arxiv_cs_cv 2026年4月24日

When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

Translated: 2026/4/24 19:50:18

surgical-vqauncertainty-estimationsemantic-nearest-neighbor-entropyquestion-answeringmedical-ai

Japanese Translation

arXiv:2511.01458v2 Announce Type: replace Abstract: Surgery における視覚的質問応答（VQA）システムの導入において、安全性と信頼性は極めて重要です。不正確または曖昧な回答は患者への伤害をもたらす可能性があります。既存の不確実性推定手法、例えばセマンティック・ネーリスト・ニアスタ・エントロピー（SNNE）は、条件付き質問を明示的に考慮していないという主要な制限を持っています。その結果、これらの手法は臨床的な質問と整合しないがセマンティック的に一致する回答に対して高い確信度を割り当てる傾向があり、特に質問の言い回しが変化する場合に顕著です。我々は、質問-回答の整合性を二重ゲートを通じてセマンティックエントロピーに統合する、ブラックボックス型不確実性推定手法である質問整合型セマンティック・ネーリスト・ニアスタ・エントロピー（QA-SNNE）を提案します。QA-SNNE は、サンプルされた回答ごとの質問に対する関連性を重み付けとして、ペアワイズセマンティック類似度を測定して不確実性を評価します。使用可能な埋め込みベース、推論ベース、あるいはクロスエンコーダー整合戦略があります。言語の変化に対する堅牢性を評価するために、我々はベンチマーク手術 VQA データセットのテンプレート外リファーマーバージョンを構築しました。その際、質問の言及のみを変更し、画像および真の回答は変更されません。我々は、零ショットおよびパラメータ効率化ファインチューニング（PEFT）の両方の設定で、テンプレート内およびテンプレート外の質問を含めた 2 つのベンチマーク手術 VQA データセットの 5 つの VQA モデルに対して QA-SNNE を評価しました。QA-SNNE は、EndoVis18-VQA の 3 つの零ショットモデルのうち 2 つのテンプレート内モデルで AUROC を向上させ（例：Llama3.2 で +15％、Qwen2.5 で +21％）、テンプレート外リファライジング条件下では AUROC を最大で +8％向上させることができました（外部検証では結果が混合でした）。総括すると、QA-SNNE はセマンティックな不確実性と質問の関連性を結びつけることで、手術 VQA にための実用的でモデル非依存な保護策を提供します。

Original Content

arXiv:2511.01458v2 Announce Type: replace Abstract: Safety and reliability are critical for deploying visual question answering (VQA) systems in surgery, where incorrect or ambiguous responses can cause patient harm. A key limitation of existing uncertainty estimation methods, such as Semantic Nearest Neighbor Entropy (SNNE), is that they do not explicitly account for the conditioning question. As a result, they may assign high confidence to answers that are semantically consistent yet misaligned with the clinical question, especially under variation in question phrasing. We propose Question-Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black-box uncertainty estimator that incorporates question-answer alignment into semantic entropy through bilateral gating. QA-SNNE measures uncertainty by weighting pairwise semantic similarities among sampled answers according to their relevance to the question, using embedding-based, entailment-based, or cross-encoder alignment strategies. To assess robustness to language variation, we construct an out-of-template rephrased version of a benchmark surgical VQA dataset, where only the question wording is modified while images and ground-truth answers remain unchanged. We evaluate QA-SNNE on five VQA models across two benchmark surgical VQA datasets in both zero-shot and parameter-efficient fine-tuned (PEFT) settings, including out-of-template questions. QA-SNNE improves AUROC on EndoVis18-VQA for two of three zero-shot models in-template (e.g., +15% for Llama3.2 and +21% for Qwen2.5) and achieves up to +8% AUROC improvement under out-of-template rephrasing, with mixed results on external validation. Overall, QA-SNNE provides a practical, model-agnostic safeguard for surgical VQA by linking semantic uncertainty to question relevance.