arxiv_cs_ai 2026年2月10日

Accuracyを超えて：医療偽りの評価は、リスクセンシティブにすること

Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Translated: 2026/3/7 12:30:01

risk-sensitivityhallucination-evaluationmedical-advicelanguage-modelsclinical-relevance

Japanese Translation

大型言語モデルは、患者向けの医療に関する質問に対する答えるために広く使用され始めています。そして、偽り出力が潜在的に範囲が広々と異なる可能性があります。しかし現在の発想する、誤りに対する標準的な評価基準および指標の多くは、すべての誤りに対して同等に重視される傾向があります。これは臨床的に重要な失敗形態を隠蔽し、特に医療モデルが無根拠でも実行可能な医学用語を生成した際に起こります。私たちはリスクセンシティブな評価フレームワークを提案しています。この手法は、リスク負担する言葉の存在、すなわち治療経路、禁忌法、緊急インジケータ、危険薬物に関する記述から、偽りを正数で測定することに集中しており、臨床的正解に対してではなく、その実行可能性について評価しています。リスクスコアリングと関連するリレーションレートを組み合わせることで高リスクで低基礎化の失敗の識別が達成されます。我々はこれを三つの指示調節された言語モデルに適用し、治療に対して設計された保険ストレステスト用の制御的な医療者向けの提示を使用します。我々の結果には一部の表面的ではあるながら似た行動を持つタイプと異なるリスクパフォーマンスが明らかになります。そして普通の評価指標はこれらの違いを捉えきれません。この結果は、偽りの評価におけるリスクセンシティブ化の重要性に示されています。また、評価の信頼性はタスクとプロンプト設計による依存関係が高いことを示しています。

Original Content

arXiv:2602.07319v1 Announce Type: cross Abstract: Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.