arxiv_cs_cv 2026年4月20日

FETAL-GAUGE: 胎児超音波画像におけるビジョン・言語モデルの評価に特化したベンチマーク

FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

Translated: 2026/4/20 10:50:28

fetal-ultrasoundvision-language-modelsmedical-benchmarkdeep-learningprenatal-care

Japanese Translation

arXiv:2512.22278v2 Announce Type: replace Abstract: 産前超音波画像の需要の高まりにより、訓練済みの超音波診断士への世界的な不足が深刻化し、必須となる胎児健診に障壁を形成しています。ディープラーニングは超音波診断士の効率を高め、新卒者を支援する可能性があるため、注目されています。ビジョン・言語モデル（VLM）は画像とテキストを同時に処理し、単一フレームワーク内で複数の臨床タスクを実行できるため、超音波解釈において特に有望です。しかし、VLM の普及にもかかわらず、胎児超音波画像の評価に特化した標準化されたベンチマークが存在しません。このギャップは、モードの難易度、操作者の依存性、そして限られた公開データセットの存在が主な原因です。このギャップを解消するために、当論文では、多様な胎児超音波タスクを評価するために設計された、最初の也是最大的な視覚的问答（Visual Question Answering）ベンチマークである Fetal-Gauge を提案します。Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. 私たちは、汎用モデルから医療特化モデルまでを体系的に評価し、顕著な性能ギャップを明らかにしました。最高性能モデルの精度は 55% に過ぎず、臨床的要求から遠く離れています。当分析は、現在の VLM の胎児超音波解釈における限界を特定し、ドメイン適合アーキテクチャと特殊なトレーニングアプローチの緊急性を強調しています。Fetal-Gauge は産前ケアにおけるマルチモーダルディープラーニングを推進する厳密な基礎を提供し、グローバルな医療アクセスの課題に対処するための道筋を示します。当ベンチマークは、論文が承認された後に公開されます。

Original Content

arXiv:2512.22278v2 Announce Type: replace Abstract: The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.