arxiv_cs_cv 2026年4月20日

単一フレームを超え：体積 MRI における多フレーム空間的に裏付けられた推論

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

Translated: 2026/4/20 10:42:42

spatial-reasoningmedical-vlmvolumetric-mrivision-language-modelschain-of-thought

Japanese Translation

arXiv:2604.15808v1 Announce Type: new 概要：空間推論とビジュアルグラウンディングは、ビジョン言語モデル（VLM）の中核的能力であり、しかし多くの医療 VLM は、透明な推論や空間的証拠なしに予測を行う傾向がある。既存のベンチマークも VLM を離散的な 2D イメージで評価することで、臨床イメージングの体積的特性を看過しており、結果が複数のフレームにわたる、あるいは数スライスにのみ現れる状況を考慮していない。当研究では、専門放射線師による注釈から作成された fastMRI+ データセット（脳と膝の研究を含む）に基づく、多フレーム空間的に裏付けられた推論のための体積 MRI ベンチマーク「空間的に裏付けされた MRI ビジュアル質問応答（SGMRI-VQA）」を導入する。SGMRI-VQA は 41,307 パアルのデータセットであり、各 QA パアルには医師との整合性の取られたチャン・オブ・シンク追跡およびフレーム索引付きの境界ボックス座標を含む。タスクは検出、定位、カウント/分類、キャプション付けにわたって階層的に組織化されており、モデルは何があるのか、どこにあるのか、そしてどのフレームに亘り延びるのかを同時に推論する必要がある。10 つの VLM をベンチマークし、Qwen3-VL-8B の境界ボックス上書きでの上流学習は、強力なゼロショットベースラインに比べ一貫してグラウンディング性能を改善するのを示し、ターゲットされた空間上書きは、裏付けされた臨床推論への効果的な道筋であることを示唆する。

Original Content

arXiv:2604.15808v1 Announce Type: new Abstract: Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.