arxiv_cs_cv 2026年4月20日

ビジュアル・ラングauge モデルは本当にビジュアル推理を実行しているのか？——モーダルギャップへの厳密な検証

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Translated: 2026/4/20 10:46:32

vision-language-modelsvision-reasoningmultimodal-benchmarkcrossmodal-comparisonfine-tuning

Japanese Translation

arXiv:2604.16256v1 Announce Type: new Abstract：ビジュアル・ラングauge モデル（VLM）における推理は、多様なダウンストリームタスクへの応用可能性の広さにより最近大きく注目を集めている。しかし、VLM の卓越した性能が、本質的に視覚に基づいた推理から生じるのか、それともテキストバックボーンの推理能力に依存しているのかは、まだ不明確である。これを体系的に測定するため、私らは、制御されたクロスモーダル比較に最適化された新しいマルチモーダル推理ベンチマーク CrossMath を導入した。具体的には、人によるアノテーションによって検証された同様のタスク情報を持つ形式で、各問題をテキストのみ、画像のみ、および画像＋テキストの形式で構築した。この厳密な対照配置により、モーダルに依存する推理の違いを効果的に分離すると同時に、情報ミスマッチなどの干渉因子を排除した。最先进の VLM の大規模評価は、一貫してテキスト推理とビジュアル推理との間に顕著な性能ギャップが存在することを示した。特に、VLM はテキスト入力を得意としており、ビジュアルデータ（画像＋テキスト）を追加することは、テキストのみ基準よりも性能を低下させることが頻繁に観測された。これらの結果は、現在の VLM が主にテキスト空間で推理を行うことであり、本質的にビジュアル証拠に依存することは限られていることを示唆している。この制限を緩和するために、我らは VLM 的微細調整用の CrossMath 訓練セットを編纂した。実証的研究は、この訓練セットでの微細調整が、すべての個別および結合モーダルで推理性能を大幅に向上させ、さらに 2 つの一般的なビジュアル推理タスクにおいても頑健な成果をもたらすことを示した。ソースコードは https://github.com/xuyige/CrossMath で入手可能です。

Original Content

arXiv:2604.16256v1 Announce Type: new Abstract: Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.