arxiv_cs_cv 2026年4月20日

Chain-of-Thought はマルチモーダル大規模言語モデルの視覚空間推論能力を劣化させる

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Translated: 2026/4/20 10:44:44

multimodal-llmchain-of-thoughtspatial-reasoninghallucinationvisual-benchmarks

Japanese Translation

arXiv:2604.16060v1 Announce Type: new 抜粋：マルチモーダル推論モデル（MRM）は、Chain-of-Thought（CoT）に基づく思考方式を活用することで、数学的・論理的問題解決の分野を革命化しました。しかし、本稿ではこのパラダイムが汎用的な空間知能に直面する困難さを示します。17 モデルと 13 つの空間ベンチマークを網羅的に評価することで、私たちは CoT プロンプティングが一貫して視覚空間推論のパフォーマンスを低下させるという重要なギャップを特定しました。さらに、新規の「No-Image++」アブレーション実験を通じて、MRM と CoT プロンプティングされたマルチモーダル言語モデル（MLM）は、画像が存在しない状況においても、テキストの事前知識に基づき視覚的な詳細を誤って生成する（ホッカリレーションする）という重度のショートカット学習に苦しんでいることを示しました。これらの発見は、空間タスクにおけるテキストのみを基礎とした CoT の有効性に疑問を投げかけ、視覚中心の推論パラダイムへの必要性を強調しています。

Original Content

arXiv:2604.16060v1 Announce Type: new Abstract: Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.