arxiv_cs_cv 2026年2月10日

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Translated: 2026/3/15 19:05:34

visual-spatial-reasoningworld-modelstest-time-scalingmultimodal-llmsimage-imagination

Japanese Translation

arXiv:2602.08236v1 Announce Type: new 摘要：マルチモーダル大型言語モデル (MLLM) の急速な進展に伴い、画像空間推論の精度は、未視覚や代替視点から見たシーンに依存する正解の場合に依然として信頼性不足にあります。最近の研究では、画像の想像力を世界モデル（World Models）に組み込むことでこの問題を解決しようとしていますが、想像力が本当に必要なタイミング、どの程度の想像力が有益で、いつから有害になるかという問いに対する理解はまだ不十分です。実際には、恣意的な想像は計算コストを増大させ、誤った証拠を導入することでパフォーマンスを低下させる可能性があります。本研究では、空間推論におけるテスト時の視覚的想像力を制御可能なリソースとして深く分析します。静的な視覚的証拠が十分な場合、想像力が推論を改善するケース、そして過剰な想像や不要な想像が精度と効率に与える影響について研究します。この分析をサポートするために、世界モデルを備えた適応的テスト時フレームワークの AVIC を提案します。AVIC は、現在の視覚的証拠の充足性を明示的に推論し、選択的に視覚的想像力を呼び出し、そのスケールを調整します。空間推論ベンチマーク（SAT, MMSI）とエンボディドナビゲーションベンチマーク（R2R）の両方で、本研究の結果は、想像力が必須である場合、微細な影響を与える場合、そして有害な場合を明確に示し、選択的制御による手法が固定された想像戦略に比べて大幅に少ない世界モデル呼び出し数とトークン数で、同等、あるいは優れたパフォーマンスを発揮できることを示しました。総合的に、当社の発見は、効率性と信頼性を確保するためにテスト時の想像力を分析および制御することが重要であることを示唆しています。

Original Content

arXiv:2602.08236v1 Announce Type: new Abstract: Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.