arxiv_cs_cv 2026年2月10日

構想による構造の思考：制約された流形上の論推理を通じた空間知能の評価

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Translated: 2026/3/15 19:02:31

spatial-intelligencevision-language-modelsvqa-benchmark3d-geometrymanifold-reasoning

Japanese Translation

arXiv:2602.07864v1 Announce Type: new 摘要：空間知能は、物理世界における視覚 - 言語モデル（VLM）にとって不可欠ですが、多くのベンチマークはモデルが 2D の短絡を利用できるようになった制約の少ないシチュエーションで評価されているためです。私たちは、複雑な実世界 3D 構造に基づき、幾何学的・トポロジ学的・物理的な制約によって可能とされる構成が厳密に制御された、制約された流形上の空間論理を扱う VQA ベンチマーク SSI-Bench を提案します。SSI-Bench は、幾何学的およびトポロジ学的論理に跨度する 1,000 問のランク付け問題を備えており、構造化された空間的操作の多様なレパートリィ、すなわちメンタル・ロテーション（構造化回転）、断面推論、不透明度論理、および力路論理を必要とします。これは完全に人間中心のパイプラインで作成されました：10 人の研究者が、画素レベルの手がかりを最小限にするために、400 時間以上を画像の選定、構造要素のアノテーション、および問題の設計に費やしました。31 の広く使用されている VLM を評価したところ、人間との大きなギャップが見つかりました：最も優れたオープンソースモデルが 22.2% の精度を、最強のクローズドソースモデルが 33.6% を達成し、一方人間は 91.6% とのスコアを出しました。モデルを思考させることは僅かな改善を生み出し、エラー解析では構造のアンディングおよび制約に整合する 3D 論理の失敗が示唆されています。プロジェクトページ：https://ssi-bench.github.io

Original Content

arXiv:2602.07864v1 Announce Type: new Abstract: Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.