arxiv_cs_cv 2026年4月24日

記号によるGroundingが、抽象視覚推論における表現論的ボトルネックを明らかにする

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

Translated: 2026/4/24 19:48:25

vision-language-modelsbongard-problemslarge-language-modelsvisual-reasoningsymbolic-grounding

Japanese Translation

arXiv:2604.21346v1 Announce Type: cross アブストラクト: 視覚 - 言語モデル（VLMs）は、Bongard 問題などの抽象視覚推論ベンチマークでしばしば失敗し、ボトルネックが推論にあるのか表現にあるのかという問いを提起します。我々は、真の推論プログラム（ground-truth generative programs）を備えた抽象的概念学習の合成ベンチマークである Bongard-LOGO でこれを研究します。これは、画像の記号由来入力（symbolic inputs）を与えた大規模言語モデル（LLMs）と比較し、raw images に対してエンドエンドの VLM を評価することで行われました。記号入力を実用的なマルチモーダルアーキテクチャではなく診断用的プローブ（diagnostic probe）として利用し、我々は extbf{Componential--Grammatical (C--G)} パラダイムを用いて、Bongard-LOGO を LOGO スタイルのアクションプログラムや構造化記述に基づく記号推論タスクとして再構成しました。LLMs は Free-form 問題において 90 点台の精度を獲得する大きな一貫した改善を実現し、一方、同様のタスク定義下では強力な視覚ベースラインもまだ偶発的な水準に留まりました。入力の形式、明示的概念プロンプト、および最小限の視覚 Grounding に関するアブレーション実験では、ピクセルからの記号構造へのシフトよりも、これらの要因がはるかに少ないことを示しました。これらの結果は、記号入力による制御された診断的上限を示すとともに、抽象視覚推論において表現がボトルネックであることを特定しました。

Original Content

arXiv:2604.21346v1 Announce Type: cross Abstract: Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.