arxiv_cs_cv 2026年4月20日

目覚め: 多モーダル大言語モデルのための視覚的抽象、変換と構成のためのベンチマーク

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Translated: 2026/4/20 10:44:40

multimodal-llmvisual-abstractbenchmarkcognitive-sciencesarxiv-2604

Japanese Translation

arXiv:2604.16054v1 発表タイプ：新規要約: 多モーダル大言語モデル (MLLM) は視覚言語ベンチマークにおいて驚くべき進展を遂げましたが、その視覚認知と視空間推論の能力はまだ十分に理解されていません。本研究では、古典的な人間知能テストに着想を得た 8 つの視覚認知タスクに基づく「Mind's Eye」という選択式ベンチマークを提案します。このベンチマークは、新しい「A-R-T」分類体系（抽象化、関係性、変換）の下に整理されています。これらのタスクは、パターン誘導、類比関係マッピング、心内の変換といった流動知能の核心的なプロセスを調べることを目的としています。我々は多様なクローズドソースおよびオープンソース MLLM を評価し、それらのパフォーマンスを人間参加者との間で比較しました。人間の正確率は 80% ですが、最上等の MLLM は 50% を下回っています。誤り分析は、以下の失敗の存在を明らかにしました: (i) 視覚的注意の割り当て、(ii) 内部知覚的操作、および (iii) 底層の視覚概念に対する弱い抽象化。我々の発見は、人間参加者と比較した場合、現在の MLLM が限られた視空間推論能力を持つことを示唆しており、より認知に基づいた評価フレームワークの必要性を強調しています。

Original Content

arXiv:2604.16054v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.