arxiv_cs_ai 2026年4月24日

MMLMs は欠けている何を「読み取れる」のか？

Can MLLMs "Read" What is Missing?

Translated: 2026/4/24 20:16:16

multimodal-llmtext-reconstructionbenchmarksvisual-groundingdocument-analysis

Japanese Translation

arXiv:2604.21277v1 発表タイプ：新規要約：我々は、マルチモーダル大規模言語モデル（MLLM）が視覚的文脈からマスクされたテキストを直接再構築する内在的な能力を評価するために設計された MMTR-Bench というベンチマークを導入します。従来の質問応答タスクとは異なり、MMTR-Bench は明示的なプロンプトを排除し、ドキュメントやウェブサイトなど実世界のドメインにおいてシングルまたはマルチページの入出力からマスクされたテキストを回復させることを要求します。この設計は、再構築タスクを指示に従う能力から分離させ、モデルのレイアウト理解、視覚的アンカー付け、および知識統合を直接評価することを可能にします。MMTR-Bench は複数の言語と異なるターゲット長にわたる 2,771 のテストサンプルを備えています。この多様性に対応するため、我々はレベルに応じた評価プロトコールを提案します。代表例となる MLLM における実験では、このベンチマークが特に文および段落レベルの再構築において重大な挑戦となることが示されました。ウェブサイトのアクセス先は https://mmtr-bench-dataset.github.io/MMTR-Bench/ です。

Original Content

arXiv:2604.21277v1 Announce Type: new Abstract: We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.