arxiv_cs_lg 2026年2月10日

マルチモーダル大規模言語モデルを用いた効率的なテーブル抽出と理解

Efficient Table Retrieval and Understanding with Multimodal Large Language Models

Translated: 2026/3/15 7:03:58

multi-modal-large-language-modelstabular-datavisual-retrievaltable-understandingimage-processing

Japanese Translation

arXiv:2602.07642v1 Announce Type: cross 要旨：表形式のデータは、金融レポート、手書き記録、文書スキャンなど、多岐にわたるリアルワールドシナリオにおいて画像形式で頻繁に記録されています。これらの視覚表現は、構造的かつ視覚的な複雑さを両方備えているため、機械による理解においてユニークな課題を呈します。最近のマルチモーダル大規模言語モデル（MLLMs）の進歩は表理解において有望な結果を示していますが、これらのモデルは通常、関連する表が手に入ることを前提としています。しかし、より実践的なシナリオは、大規模集合から関連する表を特定し、ユーザーのクエリに対して推論を行うというものです。このギャップを解決するために、私たちは MLLMs が大規模な表集合に対してクエリを回答できるフレームワーク、TabRAG を提案します。私たちのアプローチは、一緒に訓練された視覚 - テキスト基礎モデルを使用して候補表を抽出し、次に MLLMs を用いてこれらの候補を微細な階層で再ランク付けし、最後に選択された表について MLLMs を用いて推論を行うことで回答生成を行います。新しく作成されたデータセット、88,161 件のトレーニングサンプルと 9,819 件のテストサンプル、8 ベンチマーク、48,504 個のユニークな表を含む広範な実験を通じて、私たちが提案したフレームワークは、 retrieval recall で 7.0%、answer accuracy で 6.1% を上回る従来の手法に比較して著しく優れていることを示しており、現実世界における表理解タスクに対する実用的な解決策を提供します。

Original Content

arXiv:2602.07642v1 Announce Type: cross Abstract: Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.