arxiv_cs_cv 2026年4月20日

InstructTable: 指示を用いたテーブル構造認識の改善

InstructTable: Improving Table Structure Recognition Through Instructions

Translated: 2026/4/20 10:51:38

table-structure-recognitionvision-language-modelsdata-synthesisarxiv-2026benchmark-creation

Japanese Translation

InstructTable: Instruction-Guided Table Structure Recognition arXiv:2604.02880v2 Announce Type: replace Table 構造認識（TSR）は、テーブル画像を構造化された表現に変換する能力により、広範な実用的重要性を有しています。しかし、統合セルや空白セルを含む複雑なレイアウトを処理する際に、大きな課題に直面しています。従来の視覚中心モデルは視覚情報のみに基づき、重要な意味論的支持が欠如しており、複雑なシナリオにおける正確な構造認識を阻害しています。バイショナライズーモデルは文脈の意味論を活用することで理解能力を向上させますが、これらは視覚構造化情報のモデル化を十分に軽視しています。これらの制限に対処するため、本論文は指示導向の多段階トレーニング TSR フレームワークである InstructTable を提案します。慎重に設計されたテーブル指示事前トレーニングは、微細な構造的パターンへの注目を向け、複雑なテーブルの理解能力を向上させます。補完的な TSR 微調整は頑健な視覚情報モデル化を維持し、多様なシナリオにおける高精度なテーブル解析を保ち続けます。さらに、私々は、大規模な本格的なテーブルデータを合成するための革新的なテンプレートなしの手法である Table Mix Expand (TME) を導入しました。TME を活用し、我々の手法を通じて合成された 900 個の複雑なテーブル画像を含む Balanced Complex Dense Synthetic Tables (BCDSTab) ベンチマークを構築しました。これは厳密なベンチマークを遂行することを目的としています。 FinTabNet, PubTabNet, MUSTARD などの複数の公開データセット、および BCDSTab における大規模な実験により、InstructTable が TSR タスクにおいて最良のパフォーマンスを達成したことを示しました。アブレーションスタディにより、提案されたテーブルデータ固有の指示と合成データの積極的な影響が確認されました。

Original Content

arXiv:2604.02880v2 Announce Type: replace Abstract: Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.