arxiv_cs_cv 2026年4月20日

DenTab: 本データセット、現実の歯科見積もりに基づいた表認識および視覚 Q&A 用

DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

Translated: 2026/4/20 10:45:15

table-recognitionvisual-qadataset-benchmarkingvmlocr

Japanese Translation

arXiv:2604.16099v1 発表タイプ: 新しい摘訳：表は重要な取引および管理情報をコンパクトなレイアウトに凝縮しますが、実用的な抽出にはテキスト認識以上の要件が必要です。システムは構造（行、列、マージセル、ヘッダー）を復元し、項目名、合計、合計など的一般的なキャプチャアーティファクト下での役割を解釈する能力も必要です。既存の多くの表構造認識および TableVQA リソースは、クリーンなデジタル生まれのデータ源またはレンダリングされた表から構築されており、そのため、ノイジーな管理条件を部分的にのみ反映しています。我々は DenTab を導入しました。これは、高品質な HTML アノテーションを伴う歯科見積もりの 2,000 個の切り取られた表画像からなるデータセットであり、表認識 (TR) と表視覚 Q&A (TableVQA) を同じ入力に基づいて評価することを可能にします。DenTab は、取得、集計、および論理/一貫性チェックを跨る 11 カテゴリーにわたる 2,208 件の質問を含んでいます。16 つのシステムをベンチマークしました。そのうち 14 つは視覚言語モデル (VLM)、2 つは OCR ベースラインです。モデル全体で、強力な構造復元は多段階の算術および一貫性質問の信頼性の高いパフォーマンスに一貫して翻訳するものではありませんし、これらの論理失敗は ground-truth HTML テーブル入力を使用しても継続します。トレーニングなしで算術の信頼性を向上させるために、我々は Table Router Pipeline を提案しました。このパイプラインは、算術質問を決定論的実行にルーティングします。このパイプラインは (i) ベースラインの回答、構造化された表表現、および制約された表プログラムを生成する VLM と (ii) 解析された表上の正確な計算を実行するルールベースのエグゼキューターを組み合わせます。ソースコードおよびデータセットは https://github.com/hamdilaziz/DenTab で公開提供されます。

Original Content

arXiv:2604.16099v1 Announce Type: new Abstract: Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision--language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at https://github.com/hamdilaziz/DenTab.