arxiv_cs_cv 2026年4月20日

PILOT: 可誘導型レイアウト認識に優れたインタリーブ OCR トランフォーマー

PILOT: A Promptable Interleaved Layout-aware OCR Transformer

Translated: 2026/4/20 10:49:23

ocrtransformertext-recognitionspatial-groundingarxiv

Japanese Translation

arXiv:2504.03621v2 Announce Type: replace 要約：古典的な OCR パイプラインは、文書の読み取りを検出、分割、認識の 3 つの段階に分解しており、これはローカライゼーション誤差への敏感性やインタラクティブなクエリーへの拡張の困難さを生み出しています。本論文では、手書きおよび印刷された文書において、テキスト認識と空間-grounding を単一のコンパクトなモデルで統合的に実行できるかを調査します。私たちは、文書 OCR を単一化されたシーケンス生成として定式化する 1.55 億パラメータの Prompt-Conditioned 生成モデルである PILOT を提案します。軽量な深度分離 CNN がページをエンコードし、トランフォーマーデコーダーは 10px のグリッド上で自動再帰的に、単語と量子化された絶対座標トークンの単一ストリームを発信し、フルページ OCR、領域制約読取、文字列による斑点検出を同じアーキテクチャ内で実現します。3 つ段階のカリキュラム学習（単純な転写から連成テキスト・ボックス生成、そしてプロンプト制御抽出へ）は、トレーニングの安定化と空間-grounding の改善を達成します。IAM, RIMES~2009, SROIE~2019, 多様な MAURDOR ベンチマークでの実験は、PILOT が従来の OCR システム、最近の end-to-end HTR モデル、コンパクトな vision--language モデルと比較して、テキスト認識およびラインレベル検出において競争力のある、あるいは上回る性能を示しつつ、数十億パラメータマルチモーダルモデルとは大きく異なる小さなサイズであることを示しています。マイクロ OCR や文字列による斑点検出などの追加評価では、統合されたテキスト・レイアウトデコーダーがコンパクトな設定で正確かつ効率的なプロンプト制御 OCR を提供できることが確認されました。再現性を支援するために、合成 SROIE ジェネレーター、50 万ページの IDL/PDFA 标注ページ、IAM, RIMES~2009, MAURDOR の調和されたラインレベル标注、およびソースコードを https://github.com/hamdilaziz/PILOT に公開します。

Original Content

arXiv:2504.03621v2 Announce Type: replace Abstract: Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.