arxiv_cs_cv 2026年2月10日

MonkeyOCR: 構造認識・関係トリプレットパラダイムによるドキュメント解析

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Translated: 2026/3/15 6:02:06

monkeyocrdocument-parsingstructure-recognitionlarge-language-modelscomputer-vision

Japanese Translation

arXiv:2506.05218v2 発表タイプ：代替要旨：構造認識・関係（Structure-Recognition-Relation: SRR）トリプレットパラダイムを利用したドキュメント解析モデル MonkeyOCR を提案します。この設計は、それ自体が複雑で多機能のツールパイプラインとなることを避け、大規模エンドツーエンドモデルでのフルページの処理による非効率性を回避することで、最先端の実現性を向上させます。SRR パラダイムでは、ドキュメント解析は「そこにあるか？」（構造）、「何だか？」（認識）、「どう組織されているか？」（関係）という三つの根本的な問いに抽象化され、構造検出、コンテンツ認識、関係予測に対応します。このパラダイムを支援するため、我々は 10 種類以上のドキュメントタイプを跨ぐ 450 万件のバイリンガルデータを含む MonkeyDoc という包括的なデータセットを発表し、既存のデータセットがしばしば単一タスク、単一言語、単一ドキュメントタイプに限定されるという限界に対処しました。SRR パラダイムと MonkeyDoc を利用して、30 億パラメータのドキュメント基礎モデルを訓練しました。また、このモデルのパラメータ冗長性を特定し、連続パラメータ劣化（Contiguous Parameter Degradation: CPD）を提案し、06 億パラメータから 12 億パラメータまでのモデルを構築可能にし、受け入れ可能なパフォーマンス低下を伴いながらより高速な実行を実現しました。MonkeyOCR は、Gemini 2.5-Pro を含む以前の実行オープンソースおよびクローズドソースの手法を凌駕し、最先端のパフォーマンスを示しました。また、1 枚の RTX 3090 GPU で推論のために効率的に展開可能です。コードとモデルは https://github.com/Yuliang-Liu/MonkeyOCR でリリースされます。

Original Content

arXiv:2506.05218v2 Announce Type: replace Abstract: We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and avoids the inefficiencies of processing full pages with giant end-to-end models. In SRR, document parsing is abstracted into three fundamental questions - ``Where is it?'' (structure), ``What is it?'' (recognition), and ``How is it organized?'' (relation) - corresponding to structure detection, content recognition, and relation prediction. To support this paradigm, we present MonkeyDoc, a comprehensive dataset with 4.5 million bilingual instances spanning over ten document types, which addresses the limitations of existing datasets that often focus on a single task, language, or document type. Leveraging the SRR paradigm and MonkeyDoc, we trained a 3B-parameter document foundation model. We further identify parameter redundancy in this model and propose contiguous parameter degradation (CPD), enabling the construction of models from 0.6B to 1.2B parameters that run faster with acceptable performance drop. MonkeyOCR achieves state-of-the-art performance, surpassing previous open-source and closed-source methods, including Gemini 2.5-Pro. Additionally, the model can be efficiently deployed for inference on a single RTX 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.