arxiv_cs_cv 2026年2月10日

What, Whether and How? Process Reward Models for Thinking with Images Reasoning を解き明かす

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

Translated: 2026/3/16 14:04:35

process-reward-modelsthinking-with-imageslarge-vision-language-modelsvisual-reasoninglvlm-benchmark

Japanese Translation

arXiv:2602.08346v1 Announce Type: new 摘要：大規模ビジョン・言語モデル（LVLMs）の急成長により、様々な視覚タスクにおける優れた能力が示されています。これらの発展を基盤とした「イメージ付き思考（thinking with images）」のパラドラムが出現し、モデルは推論の各ステップで視覚情報を動的に編集・再符号化することで、人間の視覚処理を模倣しています。しかし、このパラドラムは推論過程で多様な誤りが生じるという重要な課題をもたらすことになります。これに対応するため、推論ステップを正解と誤答区別するためのプロセス報酬モデル（PRMs）が必要となりますが、既存の PRM ベンチマークは主にテキスト中心であり、このパラドラムにおける包括的な評価には欠けます。これらのギャップを解消するために、本研究では、イメージ付き思考のパラドラム下における PRM 評価を目的として設計された最初の包括的ベンチマークを提案しました。主な貢献は以下の通りです：(1) 推論軌道および PRM を活用した誘導探索実験を通じて、7 つの微細な誤りタイプを定義し、特化した PRM の必要性と改善の可能性を示しました。(2) 4 つのカテゴリおよび 16 のサブカテゴリにわたる 1,206 件の手動アノテーション済みイメージ付き思考推論軌道を含める包括的ベンチマークを構築し、PRM の微細な評価を実現しました。(3) 実験分析により、現在の LVLM は効果的な PRM として不十分であり、視覚的推論過程の評価において限られた能力に留まり、誤りタイプごとに、肯定的評価へのバイアス、および推論ステップの位置に対する感度において顕著なパフォーマンスの差異を示しています。これらの発見は、当ベンチマークの有効性を証明するとともに、LVLM における PRM の進展のための決定的基盤を確立しました。

Original Content

arXiv:2602.08346v1 Announce Type: new Abstract: The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.