arxiv_cs_lg 2026年4月24日

LVLM の強化学習再考：収束、報酬分解，および汎化能力

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Translated: 2026/4/24 19:54:40

reinforcement-learninglvlmtool-augmentedgrpogeneralization

Japanese Translation

arXiv:2604.19857v1 Announce Type: new 要約: 検証可能な報酬（Verifiable Rewards, RLVR）を用いた強化学習は、ツール使用や多段階推論などのエージェント機能を備えた大規模ビジョン・言語モデル（LVLM）に威力を添えるパラダイムとして登場した。著しい実証的な成功をもたらした Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) が例示されるが、このパラダイムの理論的基礎は依然として十分に理解されていない。特に、2 つの重要な問いには確実な答えが不足している: (i) 検証可能な報酬の複合構造（形式準拠性、回答精度、ツールの実行可能性）は、グループ相対ポリシー最適化（GRPO）の収束にどのように影響するか、および (ii) ツール拡張タスクの少量のトレーニングが、分布外（out-of-distribution）ドメインへの転移をもたらすのか。私たちは、制約付き深度のツール呼び出しを持つマルチモーダルエージェント意思決定を記述する形式枠組みである extit{Tool-Augmented Markov Decision Process}（TA-MDP）を導入することでこれらのギャップを埋める。この枠組み内において、我々は 3 つの主な結果を確立した。まず、複合検証可能な報酬の下での GRPO が extbf{定理 1} に示すように、報酬成分数とグループサイズの明確な依存性を伴い $O(1/ oot{T}{1})$ の速度で一階の定常点に収束することを証明した。次に、分解された単一成分最適化と連成最適化の非最適性ギャップを境界化する extbf{報酬分解定理}を導出し、報酬分解が有益となる条件の正確な特徴付けを提供した（ extbf{定理 2}）。さらに、Visual-ARFT で観測された強い分布外転移を説明する、ツール拡張ポリシーに対する PAC-Bayes 一般化境界を確立した（ extbf{定理 3}）。

Original Content

arXiv:2604.19857v1 Announce Type: new Abstract: Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate $O(1/\sqrt{T})$ with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).