arxiv_cs_lg 2026年4月24日

近未来政策最適化：自己と対話する RLVR 収束を加速する新しいアプローチ

Near-Future Policy Optimization

Translated: 2026/4/24 20:00:24

reinforcement-learningpolicy-optimizationoff-policyarxivqwen-vl

Japanese Translation

arXiv:2604.20733v1 Announce Type: new 概要検証可能な報酬を持つ強化学習 (RLVR) は、現在のトレーニングの核となるレシピとして定着しています。適したオフポリシー軌道をオンポリシー探索に組み込むことは、RLVR の収束速度を加速し、性能の上限を高めることが示されていますが、そのような軌道のソースを見つけることはまだ最大の課題です。既存のミックスポリシー手法は、外部教師から軌道をもたらす方法（質は高いが分布遠い）と、過去のトレーニング軌道を反復利用する方法（質は近いが限界がある）が存在し、どちらも有効な学習シグナルを最大化するために必須の「十分に強い（より高い $Q$ 、より多くの新しい知識）」かつ「十分に近い（より低い $V$ 、より容易に吸収される）」という条件を同時に満たすことはありません。我々は、同一トレーニングランのより後のチェックポイントから自己と対話する「近未来自己」から学習するという、非常に単純なミックスポリシー schemes、Near-Future Policy Optimization (NPO) を提案します。NPO は、現在のポリシーより質が高く、外部ソースより質に近いという理想的なバランスを実現します。我々は、初期段階のブートストラッピングと終末期のフラット突破という 2 つの介入を検証し、さらに、オンライントレーニングシグナルから自動的に介入をトリガーし、$S$ を最大化するガイドチェックポイントを選択する適応的バリアントである AutoNPO を提案します。Qwen3-VL-8B-Instruct を GRPO を使用した場合、NPO は平均性能を 57.88 から 62.84 に向上させ、AutoNPO を 63.15 に引き上げることで、最終的な性能上限を向上させつつ収束を加速しました。

Original Content

arXiv:2604.20733v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.