arxiv_cs_lg 2026年4月24日

Bounded Ratio Reinforcement Learning

Translated: 2026/4/24 20:10:59

reinforcement-learningpolicy-optimizationproximal-policy-optimizationtrust-region-methodsllm-fine-tuning

Japanese Translation

arXiv:2604.18578v2 発表タイプ：置換概要：Proximal Policy Optimization（PPO）は、そのスケーラビリティと多分野での実証的堅牢さにより、オンラインリナードラーニングにおいて支配的なアルゴリズムとなっている。しかし、信頼領域手法の根本的な基礎と、PPO に使用されるヘueristisch な切り取り目的関数の間には、大きな断絶が存在する。本論文では、このギャップを埋めるために、Bounded Ratio Reinforcement Learning（BRRL）フレームワークを導入する。我々は、新しい正則化された制約付きポリシー最適化問題を構築し、その解析的最適解を導出した。また、この解が単調な性能向上を保証することを証明した。パラメータ付きポリシークラスを扱うために、BRRL の解析的最適解とポリシー間の有利性を加重したダイバージェンスを最小化する、Bounded Policy Optimization（BPO）というポリシー最適化アルゴリズムを開発した。さらに、生成されるポリシーの期待性能を BPO損失関数の観点から下限を示した。特に、我々のフレームワークは PPO 損失の成功を解釈するための新しい理論的な視点を提供し、信頼領域ポリシー最適化とクロスエントロピー手法（CEM）を結びつける。さらに、LLM フィンチューニングのために BPO をグループ相対 BPO（GBPO）に拡張した。MuJoCo、Atari、そして複雑な IsaacLab 環境（例：Humanoid 移動）における BPO の実証評価、および LLM フィンチューニングタスクにおける GBPO の評価は、BPO と GBPO が一般的に PPO と GRPO と同様の安定性、そして最終的な性能を持つかそれを超えることを示した。

Original Content

arXiv:2604.18578v2 Announce Type: replace Abstract: Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.