arxiv_cs_lg 2026年2月10日

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Translated: 2026/3/15 13:03:39

llm-rlvariance-reductionoptimal-token-baselinereinforcement-learninggradient-optimization

Japanese Translation

arXiv:2602.07078v1 宣言タイプ：新規要約：大規模言語モデル（LLM）のための強化学習（RL）は、グラディエント分散の増大による長い時限タスクにおけるトレーニング崩壊に苦しんでいます。これを緩和するために、アドバンテージ計算のための基準は一般的に導入されていますが、従来の値モデルは最適化が困難であり、標準的なグループベースの基準はシーケンスの多様性を無視しています。古典的な最適な基準理論はグローバルな分散削減を実現できますが、トークンの多様性を考慮しておらず、禁止的なグラディエントに基づく計算を必要とします。本作業では、最適トークン基準（OTB）を基本原理から導出し、グラディエント更新は累積グラディエントノルムに反比例して重み付けされるべきであることを見証しました。効率性を担保するために、フウォードパスの確率のみを使用してグラディエントノルムを近似する Logit-Gradient Proxy を提案しました。私たちの方法はトレーニングの安定性を実現し、大きなグループサイズ ($N=32$) と同等のパフォーマンスを、$N=4$ のみで達成し、ワンターンおよびツールの統合された推論タスクでトークン消費を 65% 以上削減しました。

Original Content

arXiv:2602.07078v1 Announce Type: new Abstract: Reinforcement Learning (RL) for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, a baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes ($N=32$) with only $N=4$, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.