arxiv_cs_lg 2026年4月24日

すべてのロールアウトが有用ではない：大規模言語モデルの強化学習におけるロールアウトのダウンサンプリング

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Translated: 2026/4/24 20:07:31

reinforcement-learninglarge-language-modelsrollout-samplingpolicy-optimizationgrpo

Japanese Translation

arXiv:2504.13818v5 発表タイプ: 差し替え要約：検証可能な報酬（verifiable rewards）を用いた強化学習（RLVR）は、大規模言語モデルの推論能力を向上させる上で主導的なアプローチとして台頭しました。しかし、このアプローチには基本的な計算資源とメモリ容量の非対称性が存在します。つまり、ロールアウト生成は圧倒的に並列処理が容易でメモリを軽くても、ポリシー更新は通信負荷が大きくメモリ集約的な特性を持っています。この問題を解決するために、PODS（Policy Optimization with Down-Sampling：ポリシー最適化用ダウンサンプリング）を導入し、ロールアウト生成とポリシー更新を離れさせることで、戦略的に選択されたサブセットのロールアウトのみをトレーニングする手法を開発しました。これにより、学習の品質を維持しつつ、更新コストを劇的に削減することが可能です。また、報酬の多様性を最大化する原則に基づいたサブセット選択基準である「最大分散ダウンサンプリング（max-variance down-sampling）」を提案し、効率的な $O(n\log n)$ 実装を提供しています。実験的に、PODS を採用したグループ相対ポリシー最適化（GRPO）は、実装した異なる推論ベンチマークおよびハードウェア構成のいずれにおいても、通常の GRPO の最大テスト精度を少なくとも $f{1.7\times}$ 速く達成しました。

Original Content

arXiv:2504.13818v5 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.