arxiv_cs_lg 2026年4月20日

エントロピー正則化の再考: アダプティブ係数が LLM の強化学習に可能性を unlocked

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Translated: 2026/4/20 11:04:11

entropy-regularizationllmreinforcement-learningrlvrpolicy-collapse

Japanese Translation

arXiv:2510.10959v3 Announce Type: replace 摘要：推論能力は大型言語モデル（LLM）の定義的な能力となり、検証可能な報酬に基づく強化学習（RLVR）はそれを強化する主要なパラダイムとなっています。しかし、RLVR のトレーニングは、ポリシーが過剰に決定論的になり探索が阻害され、推論性能が制限される「ポリシーエントロピー崩壊」という課題に直面することがあります。エントロピー正則化は一般的な解決策ですが、固定係数がその有効性に大きく影響するため、タスクやモデル間で不安定です。本研究では、RLVR におけるエントロピー正則化を再考し、その可能性は大幅に評価されすぎないと論じます。私らの分析は、(i) 難易度の異なるタスクは異なる探索強度を必要とし、(ii) バランスの取れた探索には、ポリシーエントロピーを初期レベルよりやや低く適度な範囲に維持する必要があります、ことを示しています。したがって、我々は 3 つのコンポーネント（難易度感度係数割り当て、初期値に基づく目標エントロピー、動的グローバル係数調整）を通じて探索と exploitation を動的に調整する「アダプティブエントロピー正則化（AER）」を提案します。複数の数学的推論ベンチマークでの実験において、AER はベースラインを一貫して上回り、推論精度と探索能力の両方を向上させる結果となりました。

Original Content

arXiv:2510.10959v3 Announce Type: replace Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.