arxiv_cs_ai 2026年4月20日

Unified Entropy Control for Reinforcement Learning を活用した目的別探索

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

Translated: 2026/4/20 11:16:48

reinforcement-learningentropy-controlllmvlmexploration

Japanese Translation

arXiv:2604.14646v2 Announce Type: replace Abstract: 最近、強化学習 (RL) の進展により、大規模言語モデル (LLM) や画像言語モデル (VLM) の推論能力が向上しました。しかし、広く使用されているグループ相対方針最適化 (GRPO) は、エントロピーの崩壊に悩まされており、これがポリシーの早期収束と多様性の失墜を招いています。既存の探索手法は探索段階で追加的なバイアスや分散を導入しており、最適化の安定性を維持するのが困難です。私たちは「Unified Entropy Control for Reinforcement Learning (UEC-RL)」というフレームワークを提案します。UEC-RL は探索と安定化のための目的別のメカニズムを提供します。UEC-RL は困難なプロンプトに対してより多くの探索を活性化させ、潜在的に価値のある推論経路を検索します。一方、安定化器はエントロピーが制御不能に増加するのを防止し、モデルが確実な行動を統合する过程中で訓練を安定に保ちます。これらのコンポーネントは必要なときに検索空間を拡大しつつ、訓練中全体を通じに頑健な最適化を維持します。LLM と VLM の推論タスクに対する実験では、RL ベースラインに対して Pass@1 と Pass@$k$ で一貫した向上を示しました。Geometry3K では、UEC-RL は GRPO に対して 37.9% の相対的改善を達成しており、これは有効な探索を維持しつつ収束を犠牲にしないことを示しています。また、UEC-RL は大規模モデルにおける RL ベースの推論スケールアップのための鍵であることを強調しています。当社のコードは https://github.com/597358816/UEC-RL に利用可能です。

Original Content

arXiv:2604.14646v2 Announce Type: replace Abstract: Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@$k$. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at https://github.com/597358816/UEC-RL.