arxiv_cs_ai 2026年2月10日

iGRPO: 自己フィードバックに基づくLLMの推理

iGRPO: Self-Feedback-Driven LLM Reasoning

Translated: 2026/3/7 11:22:33

ailarge-language-modelsself-feedbackreinforcement-learning

Japanese Translation

大量言語模型 (LLMs) は複雑な数学的問題を解くことに画期的な効果を見せていますが、まだ正確で共通的な解決策を作成することはできません。再現学習 (RL) フレームワークを使用してこれらのモデルをタスク固有の報酬に合わせることでこの欠点を補うことができるのです。 Group Relative Policy Optimization (GRPO) は、価値関数がない価値機能無料の Proximal Policy Optimization (PPO) の代替案です。それはグループ間の報酬正規化を使用します。我々は Iterative Group Relative Policy Optimization (iGRPO)， GRPO の二段階の拡張を導入しました。動的自身条件付けとモデル生成された草案によって iGRPO を強制しました。ノイズが一致するシードをランダムにサンプリングし、選択した最高報酬のノウハウを使用してオプティマイザとの同一のスカラーリアルティ信号を選ぶ段階 1。 2 種類で iGRPO は新しい最初のパフォーマンスで優れておりグループモデル (例えば、Nemotron-H-8B-Base-8K や DeepSee-R1 Distilled) を引き続き優れたパフォーマンスを維持していました。さらに、オープンリソース-Nemotron モデルに対して aceReason-Math で iGRPO 習得した開発者には、両方とも新しい状態オブジェクトで85.62%と 79.64 %の良い点数を与えました。さらに他の GRPO の代替品をもたらし、生成法ジャッジが利益に貢献し、学習ダイナミクスにはエントロピー収縮を遅らせることで学習性を乱すことによってディープチャレンジングしました。これらの結果が表明されれば、自己フィードバックに基づく RL を使用することで、確証可能な数学的な判断に向けた進歩性が引き出される可能性があります。

Original Content

arXiv:2602.09000v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.