arxiv_cs_lg 2026年4月20日

自己整合報酬: 効率的かつ効果的な推論器への道

Self-Aligned Reward: Towards Effective and Efficient Reasoners

Translated: 2026/4/20 11:03:36

reinforcement-learninglarge-language-modelsself-aligned-rewardprompt-engineeringrlhf

Japanese Translation

arXiv:2509.05489v2 発表タイプ: 更新要旨：検証可能な報酬を持つ強化学習は、大規模言語モデル（LLM）の推論を顕著に進歩させましたが、これらのシグナルは依然として粗く、二進値の正しさのフィードバックしか提供していません。この限界は、冗長な推論や高い計算コストなどの非効率性をもたらすとともに、既存の解決策はしばしば精度を犠牲にしてしまいます。これを解決するために、我々は検証可能な報酬を補完し、推論の精度と効率の両方を促進するための自己誘導信号である自己整合報酬（SAR）を提案します。SAR は、クエリに条件付けされた回答と単独の回答との相対的なペルプレクシー差として定義され、簡潔でクエリ固有の応答を好みます。定量的分析によると、SAR は回答の品質を確信して区別でき、簡潔で正しい回答は冗長な回答よりも、部分的に正しい回答は完全に間違った回答よりも高い得点を記録します。4 モデルを 7 のベンチマークで評価した結果、SAR を一般的な RL アルゴリズム（PPO や GRPO）に統合すると、精度が 4% 向上し、推論コストが 30% 削減されます。さらなる分析では、SAR が長さや自己確信に基づく報酬シグナルと比較して、正解率と効率の間のパレート最適トレードオフを実証しました。また、SAR は高度な推論動作を維持しながら回答を短縮することを示しており、不要な詳述を抑制することなく決定的な推論力を失うことを示しました。これらの結果は、自己整合報酬が検証可能な報酬の微細な補完として有望であることを示し、より効率的かつ効果的な LLM トレーニングの道を切り開きます。

Original Content

arXiv:2509.05489v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.