arxiv_cs_lg 2026年2月10日

Adam には本当に必要吗？LLM における SGD とスパースな強化学習の驚くほど強いパフォーマンス

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

Translated: 2026/3/15 14:47:33

reinforcement-learningllmsgdoptimizationparameter-efficiency

Japanese Translation

arXiv:2602.07729v1 Announce Type: new 摘要：強化学習（RL）、特に検証可能な報酬からの強化学習（RLVR）は、大型言語モデル（LLM）のトレーニングにおける不可欠な段階となり、現在のスケール拡大努力の主要な焦点となっています。しかし、近年の研究が示しているように RL とその異なる基盤とにもかかわらず、RL における最適化プラクティスは次のトークン予測の段階（例：事前トレーニングと監督学習による微調整）に従う傾向にあります。そのようなプラクティスの一つが、大規模トランスフォーマーのトレーニングにおいて広く採用されているが、高いメモリーオーバーヘッドを伴う AdamW オプティマイザーの使用です。私らの分析は、AdamW における運動量と適応型学習率の両方が RL では SFT においてよりも相対的に影響が少ないことを示し、RL が Adam 型のパラメータごとの適応型学習率や運動量からの恩恵が少ないという仮説を導出しました。この仮説を確認するために、監督学習における大規模トランスフォーマーにおいてパフォーマンスが悪いことが知られていますが、メモリー効率が大幅に向上した SGD を使用した実験は、RL における LLM のために AdamW と同等かそれ以上に優れたパフォーマンスを示しました。驚くべきことに、SGD を使用する完全微調整では、スパース性促進の制御なしでモデルのパラメータの 0.02% 未満だけが更新され、AdamW と比較して 1000 倍以上の少なさです。私らの分析は、この更新スパース性の潜在的理由を提供します。これらの発見は RL における最適化ダイナミクスに関する新たな洞察を提供し、RL が以前認識されていまいかにパラメータ効率的であることができることを示しました。

Original Content

arXiv:2602.07729v1 Announce Type: new Abstract: Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.