arxiv_cs_lg 2026年2月10日

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Translated: 2026/3/15 14:06:50

machine-learningoptimizationconvergence-ratesheavy-tailed-distributionslarge-language-models

Japanese Translation

arXiv:2602.07425v1 Announce Type: new Abstract: Adaptiv gradient methods は現代の機械学習の主力であるものの、Lion や Muon などの Sign-based 最適化アルゴリズムは、Large Language Models (LLM) の訓練において AdamW に対して経験的に優れた性能を示しました。しかし、なぜ Sign-based 更新が分散適合手法を超えるのか、その理論的な理解はまだ未解明です。本論文では、言語モデルタスクでよく観測される重尾勾配ノイズという現象を通じて、理論と実践の隙を埋めることを目的としています。理論的に、私たちは標準的な有限分散仮定よりも LLM の挙動をより正確に捉える新たな一般化された重尾ノイズ条件を導入しました。このノイズモデルの下で、SignSGD と Lion に対して一般化された滑らかな関数クラスに対する鋭い収束率を確立し、それ以前の最良の既知の界と一致するか、それを超えました。さらに、私たちが知る限り、重尾的確率性下の行列最適化における最初の厳密分析を提供し、Muon と Muonlight への分析を拡張しました。これらの結果は、Sign-based optimizers の経験的な優位性を強く理論的に正当化し、重尾に関連するノisy な勾配を処理することにおいて、これらの最適化器が自然に適していることを示しています。経験的に、LLM 事前訓練の実験は我々の理論的洞察を検証し、提案されたノイズモデルが実践とよく一致していることを確認しました。

Original Content

arXiv:2602.07425v1 Announce Type: new Abstract: While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.