arxiv_cs_ai 2026年4月24日

SGD の安定性の限界：非決定論的な鋭さギャップ

SGD at the Edge of Stability: The Stochastic Sharpness Gap

Translated: 2026/4/24 20:22:25

stochastic-gradient-descentdeep-learningneural-networksoptimizationeigenvalues

Japanese Translation

arXiv:2604.21016v1 Announce Type: cross Abstract: フルバッチ勾配降降（GD）とステップサイズ $\eta$ を用いてニューラルネットワークを訓練する際、ヘッセ行列の最大固有値（すなわち鋭さ $S(\boldsymbol{\theta})$）は $2/\eta$ まで上昇し、その値で安定します。この現象は「安定性の限界（Edge of Stability; EoS）」と呼ばれます。Damian et al. (2023) は、この挙動が損失関数の 3 番目の階構造に起因する自己安定化機構によって説明され、GD が制約 $ S(\boldsymbol{\theta})\leq 2/\eta$ に対して投影勾配降降（PGD）に従うことを暗黙的に示したと指摘しました。ミニバッチ SGD の場合、鋭さは $2/\eta$ を下で安定しますが、バッチサイズが小さくなるにつれてこのギャップは拡大します。しかし、この抑制に対する理論的な説明はまだ存在しません。本稿では、自己安定化枠組みを SGD へと拡張した「非決定論的自己安定化（stochastic self-stabilization）」を導入しました。我々の主要な洞察は、勾配ノイズがトップヘッセ固有ベクトルに沿った振動的なダイナミクスに分散を導入し、立方形の鋭さ抑制力を強化し、平衡点を $2/\eta$ より下へシフトさせることにあります。Damian et al. (2023) のアプローチに従い、移動する投影勾配降降軌道に対して定義した確率的予測ダイナミクスを採用し、SGD がこれらとの予測からの偏離を確率的結合定理によって制約することを証明しました。我々は閉じた形式の平衡鋭さギャップ $\Delta S = \eta \beta \sigma_{\boldsymbol{u}}^{2}/(4\alpha)$ を導出しました。ここで、$\alpha$ は漸進的鋭化率、$\beta$ は自己安定化強度、$\sigma_{ \boldsymbol{u}}^{2}$ はトップ固有ベクトルへの投影された勾配ノイズの分散です。この式は、より小さなバッチサイズが平坦な解をもたらすことを予測し、バッチサイズがフルデータセットに等しい場合、GD を回復します。

Original Content

arXiv:2604.21016v1 Announce Type: cross Abstract: When training neural networks with full-batch gradient descent (GD) and step size $\eta$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbol{\theta})$ -- rises to $2/\eta$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbol{\theta})\leq 2/\eta$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/\eta$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/\eta$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $\Delta S = \eta \beta \sigma_{\boldsymbol{u}}^{2}/(4\alpha)$, where $\alpha$ is the progressive sharpening rate, $\beta$ is the self-stabilization strength, and $\sigma_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.