arxiv_cs_lg 2026年2月10日

Grokking in Linear Models for Logistic Regression

Translated: 2026/3/15 16:05:02

grokkinglinear-modelslogistic-regressionmachine-learninggeneralization

Japanese Translation

arXiv:2602.08302v1 Announce Type: new Abstract: Grokking（遅れgeneralization現象）は、一般的にディープニューラルネットワークの深さや構成的構造に起因すると考えられています。我々は最も単純な設定の一つ、すなわち原点に関して線形かつ最大マージンに分離可能なデータにおける二分類問題において、ロジスティック損失関数による線形モデルの学習におけるGrokkingを研究しました。我々は以下の3つのテストレジームを調査しました：（1）テストデータがトレーニングデータと同じ分布から抽出された場合、Grokkingは観測されません；（2）テストデータがマージン周辺に集中している場合、Grokkingが観測されます；（3）PGD（Projeted Gradient Descent）攻撃を通じて生成された敵対的テストデータの場合、Grokkingも観測されます。我々は理論的に示唆しました：勾配降下法のインシレントバイアスは、3段階の学習プロセス（集団支配、サポートベクター支配による忘却、サポートベクター支配によるgeneralization）を引き起こし、この中にGrokkingが生じ得ます。我々の解析は、Grokkingの出現がデータのアシンメトリー（例：クラスあたりのサンプル数、サポートベクターのクラス間分布）に関連し、かつGrokking時間の特性を与えることを見出しました。我々は、異なる集団点とサポートベクターの分布を配置し、精度曲線と超平面ダイナミクスを解析することで、我々の理論を実験的に検証しました。全体的に、我々の結果は、Grokkingは深さや表現学習を必要とせず、バイアス項のダイナミクスを通じて線形モデルでも発生できることを示しています。

Original Content

arXiv:2602.08302v1 Announce Type: new Abstract: Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.