arxiv_cs_lg 2026年2月10日

非反復型マルチパス神経ネットワークにおける超パラメータ移転の法則

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Translated: 2026/3/15 14:07:50

hyperparameter-transferneural-networksdepth-scalingtraining-dynamicseffective-depth

Japanese Translation

arXiv:2602.07494v1 Announce Type: new 要約: 深く設計された現代のアーキテクチャの訓練は高コストであり、高価な繰り返し調整に比して超パラメータの移転の方が望ましい。最大更新パラメトリゼーション（$ ext{maximal update parametrization, } \mu\text{P}$）は、多くの超パラメータが幅（width）の間で移転される理由を説明するのに役立ちます。しかし、現代のアーキテクチャにおいて、計算グラフに複数の並列パスとリズナル（residual）集合が含まれているため、深さのスケーリングはまだ十分に理解されていません。CNN、ResNets、Transformers などの各種非反復型マルチパス神経ネットワークを統一するために、私たちはグラフに基づく「有効な深さ」という概念を導入しました。安定化初期化と最大更新の基準下において、最適な学習率の有効な深さに従って、普遍的な $-3/2$ 乗の法則に従って減少すると示しました。ここで、最大更新の基準は、初期化における典型的な 1 ステップの表現変化を最大限にしつつ不安定性を起こさないよう、有効な深さは入力から出力への最短パス長で、層とリズナルの追加をカウントします。多岐にわたるアーキテクチャでの実験は、予測された傾きを確認し、深さと幅の間に学習率のゼロショット移転を可能にし、深さのスケーリングを予測可能な超パラメータ移転問題に変えました。

Original Content

arXiv:2602.07494v1 Announce Type: new Abstract: Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ($\mu$P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.