arxiv_cs_lg 2026年2月10日

From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency

Translated: 2026/3/15 14:50:11

distributed-trainingadam-optimizerlow-rank-communicationfoundation-modelsgradient-synchronization

Japanese Translation

arXiv:2602.08007v1 Announce Type: new Abstract: ファウンデーションモデルの規模拡大に伴い、事前学習ではデータ並列最適化への依存がますます高まり、帯域幅に制限された勾配同期が主要なボトルネックとなっています。また、射影ベースの低階最適化器は主にメモリ効率のために設計されてきましたが、通信制限された学習には依然として最適ではなく、一方通信では $m\times n$ マトリックスの勾配に対して $O(rn)$ のオブジェクトを送信し、リフレッシュステップが最大通信バイトを支配しています。我々は、TSR-Adam と名付けた手法を提案し、これは Adam 族の更新に二方低階通信を導入し、コンパクトな核 $U^\top G V\in\mathbb{R}^{r\times r}$ を同期させることで、主要な毎ステップのペイロードを $O(mn)$ から $O(r^2)$ に削減し、モーメント状態を低次元核に保持するものであります。さらに、子空間リフレッシュの最大通信を削減するため、TSR-Adam は勾配全体同期を避けるためのランダム化された SVD ベースのリフレッシュを採用しています。また、我々は埋め込み勾配にも埋め込み特有のランクとリフレッシュスケジュールを用いた低階通信を広範に拡張し、埋め込みを高密度に維持した場合よりも追加の通信とメモリ節約をもたらしました。 60M から 1B モデル規模までの事前学習において、TSR-Adam はステップごとの平均通信バイトを $13$ 倍削減し、GLUE 微調整では通信を $25$ 倍削減し、同様のパフォーマンスを発揮しました。我々はさらに提案された更新に対して理論的定常性解析も提供しました。コードは https://github.com/DKmiyan/TSR-Adam に利用可能です。

Original Content

arXiv:2602.08007v1 Announce Type: new Abstract: As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an $O(rn)$ object for an $m\times n$ matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core $U^\top G V\in\mathbb{R}^{r\times r}$, reducing the dominant per-step payload from $O(mn)$ to $O(r^2)$ while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization. We additionally extend low-rank communication to embedding gradients with embedding-specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13\times$, and on GLUE fine-tuning it reduces communication by $25\times$, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR-Adam.