arxiv_cs_lg 2026年4月24日

FlashNorm: Transformer の高速正規化

FlashNorm: Fast Normalization for Transformers

Translated: 2026/4/24 20:06:57

flashnormtransformerllmrmsnormoptimization

Japanese Translation

arXiv:2407.09577v4 Announce Type: replace Abstract: 正規化レイヤーは大型言語モデル（LLM）に広く使われていますが、計算のボトルネックとなっています：異なるベクトルと行列実行ユニットを持つハードウェアにおいては、RMS 計算がその後の行列乗算をブロックし、並列実行を妨げています。当社は、正規化重みを取り込み、その後の線形レイヤーに融合させることで、RMSNorm に続く線形レイヤーを正確に再定式化し、正規化重みを消去する（i）、スカラー RMS 正規化を行列乗算の出力に遅延させることで 2 つの操作を並列実行可能にする（ii）FlashNorm を発表します。 FlashNorm は元の計算と数学的に同一であり、近似を適用せず、再トレーニングも不要です。同様の手法は LayerNorm、DyT（動的 Tanh）、GLU バリアントを含むフューードフォワードネットワーク、RoPE ベースの注意力機構にも拡張可能です。 NVIDIA T4 GPU 上では、SmolLM2-135M 規模の計算制限（prefill） režime で norm-then-project 操作において 33% から 35% 低い遅延を実現し、Llama-7B 規模では 12% から 14% の低遅延を達成しました。SmolLM2-135M、Llama-3.2-1B、および Llama-3.1-8B でゼロ損失の重み融合を検証しました。推論速度以外にも、FlashNorm はパラメータテンソル数を減らすことでモデル実装を簡素化し、PaLM がすべての線形レイヤーからバイアスパラメータを削除して実現した簡素化と同様の効果をもたらします。当社の解説動画 https://youtu.be/GEuJv34_XgU?si をご覧いただければ、コードは https://github.com/OpenMachine-ai/transformer-tricks を参照してください。

Original Content

arXiv:2407.09577v4 Announce Type: replace Abstract: Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. FlashNorm is mathematically identical to the original computation, it introduces no approximation and requires no retraining. The same technique extends to LayerNorm, Dynamic Tanh (DyT), feed-forward networks with GLU variants, and RoPE-based attention. On an NVIDIA T4 GPU, FlashNorm achieves 33 to 35% lower latency on the norm-then-project operation in the compute-bound (prefill) regime at SmolLM2-135M scale, and 12 to 14% at Llama-7B scale. We verify zero-loss weight folding on SmolLM2-135M, Llama-3.2-1B, and Llama-3.1-8B. Beyond inference speed, FlashNorm simplifies model implementations by reducing parameter tensor count, analogous to the simplification achieved by PaLM's removal of bias-parameters from all linear layers. Watch our explainer video https://youtu.be/GEuJv34_XgU?si and see https://github.com/OpenMachine-ai/transformer-tricks for code.