arxiv_cs_lg 2026年2月10日

SiameseNorm: Pre-Norm と Post-Norm の調和を可能にする障壁を超える

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Translated: 2026/3/15 15:01:52

transformerpre-normpost-normoptimization-stabilitydeep-learning

Japanese Translation

arXiv:2602.08064v1 発表タイプ：新しい要旨：現代の Transformer は、最適化安定性のために Post-Norm アーキテクチャのより優れた潜力を捨象した Pre-Norm パラダイムを採用しています。以前の試みは、安定性と性能のトレードオフをもたらしました。私たちはこの現象を、単一ストリーム設計内の構造的な不相容性に帰因しています：Post-Norm 操作の適用は、必ず Pre-Norm が保持するクリーンなアイデンティティ勾配を妨げます。これらのパラダイムを根本的に調和させるため、私たちは SiameseNorm を提案しました。これは、共有パラメータで Pre-Norm 似たストリームと Post-Norm 似たストリームを結合した二重ストリームのアーキテクチャです。この設計は、両方のストリームの最適化ダイナミクスを分離させ、一つのストリームが安定性を確保しつつもう一つのストリームが表現力を強化することで、Pre-Norm と Post-Norm の両方の特性を維持します。13 億パラメータのモデルにおける大規模な事前トレーニング実験では、SiameseNorm が顕著な最適化の堅牢性を示し、強力なベースラインを常に凌駕したことを証明しました。コードは https://github.com/Qwen-Applications/SiameseNorm で利用可能です。

Original Content

arXiv:2602.08064v1 Announce Type: new Abstract: Modern Transformers predominantly adopt the Pre-Norm paradigm for its optimization stability, foregoing the superior potential of the unstable Post-Norm architecture. Prior attempts to combine their strengths typically lead to a stability-performance trade-off. We attribute this phenomenon to a structural incompatibility within a single-stream design: Any application of the Post-Norm operation inevitably obstructs the clean identity gradient preserved by Pre-Norm. To fundamentally reconcile these paradigms, we propose SiameseNorm, a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams with shared parameters. This design decouples the optimization dynamics of the two streams, retaining the distinct characteristics of both Pre-Norm and Post-Norm by enabling all residual blocks to receive combined gradients inherited from both paradigms, where one stream secures stability while the other enhances expressivity. Extensive pre-training experiments on 1.3B-parameter models demonstrate that SiameseNorm exhibits exceptional optimization robustness and consistently outperforms strong baselines. Code is available at https://github.com/Qwen-Applications/SiameseNorm.