arxiv_cs_lg 2026年4月24日

分散が重要でない：マルチモデル規模におけるトランスフォーマー圧縮構造的分析

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

Translated: 2026/4/24 19:59:52

transformer-compressionmachine-learningneural-networksmodel-scalingquantization

Japanese Translation

arXiv:2604.20682v1 Announce Type: new 摘要：私たちは、GPT-2（124M パラメータ）および Mistral 7B（7.24B パラメータ）を対象とした 40 以上の実験を通じて、トランスフォーマー圧縮の系統的な実証的研究を提示します。当分析は、スペクトル圧縮、ブロックレベルの関数置換、回転ベースの量子化、アクティブーション幾何学、そして適応的早期退避を網羅しています。我々は圧縮に関連する 5 つの構造的特性を特定しました。(1) 分散は重要でない：分散が大きいアクティブーション方向は、予測方向との相関が約 96% 低い（CCA を経由して計測）ことがあり、これらの部分空間への投影は超過 90% の分散を維持しつつ、ペルプレクシーを劣化させる。(2) ブロックの線形性は条件付き：トランスフォーマーブロックは、正しいアッストリーム分布の下でのみ、線形である（GPT-2 で R^2 ≈ 0.95、Mistral ブロック 31 で 0.93）のみ、先頭ブロックの変更は分布のシフトを誘発し、ダウンストリーム近似を劣化させる。(3) 再構築の壁：重量を量子化コンポーネントに分解するアプローチは、クロスタームを通じてエラーを増幅させ、直接の量子化が厳密に優れるようになる。(4) 線形性は深度とともに増大：Mistral 7B は、R^2 = 0.17（ブロック 0）から R^2 = 0.93（ブロック 31）への進展を示し、非線形特徴構築と線形-refinement の間分けを示唆している。(5) トークンの約 30% は計算的に容易であることが、退避ヘッドと KL 発散感度を通じて確認された。我々は、単一ブロックの線形置換が Mistral 7B の最終ブロックで 34 倍の圧縮と 1.71 のペルプレクシー増加を実現することを示し、マルチブロック置換は残差エラーの累積と分布のシフトのために失敗することを示した。これらの見通しは、静的な後訓練圧縮の基本的な限界を提案し、適応的かつトークンごとの計算をより効果的な方向として動機付ける。

Original Content

arXiv:2604.20682v1 Announce Type: new Abstract: We present a systematic empirical study of transformer compression through over 40 experiments on GPT-2 (124M parameters) and Mistral 7B (7.24B parameters). Our analysis covers spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. We identify five structural properties relevant to compression. (1) Variance is not importance: high-variance activation directions are approximately 96 percent uncorrelated with predictive directions (measured via CCA), and projecting onto these subspaces preserves over 90 percent of variance while degrading perplexity. (2) Block linearity is conditional: transformer blocks are approximately linear (R^2 ~ 0.95 on GPT-2, 0.93 on Mistral block 31) only under the correct upstream distribution; modifying earlier blocks induces distribution shift that degrades downstream approximations. (3) The reconstruction wall: approaches that factor weights into quantized components amplify errors through cross-terms, making direct quantization strictly superior. (4) Linearity increases with depth: Mistral 7B exhibits a progression from R^2 = 0.17 (block 0) to R^2 = 0.93 (block 31), indicating a division between nonlinear feature construction and linear refinement. (5) Approximately 30 percent of tokens are computationally easy, confirmed via exit heads and KL divergence sensitivity. We demonstrate that single-block linear replacement achieves 34x compression with a 1.71 perplexity increase on the final block of Mistral 7B, while multi-block replacement fails due to residual error accumulation and distribution shift. These findings suggest fundamental limits to static post-training compression and motivate adaptive, per-token computation as a more effective direction.