arxiv_cs_ai 2026年2月10日

TernaryLM: 半精度1ビットのクオートリゼーションを使用した、記憶効率の良い言語モデル

TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling

Translated: 2026/3/7 12:31:42

tensornetworktransformer-networksquantized-modelslanguage-modelingneural-network-analysis

Japanese Translation

大きな言語モデル（LLMs）は驚異的なパフォーマンスを達成しますが、多くのコンピュータリソースを必要し、オッズデバイスや限られたリソースを持つ環境での展開に制約付けます。WeはTernaryLMと名付けた新しいトランスファーモデルの建築を提示しており、訓練時に一般的な1ビット半整数化(−1、0、+1)を使用して記憶効率を大幅に見込んでいます。これはパラメータ数が約1320万です。この手法はすでに学習したフルピクセル精度のモデルを後処理で量化する方法とは異なります，TernaryLMは、適応型層ごとのスケーリングフaktorsを使って、正しい量子化可能な表現（representations）から初めて始めて訓練を行います。我々の実験結果は以下が含まれています（1）TinyStoriesの検証のロビツェリピティ、58.42；（2）MRPC のパラフレーズ検出における、ダウンストリームでのトレランスに、F1スコアが約82.47でアップするといったこと（3）比照のための記憶効率は、2.4倍となる498MBから1197MBと、同等の推定遅延；（4）様々なコーラルでの安定したトレーニングダイナミクスを持つ。我々は层ごとの圧縮の分析を提供し始めているのでその結果とその未来の非一様なプリセス戦略に関する洞察が示されています。我々の結果より、完全な1ビット・トレインに一歩近づいたことが、効率的なニューラル言語モデルのためには有望な方向だと考えられます（コードは <https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling/>をご覧ください）。

Original Content

arXiv:2602.07374v1 Announce Type: cross Abstract: Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1-bit ternary quantization {-1, 0, +1} during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer-wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non-uniform precision strategies. Our results suggest that native 1-bit training is a promising direction for efficient neural language models. Code is available at https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling.