arxiv_cs_lg 2026年4月24日

MixLLM: 出力特徴量間のグローバル混精度量子化と高度に効率的なシステム設計

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Translated: 2026/4/24 20:07:18

llmquantizationmixed-precisionarxivmachine-learning

Japanese Translation

arXiv:2412.14590v2 Announce Type: replace Abstract: 量子化は LLM を小型化する最も効果的な手法の一つとなりました。しかし、既存の量子化ソリューションは、顕著な精度低下または低いシステム効率のいずれかの限界を示しています。本論文では、モデル内の異なる特徴量が異なる重要性を有するという知見に基づき、出力特徴量間の混精度量子化の最適化空間を探る MixLLM を提案します。MixLLM は単一レイヤーの観点ではなく、グローバルな視点から重要な出力特徴量を特定し、高精度かつ低メモリ使用量を達成するために、より多くのビット幅が必要な出力特徴量に対して最適なビット幅を割り当てます。我々は、高精度かつ高システム効率を持つアルゴリズムとシステムの協調設計における量子化構成の「スイートスポット」を提示します。システムの課題に対処するために、Tensor Core の活用と高速データ型変換を活用する 2 ステップデクランタライゼーションを設計し、メモリアccess、デクランタライゼーション、および行列積のオーバーヘッドを最小限に抑えるために、ソフトウェアパイプラインを最適化しています。大規模な実験では、僅か 10% だけビット数を増やすことで、SOTA のパープルリティ増加を約 0.5 から Llama 3.1 70B で 0.2 以内に減少させ、3 つの人気モデルの SOTA よりも MMLU-Pro loss を 1.92 から 0.99 に減少させました。精度の優位性に加えて、MixLLM は state-of-the-art のシステム効率も実現しています。コードは https://github.com/microsoft/MixLLM で公開されています。

Original Content

arXiv:2412.14590v2 Announce Type: replace Abstract: Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core easily and fast data type conversion to reduce dequantization overhead, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10\% more bits, the perplexity increase can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while MMLU-Pro loss can be reduced from 1.92 to 0.99 over the SOTA of three popular models. Besides its superior accuracy, MixLLM also achieves state-of-the-art system efficiency. Code is released at https://github.com/microsoft/MixLLM.