arxiv_cs_ai 2026年4月24日

拡散ベースの言語モデル推論用の NPU 設計

NPU Design for Diffusion Language Model Inference

Translated: 2026/4/24 20:34:09

npudiffusion-modelslarge-language-modelsaccelerator-architecturekv-cache

Japanese Translation

arXiv:2601.20706v2 Announce Type: replace-cross 摘れ要：拡散ベースの大規模言語モデル（dLLM）は、従来の自己回帰型（AR）LLM 推論と本質的に異なるアプローチを採用しています：双方向の注意機構、ブロックごとの KV キャッシュ更新、クロスステップ再使用、および GEMM 中心ではないサンプリングフェーズを活用します。これらの特性により、現在の dLLM は従来のほとんどの NPU と互換性がなくなり、その推論パターン、特に減算計算が中心かつ top-k 駆動のサンプリング段階は、AR アクセラレーターのそれを超えた新しい ISA およびメモリ階層のサポートを必要としています。さらに、ブロックごとの拡散の KV キャッシュは、AR NPU が想定する追加専用パラジムの外にあり、従来の AR 由来の KV 圧縮スキームは静的なアクティベーション分布を前提としており、dLLM における反復ブロックベースの精練によって導入されるステップごとの分布シフトを考慮していません。本論文では、初の dLLM 専用 NPU アクセラレーターの導入を報告します。この設計は以下を提供します：dLLM 向けの ISA およびコンパイラ、dLLM における変換器推論と拡散サンプリング両方に対するハードウェア最適化された実行モデル、dLLM における KV キャッシュの圧縮に特化した新しいブロック適応オンライン平滑化（BAOS）、および 7nm での完全な RTL 実装シナthesis。設計の評価および検証には、解析的、サイクル精度、精度の 3 つのシミュレータを含むトライパスシミュレーションフレームワークを導入し、物理ハードウェアとのクロスバリデーションを行いました。全 NPU スタック（ISA、シミュレーションツール、圧縮ソフトウェア）は承認後にオープンソース化する予定です。

Original Content

arXiv:2601.20706v2 Announce Type: replace-cross Abstract: Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top-$k$-driven sampling stage, demand new ISA and memory hierarchy support beyond that of AR accelerators. In addition, the blocked diffusion KV cache breaks from the append-only paradigm assumed by AR NPUs, and conventional AR-derived KV quantization schemes were designed for static activation distributions and do not account for the step-wise distribution shifts introduced by iterative block-wise refinement in dLLMs. In this paper, we introduce the first NPU accelerator specifically designed for dLLMs. It delivers: a dLLM-oriented ISA and compiler; a hardware-optimized execution model for both the transformer inference and diffusion sampling used in dLLMs; a novel Block-Adaptive Online Smoothing (BAOS) for quantizing KV cache in dLLMs; and a complete RTL implementation synthesized in 7nm. To evaluate and validate our design, we introduce a tri-path simulation framework that comprises analytical, cycle-accurate, and accuracy simulators, together with cross-validations against physical hardware. The full NPU stack, including ISA, simulation tools, and quantization software, will be open-sourced upon acceptance.