arxiv_cs_lg 2026年4月20日

AscendKernelGen: 神経処理ユニットのための LLM ベースの kernels 生成に関する体系的な研究

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

Translated: 2026/4/20 11:07:48

ascend-npullm-applicationkernel-generationchain-of-thoughtsupervised-fine-tuning

Japanese Translation

arXiv:2601.07160v2 Announce Type: replace-cross Abstract: 計算効率への恒常的な需要を満たすために、神経処理ユニット（NPU）は現代の AI インフラストラクチャにおいて不可欠 geworden しました。しかし、それらの最大限のポテンシャルをUnlock するには、ベンダー固有のドメイン特定言語（DSL）を使用して高性能な計算 kernels を開発する必要があります。これは深いハードウェア知識を必要とし、労働集積的です。一方、大規模言語モデル（LLM）は一般的なコード生成において有望な成果を示しましたが、NPU ドメインでは訓練データの厳格な制約と希少性のため、困難を伴います。我々の前向きな研究は、最上位の汎用 LLM が Ascend NPU 用の機能を持つ複雑な kernels を生成できず、成功率がほぼゼロであることを示しました。これらの課題に対処するために、我々は NPU kernels 開発のための統合的な生成・評価フレームワーク、AscendKernelGen を提案しました。我々は、現実世界の kernels 実装から導き出された連鎖的思考（chain-of-thought）推理を備えた高品質な Ascend-CoT データセットと、監督微調および実行フィードバックを伴う強化学習によるトレーニングを施したドメイン適応モデル KernelGen-LM を導入しました。さらに、変化する複雑さレベルにおけるコンパイル、正しさとパフォーマンスを評価するための総合的なベンチマーク、NPUKernelBench を設計しました。実験結果は、我々のアプローチが汎用 LLM とハードウェア特定プログラミングの間のギャップを大幅に埋めたことを示しています。具体的には、複雑なレベル 2 kernels におけるコンパイル成功率は 0% から 95.5%（Pass@10）へ向上し、機能的正し性はベースラインの完全な失敗に対して 64.3% となりました。これらの結果は、加速器認識コード生成において、ドメイン特定推理と厳格な評価の決定的な役割を浮き彫りにしました。AscendKernelGen は https://huggingface.co/AscendKernelGen および https://github.com/weich97/NPUKernelBench で利用可能です。

Original Content

arXiv:2601.07160v2 Announce Type: replace-cross Abstract: To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline's complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation. AscendKernGen is available at https://huggingface.co/AscendKernelGen and https://github.com/weich97/NPUKernelBench.