arxiv_cs_cv 2026年2月10日

FlashKAT：コルモゴロフ・アノルドトランスフォーマーにおける性能ボトルネックの解明と対応

FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer

Translated: 2026/3/15 17:02:25

kan-networktransformergradient-computationmemory-bottleneckdeep-learning-architecture

Japanese Translation

arXiv:2505.13813v3 Announce Type: replace-cross Abstract: コルモゴロフ・アノルドネットワーク (KAN) は、多重層パーセプトロン (MLP) の代替手段として、その高い表現力と解釈可能性により人気が高まっている。しかし、KAN はトレーニング安定性の欠如と計算コストによる桁違いの遅延に悩まされ、大規模タスクへの適用が制限されている。最近、グループ・ラショナル KAN (GR-KAN) を活用し、従来のトランスフォーマーモデルと同等の FLOPs を実現したコルモゴロフ・アノルドトランスフォーマー (KAT) が提案された。しかし、FLOPs が同等にもかかわらず、私達のテスト结果显示 KAT はトレーニング中が約 123 倍も遅いことが示され、FLOPs 以外の性能ボトルネックが存在することが明らかになった。本論文では、KAT の遅延の根本原因を解明するため一連の実験を行った。我々は、その遅延を GR-KAN の逆_propagation における非効率的な勾配積算に起因するメモリストアリングとして特定した。このメモリボトルネックに対処するために、再編成されたカーネルを用いて遅いメモリのアクセスを最小限にし、原子加算の使用を削減した FlashKAT を提案した。評価结果表明、FlashKAT は最先进の KAT に対して最大 86.5 倍のトレーニング速度向上を実現し、勾配計算における丸め誤差を減少させた。

Original Content

arXiv:2505.13813v3 Announce Type: replace-cross Abstract: The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multilayer perceptron (MLP) due to its greater expressiveness and interpretability. Even so, KAN suffers from training instability and being orders of magnitude slower due to its increased computational cost, limiting its applicability to large-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, achieving FLOPs comparable to traditional Transformer models with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our testing shows that KAT remains 123x slower during training, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls, linked more specifically to inefficient gradient accumulations in the backward pass of GR-KAN. To address this memory bottleneck, we propose FlashKAT, which minimizes accesses to slow memory and the usage of atomic adds through a restructured kernel. Evaluations show that FlashKAT achieves up to an 86.5x training speedup over state-of-the-art KAT while reducing rounding errors in gradient computation.