arxiv_cs_lg 2026年2月10日

招待前に調査：効率的な LLM 推論のための軽量スケーチ＆ウォークスパースアテンション

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Translated: 2026/3/15 14:05:56

sparse-attentionllm-inferencememory-efficienttraining-freelong-context

Japanese Translation

arXiv:2602.07397v1 発表型: 新しいもの要約：自己注意は、事前エンコードフェーズとデコードフェーズの両方で長文脈 LLM 推論の計算コストとメモリコストの支配要因となっています。この課題に対処するため、私はトレーニング不要なスパースアテンション法である「スケーチ＆ウォークアテンション」を導入しました。この手法は、軽量なスケーチと決定論的なウォークによって疎性を決定します。スケーチ＆ウォークは、Hadamard スケーチを用いてアテンションスコア的低コスト近似値を取得し、その上でウォークメカニズムによりレイヤー間でのアテンション影響を捉えながらこれらの評価を統合します。蓄積されたウォークスコアを用いてトップ-Kアテンションブロックを選択し、カスタムスパースアテンション核と共にトレーニング不要の単一アルゴリズムで事前エンコードフェーズとデコードフェーズの両方に均一に適用される動的疎性を可能にします。幅広いモデルとタスクの範囲において、スケーチ＆ウォークはアテンション密度 20% 付近でほぼ損失のない精度を維持でき、特定の環境では密集型アテンションをわずかに優越させ、最大 6 倍の推論速度向上を実現しました。

Original Content

arXiv:2602.07397v1 Announce Type: new Abstract: Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.