arxiv_cs_lg 2026年2月10日

Pre-Score による効率的なアテンション: トランフォーマーにおける情報量のあるキーを優先する

Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers

Translated: 2026/3/15 9:05:00

transformersattention-mechanismspre-scoringlong-contextlanguage-modeling

Japanese Translation

arXiv:2505.11040v4 Announce Type: replace Abstract: 効率的なアテンションメカニズムは長文脈のトランフォーマーを可能にするものの、グローバルに重要なトークンを見落とし、モデリング品質の低下を招く場合がある。われわれは、キーに対してハイレベル近似アテンションを適用する前にクエリに依存しないグローバル重要性を割り当てるプレスコリング（pre-scoring）フレームワークを導入した。クラスタリングベース、またはレバレッジ（leverage）スタイルのスコリングを用いることで、プレスコリングは構造上情報量のあるキーを特定し、計算をこの優先されたサブセットに制限する。HyperAttention と統合することで、プレスコリングは長文脈言語モデリングにおいて近似精度を大幅に向上させた: 131 万トークンの文脈を持つ ChatGLM では、固定されたインタラクションバジェットにおいて、Perplexity が 12.0 から 9.5 へ減少し、二乗次以下の効率性を維持した。同様にキーバジェットが同等の場合、クラスタリングベースのスコリングはレバレッジベースの選択を常に上回った。言語モデルを越えて、Vision Transformer の自己アテンションを置換しても、ベースライン精度の大部分は維持され、アプローチがモーダル全体で汎用性を有していることが示された。われわれは、planted-subspace モデルのもとでの構造保証を提供し、クラスタリングがレバレッジベースの手法と同様の重み付けされたキーセットを回復することを示した。全体的に、プレスコリングは情報量のあるキーをより良く優先することで、近似アテンションの効率・精度のトレードオフを改善し、スケーラビリティを犠牲化しないまま実現した。

Original Content

arXiv:2505.11040v4 Announce Type: replace Abstract: Efficient attention mechanisms enable long-context transformers but often miss globally important tokens, degrading modeling quality. We introduce a pre-scoring framework that assigns a query-independent global importance prior to keys before applying hierarchical approximate attention. Using clustering-based or leverage-style scoring, pre-scoring identifies structurally informative keys and restricts computation to this prioritized subset. Integrated with HyperAttention, pre-scoring substantially improves approximation quality on long-context language modeling: on ChatGLM with 131k-token contexts, perplexity decreases from 12.0 to 9.5 under a fixed interaction budget while retaining subquadratic efficiency. Clustering-based scoring consistently outperforms leverage-based selection under identical key budgets. Beyond language, replacing self-attention in Vision Transformers preserves most of the baseline accuracy, showing that the approach generalizes across modalities. We provide structural guarantees under a planted-subspace model, showing that clustering recovers the same heavy-key sets as leverage-based methods. Overall, pre-scoring improves the efficiency-accuracy trade-off of approximate attention by better prioritizing informative keys without sacrificing scalability.