arxiv_cs_lg 2026年4月24日

TTKV: 長期文脈の LLM 推論のための階層的時系列 KV キャッシュ

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

Translated: 2026/4/24 20:01:45

kv-cachellm-inferencelong-contextmemory-hierarchyattention-mechanism

Japanese Translation

arXiv:2604.19769v1 Announce Type: cross 要旨：キー値（KV）キャッシュは、大規模言語モデル（LLM）の効率的な推論において不可欠ですが、そのメモリ使用量は文脈長の線形比例を示し、深刻なスケーラビリティのボトルネックを招きます。既存の手法は、大部分が時間軸全体に KV ステートが同等に重要であると考え、均一な精度とアクセス可能性を仮定しています。しかし、この仮定は、記憶の明晰さ、想起頻度、および時間的接近度によって変化が記憶システムの人々の認識と対照的です。この洞察を動機付け、我々は人間の記憶システムを KV キャッシュにマッピングする KV キャッシュ管理フレームワークである TTKV を提案します。TTKV は、異なった容量と精度を備えた時系列の階層に KV キャッシュを分割します。この設計は 3 つの側面を解決します：(1) 階層配置：HBM と DRAM を使用して高速メモリと低速メモリを解偶化し、(2) 階層内容：時間的接近性に基づき、より最新の KV ステートを高速で高精度の階層に割り当て、(3) 階層相互作用：低速階層にアクセスする際に通信と計算をオーバーラップさせるためにブロックごとのストリーミング注意力を適用します。実験は、TTKV が 128K コンテキストのタスクで跨り階層トラフィックを 5.94 倍減らし、強力な基準と比較して最大 76% の遅延削減と 2 倍の透過率向上を達成することを示しています。

Original Content

arXiv:2604.19769v1 Announce Type: cross Abstract: Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility. However, this assumption contrasts with human memory systems, where memories vary in clarity, recall frequency, and relevance with temporal proximity.Motivated by this insight, we propose TTKV, a KV cache management framework that maps the human memory system onto the KV cache. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision. The design addresses three aspects: (1) Tier Layout, decoupling fast and slow memory using HBM and DRAM; (2) Tier Content, assigning more recent KV states to faster, higher-precision tiers based on temporal proximity; and (3) Tier Interaction, employing block-wise streaming attention to overlap communication and computation when accessing slow tiers. Experiments show that TTKV reduces cross-tier traffic by 5.94x on 128K-context tasks, achieving up to 76% latency reduction and 2x throughput improvement over strong baselines.