arxiv_cs_lg 2026年4月24日

Continuous Semantic Caching for Low-Cost LLM Serving

Translated: 2026/4/24 19:55:26

llmsemantic-cachingmachine-learninginference-optimizationreinforcement-learning

Japanese Translation

arXiv:2604.20021v1 Announce Type: new Abstract: 大規模言語モデル（LLM）の使用がますます増えているため、ユーザーが意味的に類似したクエリでリクエストできるように、応答をキャッシュする戦略は、推論コストと遅延を削減する上で不可欠なものとなっています。既存のキャッシュフレームワークは、有限で既知の離散クエリユニバースを前提とし、サービスコストと到着確率を学習することに基づいて、どのクエリ応答をキャッシュするかを決定しようとしてきました。しかし、LLM のユーザーおよびクエリのプールが拡大するにつれ、このような前提はますます成り立たなくなります：実世界の LL クエリは無限の連続埋め込み空間にあります。この論文では、確定的な連続クエリ空間における LLM 応答キャッシュの最初の厳密な理論フレームワークを確立しました。離散最適化と連続表現空間の間にギャップを埋めるために、動的な $\ε$ -net 分割と核平滑回帰を組み合わせました。この設計により、システムは推定不確実性を形式的に定量化し、LLM クエリコストに関する部分的なフィードバックを連続的な意味クエリ neighborhood 全体に一般化することができます。私たちは、キャッシュされた応答の切り替えコストを削減するように最適化されたオフライン学習とオンライン適応アルゴリズムを開発しました。オンラインアルゴリズムが最適連続オラクルに対する亜線形 regret バウンド（リグレット境界）を持つことを証明し、これは既存の離散クエリモデルの境界に縮減されます。大規模な実証評価は、私たちのフレームワークが最適連続キャッシュを良好に近似しつつ、既存の手法に比べて計算および切り替えオーバーヘッドを削減することを示しました。

Original Content

arXiv:2604.20021v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.