arxiv_cs_cv 2026年4月24日

LiveVLM: ストリーム指向型 KV Cache と検索による効率的なオンライナビデオリズム理解

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Translated: 2026/4/24 19:49:32

livevlmvideo-llmkv-cachevision-language-modelstreaming-compression

Japanese Translation

arXiv:2505.15269v2 発表形式: 代替要旨: 最近のビデオ大規模言語モデル（Video LLMs）の進展により、模型が長時間のビデオを処理し、卓越した性能を発揮できるようになりました。しかし、キー - バリュー（KV）キャッシュは時間とともに線形拡大し、顕著なメモリオーバーヘッドと応答遅延をもたらします。これは、Deepseek サービス、自律走行車、ロボットなどの各種リアルワールドのオンライン応用において決定的な課題です。これらの問題を緩和するために、$ extbf{LiveVLM}$ という、オンライナビデオリズム理解とリアルタイムインタラクションに特化されたトレーニングなしでクエリagnostic なフレームワークを提案します。LiveVLM は、VSB（Vision Sink Bucketing）メカニズムを使用してビデオストリームを実時に処理し、長期のビデオ詳細を保持すると同時に不要な KV を排除します。このメカニズムは、ビジュアル・ツー・ビジュアル注意力スコアを用いて測定基準とし、圧縮中に文脈情報の覆盖率を最大化することを目指します。クエリagnostic に圧縮された KV Cache が特定のクエリに対して不可避免地に関連しない情報を保持することに着目し、LiveVLM は不要な文脈からの干渉を減らすために、PaR（Position-agnostic KV Retrieval）メカニズムを併用しました。PaR の主要点は位置埋め込みを離脱させることで、キーテンソル間の類似性を高め、ページ粒度での効率的な検索をサポートすることにあります。大規模な実験により、LiveVLM は基礎モデルである LLaVA-OneVision モデルが、トレーニングなしのクエリagnostic な方法群およびトレーニングベースのオンラインモデル群ともに、最先端の精度を実現することを示しました。

Original Content

arXiv:2505.15269v2 Announce Type: replace Abstract: Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $\textbf{LiveVLM}$, a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that KV cache compressed in a query-agnostic manner inevitably retains irrelevant information for specific queries, LiveVLM incorporates a Position-agnostic KV Retrieval (PaR) mechanism to reduce interference from redundant context. The keypoint of PaR lies in decoupling positional embeddings to enhance the similarity between key tensors, thereby supporting efficient retrieval at the granularity of pages. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy among both training-free query-agnostic methods and training-based online models.