arxiv_cs_cv 2026年2月10日

MARC: 大規模 LLM を活用したメモリ拡張強化学習ベースのトークン圧縮による効率的なビデオ理解

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Translated: 2026/3/15 14:46:42

video-understandingtoken-compressionreinforcement-learningmultimodal-modelsllm-applications

Japanese Translation

arXiv:2510.07915v3 発表タイプ：代替要約: 大規模言語モデル（LLM）の迅速な進歩はマルチモーダルモデルの基盤を築きました。しかし、画像からビデオへ拡張する際に、高フレームレートのビデオ処理では計算コストが非常に高くなるという課題が残されています。トークン圧縮は有望な解決策ですが、既存のトレーニングフリーな手法は多くの場合、情報損失や性能低下を招いています。これを克服するため、本研究では構造されたレトリーバルと強化学習に基づく蒸留を統合した、 extbf{メモリ拡張強化学習ベースのトークン圧縮（MARC: Memory-Augmented RL Token Compression）} を提案します。MARC は、 extbf{ビジュアルメモリレトリバー（VMR: Visual Memory Retriever）} を用いて重要クリップを選択し、 extbf{圧縮グループ相対政策最適化（C-GRPO: Compression Group Relative Policy Optimization）} フレームワークを用いて、教師モデルから生徒モデルへの推理能力を蒸留する、 extit{「レトリイブ后再圧縮（retrieve-then-compress）」} 戦略を採用します。6 つのビデオベンチマークの実験により、MARC は 1 フレーム分のトークンのみでベースラインに近い精度を実現しました。これにより、ビジュアルトークンを extbf{95 extpercent}、GPU メモリを extbf{72 extpercent}、遅延を extbf{23.9 extpercent}削減することができました。これは、ビデオ Q&A、監視、自律走行などリソース制約のある環境における効率的なリアルタイムビデオ理解の潜在能力を示しています。

Original Content

arXiv:2510.07915v3 Announce Type: replace Abstract: The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.