arxiv_cs_cv 2026年2月10日

FlashVID: Training-free Tree-based Spatiotemporal Token Merging による効率的な動画大規模言語モデル

FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Translated: 2026/3/15 19:03:48

videolarge-language-modelscomputerefficiencytokenmergingvideocompressioninferenceacceleration

Japanese Translation

arXiv:2602.08024v1 Announce Type: new 本文書は、ビデオ大規模言語モデル（VLLMs）の抽象説明文です。VLLMs は動画理解の能力を顕著に示しましたが、高量の大規模視覚トークンを処理する必要があるため、大規模な計算効率低下が発生しています。既存の VLLMs 加速フレームワークは、空間的なおよび時間的な冗長性を独立して圧縮し、これにより時空間関係を無視し、最適でない時空間圧縮につながっています。動画の動的な性質により、時間とともに高度に相関した視覚的特徴は、空間位置、スケール、方向性、その他の属性で変化しやすいことが知られています。この洞察に基づいて、VLLMs 用のトレーニング不要推論加速フレームワークである FlashVID を提案します。具体的には、FlashVID は ADTS（Attention and Diversity-based Token Selection）を用いて基礎的な動画表現のための代表的なトークンを選択し、TSTM（Tree-based Spatiotemporal Token Merging）を用いて微細な時空間冗長性を排除します。3 つの代表格的な VLLMs を対象とし、5 つの動画理解ベンチマークで行われた大規模な実験では、我々の手法の有効性と汎用性が示されました。特に、視覚トークンを 10% に留めることで、FlashVID は LLaVA-OneVision の 99.1% のパフォーマンスを維持します。このため、FlashVID はトレーニング不要かつプレッグ・アンド・プレイのモジュールとして、長い動画フレームを拡張する際に機能し、Qwen2.5-VL の動画フレーム入力を増加倍し、同様の計算リソース内相対改善率 8.6% を達成します。ソースコードは https://github.com/Fanziyang-v/FlashVID から利用可能です。

Original Content

arXiv:2602.08024v1 Announce Type: new Abstract: Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang-v/FlashVID.