arxiv_cs_cv 2026年2月10日

ゲートアテンションと学習可能なサンプリングを用いた大規模マルチモーダルモデルにおける時長動画理解のための状態空間階層圧縮

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Open original article

Translated: 2026/3/15 7:01:40

state-space-modelsvideo-understandingmultimodal-modelscompressionattention-mechanisms

Japanese Translation

arXiv:2506.13564v2 Announce Type: replace 抽象：本稿では、大規模マルチモーダルモデルへ入力する前に膨大な動画フレーム機能を圧縮する効率的なフレームワークを提案し、時長動画に伴う深刻なトークン爆発を軽減します。当設計は、ゲートスキップ結合と周期性に挿入された学習済みクエリに適用される学習済みウェイト平均プーリング機構を備えた双方向状態空間モデルを活用しています。この構造は、空間および時間両方の次元にわたる階層的ダウンサンプリングを可能にし、費用対効果の高いパフォーマンス維持を実現します。困難な時長動画理解タスクにおいて、我々のアプローチは最先端モデルに対抗できる結果を示し、全体的なトークン予算を大幅に削減します。特に、我々の状態空間モデルを従来のモジュールに置き換えると顕著な性能低下が生じるため、提案した状態空間モデルがマルチフレーム動画情報を効果的に圧縮することを強調しています。当フレームワークはリソース意識的な効率性を重視しており、本格的な実運用に適しています。複数のベンチマークを通じてスケーラビリティと汎用性を検証し、効率的なリソース利用と包括的な動画理解という双目的を果たしました。

Original Content

arXiv:2506.13564v2 Announce Type: replace Abstract: We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.