arxiv_cs_cv 2026年2月10日

VidVec: ビデオ MLLM エンベディングの解放によるビデオ・テキスト検索

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Translated: 2026/3/15 19:04:32

vidvecmlmvideo-retrievalmultimodal-mlmzero-shot-learning

Japanese Translation

arXiv:2602.08099v1 Announcement Type: new Abstract: 最近の研究は、生成型マルチモーダル大規模言語モデル (MLLM) をビジョンタスク用のエンベディング抽出器に適応させ、通常は汎用的表現を生成するために微調整を行っています。しかし、それらはビデオタスクにおけるパフォーマンスはビデオファウンデーションモデル (VFM) に劣ります。本論文では、MLLM をビデオ・テキストエンベディングおよび検索に活用することに焦点を当てています。まず、中間層（事前訓練された MLLM 層）がすでに大きなタスク関連情報をエンコードしていることを示す系統的な層ごとの分析を行います。この洞察を活用し、中間層のエンベディングを補正された MLLM ヘッドと組み合わせることで、トレーニングなしで強力なゼロショット検索パフォーマンスを発現することが示されました。これらの知見に基づき、私たちは、稠密なビデオキャプションを短いサマリーにマップし、可視監視なしでタスク関連のビデオ・テキストエンベディング学習を可能にする軽量テキストベースのアライメント戦略を導入しました。驚くべきことに、テキスト以外の微調整なしに、我々の手法は現在の手法を上回り、しばしば大幅な差をもって、一般的なビデオ検索ベンチマークにおいて state-of-the-art の結果を達成しました。

Original Content

arXiv:2602.08099v1 Announce Type: new Abstract: Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.