arxiv_cs_cv 2026年4月24日

Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts

Translated: 2026/4/24 19:47:41

robustnessvideo-text-retrievalquery-shifthubnesstest-time-adaptation

Japanese Translation

arXiv:2604.20851v1 Announce Type: cross Abstract: 現代のビデオテキスト検索 (VTR) モデルは、分布内ベンチマークで優れている一方で、トレーニングドメインと異なるデータ分布を持つ現実世界のクエリシフトに対して極めて脆弱です。これにより、性能が劇的に低下します。既存の画像に焦点を当てた強靭性ソリューションは、これらのシフトに内在する複雑な空間時間動的に問題に対処しないため、ビデオでのこの脆弱性を処理するにも不十分です。この脆弱性を包括的に評価するために、まず、5 つの深刻度の異なる 12 種類のビデオ歪みを有する総合的なベンチマークを導入しました。このベンチマークへの分析は、クエリシフトが Hubness 現象を増幅することを示しました。いくつかのギャラリーアイテムが、異常に多いクエリを引き付ける支配的な「ハブ」になるのです。これを緩和するために、私たちは、VTR でのハブネスを直接相殺するために設計されたテスト時適応枠組みである HAT-VTR（Test-time Video-Text Retrieval のハブネス軽減）を提案しました。2 つの主要なコンポーネントを使用します：1）ハブネス抑制メモリを相似スコアを修正するために用いること、および 2）時間的特徴の一貫性を強制するためのマルチ粒度損失。広範な実験により、HAT-VTR は、多様なクエリシフトシナリオにおいて、以前の手法を一貫して凌駕し、強靭性を大幅に向上させることが示されました。これにより、モデルの信頼性が実世界アプリケーションにおいて向上します。

Original Content

arXiv:2604.20851v1 Announce Type: cross Abstract: Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.