arxiv_cs_cv 2026年4月20日

Video-STAR：ツールを強化したオープン語彙動作認識へのアプローチ

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Translated: 2026/4/20 10:50:14

video-STARopen-vocabularyaction-recognitionreinforcement-learningmultimodal

Japanese Translation

arXiv:2510.08480v2 発表タイプ：差し替え要約：マルチモーダル大規模言語モデル（MLLM）は視覚的推論とテキスト的推論の架け橋を結ぶ顕著な可能性を示しましたが、テキスト中心の先入観への依存は、オープン語彙シナリオにおいて半義的に類似した動作を分離する能力を制限することがよくあります。これを解決するため、文脈的なサブモーション分解とツール拡張強化学習を調和させた、オープン語彙動作認識（OVAR）のための Video-STAR フレームワークを提案します。我々のアプローチは、動作を一貫体として扱う従来の手法とは異なり、区別性の高いサブモーションに動作を分解し、微細な一致を実現すると同時に、ドメイン固有のツールを動的に呼び出して横方向のクロスモーダル交互処理を行うことで、カテゴリ固有の推論能力を可能にし、クロスモーダル幻覚を低減します。さらに、ツール使用効率、サブモーションの関連性、および推論の構造化整合性をバランスさせる階層報酬を設計することで、我々の手法は明示的な監督なしに外部ツールの活用を自律的に促進し、サブモーションパターンを優先させ、テキスト中心の推論から視覚ベースの推論へと移行します。HMDB-51、UCF-101、SSv2、Kinetics-400、および Kinetics-600 データセットでの広範な評価が、我々の手法が既存の手法よりも微細な動作の識別やクロスモーダル幻覚の処理において卓越したパフォーマンスを示し、優れた頑健性と汎用性を検証しました。

Original Content

arXiv:2510.08480v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.