arxiv_cs_ai 2026年4月24日

HiCrew: 問题认知型マルチエージェント協力による長尺動画理解のための階層的理由推理

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

Translated: 2026/4/24 20:17:00

video-understandingmulti-agent-systemscausal-reasoninghierarchical-clusteringquestion-answering

Japanese Translation

arXiv:2604.21444v1 Announce Type: new 要旨: 長尺な動画理解は、長時間的水平にわたる汎用的な空間時間重複と複雑な物語的依存関係によって、根本的に困難な課題に直面しています。最近の構造化表現は視覚情報を効果的に圧縮しますが、原因推理において不可欠である時間的一貫性を犠牲にすることがよくあります。また、既存のマルチエントフレームワークは、厳格で事前定義されたワークフローを通じて動作するため、問題固有の要件に応じて推理戦略に適応できず、問題です。この論文では、HiCrew という階層的多エージェントフレームワークを導入し、これらの限界を解決する 3 つの主要な貢献を通じて対応します。第一に、我々はショット境界検出を活用して時間的トポロジーを保存しつつ、語義的に一貫したセグメント内で関連性導向の階層聚集を実行するためのハイブリッドな構造を提案しました。第二に、我々は、目的駆動的な視覚プロンプトを合成して精度重視の語義描写を生成するための問题认知型キャプション生成メカニズムを開発しました。第三に、我々は、問題の複雑さに基づいて役割と実行パスを適応的に選択する動的なエージェント協力をオーケストレーションする計画レイヤーを統合しました。EgoSchema および NExT-QA における広範な実験は、我々のアプローチの有効性を検証し、多様な問題タイプで高いパフォーマンスを達成し、特に我々の階層構造を保持する設計が利益を生む時間的および原因的推理タスクにおいて顕著な成果を示しました。

Original Content

arXiv:2604.21444v1 Announce Type: new Abstract: Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi-agent frameworks operate through rigid, pre-defined workflows that fail to adapt their reasoning strategies to question-specific demands. In this paper, we introduce HiCrew, a hierarchical multi-agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance-guided hierarchical clustering within semantically coherent segments. Second, we develop a Question-Aware Captioning mechanism that synthesizes intent-driven visual prompts to generate precision-oriented semantic descriptions. Third, we integrate a Planning Layer that dynamically orchestrates agent collaboration by adaptively selecting roles and execution paths based on question complexity. Extensive experiments on EgoSchema and NExT-QA validate the effectiveness of our approach, demonstrating strong performance across diverse question types with particularly pronounced gains in temporal and causal reasoning tasks that benefit from our hierarchical structure-preserving design.