arxiv_cs_cv 2026年4月24日

混雑状況認識を考慮したインスタンスレベルビジュアルアクティブトラックリング：OA-VAT

Instance-level Visual Active Tracking with Occlusion-Aware Planning

Translated: 2026/4/24 19:44:12

visual-active-trackingocclusion-aware-planninginstance-level-discriminationconditional-diffusioncamera-control

Japanese Translation

arXiv:2604.21453v1 Announce Type: new 要旨：ビジュアルアクティブトラックリング（VAT）は、標的を 3 次元空間に追跡するようにカメラを制御するもので、ドローンナビゲーションやセキュリティ監視などのアプリケーションにおいて極めて重要です。しかし、現実的なデプロイメントにおいて、不足したインスタンスレベル区別による類似の外見を持つ混同対象に起因する混乱と、アクティブプランニングの欠如による被食物体下での深刻な失敗という 2 つの主要なボトルネックに直面しています。これに対応するため、我々は OA-VAT、3 つの補完的なモジュールを備えた統合パイプラインを提案します。まず、トレーニング不要なインスタンス感知的オフラインプロトタイプ初始化は、DINOv3 を用いてマルチビュー増強特徴を統合し、混同対象の混乱を軽減します。次に、オンラインプロトタイプ強化トラックリングは、外観や運動変化下での安定したトラックリングを実現するために、信頼性感知カルマンフィルタを統合してプロトタイプをオンラインで強化します。最後に、我々の新しい Planning-20k データセットで訓練された、被食物体認識を考慮したトラジェクトリプランナーは、条件付き拡散を用いて被食物体回復のための障害物回避経路を生成します。実験結果は、OA-VAT が UnrealCV において 0.93 の平均 SR を達成し（SOTA TrackVLA より 2.2% 高い）、実世界データセットにおいて 90.8% の平均 CAR を達成し（SOTA GC-VAT より 12.1% 高い）、また DJI Tello ドローンでは 81.6% の TSR を達成することを示しています。RTX 3090 上で 35 FPS で動作する OA-VAT は、実用的なデプロイメント向けに堅牢かつリアルタイム性能を提供します。

Original Content

arXiv:2604.21453v1 Announce Type: new Abstract: Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.