arxiv_cs_cv 2026年4月20日

FineCog-Nav: 粗粒度を微細化してゼロショットマルチモーダル UAV ナビゲーションを実現する

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

Translated: 2026/4/20 10:46:59

uav-navigationzero-shot-learningmultimodal-aivision-language-navigationcognitive-models

Japanese Translation

arXiv:2604.16298v1 Announce Type: new Abstract: UAV 版ビジョン言語ナビゲーション (VLN) は、自己中心視点から複雑な 3 次元環境を移動させ、長期計画にわたる曖昧なマルチステップ指示に従うようにエージェントを要求します。既存のゼロショット手法は、大規模な基礎モデルに依存し、汎用的なプロンプトを使用し、また松く調和されたモジュールを組み立てるという点で制限されています。この作業では、FineCog-Nav を提案します。これは人間の認知を模倣したトップダウンフレームワークであり、言語処理、知覚、注意、記憶、想像力、推論、意思決定の微細なモジュールにナビゲーションを組織化します。各モジュールは、役割特異的なプロンプトと構造化された入力・出力プロトコルを持つ中規模の基礎モデルによって駆動され、効果的な協調と向上した解釈可能性を可能にします。微細な評価をサポートするために、私たちは AerialVLN から導出された 300 分の軌道を含む、AerialVLN-Fine と名付けた厳選されたベンチマークを構築しました。このベンチマークは、文レベルの指示・軌道対応と、視覚的端点やランドマーク参照を明確にした洗練された指示を含んでいます。実験結果は、FineCog-Nav がゼロショットベースラインと比べて、指示遵守、長期計画、そして未観測環境への一般化において一貫して優れていることを示しています。これらの結果は、ゼロショット空中ナビゲーションにおいて、微細な認知モジュール化の有効性を示唆しています。プロジェクトページ: https://smartdianlab.github.io/projects-FineCogNav

Original Content

arXiv:2604.16298v1 Announce Type: new Abstract: UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.