arxiv_cs_cv 2026年2月10日

Driving with DINO: 自己運転における視覚基礎特性によるシミュレーションから現実世界への統一された橋渡し

Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

Translated: 2026/3/15 17:01:31

Japanese Translation

arXiv:2602.06159v2 発表タイプ：差し替え要約：コントラロールブビデオ拡散の台頭により、現行の自己運転ビデオ生成 Sim2Real メソッドは、ドメインギャップを橋渡しする明示的な中間表現に依存していました。しかし、これらのモダリティは基本的な「一貫性・リアリズムジレンマ」に直面しています。低レベル信号（例：エッジ、ぼやけた画像）は正確な制御を保証しますが、合成アチファクトを「焼酎（バiking in）」する過程でリアリズムを犠牲にし、一方、高レベルプリオー（例：深度、セマンティクス、HDMaps）は写実性を促しますが、一貫したガイドを必要とする構造的な詳細に欠けます。本稿では、Driving with DINO (DwD)、このシミュレーションと現実世界の間の統一されたブリッジとして視覚基礎モジュール (VFM) 特性を活用する新しいフレームワークを提示します。まず、これらの特性はハイレベルセマンティクスからFine-grained構造物に至るまでの情報スペクトルをエンコードすることがあることを特定しました。これらを有効的に利用するために、我々は「テクスチャバキング」に責任を持つ高周波要素を捨象するための主成分空間投影を採用し、同時に厳密な次元削減に内在する構造的損失を緩和するランダムチャンネルテールドロップを導入し、リアリズムと制御的一貫性の調和を図りました。さらに、DINOv3の高い解像度機能を制御精度向上に充分发挥するために、これらの高解像度特性を拡散バックボーンに適応させる学習可能な空間アラインメントモジュールを導入しました。最後に、フレームごとに DINO 特性を統合する際、歴史的な動き文脈を明示的に保持するために、原因畳み込みを採用した原因的时间アグリゲーターを提案し、これは効果的にモーションブラーを緩和し、時間の安定性を保証しました。プロジェクトページ：https://albertchen98.github.io/DwD-project/

Original Content

arXiv:2602.06159v2 Announce Type: replace Abstract: Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/