arxiv_cs_cv 2026年4月20日

DriveLaW: 潜在駆動空間における計画とビデオ生成の統合

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Translated: 2026/4/20 10:50:33

drive-lawworld-modelsautonomous-drivinglatent-spacevideo-generation

Japanese Translation

arXiv:2512.23421v3 Announce Type: replace 摘要: ワールドモデルは、現実世界の長尾課題に対処するために、時間が経つにつてシナリオがどのように展開するかを学習することにより、自律運転において不可欠なものとなっています。しかし、現行的なアプローチでは、ワールドモデルは限定的な役割に退けられており、 ostensibly 統合されたアーキテクチャ内にありますが、それでもワールド予測とモーション計画が分離されたプロセスとして処理されています。このギャップを埋めるために、私たちは、ビデオ生成とモーション計画を統合する革新的なパラダイムである DriveLaW を提案します。ビデオ生成器からの潜在表現を直接プランナーに注入することで、DriveLaW は、高解像度の将来予測と信頼性の高い軌道計画の間で内在的な一貫性を確保します。具体的には、DriveLaW は以下の 2 つの主要コンポーネントから構成されています：まず、表現豊かな潜在表現を用いて高品質な予測を生成する強力なワールドモデルである DriveLaW-Video、そして DriveLaW-Video の潜在表現から一貫性のあるかつ信頼性の高い軌道を生成するディフュージョンプランナーである DriveLaW-Act です。両方のコンポーネントは、3 ステージのプログレッシブトレーニング戦略によって最適化されています。私たちの統合パラダイムの威力は、両方のタスクで新しい最高性能の結果によって証明されています。DriveLaW は、ビデオ予測を大幅に前進させ、FID で 33.3%、FVD で 1.8% の幅で最良の性能を超えるだけでなく、NAVSIM 計画ベンチマークにおいて新たな記録を達成しました。

Original Content

arXiv:2512.23421v3 Announce Type: replace Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.