arxiv_cs_cv 2026年4月20日

AutoDrive-R$^2$: 自律走行における VLA モデルの推論能力と自己反省能力を促進する

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Translated: 2026/4/20 10:52:03

autonomous-drivingvla-modelschain-of-thoughtreinforcement-learninggrpo

Japanese Translation

arXiv:2509.01944v3 Announce Type: replace-cross 要約: 自律走行システムにおける視覚・言語・アクション（VLA）モデルは、マルチモーダル感知と意思決定能力を統合することで変革的な可能性を示しました。しかし、意思決定プロセスの解釈可能性と整合性、およびアクションシーケンスの妥当性は十分に研究されていません。これらの課題に対処するため、私らは自律走行システムの推論能力および自己反省能力を、チェーンオブサThought (CoT) 処理と強化学習 (RL) を通じて強化する新しい VLA フレームワーク AutoDrive-R$^2$ を提案します。具体的には、まず、教師あり微調整のために提案した革新的な CoT データセットである nuScenesR$^2$-6K を導入し、自己反省による検証を伴う 4 ステップの論理チェーンを通じて、入力情報と出力軌道間の認知的橋渡しを効果的に構築しました。さらに、RL 段階において推論および自己反省を最大化するため、物理に基づく報酬フレームワーク（空間整合性、車両動態、時間的滑らかさを含む基準を有する）内にグループ相対政策最適化 (GRPO) アルゴリズムを実装しました。nuScenes および Waymo データセットを又跨る広範な評価結果は、提案手法の最先端のパフォーマンスと強固な汎化能力を示唆しています。

Original Content

arXiv:2509.01944v3 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R$^2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR$^2$-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.