arxiv_cs_cv 2026年2月10日

視覚と言語：自律走行車用安全評価と計画のための新表現手法と人工知能

Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Open original article

Translated: 2026/3/15 18:05:37

vision-language-modelsautonomous-drivingvision-cliptrajectory-planningartificial-intelligence

Japanese Translation

arXiv:2602.07680v1 Announce Type: new 摘要：視覚言語モデル（VLMs）は、近年、視覚的観測を自然言語概念と一致させる能力を持ち、安全至要な自律走行におけるセマンティックな推論に新たな機会をもたらした強力な表現学習システムとして登場しました。本稿は、この視覚言語表現を感知、予測、計画パイプラインに統合した際、それがどのように走行シーンにおける安全評価と意思決定を支えるかを探求します。我々は、互いに補完的な 3 つのシステムレベルのユースケースを対象としました。第一に、CLIP に基づく画像テキスト類似性を活用した軽量でカテゴリ非特異的な危害スクリーニングアプローチを導入し、低遅延のセマンティック危害シグナルを生成します。これにより、明示的な物体検出や視覚的質問応答を伴わずに、多様なおよび分布外のパッド危害を堅牢に検出可能になります。第二に、Waymo オープンデータセットを用いて、シーンレベルの視覚言語エンベッディングをトランスフォーマーベースの軌道計画フレームワークに統合することを検討します。我々の結果は、計画者がグローバルエンベッディングに単純に条件を与えても軌道精度が向上しないことを示し、表現とタスクの整合性の重要性を浮き彫りにするとともに、安全至要な計画のためのタスク指向された抽出法の開発を促しています。第三に、doScenes データセットを用いて、自然言語を運動計画における明示的な行動制約として探求します。このシナリオでは、視覚的なシーン要素に根差したパッセンジャースタイルの指示が、稀ではあるが深刻な計画失敗を抑制し、曖昧なシナリオにおいて安全に整合した行動を促進します。これらを総合すると、視覚言語表現はセマンティックリスク、意図、および行動制約を表すために使用されれば、自律走行の安全にとって大きな可能性を秘めていることが示されます。この可能性を実現することは、直接的な特徴注入ではなく、慎重なシステム設計と構造化されたグラウンディングを要する本質的にエンジニアリング上の問題です。

Original Content

arXiv:2602.07680v1 Announce Type: new Abstract: Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.