arxiv_cs_cv 2026年2月10日

内外両方で見聞きする：運転者安全評価とインテリジェント車両の意思決定のためのマルチモーダル人工知能システム

Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Open original article

Translated: 2026/3/15 18:05:31

multimodal-aidriver-safetyautonomous-vehiclesspeech-processingsensor-fusion

Japanese Translation

arXiv:2602.07668v1 Announce Type: new 要約：「見る内的・見る外的（LILO）」フレームワークは、外部の環境と運転者の状態を理解して安全性を向上させるインテリジェント車両アプリケーションを可能にし、スマートエアバッグの展開、自律制御移行における取り次ぎ時間予測、および運転者注意モニタリングなどの事例で実用化されています。本研究では、このフレームワークへの拡張を提案し、運転者を理解するための追加情報源としての音声モーダリティの重要性を示唆しています。そして、自律技術の進化の文脈において、乗客および車外の人々も含むアプローチを提唱しています。私たちは、音声信号を統合し、マルチモーダルセンサー融合によって運転者状態の評価と環境理解を向上させるために、内外両方で見聞きする（L-LIO）フレームワークを構築しました。私たちは、音声による車両安全の向上を示す 3 つの事例を評価しました：1）運転者の音声から潜在的な疲労状態（例：飲酒状態）を分類するための教師あり学習、2）乗客の自然な言葉による指示（例：「その赤い建物の後で右折してください」）の収集・分析を通じて、音声同期した指示データが計画システムとどのようにインターフェースするかが示された話した言語の活用、および音声による外部エージェントの導線およびジェスチャーの曖昧さを解決することができ、それにより視覚のみで機能するシステムの限界が明確になりました。使用されたデータセットは、実世界環境で収集した車内および車外の音声サンプルを含みます。パイロット研究の成果では、音声は特に音声が安全な意思決定や、視覚信号だけでは不十分な多層的または文脈豊富なシナリオにおいて、安全性に関連する洞察を提供することが示されました。課題には環境雑音の干渉、プライバシーに関する考慮事項、および人間主體に対する頑健性があり、これらは動的な実世界コンテキストにおける信頼性に関するさらなる研究を促しています。L-LIO は、音声と視覚センサーのマルチモーダル融合を通じて運転者およびシーン理解を拡張し、安全性介入の新しい道筋を提供します。

Original Content

arXiv:2602.07668v1 Announce Type: new Abstract: The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.