arxiv_cs_cv 2026年2月10日

VLMs に対する視覚認識攻撃への本質的な頑健性について

Toward Inherently Robust VLMs Against Visual Perception Attacks

Translated: 2026/3/15 7:01:36

vision-language-modelsautonomous-vehiclescyber-securitydeep-learningrobustness

Japanese Translation

arXiv:2506.11472v3 Announce Type: replace Abstract: 自律型車両は、ディープニューラルネットワーク（DNN）を交通標識認識、車線中央揃え、および車両検出のために依存していますが、これらのモデルは誤分類を引き起こし、安全性を脅かす攻撃に脆弱です。既存の防御策（例：对抗性訓練）は、一般化に失敗し、クリーンアキュラシーを低下させることがよくあります。当論文では、自律型車両認識に専門化した Fine-tuned Vision-Language Models（V2LMs）を提案し、これらが对抗性訓練なしに未見の攻撃に対して本質的により頑健であることを示します。従来の DNN に比べて、実質的に高い对抗性精度を維持します。当論文では Solo（タスク特化型 V2LM）および Tandem（すべての 3 つのタスク用の単一の V2LM）という 2 つの導入形態を研究します。攻撃下では、DNN は 33-74% 低下しますが、V2LM は平均で 8% 未満に低下します。Tandem は Solo と同等の頑健性を実現する一方、メモリ効率が高まります。さらに、既存の認識スタックと並列に V2LM を統合する可能性も探求しました。当論文の結果は、V2LM が安全で頑健な自律型車両認識の道筋として有望であることを示唆しています。

Original Content

arXiv:2506.11472v3 Announce Type: replace Abstract: Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.