arxiv_cs_cv 2026年2月10日

BEAT: VLM ベースのエンボディドエージェントに対する対比によるトリガー学習を利用したビジュアルバックドア攻撃

BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning

Translated: 2026/3/15 17:03:24

vision-language-modelsembodied-agentsadversarial-attacksbackdoor-attackscontrastive-learning

Japanese Translation

arXiv:2510.27623v2 Announce Type: replace-cross 摘要：最近のビジョン・ langage モデル（VLM）の進歩は、エンボディドエージェントの直接認識、推論、タスク指向アクション計画機能を可能にし、その推進力を加速させた。しかし、このビジュアル駆動型のエンボディドエージェントは新たな攻撃表面を開放しており、そこではエージェントが通常通り動作した後、シーンにビジュアルトリガーが出現すると、攻撃者が指定した多段階ポリシーを恒久的に実行する「ビジュアルバックドア攻撃」が可能になる。私たちは、環境内のオブジェクトをトリガーとして使用し、VLM ベースのエンボディドエージェントにこのようなバックドアを注入する最初のフレームワークである BEAT を提案した。テキストトリガーに比べ、オブジェクトトリガーは視点や照明条件下で多様な変化を示すため、信頼ably な植入が困難である。BEAT は、この課題に対し、(1) エージェントがトリガーの変動性を暴露するための多様なシーン、タスク、トリガー配置を含むトレーニングセットを構築し、(2) まず监督微调（SFT）を適用し、続いて我らが新しい対比トリガー学習（Contrastive Trigger Learning、CTL）を導入する 2 ステージのトレーニングスキームを採用することで解決した。CTL は、トリガー識別をトリガーあり入力とトリガーなし入力の間の好ましさ学習の形式で記述し、明示的に判定境界を鋭化することで、正確なバックドア起動を確保する。様々なエンボディドエージェントベンチマークと VLM で実験した結果、BEAT は攻撃成功率を最高で 80％まで向上させつつ、良性タスクのパフォーマンスも強く維持し、分布外トリガー配置にも信頼ably 般化された。特に、単純な SFT と比較して、限られたバックドアデータ条件下で、CTL はバックドア起動精度を最大 39％向上させた。これらの見解は、VLM ベースのエンボディドエージェントにおいて極めて重要だが未解決のセキュリティリスクを暴露しており、現実世界への実装前に堅牢な防御の必要性を強調するものである。

Original Content

arXiv:2510.27623v2 Announce Type: replace-cross Abstract: Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision-driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and VLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in VLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.