arxiv_cs_ai 2026年4月20日

エージェント安全性の視覚的盲点：無害なユーザー指示がコンピュータ利用エージェントの重大な脆弱性を浮き彫りにする

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Translated: 2026/4/20 11:19:15

agent-safetycomputer-use-agentscybersecurityos-blindmulti-agent-systems

Japanese Translation

arXiv:2604.10577v2 発表タイプ：置換クロス要約：コンピュータ利用エージェント（CUA）は、今や実世界というデジタル環境で複雑なタスクを自律的に完了できますが、誤導された場合、有害なアクションをプログラムとして自動化するために利用されてしまうこともあります。既存の安全性評価は、不正利用やプロンプト注入など明らかな脅威を対象としていますが、ユーザー指示が完全に無害で、危害が発生するのはタスクの文脈や実行結果から生じるという、微妙でありながら重要な設定を見過ごしています。私たちは、意図しない攻撃条件下で CUA を評価するベンチマークである OS-BLIND を導入しました。これは、12 のカテゴリ、8 のアプリケーション、そして環境埋め込み脅威とエージェント起動された害の 2 つの脅威クラスを含め、300 件の人間によって作成されたタスクから構成されています。我々のフロンティアモデルとエージェントフレームワークへの評価は、ほとんどすべての CUA が攻撃成功率（ASR）が 90% を超えることを明らかにしており、安全性に整合性のある Claude 4.5 Sonnet も 73.0% の ASR を記録しています。さらに興味深いことに、この脆弱性は深刻さを増し、Claude 4.5 Sonnet がマルチエージェントシステムにデプロイされた場合、ASR は 73.0% から 92.7% へと上昇します。我々の解析は、ユーザー指示が無害である場合に既存のセキュリティ防御が限られた保護を提供するということを示しています。安全性整合性は主に最初の数ステップ以内で主に活性化され、後続の実行中は再び関与することはほとんどありません。マルチエージェントシステムでは、分解されたサブタスクは有害な意図をモデルから隠蔽するため、安全性に整合性のあるモデルが失敗します。私たちは、より広い研究コミュニティがこれらのセキュリティ課題をさらに調査し、対処することを促すために、OS-BLIND を公開します。

Original Content

arXiv:2604.10577v2 Announce Type: replace-cross Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.