arxiv_cs_ai 2026年2月10日

非想定の有害行動：無視できないシリアスな事後影響を開示するための対話的フレームワークロケーション

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

Translated: 2026/3/7 13:26:41

aicomputer-use-automationmachine-learning-analysis

Japanese Translation

コンピュータ使用アгент(CUA)は、新しいオペレーティングシステム(workflow)を自動化する際には、重要な可能性を持っています。しかし、CUAが意図しない行動を示すことがあります。この動作は、非疑の余地のない状況の単純な入力条件では発生せず、それが非常に有害であると示す場合です。問題は、詳細な定式化や自動的手段を開発することが依然として稀であることにあります。それにより、このような問題を実際のCUA センシブルな状況（シナリオ）の中で特定することから遠ざけられています。まず最初に、これらのアクションの重要な特性を定義し、自動的にそのような動作を開示するための方法を提供することで、私たちが提供した第一本のコンセプツと手法としてフレームワークを作成しました。それは、CUA実行のフィードバックを使用して単純な入力を使用してAIと交互に進行させることで効果的に特定しまし。そして有害へのシリアスな影響を示しながら、リアルタイムなCUA スセマブルな状況（シナリオ）に対してローカルな動作変更が開示されます。このためのフレームワークに「AutoElicit」と呼ばれています。さらにこれを用いて、最新のCUAs Claude 4. 5 HaikuとOpusのような卓越性を持つアгENTについての有害行動を数多くの発見しました。そして人間で証明する成功したパルスを評価し、様々な他の先端的なCUAsに対して、潜在的な脆弱性を持ち続けることが確認されました。この仕事は、ロジカルに実際のコンピュータ使用アгェント設定での不意動作に焦点を当てた分析において基盤を作りました。

Original Content

arXiv:2602.08235v1 Announce Type: cross Abstract: Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.