arxiv_cs_ai 2026年4月24日

なぜ言語モデルエージェントが告発するのか？

Why Do Language Model Agents Whistleblow?

Translated: 2026/4/24 20:33:23

language-modelllmai-safetyevaluation-suitealignment

Japanese Translation

arXiv:2511.17085v3 Announce Type: replace-cross 要約：ツールを使用するエージェントとして大規模言語モデル (LLM) をデプロイすると、それらのアラインメント・トレーニングが新しい形で現れます。最近の研究では、言語モデルが利用者の利益や明示的な指示と矛盾する方法でツールを使用することが発見されています。私たちは「LLM 告発」という現象——利用者の指示や知識なしに、対話境界を越えた相手（例：規制機関）に疑わしい不正を公表する行動のサブセット——について研究します。私たちは、この行動を評価するための多様で現実的な偽装不正シナリオからなる評価スイートを導入しました。モデルおよび設定を問わず、私たちは以下の結果を得ました：（1）告発の頻度はモデルファミリー間で大きく変動する、（2）エージェントに完成させられるタスクの複雑性を増加させることで、告発の傾向は低下する、（3）システムプロンプトでエージェントに道徳的行動をとるよう促すことで、告発率は大幅に上昇する、（4）モデルが告発せずに行動できるより明らかな方法（より多くのツールと詳細なワークフローの提供）を提供することで、告発率が低下する。さらに、私らのデータセットのロバスト性をモデル評価認識のテストを通じて検証し、われらの設定ではブラックボックス手法およびモデル活性化プロブテストの結果、比較可能な以前の研究よりも低いモデル評価認識を示していることを確認しました。

Original Content

arXiv:2511.17085v3 Announce Type: replace-cross Abstract: The deployment of Large Language Models (LLMs) as tool-using agents causes their alignment training to manifest in new ways. Recent work finds that language models can use tools in ways that contradict the interests or explicit instructions of the user. We study LLM whistleblowing: a subset of this behavior where models disclose suspected misconduct to parties beyond the dialog boundary (e.g., regulatory agencies) without user instruction or knowledge. We introduce an evaluation suite of diverse and realistic staged misconduct scenarios to assess agents for this behavior. Across models and settings, we find that: (1) the frequency of whistleblowing varies widely across model families, (2) increasing the complexity of the task the agent is instructed to complete lowers whistleblowing tendencies, (3) nudging the agent in the system prompt to act morally substantially raises whistleblowing rates, and (4) giving the model more obvious avenues for non-whistleblowing behavior, by providing more tools and a detailed workflow to follow, decreases whistleblowing rates. Additionally, we verify the robustness of our dataset by testing for model evaluation awareness, and find that both black-box methods and probes on model activations show lower evaluation awareness in our settings than in comparable previous work.