arxiv_cs_lg 2026年2月10日

CausalArmor: 因果歸因に基づく効率的な間接プロンプト注入防護

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Translated: 2026/3/15 7:04:31

causal-aiprompt-injectionsecurity-defensemachine-learningagent-safety

Japanese Translation

arXiv:2602.07918v1 Announce Type: cross 摘要：トールコール機能を備えた AI エージェントは、間接プロンプト注入 (IPI) 攻撃にさらされやすい。この攻撃シナリオでは、信頼できないコンテンツ内に隠された悪意のあるコマンドが、エージェントに承認されていない行動を実行させるよう誘導する。既存の防御手段は攻撃成功率を低下させるものの、過剰防御のジレンマに直面しており、実際の脅威に関わらず高価で常時稼働するサンジチゼーションを実装することで、無害なシナリオでも有用性とレイテンシーを低下させる。当社は、因果欠落の視点を通じて IPI を再考した：成功した注入は、ユーザーの要求がエージェントの権限ある行動に決定的な支持を提供しなくなる「優位性シフト」を示し、特定の信頼できないセグメント（例えば、取得ドキュメントやツール出力）が過剰な可視的帰属影響を与える現象である。この特徴に基づき、CausalArmor と呼ばれる選択的防御フレームワークを提案し、(i) 権限ある決定時点で軽量な、欠落に基づく帰算計算を行い、(ii) 信頼できないセグメントがユーザーの意向を優位に占める場合にのみ、指向されたサンジチゼーショントリガーを行う。さらに、CausalArmor は``poisoned``（毒された）推論トレースに基づいて行動しないよう、遡行的な Chain-of-Thought マスキングを採用する。当社は、帰算マージンに基づくサンジチゼーションが、悪意ある行動を選択する確率の指数関数的に小さな上界を条件付きで与えることを示す理論的解析を示した。AgentDojo と DoomArena での実験は、CausalArmor が激进的な防御と同程度のセキュリティ性能を持ち、説明可能性を向上させ、AI エージェントの有用性とレイテンシーを維持していることを示した。

Original Content

arXiv:2602.07918v1 Announce Type: cross Abstract: AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.