arxiv_cs_cv 2026年4月24日

AgentDoG: AI エージェントの安全性とセキュリティに向けた診断ガールレフレームワーク

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Translated: 2026/4/24 19:53:12

agent-safetyai-agentsrisk-diagnosticsguardrail-frameworkllm-security

Japanese Translation

arXiv:2601.18491v2 Announce Type: replace-cross Abstract: AI エージェントの台頭により、自律的なツール使用と環境との相互作用から生じる複雑な安全性とセキュリティ上の課題が出現しました。現在のガールレモデルは、エージェンティックリスク意識やリスク診断の透明性を欠いています。複雑かつ多数のリスク行動をカバーするエージェンティックガールレを導入するためには、まず、リスク源（どこ）、失敗モード（どのように）、および后果（何）において直交する分類を行う統合的な 3 次元分類体系を提案します。この構造化された階層分類体系に基づいて、新しい粒度の細かいエージェンティックセキュリティベンチマーク（ATBench）と、エージェンティック安全性およびセキュリティのために設計された診断ガールレフレームワーク（AgentDoG）を導入しました。AgentDoG は、エージェンットの軌道全体にわたる粒度の細かい文脈依存監視を提供します。より重要なのは、AgentDoG が不安定な行為および一見安全だが不合理な行為の原因を診断し、二値ラベルを超えて根拠と透明性を提供することで、効果的なエージェンツアラインメントを促進できるという点です。AgentDoG のバリアントは、Qwen および Llama モデルファミリーにおいて 4B、7B、および 8B パラメータの 3 種類サイズで利用可能です。広範な実験結果は、AgentDoG が多様かつ複雑な対話シナリオにおいてエージェンティックセキュリティモデレーションで最先峰のパフォーマンスを達成したことを示しています。すべてのモデルとデータセットは公開されています。

Original Content

arXiv:2601.18491v2 Announce Type: replace-cross Abstract: The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.