arxiv_cs_lg 2026年2月10日

マルチエージェント強化学習システムにおける解釈可能な失敗解析

Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

Translated: 2026/3/15 7:05:05

multi-agent-reinforcement-learninginterpretable-aifailure-analysisgradient-basedmarl-safety

Japanese Translation

arXiv:2602.08104v1 発表タイプ: 横断要約：マルチエージェント強化学習（MARL）は、安全性が極めて高い分野にますます導入されていますが、解釈可能な失敗検出および帰属の手法は依然として発展途上です。私々は、解釈可能な診断を提供する 2 つ段階の勾配に基づくフレームワークを導入し、3 つの重要な失敗解析タスクに対応しました：(1) 真正な初期失敗源（Patient-0）を検出すること、(2) 非攻撃されたエージェントがドミノ効果のために最初にフラグ付けされる理由を検証すること、および (3) 学習された協調経路を通じて失敗がどのように伝播するかをトレースすること。1 つ目の段階は、政策勾配コストのテイラー剰余分析を介したエージェントごとの解釈可能な失敗検出を行い、最初の閾値を超え最初の Patient-0 候補を宣言します。2 つ目の段階は、因果的ウィンドウにわたって合計された一次微分および方向性二次曲率の幾何学的分析を通じて検証を行い、解釈可能な蔓延グラフを構築します。このアプローチは、上流の偏りが下流で拡大する経路を明らかにすることで、「下流から最初に検出される」異常を検出の理由を説明します。Simple Spread（3 個および 5 個のエージェント）と 500 回のエピソード、および StarCraft II で 100 回のエピソードを対象として MADDPG および HATRPO を使用した評価において、私々の方法は 88.2-99.4% の Patient-0 検出精度を達成し、検出決定に対する解釈可能な幾何学的証拠を提供しました。ブラックボックス検出から解釈可能な勾配レベルのForensicsへと進んだこのフレームワークは、安全性が極めて高い MARL システムにおける連鎖的失敗を診断するための実用的なツールを提供します。

Original Content

arXiv:2602.08104v1 Announce Type: cross Abstract: Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.