arxiv_cs_lg 2026年2月10日

野に deployed エージェントにおけるスペクトルガードレール：注視トポロジを介したツールの使用ハルシネーションの検出

Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

Translated: 2026/3/15 15:02:05

spectral-analysisattention-topologyhallucination-detectionagent-safetyllama-3.1

Japanese Translation

arXiv:2602.08082v1 発表タイプ：new 要旨: 野における自律エージェントの展開には、ツールの使用失敗に対する信頼性の高いサファードが必要です。私たちは、注視トポロジのスペクトル解析に基づく学習フリーのガードレールを提案し、これは上流手法を補完します。Llama 3.1 8B で、我々の手法は多機能検出において 97.7% の再現率、バランス型展開において 86.1% の再現率と 81.0% の精度を獲得し、ラベル付けされたトレーニングデータを必要とせず。特に、単一層スペクトル機能が近Perfectなハルシネーション検出器であるという発見は驚くべきものです：Llama 26 スムースネスは単一閾値で 98.2% の再現率（213/217 のハルシネーション検出）を達成し、Mistral 3 エントロピーは 94.7% の再現率を達成します。これは、ハルシネーションは単に誤ったトークンではなく、熱力学的状態の変化であることを示唆します：モデルは誤った際に注視がノイズになります。制御されたクロスモデル評価 ($N=1000$, $T=0.3$, 同じ General ドメイン、ハルシネーション率 20--22%) を通じて、「Loud Liar」現象を明らかにしました：Llama 3.1 8B の失敗はスペクトル的に災害的に、そして検出が圧倒的に容易ですが、Mistral 7B は最も高い識別率（AUC 0.900）を果たします。これらの知見は、スペクトル解析をエージェント安全性のための原理的かつ効率的なフレームワークと確立しました。

Original Content

arXiv:2602.08082v1 Announce Type: new Abstract: Deploying autonomous agents in the wild requires reliable safeguards against tool use failures. We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches. On Llama 3.1 8B, our method achieves 97.7\% recall with multi-feature detection and 86.1\% recall with 81.0\% precision for balanced deployment, without requiring any labeled training data. Most remarkably, we discover that single layer spectral features act as near-perfect hallucination detectors: Llama L26 Smoothness achieves 98.2\% recall (213/217 hallucinations caught) with a single threshold, and Mistral L3 Entropy achieves 94.7\% recall. This suggests hallucination is not merely a wrong token but a thermodynamic state change: the model's attention becomes noise when it errs. Through controlled cross-model evaluation on matched domains ($N=1000$, $T=0.3$, same General domain, hallucination rates 20--22\%), we reveal the ``Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination (AUC 0.900). These findings establish spectral analysis as a principled, efficient framework for agent safety.