arxiv_cs_lg 2026年4月20日

The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

Translated: 2026/4/20 11:04:31

llm-reasoningtool-hallucinationreinforcement-learningagent-aimodel-faults

Japanese Translation

arXiv:2510.22977v2 Announce Type: replace 摘要：大型言語モデル（LLM）の論理推理能力を向上させることは、「考えから行動する」Agents を構築するための重要な戦略です。しかし、OpenAI の o3 などの最近の観察は、より強力な推理がより多くの幻觉（ hallucination）と一致するというパラドックスを示唆しています。ただし、それまでに論理推理能力の向上自体がツール幻觉を招くかどうかを体系的に検討した prior work は存在しません。このギャップに対処するため、本研究は中心的な質問を提起します：推理の強化はツール幻觉を増加させるのか？これに答えるために、我々はツール幻觉を 2 つの失敗モード（i）：利用可能なツールの欠如、および（ii）誤導的なツールのみが存在する状況において計測する、診断用的ベンチマークである SimpleToolHalluBench を導入しました。制御された実験を通じて、我々は 3 つの主要な見出しを確立しました。第一に、推理能力を RL を介して段階的に強化することは、タスクパフォーマンスの向上と比例してツール幻觉の割合を増加させる因果的な関係を示すものです。第二に、この効果は過学習を超えており、ツールタスクではないタスク（例：数学）でトレーニングされた場合でも、その後のツール幻觉を増幅します。第三に、この効果は手法に依存せず、推理が supervised fine-tuning を通じて組み込まれた場合や、直接回答から一歩ずつの思考へ切り替えるという推理発現方式でのみ inference 時に誘発された場合でも現れます。我々は、提示工学（Prompt Engineering）および直接好意思別最適化（DPO）を包括した緩和戦略も評価し、基礎的な信頼性と能力のトレードオフを明らかにしました：幻觉を減少させることは一貫して利便性を低下させます。メカニズムとして、推理 RL はツール信頼性に関連する表現を相対的に破滅させ、幻觉は遅いレイヤーの残差ストリームに集中した増幅された偏離として表面化します。これらの見出しは、現在の推理強化手法が内在地でツール幻觉を増幅すると示唆しており、能力と信頼性の両方を最適化する新しいトレーニング目標の必要性を強調しています。

Original Content

arXiv:2510.22977v2 Announce Type: replace Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) is a key strategy for building Agents that "think then act." However, recent observations, like OpenAI's o3, suggest a paradox: stronger reasoning often coincides with increased hallucination, yet no prior work has systematically examined whether reasoning enhancement itself causes tool hallucination. To address this gap, we pose the central question: Does strengthening reasoning increase tool hallucination? To answer this, we introduce SimpleToolHalluBench, a diagnostic benchmark measuring tool hallucination in two failure modes: (i) no tool available, and (ii) only distractor tools available. Through controlled experiments, we establish three key findings. First, we demonstrate a causal relationship: progressively enhancing reasoning through RL increases tool hallucination proportionally with task performance gains. Second, this effect transcends overfitting - training on non-tool tasks (e.g., mathematics) still amplifies subsequent tool hallucination. Third, the effect is method-agnostic, appearing when reasoning is instilled via supervised fine-tuning and when it is merely elicited at inference by switching from direct answers to step-by-step thinking. We also evaluate mitigation strategies including Prompt Engineering and Direct Preference Optimization (DPO), revealing a fundamental reliability-capability trade-off: reducing hallucination consistently degrades utility. Mechanistically, Reasoning RL disproportionately collapses tool-reliability-related representations, and hallucinations surface as amplified divergences concentrated in late-layer residual streams. These findings reveal that current reasoning enhancement methods inherently amplify tool hallucination, highlighting the need for new training objectives that jointly optimize for capability and reliability.