arxiv_cs_lg 2026年4月20日

脆い思考：大規模言語モデルが連鎖思考の妨害をどのように扱うか

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Translated: 2026/4/20 11:08:17

llm-robustnesschain-of-thoughtlarge-language-modelsreasoning-perturbationsmathematical-reasoning

Japanese Translation

arXiv:2603.03332v3 Announce Type: replace-cross 要旨：思考連鎖（Chain-of-Thought、CoT）プロンプティングは、大規模言語モデル（LLM）からの推理を引き出すための基礎的な手法として台頭しましたが、このアプローチが中間の推理ステップにおける腐敗に対してどれだけ頑健なのかは、まだ十分に理解されていません。本論文では、LLM の頑健性を体系的に評価するために、5 つの CoT 妨害タイプの分類体系を実験的に検証します： extit{MathError, UnitConversion, Sycophancy, SkippedSteps} および extit{ExtraSteps}。私たちはパラメータ数が 3 つの桁幅にわたる 13 モデルを評価し、推理チェーンに注入された妨害にもかかわらず数学的な推理タスクを完了する能力をテストしました。私の主要な見解は異なった脆弱性パターンを明らかにしており、 extit{MathError} の妨害は小型モデル（50〜60％の精度低下）において最も深刻な劣化をもたらしますが、スケールメリットは強く、 extit{UnitConversion} はすべてのスケールで依然として課題となっています（中型モデルであっても精度損失が 5％以上）。 extit{ExtraSteps} は最小の精度劣化（0〜6％）をもたらすだけで、最も小さなモデルでも特に影響は小さいです。 extit{Sycophancy} と extit{SkippedSteps} は小型モデルに対して約 10％の精度低下をもたらすのみで、モデル規模が大きくなるにつれてやや改善されます。スケーリングの関係は、モデルサイズが多くの妨害に対して保護因子として機能しますが、常にそうではないことを示しています。これらの見解は、LLM を多段階推理パイプラインにデプロイする際に直接的な影響を及ぼし、タスク固有の頑健性評価と緩和戦略の必要性を強調しています。コードと結果は以下の URL にあります：https://github.com/Mystic-Slice/CoTPerturbation

Original Content

arXiv:2603.03332v3 Announce Type: replace-cross Abstract: Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (>5\% loss even for midsized models); ExtraSteps incur minimal accuracy degradation (0-6\%) even for the smallest of models; Sycophancy and SkippedSteps produce modest effects ($\sim$10\% loss for small models) and slightly improve with scale. Scaling relationships show that model size serve as a protective factor against many perturbations but not always. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available at https://github.com/Mystic-Slice/CoTPerturbation