arxiv_cs_ai 2026年2月10日

分散情報のもとでの集体の合理的な判断にみつかる体系的な失敗: マルチ・ア gent LLMs

Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

Translated: 2026/2/14 7:14:47

Japanese Translation

多数の言語モデル（LLMs）を応用したマルチ・アgentシステムは、散布されている情報を集めることにより意思決定を向上させることが期待されています。しかし、この能力を評価することが難しかったです。我々は「HiddenBench」を導入しました。これはヒドンのプロファイルパラダイムに基づいた65タスクのバーゲンで、分散情報のもとでの共同的理由処理から個人的な理由処理能力に焦点を当てています。15の先駆者のLLMsに対する評価を行った結果、マルチ・アgent LLMは分散情報のもとでの30.1%の精度しか達成できず、完全な情報を与られた単体であるものは80.7%の精度でした。このギャップを特定するために、ある種の失敗モードが存在することが知られています。それは、その他のメンバーがまだ伝えられていない潜在的な情報不均衡が認識されかねているためです。したがって、その他の誰が何を知っているかについて考えることができず、その他の人が持っている情報を予測することで彼らは適切な分布式の事実に進んでいくだけでなく、重要な事実にもっとも探求しようとしません。これらの失敗は提案の戦略（例えばGemini-2.5 Flash/ Pro）、コミュニケーション深度、グループサイズと並行して存在し、グループが拡大するにつれて悪化していきます。一部のモデル（例：Gemi ni - 2.5 Flash/ Pro）では他よりも優れたものはありますが、これは全モデル間で個別に適応度を高めるためには特にないのです。我々の結果は、マルチ・アgent LLMにおいて意思決定における全体的な情報探求に関する弱点としての失敗に焦点を当てています。一方でも理論に基づいて再現可能なフレームワークでこれらの集団論理的失敗を診断するものであります。

Original Content

arXiv:2505.11556v3 Announce Type: replace-cross Abstract: Multi-agent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry-they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes-and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.