arxiv_cs_lg 2026年4月24日

DialToM: 心理理論を備えた状態駆動型対話軌跡の予測のためのベンチマーク

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Translated: 2026/4/24 20:05:01

llmstheory-of-minddialoguebenchmarkforecasting

Japanese Translation

arXiv:2604.20443v1 発表タイプ：クロス摘要: 大規模言語モデル（LLM）には心理理論（ToM）能力が備わっていると示されています。ただし、これが堅牢な推論から生じるものであるのか、偽の相関から生じるものであるのかはまだ明らかではありません。本稿では、多数選択フレームワークを使用して自然な人間対話から構築した、人間による検証を受けた DialToM ベンチマークを導入します。私たちは、心理状態の予測（形式的 ToM）だけでなく、これらの状態の実用的有用性（機能的 ToM）も、先見的診断的予測を通じて評価します。これは、モデルがメンタル・ステート・プロファイルのみに基づいて状態整合性の対話軌跡を特定できるかどうかを調べるものです。私らの結果は、大規模言語モデルがメンタル・ステートの特定においては優れていますが、この理解を駆使して社会的軌跡を予測することは（Gemini 3 Pro を除く）、多くのモデルにおいて不十分であることを示しています。さらに、人間と LLM 生成の推論の間の意味的な類似性は非常に弱いことがわかりました。再現性を促進するため、DialToM データセットおよび評価コードは https://github.com/Stealth-py/DialToM に公開されています。

Original Content

arXiv:2604.20443v1 Announce Type: cross Abstract: Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.