arxiv_cs_ai 2026年4月20日

Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration

Translated: 2026/4/20 11:17:29

theory-of-mindhuman-agent-collaborationlarge-language-modelsinstruction-inferencetomcat

Japanese Translation

arXiv:2507.02935v3 Announce Type: replace-cross 要事を成し遂げるための成功した人間とエージェントのチームワークには、エージェントが（人間の）主体者に与えられた指示を理解する能力が必要です。多くの場合、指示は不完全または曖昧である可能性があります。そのような場合、エージェントは共有コンテキストから言わせない意図を推論する必要性があり、つまり主体者の Theory of Mind (ToM) を行使して主体者の心的状態を推論する必要があります。当社は、大規模言語モデル (LLMs) を活用した効果的な人間とエージェントの協働の可能性を検討します。動的な、目標指向の、協力的環境における ToM を評価するために、私たちは不完備または曖昧な指示を解釈し、主体者が目標を達成するようにエージェントを支援する新しいタスク、Instruction Inference を導入しました。私たちは、主体者の指示を解釈し応答する際に ToM 推理を示すように設計された LLM ベースのエージェント Tomcat を提示します。私たちは 2 種類の Tomcat のバリアントを実装しました。1 つは Fs-CoT（Fs は few-shot、CoT は chain-of-thought）と呼ばれ、必要な構造化推理を示す少量の例に基づいています。もう 1 つは CP（commonsense prompt）と呼ばれ、常識知識と問題に関する情報に基づいています。私たちは Tomcat の両方のバリアントを、GPT-4o、DeepSeek-R1、および Gemma-3-27B という 3 つの主要な LLM 上で実現しました。Tomcat の有効性を評価するために、私たちは CP バリアントと同じ情報を提供した 52 名の参加者による研究を行いました。私たちは Tomcat と私たちの研究参加者の ToM 能力を測定するために、意図の精度、行動の最適性、および計画の最適性を計算しました。私たちは、Tomcat における Fs-CoT、特に GPT-4o と DeepSeek-R1 は人間参加者と同様の性能を実現し、それが人間とエージェントの協働における Tomcat の ToM プotential を強調しました、という発見を行いました。

Original Content

arXiv:2507.02935v3 Announce Type: replace-cross Abstract: Successful human-agent teaming relies on an agent being able to understand instructions given by a (human) principal. In many cases, an instruction may be incomplete or ambiguous. In such cases, the agent must infer the unspoken intentions from their shared context, that is, it must exercise the principal's Theory of Mind (ToM) and infer the mental states of its principal. We consider the prospects of effective human-agent collaboration using large language models (LLMs). To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting incomplete or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implemented two variants of Tomcat. One, dubbed Fs-CoT (Fs for few-shot, CoT for chain-of-thought), is based on a small number of examples demonstrating the requisite structured reasoning. One, dubbed CP (commonsense prompt), relies on commonsense knowledge and information about the problem. We realized both variants of Tomcat on three leading LLMs, namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-agent collaboration.