arxiv_cs_ai 2026年4月24日

状況対話における共通の前提を表現するためのマインド・メンタル・イメージの利用

Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

Translated: 2026/4/24 20:23:58

situated-dialoguemachine-learningmultimodal-modelsknowledge-representationgenerative-ai

Japanese Translation

arXiv:2604.21144v1 Announce Type: cross アブストラクト：状況対話では、話者は個別の文だけを対象とした推論ではなく、共有文脈の信頼性の高い表象を維持する必要があります。現在の対話エージェントは、この要件、特に文脈ウィンドウを超えて共通の前提を維持する必要がある場合に、しばしば苦戦します。このような設定では、微細な区別は純粋なテキスト表現に圧縮されることが多く、これにより「表象のぼかし」と呼ばれる重要な失敗モードを引き起こします。これは、類似してはあるが区別される実体同士が交換可能な記述に圧縮されてしまう現象です。この文意の平坦化は、エージェントが局所的に整合性があるように見せかけても、時間をかけて共有文脈を追跡できなくなるという欺瞞的な着地状態を生み出します。人間の推論におけるマインド・メンタル・イメージの役割にインスピレーションを得て、マルチモーダルモデルの利用が進展した背景下、対話エージェントはこれらの制限に対処するために、対話の間で描写的中間表象を構築する類似的な能力を与えられるか否かを探索しています。したがって、私たちは対話状態を後続の着地応答生成のために取得可能な継続的な可視履歴に変換する能動的な可視スcaffoldingフレームワークを導入しました。IndiRefベンチマークの評価では、incremental外部化自体がフル・対話推論よりも改善されていることを示しており、代表的な文脈のコミットメントを強固にし、表象のぼかしを減少させることで、スcaffoldingは追加的なメリットをもたらしています。同時に、描写不可能な情報についてはテキスト表現が優位であり、ハイブリッド・マルチモーダル設定が全体のパフォーマンスにおいて最高であることが示されました。これらの結果は、対話エージェントが、描写的情報と提案的情報を統合する明示的なマルチモーダルな共通の前提の表象から利益を得ることを示唆しています。

Original Content

arXiv:2604.21144v1 Announce Type: cross Abstract: Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.