arxiv_cs_ai 2026年2月10日

ExpliCa: 大規模言語モデルにおける明示的因果推理の評価

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

Translated: 2026/2/14 7:11:34

Japanese Translation

大規模言語モデル (LLMs) は、解釈と推論の精度に要求されるタスクでますます広く使用されています。本論文では、私たちが ExpliCa を導入し、これにより LLMs の因果推理での実現性を評価することを紹介します。 ExpliCa は異なる語彙順序と言語的関連によって整合的に統合されている causal および時系列の関係性を持ちます。このデータセットには収集された crowdsourced 勝否判定が組み込まれています。ExpliCa の LLM 試験では提示法やパーカプシオンベースの指標を使用しました。7 社の商用 LLMs とオープンソース LLMs を対象にテストを行い、トップモデルは 0.8 データの正確性を達成することが困難であることを Revealed。驚くべきことに、モデルは時系列関係を因果関係と誤って混同することが多く、またイベントの語彙順序はモデルのパフォーマンスにも強く影響されます。最後に、パーカプシオンベースの得点と提示性能はモデルサイズに対する影響も異なることが明らかになりました。

Original Content

arXiv:2502.15487v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.