arxiv_cs_ai 2026年4月24日

取らない道：実行に関する推論における対偶性

The Path Not Taken: Duality in Reasoning about Program Execution

Translated: 2026/4/24 20:21:31

llmsdual-reasoningcode-executionbenchmarksprogram-analysis

Japanese Translation

arXiv:2604.20917v1 Announce Type: cross 要旨：大規模言語モデル（LLMs）は、多様なコーディングタスクにおいて驚くべき能力を示しており、その採用には、表面的なパターンに頼ることなく実行の真の理解を必要とします。既存ベンチマークは、特定の入力と結びついたプログラムの属性を予測すること（例：コードのカバレッジ、プログラムの出力）に主に焦点を当てており、これにより動的コードの推論の狭い視点を提供し、データ汚染のリスクがあります。我々は、プログラムの実行を理解するには、観測された振る舞いを予測する（i）特定の入力に対して、および特定の振る舞いの目標へと変則化する必要がある入力の変化を推論する（ii）という2つの補完的な推論タスクを通じて、その本質的な対偶性を評価する必要があると論じています。両方のタスクは、モデルの因果的理解を同時に探求します。我々は、DexBench というベンチマークを実行しました。DexBench は 445 つのパairedインスタンスを備え、13 つの LLM を評価しました。結果は、双経路推論が動的コード理解の頑健な識別子を供給することを示しています。

Original Content

arXiv:2604.20917v1 Announce Type: cross Abstract: Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.