arxiv_cs_lg 2026年2月10日

コード世界モデルのデバッグ

Debugging code world models

Translated: 2026/3/15 7:04:03

code-world-modelslanguage-modelsprogram-executiontokenizationstate-tracking

Japanese Translation

arXiv:2602.07672v1 発表タイプ: クロス要旨: コード世界モデル (CWM) は、実行されたすべてのコマンド後に明示的な実行時状態を予測することでプログラム実行をシミュレートするために訓練された言語モデルです。この実行に基づく世界モデルは、モデル内の内部検証を可能にし、自然言語のチェーンオブサンツ思考の代案となるものです。ただし、CWM のエラーの源泉や限界は依然としてよく理解されていません。本研究では、2 つの補完的な観点、すなわち局所的なセマンティック実行と長期的な状態追跡から CWM を分析します。実際のコードベンチマークにおいて、2 つの支配的な失敗レジームを特定しました。第一に、密集した実行時状態はトークン集約的な実行トレースを生み出し、実行履歴の長いプログラムのトークン予算枯渇につながります。第二に、失敗は不均衡に文字列値の状態に集中しており、これはプログラム構造よりもサブワードトークナイゼーションの限界に起因すると帰因します。長期的な振る舞いを研究するため、作用実行における状態の伝播を分離する制御された置換追跡ベンチマークを使用しました。長期的な劣化は主に不正確な作用生成によって引き起こされていることを示しました。即ち、作用を基準の命令に置き換えた場合、Transformer ベースの CWM は長期的な時間軸において状態を正確に伝達し、Transformer が長期的な状態追跡において知られる限界を超えます。これらの見解は、プログラム実行とデータタイプにより適合しており、CWM においてより効率的な監視と状態表現の方向性を示唆しています。

Original Content

arXiv:2602.07672v1 Announce Type: cross Abstract: Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long-horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.