arxiv_cs_ai 2026年4月24日

DryRUN: LLM 駆動型コード生成における公共テストの役割について

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

Translated: 2026/4/24 20:27:27

llmcode-generationmulti-agent-systemssoftware-developmentsimulation

Japanese Translation

arXiv:2604.21598v1 Announce Type: cross Abstract: マルチエージェントフレームワークは、自律型コード生成および複雑なアルゴリズム問題解決に広く利用されています。最近の取り組みでは、言語モデルが実行ステップをトレースしてロジックを検証するよう、シミュレーション駆動型のプランニングとデバッグを組み込むことで、機能的に正しいコードの生成という課題に対処しています。しかし、これらのアプローチはデバッグおよびシミュレーションループを確立するために、人間によって提供された公共テストケースに依存しています。包括的な入出力例を手動で作成することは、ソフトウェア開発ライフサイクルにおける労働集約的なボトルネックです。実世界ソフトウェアエンジニアリングにおいては、実装前に真の入力出力例がほとんど入手できないため、この依存関係は手法を厳選された競合プログラミングベンチマークに制限しています。さらに、これらの公共テストへの依存は「過信ギャップ」を誘発させ、フレームワークが単純な例に過学習をきたし、隠された評価では失敗することを識別しました。一方、外部のサンプル入力はコード生成に厳密に必要ではないことを観測しました。大規模言語モデル（LLM）が有効な入力と実行トレースを自律的に生成し、自己訂正を行うことができることを示しました。したがって、LLM が反復的に計画し、自律的に自身の入力と実行をシミュレートすることを可能にすることで、真のサンプルの必要性をなくす DryRUN フレームワークを開発しました。LiveCodeBench v6 データセット（2025 年 3 月以来）での評価では、DryRUN は、公共テストケースまたは外部実行フィードバックを一切使用しない場合でも、出力トークン消費を削減しつつ、公共テストに依存する最上位のフレームワークである CodeSIM と同様のパフォーマンスを達成することが示されました。

Original Content

arXiv:2604.21598v1 Announce Type: cross Abstract: Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven planning and debugging, where language models trace execution steps to verify logic. However, these approaches depend on human-provided public test cases to ground the debugging and simulation loop. Manually authoring comprehensive input-output examples is a labor-intensive bottleneck in the software development lifecycle. Because ground-truth input-output examples are rarely available prior to implementation in real-world software engineering, this dependency restricts methods to curated competitive programming benchmarks. Furthermore, we identify that reliance on these public tests induces an ``overconfidence gap,'' causing frameworks to overfit to simplistic examples and fail on hidden evaluations. In contrast, we observe that external sample inputs are not strictly necessary for code generation. We demonstrate that large language models can autonomously generate valid inputs and simulate execution traces to self-correct. Consequently, we develop DryRUN, a framework that eliminates the need for ground-truth samples by allowing the LLM to iteratively plan, autonomously generate its own inputs and simulate execution, mitigating algorithmic overconfidence. Evaluations on the LiveCodeBench v6 dataset (post-March 2025) demonstrate that DryRUN matches performance against CodeSIM, a state-of-the-art and public-test-dependent framework, while operating entirely without public test cases or external execution feedback while reducing output token consumption.