arxiv_cs_ai 2026年2月10日

レビューデータに基づくリスティングエージェント生成のテストが意味を持つべきでないか：自動ソフトウェア工事を使用するための質疑応答式

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Translated: 2026/3/7 13:17:31

large-language-modelssoftware-engineeringautomated-code-generationempirical-studieshuman-computer-interaction

Japanese Translation

大きな言語モデル（LLM）コードエージェントは、問題を解決しやすいようにリポジトリレベルの問題に反復的にコードを変更し、ツールを呼べばバリデーションをチェックします。これらのワークフローでは、エージェントはよくテストを作成するという一般的なパラダイムを持っており、SWE-benchでランカートップ10内の多くのエージェントもそれに従っています。しかし、GPT-5.2は新しいテストをほとんど作っていませんが、それでも上位のエージェントと同等またはそれ以上です。これらのエージェントのテストが問題解決にどのように影響を与えているのかという重要な質問が出されました。我々は、これらのテスティングの影響を明らかにするために、いくつかの最新のLLMでの6つのSWE-bench検証を分析した実験的研究を発表しました。私たちの結果によれば、テスト作成は一般的に行われていますが、同一モデル内で解決済みと未解決のタスクは同じテストの割合が多いようです。これらのテストは主に観察フィードバックチャネルを提供しておりエージェントは価値を示すprintステートメントよりも正式なasserstion基準を優先します。これらの洞察に基づいて、4つのエージェンツへのプロンプトの修正を行われてテスト作成の量が増加または減少させられた実験を行いました結果に変更は最終的なアウトカムには全く影響を与えずに示されましたまた彼等の研究は、現在のテスト生成のプロセスは自動ソフトウェア工事におけるパーセント点の利用価値があまり高くないという結論を導き出すことを助けています。

Original Content

arXiv:2602.07900v1 Announce Type: cross Abstract: Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget. To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.