arxiv_cs_ai 2026年2月10日

厳密さ、信頼性、再現性が重要：2014-2025年のコードベンチマーク評価の一斎

Rigor, Reliability, and Reproducibility Matter: A Decade-Scale Survey of 572 Code Benchmarks

Translated: 2026/2/14 7:10:44

Japanese Translation

コードに関連のベンチマークは、大規模言語モデル（LLM）の評価において criticalな役割を果たします。一方で、これらのベンチマーク品質がコミュニティが模型能力について解釈する根本を左右します。近年ではbenchmark品質への意識が高まっていますが、10年間の調査（2014-2025）においても、意識が普及に繋がらない現実がありました。例えば2025年度だけでもテストケースを提供する際にコードカバー率を無視するベンチマーク数は前の10年にかけて集まった数とほぼ同じだったという事です。その点について我々は明確な立場を持ちます：コードベンチマークは厳密性において検討、信頼性において評価、再現性においてリリースに重視すべきであるべきです。それを実用化する手段として、55のチェックリストを含む「HOW2BENCH」のコードベンチマークガイドラインを導入しています。最後になりましたが、我々のさらなる人間調査により問題は重大な労力と理解不足からの事であることが分かっています

Original Content

arXiv:2501.10711v4 Announce Type: replace-cross Abstract: Code-related benchmarks play a critical role in evaluating large language models (LLMs), yet their quality fundamentally shapes how the community interprets model capabilities. In the past few years, awareness of benchmark quality has grown. Yet, after a decade-scale (2014-2025) survey over 572 code benchmarks, we observed a lag between growing awareness and actual practice. For example, in 2025 alone, the number of benchmarks that ignore code coverage when providing test cases nearly matches the total count accumulated across the previous ten years. In response, we take a clear position: Code benchmarks must prioritize rigor in benchmark construction, reliability in evaluation, and reproducibility in release. To operationalize this position, we introduce a code benchmark guideline HOW2BENCH with 55 checklists. Finally, our further human study also exposed that the current issues not only stem from the significant effort required, but also from a lack of awareness regarding their importance.