arxiv_cs_lg 2026年2月10日

Agnent におけるランダム性の考察

On Randomness in Agentic Evals

Translated: 2026/3/15 13:04:11

agentic-systemsevaluation-benchmarksstatistical-analysismachine-learningsoftware-development

Japanese Translation

arXiv:2602.07150v1 Announce Type: new 要約: エージェントシステムは、タスクを解決するために環境と相互作用するベンチマークで評価されます。多くの論文は、この評価が信頼性の高いパフォーマンス推定を与けると仮定して、単一のランニングによる pass@1 スコアを報告しています。我々は、3 つのモデルと 2 つのスケファールドをまたぎ 6 万のエージェント軌跡を SWE-Bench-Verified で収集することで、この仮定を検証しました。我々は、どのランニングが選択されたかに依存して、単一ランニングの pass@1 エステートが 2.2 から 6.0 パーセントポイントで大きく変動し、温度が 0 でも標準偏差が 1.5 パーセントポイントを超えるという、顕著な変動を突き止めました。この変動には決定的な含意があり、2〜3 パーセントポイントの報告された改善は、真のアルゴリズムの進歩というより評価のノイズに起因している可能性があります。トークンレベルの解析を通じて、我々は軌跡が初期のいくつかパーセントのトークン内で早期に分岐し、これらの微小な差異が異なる解決策戦略に波及することを示しました。エージェントシステムを確実に評価するよう、我々は以下の 3 つの具体的な実践を推奨します：(1) 特に微小な改善を測定する場合、単一タスクあたり複数の独立したランニングから pass@1 を推定すること、(2) 検出されるべき効果量を特定するために必要なランニング数を決定するための統計的仮力解析を用いること、(3) より完全なパフォーマンスエンベロープを特徴づけるために k>1 の pass@k (楽観的限界) と pass^k (悲観的限界) などの指標を考慮すること。これらの実践は評価コストを増加させますが、真の科学的研究の進歩を統計的なノイズと区別するためには不可欠です。

Original Content

arXiv:2602.07150v1 Announce Type: new Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.