arxiv_cs_ai 2026年4月24日

DRBENCHER: エージェントがエンティティを特定し、その属性を取得して計算を行えるか？

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

Translated: 2026/4/24 20:30:55

drbenchagent-benchmarkresearch-agentssynthetic-dataevaluation-framework

Japanese Translation

arXiv:2604.09251v2 Announce Type: replace 要旨: ディープリサーチエージェントは、ウェブサーフィンとマルチステップ計算を交互に実行するようになっていますが、既存のベンチマークはこれらを個別に評価しており、現実世界の性能を評価する盲点を生んでいます。我々は、検索と計算を必要とする問題のための合成ベンチマーク生成器である DRBENCHER を提案します。このツールは、以下の 4 つの要件を確立しています：検証性（ゴールドな答えはパラメータ指定されたコードを実行して知識グラフの値を計算することで導出される）、複雑性（複数跳躍のエンティティ特定、属性取得、およびドメイン固有の計算）、難易度（2 ステージ検証カスケードが生成モデルが解ける問題からフィルタリングする）、多様性（貪欲最大最小埋め込みフィルタが覆盖面を最大化する）。これらの要件は、生物化学、金融、地物理、セキュリティ、歴史の 5 つのドメインにわたる単一のアスキーファーストパイプラインを通じて実現されます。人間の評価では、有効性が 76%（陳腐データを除く 84%）であり、エラーの 35% が陳腐な知識グラフエントリーに起因すると示されており、これは進化するデータ上で推論を行うシステム固有の制限を浮き彫りにしています。自動評価では、最強の境界モデルでも答えの正確率が 20% に留まりました。人手で作成されたベンチマーク（BrowseComp+, MATH-500, GPQA）と比較すると、DRBENCHER は最も高い半義的な多様性を持っています。

Original Content

arXiv:2604.09251v2 Announce Type: replace Abstract: Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.