arxiv_cs_ai 2026年4月24日

AgencyBench：100 万トークンのリアルワールドコンテキストにおける自律エージェントの先端をベンチマーク化する

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Translated: 2026/4/24 20:30:28

agencybenchautonomous-agentslarge-language-modelsbenchmarkingllm-evaluation

Japanese Translation

arXiv:2601.11044v4 Announce Type: replace 摘要： Large Language Models (LLMs) に基づく自律エージェントは、経済生産に大きく貢献する多角的な能力を示している。しかし、既存のベンチマークは単一のエージェント機能に焦点を当てられており、長期のリアルワールドシナリオを捉えていない。また、現実的なタスクに対する人間によるフィードバック（human-in-the-loop）への依存は、スケーラビリティのボトルネックとなり、自動化されたロールアウトの収集・評価を妨げている。このギャップを埋めるために、AgencyBench を導入する。日常の AI ユースから派生した包括的なベンチマークであり、138 つの特定のクエリ、デリバリー物、基準を持つ 32 つのリアルワールドシナリオにおいて、6 つの核心的なエージェント機能を評価する。これらのシナリオを解決するには、平均 90 つのツール呼び出し、100 万トークン、そして何時間もの実行時間が必要である。自動化された評価を可能にするために、ユーザシミュレーションエージェントを使用し、 iterated feedback を提供し、Docker サンドボックスを用いて視覚的・機能的基準に基づいた評価を執行する。実験では、クローズドソースモデルがオープンソースモデルを大幅に凌駕していることが明らかになった（48.4% vs 32.1%）。さらに、リソース効率、フィードバックによる自己修正、そして特定のツール使用の偏好において、モデル間で顕著な差が認められた。最後に、エージェントのスコーフォードの影響を検討し、プロプライエタリモデルがそのネイティブエコシステム内で優れていること（例：Claude-4.5-Opus via Claude-Agent-SDK）、そしてオープンソースモデルが異なるパフォーマンスのピークを示すこと（特定の実行フレームワーク向けの潜在的な最適化を示唆）が観察された。AgencyBench は次世代エージェントにとって不可欠なテストベッドであり、モデルアーキテクチャとエージェントフレームワークの共同最適化の必要性を浮き彫りにする。本論文が自律エージェントの未来の方向性に光を当てることを、そして本格的ベンチマークおよび評価ツールキットを https://github.com/GAIR-NLP/AgencyBench にリリースすることを期待している。

Original Content

arXiv:2601.11044v4 Announce Type: replace Abstract: Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.