arxiv_cs_ai 2026年4月24日

LLM ベースのエージェントの評価に関する調査

Survey on Evaluation of LLM-based Agents

Translated: 2026/4/24 20:29:39

llmagentsevaluationartificial-intelligencebenchmarking

Japanese Translation

arXiv:2503.16416v2 発表タイプ: 代替要約: LLM ベースのエージェントは、自律的なシステムが動的な環境と対話しながら計画、推論、ツールを使用することを可能にする AI パラダイムの転換を意味します。この論文は、これらの能力が急速に向上しているエージェントの評価方法に関する最初の包括的な調査を提供します。我々はエージェント評価の分野を以下の 5 つの観点から分析します: (1) 計画やツール使用など、エージェントワークフローに必要となるコア LLM 能力; (2) ウェブや SWE エージェントなどのアプリケーション固有ベンチマーク; (3) 一般化エージェントの評価; (4) エージェントベンチマークのコア次元の分析; そして (5) エンジニアのための評価フレームワークとツール。我々の分析は、継続的に更新されるベンチマークを持つより現実的で困難な評価への移行を含む現在の傾向を明らかにしています。さらに、将来の研究が対処しなければならない重要なギャップ、特にコスト効率性、安全性、堅牢性の評価と、細粒度かつスケーラブルな評価方法の開発において、我々が特定しています。

Original Content

arXiv:2503.16416v2 Announce Type: replace Abstract: LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks' core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address, particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.