arxiv_cs_ai 2026年4月24日

Diversity-Guided User Simulation による効率的なエージェント評価

Efficient Agent Evaluation via Diversity-Guided User Simulation

Translated: 2026/4/24 20:17:08

llmagent-evaluationmonte-carlodiversitysimulation

Japanese Translation

arXiv:2604.21480v1 Announce Type: new 要約: 大規模言語モデル (LLMs) は顧客向けのエージェントとしてますます導入されていますが、確率的かつ複数のターンを交わる相互作用により、それらの信頼性の評価は依然として課題です。現在の評価プロトコルは、エージェントとユーザーの会話全体を完全に再生成して成功確率を推定するために、線形モンテカルロシミュレーションに頼っており、計算上の非効率的であり、同一の初期プレフィックスを繰り返し再生成するだけでなく、稀なユーザー行動から生じる深い失敗モードを発見しないことが多いです。私々は、エージェントとユーザーの相互作用の体系的な探索用の効率的で、ショットベースの、カバレッジ指向なユーザーシミュレーションフレームワークである DIVERT (Diversity-Induced Evaluation via Branching of Trajectories) を導入しました。DIVERT は、重要な決定点を捉え、エージェント - 環境状態を完全に記録し、これらのスナップショットから実行を再開することで、共有された会話プレフィックスの再利用を可能にし、冗長な計算を削減します。各分岐点から、フレームワークはダイバーシティ化を誘導するターゲットされたユーザー応答を使用して分岐し、代替の相互作用経路を目標を持って探索します。評価を意味的に多様で探索されていない軌道上に焦点を当てることで、DIVERT は効率性とカバレッジの両方を向上させます。実証的結果は、標準的な線形ロールアウトプロトコルと比較して、トックあたりの失敗発見数を高め、失敗が特定されるタスクのセットを広げていることを示しています。

Original Content

arXiv:2604.21480v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.