arxiv_cs_ai 2026年2月10日

GISAC:一般情報探求アシスタントのベンチマーク

GISA: A Benchmark for General Information-Seeking Assistant

Translated: 2026/3/7 14:12:48

benchmarkinformation-seekinglarge-language-modelsmulti-tour interactions

Japanese Translation

大規模言語モデル（LLMs）の進歩は、マルチトーンウェブインタラクションを通じて自主的な情報を取得するデジタルス redevelep agentsの発展を大幅に加速しました。さまざまなベンチマークが提案されていましたが、これらの中には、答えから逆算した質問を作成し評価することに重点が置かれています。この結果、実世界のニーズと相違のある unnaturalなタスクが作成されています。また、これらのベンチマークは具体的な情報探求や複数のソースからの情報を統合する能力を強調していますが、これを一貫した回答セットに依存することが多い。つまり、データに汚染のリスクが高いです。このギャップを埋めるために、一般情報探求アシスタントを含むGISACを導入しました 373の見事な人間作成された質問によって、具体的な情報を探求するシーンを反映しています。GISACは四つの構造化回答フォーマットである（アイテム、セット、リスト、テーブル）を持ちますこれは決定的事実は評価に導入されます。深読みと広範の情報統合能力が同じタスク内に統一されています。また、日々更新された答えを持つライブ部分も含まれていますを耐えさせるために、それらは記憶には抵抗があります。特に、1つ以上の質問毎に全ての人間のサーチトライアプローズを提供していますこれがゴールスタンダードの参照であり、プロセスレベルでの監督と同様にイマーショナラーモンテルリングも提供されます。主なLLMsと商業的な情報を効率的に探す製品における実験によって明らかされたのは、特に複雑な計画や完全なる情報収集が必要なタスクの最高のモデルは、約19.30%の正確性でマッチスコアを達成することであり、そのパフォーマンスは有意に下降しています。これらの見解は今後の改善の大量な余地を持っておりました。

Original Content

arXiv:2602.08543v1 Announce Type: cross Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.