arxiv_cs_ai 2026年4月24日

誰が「最高」を定義するか？対話型・ユーザー定義型の LLM ランキングの評価への取り組み

Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

Translated: 2026/4/24 20:18:24

llm-leaderboardsbenchmark-analysisinteractive-visualizationevaluation-frameworksmodel-comparison

Japanese Translation

arXiv:2604.21769v1 発表タイプ: 新規要旨: LLM ランキングボードは、モデルを比較し、デプロイ先の決定を導き出すために広く使われています。しかし、ランキングの付け方は、ベンチマークデザイナーによって設定された評価の優先順位によって形成されており、実際のユーザーや組織の多様な目的や制約に適合していないのが実情です。一つの総点数は、異なるプロンプトタイプや構成においてモデルがどのように振る舞うかを十分に表しきれていません。本作業では、LMArena（旧名：Chatbot Arena）ベンチマークで使用されるデータセットについて詳細な分析を行い、この評価の課題を、デザインプローブとして対話型ビジュアライゼーションインターフェースを設計することで調査しました。当の分析により、データセットは特定のトピックに偏っていること、モデルのランキングがプロンプトのスライス間で変動すること、また基準に基づく判断が意図する範囲を超えて使われていることが明らかになりました。この分析に基づき、ユーザーがプロンプトスライスを選択・重み付けすることで独自の評価優先順位を定義し、それに応じてランキングがどのように変化するかの探訪を行うビジュアライゼーションインターフェースを導入しました。定性的な研究において、この対話的なアプローチは透明度の向上と、より文脈に特化したモデル評価への支持、そして LLM ランキングボードの設計・使用方法の代替策への示唆を与えました。

Original Content

arXiv:2604.21769v1 Announce Type: new Abstract: LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.