arxiv_cs_ai 2026年4月24日

One Output を超えよう：生成されたテキストの分布を可視化し比較する

Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

Translated: 2026/4/24 20:31:05

linguisticsmachine-learningvisualizationlanguage-modelhuman-computer-interaction

Japanese Translation

arXiv:2604.18724v2 Announce Type: replace Abstract: ユーザーは通常、大規模言語モデル（LM）と single output（単一出力）を通じて対話・評価していますが、各出力は可能な完結ものの広範な分布からの単一のサンプルに過ぎません。この対話は、モードや稀なエッジケース、あるいはプロンプトの微小な変化に対する感受性などの分布構造を隠蔽し、結果としてユーザーはオープンエンドタスクのプロンプトを改善する際に、個例に基づく過度の一般化を行ってしまいます。研究者が大規模言語モデル（LM）を使う際に（N=13）、現実において確定的性が重要になるケースを、なぜ言語上の分布について推論するのか、そして現在のワークフローがどこで破綻するかを確認するための形成期研究に基づき、我々は GROVE を提案しました。GROVE は、複数の LM 生成をテキストグラフを通じた重なったパスとして表現するインタラクティブな可視化であり、共有構造、分岐点、およびクラスタを明らかにしつつ、生の出力へのアクセスを保つものです。我々は、補完的な分布的タスクを対象とした 3 つのcrowdsourcedユーザー研究（N=47, 44, および 40 参加者）を評価しました。我々の結果は、ハイブリッドなワークフローをサポートします。すなわち、グラフ要約は多様性の評価などの構造判断を改善しますが、詳細指向の質問については、直接の出力検査の方が依然として優れていることが示されました。

Original Content

arXiv:2604.18724v2 Announce Type: replace Abstract: Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks. Informed by a formative study with researchers who use LMs (n=13) examining when stochasticity matters in practice, how they reason about distributions over language, and where current workflows break down, we introduce GROVE. GROVE is an interactive visualization that represents multiple LM generations as overlapping paths through a text graph, revealing shared structure, branching points, and clusters while preserving access to raw outputs. We evaluate across three crowdsourced user studies (N=47, 44, and 40 participants) targeting complementary distributional tasks. Our results support a hybrid workflow: graph summaries improve structural judgments such as assessing diversity, while direct output inspection remains stronger for detail-oriented questions.