arxiv_cs_ai 2026年2月10日

DIAL-SUMMER: ダイアログ要約における階層的なエラー評価の構造化フレームワーク

DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

Translated: 2026/3/7 13:24:02

dialsummerdialogue-summarieserror-taxonomynatural-language-processingmachine-learning-evaluation

Japanese Translation

ダイアローグは、人間にとって最も一般的なコミュニケーション形態であり、対象のダイヤログとその要約から重要な点を復習し、顧客エンジニアリングや製品ユーザーとの会話レビューを行う際には非常に有効です。ダイアログ要約評価に関する前の研究がこのタスクで特定の複雑性を無視していることに焦点を当てましょう: (i) シトラクトへの変化、つまり、いくつかのターン間で情報について散在的にやり取りしてみたものから、本文の行ごとにまとめられた要約の文へと話が変わる点に、(ii) ディクショナリーポインテイプが語り手のそれ自身と二番目の人称を標準的な三番目の人称で扱うか変化するという視点の変化。この文章では、私たちがDIALSUMMERというフレームワークを使ってその問題に対処しようとしています。我々はDIAL-SUMMERのエラー分類のタクストラックを提案し、ダイヤログ要約を二つの階層的なレベルで評価します：ダイアログレベルは大きな話者やターンに対してフォーカスを当てていて、またWithin-Turn-Levelは一つのターン中の情報に特化したもの。そして最後にはDIAL-SUMMERの分類されたための_dataset_を開示します。これらの解析されたエラーを見ていくことで、興味深いトレンドを発見しました（特に、ダイアログの中に中間段階にあるターンが最も頻繁に要約から漏れて出てきます）。また、私たちのツリーを利用してLLM-Judgeの能力を評価し、我々はそのデータセットの難しさ、より強い分類の必要性、そしてこのタスクに関して同様には更なる研究成果が必要であると示しました。これからのコードと推論のデータセットが今後すぐ提供されます。

Original Content

arXiv:2602.08149v1 Announce Type: cross Abstract: Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.