arxiv_cs_ai 2026年4月24日

再利用可能なクロスドメインパイプラインを用いた AI 会議サマリーの評価

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Translated: 2026/4/24 20:16:34

generative-aievaluation-pipelineai-meeting-summariesbenchmarkingdeep-eval

Japanese Translation

arXiv:2604.21345v1 Announce Type: new Abstract: 本稿では、汎用性のある生成 AI アプリケーションの評価パイプラインを提示し、AI 会議サマリーのためにインスタンス化するとともに、データパイプラインから派生したパブリックアセットパッケージとともにリリースする。このシステムは、ソースインテイク、構造化参照の構築、候補生成、構造化評価、報告の 5 つの段階において、再利用可能なオーケストレーションとタスク固有のセマンティクスを分けている。スタンドアロン型クレームスコーラーとは異なり、このパイプラインでは真値（ground truth）と評価者出力の両方にも型付けされ、保存されたアセットとして扱うことで、集計、不具合分析、統計テストを可能にする。我々は 114 件の会議（市議会、民間データ、ホワイトハウス取材）に跨る型付けされたデータセット上でオフラインループをベンチマークし、gpt-4.1-mini、gpt-5-mini、および gpt-5.1 について 340 件の会議モデルペアと 680 件の審査実行を行った。このプロトコルにおいて、gpt-4.1-mini は平均精度で最高（0.583）を示し、gpt-5.1 は完全性（0.886）とカバー率（0.942）においてリードしている。ホーلم補正付きのペアサインテストでは、精度における有意な勝者は見出されなかったが、gpt-5.1 の保留率（retention）における有意な改善が確認された。型付けされた DeepEval 対比ベースラインは保留率の順列を保持しながら、より高い全体的精度を報告しており、参照に基づく評価ではサポートされていない具体性に関する誤りが、クレームに基づいた評価によって検出される可能性があることを示唆している。型付け分析は、サポートされていない具体性が頻繁に発生する精度課題のドメインである「ホワイトハウス取材」を特定した。デプロイメントのフォローアップでは、全ての指標において gpt-5.4 が gpt-4.1 を凌駕し、同一プロトコル下で保留率指標において統計的に堅牢な改善を見せた。システムはオフラインループと文書におけるベンチマークを行うが、オンラインのフィードバックから評価への経路については定量的な評価を行っていない。

Original Content

arXiv:2604.21345v1 Announce Type: new Abstract: We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.