arxiv_cs_ai 2026年4月24日

LLM-as-a-Judge を自由文法学 QA 評価において利用するためのプロンプト最適化による判定傾向の活用

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Translated: 2026/4/24 20:36:11

llm-as-a-judgeprompt-optimizationlegal-qatext-generationmachine-learning

Japanese Translation

arXiv:2604.20726v2 Announce Type: replace-cross 本稿では、自由文法学の質問応答（QA）評価における LLM-as-a-Judge 評価において、プロンプト設計と判断者の選択がどのように機能するかを探求します。当研究は、自動タスクプロンプト最適化が人間中心設計を超えるか、最適化の効果が判断者のフィードバックスタイルによって変化するのか、および最適化されたプロンプトが他の判断者に移転できるかを調査します。我々は、ProTeGi 手法を使用して 2 名（Qwen3-32B、DeepSeek-V3）の判断者からのフィードバックに基づき、4 つのタスクモデルに対してタスクプロンプトを最適化し、その上でクロス・ジャッジ転移を系統的に検討しました。自動最適化は常にベースラインよりも優れており、寛容な判断者のフィードバックは厳格なフィードバックよりも高いかつ一貫した成果をもたらします。寛容なフィードバックで最適化されたプロンプトは、厳格な判断者によりよく転送され、その逆は該当しません。分析结果表明、寛容な判断者は許容的なフィードバックを提供し、より汎用的なプロンプトを生み出す一方、厳格な判断者は制限的なフィードバックを生み出し、判断者固有の過学習を引き起こします。我々の見解は、訓練データ上でアルゴリズム的にプロンプトを最適化することは、人間中心のプロンプト設計を超えることができ、最適化過程における判断者の姿勢がプロンプトの一般化能力を決定する、という点を実証しています。

Original Content

arXiv:2604.20726v2 Announce Type: replace-cross Abstract: This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability.