arxiv_cs_ai 2026年4月24日

Verifiable Accuracy と Abstention Rewards を持つ Curriculum RL を用いたマルチターン対話における「Lost in Conversation」の緩和

Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

Translated: 2026/4/24 20:32:59

reinforcement-learninglarge-language-modelsmulti-turn-conversationcurriculum-learningrlvr

Japanese Translation

arXiv:2510.18731v2 Announce Type: replace-cross 要約: 大規模言語モデル (LLM) は単一ターンの指示追従において強力な性能を示しますが、情報が段階的に開示されるマルチターンの場面で「対話の喪失 (Lost-in-Conversation, LiC)」という性能劣化を経験します。現在、検証可能な報酬 (Verifiable Rewards) を持つ強化学習 (RLVR) 上の進展に刺激を受け、私たちはマルチターン対話設定においてモデルに問題の解決可能性を判定させ、正しい回答の生成だけでなく、情報不足に対する情報に基づく自発的な回答抑制も促す枠組み、検証可能な精度および自発的な回答抑制報酬を持つ Curriculum 強化学習 (RLAAR) を提案します。わたしたちのアプローチは、命令フ래グメント (instruction shards) における対話の難易度を段階的に増量する能力ゲートのカリキュラムを採用し、訓練の安定性を維持しつつ信頼性を促進します。マルチターンのオン・ポリシーロールアウトとミクスティルド報酬システムを使用し、RLAAR は問題解決と情報に基づく自発的な回答抑制のバランスを教えることで、LiC を引き起こす過度な早期回答行動を減らします。LiC ベンチマークでの評価により、RLAAR は LiC による性能の低下を著しく緩和し (62.6% から 75.1% へ)、校正された自発的な回答率を向上させ (33.5% から 73.4% へ)。これらの結果は、マルチターンの信頼性高く信頼性の高い LLM を構築する実践的なレシピを提供します。

Original Content

arXiv:2510.18731v2 Announce Type: replace-cross Abstract: Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.