Back to list
対話モデルの最適化:アジェントゲームとadaptivel treeベースのGRPO
Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO
Translated: 2026/3/7 10:12:43
Japanese Translation
オープンエンドな対話担当者は、ユーザーの特質を収集したプレロールデータに依存することなく、エンゲージメントのある personalized ディスコミュニケーションを提供しようとしますが、既存の方法には困難があります。特定のユーザーデータに対する過剰信頼と、最短のステージでの偏った reinforcement 学習 (RL) は、長期的な対話価値を無視しています。これらの問題に対処するために、我们提出了一种结合在线个性化和 adaptive treeベースの Group 相對評価优化(GRPO) 的新型长期 RL 構架(AT-GRPO)。采用两个代理游戏的范例,用户_agent 倒数模仿风格(学习用户特定的交谈特质)以及主动终止过程(在即时奖励中预测回合级终止概率),形成一个迭代循环,促进对话代理人深入兴趣探索。adaptivel treeベースの GRPO 重写对話轨迹为树木,并引入适应性的观察范围。与此全树膨胀所带来的指数性负担不同,它限制每个节点的奖励范围,这些是阶段感知的:较大的范围支持早期话题探索,较小的范围则使后期对话保存变得更加容易。这样设计从指数型缩减到对话长度上的多项式型展开,同时保护长期奖励捕获。广泛的实验显示我们的构架表现出色,实现了高效率和稳健性。
Original Content
arXiv:2602.08533v1 Announce Type: new
Abstract: Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.