arxiv_cs_ai 2026年2月10日

対話モデルの最適化：アジェントゲームとadaptivel treeベースのGRPO

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Translated: 2026/3/7 10:12:43

reinforcement-learninggame-theorynatural-language-processingdialogue-systems

Japanese Translation

オープンエンドな対話担当者は、ユーザーの特質を収集したプレロールデータに依存することなく、エンゲージメントのある personalized ディスコミュニケーションを提供しようとしますが、既存の方法には困難があります。特定のユーザーデータに対する過剰信頼と、最短のステージでの偏った reinforcement 学習 (RL) は、長期的な対話価値を無視しています。これらの問題に対処するために、我们提出了一种结合在线个性化和 adaptive treeベースの Group 相對評価优化(GRPO) 的新型长期 RL 構架（AT-GRPO）。采用两个代理游戏的范例，用户_agent 倒数模仿风格（学习用户特定的交谈特质）以及主动终止过程（在即时奖励中预测回合级终止概率），形成一个迭代循环，促进对话代理人深入兴趣探索。adaptivel treeベースの GRPO 重写对話轨迹为树木，并引入适应性的观察范围。与此全树膨胀所带来的指数性负担不同，它限制每个节点的奖励范围，这些是阶段感知的：较大的范围支持早期话题探索，较小的范围则使后期对话保存变得更加容易。这样设计从指数型缩减到对话长度上的多项式型展开，同时保护长期奖励捕获。广泛的实验显示我们的构架表现出色，实现了高效率和稳健性。

Original Content

arXiv:2602.08533v1 Announce Type: new Abstract: Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.