arxiv_cs_ai 2026年2月10日

WildReward: ワ wild interactions からの報酬モデルの学習

WildReward: Learning Reward Models from In-the-Wild Human Interactions

Translated: 2026/3/7 14:17:03

language-modelsreward-modelsordinal-regressionmachine-learning

Japanese Translation

報酬モデル (RMs) は、大規模言語モデル (LLMs) のトレーニングには欠かせない要素ですが、通常は大きなスケールの人間による事前訓練した偏好ペアを基にしています。大規模な LLMS の広範な実装により、野生での人との相互作用が含意的な報酬サインの丰富的な資源として出現しました。この現実は次の問いを提起します：我々は実際の場所における人との相互作用から報酬モデルを開発することは可能でしょうか？本作品では、WildChat をイントラクションソースとし、高品質の人間のフィードバックを抽出するPipelineを開発し提案して WildReward の訓練を行いました。この訓練はユーザーによるフィードバック直接に移動されたオーデリアーコンダントで行われており、ペアの偏好的な事前訓練を必要としません。このため，广泛的労働試験実装が、WildReward が conventional reward models よりもやや同等のパフォーマンスを持つことも、より優れた調整性とサンプル間の一貫性を持っておりを開示します。さらに WildRewardの最も重要なのは利用ユーザーの多様性に直結しており、さらに多くのユーザを含めることで強力な報酬モデルが生産できるということについても見識を示したことです。最後に、WildReward に対しオンライン的な動的最適化 (DPO) のトレーニングを適用することで、さまざまなタスクに対する明確なパフォーマンス改善が確認されています。リソースは https://github.com/THU-KEG/WildReward により共有されます。

Original Content

arXiv:2602.08829v1 Announce Type: cross Abstract: Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at https://github.com/THU-KEG/WildReward.