arxiv_cs_lg 2026年4月24日

Rubric に基づく自己対戦による事前トレーニングテキストから、開かれたタスクのためのトレーニング後シグナルの起動

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Translated: 2026/4/24 20:03:27

self-playreinforcement-learninglarge-language-modelsopen-ended-tasksrubric-based-training

Japanese Translation

arXiv:2604.20051v1 Announce Type: cross 要旨: 自己対戦は、最近、大型言語モデル (LLM) をトレーニングするための有望なパラダイムとして台頭しています。自己対戦では、目標 LLM はタスクの入力 (例：質問を提示) を生成し、それを解決するために自身でタスクの出力 (例：回答を提供) を生成します。報酬モデルが出力を評価し、その報酬が通常強化学習 (RL) を介して LLM をトレーニングするために使用されます。自己対戦は最小限の上書きコストを要し、これは特に、従来の人間または高価なプロプライエタリモデルによって作成される必要がある高品質な入力出力ペアを必要とするトレーニング後 LLM にとって有益です。しかし、既存の取り組みは、数学やコーディングなど検証可能なタスクのみに自己対戦を探索しています。代わりに、私たちはそれをより現実的な開かれたタスクに拡張することを目的としています。具体的には、私たちは、各サンプルに対して評価基準 (rubric) と入力出力ペアを合成するために、同じ LLM を使用する POP という自己対戦フレームワークを提案しました。その後、評価基準が出力の評価とモデルのトレーニングに使用されます。私たちは、さらにコンテンツに富んだ事前トレーニングコーパスにフレームワークを土台として据え、(1) 生成 - 検証のギャップを確保しハッキングを減らし、(2) モード収縮を防ぎます。Qwen-2.5-7B において、POP は、長文医疗保健问答 (QA)、クリエイティブな文章作成、指示への準拠を含むさまざまなタスクにおいて、事前トレーニング済みモデルと指示チューニングされたモデルの両方の性能を向上させました。

Original Content

arXiv:2604.20051v1 Announce Type: cross Abstract: Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.