arxiv_cs_lg 2026年4月24日

Self-Guidance を用いた大規模な自己対戦の拡張

Scaling Self-Play with Self-Guidance

Translated: 2026/4/24 19:57:30

llmreinforcement-learningself-playscaling-lawstheorem-proving

Japanese Translation

arXiv:2604.20209v1 Announce Type: new Abstract: LLM による自己対戦アルゴリズムは、原理上その学習には何の制約もない点が注目されています。推測者（Conjecturer）モデルが解き手（Solver）に課題を作成し、両者が共同で向上していきます。しかし、現実には既存の LLM 自己対戦手法は大量の計算資源と相まって良好に拡張できず、学習の平坦化に陥ります。我々は、この問題が長期的なトレーニング実行中に推測者が報酬を裏切るように学習し、解決する解に寄与しない人工的に複雑な課題に崩れ去ることに起因すると主張します。これを克服するため、我々は解の退化から推測者をガイドする「自己指導自己対戦（Self-Guided Self-Play, SGS）」という自己対戦アルゴリズムを導入します。SGS ではモデルは解き手、推測者、そして合成課題の関連性・自然さ・品質を評価し、未解決の目標課題への妥当性をスコアリングすることで推測者の崩壊に制約をかけるという 3 つの役割を担います。我々の主要な仮説は、言語モデルがサブ問題が目標達成に有用かを評価できる可能性があるというものです。我々は SGS の拡張性を、先の研究より有意に長いトレーニングを実行し、累積解決率曲線にスケーリングの法則を適合することで評価しました。SGS を Lean4 における形式証明に適用したところ、我々の最強の RL ベースラインの漸近的解決率を上回るまでに 80 回未満の自己対戦だけで達成し、200 回の自己対戦後に 70 億パラメータのモデルが 6710 億パラメータのモデル（pass@4）よりも多くの問題を解く能力を発揮することがわかりました。

Original Content

arXiv:2604.20209v1 Announce Type: new Abstract: LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core hypothesis is that language models can assess whether a subproblem is useful for achieving a goal. We evaluate the scaling properties of SGS by running training for significantly longer than prior works and by fitting scaling laws to cumulative solve rate curves. Applying SGS to formal theorem proving in Lean4, we find that it surpasses the asymptotic solve rate of our strongest RL baseline in fewer than 80 rounds of self-play and enables a 7B parameter model, after 200 rounds of self-play, to solve more problems than a 671B parameter model pass@4.