arxiv_cs_ai 2026年2月10日

ダイスン型言語モデルへの効率的かつ安定したリフレッシュ学習

Efficient and Stable Reinforcement Learning for Diffusion Language Models

Translated: 2026/3/7 11:18:40

reinforcement-learninglarge-language-models-dllmsspatio-temporal-pruning-stpdiffusion-based-language-models

Japanese Translation

ダイスンベースされた大きな言語モデル(ダイスン型dLLMs)の複雑な推論能力をUnlockするためには、再実行可能な学習(RL)は非常に重要です。しかし、ダイスン型dLLMsにRLを適用すると、効率性と安定性に直面する一連の独自課題があります。これらの課題に対処するために、空間的削減(SP)フレームワークを提案しています。これは、(1) 空間的な削減SP: 予測的で静的前の情報を用いて探求スペースを制限する、（2) 時間的削減TP: 遅延ステージの修正処理ステップをスキップすることで、冗長なステップを無効にします。我々の理論分析は、これにより負の方数の推定を安定化できることを示していますVARIANCERとPOLICYUPDATESについて。我々のコードは https://github.com/Lolo1222/SPTPで入手可能だ。

Original Content

arXiv:2602.08905v1 Announce Type: new Abstract: Reinforcement Learning (RL) is crucial for unlocking the complex reasoning capabilities of Diffusion-based Large Language Models (dLLMs). However, applying RL to dLLMs faces unique challenges in efficiency and stability. To address these challenges, we propose Spatio-Temporal Pruning (STP), a framework designed to simultaneously improve the efficiency and stability of RL for dLLMs. STP compresses the redundancy in the generative process through: (1) \textit{spatial pruning}, which constrains the exploration space using static priors; and (2) \textit{temporal pruning}, which bypasses redundant late-stage refinement steps. Our theoretical analysis demonstrates that STP strictly reduces the variance of the log-likelihood estimation, thereby ensuring more stable policy updates. Extensive experiments demonstrate that STP surpasses state-of-the-art baselines in both efficiency and accuracy. Our code is available at https://github.com/Lolo1222/STP.