arxiv_cs_lg 2026年2月10日

Langevin Dynamics を用いた直接 Soft-Policy サンプリング

Direct Soft-Policy Sampling via Langevin Dynamics

Translated: 2026/3/15 14:48:49

reinforcement-learninglangevin-dynamicssoft-policyq-learningmujo-co

Japanese Translation

arXiv:2602.07873v1 Announce Type: new 摘要：強化学習におけるソフトポリシーは、状態行動価値関数における Boltzmann 分布としてポリシーを定義し、探索と利用のバランスを取る原理的なメカニズムを提供します。しかし、実務においてこのようなソフトポリシーを実現することは依然として困難です。既存のアプローチは、表現力の制限のあるパラメトリックなポリシーに依存するか、不可能な確率密度を有する拡散ベースのポリシーを採用し、これらがソフトポリシーの目的関数において信頼できるエントロピー推定を妨げます。我々は、Q 関数の行動勾配によって駆動される Langevin dynamics を直接適用することで、ソフトポリシーのサンプリングを直接実現し、この課題に取り組んでいます。この視点は、明示的にポリシーをパラメータ化することなく、目的の Boltzmann 分布から行動をサンプリングする Langevin Q-Learning (LQL) を導きます。ただし、Langevin dynamics の直接適用は、高次元かつ非凸な Q-Landscape において緩慢な混合を生じさせ、実用的な効果を制限します。この問題を解決するために、我々はマルチスケールのノイズ摂動を価値関数に統合した Noise-Conditioned Langevin Q-Learning (NC-LQL) を提案します。NC-LQL は、ノイズに条件付けされた Q 関数を学習し、一連の順次滑らかな価値 Landscape を誘導します。これにより、サンプリングはглоバルな探索から精密なモードリファインメントへと遷移できるようになります。OpenAI Gym MuJoCo ベンチマークにおいて、NC-LQL は最先端の拡散ベースの手法と比較して競争力のある性能を示し、オンライン RL にとって単純かつ強力な解決策を提供します。

Original Content

arXiv:2602.07873v1 Announce Type: new Abstract: Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.