arxiv_cs_lg 2026年2月10日

Unichain 制約の一般パラメータ化平均報酬制約マルコフ決定過程における後悔解析

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Translated: 2026/3/15 14:50:01

reinforcement-learningmarkdownregret-analysismarkov-decision-processesunichain

Japanese Translation

arXiv:2602.08000v1 Announce Type: new Abstract: 私達はユニチェーン仮定と一般政策パラメータ化の下で、無限期間平均報酬制約マルコフ決定過程 (CMDPs) を研究します。既存の制約強化学習の後悔解析は、エルゴード性または強い混合時間の仮実に大きく依存しており、遷移状態が存在する場合はこれを成立させることができません。私達はマルチレベルモンテカルロ (MLMC) 推定者と明示的なバーンイン機構を活用した原対双天然アクター・クリティックアルゴリズムを提案し、混合時間オラクルを必要とせずにユニチェーンの動的挙動を取り扱います。我々の解析は、政策およびクリティックのパラメータ化から生じる近似誤差を除く限り、$ ilde{O}( oot T{2})$として有限時間后悔と累積制約違反の制約を確立し、これにより秩序最善保証をより広く一般化する CMDPs のクラスに適用します。

Original Content

arXiv:2602.08000v1 Announce Type: new Abstract: We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$, up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.