arxiv_cs_lg 2026年4月24日

マルチチェーン MDP におけるより高速な固定点法

Faster Fixed-Point Methods for Multichain MDPs

Translated: 2026/4/24 20:12:05

multichain-mdpmarkov-decision-processesvalue-iterationfixed-point-methodsaverage-reward

Japanese Translation

arXiv:2506.20910v2 Announce Type: replace-cross 要旨: 私たちは、平均報酬基準（a.k.a. マルチチェーン）の下で一般のマルコフ決定過程（MDPs）を解くために価値反復（VI）アルゴリズムを研究します。これは基本的でありながら理論的に挑戦的な設定です。すべての平均報酬問題に見られるベルマン演算子の非収束性と解の非一意性以及び、マルチチェーン設定における最適政策が、各コンポーネント内での長期パフォーマンスの最適化に加えて、最適な接続コンポーネントへと導くためのナビゲーションサブ問題の解決という課題を含んでいるためです。私たちが開発したアルゴリズムは、マルチチェーン MDP におけるより高速な収束を達成するためにこのナビゲーションサブ問題をより効果的に解決し、先の研究に対する改善された収束率と複雑性のより鋭い指標をもたらします。私たちの成果の多くの主要な構成要素は、潜在的に独立した興味を持つのではなく、平均報酬と割引問題との新しいつながり、一般のバナッハ空間に拡張される最適固定点法（割引 VI 法）、割引価値誤差のための新しい非線形収束率、そしてマルチチェーン MDP ための洗練された最適性分解が含まれています。全体として、私たちの成果は割引および平均報酬問題においてより高速な収束率をもたらすとともに、VI アプローチの理論的基礎を広げます。

Original Content

arXiv:2506.20910v2 Announce Type: replace-cross Abstract: We study value-iteration (VI) algorithms for solving general (a.k.a. multichain) Markov decision processes (MDPs) under the average-reward criterion, a fundamental but theoretically challenging setting. Beyond the difficulties inherent to all average-reward problems posed by the lack of contractivity and non-uniqueness of solutions to the Bellman operator, in the multichain setting an optimal policy must solve the navigation subproblem of steering towards the best connected component, in addition to optimizing long-run performance within each component. We develop algorithms which better solve this navigational subproblem in order to achieve faster convergence for multichain MDPs, obtaining improved rates of convergence and sharper measures of complexity relative to prior work. Many key components of our results are of potential independent interest, including novel connections between average-reward and discounted problems, optimal fixed-point methods for discounted VI which extend to general Banach spaces, new sublinear convergence rates for the discounted value error, and refined suboptimality decompositions for multichain MDPs. Overall our results yield faster convergence rates for discounted and average-reward problems and expand the theoretical foundations of VI approaches.