arxiv_cs_ai 2026年2月10日

部分情報のない決定過程モデル (POMDP) を使用して Finite-State コントローラを強化学学習で導入する

Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

Translated: 2026/3/7 10:17:51

reinforcement-learningfinite-statepomdpdeep-reinforcement-learning

Japanese Translation

完全な状態情報を持っている場合以外において、マルチプリックスモデルの決定過程模型（POMDP）における行動戦略を提供しようとすることで、新しいフレームワーク Lexpop が提案されました．Lexpopは，深層 Reinforcement Learning を用いてニューラルネイティブなポリシーを学習し，そしてそのニューラルポリシーよりも簡潔な Finite-State コントロールを効率的な抽出方法により構築します。これによって、ニューロンによるポリシーより評価可能かつ性能向上の保証が提供されます．我々はLexpop を使用し、隠れモデルPOMDP (HM-POPMD)でも一般的な強化学習を実装しました．我々は何つの最悪ケースのモデルであるそれぞれの抽出されたコントロールについて付いています．これらの POMDP への反復的なニューロンベースの政策と、その結果を得た Robust コントロールを求めました。私達の試行では、大型なステート空間の問題向けにLexpopはPOMDPおよびHM-POMDPの状況を上回っています。

Original Content

arXiv:2602.08734v1 Announce Type: new Abstract: Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.