arxiv_cs_lg 2026年2月10日

Single-Trial 環境における Online プランニングを介した一般ユーティリティマルコフ決定過程の一般解法

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Translated: 2026/3/15 9:05:25

general-utility markov decision processessingle-trial regimeonline planningmonte-carlo tree searchreinforcement learning

Japanese Translation

arXiv:2505.15782v2 発表種別：置き換え要旨：本稿では、エージェントの性能が単一軌跡（single trajectory）に基づいて評価される単一試行（single-trial） режиме の無限horizon 割引一般ユーティリティマルコフ決定過程（GUMDP）を解く第一のアプローチを提唱します。まず、単一試行 режиме における政策最適化に関する基本的な結果を提供し、最適性に必要な政策のクラスを調査するとともに、問題を元の問題と等価な特別なマルコフ決定過程（MDP）として定義し、単一試行 regime の政策最適化の計算困難性を検討します。次に、オンラインプランニングの手法、特にモンテカルロ木探索（Monte-Carlo tree search）アルゴリズムを活用して、単一試行 regime の GUMDP を解く方法を示します。最後に、関連するベースラインと比較して我々のアプローチが優越していることを示す実験結果を提供します。

Original Content

arXiv:2505.15782v2 Announce Type: replace Abstract: In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.