arxiv_cs_ai 2026年4月24日

部分観測マルコフ意思決定過程（POMDP）政策に対する観測擾乱への頑健性の解析

Robustness Analysis of POMDP Policies to Observation Perturbations

Translated: 2026/4/24 20:16:02

pomdpreinforcement-learningrobustnessoptimizationrobotics

Japanese Translation

arXiv:2604.21256v1 Announce Type: new 要旨：部分観測マルコフ意思決定過程（POMDP）の政策は、通常、標準システムモデルを使用して設計されます。実運用においては、このモデルが真のシステムから微細な漂移やセンサーの劣化といった要因により外れる場合があり、予想外のパフォーマンス低下を引き起こします。本稿では、POMDP 観測モデルに対する偏離に対する政策の頑健性を研究します。われわれは「Policy Observation Robustness Problem（ポリシー観測頑健性問題）」を導入し、それが政策の価値を特定のスレッショルド以上にあることを保証する最大の許容偏離を決定する方法を提案します。われわれは 2 つのバリアント、すなわち状態と行動に依存する偏離を持つ「sticky バリアント」と、履歴に依存する偏離を持つ「non-sticky バリアント」について分析します。われわれは、Policy Observation Robustness Problem を内側最適化が観測偏離の大きさに対して単調であるという階層最適化問題として形式化できることを示しました。これにより、外側最適化におけるルート検索アルゴリズムを使用して効率的な解決が可能になります。non-sticky バリアントについては、政策が有限状態コントローラー（FSCs）で表される場合、完全な履歴よりも FSC のノードに依存する観測のみを考慮すれば十分であることが示されました。われわれは、sticky バリアントおよび non-sticky バリアントの両方において健全性と収束保証を持つアルゴリズム「Robust Interval Search」を提示しました。このアルゴリズムは、non-sticky バリアントにおいて多項式時間の計算量を持ち、sticky バリアントでは至多指数時間計算量を持つことを示しました。われわれは、Robust Interval Search の実装が数万の状態を持つ POMDP 問題に対してスケーラビリティを実証する実験結果を提供しました。さらに、われわれはロボティクスとオペレーションズリサーチからの事例研究を提供し、問題とアルゴリズムの実用的有用性を示しました。

Original Content

arXiv:2604.21256v1 Announce Type: new Abstract: Policies for Partially Observable Markov Decision Processes (POMDPs) are often designed using a nominal system model. In practice, this model can deviate from the true system during deployment due to factors such as calibration drift or sensor degradation, leading to unexpected performance degradation. This work studies policy robustness against deviations in the POMDP observation model. We introduce the Policy Observation Robustness Problem: to determine the maximum tolerable deviation in a POMDP's observation model that guarantees the policy's value remains above a specified threshold. We analyze two variants: the sticky variant, where deviations are dependent on state and actions, and the non-sticky variant, where they can be history-dependent. We show that the Policy Observation Robustness Problem can be formulated as a bi-level optimization problem in which the inner optimization is monotonic in the size of the observation deviation. This enables efficient solutions using root-finding algorithms in the outer optimization. For the non-sticky variant, we show that when policies are represented with finite-state controllers (FSCs) it is sufficient to consider observations which depend on nodes in the FSC rather than full histories. We present Robust Interval Search, an algorithm with soundness and convergence guarantees, for both the sticky and non-sticky variants. We show this algorithm has polynomial time complexity in the non-sticky variant and at most exponential time complexity in the sticky variant. We provide experimental results validating and demonstrating the scalability of implementations of Robust Interval Search to POMDP problems with tens of thousands of states. We also provide case studies from robotics and operations research which demonstrate the practical utility of the problem and algorithms.