arxiv_cs_lg 2026年4月24日

分布的逆強化合成就学習

Distributional Inverse Reinforcement Learning

Translated: 2026/4/24 20:08:50

distributional-inverse-reinforcement-learningoffline-rlrisk-averse-policystochastic-dominancereward-function

Japanese Translation

arXiv:2510.03013v3 発表タイプ: 差し替え要約：我々は、オフライン逆強化合成就学習（IRL）に対して、報酬関数と収益の全分布に関する不確実性を同時にモデル化する分布的枠組みを提案します。従来の IRL アプローチが決定論的な報酬推定を取得するか、期待値の収益のみを一致させるのに対し、我々の方法は、報酬の分布を学習することで専門家の行動におけるより豊かである構造を捉え、一階の確率優位性（FSD）の違反を最小化することにより、政策の学習に歪みリスク尺度（DRM）を統合し、両方の報酬分布と分布意識のある政策の回復を可能にします。この構成は、行動解析とリスク意識のある倣い学習に最適です。理論的分析は、アルゴリズムが $\\mathcal{O}(\varepsilon^{-2})$ の反復複雑性で収束することを示しています。合成ベンチマーク、現実世界の神経行動データ、および MuJoCo 制御タスクでの経験的な結果は、我々の方法が表現性の高い報酬表現を回復し、最上級の性能を達成することを示しています。

Original Content

arXiv:2510.03013v3 Announce Type: replace Abstract: We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Theoretical analysis shows that the algorithm converges with $\mathcal{O}(\varepsilon^{-2})$ iteration complexity. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art performance.