arxiv_cs_lg 2026年4月24日

Hybrid-AIRL: 監督型エキスパートガイダンスによる逆強化学習の向上

Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

Translated: 2026/4/24 20:09:12

inverse-reinforcement-learninghybrid-airlheads-up-limit-hold-emgamificationmachine-learning

Japanese Translation

arXiv:2511.21356v3 Announce Type: replace Abstract: 敵対的逆強化学習（AIRL）は、専門家のデモンストレーションから密集型報酬関数を推測することで、強化学習におけるスパース報酬問題を解決する可能性を示しており、その有望性が高いです。ただし、高度に複雑なかつ不完全情報の設定におけるその性能は、まだほとんど探求されていません。このギャップを探るために、私たちは、スパースかつ遅延報酬、そして重大な不確実性を特徴づける「ヘッドアップ・リミット・ホール・エム（HULHE）」ポーカーの文脈において AIRL を評価しました。この設定では、AIRL が十分な情報量の報酬関数を推測する際に困難に直面するとの結果が得られました。この限界を克服するため、私たちは、専門家のデータから導出された監督型損失関数と、確率的正則化メカニズムを取り入れることで報酬推測とポリシー学習を強化する、Hybrid-AIRL（H-AIRL）という拡張を貢献しました。H-AIRL の評価には、慎重に選択された Gymnasium ベンチマークおよび HULHE ポーカー設定が行われ、さらに学習された報酬関数の視覚化を通じて学習プロセスについてより深い洞察を得ることも行われました。私たちの実験結果は、H-AIRL が AIRL に比較して高いサンプリング効率とより安定した学習を実現することを示しています。これは、逆強化学習に取り込む監督型シグナルの利点を強調し、H-AIRL を、困難な現実世界の設定に取り組むための有望な枠組みと確立しました。

Original Content

arXiv:2511.21356v3 Announce Type: replace Abstract: Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.