arxiv_cs_lg 2026年4月24日

最大エントロピー半教師なし逆強化学習

Maximum Entropy Semi-Supervised Inverse Reinforcement Learning

Translated: 2026/4/24 19:55:45

inverse-reinforcement-learningmaximum-entropysemi-supervised-learningapprenticeship-learningreinforcement-learning

Japanese Translation

arXiv:2604.20074v1 Announce Type: new 抽象：学徒学習（AL）の一般的なアプローチは、これを逆強化学習（IRL）の問題として定式化することである。MaxEnt-IRL アルゴリズムは、最大エントロピーの原則を IRL に統合する際、その先駆的な手法とは異なり、専門家の行動と一致する可能性がある大量のポリシーが存在することを理由とした多義性を解消する。本研究では、専門家の軌跡に加え、教師なし軌跡も利用可能な AL の設定を調査する。われわれは、最大エントロピー IRL を半教師なし学習の原則と組み合わせる新しいアルゴリズムである MESSI を導入した。具体的には、MESSI は軌跡間のペナルティを用いて、教師なしデータを MaxEnt-IRL フレームワークに統合する。高速道路運転問題とグリッドワールド問題における実验結果は、MESSI が教師なし軌跡を活用し、MaxEnt-IRL の性能を向上させる能力を示している。

Original Content

arXiv:2604.20074v1 Announce Type: new Abstract: A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is available. We introduce MESSI, a novel algorithm that combines MaxEnt-IRL with principles coming from semi-supervised learning. In particular, MESSI integrates the unsupervised data into the MaxEnt-IRL framework using a pairwise penalty on trajectories. Empirical results in a highway driving and grid-world problems indicate that MESSI is able to take advantage of the unsupervised trajectories and improve the performance of MaxEnt-IRL.