arxiv_cs_lg 2026年2月10日

LEASE: 高サンプル効率を備えたオフライン基盤の強化学習

LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

Translated: 2026/3/15 9:03:27

offline-rlreinforcement-learningpreference-learningsample-efficiencyarxiv-2024

Japanese Translation

arXiv:2412.21001v3 Announcement Type: replacement 要約：オフライン基盤の強化学習（PbRL）は、報酬関数の設計やオンライン相互作用に伴う高コストといった課題を克服する有効なアプローチを提供します。しかし、ラベリングにはリアルタイムでの人間によるフィードバックが必要であり、十分な数の偏好ラベルを取得することは困難です。この問題に対処するため、本稿では学習した転移モデルを活用してラベルのない偏好データを生成する、高サンプル効率を備えたオフライン基盤の強化学習（LEASE）アルゴリズムを提案します。事前学習された報酬モデルがラベルのないデータに対して誤ったラベルを生成する可能性があることに鑑み、報酬モデルのパフォーマンスを確保する不確実性感知メカニズムを設計しました。これにより、高い信頼度と低い分散を持つデータのみが選択されます。さらに、報酬モデルの汎化境界を提供することで、報酬の精度に影響を与える要因を分析し、LEASEによって学習されたポリシーが理論的改善を保証する旨を実証しました。開発された理論は状態と行動のペアに基づいているため、他のオフラインアルゴリズムと組み合わせることは容易です。実験結果は、オンライン相互作用なしでより少ない偏好データ条件下でも、ベースラインと同等のパフォーマンスを発揮できるLEASEが達成できることを示しています。

Original Content

arXiv:2412.21001v3 Announce Type: replace Abstract: Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.