arxiv_cs_ai 2026年4月24日

基礎モデルに基づく強化学習：エンボディドエージェントが自律的に効率的に学習する

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

Translated: 2026/4/24 20:31:35

reinforcement-learningfoundation-modelsroboticssample-efficiencyembodied-ai

Japanese Translation

arXiv:2310.02635v5 発表型：replace-cross 要旨: 強化学習（RL）は、ロボットマニピュレーションタスクを解決するための有望なアプローチです。ただし、RL アルゴリズムを直接現実の世界に応用するのは困難です。まず、RL はデータ集約的であり、通常、環境との数百万回のインタラクションを必要とし、現実のシナリオでは非現実的です。また、報酬関数を手動で設計するために大きな工学上の努力を要します。これらの問題を解決するために、本論文では基礎モデルを活用します。我々は、政策、価値、成功報酬の基礎モデルからの導示とフィードバックを利用する「基礎モデルに基づく強化学習（RLFP）」を提案します。この枠組み内では、エンボディドエージェントが自動的な報酬関数と共により効率的に探索できる「基礎モデル導向 Actor-Critic（FAC）」アルゴリズムを導入します。我々の枠組みの利点は三つにわたります：(1) extit{サンプル効率が高い}、(2) extit{最小限かつ効果的な報酬工学}、(3) extit{基礎モデルの種類に依存せず、ノイズのある事前知識に堅実}。我々の手法は、実在するロボットとシミュレーションの両方で各種マニピュレーションタスクで顕著な性能を発揮しました。実在のロボットでの 5 つの器用なタスクにおいて、FAC はリアルタイム学習 1 時間後に平均成功率 86/%を達成しました。シミュレータ Meta-world での 8 つのタスクにおいて、FAC は約 10 万フレーム（約 1 時間の学習）未満で 7/8 のタスクで 100/%の成功率を達成し、100 万フレームの手動設計報酬を持つ基準手法を凌駕しました。我々は、RLFP フレームワークが将来のロボットが物理世界において自律的に探索・学習することを可能にするだろうと考えます。可視化データとコードは https://yewr.github.io/rlfp で利用可能です。

Original Content

arXiv:2310.02635v5 Announce Type: replace-cross Abstract: Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks. Visualizations and code are available at https://yewr.github.io/rlfp.