arxiv_cs_lg 2026年4月24日

CubeDAgger: 動的システム向けの効率的かつ低リスクなインタラクティブ模倣学習

CubeDAgger: Interactive Imitation Learning for Dynamic Systems with Efficient yet Low-risk Interaction

Translated: 2026/4/24 20:11:50

interactive-imitation-learningcubeDaggerdynamic-systemsreinforcement-learningrobot-control

Japanese Translation

arXiv:2505.04897v2 発表タイプ：replace-cross 摘要：インタラクティブ模倣学習では、エキスパートからの段階的な監督によってエージェントの制御政策の強靭性を向上させる。最近のアルゴリズムの多くは、監督のタイミングを限定的に選択することでエキスパートの負担を軽減するために、エキスパートとエージェントの切り替えシステムを採用している。しかし、このアプローチは静的タスクには有用だが、動的タスクにおいてはタイミングの不一致が動作の急激な変化を引き起こし、ロボットの動的安定性を損なうためである。本稿では、動的タスクにおいても動的安定性の違反を最小限に抑えながら強靭性を向上させる新しい手法、CubeDAgger を提案する。提案された手法は、3 つの改善点が加わったベースラインである EnsembleDAgger に基づいている。第 1 つの改善は、監督のタイミングを決定する閾値を明示的にアクティブ化するための正則化を追加することである。第 2 つの改善は、エキスパートとエージェントの切り替えシステムを複数の動作候補からの最適コンセンサスシステムに変換することである。第 3 つの改善は、エージェントの動作に自己回帰彩色ノイズを注入し、時間一貫性の高い探索を促進することである。これらの改善は、トレーニングされた政策が十分に強靭でありながら相互作用中も動的安定性を維持することを示すシミュレーションで検証された。最後に、人間のエキスパートを使用した実ロボットのスプーン取り実験では、この手法がわずか 30 分の相互作用から強靭な政策をゼロから学習できることが示された。https://youtu.be/kBl3SCTnVEM

Original Content

arXiv:2505.04897v2 Announce Type: replace-cross Abstract: Interactive imitation learning makes an agent's control policy robust by stepwise supervisions from an expert. The recent algorithms mostly employ expert-agent switching systems to reduce the expert's burden by limitedly selecting the supervision timing. However, this approach is useful only for static tasks; in dynamic tasks, timing discrepancies cause abrupt changes in actions, losing the robot's dynamic stability. This paper therefore proposes a novel method, named CubeDAgger, which improves robustness with less dynamic stability violations even for dynamic tasks. The proposed method is designed on a baseline, EnsembleDAgger, with three improvements. The first adds a regularization to explicitly activate the threshold for deciding the supervision timing. The second transforms the expert-agent switching system to an optimal consensus system of multiple action candidates. Third, autoregressive colored noise is injected to the agent's actions for time-consistent exploration. These improvements are verified by simulations, showing that the trained policies are sufficiently robust while maintaining dynamic stability during interaction. Finally, real-robot scooping experiments with a human expert demonstrate that the proposed method can learn robust policies from scratch based on just 30 minutes of interaction. https://youtu.be/kBl3SCTnVEM