arxiv_cs_lg 2026年2月10日

限られたサンプルにおけるオフライン強化学習の証明可能なドメイン適応

Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Translated: 2026/3/15 9:02:52

offline-reinforcement-learningdomain-adaptationprovable-boundssample-efficiencyarxiv-2408

Japanese Translation

arXiv:2408.12136v5 Announce Type: replace 要約: オフライン強化学習（RL）は、静的なターゲットデータセットから効果的なポリシーを学習します。現在の最先端のオフライン RL アルゴリズムのパフォーマンスにかかわらず、それはターゲットデータセットの大きさに依存しており、ターゲットデータセットにサンプル数が限られている場合は性能が低下します。これは、実際のアプリケーションでよく見られるケースです。この問題に対処するため、関連するソースデータセット（シミュレーターなど）からの補助サンプルを活用したドメイン適応は有益である可能性があります。しかしながら、限られたターゲットデータセットと、大規模だがバイアスを持つソースデータセットのトレードオフを最適化する最適な方法を確立し、かつ証明可能な理論的保証を保証することは、まだ開かれている課題です。当論文を知る限り、本稿は、各データセットに割り当てる重さがオフライン RL の性能に与える影響を理論的に調査する最初のフレームワークを提案しています。特に、私たちは簡易化された仮定の下でクローズド形式で計算できる最適重みの存在と、性能の境界を確立しました。また、最適値の近傍への収束に関するアルゴリズム的保証も提供しました。これらの結果は、ソースデータセットの品質とターゲットデータセットのサンプル数に依存しています。我々の実証結果は、よく知られた Procgen と MuJoCo のベンチマークにおいて、この作品の理論的貢献を実証しました。

Original Content

arXiv:2408.12136v5 Announce Type: replace Abstract: Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known Procgen and MuJoCo benchmarks substantiate the theoretical contributions in this work.