arxiv_cs_lg 2026年4月24日

平均報酬オフライン強化学習における最適単一ポリシーサンプル複雑性および遷移時間カバレッジ

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

Translated: 2026/4/24 20:07:54

reinforcement-learningoffline-rlsample-complexitymdpaverage-reward

Japanese Translation

arXiv:2506.20904v2 発表タイプ：置換摘要：私たちは、分散シフトおよび不均一カバレッジの観点からより困難な課題を提起する平均報酬 MDP におけるオフライン強化学習を研究し、理論的な観点からはまだ十分に検討されていない分野である。これまでの研究では、単一ポリシーデータカバレッジ仮定に基づいて性能保証を取得しているが、そのような保証はすべてのポリシーに uniform である追加の複雑性測定（例：uniformal mixing time）を利用している。私たちは、目標ポリシー（specifically bias span と新たな policy hitting radius）に依存し、かつ平均報酬オフライン RL のために最初の完全な単一ポリシーサンプル複雑性境界を得るための明確な保証を発展させた。さらに、私たちは制限された構造仮定よりも一般的な弱く通信可能な MDP を扱うことに先駆的である。これは、この目的のために、我々の新しいクォンタイルクリッピング技術によって強化された悲観的割引価値イテレーションに基づくアルゴリズムを導入したものである。このアルゴリズムは、実装のために事前に任意のパラメータ知識を必要としない。驚くべきことに、ハードな例を用いて、我々の条件下での学習は、目標ポリシーの停留分布のみのカバレッジ仮定を超えたカバレッジ仮定を必要とすることを示し、単一ポリシー複雑性測定と以前検討されたケースを区別した。私たちはまた、我々の主要な結果にほぼ一致する下界を発展させた。

Original Content

arXiv:2506.20904v2 Announce Type: replace Abstract: We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively underexamined from a theoretical perspective. While previous work obtains performance guarantees under single-policy data coverage assumptions, such guarantees utilize additional complexity measures which are uniform over all policies, such as the uniform mixing time. We develop sharp guarantees depending only on the target policy, specifically the bias span and a novel policy hitting radius, yielding the first fully single-policy sample complexity bound for average-reward offline RL. We are also the first to handle general weakly communicating MDPs, contrasting restrictive structural assumptions made in prior work. To achieve this, we introduce an algorithm based on pessimistic discounted value iteration enhanced by a novel quantile clipping technique, which enables the use of a sharper empirical-span-based penalty function. Our algorithm also does not require any prior parameter knowledge for its implementation. Remarkably, we show via hard examples that learning under our conditions requires coverage assumptions beyond the stationary distribution of the target policy, distinguishing single-policy complexity measures from previously examined cases. We also develop lower bounds nearly matching our main result.