arxiv_cs_lg 2026年2月10日

エピグラフに基づくフローマッチリングによる安全かつ高性能なオフライン強化学習

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Translated: 2026/3/15 15:01:31

offline-reinforcement-learningsafe-rlflow-matchingoptimal-controlepigraph

Japanese Translation

arXiv:2602.08054v1 Announce Type: new エブリクトRL（オフライン強化学習）は、オンライン探索に伴うリスクなしに自律システムをトレーニングする強力なパラダイムを提供し、特に安全性が必須の分野において特に有用です。しかし、固定されたデータセットから同時に高い安全性と性能を実現することは依然として困難です。既存の安全なオフラインRL手法は、違反を許容するソフト制約に依存し、過剰な保守主義を招くか、安全性、報酬最適化、そしてデータ分布への準拠をバランスさせることに苦しんでいました。これに対応するため、我々は状態制約を持つ最適制御問題を構築し、安全性と性能を同時に最適化する「エピグラフに基づくフローマッチリング（EpiFlow）」というフレームワークを提案しました。我々は最適制御問題のエピグラフ再構成に基づいて学習した実現可能性価値関数を用い、従来の研究に見られる分断された目的関数や事後フィルタリングを回避します。政策は、このエピグラフ価値関数に基づいて行動分布を再重み付けし、フローマッチリングを通じて生成政策を適合させることで合成されます。これにより、効率的で分布一貫的なサンプリングが可能になります。安全性が必須であるさまざまなタスク、Safety-Gymnasiumベンチマークを含め、EpiFlowはほぼゼロの経験的違反率を実現しながら競争力のあるリターンを示し、エピグラフ指向な政策合成の効果を証明しました。

Original Content

arXiv:2602.08054v1 Announce Type: new Abstract: Offline reinforcement learning (RL) provides a compelling paradigm for training autonomous systems without the risks of online exploration, particularly in safety-critical domains. However, jointly achieving strong safety and performance from fixed datasets remains challenging. Existing safe offline RL methods often rely on soft constraints that allow violations, introduce excessive conservatism, or struggle to balance safety, reward optimization, and adherence to the data distribution. To address this, we propose Epigraph-Guided Flow Matching (EpiFlow), a framework that formulates safe offline RL as a state-constrained optimal control problem to co-optimize safety and performance. We learn a feasibility value function derived from an epigraph reformulation of the optimal control problem, thereby avoiding the decoupled objectives or post-hoc filtering common in prior work. Policies are synthesized by reweighting the behavior distribution based on this epigraph value function and fitting a generative policy via flow matching, enabling efficient, distribution-consistent sampling. Across various safety-critical tasks, including Safety-Gymnasium benchmarks, EpiFlow achieves competitive returns with near-zero empirical safety violations, demonstrating the effectiveness of epigraph-guided policy synthesis.