arxiv_cs_lg 2026年4月24日

Occupancy Reward Shaping: Offline Goal-Conditioned Reinforcement Learning における Credit Assignment の向上

Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

Translated: 2026/4/24 19:59:36

reinforcement-learningoffline-rlcredit-assignmentworld-modeloptimal-transport

Japanese Translation

arXiv:2604.20627v1 Announce Type: new 摘要：行動とその長期的な結果との時間的なラグは、データから目的指向的な行動を学習する際に Credit Assignment を困難にします。生成世界モデルはエージェントが訪問しうる将来の状態の分布を捉え、時間情報を捉えていることを示しています。その時間情報をどのように抽出して Credit Assignment を行うのでしょうか。本稿では、世界モデルに保存された時間情報が世界の下層の幾何学をどうエンコードしているかを形式化します。Optimal Transport を活用して、占有測度の学習されたモデルからこの幾何学を引き出し、目的到達情報を捉える報酬関数に変換します。我々の resulting method、Occupancy Reward Shaping は、報酬が希少な設定における Credit Assignment の問題を大幅に緩和します。ORS は証明的に最適ポリシーを変化させないものの、13 つの多様な長期ホライズンの移動および操作タスクにおいて、実用上性能を 2.2 倍向上させました。さらに、我々は ORS をトカマク制御タスク 3 において現実世界での原子核融合制御の有效性を実証しました。コード: https://github.com/aravindvenu7/occupancy_reward_shaping; ウェブサイト: https://aravindvenu7.github.io/website/ors/

Original Content

arXiv:2604.20627v1 Announce Type: new Abstract: The temporal lag between actions and their long-term consequences makes credit assignment a challenge when learning goal-directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal-reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse long-horizon locomotion and manipulation tasks. Moreover, we demonstrate the effectiveness of ORS in the real world for controlling nuclear fusion on 3 Tokamak control tasks. Code: https://github.com/aravindvenu7/occupancy_reward_shaping; Website: https://aravindvenu7.github.io/website/ors/