arxiv_cs_ai 2026年4月24日

動的な事前知識を強化学習における訓練目的とすること

Dynamical Priors as a Training Objective in Reinforcement Learning

Translated: 2026/4/24 20:26:48

reinforcement-learningdynamical-priorspolicy-gradientdecision-makingevidence-accumulation

Japanese Translation

arXiv:2604.21464v1 Announce Type: cross 要約：標準的な強化学習（RL）は報酬を最適化するものの、決定の時間発展に関する制限は少なく、時間的不整合な行動（突然の確信の変化、振動、あるいは退化的な不活動など）を示しつつ高い性能を得る可能性があります。我々は、証拠蓄積とヒステリシスを実装する外部状態の動的性質から導かれた補助損失を加えることで、政策勾配学習を拡張する動的な事前知識強化学習（DP-RL）という訓練フレームワークを導入しました。報酬、環境、または政策アーキテクチャを変更することなしに、この事前知識は学習中の行動確率の時間的発展を形作ります。3 つの最小環境を跨いで、我々は動的な事前知識が任務依存的な方法で決定の軌道を体系的に変化させ、一般的な平滑化では説明できない時間的に構造化された行動を促進することを示しました。これらの結果は、訓練目的だけでは強化学習エージェントの意思決定の時間幾何学を制御可能であることを示しています。

Original Content

arXiv:2604.21464v1 Announce Type: cross Abstract: Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task-dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision-making in RL agents.