arxiv_cs_lg 2026年2月10日

要件を条件とするマルチ目標強化学習：分解された、多様性情報を駆使した政策最適化

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Translated: 2026/3/15 14:48:01

reinforcement-learningmulti-objectivepolicy-optimizationpareto-frontierdeep-reinforcement-learning

Japanese Translation

arXiv:2602.07764v1 発表タイプ：新要約：マルチ目標強化学習（MORL）は、複数の（しばしば相互排斥的な）目標を均衡させる政策を学習することを目指します。単一の要件を条件とする政策が最も柔軟かつスケーラブルな解決策であるにもかかわらず、既存のアプローチは実運用において脆弱であり、多くの場合、完全なペーロ前（Pareto front）を再構築できかねます。我々は、この失敗が現在の手法における 2 つの構造的的問題、すなわち先方のスカラー化による破壊的な勾配干渉と、要件空間における表現の崩壊に起因することを示しました。我々は、これらの問題を直接対応するためにマルチ目標政策最適化を再編成した PPO ベースのフレームワーク $D^3PO$ を導入しました。$D^3PO$ は、分解された最適化パイプラインを通じて各目標の学習シグナルを保ち、安定化後に要件のみを統合することで、信頼性の高いクレジット割り当てを可能にします。さらに、スケーリングされた多様性情報規制器は、政策行動が要件の変化に対して感受性を持てることを強制し、崩壊を防ぎます。標準的な MORL ベンチマーク（高次元および多数目標制御タスクを含む）において、$D^3PO$ は、以前のカタストロフィ法およびマルチ政策手法よりも常により広範で、かつ質の高いペーロ前を発見し、状態の最善レベルのハイパーボリュームと期待ユーティリティを維持・超過しつつ、単一の運用可能な政策を使用します。

Original Content

arXiv:2602.07764v1 Announce Type: new Abstract: Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.