arxiv_cs_lg 2026年4月24日

Value-Based Multi-objective Reinforcement Learning における問題：价值関数の干渉と過大評価に対する感受性

Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Open original article

Translated: 2026/4/24 20:06:52

reinforcement-learningmulti-objectiveq-learningvalue-functionarxiv

Japanese Translation

arXiv:2402.06266v2 Announce Type: replace 要旨：マルチオブジェクト強化学習 (MORL) アルゴリズムは、ベクトル型報酬で表現される複数の衝突する目標を扱うより一般的なケースに拡張した従来の強化学習 (RL) を拡張しています。Q-学習など広く利用されているスカラー型 RL 手法は、(1) ベクトル型价值関数の学習と (2) ユーザーが異なる目標に対して持つ偏りを反映するスカラー化または順序決定演算子を用いたアクション選択を行うことで、複数の目標に対応するように修飾できます。本論文では、非線形ユーティリティ関数と併せて利用される際に価値ベースの MORL アルゴリズムの性能を妨げる可能性がある、以前に報告されなかった 2 つの問題、すなわち价值関数の干渉と過大評価への感受性を調査します。我々は、複数の目標マルコフ決定過程（MDP）に対する単純な例において、これらの現象の本質をタブレー型マルチオブジェクト Q-学習の実装を用いて示します。

Original Content

arXiv:2402.06266v2 Announce Type: replace Abstract: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.