arxiv_cs_lg 2026年4月24日

目標ネットワークを使用しない分布値推定に基づく頑健な品質多様性最適化

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

Translated: 2026/4/24 19:58:36

reinforcement-learningquality-diversitysample-efficiencydistributed-criticevolutionary-algorithms

Japanese Translation

arXiv:2604.20381v1 発表タイプ：新要約：品質多様性（QD）アルゴリズムは多様な技能のレパートリーを発見することに優れていますが、サンプル効率が悪く、複雑な移動タスクを解決するには通常数千万環境ステップが必要とされます。近年の強化学習（RL）の進歩により、高アップデート・データ比率（UTD）がアクター・クリティック学習を加速することが示されました。しかし、標準的な高 UTD アルゴリズムは通常、学習の安定化のために目標ネットワークを利用しており、この要件は計算上のボトルネックとなり、サンプル効率と急速な集団適応が不可欠なリソース集約的な QD タスクに対しての実用性を損なっています。本稿では、密集型かつ低方差の勾配信号を提供し、Dominated Novelty Search のための高 UTD 学習を可能にし、環境ステップの数を 10 倍以上削減する、サンプル効率が高く目標なし、分布値 QD-RL アルゴリズムである QDHUAC を提案します。我々の手法は高 UTD 比率で安定したトレーニングを可能にし、基本法相対で 10 倍以上のサンプル数で多次元的な Brax 環境において競合的なカバー率と適応度を達成しました。我々の結果は、目標なしの分布値クリティックと優越性ベースの選択を組み合わせるのが、サンプル効率進化的強化学習アルゴリズムの次世代のための重要な促進因子であることを示唆しています。

Original Content

arXiv:2604.20381v1 Announce Type: new Abstract: Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.