arxiv_cs_lg 2026年4月24日

機械学習生成された代理報酬を持つマルチアームバンディット

Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Translated: 2026/4/24 20:12:01

multi-armed-banditmachine-learningregret-minimizationreinforcement-learningsequential-decision-making

Japanese Translation

arXiv:2506.16658v2 Announce Type: replace-cross 要約：マルチアームバンディット（MAB）は、不確実性下でのシークエンス・デシジョン・メイキングにおいて広く採用されている枠組みです。従来のバンディットアルゴリズムはオンラインデータのみを依存しており、これはアームがアクティブに引かれたオンラインフェーズに収集されるため、限られたデータである傾向にあります。しかし、多くの実用的な設定では、アームを配置する前に過去のユーザーのコバリートなどの豊富な補助データが利用可能です。我々は、事前にトレーニングされた機械学習（ML）モデルを使用して、サイド情報と履歴データを\emph{代理報酬}に変換する新しい MAB 設定を提案しました。この設定の顕著な課題は、代理報酬が本質的に偏った値を示す可能性があります。これは、オフラインフェーズでは真の報酬データが通常入手不可であるためです。これに対処するため、我々は任意の報酬予測モデルや任意の形式の補助データに適用可能な機械学習支援上側信頼区間（MLA-UCB）アルゴリズムを提案しました。予測報酬と真の報酬が合在一起的に正規分布している場合、MLA-UCB は累積リガットを厳密に改善し、平均代理報酬が真の平均報酬と完全に一致していない場合でも、幅広いクラスポリシーの中での漸近的最適性を達成します。特に、我々の手法は真の報酬と代理報酬の共分散行列に関する事前知識は不要です。さらに、我々は、各アーム引が観察と報酬のバッチを生成し、報酬が非正規分布である可能性のあるバッチ化された報酬 MAB 問題への手法を拡張し、古典的な UCB アルゴリズムを凌駕する計算可能な信頼区間とリガット保証を導出了。最後に、正規分布および ML 生成の代理報酬を用いた大規模シミュレーション、および言語モデル選択および動画推薦における実世界の研究は、適度なオフライン代理サンプルサイズと相関において、一貫したかつ 종종 実質的なリガット削減を示しました。

Original Content

arXiv:2506.16658v2 Announce Type: replace-cross Abstract: Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent challenge of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, even in cases where the mean surrogate reward completely misaligns with the true mean rewards, and achieves the asymptotic optimality among a broad class of policies. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We further extend the method to a batched reward MAB problem, where each arm pull yields a batch of observations and rewards may be non-Gaussian, and we derive computable confidence bounds and regret guarantees that improve upon classical UCB algorithms. Finally, extensive simulations with both Gaussian and ML-generated surrogates, together with real-world studies on language model selection and video recommendation, demonstrate consistent and often substantial regret reductions with moderate offline surrogate sample sizes and correlations.