arxiv_cs_lg 2026年4月20日

探索への偏りを取り解く：線形バンディットを用いた推奨オフライン評価における「利用を優先すること」の解明

Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

Translated: 2026/4/20 11:03:31

multivariate-banditrecommendation-systemoffline-evaluationexploration-exploitationlinear-regression

Japanese Translation

arXiv:2507.18756v2 Announce Type: replace 摘要：複数アームバンディット（MAB）アルゴリズムは、連続的な逐次学習が必要な推奨システムで広く使用されています。MAB の核心的な側面は、既知の高い期待値のアイテムを利用するか、新しい情報を収集するために関知のアイテムを探索するかという「利用・探索のトレードオフ」です。コンテクスト線形バンディットの多くは同一の線形回帰のバックボーンを持ち、主に探索戦略に異なりつつ、このトレードオフは特に重要です。しかしながら、その普及にもかかわらず、MAB のオフライン評価は、探索行動を信頼的に評価する点において限界を持つと認識されています。本研究中は、複数の線形 MAB に関する包括的なオフライン経験的比較を行いました。驚くべきことに、90 件以上の多様なデータセットを超えて、何種かの探索を行うものを持たない貪欲な線形モデルは、常に最上位の性能を達成し、多くの場合、探索を行う対照群に優れ、あるいは同等の結果を示しました。この観察結果は、探索を最小化する構成を常に好むという超パラメータ最適化の結果によってさらに裏付けられ、これらの評価設定においては純粋な利用が優勢な戦略であると示唆しています。私々の結果は、バンディットのためのオフライン評価プロトコルにおける重要な不備を暴露し、特にそれらが真の探索的有効性を反映する能力に関係しています。したがって、この研究は、より堅牢な評価手法の開発の緊急性を強調し、今後の推薦システムにおける相互作用的学習のための代替評価フレームワークに関する調査を導きます。

Original Content

arXiv:2507.18756v2 Announce Type: replace Abstract: Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.