arxiv_cs_lg 2026年2月10日

一般報酬における非パラメトリックベイズ最適化

Nonparametric Bayesian Optimization for General Rewards

Translated: 2026/3/15 14:06:21

bayesian-optimizationnonparametricgaussian-processthompson-samplingregret-analysis

Japanese Translation

arXiv:2602.07411v1 発表タイプ：新要旨：本稿は報酬モデルの不確実性下におけるベイズ最適化（Bayesian optimization: BO）を対象とする。我々は、目的関数がリップシッツ連続性を有し、広範な測定ノイズを許容する一般的な報酬設定において、無 regret 保証を実証的に初めて達成する BO アルゴリズムを提案する。当アプローチの核となるのは、新しい代理モデルである無限ガウス過程（infinite Gaussian process: $\infty$-GP）である。これは、報酬分布の空間に事前確率分布を置くベイズ非パラメトリックモデルであり、古典的なガウス過程（GP）よりもはるかに広範なクラスのリワードモデルを表現可能である。$\infty$-GP とトンプソンサンプリング（Thompson Sampling: TS）を組み合わせることで、効果的な探索と開拓を可能にしている。それに対応し、我々は一般のリワードに対する新しい TS regret 解析フレームワークを開発し、これを regret と代理モデルと真の報酬分布の合計変化距離と関連付ける。さらに、切り取りギブスサンプリング手順を用いることで、我々の手法は古典的 GP に比較して最小限の追加メモリや計算複雑性のみを要し、計算上でスケーラブルである。実証結果は、非定常、重尾、または他の悪条件の報酬を持つ設定において特に、最先端のパフォーマンスを示している。

Original Content

arXiv:2602.07411v1 Announce Type: new Abstract: This work focuses on Bayesian optimization (BO) under reward model uncertainty. We propose the first BO algorithm that achieves no-regret guarantee in a general reward setting, requiring only Lipschitz continuity of the objective function and accommodating a broad class of measurement noise. The core of our approach is a novel surrogate model, termed as infinite Gaussian process ($\infty$-GP). It is a Bayesian nonparametric model that places a prior on the space of reward distributions, enabling it to represent a substantially broader class of reward models than classical Gaussian process (GP). The $\infty$-GP is used in combination with Thompson Sampling (TS) to enable effective exploration and exploitation. Correspondingly, we develop a new TS regret analysis framework for general rewards, which relates the regret to the total variation distance between the surrogate model and the true reward distribution. Furthermore, with a truncated Gibbs sampling procedure, our method is computationally scalable, incurring minimal additional memory and computational complexities compared to classical GP. Empirical results demonstrate state-of-the-art performance, particularly in settings with non-stationary, heavy-tailed, or other ill-conditioned rewards.