Skip to yearly menu bar Skip to main content

Workshop: OPT 2023: Optimization for Machine Learning

From 6235149080811616882909238708 to 29: Vanilla Thompson Sampling Revisited

Bingshan Hu · Tianyue Zhang

Abstract: In this work, we derive a new problem-dependent regret bound for Thompson Sampling with Gaussian priors (Algorithm 2 in [Agrawal and Goyal, 2017]), a classical stochastic bandit algorithm that has demonstrated excellent empirical performance and has been widely deployed in real-world applications. The existing regret bound is $\sum\limits_{i \in \mathcal{A}: \Delta_i >0}\frac{288 \left(e^{64}+6 \right) \ln \left(T\Delta_i^2 + e^{32} \right)}{\Delta_i} + \frac{10.5}{\Delta_i} + \Delta_i$, where $\mathcal{A}$ is the arm set, $\Delta_i$ denotes the single round performance loss when pulling a sub-optimal arm $i$ instead of the optimal arm, and $T$ is the learning time horizon. Since real-world learning tasks care about the performance with a finite learning time horizon $T$, the existing regret bound is only non-vacuous when $T > 288 \cdot e^{64}$, which may not be practical. Our new regret bound is $ \sum\limits_{i \in \mathcal{A}: \Delta_i >0} \frac{1252 \ln \left(T \Delta_i^2 + 100^{\frac{1}{3}}\right)}{\Delta_i} +\frac{18 \ln \left(T\Delta_i^2 \right)}{\Delta_i} + \frac{182.5}{\Delta_i}+ \Delta_i$, which tightens the leading term's coefficient significantly. Despite having made some improvements, we would like to emphasize that the goal of this work is to deepen the understanding of Thompson Sampling from a theoretical perspective to unlock the full potential of this classical learning algorithm for solving challenging real-world learning problems.

Chat is not available.