NeurIPS Poster Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Poster

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Hao Qin · Kwang-Sung Jun · Chicheng Zhang

Great Hall & Hall B1+B2 (level 1) #1803

[ Abstract ] [ Project Page ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Abstract: We study

K

$K$ -armed bandit problems where the reward distributions of the arms are all supported on the

[0, 1]

$[0,1]$ interval. Maillard sampling\cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting\cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we analyze the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling {and a special case of Minimum Empirical Divergence (MED)~\cite{honda2011asymptotically}} for achieving a KL-style finite-time gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has an {adaptive} worst-case regret bound of the form

O (\sqrt{μ^{*} (1 - μ^{*}) K T \ln K} + K \ln T)

$O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)$ , where

μ^{*}

$\mu^*$ is the expected reward of the optimal arm, and

T

$T$ is the time horizon length; {this is the first time such adaptivity is reported in the literature for an algorithm with asymptotic optimality guarantees.}

Chat is not available.