Timezone: »

Context-lumpable stochastic bandits
Chung-Wei Lee · Qinghua Liu · Yasin Abbasi Yadkori · Chi Jin · Tor Lattimore · Csaba Szepesvari

Wed Dec 13 08:45 AM -- 10:45 AM (PST) @ Great Hall & Hall B1+B2 #1815
We consider a contextual bandit problem with $S $ contexts and $K $ actions. In each round $t=1,2,\dots$ the learnerobserves a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min(S ,K)$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon^2)$ samples with high probability and provide a matching $\widetilde\Omega(r (S +K )/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r ^3(S +K )T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O{\sqrt{\text{poly}(r)(S+K)T}}$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.

Author Information

Chung-Wei Lee (University of Southern California)
Qinghua Liu (Princeton University)
Yasin Abbasi Yadkori (DeepMind)
Chi Jin (Princeton University)
Tor Lattimore (DeepMind)
Csaba Szepesvari (University of Alberta)

More from the Same Authors