NeurIPS Poster Contextual Bandits with Cross-Learning

Poster

Contextual Bandits with Cross-Learning

Santiago Balseiro · Negin Golrezaei · Mohammad Mahdian · Vahab Mirrokni · Jon Schneider

East Exhibition Hall B, C #41

Keywords: [ Theory ] [ Game Theory and Computational Economics ] [ Bandit Algorithms ] [ Algorithms ]

[ Abstract ]

Abstract: In the classical contextual bandits problem, in each round

$t$ , a learner observes some context

$c$ , chooses some action

$a$ to perform, and receives some reward

$r_{a,t}(c)$ . We consider the variant of this problem where in addition to receiving the reward

$r_{a,t}(c)$ , the learner also learns the values of

$r_{a,t}(c')$ for all other contexts

$c'$ ; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions (in this setting the context is the decision maker's private valuation for each auction). We call this problem the contextual bandits problem with cross-learning. The best algorithms for the classical contextual bandits problem achieve

$\tilde{O}(\sqrt{CKT})$ regret against all stationary policies, where

$C$ is the number of contexts,

$K$ the number of actions, and

$T$ the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on

$C$ and achieve regret

$\tilde{O}(\sqrt{KT})$ (when contexts are stochastic with known distribution),

$\tilde{O}(K^{1/3}T^{2/3})$ (when contexts are stochastic with unknown distribution), and

$\tilde{O}(\sqrt{KT})$ (when contexts are adversarial but rewards are stochastic). We simulate our algorithms on real auction data from an ad exchange running first-price auctions (showing that they outperform traditional contextual bandit algorithms).

Live content is unavailable. Log in and register to view live content