Timezone: »
Spotlight
Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms
Mohsen Bayati · Nima Hamidi · Ramesh Johari · Khashayar Khosravi
Wed Dec 09 07:30 PM -- 07:40 PM (PST) @ Orals & Spotlights: Learning Theory
We study the structure of regret-minimizing policies in the {\em many-armed} Bayesian multi-armed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$.
We first show that {\em subsampling} is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in the many-armed regime. However, a subsampled UCB (SS-UCB), which samples $\Theta(\sqrt{T})$ arms and executes UCB only on that subset, is rate-optimal.
Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms. In particular, in numerical experiments SS-UCB performs worse than a simple greedy algorithm (and its subsampled version) that pulls the current empirical best arm at every time period.
We show that these insights hold even in a contextual setting, using real-world data.
These empirical results suggest a novel form of {\em free exploration} in the many-armed regime that benefits greedy algorithms. We theoretically study this new source of free exploration and find that it is deeply connected to the distribution of a certain tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration as discussed in the recent literature on contextual bandits, where free exploration arises due to variation in contexts.
We use this insight to prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when $k > \sqrt{T}$, and achieves sublinear regret with more general distributions. This is a case where theoretical rate optimality does not tell the whole story: when complemented by the empirical observations of our paper, the power of greedy algorithms becomes quite evident. Taken together, from a practical standpoint, our results suggest that in applications it may be preferable to use a variant of the greedy algorithm in the many-armed regime.
Author Information
Mohsen Bayati (Stanford University)
Nima Hamidi (Stanford University)
Ramesh Johari (Stanford University)
Khashayar Khosravi (Google Research)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Poster: Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms »
Thu. Dec 10th 05:00 -- 07:00 AM Room Poster Session 4 #1274
More from the Same Authors
-
2022 Poster: Thompson Sampling Efficiently Learns to Control Diffusion Processes »
Mohamad Kazem Shirani Faradonbeh · Mohamad Sadegh Shirani Faradonbeh · Mohsen Bayati -
2021 Poster: Synthetic Design: An Optimization Approach to Experimental Design with Synthetic Controls »
Nick Doudchenko · Khashayar Khosravi · Jean Pouget-Abadie · Sébastien Lahaie · Miles Lubin · Vahab Mirrokni · Jann Spiess · guido imbens -
2020 Poster: Adaptive Experimental Design with Temporal Interference: A Maximum Likelihood Approach »
Peter W Glynn · Ramesh Johari · Mohammad Rasouli -
2019 Poster: Personalizing Many Decisions with High-Dimensional Covariates »
Nima Hamidi · Mohsen Bayati · Kapil Gupta -
2019 Poster: Semi-Parametric Dynamic Contextual Pricing »
Virag Shah · Ramesh Johari · Jose Blanchet -
2018 Poster: Bandit Learning with Positive Externalities »
Virag Shah · Jose Blanchet · Ramesh Johari -
2016 Poster: Scaled Least Squares Estimator for GLMs in Large-Scale Problems »
Murat Erdogdu · Lee H Dicker · Mohsen Bayati -
2013 Poster: Estimating LASSO Risk and Noise Level »
Mohsen Bayati · Murat Erdogdu · Andrea Montanari -
2010 Poster: The LASSO risk: asymptotic results and real world examples »
Mohsen Bayati · José Bento · Andrea Montanari