Workshop: OPT2020: Optimization for Machine Learning

Contributed Video: Employing No Regret Learners for Pure Exploration in Linear Bandits, Mohammadi Zaki

Mohammadi Zaki

Abstract: We study the best arm identification problem in linear multi-armed bandits (LMAB) in the fixed confidence ($\delta$-PAC) setting; this is also the problem of optimizing an unknown linear function over a discrete ground set with noisy, zeroth-order access. We propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Most previous approaches rely on access to a minimax optimization oracle which is at the heart of the complexity of the problem. We propose a method to solve this optimization problem (upto suitable accuracy) by interpreting the problem as a two-player zero-sum game, and attempting to sequentially converge to its saddle point using low-regret learners to compute the players' strategies in each round which yields a concrete querying algorithm. The algorithm, which we call the {\em Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting without requiring boundedness assumptions on the parameter space. PELEG is, thus, the first algorithm to achieve both order-optimal sample complexity and explicit implementability for this setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.