We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The time and budget constraints significantly complicate the exploration and exploitation tradeoff because they introduce complex coupling among contexts over time. To gain insight, we first study unit-cost systems with known context distribution. When the expected rewards are known, we develop an approximation of the oracle, referred to Adaptive-Linear-Programming(ALP), which achieves near-optimality and only requires the ordering of expected rewards. With these highly desirable features, we then combine ALP with the upper-confidence-bound (UCB) method in the general case where the expected rewards are unknown a priori. We show that the proposed UCB-ALP algorithm achieves logarithmic regret except in certain boundary cases.Further, we design algorithms and obtain similar regret analysis results for more general systems with unknown context distribution or heterogeneous costs. To the best of our knowledge, this is the first work that shows how to achieve logarithmic regret in constrained contextual bandits. Moreover, this work also sheds light on the study of computationally efficient algorithms for general constrained contextual bandits.
Huasen Wu (University of California at Davis)
Huasen Wu received his B.S. and Ph.D. degrees from the School of Electronic and Information Engineering, Beihang University, Beijing, in 2007 and 2014, respectively. He is currently a Postdoctoral Researcher working with Prof. Xin Liu at the Department of Computer Science, University of California, Davis. From December 2010 to January 2012, he was a visiting student at UC Davis, from May 2014 to August 2014, he was a visiting scholar at University of Illinois at Urbana-Champaign (UIUC), and from October 2012 to January 2014, he worked as a research intern at Wireless and Networking Group, Microsoft Research Asia (MSRA).
R. Srikant (University of Illinois at Urbana-Champaign)
Xin Liu (University of California, Davis)
Chong Jiang (University of Illinois at Urbana-Champaign)
More from the Same Authors
2020 Poster: The Mean-Squared Error of Double Q-Learning »
Wentao Weng · Harsh Gupta · Niao He · Lei Ying · R. Srikant
2019 Poster: Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning »
Harsh Gupta · R. Srikant · Lei Ying
2018 Poster: Adding One Neuron Can Eliminate All Bad Local Minima »
SHIYU LIANG · Ruoyu Sun · Jason Lee · R. Srikant
2016 Poster: Double Thompson Sampling for Dueling Bandits »
Huasen Wu · Xin Liu