NeurIPS 2020 : Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs



Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang

Oral presentation: Orals & Spotlights Track 24: Learning Theory
on Thu, Dec 10th, 2020 @ 02:15 – 02:30 GMT

Poster Session 5 (more posters)
on Thu, Dec 10th, 2020 @ 05:00 – 07:00 GMT

Toggle Abstract Paper (in Proceedings / .pdf)

Abstract: We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality. Besides its simplicity, our approach enjoys several advantages. First, the obtained high-probability regret bounds are data-dependent and could be much smaller than the worst-case bounds, which resolves an open problem asked by Neu (2015). Second, resolving another open problem of Bartlett et al. (2008) and Abernethy and Rakhlin (2009), our approach leads to the first general and efficient algorithm with a high-probability regret bound for adversarial linear bandits, while previous methods are either inefficient or only applicable to specific action sets. Finally, our approach can also be applied to learning adversarial Markov Decision Processes and provides the first algorithm with a high-probability small-loss bound for this problem.

Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang

Preview Video and Chat

To see video, interact with the author and ask questions please use registration and login.