Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition
Tiancheng Jin, Haipeng Luo
Spotlight presentation: Orals & Spotlights Track 24: Learning Theory
on 2020-12-09T19:50:00-08:00 - 2020-12-09T20:00:00-08:00
on 2020-12-09T19:50:00-08:00 - 2020-12-09T20:00:00-08:00
Toggle Abstract Paper (in Proceedings / .pdf)
Abstract: This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with \tilde{O}(\sqrt{T}) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves \tilde{O}(\sqrt{C}) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.