Poster
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
Zihan Zhang · Yuhang Jiang · Yuan Zhou · Xiangyang Ji
Abstract:
In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with states, actions and planning horizon , we design a computational efficient algorithm to achieve near-optimal regret of \footnote{ hides logarithmic terms of } in episodes using batches with confidence parameter . To our best of knowledge, it is the first regret bound with batch complexity. Meanwhile, we show that to achieve regret, the number of batches is at least , which matches our upper bound up to logarithmic terms.Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.ion model.
Chat is not available.