NeurIPS Poster Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Poster

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Zihan Zhang · Yuhang Jiang · Yuan Zhou · Xiangyang Ji

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with

S

$S$ states,

A

$A$ actions and planning horizon

H

$H$ , we design a computational efficient algorithm to achieve near-optimal regret of

\tilde{O} (\sqrt{S A H^{3} K \ln (1 / δ)})

$\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$ \footnote{

\tilde{O} (\cdot)

$\tilde{O}(\cdot)$ hides logarithmic terms of

(S, A, H, K)

$(S,A,H,K)$ } in

K

$K$ episodes using

O (H + \log_{2} \log_{2} (K))

$O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter

δ

$\delta$ . To our best of knowledge, it is the first

\tilde{O} (\sqrt{S A H^{3} K})

$\tilde{O}(\sqrt{SAH^3K})$ regret bound with

O (H + \log_{2} \log_{2} (K))

$O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve

\tilde{O} (p o l y (S, A, H) \sqrt{K})

$\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least

Ω (H / \log_{A} (K) + \log_{2} \log_{2} (K))

$\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$ , which matches our upper bound up to logarithmic terms.Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.ion model.

Chat is not available.