NeurIPS 2019 Schedule

( events) Timezone:

Poster

Tue Dec 10 05:30 PM -- 07:30 PM (PST) @ East Exhibition Hall B + C #211

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

In Reinforcement Learning and Planning -- Reinforcement Learning

Zihan Zhang · Xiangyang Ji

[ Paper] [ Slides]

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function

h^{*}

$h^{*}$ , the proposed algorithm achieves a regret bound of

\tilde{O} (\sqrt{S A H T})

$\tilde{O}(\sqrt{SAHT})$ \footnote{The symbol

\tilde{O}

$\tilde{O}$ means

O

$O$ with log factors ignored. } for MDP with

S

$S$ states and

A

$A$ actions, in the case that an upper bound

H

$H$ on the span of

h^{*}

$h^{*}$ , i.e.,

s p (h^{*})

$sp(h^{*})$ is known. This result outperforms the best previous regret bounds

\tilde{O} (S \sqrt{A H T})

$\tilde{O}(S\sqrt{AHT})$ \citep{fruit2019improved} by a factor of

\sqrt{S}

$\sqrt{S}$ . Furthermore, this regret bound matches the lower bound of

Ω (\sqrt{S A H T})

$\Omega(\sqrt{SAHT})$ \citep{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of

\tilde{O} (\sqrt{S A D T})

$\tilde{O}(\sqrt{SADT})$ for MDPs with a finite diameter

D

$D$ compared to the lower bound of

Ω (\sqrt{S A D T})

$\Omega(\sqrt{SADT})$ \citep{jaksch2010near}.