NeurIPS Poster Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Poster

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone · Zihan Zhang

East Exhibit Hall A-C #4711

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs).However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies.In this paper, we present the first *tractable* algorithm with minimax optimal regret of

O (\sqrt{s p (h^{*}) S A T \log (S A T)})

$\mathrm{O}\left(\sqrt{\mathrm{sp}(h^*) S A T \log(SAT)}\right)$ where

s p (h^{*})

$\mathrm{sp}(h^*)$ is the span of the optimal bias function

h^{*}

$h^*$ ,

S \times A

$S\times A$ is the size of the state-action space and

T

$T$ the number of learning steps. Remarkably, our algorithm does not require prior information on

s p (h^{*})

$\mathrm{sp}(h^*)$ . Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration (

P M E V I

$PMEVI$ ), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.

Chat is not available.