NeurIPS Poster Large Scale Markov Decision Processes with Changing Rewards

Poster

Large Scale Markov Decision Processes with Changing Rewards

Adrian Rivera Cardoso · He Wang · Huan Xu

East Exhibition Hall B, C #52

Keywords: [ Online Learning ] [ Algorithms ] [ Markov Decision Processes ] [ Reinforcement Learning and Planning ]

[ Abstract ]

Abstract: We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of

$O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))$ , where

$S$ is the state space,

$A$ is the action space,

$\tau$ is the mixing time of the MDP, and

$T$ is the number of periods. The algorithm's computational complexity is polynomial in

$|S|$ and

$|A|$ . We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension

$d\ll|S|$ , we propose a modified algorithm with a computational complexity polynomial in

$d$ and independent of

$|S|$ . We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first

$\tilde{O}(\sqrt{T})$ regret bound in the large-scale MDP setting with adversarially changing rewards.

Live content is unavailable. Log in and register to view live content