Poster
Large Scale Markov Decision Processes with Changing Rewards
Adrian Rivera Cardoso · He Wang · Huan Xu
East Exhibition Hall B, C #52
Keywords: [ Online Learning ] [ Algorithms ] [ Markov Decision Processes ] [ Reinforcement Learning and Planning ]
[
Abstract
]
Abstract:
We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O(√τ(ln|S|+ln|A|)Tln(T)), where S is the state space, A is the action space, τ is the mixing time of the MDP, and T is the number of periods. The algorithm's computational complexity is polynomial in |S| and |A|. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d≪|S|, we propose a modified algorithm with a computational complexity polynomial in d and independent of |S|. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first ˜O(√T) regret bound in the large-scale MDP setting with adversarially changing rewards.
Live content is unavailable. Log in and register to view live content