Poster
Policy Optimization for Robust Average Reward MDPs
Zhongchang Sun · Sihong He · Fei Miao · Shaofeng Zou
West Ballroom A-D #6804
[
Abstract
]
Wed 11 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
This paper studies the first-order policy optimization for robust average reward Markov decision processes (MDPs). For robust average reward MDPs, the goal is to optimize the worst-case average reward over the uncertainty set of the transition kernels. We first develop the sub-gradient of the robust average reward. Based on the sub-gradient, the robust policy mirror descent is further proposed. To characterize its iteration complexity, we develop a lower bound on the difference of robust average reward between two policies and further show that the robust average reward satisfies the PL-condition. We then show that with increasing step size, our robust policy mirror descent achieves linear convergence in the optimality gap, and with constant step size, our algorithm converges to an $\epsilon$-optimal policy with iteration complexity $\mathcal{O}(1/\epsilon)$. The convergence rate of our algorithm matches with the best convergence rate of the policy-based algorithms for robust MDPs. Moreover, our algorithm is the first algorithm that converges to the global optimum with polynomial time for robust average reward MDPs. We provide simulation results to demonstrate the performance of our algorithm.
Live content is unavailable. Log in and register to view live content