Skip to yearly menu bar Skip to main content

Workshop: OPT 2023: Optimization for Machine Learning

Practical Principled Policy Optimization for Finite MDPs

Michael Lu · Matin Aghaei · Anant Raj · Sharan Vaswani


We consider (stochastic) softmax policy gradient (PG) methods for finite Markov Decision Processes (MDP). While the PG objective is not concave, recent research has used smoothness and gradient dominance to achieve convergence to an optimal policy. However, these results depend on having extensive knowledge of the environment, such as the optimal action or the true mean reward vector, to configure the algorithm parameters. This makes the resulting algorithms impractical in real applications. To alleviate this problem, we propose PG methods that employ an Armijo line-search in the deterministic setting and an exponentially decreasing step-size in the stochastic setting. We demonstrate that these proposed algorithms offer similar theoretical guarantees as previous works but now do not require the knowledge of oracle-like quantities. Furthermore, we apply the similar techniques to develop practical, theoretically sound entropy-regularized methods for both deterministic and stochastic settings. Finally, we empirically compare the proposed methods with previous approaches in single-state MDP environments.

Chat is not available.