Timezone: »
Poster
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Bruno Scherrer
Thu Dec 05 07:00 PM  11:59 PM (PST) @ Harrah's Special Events Center, 2nd Floor #None
Given a Markov Decision Process (MDP) with $n$ states and $m$ actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and SimplexPI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $ O \left( \frac{ n m}{1\gamma} \log \left( \frac{1}{1\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Hansen et al. (2013), while SimplexPI terminates after at most $ O \left( \frac{n^2 m}{1\gamma} \log \left( \frac{1}{1\gamma} \right)\right) $ iterations, improving by a factor $O(\log n)$ a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: given a measure of the maximal transient time $\tau_t$ and the maximal time $\tau_r$ to revisit states in recurrent classes under all policies, we show that SimplexPI terminates after at most $ \tilde O \left( n^3 m^2 \tau_t \tau_r \right) $ iterations. This generalizes a recent result for deterministic MDPs by Post & Ye (2012), in which $\tau_t \le n$ and $\tau_r \le n$. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that SimplexPI and Howard's PI terminate after at most $ \tilde O(nm (\tau_t+\tau_r))$ iterations.
Author Information
Bruno Scherrer (INRIA)
More from the Same Authors

2013 Poster: Approximate Dynamic Programming Finally Performs Well in the Game of Tetris »
Victor Gabillon · Mohammad Ghavamzadeh · Bruno Scherrer 
2012 Poster: On the Use of NonStationary Policies for Stationary InfiniteHorizon Markov Decision Processes »
Bruno Scherrer · Boris Lesner 
2012 Oral: On the Use of NonStationary Policies for Stationary InfiniteHorizon Markov Decision Processes »
Bruno Scherrer · Boris Lesner 
2008 Poster: Biasing Approximate Dynamic Programming with a Lower Discount Factor »
Marek Petrik · Bruno Scherrer