Timezone: »

Large-scale reinforcement learning and Markov decision problems
Benjamin Van Roy · Mohammad Ghavamzadeh · Peter Bartlett · Yasin Abbasi Yadkori · Ambuj Tewari

Sat Dec 13 05:30 AM -- 03:30 PM (PST) @ Level 5, room 511 d
Event URL: https://sites.google.com/site/rlmdp2014/ »

Reinforcement learning (RL) and MDPs have been topics of intense research since the middle of the last century. It was shown that Dynamic Programming (DP) [B, H, SB] gives the optimal policy and its computational cost is polynomial in the number of states and actions. This polynomial dependence on the size of the state space limits exact DP to small state problems. Modern applications of RL need to deal with large state problems that arise in many areas ranging from robotics to medical trials to finance.

Solving a large state MDP problem can be computationally intractable in the worst case [PT, CT]. Despite these negative results, several algorithms are shown to perform remarkably well in certain large state problems. Examples are UCT algorithm of Kocsis and Szepesvari [KS] applied in heuristic search and games, Rapidly exploring Random Trees (RRT) of LaValle and Kuffner [LK] in motion planning, policy gradient methods applied in robotics [KP, GE], approximate linear programming (ALP) applied in queuing networks [FV, DFM], and approximate dynamic programming applied in very large scale industrial applications [P]. These algorithms are developed mostly independently in different communities. Despite some similarities, the relation between them and what makes them effective is not very clear.

The computational problem that we discussed above is the focus of optimization and ADP communities. The challenge is to find a computationally efficient way to achieve something that would be easy if we had infinite computational resources. In reinforcement learning, we encounter additional statistical challenges; even if we have infinite computational power, it is not clear how we should best make inferences from observations and select actions that balance between exploration and exploitation.

This workshop will bring researchers from different communities together to discuss and exchange ideas about effective approaches and open problems in large scale MDP problems.


[B] R. Bellman, “Dynamic Programming”, Princeton University Press, 1957.
[H] R. A. Howard, “Dynamic Programming and Markov Processes”, MIT, 1960.
[SB] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . Bradford Book. MIT Press, 1998.
[PT] C. H. Papadimitriou and J. N. Tsitsiklis. "The Complexity of Markov Decision Processes", Mathematics of Operations Research, Vol. 12, No. 3, 1987, pp. 441-450.
[CT] C.-S. Chow and J. N. Tsitsiklis, "The Complexity of Dynamic Programming", Journal of Complexity, Vol. 5, No. 4, 1989, pp. 466-488.
[GE] M. Ghavamzadeh and Y. Engel, "Bayesian Policy Gradient Algorithms". Neural Information Processing Systems (NIPS), 2006.
[LK] S. M. LaValle, J. J. Kuffner, “Randomized kinodynamic planning”, The International Journal of Robotics Research 20 (5), 378-400, 2001
[KS] L. Kocsis and C. Szepesvari, "Bandit based monte-carlo planning", ECML, 2006.
[KP] J. Kober and J. Peters, "Policy Search for Motor Primitives in Robotics", Machine Learning, 84, 1-2, pp.171-203, 2011.
[FV] D. P. de Farias and B. Van Roy, "The linear programming approach to approximate dynamic programming", Operations Research, 51, 2003.
[DFM] V. V. Desai, V. F. Farias, and C. C. Moallemi, "Approximate dynamic programming via a smoothed linear program", Operations Research , 60(3):655–674, 2012.
[P] W. B. Powell, “Approximate Dynamic Programming: Solving the curses of dimensionality”, John Wiley & Sons, 2007.

Author Information

Benjamin Van Roy (Stanford University)
Mohammad Ghavamzadeh (FaceBook FAIR)
Peter Bartlett (UC Berkeley)
Yasin Abbasi Yadkori (Adobe Research)
Ambuj Tewari (University of Michigan)

More from the Same Authors