Timezone: »
Poster
Difference of Convex Functions Programming for Reinforcement Learning
Bilal Piot · Matthieu Geist · Olivier Pietquin
Large Markov Decision Processes (MDPs) are usually solved using Approximate Dynamic Programming (ADP) methods such as Approximate Value Iteration (AVI) or Approximate Policy Iteration (API). The main contribution of this paper is to show that, alternatively, the optimal state-action value function can be estimated using Difference of Convex functions (DC) Programming. To do so, we study the minimization of a norm of the Optimal Bellman Residual (OBR) $T^*Q-Q$, where $T^*$ is the so-called optimal Bellman operator. Controlling this residual allows controlling the distance to the optimal action-value function, and we show that minimizing an empirical norm of the OBR is consistant in the Vapnik sense. Finally, we frame this optimization problem as a DC program. That allows envisioning using the large related literature on DC Programming to address the Reinforcement Leaning (RL) problem.
Author Information
Bilal Piot (CNRS)
Matthieu Geist (SUPELEC)
Olivier Pietquin (Google Research Brain Team)
Related Events (a corresponding poster, oral, or spotlight)
-
2014 Spotlight: Difference of Convex Functions Programming for Reinforcement Learning »
Wed. Dec 10th 08:30 -- 08:50 PM Room Level 2, room 210
More from the Same Authors
-
2012 Poster: Inverse Reinforcement Learning through Structured Classification »
Edouard Klein · Matthieu Geist · BILAL PIOT · Olivier Pietquin