Timezone: »
Spotlight
Difference of Convex Functions Programming for Reinforcement Learning
Bilal Piot · Matthieu Geist · Olivier Pietquin
Wed Dec 10 12:30 PM -- 12:50 PM (PST) @ Level 2, room 210
Large Markov Decision Processes (MDPs) are usually solved using Approximate Dynamic Programming (ADP) methods such as Approximate Value Iteration (AVI) or Approximate Policy Iteration (API). The main contribution of this paper is to show that, alternatively, the optimal state-action value function can be estimated using Difference of Convex functions (DC) Programming. To do so, we study the minimization of a norm of the Optimal Bellman Residual (OBR) $T^*Q-Q$, where $T^*$ is the so-called optimal Bellman operator. Controlling this residual allows controlling the distance to the optimal action-value function, and we show that minimizing an empirical norm of the OBR is consistant in the Vapnik sense. Finally, we frame this optimization problem as a DC program. That allows envisioning using the large related literature on DC Programming to address the Reinforcement Leaning (RL) problem.
Author Information
Bilal Piot (CNRS)
Matthieu Geist (SUPELEC)
Olivier Pietquin (Google Research Brain Team)
Related Events (a corresponding poster, oral, or spotlight)
-
2014 Poster: Difference of Convex Functions Programming for Reinforcement Learning »
Thu. Dec 11th 12:00 -- 04:59 AM Room Level 2, room 210D
More from the Same Authors
-
2012 Poster: Inverse Reinforcement Learning through Structured Classification »
Edouard Klein · Matthieu Geist · BILAL PIOT · Olivier Pietquin