NeurIPS On the Convergence of GTD($\lambda$) with General $\lambda$

Plenary Talk
in
Workshop: The Optimization Foundations of Reinforcement Learning

On the Convergence of GTD( $λ$ ) with General $λ$

Huizhen Yu

[ Abstract ]

2019 Plenary Talk

Abstract: Gradient temporal-difference (GTD) algorithms are an important family of extensions of the standard TD algorithm, overcoming the stability issues in off-policy learning with function approximation by minimizing the projected Bellman error using stochastic gradient descent (SGD). In this talk, we consider GTD(

λ

) for policy evaluation in the case of linear function approximation and finite-space Markov decision processes (MDPs) with discounted rewards. GTD(

λ

) differs from the typical SGD algorithm for minimizing an objective function, in the following way. In an MDP, each stationary policy is associated with a family of Bellman equations. By selecting the values of the eligibility trace parameters,

λ

, we determine the Bellman operator that defines the objective function. This choice influences both the approximation error and learning behavior of the resulting algorithm. We first describe a dynamic scheme to judiciously set

λ

in a history-dependent way. This scheme is based on the idea of randomized stopping times and generalized Bellman equations, and it is useful for balancing the bias-variance tradeoff in off-policy learning. We then present asymptotic convergence results for several variations of GTD(

λ

), including saddle-point and mirror-descent TD variants, for different choices of

λ

: constant, state-dependent, and history-dependent. These convergence results are obtained by combining special properties of the joint state and eligibility trace process in TD learning with ergodic theory for weak Feller Markov chains, mean-ODE-based proof methods, and convex optimization theory. Our results not only resolve long-standing open questions regarding the convergence of GTD(

λ

) but also provide the basis for using a broader class of generalized Bellman operators for approximate policy evaluation with linear TD methods.

Live content is unavailable. Log in and register to view live content

Plenary Talk in Workshop: The Optimization Foundations of Reinforcement Learning

On the Convergence of GTD(λ) with General λ

Huizhen Yu

Plenary Talk
in
Workshop: The Optimization Foundations of Reinforcement Learning

On the Convergence of GTD( $λ$ ) with General $λ$