Skip to yearly menu bar Skip to main content


Plenary Talk
in
Workshop: The Optimization Foundations of Reinforcement Learning

On the Convergence of GTD(λ) with General λ

Huizhen Yu

[ ]
2019 Plenary Talk

Abstract: Gradient temporal-difference (GTD) algorithms are an important family of extensions of the standard TD algorithm, overcoming the stability issues in off-policy learning with function approximation by minimizing the projected Bellman error using stochastic gradient descent (SGD). In this talk, we consider GTD(λ) for policy evaluation in the case of linear function approximation and finite-space Markov decision processes (MDPs) with discounted rewards. GTD(λ) differs from the typical SGD algorithm for minimizing an objective function, in the following way. In an MDP, each stationary policy is associated with a family of Bellman equations. By selecting the values of the eligibility trace parameters, λ, we determine the Bellman operator that defines the objective function. This choice influences both the approximation error and learning behavior of the resulting algorithm. We first describe a dynamic scheme to judiciously set λ in a history-dependent way. This scheme is based on the idea of randomized stopping times and generalized Bellman equations, and it is useful for balancing the bias-variance tradeoff in off-policy learning. We then present asymptotic convergence results for several variations of GTD(λ), including saddle-point and mirror-descent TD variants, for different choices of λ: constant, state-dependent, and history-dependent. These convergence results are obtained by combining special properties of the joint state and eligibility trace process in TD learning with ergodic theory for weak Feller Markov chains, mean-ODE-based proof methods, and convex optimization theory. Our results not only resolve long-standing open questions regarding the convergence of GTD(λ) but also provide the basis for using a broader class of generalized Bellman operators for approximate policy evaluation with linear TD methods.

Live content is unavailable. Log in and register to view live content