Workshop
OPT 2021: Optimization for Machine Learning
Courtney Paquette · Quanquan Gu · Oliver Hinder · Katya Scheinberg · Sebastian Stich · Martin Takac
Mon 13 Dec, 3:15 a.m. PST
OPT 2021 will bring experts in optimization to share their perspectives while leveraging crossover experts in ML to share their views and recent advances. OPT 2021 honors this tradition of bringing together people from optimization and from ML in order to promote and generate new interactions between the two communities.
To foster the spirit of innovation and collaboration, a goal of this workshop, OPT 2021 will focus the contributed talks on research in “Beyond Worstcase Complexity”. Classical optimization analyses measure the performances of algorithms based on (1). the computation cost and (2). convergence for any input into the algorithm. Yet algorithms with worse traditional complexity (e.g. SGD and its variants, ADAM, etc), are increasingly popular in practice for training deep neural networks and other ML tasks. This leads to questions such as what are good modeling assumptions for ML problems to measure an optimization algorithm’s success and how can we leverage these to better understand the performances of known (and new) algorithms. For instance, typical optimization problems in ML may be better conditioned than their worstcase counterparts in part because the problems are highly structured and/or highdimensional (large number of features/samples). One could leverage this observation to design algorithms with better “averagecase” complexity. Moreover, increasing research seems to indicate an intimate connection between the optimization algorithm and how well it performs on the test data (generalization). This new area of research in ML and its deep ties to optimization warrants a necessary discussion between the two communities. Specifically, we aim to continue the discussion on the precise meaning of generalization and averagecase complexity and to formalize what this means for optimization algorithms. By bringing together experts in both fields, OPT 2021 will foster insightful discussions around these topics and more.
Schedule
Mon 3:15 a.m.  3:50 a.m.

Welcome event (gather.town)
(
Social event/Break
)
>
link
Please join us in gather.town (see link below). 
🔗 
Mon 3:58 a.m.  4:00 a.m.

Opening Remarks to Session 1
(
Organizer intro
)
>
SlidesLive Video 
Sebastian Stich 🔗 
Mon 4:00 a.m.  4:25 a.m.

Deep Learning: Success, Failure, and the Border between them, Shai ShalevShwartz
(
Plenary Speaker
)
>
SlidesLive Video Abstract: Deep learning revolutionized computer vision, speech recognition, natural language understanding, and more. However, from the theoretical perspective we know that training neural networks is computationally hard and one can construct distributions on which deep learning fails. The talk will propose several parameters of distributions that can move them from being easytotrain to being hardtotrain. 
Shai ShalevShwartz 🔗 
Mon 4:25 a.m.  4:30 a.m.

Q&A with Shai ShalevShwartz
(
Q&A
)
>

Shai ShalevShwartz 🔗 
Mon 4:30 a.m.  4:55 a.m.

Learning with Strange Gradients, Martin Jaggi
(
Plenary Speaker
)
>
SlidesLive Video Abstract: Gradient methods form the foundation of current machine learning. A vast literature covers the use of stochastic gradients being simple unbiased estimators of the full gradient of our objective. In this talk, we discuss four applications motivated from practical machine learning, where this key assumption is violated, and show new ways to cope with gradients which are only loosely related to the original objective. We demonstrate that algorithms with rigorous convergence guarantees can still be obtained in such settings, for

Martin Jaggi 🔗 
Mon 4:55 a.m.  5:00 a.m.

Q&A with Martin Jaggi
(
Q&A
)
>

Martin Jaggi 🔗 
Mon 5:00 a.m.  5:30 a.m.

Contributed Talks in Session 1 (Zoom)
(
Orals and spotlights
)
>
SlidesLive Video Oral (10 min)
Spotlights (5 min)
There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule. 
Sebastian Stich · Futong Liu · Abdurakhmon Sadiev · Frederik Benzing · Simon Roburin 🔗 
Mon 5:30 a.m.  6:30 a.m.

Poster Session 1 (gather.town)
(
Poster session
)
>
link
Please join us in gather.town (see link above). To see the abstracts of the posters presented in this session, please see below the schedule. Authors/papers presenting posters in gather.town for this session:

Hamed Jalali · Robert Hönig · Maximus Mutschler · Manuel Madeira · Abdurakhmon Sadiev · Egor Shulgin · Alasdair Paren · Pascal Esser · Simon Roburin · Julius Kunze · Agnieszka Słowik · Frederik Benzing · Futong Liu · Hongyi Li · Ryotaro Mitsuboshi · Grigory Malinovsky · Jayadev Naram · Zhize Li · Igor Sokolov · Sharan Vaswani

Mon 6:30 a.m.  6:58 a.m.

Break (gather.town)
(
Break
)
>
link
Please join us in gather.town (see link below), 
🔗 
Mon 6:58 a.m.  7:00 a.m.

Opening Remarks to Session 2
(
Organizer intro
)
>

Courtney Paquette 🔗 
Mon 7:00 a.m.  7:25 a.m.

The global optimization of functions with low effective dimension  better than worstcase?, Coralia Cartis
(
Plenary Speaker
)
>
SlidesLive Video Authors: Coralia Cartis, Estelle Massart and Adilet Otemissov (Mathematical Institute, University of Oxford and Alan Turing Institute, London) Abstract: We investigate the unconstrained and constrained global optimization of functions with low effective dimension, which are constant along an (unknown) linear subspace and only vary over the effective (complement) subspace. Known also as multiridge or active subspace objectives, these functions appear frequently in applications, such as in those involving some complex engineering simulations, overly parametrised models, and more. We aim to implicitly explore the intrinsic low dimensionality of the constrained/unconstrained landscape using (feasible) random embeddings, in order to understand and improve the scalability of algorithms for the global optimization of these specialstructure problems. We show that for these functions, the convergence and complexity of a random embedding algorithmic framework can potentially be exponentially better than if no special structure was present, and furthermore, under certain assumptions/problems, they are independent of the ambient dimension. Our numerical experiments on functions with low effective dimension illustrate the improved scalability of the framework when coupled with stateoftheart global — and even local — optimization solvers for the subproblems. Our analysis uses tools from random matrix theory, stochastic optimization and conic integral geometry. Papers:

Coralia Cartis 🔗 
Mon 7:25 a.m.  7:30 a.m.

Q&A with Coralia Cartis
(
Q&A
)
>

Coralia Cartis 🔗 
Mon 7:30 a.m.  7:55 a.m.

NonEuclidean Differentially Private Stochastic Convex Optimization, Cristóbal Guzmán
(
Plenary Speaker
)
>
SlidesLive Video
Abstract: Ensuring privacy of users' data in machine learning models has become a crucial requirement in multiple domains. In this respect, differential privacy (DP) is the gold standard, due to its general and rigorous privacy guarantees, as well as its high composability. For the particular case of stochastic convex optimization (SCO), recent efforts have established optimal rates for the excess risk under differential privacy in Euclidean setups. These bounds suffer a polynomial degradation of accuracy with respect to the dimension, which limits their applicability in highdimensional settings. In this talk, I will present nearlydimension independent rates on the excess risk for DPSCO in the $\ell_1$ setup, as well as the investigation of more general $\ell_p$ setups, where $1\leq p\leq \infty$. Based on joint work with Raef Bassily and Anupama Nandi.

Cristóbal Guzmán 🔗 
Mon 7:55 a.m.  8:00 a.m.

Q&A with Cristóbal Guzmán
(
Q&A
)
>

Cristóbal Guzmán 🔗 
Mon 8:00 a.m.  8:30 a.m.

Contributed Talks in Session 2 (Zoom)
(
Orals and spotlights
)
>
SlidesLive Video Oral (10 min)
Spotlights (5 min)
There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule. 
Courtney Paquette · Chris Junchi Li · Jeffery Kline · Junhyung Lyle Kim · Pascal Esser 🔗 
Mon 8:30 a.m.  9:58 a.m.

Break ( Break ) > link  🔗 
Mon 9:58 a.m.  10:00 a.m.

Opening Remarks to Session 3
(
Organizer intro
)
>

Oliver Hinder 🔗 
Mon 10:00 a.m.  10:25 a.m.

Avoiding saddle points in nonsmooth optimization, Damek Davis
(
Plenary Speaker
)
>
SlidesLive Video Abstract: We introduce a geometrically transparent strict saddle property for nonsmooth functions. When present, this property guarantees that simple randomly initialized proximal algorithms on weakly convex problems converge only to local minimizers. We argue that the strict saddle property may be a realistic assumption in applications since it provably holds for generic semialgebraic optimization problems. Finally, we close the talk with an extension of the result to "perturbed" subgradient methods. 
Damek Davis 🔗 
Mon 10:25 a.m.  10:30 a.m.

Q&A with Damek Davis
(
Q&A
)
>

Damek Davis 🔗 
Mon 10:30 a.m.  10:55 a.m.

Faster Empirical Risk Minimization, Jelena Diakonikolas
(
Plenary Speaker
)
>
SlidesLive Video Abstract: Empirical Risk Minimization (ERM) problems are central to machine learning, and their efficient optimization has been studied from different perspectives, often taking advantage of the finite sum structure present in typical problem formulations. In particular, tight oracle complexity bounds have been established under fairly general assumptions about the loss functions. In this talk, I will present a rather surprising and general result that takes advantage of the separability of nonsmooth convex loss functions with efficiently computable proximal operators  such as, e.g., the hinge loss and the sum of absolute errors  to obtain an algorithm that exhibits significantly lower complexity than what is predicted by the lower bounds for general nonsmooth convex losses. I will then talk about how this result can be further improved for problems that can be stated in a form close to a linear program and discuss how these results relate to robust empirical risk minimization. The talk is based on joint results with Chaobing Song, Eric Lin, and Stephen Wright. 
Jelena Diakonikolas 🔗 
Mon 10:55 a.m.  11:00 a.m.

Q&A with Jelena Diakonikolas
(
Q&A
)
>

Jelena Diakonikolas 🔗 
Mon 11:00 a.m.  11:30 a.m.

Contributed talks in Session 3 (Zoom)
(
Orals and spotlights
)
>
SlidesLive Video Oral (10 min)
Spotlights (5 min)
There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule. 
Oliver Hinder · Wenhao Zhan · Akhilesh Soni · Grigory Malinovsky · Boyue Li 🔗 
Mon 11:30 a.m.  12:30 p.m.

Poster Session 2 (gather.town)
(
Poster session
)
>
link
Please join us in gather.town (see link above). To see the abstracts of the posters presented in this session, please see below the schedule. Authors/papers presenting posters in gather.town for this session:

Wenjie Li · Akhilesh Soni · Jinwuk Seok · Jianhao Ma · Jeffery Kline · Mathieu Tuli · Miaolan Xie · Robert Gower · Quanqi Hu · Matteo Cacciola · Yuanlu Bai · Boyue Li · Wenhao Zhan · Shentong Mo · Junhyung Lyle Kim · Sajad Fathi Hafshejani · Chris Junchi Li · Zhishuai Guo · Harshvardhan Harshvardhan · Neha Wadia · Tatjana Chavdarova · Difan Zou · Zixiang Chen · Aman Gupta · Jacques Chen · Betty Shea · Benoit Dherin · Aleksandr Beznosikov

Mon 12:30 p.m.  12:58 p.m.

Break (gather.town)
(
Break
)
>
link
Please join us in gather.town (see link below). 
🔗 
Mon 12:58 p.m.  1:00 p.m.

Opening Remarks to Session 4
(
Organizer intro
)
>

Quanquan Gu 🔗 
Mon 1:00 p.m.  1:25 p.m.

Online Learning via Linear Programming, Yinyu Ye
(
Plenary Speaker
)
>
SlidesLive Video Abstract: A natural optimization model that formulates many online resource allocations and decisionmaking problems is online linear programming (OLP) where the constraint matrix, along with the objective coefficients and decision variables, are revealed and decided column by column sequentially. We review the near optimal algorithms and theories for solving this surprisingly general class of online problems under the assumption of random order of arrival and/or iid distributions of the input data. Then we present few recent applications of the model/algorithm, including a fast online algorithm as a presolver for solving largescale offline (binary) LPs, an interiorpoint online algorithm to address “fairness” for resource allocation, and a provable logarithmic regret bound for the Bandits with Knapsacks (BwK) problem. 
Yinyu Ye 🔗 
Mon 1:25 p.m.  1:30 p.m.

Q&A with Yinyu Ye
(
Q&A
)
>

Yinyu Ye 🔗 
Mon 1:30 p.m.  1:55 p.m.

Putting Randomized Matrix Algorithms in LAPACK, and Connections with Secondorder Stochastic Optimization, Michael Mahoney
(
Plenary Speaker
)
>
SlidesLive Video Abstract: LAPACK (Linear Algebra PACKage) is a widelyused highquality software library for numerical linear algebra that provides routines for solving systems of linear equations, linear least squares, eigenvalue problems, singular value decomposition, and matrix factorizations such as LU, QR, Cholesky and Schur decomposition. Randomized Numerical Linear Algebra (RandNLA) is an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for largescale linear algebra problems. In addition to providing some of the best linear algebra algorithms (in worstcase theory, numerical implementations, and nontrivial implicit statistical properties and machine learning implementations), RandNLA techniques are the basis for the best stochastic secondorder optimization algorithms (such as SubSampled Newton's methods, Iterative Hessian Sketch, and NewtonLESS). The time has come to put RandNLA methods into the next generation of LAPACK, and we have begun to do that. We will present our high level plan to introduce RandNLA algorithms into LAPACK. The RandLAPACK library will implement state of the art randomized algorithms for problems such as lowrank approximation and leastsquares, but its full scope will be larger than linear algebra per se. It will include certain higherlevel primitives in optimization and machine learning that require judicious use of RandNLA. We will describe building blocks, the modular design that will help RandLAPACK evolve with the field of RandNLA (as well as the needs of machine learning, scientific computing, and other users) over time, as well as connections and implications for machine learning optimization. Joint work with Riley Murray, Jim Demmel, and others. 
Michael Mahoney 🔗 
Mon 1:55 p.m.  2:00 p.m.

Q&A with Michael Mahoney
(
Q&A
)
>

Michael Mahoney 🔗 
Mon 2:00 p.m.  2:30 p.m.

Contributed talks in Session 4 (Zoom)
(
Orals and spotlights
)
>
SlidesLive Video Oral (10 min)
Spotlights (5 min)
There will be a Q&A in the last 5 minutes for all speakers. Abstracts for the talks are below the schedule. 
Quanquan Gu · Agnieszka Słowik · Jacques Chen · Neha Wadia · Difan Zou 🔗 
Mon 2:30 p.m.  2:35 p.m.

Closing remarks
(
Organizer closing
)
>

Courtney Paquette 🔗 


Integer Programming Approaches To Subspace Clustering With Missing Data
(
Poster
)
>
In the \emph{Subspace Clustering with Missing Data} (SCMD) problem, we are given a collection of $n$ data points, arranged into columns of a matrix $X \in \mathbb{R}^{d \times n}$, and each of the data points is observed only on a subset of its coordinates. The data points are assumed to be concentrated near a union of lowdimensional subspaces. The goal of SCMD is to cluster the vectors of the data matrix $X$ as per their subspace membership. Stateoftheart algorithms for SCMD can perform poorly on instances with a large amount of missing data or if the data matrix $X$ is nearly fullrank. We propose a novel integer programmingbased method for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facilitylocation problem, with subspaces playing the role of facilities and data points that of customers. We propose a columngeneration approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than stateoftheart methods when the data is highrank or the percentage of missing data is high.

Akhilesh Soni · Daniel PimentelAlarcón 🔗 


Integer Programming Approaches To Subspace Clustering With Missing Data
(
Spotlight
)
>
In the \emph{Subspace Clustering with Missing Data} (SCMD) problem, we are given a collection of $n$ data points, arranged into columns of a matrix $X \in \mathbb{R}^{d \times n}$, and each of the data points is observed only on a subset of its coordinates. The data points are assumed to be concentrated near a union of lowdimensional subspaces. The goal of SCMD is to cluster the vectors of the data matrix $X$ as per their subspace membership. Stateoftheart algorithms for SCMD can perform poorly on instances with a large amount of missing data or if the data matrix $X$ is nearly fullrank. We propose a novel integer programmingbased method for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facilitylocation problem, with subspaces playing the role of facilities and data points that of customers. We propose a columngeneration approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than stateoftheart methods when the data is highrank or the percentage of missing data is high.

Akhilesh Soni · Daniel PimentelAlarcón 🔗 


Farkas' Theorem of the Alternative for Prior Knowledge in Deep Networks
(
Poster
)
>
In this paper, prior knowledge expressed in the form of polyhedral sets is introduced into the training of a deep neural network. This approach to using prior knowledge extends earlier work that applies Farkas' Theorem of the Alternative to linear support vector machine classifiers. Through this extension, we gain the ability to sculpt the decision boundary of a neural network by training on a set of discrete data while simultaneously fitting an uncountable number of points that live within a polytope that is defined by prior knowledge. We demonstrate viability of this approach on both synthetic and benchmark data sets. 
Jeffery Kline · Joseph Bockhorst 🔗 


Farkas' Theorem of the Alternative for Prior Knowledge in Deep Networks
(
Spotlight
)
>
In this paper, prior knowledge expressed in the form of polyhedral sets is introduced into the training of a deep neural network. This approach to using prior knowledge extends earlier work that applies Farkas' Theorem of the Alternative to linear support vector machine classifiers. Through this extension, we gain the ability to sculpt the decision boundary of a neural network by training on a set of discrete data while simultaneously fitting an uncountable number of points that live within a polytope that is defined by prior knowledge. We demonstrate viability of this approach on both synthetic and benchmark data sets. 
Jeffery Kline · Joseph Bockhorst 🔗 


Decentralized Personalized Federated Learning: Lower Bounds and Optimal Algorithm for All Personalization Modes
(
Poster
)
>
In this paper, we consider the formulation of the federated learning problem that is relevant to both decentralized personalized federated learning and multitask learning. This formulation is widespread in the literature and represents the minimization of local losses with regularization taking into account the communication matrix of the network. First of all, we give lower bounds for the considered problem in different regularization regimes. We also constructed an optimal algorithm that matches these lower bounds. Additionally, we check the theoretical results with experiments. 
Abdurakhmon Sadiev · Ekaterina Borodich · Darina Dvinskikh · Aleksandr Beznosikov · Alexander Gasnikov 🔗 


Decentralized Personalized Federated Learning: Lower Bounds and Optimal Algorithm for All Personalization Modes
(
Spotlight
)
>
In this paper, we consider the formulation of the federated learning problem that is relevant to both decentralized personalized federated learning and multitask learning. This formulation is widespread in the literature and represents the minimization of local losses with regularization taking into account the communication matrix of the network. First of all, we give lower bounds for the considered problem in different regularization regimes. We also constructed an optimal algorithm that matches these lower bounds. Additionally, we check the theoretical results with experiments. 
Abdurakhmon Sadiev · Ekaterina Borodich · Darina Dvinskikh · Aleksandr Beznosikov · Alexander Gasnikov 🔗 


Towards Modeling and Resolving Singular Parameter Spaces using Stratifolds
(
Poster
)
>
When analyzing parametric statistical models, a useful approach consists in modeling geometrically the parameter space. However, even for very simple and commonly used hierarchical models like statistical mixtures or stochastic deep neural networks, the smoothness assumption of manifolds is violated at singular points which exhibit nonsmooth neighborhoods in the parameter space. These singular models have been analyzed in the context of learning dynamics, where singularities can act as attractors on the learning trajectory and, therefore, negatively influence the convergence speed of models. We propose a general approach to circumvent the problem arising from singularities by using stratifolds, a concept from algebraic topology, to formally model singular parameter spaces. We use the property that specific stratifolds are equipped with a resolution method to construct a smooth manifold approximation of the singular space. We empirically show that using (natural) gradient descent on the smooth manifold approximation instead of the singular space allows us to avoid the attractor behavior and therefore improve the convergence speed in learning. 
Pascal Esser · Frank Nielsen 🔗 


Towards Modeling and Resolving Singular Parameter Spaces using Stratifolds
(
Spotlight
)
>
When analyzing parametric statistical models, a useful approach consists in modeling geometrically the parameter space. However, even for very simple and commonly used hierarchical models like statistical mixtures or stochastic deep neural networks, the smoothness assumption of manifolds is violated at singular points which exhibit nonsmooth neighborhoods in the parameter space. These singular models have been analyzed in the context of learning dynamics, where singularities can act as attractors on the learning trajectory and, therefore, negatively influence the convergence speed of models. We propose a general approach to circumvent the problem arising from singularities by using stratifolds, a concept from algebraic topology, to formally model singular parameter spaces. We use the property that specific stratifolds are equipped with a resolution method to construct a smooth manifold approximation of the singular space. We empirically show that using (natural) gradient descent on the smooth manifold approximation instead of the singular space allows us to avoid the attractor behavior and therefore improve the convergence speed in learning. 
Pascal Esser · Frank Nielsen 🔗 


Spherical Perspective on Learning with Normalization Layers
(
Poster
)
>
Normalization Layers (NL) are widely used in modern deeplearning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. We introduce a spherical framework to study the optimization of neural networks with NL from a geometric perspective. Concretely, we leverage the radial invariance of groups of parameters to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. We use it to derive the first effective learning rate expression of Adam. We then show theoretically and empirically
that, in the presence of NL, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere.

Simon Roburin · Yann de MontMarin · Andrei Bursuc · Renaud Marlet · Patrick Pérez · Mathieu Aubry 🔗 


Spherical Perspective on Learning with Normalization Layers
(
Spotlight
)
>
Normalization Layers (NL) are widely used in modern deeplearning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. We introduce a spherical framework to study the optimization of neural networks with NL from a geometric perspective. Concretely, we leverage the radial invariance of groups of parameters to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. We use it to derive the first effective learning rate expression of Adam. We then show theoretically and empirically
that, in the presence of NL, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere.

Simon Roburin · Yann de MontMarin · Andrei Bursuc · Renaud Marlet · Patrick Pérez · Mathieu Aubry 🔗 


Optimization with Adaptive Step Size Selection from a Dynamical Systems Perspective
(
Poster
)
>
We investigate how adaptive step size methods from numerical analysis can be used to speed up optimization routines. In contrast to line search strategies, the proposed methods recycle available gradient evaluations. Thus, neither evaluations of the objective function nor additional gradient evaluations are required, and the computational complexity per iteration remains at $\mathcal{O}(d)$, where $d$ is the problem dimension. On strongly convex functions, our results show a consistent improvement in the number of iterations by up to a factor of $2$ with the gradient method and of $1.4$ with the heavy ball method across a range of condition numbers.

Neha Wadia · Michael Jordan · Michael Muehlebach 🔗 


Optimization with Adaptive Step Size Selection from a Dynamical Systems Perspective
(
Spotlight
)
>
We investigate how adaptive step size methods from numerical analysis can be used to speed up optimization routines. In contrast to line search strategies, the proposed methods recycle available gradient evaluations. Thus, neither evaluations of the objective function nor additional gradient evaluations are required, and the computational complexity per iteration remains at $\mathcal{O}(d)$, where $d$ is the problem dimension. On strongly convex functions, our results show a consistent improvement in the number of iterations by up to a factor of $2$ with the gradient method and of $1.4$ with the heavy ball method across a range of condition numbers.

Neha Wadia · Michael Jordan · Michael Muehlebach 🔗 


Better Linear Rates for SGD with Data Shuffling
(
Poster
)
>
Virtually all stateoftheart methods for training supervised machine learning models are variants of SGD, enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the {\em next} training data point to be learning from. Standard variants of SGD employ a {\em sampling with replacement} strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on {\em sampling without replacement} as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as {\em random shuffling} (RS), or before every epoch, strategy known as {\em random reshuffling} (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al (2020) provided tight {\em sublinear} convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shufflingbased methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining {\em linear} convergence rates in the strongly convex regime. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement. Moreover, we obtained improved convergence guarantees in convex and nonconvex regimes as well. 
Grigory Malinovsky · Alibek Sailanbayev · Peter Richtarik 🔗 


Better Linear Rates for SGD with Data Shuffling
(
Spotlight
)
>
Virtually all stateoftheart methods for training supervised machine learning models are variants of SGD, enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. However, one of the most basic questions in the design of successful SGD methods, one that is orthogonal to the aforementioned tricks, is the choice of the {\em next} training data point to be learning from. Standard variants of SGD employ a {\em sampling with replacement} strategy, which means that the next training data point is sampled from the entire data set, often independently of all previous samples. While standard SGD is well understood theoretically, virtually all widely used machine learning software is based on {\em sampling without replacement} as this is often empirically superior. That is, the training data is randomly shuffled/permuted, either only once at the beginning, strategy known as {\em random shuffling} (RS), or before every epoch, strategy known as {\em random reshuffling} (RR), and training proceeds in the data order dictated by the shuffling. RS and RR strategies have for a long time remained beyond the reach of theoretical analysis that would satisfactorily explain their success. However, very recently, Mishchenko et al (2020) provided tight {\em sublinear} convergence rates through a novel analysis, and showed that these strategies can improve upon standard SGD in certain regimes. Inspired by these results, we seek to further improve the rates of shufflingbased methods. In particular, we show that it is possible to enhance them with a variance reduction mechanism, obtaining {\em linear} convergence rates in the strongly convex regime. To the best of our knowledge, our linear convergence rates are the best for any method based on sampling without replacement. Moreover, we obtained improved convergence guarantees in convex and nonconvex regimes as well. 
Grigory Malinovsky · Alibek Sailanbayev · Peter Richtarik 🔗 


Fast, Exact Subsampled Natural Gradients and FirstOrder KFAC
(
Poster
)
>
Secondorder optimization algorithms hold the potential to speed up learning in neural networks, but are notoriously hard to compute due to the enormous size of the curvature matrix. This problem has inspired approximations of the curvature matrix, which allow for efficient computations, most prominently the kroneckerfactored curvature approximation (KFAC). Indeed, KFAC shows significant speedups for optimization compared to standard baselines. In this context, we challenge two common beliefs: Firstly we show that, when subsampling the curvature matrix (in our case the Fisher Information), secondorder updates can be computed efficiently and exactly: The PyTorch implementation of our method requires less than twice as much amortised wallclock time per parameter update than SGD. Secondly, through a careful set of experiments, we demonstrate that KFAC does not owe its performance to approximating the curvature matrix, but rather is closely linked to a new, simple firstorder optimization algorithm. We propose and analyse using this firstorder optimizer and demonstrate that it outperforms KFAC both in terms of computation cost and optimization progress per parameter update. 
Frederik Benzing 🔗 


Fast, Exact Subsampled Natural Gradients and FirstOrder KFAC
(
Spotlight
)
>
Secondorder optimization algorithms hold the potential to speed up learning in neural networks, but are notoriously hard to compute due to the enormous size of the curvature matrix. This problem has inspired approximations of the curvature matrix, which allow for efficient computations, most prominently the kroneckerfactored curvature approximation (KFAC). Indeed, KFAC shows significant speedups for optimization compared to standard baselines. In this context, we challenge two common beliefs: Firstly we show that, when subsampling the curvature matrix (in our case the Fisher Information), secondorder updates can be computed efficiently and exactly: The PyTorch implementation of our method requires less than twice as much amortised wallclock time per parameter update than SGD. Secondly, through a careful set of experiments, we demonstrate that KFAC does not owe its performance to approximating the curvature matrix, but rather is closely linked to a new, simple firstorder optimization algorithm. We propose and analyse using this firstorder optimizer and demonstrate that it outperforms KFAC both in terms of computation cost and optimization progress per parameter update. 
Frederik Benzing 🔗 


Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
(
Poster
)
>
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a finetuned regularization.
In this paper,
we provide a theoretical explanation for this phenomenon: we show that
in the nonconvex setting of learning overparameterized twolayer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. 
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu 🔗 


Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
(
Spotlight
)
>
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a finetuned regularization.
In this paper,
we provide a theoretical explanation for this phenomenon: we show that
in the nonconvex setting of learning overparameterized twolayer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. 
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu 🔗 


DESTRESS: ComputationOptimal and CommunicationEfficient Decentralized Nonconvex FiniteSum Optimization
(
Poster
)
>
Emerging applications in multiagent environments such as internetofthings, networked sensing, autonomous systems and federated learning, call for decentralized algorithms for finitesum optimizations that are resourceefficient in terms of both computation and communication. In this paper, we con sider the prototypical setting where the agents work collaboratively to minimize the sum of local loss functions by only communicating with their neighbors over a predetermined network topology. We develop a new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex finitesum optimization, which matches the nearoptimal incremental firstorder oracle (IFO) complexity of stateoftheart centralized algorithms for finding firstorder stationary points, and significantly improves over existing decentralized algorithms. The communication complexity of DESTRESS also improves upon prior arts over a wide range of parameter regimes. DESTRESS leverages several key algorithm design ideas including stochastic recursive gradient updates with minibatches for local computation, gradient tracking with extra mixing for periteration communication, together with careful choices of hyperparameters and new analysis frameworks to provably achieve a desirable computationcommunication tradeoff. 
Boyue Li · Zhize Li · Yuejie Chi 🔗 


DESTRESS: ComputationOptimal and CommunicationEfficient Decentralized Nonconvex FiniteSum Optimization
(
Spotlight
)
>
Emerging applications in multiagent environments such as internetofthings, networked sensing, autonomous systems and federated learning, call for decentralized algorithms for finitesum optimizations that are resourceefficient in terms of both computation and communication. In this paper, we con sider the prototypical setting where the agents work collaboratively to minimize the sum of local loss functions by only communicating with their neighbors over a predetermined network topology. We develop a new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex finitesum optimization, which matches the nearoptimal incremental firstorder oracle (IFO) complexity of stateoftheart centralized algorithms for finding firstorder stationary points, and significantly improves over existing decentralized algorithms. The communication complexity of DESTRESS also improves upon prior arts over a wide range of parameter regimes. DESTRESS leverages several key algorithm design ideas including stochastic recursive gradient updates with minibatches for local computation, gradient tracking with extra mixing for periteration communication, together with careful choices of hyperparameters and new analysis frameworks to provably achieve a desirable computationcommunication tradeoff. 
Boyue Li · Zhize Li · Yuejie Chi 🔗 


Heavytailed noise does not explain the gap between SGD and Adam on Transformers
(
Poster
)
>
The success of Adam on deep learning architectures such as transformers has made it the default option in many application where stochastic gradient descent (SGD) does not work. Our theoretical understanding of this discrepancy is lagging and has prevented us from significantly improving either algorithm. Recent work advanced the hypothesis that Adam, and other heuristics like gradient clipping, outperform SGD on language tasks because the distribution of the stochastic gradients is more heavytailed. This suggest that Adam performs better because it is more robust to outliers, which is a promising avenue for designing better stochastic gradient estimators. However, it is unclear whether heavytailed noise is the cause of this discrepancy or another symptom. Through experiments that control for stochasticity by increasing the batch size, we present evidence that stochasticity and heavytailed noise is not the major factor in this gap. 
Jacques Chen · Frederik Kunstner · Mark Schmidt 🔗 


Heavytailed noise does not explain the gap between SGD and Adam on Transformers
(
Spotlight
)
>
The success of Adam on deep learning architectures such as transformers has made it the default option in many application where stochastic gradient descent (SGD) does not work. Our theoretical understanding of this discrepancy is lagging and has prevented us from significantly improving either algorithm. Recent work advanced the hypothesis that Adam, and other heuristics like gradient clipping, outperform SGD on language tasks because the distribution of the stochastic gradients is more heavytailed. This suggest that Adam performs better because it is more robust to outliers, which is a promising avenue for designing better stochastic gradient estimators. However, it is unclear whether heavytailed noise is the cause of this discrepancy or another symptom. Through experiments that control for stochasticity by increasing the batch size, we present evidence that stochasticity and heavytailed noise is not the major factor in this gap. 
Jacques Chen · Frederik Kunstner · Mark Schmidt 🔗 


Acceleration and Stability of the Stochastic Proximal Point Algorithm
(
Poster
)
>
Stochastic gradient descent (SGD) has emerged as the defacto method for solving (unconstrained) stochastic optimization problems. However, it suffers from two fundamental limitations: $(i)$ slow convergence due to inaccurate gradient approximation, and $(ii)$ numerical instability, especially with respect to step size selection. To improve the slow convergence, accelerated variants such as stochastic gradient descent with momentum (SGDM) have been studied; however, the interference of gradient noise and momentum can aggravate the numerical instability. Proximal point methods, on the other hand, have gained much attention due to their numerical stability. Their stochastic accelerated variants though have received limited attention. To bridge this gap, we propose the stochastic proximal point algorithm with momentum (SPPAM), and study its convergence and stability. We show that SPPAM enjoys a better contraction factor compared to stochastic proximal point algorithm (SPPA), leading to faster convergence. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM.

Junhyung Lyle Kim · Panos Toulis · Anastasios Kyrillidis 🔗 


Acceleration and Stability of the Stochastic Proximal Point Algorithm
(
Spotlight
)
>
Stochastic gradient descent (SGD) has emerged as the defacto method for solving (unconstrained) stochastic optimization problems. However, it suffers from two fundamental limitations: $(i)$ slow convergence due to inaccurate gradient approximation, and $(ii)$ numerical instability, especially with respect to step size selection. To improve the slow convergence, accelerated variants such as stochastic gradient descent with momentum (SGDM) have been studied; however, the interference of gradient noise and momentum can aggravate the numerical instability. Proximal point methods, on the other hand, have gained much attention due to their numerical stability. Their stochastic accelerated variants though have received limited attention. To bridge this gap, we propose the stochastic proximal point algorithm with momentum (SPPAM), and study its convergence and stability. We show that SPPAM enjoys a better contraction factor compared to stochastic proximal point algorithm (SPPA), leading to faster convergence. In terms of stability, we show that SPPAM depends on problem constants more favorably than SGDM.

Junhyung Lyle Kim · Panos Toulis · Anastasios Kyrillidis 🔗 


Gaussian Graphical Models as an Ensemble Method for Distributed Gaussian Processes
(
Poster
)
>
Distributed Gaussian process (DGP) is a popular approach to scale GP to big data which divides the training data into some subsets, performs local inference for each partition, and aggregates the results to acquire global prediction. To combine the local predictions, the conditional independence assumption is used which basically means there is a perfect diversity between the subsets. Although it keeps the aggregation tractable, it is often violated in practice and generally yields poor results. In this paper, we propose a novel approach for aggregating the Gaussian experts' predictions by Gaussian graphical model (GGM) where the target aggregation is defined as an unobserved latent variable and the local predictions are the observed variables. We first estimate the joint distribution of latent and observed variables using the ExpectationMaximization (EM) algorithm. The interaction between experts can be encoded by the precision matrix of the joint distribution and the aggregated predictions are obtained based on the property of conditional Gaussian distribution. Using both synthetic and real datasets, our experimental evaluations illustrate that our new method outperforms other stateoftheart DGP approaches. 
Hamed Jalali · Gjergji Kasneci 🔗 


DAdaQuant: Doublyadaptive quantization for communicationefficient Federated Learning
(
Poster
)
>
Federated Learning (FL) is a powerful technique to train a model on a server with data from several clients in a privacypreserving manner. FL incurs significant communication costs because it repeatedly transmits the model between the server and clients. Recently proposed algorithms quantize the model parameters to efficiently compress FL communication. We find that dynamic adaptations of the quantization level can boost compression without sacrificing model quality. We introduce DAdaQuant as a doublyadaptive quantization algorithm that dynamically changes the quantization level across time and different clients. Our experiments show that DAdaQuant consistently improves client>server compression, outperforming the strongest nonadaptive baselines by up to 2.8x. 
Robert Hönig · Yiren Zhao · Robert Mullins 🔗 


Barzilai and Borwein conjugate gradient method equipped with a nonmonotone line search technique
(
Poster
)
>
In this paper, we propose a new nonmonotone conjugate gradient method for solving unconstrained nonlinear optimization problems. We first modify the nonmonotone line search method by introducing a new trigonometric function to calculate the nonmonotone parameter, which plays an essential role in the algorithm's efficiency. Then, we apply a convex combination of the BarzilaiBorwein method \cite{Barzilai} for calculating the value of step size in each iteration. Under some suitable assumptions, we prove that the new algorithm has the global convergence property. The efficiency and effectiveness of the proposed method are determined in practice by applying the algorithm to some standard test problems and nonnegative matrix factorization problems. 
Sajad Fathi Hafshejani · Daya Gaur · Shahadat Hossain · Robert Benkoczi 🔗 


Using a one dimensional parabolic model of the fullbatch loss to estimate learning rates during training
(
Poster
)
>
A fundamental challenge in Deep Learning is to automatically find optimal step sizes for stochastic gradient descent. In traditional optimization, line searches are a commonly used method to determine step sizes. One problem in Deep Learning is that finding appropriate step sizes on the fullbatch loss is unfeasibly expensive. Therefore, classical line search approaches, designed for losses without inherent noise, are usually not applicable. Recent empirical findings suggest that the fullbatch loss behaves locally parabolically in the direction of noisy update step directions. Furthermore, the trend of the optimal update step size is changing slowly. By exploiting these findings, this work introduces a linesearch method that approximates the fullbatch loss with a parabola estimated over several minibatches. In the experiments conducted, our approach mostly outperforms SGD tuned with a piecewise constant learning rate schedule and other line search approaches for Deep Learning across models, datasets, and batch sizes on validation and test accuracy. 
Maximus Mutschler · Andreas Zell 🔗 


Communitybased Layerwise Distributed Training of Graph Convolutional Networks
(
Poster
)
>
Graph convolutional network (GCN) has been successfully applied to many graphbased applications. Training a largescale GCN model, however, is still challenging: Due to the node dependency and layer dependency of the GCN architecture, a huge amount of computational time and memory is required in the training process. In this paper, we propose a parallel and distributed GCN training algorithm based on the Alternating Direction Method of Multipliers (ADMM) to tackle the two challenges simultaneously. We first split the GCN layers into independent blocks to achieve layer parallelism. Furthermore, we reduce node dependency by dividing the graph into several dense communities such that each of them can be trained with an agent in parallel. Finally, we provide solutions for all subproblems in the communitybased ADMM algorithm. Preliminary results demonstrate that our proposed communitybased ADMM training algorithm can lead to more than triple speedup while achieving the best performance compared to stateoftheart methods. 
Hongyi Li · Junxiang Wang · Yongchao Wang · Yue Cheng · Liang Zhao 🔗 


Optimumstatistical Collaboration Towards Efficient BlackboxOptimization
(
Poster
)
>
We propose a general optimumstatistical collaboration framework for sequential blackbox optimization. Based on general definitions of the resolution descriptor and the uncertainty quantifier, we provide a general regret analysis of the proposed framework. We then show that the proposed framework can be applied to a broader range of functions that have different smoothness, and it inspires tighter measures of the statistical uncertainty and thus a faster algorithm. 
Wenjie Li · ChiHua Wang · Guang Cheng 🔗 


COCO Denoiser: Using CoCoercivity for Variance Reduction in Stochastic Convex Optimization
(
Poster
)
>
Firstorder methods for stochastic optimization have undeniable relevance. Variance reduction for these algorithms has become an important research topic. We exploit convexity and Lsmoothness to improve the noisy estimates outputted by the stochastic gradient oracle. Our method, named COCO denoiser, is the joint maximum likelihood estimator of multiple function gradients from their noisy observations, subject to cocoercivity constraints between them. The resulting estimate is the solution of a convex Quadratically Constrained Quadratic Problem. Although this problem is expensive to solve by interior point methods, we exploit its structure to apply an accelerated firstorder algorithm, the Fast Dual Proximal Gradient method. Besides analytically characterizing the proposed estimator, we show empirically that increasing the number and proximity of the queried points leads to better gradient estimates. We also apply COCO in stochastic settings by plugging it in existing algorithms, such as SGD, Adam or STRSAGA, outperforming their vanilla versions, even in scenarios where our modelling assumptions are mismatched. 
Manuel Madeira · Renato Negrinho · Joao Xavier · Pedro Aguiar 🔗 


Stochastic Learning Equation using Monotone Increasing Resolution of Quantization
(
Poster
)
>
In this paper, we propose a quantized learning equation with a monotone increasing resolution of quantization and stochastic analysis for the proposed algorithm. According to the white noise hypothesis for the quantization error with dense and uniform distribution, we can regard the quantization error as i.i.d. white noise. Based on this, we show that the learning equation with monotonically increasing quantization resolution converges weakly as the distribution viewpoint. The analysis of this paper shows that global optimization is possible for a domain that satisfies the Lipschitz condition instead of local convergence properties such as the Hessian constraint of the objective function. 
Jinwuk Seok · 🔗 


SignRIP: A Robust Restricted Isometry Property for Lowrank Matrix Recovery
(
Poster
)
>
Restricted isometry property (RIP), essentially stating that the linear measurements are approximately normpreserving, plays a crucial role in studying lowrank matrix recovery problem. However, RIP fails in the robust setting, when a subset of the measurements are grossly corrupted with noise. In this work, we propose a robust restricted isometry property, called SignRIP, and show its broad applications in robust lowrank matrix recovery. In particular, we show that SignRIP can guarantee the uniform convergence of the subdifferentials of the robust matrix recovery with nonsmooth loss function, even at the presence of arbitrarily dense and arbitrarily large outliers. Based on SignRIP, we characterize the location of the critical points in the robust rank1 matrix recovery, and prove that they are either close to the true solution, or have small norm. Moreover, in the overparameterized regime, where the rank of the true solution is overestimated, we show that subgradient method converges to the true solution at a (nearly) dimensionfree rate. We show that the new notion of signRIP enjoys almost the same sample complexity as its classical counterparts, but provides significantly better robustness against noise. 
Jianhao Ma · Salar Fattahi 🔗 


PracticeConsistent Analysis of AdamStyle Methods
(
Poster
)
>
In this paper, we present a simple and intuitive proof of convergence for a family of Adamstyle methods (including Adam, AMSGrad, Adabound, etc.) with an increasing or large "momentum" parameter for the firstorder moment, which gives an alternative yet more natural way to guarantee Adam converge in stochastic nonconvex minimization. We also establish a variance diminishing result for the used stochastic gradient estimators. The analysis is based on a widely used but not fully understood stochastic estimator using moving average (SEMA), which only requires a general unbiased stochastic oracle. In particular, we analyze Adamstyle methods based on the variance recursion property of SEMA for stochastic nonconvex minimization. 
Zhishuai Guo · Yi Xu · Wotao Yin · Rong Jin · Tianbao Yang 🔗 


Towards Robust and Automatic HyperParameter Tunning
(
Poster
)
>
The task of hyperparameter optimization (HPO) is burdened with heavy computational costs due to the intractability of optimizing both a model's weights and its hyperparameters simultaneously. In this work, we introduce a new class of HPO method and explore how the lowrank factorization of the convolutional weights of intermediate layers of a convolutional neural network can be used to define an analytical response surface for optimizing hyperparameters, using only training data. We quantify how this surface behaves as a surrogate to model performance and can be solved using a trustregion search algorithm, which we call \AlgName. The algorithm outperforms stateoftheart such as Bayesian Optimization and generalizes across model, optimizer, and dataset selection. 
Mathieu Tuli · Mahdi Hosseini · Konstantinos N Plataniotis 🔗 


Randomreshuffled SARAH does not need a full gradient computations
(
Poster
)
>
The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch. The aggregated stochastic gradients serve as an estimate of a full gradient in the SARAH algorithm. We provide a theoretical analysis of the proposed approach and conclude the paper with numerical experiments that demonstrate the efficiency of this approach. 
Aleksandr Beznosikov · Martin Takac 🔗 


Shifted Compression Framework: Generalizations and Improvements
(
Poster
)
>
Communication is one of the key bottlenecks in the distributed training of largescale machine learning models, and lossy compression of exchanged information, such as stochastic gradients or models, is one of the most effective instruments to alleviate this issue. Among the most studied compression techniques is the class of unbiased compression operators with variance bounded by a multiple of the square norm of the vector we wish to compress. By design, this variance may remain high, and only diminishes if the input vector approaches zero. However, unless the model being trained is overparameterized, there is no apriori reason for the vectors we wish to compress to approach zero during the iterations of classical methods such as distributed compressed SGD, which has adverse effects on the convergence speed. Due to this issue, several more elaborate and seemingly very different algorithms have been proposed recently, with the goal of circumventing this issue. These methods are based on the idea of compressing the difference between the vector we would normally wish to compress and some auxiliary vector which changes throughout the iterative process. In this work we take a step back, and develop a unified framework for studying such methods, conceptually, and theoretically. Our framework incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors. Furthermore, our general framework can lead to the improvement of several existing algorithms, and can produce new algorithms. Finally, we performed several numerical experiments which illustrate and support our theoretical findings. 
Egor Shulgin · Peter Richtarik 🔗 


A New Scheme for Boosting with an Average Margin Distribution Oracle
(
Poster
)
>
Boosting is a standard framework for learning a largemargin sparse ensemble of base hypotheses, where the algorithm assumes an oracle called the base learner, which returns a base hypothesis h with maximum edge Ei[yih(xi)] with respect to a given distribution over the sample ((x1, y1), . . . , (xm, ym)). In particular, for the l1norm regularized soft margin optimization problem, there are several boosting algorithms that have theoretical iteration bound for finding ε approximate solutions. They are not as fast as classical LPBoost by Demiriz et al., which has no nontrivial iteration bound. In this paper, we propose a new boosting scheme, where we assume a special base learner, which returns the average margin distribution vector Eh[(yih(xi))mi=1] with respect to a certain distribution over the base hypothesis class H. Under this scheme, we pro pose a boosting algorithm whose iteration bound is O((r ln m)/ε2) and running time is O(m ln m) per iteration, where r is the VC dimension of H. Moreover, we also propose an efficient implementation for the new base learner, given that a relevant subclass HS of H, is represented by a nondeterministic version of the Zerosuppressed Decision Diagram (NZDD), where NZDD is a data structure for representing a family of sets succinctly. 
Ryotaro Mitsuboshi · Kohei Hatano · Eiji Takimoto 🔗 


The Geometric Occam Razor Implicit in Deep Learning
(
Poster
)
>
In overparameterized deep neural networks there can be many possible parameter configurations that fit the training data exactly. However, the properties of these interpolating solutions are poorly understood. We argue that overparameterized neural networks trained with stochastic gradient descent are subject to a Geometric Occam Razor: these networks are implicitly regularized by the geometric model complexity. For onedimensional regression, the geometric model complexity is simply given by the arc length of the function. For higherdimensional settings, the geometric model complexity depends on the Dirichlet energy of the function. We explore the relationship between the Geometric Occam Razor, Dirichlet energy and known forms of implicit regularization. Finally, for ResNets trained on Cifar10, we observe that Dirichlet energy measurements are consistent with the action of this implicit Geometric Occam Razor. 
Benoit Dherin · Michael Munn · David Barrett 🔗 


Escaping Local Minima With Stochastic Noise
(
Poster
)
>
With the recent advancements in Deep Learning, nonconvex problems have received wide spread attention. However, while stochastic gradient descent (SGD) can often successfully optimize these nonconvex models in practice, the theoretical analysismapping SGD to Gradient Descent (GD) in expectationcannot explain global convergence, as GD gets trapped in local minima and stationary points. We introduce a novel proof framework to analyze stochastic methods on a class of structured nonconvex functions. We prove that SGD converges linearly to approximate global minima on nonconvex functions that are `close' to a convexlike (strongly convex or P\L) function when its iterates are perturbed with stochastic noise of appropriate strength (smoothing). Our analysis applies to a strictly larger class of functions than studied in prior work. We demonstrate that special cases of smoothing are equivalent to vanilla SGD in expectation, which explains why SGD can escape local minima in practice. 
Harshvardhan Harshvardhan · Sebastian Stich 🔗 


Faking Interpolation Until You Make It
(
Poster
)
>
Deep overparameterized neural networks exhibit the interpolation property on many data sets. That is, these models are able to achieve approximately zero loss on all training samples simultaneously. Recently, this property has been exploited to develop novel optimisation algorithms for this setting. These algorithms use the fact that the optimal loss value is known to employ a variation of a Polyak Stepsize calculated on a stochastic batch of data. In this work, we introduce an algorithm that extends this idea to tasks where the interpolation property does not hold. As we no longer have access to the optimal loss values a priori, we instead estimate them for each sample online. To realise this, we introduce a simple but highly effective heuristic for approximating the optimal value based on previous loss evaluations. Through rigorous experimentation we show the effectiveness of our approach, which outperforms adaptive gradient and line search methods. 
Alasdair Paren · Rudra Poudel · Pawan K Mudigonda 🔗 


High Probability Step Size Lower Bound for Adaptive Stochastic Optimization
(
Poster
)
>
Several classical adaptive optimization algorithms, such as line search and trustregion methods, have been recently extended to stochastic settings. Unlike the stochastic gradient method and its many variants, these algorithms do not use a prespecified sequence of step sizes, but increase or decrease the step size adaptively according to the estimated progress of the algorithm. These algorithms rely on stochastic oracles that estimate function values, gradients, and Hessians in some cases. The accuracy requirement of these oracles is also adaptive and depends on the step size. In the deterministic setting, a lower bound on the step size is easily derived, however, in the stochastic setting, due to possible oracle failures, bounds on the step size have not been previously derived. In this paper, we give a lower bound on the step size that holds with high probability. This bound is dependent on the probability of the oracle failures, recovering the deterministic result as an extreme case when this probability is zero. 
Katya Scheinberg · Miaolan Xie 🔗 


Adaptive Optimization with Examplewise Gradients
(
Poster
)
>
We propose a new, more general approach to the design of stochastic gradientbased optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups. To demonstrate the usefulness of this generalized approach, we develop Eve, an adaptation of the Adam optimizer which uses examplewise gradients to obtain more accurate secondmoment estimates. We provide preliminary experiments, without hyperparameter tuning, which show that the new optimizer slightly outperforms Adam on a small scale benchmark and performs the same or worse on larger scale benchmarks. Further work is needed to refine the algorithm and tune hyperparameters. 
Julius Kunze · James Townsend · David Barber 🔗 


Structured LowRank Tensor Learning
(
Poster
)
>
We consider the problem of learning lowrank tensors from partial observations with structural constraints and propose a novel factorization of such tensors, which leads to a simpler optimization problem. The resulting problem is an optimization problem on manifolds. We develop firstorder and secondorder Riemannian optimization algorithms to solve it. The duality gap for the resulting problem is derived, and we experimentally verify the correctness of the proposed algorithm. We demonstrate the algorithm on Nonnegative constraints and Hankel constraints. 
Jayadev Naram · Tanmay Sinha · Pawan Kumar 🔗 


ANITA: An Optimal Loopless Accelerated VarianceReduced Gradient Method
(
Poster
)
>
In this paper, we propose a novel accelerated gradient method called ANITA for solving the fundamental finitesum optimization problems. Concretely, we consider both general convex and strongly convex settings:
i) For general convex finitesum problems, ANITA improves previous stateoftheart result given by Varag (Lan et al., 2019). In particular, for largescale problems or the target error is not very small, i.e., $n \geq \frac{1}{\epsilon^2}$, ANITA obtains the \emph{first} optimal result $O(n)$, matching the lower bound $\Omega(n)$ provided by Woodworth and Srebro (2016), while previous results are $O(n \log \frac{1}{\epsilon})$ of Varag (Lan et al., 2019) and $O(\frac{n}{\sqrt{\epsilon}})$ of Katyusha (AllenZhu, 2017).
ii) For strongly convex finitesum problems, we also show that ANITA can achieve the optimal convergence rate $O\big((n+\sqrt{\frac{nL}{\mu}})\log\frac{1}{\epsilon}\big)$ matching the lower bound $\Omega\big((n+\sqrt{\frac{nL}{\mu}})\log\frac{1}{\epsilon}\big)$ provided by Lan and Zhou (2015).
Besides, ANITA enjoys a simpler loopless algorithmic structure unlike previous accelerated algorithms such as Varag (Lan et al., 2019) and Katyusha (AllenZhu, 2017) where they use an inconvenient doubleloop structure. Moreover, by exploiting the loopless structure of ANITA, we provide a new \emph{dynamic multistage convergence analysis}, which is the key technical part for improving previous results to the optimal rates.
Finally, the numerical experiments show that ANITA converges faster than the previous stateoftheart Varag (Lan et al., 2019), validating our theoretical results and confirming the practical superiority of ANITA.
We believe that our new theoretical rates and convergence analysis for this fundamental finitesum problem will directly lead to key improvements for many other related problems, such as distributed/federated/decentralized optimization problems. For instance, Li and Richtarik (2021) obtain the first compressed and accelerated result, substantially improving previous stateoftheart results, by applying ANITA to the distributed optimization problems with compressed communication.

Zhize Li 🔗 


EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback
(
Poster
)
>
First proposed by Seide et al (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradientbased optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is $O(1/T^{2/3})$, the rate of gradient descent in the same regime is $O(1/T)$). Recently, Richt\'{a}rik et al (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Several of these techniques were never analyzed in conjunction with EF before, and in cases where they were (e.g., bidirectional compression), our rates are vastly superior.

Peter Richtarik · Igor Sokolov · Ilyas Fatkhullin · Eduard Gorbunov · Zhize Li 🔗 


Stochastic Polyak Stepsize with a Moving Target
(
Poster
)
>
We propose a new stochastic gradient method that uses recorded past loss values to compute adaptive stepsizes. Our starting point is to show that the SP (Stochastic Polyak) method directly exploits interpolated models. That is, SP is a subsampled NewtonRaphson method applied to solving certain interpolation equations. These interpolation equations only hold for models that interpolate the data. We then use this viewpoint to develop a new variant of the SP method that converges without interpolation called MOTAPS. The MOTAPS method uses n auxiliary variables, one for each data point, that track the loss value for each data point. These auxiliary variables and the loss values are then used to set the step size. We provide a global convergence theory for MOTAPS by showing that it can be interpreted as a special variant of online SGD. We also perform several numerical experiments on convex learning problems, and nonconvex learning problem based on image classification and language translation. In all of our tasks we show that MOTAPS is competitive with the relevant baseline method. 
Robert Gower · Aaron Defazio · Mike Rabbat 🔗 


LastIterate Convergence of Saddle Point Optimizers via HighResolution Differential Equations
(
Poster
)
>
Several widelyused firstorder saddle point optimization methods yield the same continuoustime ordinary differential equation (ODE) as that of the Gradient Descent Ascent (GDA) method when derived naively. However, their convergence properties are very different even in simple bilinear games. In this paper, we use a technique from fluid dynamics called High Resolution Differential Equations (HRDEs) to design ODEs that converge to saddle points. As our main result, we design an ODE that has last iterate convergence for monotone variational inequalities. To our knowledge, this is the first continuoustime dynamics shown to converge for such a general setting. We also provide an implicit discretization of our ODE and we show it has last iterate convergence at a rate O(1/\sqrt{T}), which was previously shown to be optimal Golowich et al, 2020, using only firstorder smoothness of the monotone operator, in contrast to previous results that need secondorder smoothness as well. 
Tatjana Chavdarova · Michael Jordan · Emmanouil Zampetakis 🔗 


Towards Noiseadaptive, Problemadaptive Stochastic Gradient Descent
(
Poster
)
>
We aim to design stepsize schemes that make stochastic gradient descent (SGD) adaptive to (i) the noise \sigma^2 in the stochastic gradients and (ii) problemdependent constants. When minimizing smooth, stronglyconvex functions with condition number \kappa, we first prove that T iterations of SGD with Nesterov acceleration and exponentially decreasing stepsizes can achieve a nearoptimal \tilde{O}(\exp(T/\sqrt{\kappa}) + \sigma^2/T) convergence rate. Under a relaxed assumption on the noise, with the same stepsize scheme and knowledge of the smoothness, we prove that SGD can achieve an \tilde{O}(\exp(T/\kappa) + \sigma^2/T) rate. In order to be adaptive to the smoothness, we use a stochastic linesearch (SLS) and show (via upper and lowerbounds) that SGD converges at the desired rate, but only to a neighborhood of the solution. Next, we use SGD with an offline estimate of the smoothness and prove convergence to the minimizer. However, its convergence is slowed down proportional to the estimation error and we prove a lowerbound justifying this slowdown. Compared to other stepsize schemes, we empirically demonstrate the effectiveness of exponential stepsizes coupled with a novel variant of SLS. 
Sharan Vaswani · Benjamin DuboisTaine · Reza Babanezhad Harikandeh 🔗 


On ServerSide Stepsizes in Federated Optimization: Theory Explaining the Heuristics
(
Poster
)
>
We present a theoretical study of serverside optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is extremely useful in the context of Federated Averaging (FedAvg) with local passes over the client data. In particular, we prove that whenever the local stepsizes are small and the update direction is given by FedAvg over all clients, one can take a big leap in the obtained direction and improve the nonconvex rate of convergence from $\mathcal{O}(\varepsilon^{3})$ to $\mathcal{O}(\varepsilon^{2})$. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small serverside stepsize. Together, our results on the advantage of large and small severside stepsizes give a formal theoretical justification for the practice of adaptive serverside optimization in federated learning. Moreover, we consider multiple strategies of client participation and cover the options of uniform client sampling, deterministic or adversarial permutation of clients, as well as random permutation of clients.

Grigory Malinovsky · Konstantin Mishchenko · Peter Richtarik 🔗 


A Stochastic Momentum Method for Minmax Bilevel Optimization
(
Poster
)
>
In this paper, we study nonconvex minmax bilevel optimization problem where the outer objective function is nonconvex and strongly concave and the inner objective function is strongly convex. This paper develops a single loop single timescale stochastic algorithm based on moving average estimator, which only requires a general unbiased stochastic oracle with bounded variance. To the best of our knowledge, the only existing work on minmax bilevel
optimization focuses on the ones with an upper objective in certain structure and only achieves an oracle complexity of $\cO(\epsilon^{5})$. Under some mild assumptions on the partial derivatives of both outer and inner objective functions, we provide the first convergence guarantee with an oracle complexity of $\cO(\epsilon^{4})$ for a general class of minmax bilevel problems, which matches the optimal complexity order for solving stochastic nonconvex optimization under a general unbiased stochastic oracle model.

Quanqi Hu · Bokun Wang · Tianbao Yang 🔗 


Deep Neural Networks pruning via the Structured Perspective Regularization
(
Poster
)
>
In Machine Learning, Artificial Neural Networks (ANNs) are a very powerful tool, broadly used in many applications. Often, the selected (deep) architectures include many layers, and therefore a large amount of parameters, which makes training, storage and inference expensive. This motivated a stream of research about compressing the original networks into smaller ones without excessively sacrificing performances. Among the many proposed compression approaches, one of the most popular is \emph{pruning}, whereby entire elements of the ANN (links, nodes, channels, etc.) and the corresponding weights are deleted. Since the nature of the problem is inherently combinatorial (what elements to prune and what not), we propose a new pruning method based on Operations Research tools. We start from a natural MixedInteger Programming model for the problem, and we use the Perspective Reformulation technique to strengthen its continuous relaxation. Projecting away the indicator variables from this reformulation yields a new regularization term, which we call the Structured Perspective Regularization. That leads to structured pruning of the initial architecture. We test our method on some ResNet architectures applied to CIFAR10, CIFAR100 and ImageNet datasets, obtaining competitive performances w.r.t.~the state of the art for structured pruning. 
Matteo Cacciola · Andrea Lodi · Xinlin Li 🔗 


Efficient Calibration of MultiAgent Market Simulators from Time Series with Bayesian Optimization
(
Poster
)
>
Multiagent market simulation is commonly used for testing trading strategies before deploying them to realtime trading. In electronic trading markets only the price or volume time series, that result from interaction of multiple market participants, are typically directly observable. Therefore, multiagent market environments need to be calibrated so that the time series that result from interaction of simulated agents resemble historical  which amounts to solving a highly complex largescale optimization problem. In this paper, we propose a simple and efficient framework for calibrating multiagent market simulator parameters from historical time series observations. First, we consider a novel concept of eligibility set to bypass the potential nonidentifiability issue. Second, we generalize the twosample KolmogorovSmirnov (KS) test with Bonferroni correction to test the similarity between two highdimensional time series distributions, which gives a simple yet effective distance metric between the time series sample sets. Third, we suggest using Bayesian optimization (BO) and trustregion BO (TuRBO) to minimize the aforementioned distance metric. Finally, we demonstrate the efficiency of our framework using numerical experiments. 
Yuanlu Bai · Svitlana Vyetrenko · Henry Lam · Tucker Balch 🔗 


Faster Perturbed Stochastic Gradient Methods for Finding Local Minima
(
Poster
)
>
Escaping from saddle points and finding local minimum is a central problem in nonconvex optimization.
Perturbed gradient methods are perhaps the simplest approach for this problem. However, to find $(\epsilon, \sqrt{\epsilon})$approximate local minima, the existing best stochastic gradient complexity for this type of algorithms is $\tilde O(\epsilon^{3.5})$, which is not optimal.
In this paper, we propose \texttt{Pullback}, a faster perturbed stochastic gradient framework for finding local minima. We show that Pullback with stochastic gradient estimators such as SARAH/SPIDER and STORM can find $(\epsilon, \epsilon_{H})$approximate local minima within $\tilde O(\epsilon^{3} + \epsilon_{H}^{6})$ stochastic gradient evaluations (or $\tilde O(\epsilon^{3})$ when $\epsilon_H = \sqrt{\epsilon}$).
The core idea of our framework is a stepsize ``pullback'' scheme to control the average movement of the iterates, which leads to faster convergence to the local minima.

Zixiang Chen · Dongruo Zhou · Quanquan Gu 🔗 


Adam vs. SGD: Closing the generalization gap on image classification
(
Poster
)
>
Adam is an adaptive deep neural network training optimizer that has been widely used across a variety of applications. However, on image classification problems, its generalization performance is significantly worse than stochastic gradient descent (SGD). By tuning several inner hyperparameters of Adam, it is possible to lift the performance of Adam and close this gap; but this makes the use of Adam computationally expensive. In this paper, we use a new training approach based on layerwise weight normalization (LAWN) to solidly improve Adam's performance and close the gap with SGD. LAWN also helps reduce the impact of batch size on Adam's performance. With speed in tact and performance vastly improved, the AdamLAWN combination becomes an attractive optimizer for use in image classification. 
Aman Gupta · Rohan Ramanath · Jun Shi · Sathiya Keerthi 🔗 


Simulated Annealing for Neural Architecture Search
(
Poster
)
>
Gradientbased Neural Architecture Search (NAS) approaches have achieved remarkable progress in the automated machine learning community. However, previous methods would cause much search time and huge computation resources in a big search space for seeking an optimal network structure. In this work, we propose a novel Simulated Annealing algorithm for NAS, namely SANAS, by adding perturbations to the gradientdescent for saving search cost and boosting the predictive performance of the search architecture. Our proposed algorithm is easy to be adapted to current stateoftheart methods in the literature. We conduct extensive experiments on various benchmarks where the results demonstrate the effectiveness and efficiency of our SANAS in reducing search time and saving computation resources. Compared to previous differentiable methods, our SANAS achieves comparable or better predictive performance under the same setting. 
Shentong Mo · Jingfei Xia · Pinxu Ren 🔗 


Faster QuasiNewton Methods for Linear Composition Problems
(
Poster
)
>
Despite its lack of theoretical guarantees, the limited memory version of BFGS (lBFGS) is widely used in practice for largescale optimization problems. In particular, when combined with the Wolfe conditions, lBFGS tends to find solutions significantly faster than methods with known convergence rates such as gradient descent (GD). Similarly, inexact stepsize searches have outsized practical impact despite, in theory, having the same worstcase complexity as using a constant step size. These search techniques are often used for linear composition problems which are known to allow efficient linesearches and subspace optimization (SO). In this paper, we propose a version of lBFGS that is faster for linear composition problems. Our method combines a lBFGS direction with a momentum direction using SO. We explore practical SO issues that include extending the Wolfe conditions from one to multidimension. We experimentally compare our method to standard lBFGS and to a SO method that is known to be efficient. 
Betty Shea · Mark Schmidt 🔗 


On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging
(
Poster
)
>
We study the stochastic bilinear minimax optimization problem, presenting an analysis of the samesample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. We first note that the last iterate of the basic SEG method only contracts to a fixed neighborhood of the Nash equilibrium, independent of the step size. This contrasts sharply with the standard setting of minimization where standard stochastic algorithms converge to a neighborhood that vanishes in proportion to the squareroot (constant) step size. Under the same setting, however, we prove that when augmented with iteration averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure. In the interpolation setting, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting. 
Chris Junchi Li · Yaodong Yu · Nicolas Loizou · Gauthier Gidel · Yi Ma · Nicolas Le Roux perso · Michael Jordan 🔗 


On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging
(
Oral
)
>
We study the stochastic bilinear minimax optimization problem, presenting an analysis of the samesample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. We first note that the last iterate of the basic SEG method only contracts to a fixed neighborhood of the Nash equilibrium, independent of the step size. This contrasts sharply with the standard setting of minimization where standard stochastic algorithms converge to a neighborhood that vanishes in proportion to the squareroot (constant) step size. Under the same setting, however, we prove that when augmented with iteration averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure. In the interpolation setting, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting. 
Chris Junchi Li · Yaodong Yu · Nicolas Loizou · Gauthier Gidel · Yi Ma · Nicolas Le Roux perso · Michael Jordan 🔗 


On the Relation between Distributionally Robust Optimization and Data Curation
(
Poster
)
>
Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation. 
Agnieszka Słowik · Leon Bottou 🔗 


On the Relation between Distributionally Robust Optimization and Data Curation
(
Oral
)
>
Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation. 
Agnieszka Słowik · Leon Bottou 🔗 


Policy Mirror Descent for Regularized RL: A Generalized Framework with Linear Convergence
(
Poster
)
>
Policy optimization, which learns the policy of interest by maximizing the value function via largescale optimization, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need to encourage exploration, and to ensure certain structural properties due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL. Focusing on an infinitehorizon discounted Markov decision process, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent [Lan, 2021], the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We prove that our algorithm converges linearly over an entire range of learning rates, in a dimensionfree fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this fast convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to confirm the applicability of GPMD. 
Wenhao Zhan · Shicong Cen · Baihe Huang · Yuxin Chen · Jason Lee · Yuejie Chi 🔗 


Policy Mirror Descent for Regularized RL: A Generalized Framework with Linear Convergence
(
Oral
)
>
Policy optimization, which learns the policy of interest by maximizing the value function via largescale optimization, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need to encourage exploration, and to ensure certain structural properties due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL. Focusing on an infinitehorizon discounted Markov decision process, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent [Lan, 2021], the proposed algorithm accommodates a general class of convex regularizers as well as a broad family of Bregman divergence in cognizant of the regularizer in use. We prove that our algorithm converges linearly over an entire range of learning rates, in a dimensionfree fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. In addition, this fast convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to confirm the applicability of GPMD. 
Wenhao Zhan · Shicong Cen · Baihe Huang · Yuxin Chen · Jason Lee · Yuejie Chi 🔗 


Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation
(
Poster
)
>
Overparameterized deep neural networks are able to achieve excellent training accuracy while maintaining a small generalization error. It has also been found that they are able to fit arbitrary labels, and this behaviour is referred to as the phenomenon of memorization. In this work, we study the phenomenon of memorization with turnover dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data). Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g. real data) and difficult examples (e.g. random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one. By showing the existence of memorization on random data and real data, we highlight the consistency between them regarding optimization and we emphasize the implication of memorization during optimization. 
Futong Liu · Tao Lin · Martin Jaggi 🔗 


Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation
(
Oral
)
>
Overparameterized deep neural networks are able to achieve excellent training accuracy while maintaining a small generalization error. It has also been found that they are able to fit arbitrary labels, and this behaviour is referred to as the phenomenon of memorization. In this work, we study the phenomenon of memorization with turnover dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data). Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g. real data) and difficult examples (e.g. random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one. By showing the existence of memorization on random data and real data, we highlight the consistency between them regarding optimization and we emphasize the implication of memorization during optimization. 
Futong Liu · Tao Lin · Martin Jaggi 🔗 