Workshop
Order up! The Benefits of HigherOrder Optimization in Machine Learning
Albert Berahas · Jelena Diakonikolas · Jarad Forristal · Brandon Reese · Martin Takac · Yan Xu
Room 275  277
Optimization is a cornerstone of nearly all modern machine learning (ML) and deep learning (DL). Simple firstorder gradientbased methods dominate the field for convincing reasons: low computational cost, simplicity of implementation, and strong empirical results.
Yet second or higherorder methods are rarely used in DL, despite also having many strengths: faster periteration convergence, frequent explicit regularization on stepsize, and better parallelization than SGD. Additionally, many scientific fields use secondorder optimization with great success.
A driving factor for this is the large difference in development effort. By the time higherorder methods were tractable for DL, firstorder methods such as SGD and it’s main varients (SGD + Momentum, Adam, …) already had many years of maturity and mass adoption.
The purpose of this workshop is to address this gap, to create an environment where higherorder methods are fairly considered and compared against oneanother, and to foster healthy discussion with the end goal of mainstream acceptance of higherorder methods in ML and DL.
Schedule
Fri 6:15 a.m.  6:30 a.m.

Welcome and Opening Remarks
(
Opening Remarks
)
SlidesLive Video 
🔗 
Fri 6:30 a.m.  7:15 a.m.

Efficient SecondOrder Stochastic Methods for Machine Learning
(
Plenary Talk
)
SlidesLive Video Our talk focuses on training Deep Neural Networks (DNNs), which due to the enormous number of parameters current DNNs have, using the Hessian matrix or a full approximation to it in a secondorder method is prohibitive, both in terms of memory requirements and computational cost per iteration. Hence, to be practical, layerwise blockdiagonal approximations to these matrices are usually used. Here we describe secondorder quasiNewton (QN), natural gradient (NG), and generalized GaussNewton (GGN) methods of this type that are competitive with and often outperform firstorder methods. These methods include those that use layerwise (i) Kroneckerfactored BFGS and LBFGS QN approximations, (ii) tensor normal covariance and (iii) miniblock Fisher matrix approximations, and (iv) ShermanMorrisonWoodbury based variants of NG and GGN methods. 
Donald Goldfarb 🔗 
Fri 7:15 a.m.  8:00 a.m.

Tensor Methods for Nonconvex Optimization.
(
Plenary Talk
)
SlidesLive Video We consider the advantages of having and incorporating higher (than second) order derivative information inside regularization frameworks, generating higherorder regularization algorithms that have better complexity, universal properties and can certify higherorder criticality of candidate solutions. Time permitting, we also discuss inexact settings where problem information and smoothness assumptions are weakened, without affecting the algorithms’ complexity. Efficient solution of some higherorder polynomial subproblems will also be discussed. 
Coralia Cartis 🔗 
Fri 8:00 a.m.  8:30 a.m.

Coffee Break
(
Coffee Break
)

🔗 
Fri 8:00 a.m.  8:30 a.m.

Poster Session I
(
Poster Session
)

🔗 
Fri 8:30 a.m.  8:45 a.m.

Quartic Polynomial Subproblem Solutions in Tensor Methods for Nonconvex Optimization
(
Spotlight Talk
)
SlidesLive Video
There has been growing interest in highorder tensor methods for nonconvex optimization in machine learning as these methods provide better/optimal worstcase evaluation complexity, stability to parameter tuning, and robustness to problem conditioning. The wellknown $p$thorder adaptive regularization (AR$p$) method relies crucially on repeatedly minimising a nonconvex multivariate Taylorbased polynomial subproblem. It remains an open question to find efficient techniques to minimise such a subproblem for $p\ge3$.In this paper, we propose a secondorder method (SQO) for the AR$3$ (AR$p$ with $p=3$) subproblem. SQO approximates the specialstructure quartic polynomial subproblem from above and below by using secondorder models that can be minimised efficiently and globally.We prove that SQO finds a local minimiser of a quartic polynomial, but in practice, due to its construction, it can find a much lower minimum than cubic regularization approaches. This encourages us to continue our quest for algorithmic techniques that find approximately global solutions for such polynomials.

Wenqi Zhu 🔗 
Fri 8:45 a.m.  9:00 a.m.

PSPS: Preconditioned Stochastic Polyak Stepsize method for badly scaled data
(
Spotlight Talk
)
SlidesLive Video The family of Stochastic Gradient Methods with Polyak Stepsize offers an update rule that alleviates the need of finetuning the learning rate of an optimizer. Recent work (Robert M Gower, Mathieu Blondel, Nidham Gazagnadou, and Fabian Pedregosa: Cutting some slack for SGD with adaptive polyak stepsizes) has been proposed to introduce a slack variable, which makes these methods applicable outside of the interpolation regime. In this paper, we combine preconditioning and slack in an updated optimization algorithm to show its performance on badly scaled and/or illconditioned datasets. We use Hutchinson's method to obtain an estimate of a Hessian which is used as the preconditioner. 
Farshed Abdukhakimov 🔗 
Fri 9:00 a.m.  9:15 a.m.

DRSOM: A Dimension Reduced SecondOrder Method
(
Spotlight Talk
)
SlidesLive Video In this paper, we propose a DimensionReduced SecondOrder Method (DRSOM) for convex and nonconvex (unconstrained) optimization. Under a trustregionlike framework, our method preserves the convergence of the secondorder method while using only Hessianvector products in a few directions, which enables the computational overhead of our method remain comparable to the firstorder such as the gradient descent method. Theoretically, we show that the method has a local quadratic convergence and a global convergence rate of O(ϵ −3/2 ) to satisfy the firstorder and secondorder conditions under a commonly adopted approximated Hessian assumption. We further show that this assumption can be removed if we perform a step of the Lanczos method periodically at the endstage of the algorithm. The applicability and performance of DRSOM are exhibited by various computational experiments, particularly in machine learning and deep learning. For neural networks, our preliminary implementation seems to gain computational advantages in terms of training accuracy and iteration complexity over stateoftheart firstorder methods such as SGD and ADAM. 
Chuwen Zhang 🔗 
Fri 9:15 a.m.  9:30 a.m.

Disentangling the Mechanisms Behind Implicit Regularization in SGD
(
Spotlight Talk
)
SlidesLive Video A number of competing hypotheses have been proposed to explain why smallbatch Stochastic Gradient Descent (SGD) leads to improved generalization over the fullbatch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the smalltolarge batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using microbatches, i.e. disjoint smaller subsets of each minibatch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over microbatches, in the largebatch regime recovers smallbatch SGD generalization, whereas Jacobianbased regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model’s gradient norms resemble those of smallbatch SGD. We additionally show that this behavior breaks down as the microbatch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets. 
Zachary Novack 🔗 
Fri 9:30 a.m.  9:45 a.m.

Cubic Regularized QuasiNewton Methods
(
Spotlight Talk
)
SlidesLive Video In this paper, we propose a Cubic Regularized LBFGS. Cubic Regularized Newton outperforms the classical Newton method in terms of global performance. In classics, LBFGS approximation is applied for the Newton method. We propose a new variant of inexact Cubic Regularized Newton. Then, we use LBFGS approximation as an inexact Hessian for Cubic Regularized Newton. It allows us to get better theoretical convergence rates and good practical performance, especially from the points where classical Newton is diverging. 
Klea Ziu 🔗 
Fri 9:45 a.m.  10:00 a.m.

Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initialization
(
Spotlight Talk
)
SlidesLive Video We consider the problem of estimating the factors of a rank1 matrix with i.i.d. Gaussian, rank1 measurements that are nonlinearly transformed and corrupted by noise. Considering two prototypical choices for the nonlinearity, we study the convergence properties of a natural alternating update rule for this nonconvex optimization problem starting from a random initialization. We show sharp linear convergence guarantees for a samplesplit version of the algorithm by deriving a deterministic recursion that is accurate even in highdimensional problems. Our sharp, nonasymptotic analysis also exposes several other finegrained properties of this problem, including how the nonlinearity, sample size, and noise level affect convergence behavior.Our results are enabled by showing that the empirical error recursion can be predicted by our deterministic sequence within fluctuations of the order n^{1/2} when each iteration is run with n observations. Our technique leverages leaveoneout tools and provides an avenue for sharply analyzing higherorder iterative algorithms from a random initialization in other optimization problems with random data. 
Kabir Chandrasekher 🔗 
Fri 10:00 a.m.  11:30 a.m.

Lunch Break
(
Lunch Break
)

🔗 
Fri 11:30 a.m.  12:15 p.m.

Deterministically Constrained Stochastic Optimization
(
Plenary Talk
)
SlidesLive Video This talk highlights the recent work by my research group on the design, analysis, and implementation of algorithms for solving continuous nonlinear optimization problems that involve a stochastic objective function and deterministic constraints. We will focus on our sequential quadratic optimization (commonly known as SQP) methods for cases when the constraints are defined by nonlinear systems of equations and inequalities. Our methods are applicable for solving various types of problems, such as for training machine learning (e.g., deep learning) models with constraints. Our work focuses on the "fully stochastic" regime in which only stochastic gradient estimates are employed, for which we have derived convergenceinexpectation results and worstcase iteration complexity bounds that are on par with stochastic gradient methods for the unconstrained setting. We will also discuss the various extensions that my group is exploring. 
Frank E. Curtis 🔗 
Fri 12:15 p.m.  1:00 p.m.

Low Rank Approximation for Faster Convex Optimization
(
Plenary Talk
)
SlidesLive Video Low rank structure is pervasive in realworld datasets. This talk shows how to accelerate the solution of fundamental computational problems, including eigenvalue decomposition, linear system solves, and composite convex optimization,by exploiting this low rank structure. We present a simple and efficient method for approximate top eigendecomposition based on randomized numerical linear algebra. Armed with this primitive, we design a new randomized preconditioner for the conjugate gradient method, and a method called NysADMM, based on the inexact alternating directions method of multipliers, for composite convex optimization. These methods come with strong theoretical and numerical support. Indeed, a simple implementation of NysADMM solves important largescale statistical problems like lasso, logistic regression, and support vector machines 258x faster than standard solvers. 
Madeleine Udell 🔗 
Fri 1:00 p.m.  1:30 p.m.

Coffee Break
(
Coffee Break
)

🔗 
Fri 1:00 p.m.  2:00 p.m.

Poster Session II
(
Poster Session
)

🔗 
Fri 2:00 p.m.  2:45 p.m.

A Fast, Fisher Based Pruning of Transformers without Retraining
(
Plenary Talk
)
SlidesLive Video Pruning is an effective way to reduce the huge inference cost of large Transformer models. However, prior work on model pruning requires retraining the model. This can add high cost and complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERTBASE and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference latency, while maintaining < 1\% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain. 
Amir Gholami 🔗 
Fri 2:45 p.m.  3:00 p.m.

Closing Remarks
(
Closing Remarks
)
SlidesLive Video 
🔗 


How Small Amount of Data Sharing Benefits HigherOrder Distributed Optimization and Learning
(
Poster
)
Distributed optimization algorithms have been widely used in machine learning and statistical estimation. While distributed algorithms have the merits in parallel processing and protecting local data security, they often suffer from slow convergence compared with centralized optimization algorithms. This paper focuses on how small amount of data sharing could benefit distributed higherorder optimization algorithms with its application in learning problems. Specifically, we consider how data sharing could benefit distributed multiblock alternating direction method of multipliers (ADMM) and preconditioned conjugate gradient method (PCG) with application in machine learning tasks of linear and logistic regression. These algorithms are commonly known as algorithms between the first and the second order methods. Theoretically, we prove that a small amount of data share leads to improvements from nearworst to nearoptimal convergence rate when applying ADMM and PCG methods to machine learning tasks. A side theory product is the tight worstcase bound of linear convergence rate for distributed ADMM applied in linear regression. We further propose a meta randomized datasharing scheme and provide its tailored applications in multiblock ADMM and PCG methods in order to enjoy both the benefit from datasharing and from the efficiency of distributed computing. From the numerical evidences, we are convinced that our algorithms provide good quality of estimators in both the least square and the logistic regressions within much fewer iterations by only sharing a small amount of prefixed data, while purely distributed optimization algorithms may take hundreds more times of iterations to converge. We hope that the discovery resulted from this paper would encourage even small amount of data sharing among different regions to combat difficult global learning problems. 
Mingxi Zhu · Yinyu Ye 🔗 


A Stochastic Conjugate Subgradient Algorithm for Kernelized Support Vector Machines: The Evidence
(
Poster
)
Kernel Support Vector Machines (Kernel SVM) provide a powerful class of toolsfor classifying data whose classes are best identified via a nonlinear function.While a Kernel SVM is usually treated as a Quadratic Program (QP), its solution is usually obtained using stochastic gradient descent (SGD). In this paper we treat the Kernel SVM as a Stochastic Quadratic Linear Programming (SQLP) problem which motivates a decompositionbased algorithm that separates parameter choice from error estimation, with the latter being separable by data points. In order to takeadvantage of the quadratic structure due to the kernel matrix we introduce aconjugate subgradient approach. While convergence of the new method can be shown, the focus of this brief paper is on computational evidence which illustrates that our method maintains the scalability of SGD, while improving the accuracy of classification/optimization. 
Di Zhang · Suvrajeet Sen 🔗 


Distributed NewtonType Methods with Communication Compression and Bernoulli Aggregation
(
Poster
)
Despite their high computation and communication costs, Newtontype methods remain an appealing option for distributed training due to their robustness against illconditioned convex problems. In this work, we study {\em communication compression} and {\em aggregation mechanisms} for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of {\em three point compressors (3PC)} of Richtárik et al. [2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as {\em contractive compression} and {\em lazy aggregation}, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as {\em adaptive thresholding} and {\em Bernoulli aggregation}, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast {\em conditionnumberindependent} local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve stateoftheart communication complexity compared to several key baselines using secondorder information. 
Rustem Islamov · Xun Qian · Slavomír Hanzely · Mher Safaryan · Peter Richtarik 🔗 


Fully Stochastic TrustRegion Sequential Quadratic Programming for EqualityConstrained Optimization Problems
(
Poster
)
We propose a fully stochastic trustregion sequential quadratic programming (TRStoSQP) algorithm to solve nonlinear optimization problems. The problems involve a stochastic objective and deterministic equality constraints. Under the fully stochastic setup, we suppose that only a single sample is generated in each iteration to estimate the objective gradient. Compared to the existing linesearch StoSQP schemes, our algorithm allows one to employ indefinite Hessian matrices for SQP subproblems. The algorithm adaptively selects the radius of the trust region based on an input sequence $\{\beta_k\}$, the estimated KKT residual, and the estimated Lipschitz constants of the objective gradients and constraint Jacobians. To address the infeasibility issue of trustregion methods that arises in constrained optimization, we propose an adaptive relaxation technique to compute the trial step. In particular, we decompose the trial step into a normal step and a tangential step. Based on the ratios of the feasibility and optimality residuals to the full KKT residual, we decompose the full trustregion radius into two segments that are used to control the size of the normal and tangential steps, respectively. The normal step has a closed form, while the tangential step is solved from a trustregion subproblem, of which the Cauchy point is sufficient for our study. We establish the global almost sure convergence guarantee of TRStoSQP, and demonstrate its empirical performance on a subset of problems in CUTEst test set.

Yuchen Fang · Sen Na · Mladen Kolar 🔗 


TrustRegion Sequential Quadratic Programming for Stochastic Optimization with Random Models: FirstOrder Stationarity
(
Poster
)
We consider optimization problems with a stochastic objective and deterministic constraints, and design a trustregion sequential quadratic programming (TRSQP) method to solve them. We name our method TRSQP for STochastic Optimization with Random Models (TRSQPSTORM). In each iteration, our algorithm constructs a random model for the objective that satisfies suitable accuracy conditions with a high but fixed probability. The algorithm decides whether a trial step is successful or not based on two ratios: the ratio between the estimated actual reduction and predicted reduction on the $\ell_2$ merit function, and the ratio between the estimated KKT residual and trustregion radius. For each successful step, the algorithm increases the trustregion radius, and further decides whether the step is reliable or not based on the amount of the predicted reduction. If the step is reliable, then the algorithm relaxes the accuracy conditions for the next iteration. To resolve the infeasibility issue of trustregion methods for constrained problems, we employ an adaptive relaxation technique proposed by a companion paper. Under reasonable assumptions, we establish the global firstorder convergence guarantee: the KKT residual converges to zero almost surely. We apply our method on a subset of problems in CUTEst set to demonstrate its empirical performance.

Yuchen Fang · Sen Na · Mladen Kolar 🔗 


Black Box Lie Group Preconditioners for SGD
(
Poster
)
A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessianvector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its invariance property saves any need of damping, which is common in second order optimizers, but difficult to tune. The learning rate for parameter updating and step size for preconditioner fitting are naturally normalized, and their default values work well in most situations. 
Xilin Li 🔗 


Perseus: A Simple and Optimal HighOrder Method for Variational Inequalities
(
Poster
)
This paper settles an open and challenging question pertaining to the design of simple highorder regularization methods for solving smooth and monotone variational inequalities (VIs). A VI involves finding $x^\star \in \XCal$ such that $\langle F(x), x  x^\star\rangle \geq 0$ for all $x \in \XCal$ and we consider the setting where $F: \br^d \mapsto \br^d$ is smooth with up to $(p1)^{\textnormal{th}}$order derivatives. Highorder methods based on similar binary search procedures have been further developed and shown to achieve a rate of $O(\epsilon^{2/(p+1)}\log(1/\epsilon))$~\citep{Bullins2020Higher,Lin2021Monotone,Jiang2022Generalized}. However, such search procedure can be computationally prohibitive in practice~\citep{Nesterov2018Lectures} and the problem of finding a simple highorder regularization methods remains as an open and challenging question in the optimization theory. We propose a $p^{\textnormal{th}}$order method that does \textit{not} require any binary search procedure and prove that it can converge to a weak solution at a global rate of $O(\epsilon^{2/(p+1)})$. A lower bound of $\Omega(\epsilon^{2/(p+1)})$ is also established under a linear span assumption to show that our $p^{\textnormal{th}}$order method is optimal in the monotone setting. A version with restarting attains a global linear and local superlinear convergence rate for smooth and strongly monotone VIs. Our method can achieve a global rate of $O(\epsilon^{2/p})$ for solving smooth and nonmonotone VIs satisfying the Minty condition. The restarted version again attains a global linear and local superlinear convergence rate if the strong Minty condition holds.

Tianyi Lin · Michael Jordan 🔗 


HighOrder Optimization of Gradient Boosted Decision Trees
(
Poster
)
Gradient Boosted Decision Trees (GBDTs) are dominant machine learning algorithms for modeling discrete or tabular data. Unlike neural networks with millions of trainable parameters, GBDTs optimize loss function in an additive manner and have a single trainable parameter per leaf, which makes it easy to apply highorder optimization of the loss function. In this paper, we introduce highorder optimization for GBDTs based on numerical optimization theory which allows us to construct trees based on highorder derivatives of a given loss function. In the experiments, we show that highorder optimization has faster periteration convergence that leads to reduced running time. Our solution can be easily parallelized and run on GPUs with little overhead on the code. Finally, we discuss future potential improvements such as automatic differentiation of arbitrary loss function and combination of GBDTs with neural networks. 
Jean Pachebat · Sergey IVANOV 🔗 


On the Global Convergence of the Regularized Generalized GaussNewton Algorithm
(
Poster
)
We detail the global convergence rates of a regularized generalized GaussNewton algorithm applied to compositional problems with surjective inner Jacobian mappings. Our analysis uncovers several convergence phases for the algorithm and identifies the key condition numbers governing the complexity of the algorithm. We present an implementation with a linesearch adaptive to the constants of the problem. 
Vincent Roulet · Maryam Fazel · Siddhartha Srinivasa · Zaid Harchaoui 🔗 


Disentangling the Mechanisms Behind Implicit Regularization in SGD
(
Poster
)
A number of competing hypotheses have been proposed to explain why smallbatch Stochastic Gradient Descent (SGD) leads to improved generalization over the fullbatch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the smalltolarge batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using microbatches, i.e. disjoint smaller subsets of each minibatch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over microbatches, in the largebatch regime recovers smallbatch SGD generalization, whereas Jacobianbased regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model’s gradient norms resemble those of smallbatch SGD. We additionally show that this behavior breaks down as the microbatch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets. 
Zachary Novack · Simran Kaur · Tanya Marwah · Saurabh Garg · Zachary Lipton 🔗 


Effects of momentum scaling for SGD
(
Poster
)
The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient $\beta$. Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of $\beta$, undeservedly set to constant $0.99...9$ at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive $\beta$ and step size values.

Dmitry A. Pasechnyuk · Alexander Gasnikov · Martin Takac 🔗 


Using quadratic equations for overparametrized models
(
Poster
)
Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the \emph{interpolation equations}. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local secondorder approximation of the model. Our resulting method SP2 uses Hessianvector products to speedup the convergence of SP. Furthermore, and rather uniquely among secondorder methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, nonconvex test problems and logistic regression. We also provide a convergence theory on sumsofquadratics. 
Shuang Li · William Swartworth · Martin Takac · Deanna Needell · Robert Gower 🔗 


Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initialization
(
Poster
)
We consider the problem of estimating the factors of a rank1 matrix with i.i.d. Gaussian, rank1 measurements that are nonlinearly transformed and corrupted by noise. Considering two prototypical choices for the nonlinearity, we study the convergence properties of a natural alternating update rule for this nonconvex optimization problem starting from a random initialization. We show sharp linear convergence guarantees for a samplesplit version of the algorithm by deriving a deterministic recursion that is accurate even in highdimensional problems. Our sharp, nonasymptotic analysis also exposes several other finegrained properties of this problem, including how the nonlinearity, sample size, and noise level affect convergence behavior.Our results are enabled by showing that the empirical error recursion can be predicted by our deterministic sequence within fluctuations of the order n^{1/2} when each iteration is run with n observations. Our technique leverages leaveoneout tools and provides an avenue for sharply analyzing higherorder iterative algorithms from a random initialization in other optimization problems with random data. 
MENGQI LOU · Kabir Chandrasekher · Ashwin Pananjady 🔗 


Improving LevenbergMarquardt Algorithm for Neural Networks
(
Poster
)
We explore the usage of the LevenbergMarquardt(LM) algorithm for regression (nonlinear least squares) and classification (generalized GaussNewton methods) tasks in neural networks. We compare the performance of the LM method with other popular firstorder algorithms such as SGD and Adam, as well as other secondorder algorithms such as LBFGS, HessianFree and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance. 
Omead Pooladzandi · Yiming Zhou 🔗 


DRSOM: A Dimension Reduced SecondOrder Method
(
Poster
)
In this paper, we propose a DimensionReduced SecondOrder Method (DRSOM) for convex and nonconvex (unconstrained) optimization. Under a trustregionlike framework, our method preserves the convergence of the secondorder method while using only Hessianvector products in a few directions, which enables the computational overhead of our method remain comparable to the firstorder such as the gradient descent method. Theoretically, we show that the method has a local quadratic convergence and a global convergence rate of O(ϵ −3/2 ) to satisfy the firstorder and secondorder conditions under a commonly adopted approximated Hessian assumption. We further show that this assumption can be removed if we perform a step of the Lanczos method periodically at the endstage of the algorithm. The applicability and performance of DRSOM are exhibited by various computational experiments, particularly in machine learning and deep learning. For neural networks, our preliminary implementation seems to gain computational advantages in terms of training accuracy and iteration complexity over stateoftheart firstorder methods such as SGD and ADAM. 
Chuwen Zhang · Jiang Bo · Chang He · Yuntian Jiang · Dongdong Ge · Yinyu Ye 🔗 


Randomsubspace adaptive cubic regularisation method for nonconvex optimisation
(
Poster
)
We investigate secondorder methods for nonconvex optimisation, and propose a Random Subspace Adaptive Cubic Regularisation (RARC) method, which we analyse under various assumptions on the objective function and the sketching matrices that generate the random subspaces. We show that, when the sketching matrix achieves a subspace embedding of the augmented matrix of the gradient and the Hessian with sufficiently high probability, then the RARC method satisfies, with high probability, a complexity bound of order O(ε^{3/2}) to drive the (full) gradient norm below ε; matching in the accuracy order its deterministic counterpart (ARC). As an illustration, we particularise our results to the special case of a scaled Gaussian ensemble. 
Coralia Cartis · Zhen Shao 🔗 


Quartic Polynomial Subproblem Solutions in Tensor Methods for Nonconvex Optimization
(
Poster
)
There has been growing interest in highorder tensor methods for nonconvex optimization in machine learning as these methods provide better/optimal worstcase evaluation complexity, stability to parameter tuning, and robustness to problem conditioning. The wellknown $p$thorder adaptive regularization (AR$p$) method relies crucially on repeatedly minimising a nonconvex multivariate Taylorbased polynomial subproblem. It remains an open question to find efficient techniques to minimise such a subproblem for $p\ge3$.In this paper, we propose a secondorder method (SQO) for the AR$3$ (AR$p$ with $p=3$) subproblem. SQO approximates the specialstructure quartic polynomial subproblem from above and below by using secondorder models that can be minimised efficiently and globally.We prove that SQO finds a local minimiser of a quartic polynomial, but in practice, due to its construction, it can find a much lower minimum than cubic regularization approaches. This encourages us to continue our quest for algorithmic techniques that find approximately global solutions for such polynomials.

Wenqi Zhu · Coralia Cartis 🔗 


FLECSCGD: A Federated Learning SecondOrder Framework via Compression and Sketching with Compressed Gradient Differences
(
Poster
)
In the recent paper FLECS (Agafonov et al, FLECS: A Federated Learning SecondOrder Framework via Compression and Sketching), the secondorder framework FLECS was proposed for the Federated Learning problem. This method utilize compression of sketched Hessians to make communication costs low. However, the main bottleneck of FLECS is gradient communication without compression. In this paper, we propose the modification of FLECS with compressed gradient differences, which we call FLECSCGD (FLECS with Compressed Gradient Differences) and make it applicable for stochastic optimization. Convergence guarantees are provided in strongly convex and nonconvex cases. Experiments show the practical benefit of proposed approach. 
Artem Agafonov · Brahim Erraji · Martin Takac 🔗 


Statistical and Computational Complexities of BFGS QuasiNewton Method for Generalized Linear Models
(
Poster
)
The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be nonlinear. While GD has optimal statistical and computational complexities for estimating the true parameter under the high signaltonoise ratio (SNR) regime of the GLMs, it has suboptimal complexities when the SNR is low, namely, the iterates of GD require polynomial number of iterations to reach the final statistical radius. The slow convergence of GD for the low SNR case is mainly due to the local convexity of the leastsquare loss functions of the GLMs. To address the shortcomings of GD, we propose to use the BFGS quasiNewton method to solve parameter estimation of the GLMs. On the optimization side, when the SNR is low, we demonstrate that iterates of BFGS converge linearly to the optimal solution of the population leastsquare loss function. On the statistical side, we prove that the iterates of BFGS reach the final statistical radius of the low SNR GLMs after a logarithmic number of iterations, which is much lower than the polynomial number of iterations of GD. We also present numerical experiments that match our theoretical findings. 
Qiujiang Jin · Aryan Mokhtari · Nhat Ho · Tongzheng Ren 🔗 


The Tradeoffs of Incremental Linearization Algorithms for Nonsmooth Composite Problems
(
Poster
)
GaussNewton methods and their stochastic version have been widely used in machine learning. Their nonsmooth counterparts, modified GaussNewton or proxlinear algorithms, can lead to contrasted outcomes when compared to gradient descent in large scale settings. We explore the contrasting performance of these two classes of algorithms in theory on a stylized statistical example, and experimentally on learning problems including structured prediction. 
Krishna Pillutla · Vincent Roulet · Sham Kakade · Zaid Harchaoui 🔗 


Cubic Regularized QuasiNewton Methods
(
Poster
)
In this paper, we propose a Cubic Regularized LBFGS. Cubic Regularized Newton outperforms the classical Newton method in terms of global performance. In classics, LBFGS approximation is applied for the Newton method. We propose a new variant of inexact Cubic Regularized Newton. Then, we use LBFGS approximation as an inexact Hessian for Cubic Regularized Newton. It allows us to get better theoretical convergence rates and good practical performance, especially from the points where classical Newton is diverging. 
Dmitry Kamzolov · Klea Ziu · Artem Agafonov · Martin Takac 🔗 


ASDL: A Unified Interface for Gradient Preconditioning in PyTorch
(
Poster
)
Gradient preconditioning is a key technique to integrate the secondorder information into gradients for improving and extending gradientbased learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent performance and feasibility. We propose the Automatic Secondorder Differentiation Library (ASDL), an extension library for PyTorch, which offers various implementations and a plugandplay unified interface for gradient preconditioning. ASDL enables the study and structured comparison of a range of gradient preconditioning methods. 
Kazuki Osawa · Satoki Ishikawa · Rio Yokota · Shigang Li · Torsten Hoefler 🔗 


PSPS: Preconditioned Stochastic Polyak Stepsize method for badly scaled data
(
Poster
)
The family of Stochastic Gradient Methods with Polyak Stepsize offers an update rule that alleviates the need of finetuning the learning rate of an optimizer. Recent work (Robert M Gower, Mathieu Blondel, Nidham Gazagnadou, and Fabian Pedregosa: Cutting some slack for SGD with adaptive polyak stepsizes) has been proposed to introduce a slack variable, which makes these methods applicable outside of the interpolation regime. In this paper, we combine preconditioning and slack in an updated optimization algorithm to show its performance on badly scaled and/or illconditioned datasets. We use Hutchinson's method to obtain an estimate of a Hessian which is used as the preconditioner. 
Farshed Abdukhakimov · Chulu Xiang · Dmitry Kamzolov · Robert Gower · Martin Takac 🔗 


HesScale: Scalable Computation of Hessian Diagonals
(
Poster
)
Secondorder optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on firstorder approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate secondorder information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient secondorder optimization. 
Mohamed Elsayed · Rupam Mahmood 🔗 