Timezone: »

Workshop
Order up! The Benefits of Higher-Order Optimization in Machine Learning
Albert Berahas · Jelena Diakonikolas · Jarad Forristal · Brandon Reese · Martin Takac · Yan Xu

Fri Dec 02 06:15 AM -- 03:00 PM (PST) @ Room 275 - 277

Optimization is a cornerstone of nearly all modern machine learning (ML) and deep learning (DL). Simple first-order gradient-based methods dominate the field for convincing reasons: low computational cost, simplicity of implementation, and strong empirical results.

Yet second- or higher-order methods are rarely used in DL, despite also having many strengths: faster per-iteration convergence, frequent explicit regularization on step-size, and better parallelization than SGD. Additionally, many scientific fields use second-order optimization with great success.

A driving factor for this is the large difference in development effort. By the time higher-order methods were tractable for DL, first-order methods such as SGD and it’s main varients (SGD + Momentum, Adam, …) already had many years of maturity and mass adoption.

The purpose of this workshop is to address this gap, to create an environment where higher-order methods are fairly considered and compared against one-another, and to foster healthy discussion with the end goal of mainstream acceptance of higher-order methods in ML and DL.

 Fri 6:15 a.m. - 6:30 a.m. Welcome and Opening Remarks (Opening Remarks) 🔗 Fri 6:30 a.m. - 7:15 a.m. Efficient Second-Order Stochastic Methods for Machine Learning (Plenary Talk) Our talk focuses on training Deep Neural Networks (DNNs), which due to the enormous number of parameters current DNNs have, using the Hessian matrix or a full approximation to it in a second-order method is prohibitive, both in terms of memory requirements and computational cost per iteration. Hence, to be practical, layer-wise block-diagonal approximations to these matrices are usually used. Here we describe second-order quasi-Newton (QN), natural gradient (NG), and generalized Gauss-Newton (GGN) methods of this type that are competitive with and often outperform first-order methods. These methods include those that use layer-wise (i) Kronecker-factored BFGS and L-BFGS QN approximations, (ii) tensor normal covariance and (iii) mini-block Fisher matrix approximations, and (iv) Sherman-Morrison-Woodbury based variants of NG and GGN methods. Donald Goldfarb 🔗 Fri 7:15 a.m. - 8:00 a.m. Tensor Methods for Nonconvex Optimization. (Plenary Talk) We consider the advantages of having and incorporating higher- (than second-) order derivative information inside regularization frameworks, generating higher-order regularization algorithms that have better complexity, universal properties and can certify higher-order criticality of candidate solutions. Time permitting, we also discuss inexact settings where problem information and smoothness assumptions are weakened, without affecting the algorithms’ complexity. Efficient solution of some higher-order polynomial subproblems will also be discussed. Coralia Cartis 🔗 Fri 8:00 a.m. - 8:30 a.m. Coffee Break 🔗 Fri 8:00 a.m. - 8:30 a.m. Poster Session I (Poster Session) 🔗 Fri 8:30 a.m. - 8:45 a.m. Quartic Polynomial Sub-problem Solutions in Tensor Methods for Nonconvex Optimization (Spotlight Talk) There has been growing interest in high-order tensor methods for nonconvex optimization in machine learning as these methods provide better/optimal worst-case evaluation complexity, stability to parameter tuning, and robustness to problem conditioning. The well-known $p$th-order adaptive regularization (AR$p$) method relies crucially on repeatedly minimising a nonconvex multivariate Taylor-based polynomial sub-problem. It remains an open question to find efficient techniques to minimise such a sub-problem for $p\ge3$.In this paper, we propose a second-order method (SQO) for the AR$3$ (AR$p$ with $p=3$) sub-problem. SQO approximates the special-structure quartic polynomial sub-problem from above and below by using second-order models that can be minimised efficiently and globally.We prove that SQO finds a local minimiser of a quartic polynomial, but in practice, due to its construction, it can find a much lower minimum than cubic regularization approaches. This encourages us to continue our quest for algorithmic techniques that find approximately global solutions for such polynomials. Wenqi Zhu 🔗 Fri 8:45 a.m. - 9:00 a.m. PSPS: Preconditioned Stochastic Polyak Step-size method for badly scaled data (Spotlight Talk) The family of Stochastic Gradient Methods with Polyak Step-size offers an update rule that alleviates the need of fine-tuning the learning rate of an optimizer. Recent work (Robert M Gower, Mathieu Blondel, Nidham Gazagnadou, and Fabian Pedregosa: Cutting some slack for SGD with adaptive polyak stepsizes) has been proposed to introduce a slack variable, which makes these methods applicable outside of the interpolation regime. In this paper, we combine preconditioning and slack in an updated optimization algorithm to show its performance on badly scaled and/or ill-conditioned datasets. We use Hutchinson's method to obtain an estimate of a Hessian which is used as the preconditioner. Farshed Abdukhakimov 🔗 Fri 9:00 a.m. - 9:15 a.m. DRSOM: A Dimension Reduced Second-Order Method (Spotlight Talk) In this paper, we propose a Dimension-Reduced Second-Order Method (DRSOM) for convex and nonconvex (unconstrained) optimization. Under a trust-region-like framework, our method preserves the convergence of the second-order method while using only Hessianvector products in a few directions, which enables the computational overhead of our method remain comparable to the first-order such as the gradient descent method. Theoretically, we show that the method has a local quadratic convergence and a global convergence rate of O(ϵ −3/2 ) to satisfy the first-order and second-order conditions under a commonly adopted approximated Hessian assumption. We further show that this assumption can be removed if we perform a step of the Lanczos method periodically at the end-stage of the algorithm. The applicability and performance of DRSOM are exhibited by various computational experiments, particularly in machine learning and deep learning. For neural networks, our preliminary implementation seems to gain computational advantages in terms of training accuracy and iteration complexity over state-of-the-art first-order methods such as SGD and ADAM. Chuwen Zhang 🔗 Fri 9:15 a.m. - 9:30 a.m. Disentangling the Mechanisms Behind Implicit Regularization in SGD (Spotlight Talk) A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD) leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model’s gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets. Zachary Novack 🔗 Fri 9:30 a.m. - 9:45 a.m. Cubic Regularized Quasi-Newton Methods (Spotlight Talk) In this paper, we propose a Cubic Regularized L-BFGS. Cubic Regularized Newton outperforms the classical Newton method in terms of global performance. In classics, L-BFGS approximation is applied for the Newton method. We propose a new variant of inexact Cubic Regularized Newton. Then, we use L-BFGS approximation as an inexact Hessian for Cubic Regularized Newton. It allows us to get better theoretical convergence rates and good practical performance, especially from the points where classical Newton is diverging. Klea Ziu 🔗 Fri 9:45 a.m. - 10:00 a.m. Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initialization (Spotlight Talk) We consider the problem of estimating the factors of a rank-1 matrix with i.i.d. Gaussian, rank-1 measurements that are nonlinearly transformed and corrupted by noise. Considering two prototypical choices for the nonlinearity, we study the convergence properties of a natural alternating update rule for this nonconvex optimization problem starting from a random initialization. We show sharp linear convergence guarantees for a sample-split version of the algorithm by deriving a deterministic recursion that is accurate even in high-dimensional problems. Our sharp, non-asymptotic analysis also exposes several other fine-grained properties of this problem, including how the nonlinearity, sample size, and noise level affect convergence behavior.Our results are enabled by showing that the empirical error recursion can be predicted by our deterministic sequence within fluctuations of the order n^{-1/2} when each iteration is run with n observations. Our technique leverages leave-one-out tools and provides an avenue for sharply analyzing higher-order iterative algorithms from a random initialization in other optimization problems with random data. Kabir Chandrasekher 🔗 Fri 10:00 a.m. - 11:30 a.m. Lunch Break 🔗 Fri 11:30 a.m. - 12:15 p.m. Deterministically Constrained Stochastic Optimization (Plenary Talk) This talk highlights the recent work by my research group on the design, analysis, and implementation of algorithms for solving continuous nonlinear optimization problems that involve a stochastic objective function and deterministic constraints. We will focus on our sequential quadratic optimization (commonly known as SQP) methods for cases when the constraints are defined by nonlinear systems of equations and inequalities. Our methods are applicable for solving various types of problems, such as for training machine learning (e.g., deep learning) models with constraints. Our work focuses on the "fully stochastic" regime in which only stochastic gradient estimates are employed, for which we have derived convergence-in-expectation results and worst-case iteration complexity bounds that are on par with stochastic gradient methods for the unconstrained setting. We will also discuss the various extensions that my group is exploring. Frank E. Curtis 🔗 Fri 12:15 p.m. - 1:00 p.m. Low Rank Approximation for Faster Convex Optimization (Plenary Talk) Low rank structure is pervasive in real-world datasets. This talk shows how to accelerate the solution of fundamental computational problems, including eigenvalue decomposition, linear system solves, and composite convex optimization,by exploiting this low rank structure. We present a simple and efficient method for approximate top eigendecomposition based on randomized numerical linear algebra. Armed with this primitive, we design a new randomized preconditioner for the conjugate gradient method, and a method called NysADMM, based on the inexact alternating directions method of multipliers, for composite convex optimization. These methods come with strong theoretical and numerical support. Indeed, a simple implementation of NysADMM solves important large-scale statistical problems like lasso, logistic regression, and support vector machines 2--58x faster than standard solvers. Madeleine Udell 🔗 Fri 1:00 p.m. - 1:30 p.m. Coffee Break 🔗 Fri 1:00 p.m. - 2:00 p.m. Poster Session II (Poster Session) 🔗 Fri 2:00 p.m. - 2:45 p.m. A Fast, Fisher Based Pruning of Transformers without Retraining (Plenary Talk) Pruning is an effective way to reduce the huge inference cost of large Transformer models. However, prior work on model pruning requires retraining the model. This can add high cost and complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post- training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERT-BASE and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference latency, while maintaining < 1\% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain. Amir Gholami 🔗 Fri 2:45 p.m. - 3:00 p.m. Closing Remarks 🔗 - How Small Amount of Data Sharing Benefits Higher-Order Distributed Optimization and Learning (Poster) []  Distributed optimization algorithms have been widely used in machine learning and statistical estimation. While distributed algorithms have the merits in parallel processing and protecting local data security, they often suffer from slow convergence compared with centralized optimization algorithms. This paper focuses on how small amount of data sharing could benefit distributed higher-order optimization algorithms with its application in learning problems. Specifically, we consider how data sharing could benefit distributed multi-block alternating direction method of multipliers (ADMM) and preconditioned conjugate gradient method (PCG) with application in machine learning tasks of linear and logistic regression. These algorithms are commonly known as algorithms between the first and the second order methods. Theoretically, we prove that a small amount of data share leads to improvements from near-worst to near-optimal convergence rate when applying ADMM and PCG methods to machine learning tasks. A side theory product is the tight worst-case bound of linear convergence rate for distributed ADMM applied in linear regression. We further propose a meta randomized data-sharing scheme and provide its tailored applications in multi-block ADMM and PCG methods in order to enjoy both the benefit from data-sharing and from the efficiency of distributed computing. From the numerical evidences, we are convinced that our algorithms provide good quality of estimators in both the least square and the logistic regressions within much fewer iterations by only sharing a small amount of pre-fixed data, while purely distributed optimization algorithms may take hundreds more times of iterations to converge. We hope that the discovery resulted from this paper would encourage even small amount of data sharing among different regions to combat difficult global learning problems. Mingxi Zhu · Yinyu Ye 🔗 - A Stochastic Conjugate Subgradient Algorithm for Kernelized Support Vector Machines: The Evidence (Poster) []  Kernel Support Vector Machines (Kernel SVM) provide a powerful class of toolsfor classifying data whose classes are best identified via a nonlinear function.While a Kernel SVM is usually treated as a Quadratic Program (QP), its solution is usually obtained using stochastic gradient descent (SGD). In this paper we treat the Kernel SVM as a Stochastic Quadratic Linear Programming (SQLP) problem which motivates a decomposition-based algorithm that separates parameter choice from error estimation, with the latter being separable by data points. In order to takeadvantage of the quadratic structure due to the kernel matrix we introduce aconjugate subgradient approach. While convergence of the new method can be shown, the focus of this brief paper is on computational evidence which illustrates that our method maintains the scalability of SGD, while improving the accuracy of classification/optimization. Di Zhang · Suvrajeet Sen 🔗 - Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation (Poster) []  Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study {\em communication compression} and {\em aggregation mechanisms} for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of {\em three point compressors (3PC)} of Richtárik et al. [2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as {\em contractive compression} and {\em lazy aggregation}, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as {\em adaptive thresholding} and {\em Bernoulli aggregation}, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast {\em condition-number-independent} local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information. Rustem Islamov · Xun Qian · Slavomír Hanzely · Mher Safaryan · Peter Richtarik 🔗 - Fully Stochastic Trust-Region Sequential Quadratic Programming for Equality-Constrained Optimization Problems (Poster) []  We propose a fully stochastic trust-region sequential quadratic programming (TR-StoSQP) algorithm to solve nonlinear optimization problems. The problems involve a stochastic objective and deterministic equality constraints. Under the fully stochastic setup, we suppose that only a single sample is generated in each iteration to estimate the objective gradient. Compared to the existing line-search StoSQP schemes, our algorithm allows one to employ indefinite Hessian matrices for SQP subproblems. The algorithm adaptively selects the radius of the trust region based on an input sequence $\{\beta_k\}$, the estimated KKT residual, and the estimated Lipschitz constants of the objective gradients and constraint Jacobians. To address the infeasibility issue of trust-region methods that arises in constrained optimization, we propose an adaptive relaxation technique to compute the trial step. In particular, we decompose the trial step into a normal step and a tangential step. Based on the ratios of the feasibility and optimality residuals to the full KKT residual, we decompose the full trust-region radius into two segments that are used to control the size of the normal and tangential steps, respectively. The normal step has a closed form, while the tangential step is solved from a trust-region subproblem, of which the Cauchy point is sufficient for our study. We establish the global almost sure convergence guarantee of TR-StoSQP, and demonstrate its empirical performance on a subset of problems in CUTEst test set. Yuchen Fang · Sen Na · Mladen Kolar 🔗 - Trust-Region Sequential Quadratic Programming for Stochastic Optimization with Random Models: First-Order Stationarity (Poster) []  We consider optimization problems with a stochastic objective and deterministic constraints, and design a trust-region sequential quadratic programming (TR-SQP) method to solve them. We name our method TR-SQP for STochastic Optimization with Random Models (TR-SQP-STORM). In each iteration, our algorithm constructs a random model for the objective that satisfies suitable accuracy conditions with a high but fixed probability. The algorithm decides whether a trial step is successful or not based on two ratios: the ratio between the estimated actual reduction and predicted reduction on the $\ell_2$ merit function, and the ratio between the estimated KKT residual and trust-region radius. For each successful step, the algorithm increases the trust-region radius, and further decides whether the step is reliable or not based on the amount of the predicted reduction. If the step is reliable, then the algorithm relaxes the accuracy conditions for the next iteration. To resolve the infeasibility issue of trust-region methods for constrained problems, we employ an adaptive relaxation technique proposed by a companion paper. Under reasonable assumptions, we establish the global first-order convergence guarantee: the KKT residual converges to zero almost surely. We apply our method on a subset of problems in CUTEst set to demonstrate its empirical performance. Yuchen Fang · Sen Na · Mladen Kolar 🔗 - Black Box Lie Group Preconditioners for SGD (Poster) []  A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessian-vector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its invariance property saves any need of damping, which is common in second order optimizers, but difficult to tune. The learning rate for parameter updating and step size for preconditioner fitting are naturally normalized, and their default values work well in most situations. Xilin Li 🔗 - Perseus: A Simple and Optimal High-Order Method for Variational Inequalities (Poster) []  This paper settles an open and challenging question pertaining to the design of simple high-order regularization methods for solving smooth and monotone variational inequalities (VIs). A VI involves finding $x^\star \in \XCal$ such that $\langle F(x), x - x^\star\rangle \geq 0$ for all $x \in \XCal$ and we consider the setting where $F: \br^d \mapsto \br^d$ is smooth with up to $(p-1)^{\textnormal{th}}$-order derivatives. High-order methods based on similar binary search procedures have been further developed and shown to achieve a rate of $O(\epsilon^{-2/(p+1)}\log(1/\epsilon))$~\citep{Bullins-2020-Higher,Lin-2021-Monotone,Jiang-2022-Generalized}. However, such search procedure can be computationally prohibitive in practice~\citep{Nesterov-2018-Lectures} and the problem of finding a simple high-order regularization methods remains as an open and challenging question in the optimization theory. We propose a $p^{\textnormal{th}}$-order method that does \textit{not} require any binary search procedure and prove that it can converge to a weak solution at a global rate of $O(\epsilon^{-2/(p+1)})$. A lower bound of $\Omega(\epsilon^{-2/(p+1)})$ is also established under a linear span assumption to show that our $p^{\textnormal{th}}$-order method is optimal in the monotone setting. A version with restarting attains a global linear and local superlinear convergence rate for smooth and strongly monotone VIs. Our method can achieve a global rate of $O(\epsilon^{-2/p})$ for solving smooth and non-monotone VIs satisfying the Minty condition. The restarted version again attains a global linear and local superlinear convergence rate if the strong Minty condition holds. Tianyi Lin · Michael Jordan 🔗 - High-Order Optimization of Gradient Boosted Decision Trees (Poster) []  Gradient Boosted Decision Trees (GBDTs) are dominant machine learning algorithms for modeling discrete or tabular data. Unlike neural networks with millions of trainable parameters, GBDTs optimize loss function in an additive manner and have a single trainable parameter per leaf, which makes it easy to apply high-order optimization of the loss function. In this paper, we introduce high-order optimization for GBDTs based on numerical optimization theory which allows us to construct trees based on high-order derivatives of a given loss function. In the experiments, we show that high-order optimization has faster per-iteration convergence that leads to reduced running time. Our solution can be easily parallelized and run on GPUs with little overhead on the code. Finally, we discuss future potential improvements such as automatic differentiation of arbitrary loss function and combination of GBDTs with neural networks. Jean Pachebat · Sergey IVANOV 🔗 - On the Global Convergence of the Regularized Generalized Gauss-Newton Algorithm (Poster) []  We detail the global convergence rates of a regularized generalized Gauss-Newton algorithm applied to compositional problems with surjective inner Jacobian mappings. Our analysis uncovers several convergence phases for the algorithm and identifies the key condition numbers governing the complexity of the algorithm. We present an implementation with a line-search adaptive to the constants of the problem. Vincent Roulet · Maryam Fazel · Siddhartha Srinivasa · Zaid Harchaoui 🔗 - Disentangling the Mechanisms Behind Implicit Regularization in SGD (Poster) []  A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD) leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model’s gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets. Zachary Novack · Simran Kaur · Tanya Marwah · Saurabh Garg · Zachary Lipton 🔗 - Effects of momentum scaling for SGD (Poster) []  The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient $\beta$. Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of $\beta$, undeservedly set to constant $0.99...9$ at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive $\beta$ and step size values. Dmitry A. Pasechnyuk · Alexander Gasnikov · Martin Takac 🔗 - Using quadratic equations for overparametrized models (Poster) Recently the SP (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the \emph{interpolation equations}. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics. Shuang Li · William Swartworth · Martin Takac · Deanna Needell · Robert Gower 🔗 - Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initialization (Poster) []  We consider the problem of estimating the factors of a rank-1 matrix with i.i.d. Gaussian, rank-1 measurements that are nonlinearly transformed and corrupted by noise. Considering two prototypical choices for the nonlinearity, we study the convergence properties of a natural alternating update rule for this nonconvex optimization problem starting from a random initialization. We show sharp linear convergence guarantees for a sample-split version of the algorithm by deriving a deterministic recursion that is accurate even in high-dimensional problems. Our sharp, non-asymptotic analysis also exposes several other fine-grained properties of this problem, including how the nonlinearity, sample size, and noise level affect convergence behavior.Our results are enabled by showing that the empirical error recursion can be predicted by our deterministic sequence within fluctuations of the order n^{-1/2} when each iteration is run with n observations. Our technique leverages leave-one-out tools and provides an avenue for sharply analyzing higher-order iterative algorithms from a random initialization in other optimization problems with random data. MENGQI LOU · Kabir Chandrasekher · Ashwin Pananjady 🔗 - Improving Levenberg-Marquardt Algorithm for Neural Networks (Poster) []  We explore the usage of the Levenberg-Marquardt(LM) algorithm for regression (non-linear least squares) and classification (generalized Gauss-Newton methods) tasks in neural networks. We compare the performance of the LM method with other popular first-order algorithms such as SGD and Adam, as well as other second-order algorithms such as L-BFGS, Hessian-Free and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance. Omead Pooladzandi · Yiming Zhou 🔗 - DRSOM: A Dimension Reduced Second-Order Method (Poster) In this paper, we propose a Dimension-Reduced Second-Order Method (DRSOM) for convex and nonconvex (unconstrained) optimization. Under a trust-region-like framework, our method preserves the convergence of the second-order method while using only Hessianvector products in a few directions, which enables the computational overhead of our method remain comparable to the first-order such as the gradient descent method. Theoretically, we show that the method has a local quadratic convergence and a global convergence rate of O(ϵ −3/2 ) to satisfy the first-order and second-order conditions under a commonly adopted approximated Hessian assumption. We further show that this assumption can be removed if we perform a step of the Lanczos method periodically at the end-stage of the algorithm. The applicability and performance of DRSOM are exhibited by various computational experiments, particularly in machine learning and deep learning. For neural networks, our preliminary implementation seems to gain computational advantages in terms of training accuracy and iteration complexity over state-of-the-art first-order methods such as SGD and ADAM. Chuwen Zhang · Jiang Bo · Chang He · Yuntian Jiang · Dongdong Ge · Yinyu Ye 🔗 - Random-subspace adaptive cubic regularisation method for nonconvex optimisation (Poster) []  We investigate second-order methods for nonconvex optimisation, and propose a Random Subspace Adaptive Cubic Regularisation (R-ARC) method, which we analyse under various assumptions on the objective function and the sketching matrices that generate the random subspaces. We show that, when the sketching matrix achieves a subspace embedding of the augmented matrix of the gradient and the Hessian with sufficiently high probability, then the R-ARC method satisfies, with high probability, a complexity bound of order O(ε^{-3/2}) to drive the (full) gradient norm below ε; matching in the accuracy order its deterministic counterpart (ARC). As an illustration, we particularise our results to the special case of a scaled Gaussian ensemble. Coralia Cartis · Zhen Shao 🔗 - Quartic Polynomial Sub-problem Solutions in Tensor Methods for Nonconvex Optimization (Poster) []  There has been growing interest in high-order tensor methods for nonconvex optimization in machine learning as these methods provide better/optimal worst-case evaluation complexity, stability to parameter tuning, and robustness to problem conditioning. The well-known $p$th-order adaptive regularization (AR$p$) method relies crucially on repeatedly minimising a nonconvex multivariate Taylor-based polynomial sub-problem. It remains an open question to find efficient techniques to minimise such a sub-problem for $p\ge3$.In this paper, we propose a second-order method (SQO) for the AR$3$ (AR$p$ with $p=3$) sub-problem. SQO approximates the special-structure quartic polynomial sub-problem from above and below by using second-order models that can be minimised efficiently and globally.We prove that SQO finds a local minimiser of a quartic polynomial, but in practice, due to its construction, it can find a much lower minimum than cubic regularization approaches. This encourages us to continue our quest for algorithmic techniques that find approximately global solutions for such polynomials. Wenqi Zhu · Coralia Cartis 🔗 - FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences (Poster) In the recent paper FLECS (Agafonov et al, FLECS: A Federated Learning Second-Order Framework via Compression and Sketching), the second-order framework FLECS was proposed for the Federated Learning problem. This method utilize compression of sketched Hessians to make communication costs low. However, the main bottleneck of FLECS is gradient communication without compression. In this paper, we propose the modification of FLECS with compressed gradient differences, which we call FLECS-CGD (FLECS with Compressed Gradient Differences) and make it applicable for stochastic optimization. Convergence guarantees are provided in strongly convex and nonconvex cases. Experiments show the practical benefit of proposed approach. Artem Agafonov · Brahim Erraji · Martin Takac 🔗 - Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models (Poster) []  The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. While GD has optimal statistical and computational complexities for estimating the true parameter under the high signal-to-noise ratio (SNR) regime of the GLMs, it has sub-optimal complexities when the SNR is low, namely, the iterates of GD require polynomial number of iterations to reach the final statistical radius. The slow convergence of GD for the low SNR case is mainly due to the local convexity of the least-square loss functions of the GLMs. To address the shortcomings of GD, we propose to use the BFGS quasi-Newton method to solve parameter estimation of the GLMs. On the optimization side, when the SNR is low, we demonstrate that iterates of BFGS converge linearly to the optimal solution of the population least-square loss function. On the statistical side, we prove that the iterates of BFGS reach the final statistical radius of the low SNR GLMs after a logarithmic number of iterations, which is much lower than the polynomial number of iterations of GD. We also present numerical experiments that match our theoretical findings. Qiujiang Jin · Aryan Mokhtari · Nhat Ho · Tongzheng Ren 🔗 - The Trade-offs of Incremental Linearization Algorithms for Nonsmooth Composite Problems (Poster) []  Gauss-Newton methods and their stochastic version have been widely used in machine learning. Their non-smooth counterparts, modified Gauss-Newton or prox-linear algorithms, can lead to contrasted outcomes when compared to gradient descent in large scale settings. We explore the contrasting performance of these two classes of algorithms in theory on a stylized statistical example, and experimentally on learning problems including structured prediction. Krishna Pillutla · Vincent Roulet · Sham Kakade · Zaid Harchaoui 🔗 - Cubic Regularized Quasi-Newton Methods (Poster) []  In this paper, we propose a Cubic Regularized L-BFGS. Cubic Regularized Newton outperforms the classical Newton method in terms of global performance. In classics, L-BFGS approximation is applied for the Newton method. We propose a new variant of inexact Cubic Regularized Newton. Then, we use L-BFGS approximation as an inexact Hessian for Cubic Regularized Newton. It allows us to get better theoretical convergence rates and good practical performance, especially from the points where classical Newton is diverging. Dmitry Kamzolov · Klea Ziu · Artem Agafonov · Martin Takac 🔗 - ASDL: A Unified Interface for Gradient Preconditioning in PyTorch (Poster) []  Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent performance and feasibility. We propose the Automatic Second-order Differentiation Library (ASDL), an extension library for PyTorch, which offers various implementations and a plug-and-play unified interface for gradient preconditioning. ASDL enables the study and structured comparison of a range of gradient preconditioning methods. Kazuki Osawa · Satoki Ishikawa · Rio Yokota · Shigang Li · Torsten Hoefler 🔗 - PSPS: Preconditioned Stochastic Polyak Step-size method for badly scaled data (Poster) []  The family of Stochastic Gradient Methods with Polyak Step-size offers an update rule that alleviates the need of fine-tuning the learning rate of an optimizer. Recent work (Robert M Gower, Mathieu Blondel, Nidham Gazagnadou, and Fabian Pedregosa: Cutting some slack for SGD with adaptive polyak stepsizes) has been proposed to introduce a slack variable, which makes these methods applicable outside of the interpolation regime. In this paper, we combine preconditioning and slack in an updated optimization algorithm to show its performance on badly scaled and/or ill-conditioned datasets. We use Hutchinson's method to obtain an estimate of a Hessian which is used as the preconditioner. Farshed Abdukhakimov · Chulu Xiang · Dmitry Kamzolov · Robert Gower · Martin Takac 🔗 - HesScale: Scalable Computation of Hessian Diagonals (Poster) []  Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization. Mohamed Elsayed · Rupam Mahmood 🔗