Timezone: »

OPT 2016: Optimization for Machine Learning
Suvrit Sra · Francis Bach · Sashank J. Reddi · Niao He

Fri Dec 09 11:00 PM -- 09:30 AM (PST) @ Room 112
Event URL: http://opt-ml.org/ »

As the ninth in its series, OPT 2016 builds on remarkable precedent established by the highly successful series of workshops: OPT 2008--OPT 2015, which have been instrumental in bridging the OPT and ML communities closer together.

The previous OPT workshops enjoyed packed to overpacked attendance. This huge interest is no surprise: optimization is the 2nd largest topic at NIPS and is indeed foundational for the wider ML community.

Looking back over the past decade, a strong trend is apparent: The intersection of OPT and ML has grown monotonically to the point that now several cutting-edge advances in optimization arise from the ML community. The distinctive feature of optimization within ML is its departure from textbook approaches, in particular, by having a different set of goals driven by “big-data,” where both models and practical implementation are crucial.

This intimate relation between OPT and ML is the core theme of our workshop. We wish to use OPT2016 as a platform to foster discussion, discovery, and dissemination of the state-of-the-art in optimization as relevant to machine learning. And even beyond that, as a platform to identify new directions and challenges that will drive future research.

How OPT differs from other related workshops:

Compared to the other optimization focused workshops that we are aware of, the distinguishing features of OPT are: (a) it provides a unique bridge between the ML community and the wider optimization community; (b) it encourages theoretical work on an equal footing with practical efficiency; and (c) it caters to a wide body of NIPS attendees, experts and beginners alike (some OPT talks are always of a more “tutorial” nature).

Extended abstract

The OPT workshops have previously covered a variety of topics, such as frameworks for convex programs (D. Bertsekas), the intersection of ML and optimization, classification (S. Wright), stochastic gradient and its tradeoffs (L. Bottou, N. Srebro), structured sparsity (Vandenberghe), randomized methods for convex optimization (A. Nemirovski), complexity theory of convex optimization (Y. Nesterov), distributed optimization (S. Boyd), asynchronous stochastic gradient (B. Recht), algebraic techniques (P. Parrilo), nonconvex optimization (A. Lewis), sums-of-squares techniques (J. Lasserre), deep learning tricks (Y. Bengio), stochastic convex optimization (G. Lan), new views on interior point (E. Hazan), among others.

Several ideas propounded in OPT have by now become important research topics in ML and optimization --- especially in the field of randomized algorithms, stochastic gradient and variance reduced stochastic gradient methods. An edited book "Optimization for Machine Learning" (S. Sra, S. Nowozin, and S. Wright; MIT Press, 2011) grew out of the first three OPT workshops, and contains high-quality contributions from many of the speakers and attendees, and there have been sustained requests for the next edition of such a volume.

Much of the recent focus has been on large-scale first-order convex optimization algorithms for machine learning, both from a theoretical and methodological point of view. Covered topics included stochastic gradient algorithms, (accelerated) proximal algorithms, decomposition and coordinate descent algorithms, parallel and distributed optimization. Theoretical and practical advances in these methods remain a topic of core interest to the workshop. Recent years have also seen interesting advances in non-convex optimization such as a growing body of results on alternating minimization, tensor factorization etc.

We also do not wish to ignore the not particularly large scale setting, where one does have time to wield substantial computational resources. In this setting, high-accuracy solutions and deep understanding of the lessons contained in the data are needed. Examples valuable to MLers may be exploration of genetic and environmental data to identify risk factors for disease; or problems dealing with setups where the amount of observed data is not huge, but the mathematical model is complex. Consequently, we encourage optimization methods on manifolds, ML problems with differential geometric antecedents, those using advanced algebraic techniques, and computational topology, for instance.

At this point, we would like to emphasize again that OPT2016 is one of the few optimization+ML workshops that lies at the intersection of theory and practice: both actual efficiency of algorithms in practice as well as their theoretical analysis are given equal value.

Fri 11:15 p.m. - 11:30 p.m. [iCal]
Opening Remarks
Fri 11:30 p.m. - 12:15 a.m. [iCal]

In Online Optimization, the data in an optimization problem is revealed over time and at each step a decision variable needs to be set without knowing the future data. This setup covers online resource allocation, from classical inventory problems to the Adwords problem popular in online advertising. In this talk, we prove bounds on the competitive ratio of two primal-dual algorithms for a broad class of online convex optimization problems. We give a sufficient condition on the objective function that guarantees a constant worst-case competitive ratio for monotone functions. We show how smoothing the objective can improve the competitive ratio of these algorithms, and for separable functions, we show that the optimal smoothing can be derived by solving a convex optimization problem. This result allows us to directly optimize the competitive ratio bound over a class of smoothing functions, and hence design effective smoothing customized for a given cost function.

Sat 12:15 a.m. - 12:30 a.m. [iCal]

The time to converge to the steady state of a finite Markov chain can be greatly reduced by a lifting operation, which creates a new Markov chain on an expanded state space. For a class of quadratic objectives, we show an analogous behavior whereby a distributed ADMM algorithm can be seen as a lifting of Gradient Descent. This provides a deep insight for its faster convergence rate under optimal parameter tuning. We conjecture that this gain is always present, contrary to when lifting a Markov chain, where sometimes we only obtain a marginal speedup.

Sat 12:30 a.m. - 1:30 a.m. [iCal]
Poster Session 1 (Break)
Sat 1:30 a.m. - 2:00 a.m. [iCal]
Coffee Break 1 (Poster Session)
Sat 2:00 a.m. - 2:45 a.m. [iCal]

A lot of progress has been made in recent years on extending classical multi-armed bandit strategies to very large set of actions. A particularly challenging setting is the one where the action set is continuous and the underlying cost function is convex, this is the so-called bandit convex optimization (BCO) problem. I will tell the story of BCO and explain some of the new ideas that we recently developed to solve it. I will focus on three new ideas from our recent work http://arxiv.org/abs/1607.03084 with Yin Tat Lee and Ronen Eldan: (i) a new connection between kernel methods and the popular multiplicative weights strategy; (ii) a new connection between kernel methods and one of Erdos’ favorite mathematical object, the Bernoulli convolution, and (iii) a new adaptive (and increasing!) learning rate for multiplicative weights. These ideas could be of broader interest in learning/algorithm’s design

Sat 2:45 a.m. - 3:00 a.m. [iCal]

We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained smooth convex-concave saddle point (SP) problems. Remarkably, the method only requires access to linear minimization oracles. Leveraging recent advances in FW optimization, we provide the first proof of convergence of a FW-type saddle point solver over polytopes, thereby partially answering a 30 year-old conjecture. We verify our convergence rates empirically and observe that by using a heuristic step-size, we can get empirical convergence under more general conditions, paving the way for future theoretical work.

Sat 3:00 a.m. - 5:00 a.m. [iCal]
Lunch Break (Break)
Sat 5:00 a.m. - 5:45 a.m. [iCal]

Semidefinite programs (SDPs) can be solved in polynomial time by interior point methods, but scalability can be an issue. To address this shortcoming, over a decade ago, Burer and Monteiro proposed to solve SDPs with few equality constraints via low-rank, non-convex surrogates. Remarkably, for some applications, local optimization methods seem to converge to global optima of these non-convex surrogates reliably.

In this presentation, we show that the Burer-Monteiro formulation of SDPs in a certain class almost never has any spurious local optima, that is: the non-convexity of the low-rank formulation is benign (even saddles are strict). This class of SDPs covers applications such as max-cut, community detection in the stochastic block model, robust PCA, phase retrieval and synchronization of rotations.

The crucial assumption we make is that the low-rank problem lives on a manifold. Then, theory and algorithms from optimization on manifolds can be used.

Optimization on manifolds is about minimizing a cost function over a smooth manifold, such as spheres, low-rank matrices, orthonormal frames, rotations, etc. We will present the basic framework as well as parts of the more general convergence theory, including recent complexity results. (Toolbox: http://www.manopt.org)

Select parts are joint work with P.-A. Absil, A. Bandeira, C. Cartis and V. Voroninski.

Sat 5:45 a.m. - 6:00 a.m. [iCal]

We propose a technique to accelerate gradient-based optimization algorithms by giving them the ability to exploit L-BFGS heuristics. Our scheme is (i) generic and can be applied to a large class of first-order algorithms; (ii) it is compatible with composite objectives, meaning that it may provide exactly sparse solutions when a sparsity-inducing regularization is involved; (iii) it admits a linear convergence rate for strongly-convex problems; (iv) it is easy to use and it does not require any line search. Our work is inspired in part by the Catalyst meta-algorithm, which accelerates gradient-based techniques in the sense of Nesterov; here, we adopt a different strategy based on L-BFGS rules to learn and exploit the local curvature. In most practical cases, we observe significant improvements over Catalyst for solving large-scale high-dimensional machine learning problems.

Sat 6:00 a.m. - 6:30 a.m. [iCal]
Coffee Break 2 (Poster Session)
Sat 6:30 a.m. - 7:30 a.m. [iCal]
Poster Session 2 (Poster Session)
Sat 7:30 a.m. - 8:15 a.m. [iCal]

Finite-sum optimization problems are ubiquitous in machine learning, and are commonly solved using first-order methods which rely on gradient computations. Recently, there has been growing interest in second-order methods, which rely on both gradients and Hessians. In principle, second-order methods can require much fewer iterations than first-order methods, and hold the promise for more efficient algorithms. Although computing and manipulating Hessians is prohibitive for high-dimensional problems in general, the Hessians of individual functions in finite-sum problems can often be efficiently computed, e.g. because they possess a low-rank structure. Can second-order information indeed be used to solve such problems more efficiently? In this talk, I'll provide evidence that the answer -- perhaps surprisingly -- is negative, at least in terms of worst-case guarantees. However, I'll also discuss what additional assumptions and algorithmic approaches might potentially circumvent this negative result. Joint work with Yossi Arjevani.

Sat 8:15 a.m. - 8:30 a.m. [iCal]

We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form max(0, w.x) with w a unit vector (2-norm equal to 1). Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour where the learner is given access to a distribution D on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the l_p loss (for p=1 or p >=2) of inputs given positive labels by D.

It runs in polynomial-time (in n) with respect to {\em any} distribution on S^{n-1} (the unit sphere in n dimensions) and for any error parameter \epsilon = \Omega(1/ \log n). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where epsilon must be Omega(1) and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLUs.

Our techniques combine kernel methods and polynomial approximations with a dual-loss'' approach to convex programming. As a byproduct we also obtain the first set of efficient algorithms forconvex piecewise-linear fitting,'' and the first efficient algorithms for agnostically learning low-weight multivariate polynomials on the unit sphere.

Author Information

Suvrit Sra (MIT)

Suvrit Sra is a faculty member within the EECS department at MIT, where he is also a core faculty member of IDSS, LIDS, MIT-ML Group, as well as the statistics and data science center. His research spans topics in optimization, matrix theory, differential geometry, and probability theory, which he connects with machine learning --- a key focus of his research is on the theme "Optimization for Machine Learning” (http://opt-ml.org)

Francis Bach (INRIA - Ecole Normale Superieure)
Sashank J. Reddi (Carnegie Mellon University)
Niao He (UIUC)

More from the Same Authors