`

Timezone: »

 
Workshop
Optimal Transport and Machine Learning
Jason Altschuler · Charlotte Bunne · Laetitia Chapel · Marco Cuturi · Rémi Flamary · Gabriel Peyré · Alexandra Suvorikova

Mon Dec 13 05:00 AM -- 01:00 PM (PST) @ None
Event URL: https://otml2021.github.io/ »

Over the last few years, optimal transport (OT) has quickly become a central topic in machine learning. OT is now routinely used in many areas of ML, ranging from the theoretical use of OT flow for controlling learning algorithms to the inference of high-dimensional cell trajectories in genomics. The Optimal Transport and Machine Learning (OTML) workshop series (in '14, '17, '19) has been instrumental in shaping this research thread. For this new installment of OTML, we aim even bigger by hosting an exceptional keynote speaker, Alessio Figalli, who received the 2018 Fields Medal for his breakthroughs in the analysis of the regularity properties of OT. OTML will be a unique opportunity for cross-fertilization between recent advances in pure mathematics and challenging high-dimensional learning problems.

Mon 5:00 a.m. - 5:45 a.m.
TBA (Plenary talk)
Lénaïc Chizat
Mon 5:45 a.m. - 6:00 a.m.

We are interested in the challenging problem of modelling densities on Riemannian manifolds with a known symmetry group using normalising flows. This has many potential applications in physical sciences such as molecular dynamics and quantum simulations. In this work we combine ideas from implicit neural layers and optimal transport theory to propose a generalisation of existing work on exponential map flows, Implicit Riemannian Concave Potential Maps, IRCPMs. IRCPMs have some nice properties such as simplicity of incorporating knowledge about symmetries and are less expensive then ODE-flows. We provide an initial theoretical analysis of its properties and layout sufficient conditions for stable optimisation. Finally, we illustrate the properties of IRCPMs with density learning experiments on tori and spheres.

Danilo Jimenez Rezende · Sébastien Racanière
Mon 6:00 a.m. - 7:10 a.m.

In optimal transport, understanding the regularity of optimal maps is an important topic. This lecture aims to present the regularity theory for optimal maps, explain the connection to Monge-Ampère type equations, and overview the most fundamental results available.

Alessio Figalli
Mon 7:10 a.m. - 7:35 a.m.

Generative Adversarial Networks (GANs) have proven to be a powerful framework for learning to draw samples from complex distributions. In this talk I will discuss the challenge of learning sequential data via GANs. This notably requires the choice of a loss function that reflects the discrepancy between (measures on) paths. To take on this task, we employ adapted versions of optimal transport distances, that result from imposing a temporal causality constraint on classical transport problems. This constraint provides a natural framework to parameterize the cost function that is learned by the discriminator as a robust (worst-case) distance. We then employ a modification of the empirical measure, to ensure consistency of the estimators. Following Genevay et al. (2018), we also include an entropic penalization term which allows for the use of the Sinkhorn algorithm when computing the optimal transport cost.

Beatrice Acciaio
Mon 7:35 a.m. - 8:15 a.m.
  • Optimizing Functionals on the Space of Probabilities with Input Convex Neural Network - by David Alvarez-Melis, Yair Schiff, Youssef Mroueh
  • Learning Revenue-Maximizing Auctions With Differentiable Matching - by Michael Curry, Uro Cornelius Lyi, Tom Goldstein, John P Dickerson
  • Factored couplings in multi-marginal optimal transport via difference of convex programming - by Quang Huy Huy, Hicham Janati, Ievgen Redko, Rémi Flamary, Nicolas Courty
  • Sinkhorn EM: An Expectation-Maximizationalgorithm based on entropic optimal transport - by gonzalo e mena, Amin Nejatbakhsh, Erdem Varol, Jonathan Niles-Weed
  • Subspace Detours Meet Gromov-Wasserstein - by Clément Bonet, Nicolas Courty, François Septier, Lucas Drumetz
  • Input Convex Gradient Networks - by Jack Richter-Powell, Jonathan Peter Lorraine, Brandon Amos
  • Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs - by Meyer Scetbon, Gabriel Peyré, Marco Cuturi
  • Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1-D Frank-Wolfe - by Thibault Sejourne, François-Xavier Vialard, Gabriel Peyré
Mon 8:15 a.m. - 8:45 a.m.
Poster session - 1 (Poster session)
Mon 8:45 a.m. - 9:30 a.m.
TBA (Plenary talk)
Caroline Uhler
Mon 9:30 a.m. - 9:55 a.m.

In this talk, I will present some recent work at the intersection of optimal transport (OT) and probability flows. Optimal transport is an elegant theory that has diverse downstream applications. For likelihood estimation in particular, there has been a recent interest in using parametric invertible models (aka normalizing flows) to approximate the data distribution of interest. I will present my recent work on parameterizing flows using a neural convex potential, which is inspired by Brenier's theorem. In addition, I will cover a few other recently proposed probability flow models related to OT.

Chin-Wei Huang
Mon 9:55 a.m. - 10:20 a.m.

Multi-marginal optimal transport (MOT) is a generalization of optimal transport theory to settings with possibly more than two marginals. The computation of the solutions to MOT problems has been a longstanding challenge. In this talk, we introduce graphical optimal transport, a special class of MOT problems. We consider MOT problems from a probabilistic graphical model perspective and point out an elegant connection between the two when the underlying cost for optimal transport allows a graph structure. In particular, an entropy regularized MOT is equivalent to a Bayesian marginal inference problem for probabilistic graphical models with the additional requirement that some of the marginal distributions are specified. This relation on the one hand extends the optimal transport as well as the probabilistic graphical model theories, and on the other hand leads to fast algorithms for MOT by leveraging the well-developed algorithms in Bayesian inference. We will cover recent developments of graphical optimal transport in theory and algorithms. We will also go over several applications in aggregate filtering and mean field games.

Yongxin Chen
Mon 10:20 a.m. - 11:00 a.m.
Poster session - 2 (Poster session)
Mon 11:00 a.m. - 11:25 a.m.
TBA (Keynote talk)
Pinar Demetci
Mon 11:25 a.m. - 11:40 a.m.
Entropic estimation of optimal transport maps (Oral)
Aram-Alexandre Pooladian · Jonathan Niles-Weed
Mon 11:40 a.m. - 11:55 a.m.
Discrete Schr\¨odinger Bridges with Applications to Two-Sample Homogeneity Testing (Oral)
Zaid Harchaoui · Lang Liu · Soumik Pal
Mon 11:55 a.m. - 12:20 p.m.

Understanding the generalization capacity has been a central topic in mathematical machine learning. In this talk, I will present a generalized weighted least-squares optimization method for computational learning and inversion with noisy data. In particular, using the Wasserstein metric as the objective function and implementing the Wasserstein gradient flow (or Wasserstein natural gradient descent method) fall into the framework. The weighting scheme encodes both a priori knowledge on the object to be learned and a strategy to weight the contribution of different data points in the loss function. We will see that appropriate weighting from prior knowledge can greatly improve the generalization capability of the learned model.

Yunan Yang
Mon 12:20 p.m. - 12:25 p.m.
Concluding remarks (Discussion)
Mon 12:25 p.m. - 1:00 p.m.
Poster session - 3 (Poster session)
-
[ OpenReview  link »

Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop an accelerated Sinkhorn algorithm (coined ”translation invariant Sinkhorn”) for UOT, bridging the computational gap with OT. Our second contribution focusses on 1-D UOT and proposes a Frank-Wolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1-D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1-D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches.

Thibault Sejourne · Francois-Xavier Vialard · Gabriel Peyré
-
[ OpenReview  link »

Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop an accelerated Sinkhorn algorithm (coined ”translation invariant Sinkhorn”) for UOT, bridging the computational gap with OT. Our second contribution focusses on 1-D UOT and proposes a Frank-Wolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1-D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1-D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches.

Thibault Sejourne · Francois-Xavier Vialard · Gabriel Peyré
-
[ OpenReview  link »

A prominent class of algorithms for solving regularized optimal transport problems is that of iterative Bregman projections (IBP). Among them, in the classical bi-marginal case, superior performance of Greenkhorn algorithm to Sinkhorn algorithm has been testified in several works. Here we prove global linear convergence of a new batch Greenkhorn algorithm for regularized multimarginal optimal transport, which processes at each iteration only a batch of components of a selected marginal. While the linear convergence is established for the famous example of cyclic IBP - the Sinkhorn algorithm, in general for IBP this question is open. Our framework of Batch Greenkhorn is general enough to cover, as special cases some existing algorithms in optimal transport like Greenkhorn algorithm for bi-marginal, and (greedy) MultiSinkhorn algorithm for multimarginal optimal transport, for which we provide explicit global linear convergence rates. Moreover, our results highlight the critical role played by the batch size in accelerating the convergence of the algorithm.

Vladimir Kostic · Saverio Salzo · Massimiliano Pontil
-
[ OpenReview  link »

Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some mean-field interaction involving a gradient-type potential. However, in many problems, such as in multi-layer neural networks, the so-called particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edge-weights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Furthermore, such curves can also be recovered as the limit of a sequence of continuity equations satisfied by the time-marginal laws of stochastic processes of random matrices.

Sewoong Oh · Soumik Pal · Raghav Somani · Raghav Tripathi
-
[ OpenReview  link »
The ability to compare and align points across datasets that are known to be related, yet incomparable--because they live in heterogeneous spaces--plays an increasingly important role in machine learning. The Gromov-Wasserstein (GW) formalism can help tackle this problem. Its goal is to seek a low-distortion assignment between points taken across these incomparable datasets.As a non-convex and quadratic generalization of optimal transport (OT), GW is NP-hard. Although heuristics are known to work reasonably well (e.g. by solving a sequence of nested regularized OT problems), they still remain too costly to scale, with \emph{cubic} complexity in the number of samples $n$. As a result GW is far less used in practice than usual OT.We examine in this work how a recent variant of the OT problem that narrows down the set of admissible couplings to those admitting a low rank factorization of two sub-couplings is particularly well suited to the resolution of GW.By updating alternatively each sub-coupling, our algorithm computes a stationary point of the problem in time $O(n^2)$. When cost matrices have low rank , our algorithm becomes linear in $n$.We demonstrate the efficiency of our method on simulated and real data.
Meyer Scetbon · Gabriel Peyré · Marco Cuturi
-
[ OpenReview  link »
The ability to compare and align points across datasets that are known to be related, yet incomparable--because they live in heterogeneous spaces--plays an increasingly important role in machine learning. The Gromov-Wasserstein (GW) formalism can help tackle this problem. Its goal is to seek a low-distortion assignment between points taken across these incomparable datasets.As a non-convex and quadratic generalization of optimal transport (OT), GW is NP-hard. Although heuristics are known to work reasonably well (e.g. by solving a sequence of nested regularized OT problems), they still remain too costly to scale, with \emph{cubic} complexity in the number of samples $n$. As a result GW is far less used in practice than usual OT.We examine in this work how a recent variant of the OT problem that narrows down the set of admissible couplings to those admitting a low rank factorization of two sub-couplings is particularly well suited to the resolution of GW.By updating alternatively each sub-coupling, our algorithm computes a stationary point of the problem in time $O(n^2)$. When cost matrices have low rank , our algorithm becomes linear in $n$.We demonstrate the efficiency of our method on simulated and real data.
Meyer Scetbon · Gabriel Peyré · Marco Cuturi
-
[ OpenReview  link »

We consider a generalization of the rate-distortion-perception framework for lossy compression in which the reconstruction must attain a certain target distribution. This may arise, for example, in image restoration applications when there is a bit interface between a sender who records with degraded quality and receiver who wishes to recover a clear image. In this work, we characterize this as an optimal transport problem with constrained entropy between source and target distributions. We show that optimal solutions to this problem follow a framework that partially decouples the problems of compression and transport, and demonstrate the utility of common randomness. We show that the performance of a deep learning architecture following this framework is competitive with an end-to-end system.

Huan Liu · George Zhang · Jun Chen · Ashish Khisti
-
[ OpenReview  link »

The ability to understand and predict molecular responses towards external perturbations is a core question in molecular biology. Technological advancements in the recent past have enabled the generation of high-resolution single-cell data, making it possible to profile individual cells under different experimentally controlled perturbations. However, cells are typically destroyed during measurement, resulting in unpaired distributions over either perturbed or non-perturbed cells. Leveraging the theory of optimal transport and the recent advents of convex neural architectures, we learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level.We apply our approach, CellOT, to predict treatment responses of 21,650 cells subject to four different drug perturbations. CellOT outperforms current state-of-the-art methods both qualitatively and quantitatively, accurately capturing cellular behavior shifts across all different drugs.

Charlotte Bunne · Stefan Stark · Gabriele Gut · Andreas Krause · Gunnar Rätsch · Lucas Pelkmans · Kjong Lehmann
-
[ OpenReview  link »

The gradients of convex functions are expressive models of non-trivial vector fields. For example, the optimal transport map between any two measures on Euclidean spaces under the squared distance is realized as a convex gradients via Brenier's theorem, which is a key insight used in recent machine learning flow models. In this paper, we study how to model convex gradients by integrating a Jacobian-vector product parameterized by a neural network, which we call the Input Convex Gradient Network (ICGN). We theoretically study ICGNs and compare them to modeling the gradient by taking the derivative of an input-convex neural network, demonstrating that ICGNs can efficiently parameterize convex gradients.

Jack Richter-Powell · Jonathan Lorraine · Brandon Amos
-
[ OpenReview  link »

The gradients of convex functions are expressive models of non-trivial vector fields. For example, the optimal transport map between any two measures on Euclidean spaces under the squared distance is realized as a convex gradients via Brenier's theorem, which is a key insight used in recent machine learning flow models. In this paper, we study how to model convex gradients by integrating a Jacobian-vector product parameterized by a neural network, which we call the Input Convex Gradient Network (ICGN). We theoretically study ICGNs and compare them to modeling the gradient by taking the derivative of an input-convex neural network, demonstrating that ICGNs can efficiently parameterize convex gradients.

Jack Richter-Powell · Jonathan Lorraine · Brandon Amos
-
[ OpenReview  link »
Multi-marginal optimal transport (MOT) is a generalization of optimal transport to multiple marginals. Optimal transport has evolved into an important tool in many machine learning applications, and its multi-marginal extension opens up for addressing new challenges in the field of machine learning. However, the usage of MOT has been largely impeded by its computational complexity which scales exponentially in the number of marginals. Fortunately, in many applications, such as barycenter or interpolation problems, the cost function adheres to structures, which has recently been exploited for developing efficient computational methods. In this work, we derive computational bounds for these methods. With $m$ marginal distributions supported on $n$ points, we provide a $ \mathcal{\tilde O}(d(G)m n^2\epsilon^{-2})$ bound for a $\epsilon$-accuracy when the problem is associated with a tree with diameter $d(G)$. For the special case of the Wasserstein barycenter problem, which corresponds to a star-shaped tree, our bound is in alignment with the existing complexity bound for it.
Jiaojiao Fan · Isabel Haasler · Johan Karlsson · Yongxin Chen
-
[ OpenReview  link »

In the context of optimal transport methods, the subspace detour approach was recently presented by Muzellec and Cuturi (2019). It consists in building a nearly optimal transport plan in the measures space from an optimal transport plan in a wisely chosen subspace, onto which the original measures are projected. The contribution of this paper is to extend this category of methods to the Gromov-Wasserstein problem, which is a particular type of transport distance involving the inner geometry of the compared distributions. After deriving the associated formalism and properties, we also discuss a specific cost for which we can show connections with the Knothe-Rosenblatt rearrangement. We finally give an experimental illustration on a shape matching problem.

Clément Bonet · Nicolas Courty · François Septier · Lucas Drumetz
-
[ OpenReview  link »

In the context of optimal transport methods, the subspace detour approach was recently presented by Muzellec and Cuturi (2019). It consists in building a nearly optimal transport plan in the measures space from an optimal transport plan in a wisely chosen subspace, onto which the original measures are projected. The contribution of this paper is to extend this category of methods to the Gromov-Wasserstein problem, which is a particular type of transport distance involving the inner geometry of the compared distributions. After deriving the associated formalism and properties, we also discuss a specific cost for which we can show connections with the Knothe-Rosenblatt rearrangement. We finally give an experimental illustration on a shape matching problem.

Clément Bonet · Nicolas Courty · François Septier · Lucas Drumetz
-
[ OpenReview  link »

Let π ∈ Π(μ, ν) be a coupling between two probability measures μ and ν on a Polish space. We propose and study a class of nonparametric measures of association between μ and ν, which we call Wasserstein correlation coefficients. These coefficients are based on the Wasserstein distance between ν and the disintegration of π with respect to the first coordinate. We also establish basic statistical properties of this new class of measures: we develop a statistical theory for strongly consistent estimators and determine their convergence rate in the case of compactly supported measures μ and ν. Throughout our analysis we make use of the so-called adapted/bicausal Wasserstein distance, in particular we rely on results established in [Backhoff, Bartl, Beiglböck, Wiesel. Estimating processes in adapted Wasserstein distance. 2020]. Our approach applies to probability laws on general Polish spaces.

Johannes Wiesel
-
[ OpenReview  link »

We study Sinkhorn EM (sEM), a variant of expectation-maximization (EM) based on entropic optimal transport, as an algorithm for mixture model inference when prior information about the mixing weights is known.sEM differs from the classic EM algorithm in the way responsibilities are computed during the expectation step: rather than assign data points to clusters independently, sEM uses optimal transport to compute responsibilities that respect the known proportions.Like EM, sEM has a natural interpretation as a coordinate ascent procedure, which iteratively constructs and optimizes a lower bound on the log-likelihood.However, when the mixing weights are known, we show theoretically and empirically that sEM has better behavior than EM: it possesses better global convergence guarantees and is less prone to getting stuck in bad local optima.We complement our theoretical findings with experiments on simulated data, and demonstrate an application of sEM to an image segmentation task arising in neuroscience.In this setting, sEM yields segmentations that are significantly better than other approaches.

Gonzalo Mena · Amin Nejatbakhsh · Erdem Varol · Jonathan Niles-Weed
-
[ OpenReview  link »

We study Sinkhorn EM (sEM), a variant of expectation-maximization (EM) based on entropic optimal transport, as an algorithm for mixture model inference when prior information about the mixing weights is known.sEM differs from the classic EM algorithm in the way responsibilities are computed during the expectation step: rather than assign data points to clusters independently, sEM uses optimal transport to compute responsibilities that respect the known proportions.Like EM, sEM has a natural interpretation as a coordinate ascent procedure, which iteratively constructs and optimizes a lower bound on the log-likelihood.However, when the mixing weights are known, we show theoretically and empirically that sEM has better behavior than EM: it possesses better global convergence guarantees and is less prone to getting stuck in bad local optima.We complement our theoretical findings with experiments on simulated data, and demonstrate an application of sEM to an image segmentation task arising in neuroscience.In this setting, sEM yields segmentations that are significantly better than other approaches.

Gonzalo Mena · Amin Nejatbakhsh · Erdem Varol · Jonathan Niles-Weed
-
[ OpenReview  link »

Optimal transport (OT) theory underlies many emerging machine learning (ML) methods nowadays solving a wide range of tasks such as generative modeling, transfer learning and information retrieval. These latter works, however, usually build upon a traditional OT setup with two distributions, while leaving a more general multi-marginal OT formulation somewhat unexplored. In this paper, we study the multi-marginal OT (MMOT) problem and unify several popular OT methods under its umbrella by promoting structural information on the coupling.We show that incorporating such structural information into MMOT results in an instance of a different of convex (DC) programming problem allowing us to solve it numerically. Despite high computational cost of the latter procedure, the solutions provided by DC optimization are usually as qualitative as those obtained using currently employed optimization schemes.

Quang TRAN · Hicham Janati · Ievgen Redko · Rémi Flamary · Nicolas Courty
-
[ OpenReview  link »

Optimal transport (OT) theory underlies many emerging machine learning (ML) methods nowadays solving a wide range of tasks such as generative modeling, transfer learning and information retrieval. These latter works, however, usually build upon a traditional OT setup with two distributions, while leaving a more general multi-marginal OT formulation somewhat unexplored. In this paper, we study the multi-marginal OT (MMOT) problem and unify several popular OT methods under its umbrella by promoting structural information on the coupling.We show that incorporating such structural information into MMOT results in an instance of a different of convex (DC) programming problem allowing us to solve it numerically. Despite high computational cost of the latter procedure, the solutions provided by DC optimization are usually as qualitative as those obtained using currently employed optimization schemes.

Quang TRAN · Hicham Janati · Ievgen Redko · Rémi Flamary · Nicolas Courty
-
[ OpenReview  link »

We propose a new architecture to approximately learn incentive compatible, revenue-maximizing auctions from sampled valuations, which uses the Sinkhorn algorithm to perform a differentiable bipartite matching. This new framework allows the network to learn strategyproof revenue-maximizing mechanisms in settings not learnable by the previous RegretNet architecture. In particular, our architecture is able to learn mechanisms in settings without free disposal where each bidder must be allocated exactly some number of items. In experiments, we show our approach successfully recovers multiple known optimal mechanisms and high-revenue, low-regret mechanisms in larger settings where the optimal mechanism is unknown.

Michael Curry · Uro Lyi · Tom Goldstein · John P Dickerson
-
[ OpenReview  link »

We propose a new architecture to approximately learn incentive compatible, revenue-maximizing auctions from sampled valuations, which uses the Sinkhorn algorithm to perform a differentiable bipartite matching. This new framework allows the network to learn strategyproof revenue-maximizing mechanisms in settings not learnable by the previous RegretNet architecture. In particular, our architecture is able to learn mechanisms in settings without free disposal where each bidder must be allocated exactly some number of items. In experiments, we show our approach successfully recovers multiple known optimal mechanisms and high-revenue, low-regret mechanisms in larger settings where the optimal mechanism is unknown.

Michael Curry · Uro Lyi · Tom Goldstein · John P Dickerson
-
[ OpenReview  link »
We develop a computationally tractable method for estimating the optimal map between two distributions over $\mathbb{R}^d$ with rigorous finite-sample guarantees. Leveraging an entropic version of Brenier's theorem, we show that our estimator---the \emph{barycentric projection} of the optimal entropic plan---is easy to compute using Sinkhorn's algorithm. As a result, unlike current approaches for map estimation, which are slow to evaluate when the dimension or number of samples is large, our approach is parallelizable and extremely efficient even for massive data sets. Under smoothness assumptions on the optimal map, we show that our estimator enjoys comparable statistical performance to other estimators in the literature, but with much lower computational cost. We showcase the efficacy of our proposed estimator through numerical examples. Our proofs are based on a modified duality principle for entropic optimal transport and on a method for approximating optimal entropic plans due to Pal (2019).
Aram-Alexandre Pooladian · Jonathan Niles-Weed
-
[ OpenReview  link »
We develop a computationally tractable method for estimating the optimal map between two distributions over $\mathbb{R}^d$ with rigorous finite-sample guarantees. Leveraging an entropic version of Brenier's theorem, we show that our estimator---the \emph{barycentric projection} of the optimal entropic plan---is easy to compute using Sinkhorn's algorithm. As a result, unlike current approaches for map estimation, which are slow to evaluate when the dimension or number of samples is large, our approach is parallelizable and extremely efficient even for massive data sets. Under smoothness assumptions on the optimal map, we show that our estimator enjoys comparable statistical performance to other estimators in the literature, but with much lower computational cost. We showcase the efficacy of our proposed estimator through numerical examples. Our proofs are based on a modified duality principle for entropic optimal transport and on a method for approximating optimal entropic plans due to Pal (2019).
Aram-Alexandre Pooladian · Jonathan Niles-Weed
-
[ OpenReview  link »
We present recent approaches to prove the asymptotic behaviour of empirical transport cost, $\mathcal{T}_c(P_n,Q)$, under minimal assumptions in high dimension. Centering around its expectation, the weak limit of $\sqrt{n}\{\mathcal{T}_c(P_n,Q)-E\mathcal{T}_c(P_n,Q)\}$ is Gaussian. Yet, due to the curse of dimensionality, the variable $E\mathcal{T}_c(P_n,Q)$ can not be exchanged by its population counterpart $\mathcal{T}_c(P,Q)$. When $P$ is finitely supported this problem can be solved and the limit becomes the supremum of a centered Gaussian process, which is Gaussian under some additional conditions on the probability $Q$.
Alberto Gonzalez Sanz · Loubes Jean-Michel · Eustasio Barrio
-
[ OpenReview  link »

The gradient flow of a function over the space of probability densities with respect to the Wasserstein metric often exhibits nice properties and has been utilized in several machine learning applications. The standard approach to compute the Wasserstein gradient flow is the finite difference which discretizes the underlying space over a grid, and is not scalable. In this work, we propose a scalable proximal gradient type algorithm for Wasserstein gradient flow. The key of our method is a variational formulation of the objective function, which makes it possible to realize the JKO proximal map through a primal-dual optimization. This primal-dual problem can be efficiently solved by alternatively updating the parameters in the inner and outer loops. Our framework covers all the classical Wasserstein gradient flows including the heat equation and the porous medium equation. We demonstrate the performance and scalability of our algorithm with several numerical examples.

Jiaojiao Fan · Amirhossein Taghvaei · Yongxin Chen
-
[ OpenReview  link »

We introduce an entropy-regularized statistic that defines a divergence between probability distributions. The statistic is the transport cost of a coupling which admits an expression as a weighted average of Monge couplings with respect to a Gibbs measure. This coupling is related to the static Schr\"odinger bridge given a finite number of particles. We establish the asymptotic consistency of the statistic as the sample size goes to infinity and show that the population limit is the solution of F\"ollmer's entropy-regularized optimal transport. The proof technique relies on a chaos decomposition for paired samples. We illustrate the interest of the approach on the two-sample homogeneity testing problem.

Zaid Harchaoui · Lang Liu · Soumik Pal
-
[ OpenReview  link »

We introduce an entropy-regularized statistic that defines a divergence between probability distributions. The statistic is the transport cost of a coupling which admits an expression as a weighted average of Monge couplings with respect to a Gibbs measure. This coupling is related to the static Schr\"odinger bridge given a finite number of particles. We establish the asymptotic consistency of the statistic as the sample size goes to infinity and show that the population limit is the solution of F\"ollmer's entropy-regularized optimal transport. The proof technique relies on a chaos decomposition for paired samples. We illustrate the interest of the approach on the two-sample homogeneity testing problem.

Zaid Harchaoui · Lang Liu · Soumik Pal
-
[ OpenReview  link »

Complex measures recently became a well-established data model. We discuss the adaptation of the ubiquitous fast Fourier transform to measures, which involves their approximation by a multivariate trigonometric polynomial respecting normalization and non-negativity if applicable. The achieved approximation results, with respect to the Wasserstein-1 distance, are sharp up to logarithmic factors. The Fourier transform of atomic measures is shown to be computed up to logarithmic factors in linear time with respect to the problem size. The inverse Fourier transform is in general more involved but can be replaced by the easily computed approximation for typical applications.

Paul Catala · Mathias Hockmann · Stefan Kunis · Markus Wageringel
-
[ OpenReview  link »

Imitation learning (IL) seeks to teach agents specific tasks through expert demonstrations. One of the key approaches to IL is to define a distance between agent and expert and to find an agent policy that minimizes that distance. Optimal transport methods have been widely used in imitation learning as they provide ways to measure meaningful distances between agent and expert trajectories. However, the problem of how to optimally combine multiple expert demonstrations has not been widely studied. The standard method is to simply concatenate state (-action) trajectories, which is problematic when trajectories are multi-modal. We propose an alternative method that uses a multi-marginal optimal transport distance and enables the combination of multiple and diverse state-trajectories in the OT sense, providing a more sensible geometric average of the demonstrations. Our approach enables an agent to learn from several experts, and its efficiency is analyzed on OpenAI Gym control environments and demonstrates that the standard method is not always optimal.

ilana sebag · Samuel Cohen · Marc Deisenroth
-
[ OpenReview  link »

Multi-marginal optimal transport enables one to compare multiple probability measures, which increasingly finds application in multi-task learning problems.One practical limitation of multi-marginal transport is computational scalability in the number of measures, samples and dimensionality.In this work, we propose a multi-marginal optimal transport paradigm based on random one-dimensional projections, whose (generalized) distance we term the \emph{sliced multi-marginal Wasserstein distance}.To construct this distance, we introduce a characterization of the one-dimensional multi-marginal Kantorovich problem and use it to highlight a number of properties of the sliced multi-marginal Wasserstein distance. In particular, we show that (i) the sliced multi-marginal Wasserstein distance is a (generalized) metric that induces the same topology as the standard Wasserstein distance, (ii) it admits a dimension-free sample complexity, (iii) it is tightly connected with the problem of barycentric averaging under the sliced-Wasserstein metric.We conclude by illustrating the sliced multi-marginal Wasserstein on multi-task density estimation and multi-dynamics reinforcement learning problems.

Samuel Cohen · Alexander Terenin · Yannik Pitcan · Brandon Amos · Marc Deisenroth · Senanayak Sesh Kumar Karri
-
[ OpenReview  link »

We are interested in the challenging problem of modelling densities on Riemannian manifolds with a known symmetry group using normalising flows. This has many potential applications in physical sciences such as molecular dynamics and quantum simulations. In this work we combine ideas from implicit neural layers and optimal transport theory to propose a generalisation of existing work on exponential map flows, Implicit Riemannian Concave Potential Maps, IRCPMs. IRCPMs have some nice properties such as simplicity of incorporating knowledge about symmetries and are less expensive then ODE-flows. We provide an initial theoretical analysis of its properties and layout sufficient conditions for stable optimisation. Finally, we illustrate the properties of IRCPMs with density learning experiments on tori and spheres.

Danilo Jimenez Rezende · Sébastien Racanière
-
[ OpenReview  link »

We are interested in the challenging problem of modelling densities on Riemannian manifolds with a known symmetry group using normalising flows. This has many potential applications in physical sciences such as molecular dynamics and quantum simulations. In this work we combine ideas from implicit neural layers and optimal transport theory to propose a generalisation of existing work on exponential map flows, Implicit Riemannian Concave Potential Maps, IRCPMs. IRCPMs have some nice properties such as simplicity of incorporating knowledge about symmetries and are less expensive then ODE-flows. We provide an initial theoretical analysis of its properties and layout sufficient conditions for stable optimisation. Finally, we illustrate the properties of IRCPMs with density learning experiments on tori and spheres.

Danilo Jimenez Rezende · Sébastien Racanière
-
[ OpenReview  link »

Financial instability is one of the most important challenges in the transition to a low-carbon economy. Banks are encouraged to align their portfolios to CO2 trajectories fixed by international agreements, showing the necessity of a quantitative methodology to implement it. Wepropose a mathematical formulation for this problem and an optimization criterion for a transition between the current portfolio and a target one. The optimization Problem combines an Optimal Transport problem for which the cost is defined according to the financial context and a Credit Risk estimation.

Clara Lage · Emmanuel Gobet
-
[ OpenReview  link »

This paper introduces Wasserstein Adversarially Regularized Graph Autoencoder (WARGA), an implicit generative algorithm that directly regularizes the latent distribution of node embedding to a target distribution via the Wasserstein metric. The proposed method has been validated in tasks of link prediction and node clustering on real-world graphs, in which WARGA generally outperforms state-of-the-art models based on Kullback-Leibler (KL) divergence and typical adversarial framework.

Huidong Liang · Junbin Gao
-
[ OpenReview  link »

Schrödinger Bridges (SBs) is an optimal transport problem that has received increasing attentions in deep generative modeling for its mathematical flexibility compared to Scored-based Generative Models (SGMs). However, it remains unclear whether the optimization principle of SBs relates to modern training of deep generative models, which often relies on constructing parametrized log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory – a nonlinear stochastic optimal control principle that transforms the optimality condition of SBs into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SBs that, surprisingly, generalizes the ones for SGMs as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing application of modern generative training techniques, and we show that the resulting training algorithm achieves encouraging results on generating realistic images on MNIST and CelebA.

Tianrong Chen · Guan-Horng Liu · Evangelos Theodorou
-
[ OpenReview  link »

Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension.Recent work aims specifically at reducing this cost by computing OT using low-rank projections of the data (seen as discrete measures)~\citep{paty2019subspace}. We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension.We demonstrate the idea with neural networks.

Patric Fulop · Vincent Danos
-
[ OpenReview  link »

We introduce a new class of convex-regularized Optimal Transport losses, which generalizes the classical Entropy-regularization of Optimal Transport and Sinkhorn divergences, and propose a generalized Sinkhorn algorithm. Our framework unifies many regularizations and numerical methods previously appeared in the literature. We show the existence of the maximizer for the dual problem, complementary slackness conditions, providing a complete characterization of solutions for such class of variational problems. As a consequence, we study structural properties of these losses, including continuity, differentiability and provide explicit formulas for its gradient. Finally, we provide theoretical guarantees of convergences and stability of the generalized Sinkhorn algorithm, even in the continuous setting. The techniques developed here are directly applicable also to study Wasserstein barycenters or, more generally, multi-marginal problems.

Augusto Gerolin · Simone Di Marino
-
[ OpenReview  link »

This paper shows that a popular approach to the supervised embedding of documents for classification, namely, contrastive Word Mover's Embedding, can be significantly enhanced by adding interpretability. This interpretability is achieved by incorporating a clustering promoting mechanism into the contrastive loss. On several public datasets, we show that our method improves significantly upon existing baselines while providing interpretation to the clusters via identifying a set of keywords that are the most representative of a particular class. Our approach was motivated in part by the need to develop Natural Language Processing (NLP) methods for the novel problem of assessing student work for scientific writing and thinking - a problem that is central to the area of (educational) Learning Sciences (LS). In this context, we show that our approach leads to a meaningful assessment of the student work related to lab reports from a biology class and can help LS researchers gain insights into student understanding and assess evidence of scientific thought processes.

ruijie jiang · Julia Gouvea · Eric L Miller · David Hammer · Shuchin Aeron
-
[ OpenReview  link »

Gradient flows are a powerful tool for optimizing functionals in general metric spaces, including the space of probabilities endowed with the Wasserstein metric. A typical approach to solving this optimization problem relies on its connection to the dynamic formulation of optimal transport and the celebrated Jordan-Kinderlehrer-Otto (JKO) scheme. However, this formulation involves optimization over convex functions, which is challenging, especially in high dimensions. In this work, we propose an approach that relies on the recently introduced input-convex neural networks (ICNN) to parameterize the space of convex functions in order to approximate the JKO scheme, as well as in designing functionals over measures that enjoy convergence guarantees. We derive a computationally efficient implementation of this JKO-ICNN framework and use various experiments to demonstrate its feasibility and validity in approximating solutions of low-dimensional partial differential equations with known solutions. We also explore the use of our JKO-ICNN approach in high dimensions with an experiment in controlled generation for molecular discovery.

David Alvarez-Melis · Yair Schiff · Youssef Mroueh
-
[ OpenReview  link »

Gradient flows are a powerful tool for optimizing functionals in general metric spaces, including the space of probabilities endowed with the Wasserstein metric. A typical approach to solving this optimization problem relies on its connection to the dynamic formulation of optimal transport and the celebrated Jordan-Kinderlehrer-Otto (JKO) scheme. However, this formulation involves optimization over convex functions, which is challenging, especially in high dimensions. In this work, we propose an approach that relies on the recently introduced input-convex neural networks (ICNN) to parameterize the space of convex functions in order to approximate the JKO scheme, as well as in designing functionals over measures that enjoy convergence guarantees. We derive a computationally efficient implementation of this JKO-ICNN framework and use various experiments to demonstrate its feasibility and validity in approximating solutions of low-dimensional partial differential equations with known solutions. We also explore the use of our JKO-ICNN approach in high dimensions with an experiment in controlled generation for molecular discovery.

David Alvarez-Melis · Yair Schiff · Youssef Mroueh
-
[ OpenReview  link »

In this paper, we present a new formulation of unbalanced optimal transport called Dual Regularized Optimal Transport (DROT). We argue that regularizing the dual formulation of optimal transport results in a version of unbalanced optimal transport that leads to sparse solutions and that gives us control over mass creation and destruction. We build intuition behind such control and present theoretical properties of the solutions to DROT. We demonstrate that due to recent advances in optimization techniques, we can feasibly solve such a formulation at large scales and present extensive experimental evidence for this formulation and its solution.

Rishi Sonthalia · Anna Gilbert

Author Information

Jason Altschuler (MIT)
Charlotte Bunne (ETH Zurich)
Laetitia Chapel (IRISA)
Marco Cuturi (Google Brain & CREST - ENSAE)

Marco Cuturi is a research scientist at Google AI, Brain team in Paris. He received his Ph.D. in 11/2005 from the Ecole des Mines de Paris in applied mathematics. Before that he graduated from National School of Statistics (ENSAE) with a master degree (MVA) from ENS Cachan. He worked as a post-doctoral researcher at the Institute of Statistical Mathematics, Tokyo, between 11/2005 and 3/2007 and then in the financial industry between 4/2007 and 9/2008. After working at the ORFE department of Princeton University as a lecturer between 2/2009 and 8/2010, he was at the Graduate School of Informatics of Kyoto University between 9/2010 and 9/2016 as a tenured associate professor. He joined ENSAE in 9/2016 as a professor, where he is now working part-time. His main employment is now with Google AI (Brain team in Paris) since 10/2018, as a research scientist working on fundamental aspects of machine learning.

Rémi Flamary (École Polytechnique)
Gabriel Peyré (CNRS and ENS)
Alexandra Suvorikova (WIAS Berlin)

More from the Same Authors