Workshop
Optimal Transport and Machine Learning
Jason Altschuler · Charlotte Bunne · Laetitia Chapel · Marco Cuturi · Rémi Flamary · Gabriel Peyré · Alexandra Suvorikova
Over the last few years, optimal transport (OT) has quickly become a central topic in machine learning. OT is now routinely used in many areas of ML, ranging from the theoretical use of OT flow for controlling learning algorithms to the inference of highdimensional cell trajectories in genomics. The Optimal Transport and Machine Learning (OTML) workshop series (in '14, '17, '19) has been instrumental in shaping this research thread. For this new installment of OTML, we aim even bigger by hosting an exceptional keynote speaker, Alessio Figalli, who received the 2018 Fields Medal for his breakthroughs in the analysis of the regularity properties of OT. OTML will be a unique opportunity for crossfertilization between recent advances in pure mathematics and challenging highdimensional learning problems.
Schedule
Mon 5:00 a.m.  5:45 a.m.

Optimal Transport in the Biomedical Sciences: Challenges and Opportunities
(
Plenary talk
)
>
SlidesLive Video 
Caroline Uhler 🔗 
Mon 5:45 a.m.  6:00 a.m.

Implicit Riemannian Concave Potential Maps
(
Oral
)
>
SlidesLive Video We are interested in the challenging problem of modelling densities on Riemannian manifolds with a known symmetry group using normalising flows. This has many potential applications in physical sciences such as molecular dynamics and quantum simulations. In this work we combine ideas from implicit neural layers and optimal transport theory to propose a generalisation of existing work on exponential map flows, Implicit Riemannian Concave Potential Maps, IRCPMs. IRCPMs have some nice properties such as simplicity of incorporating knowledge about symmetries and are less expensive then ODEflows. We provide an initial theoretical analysis of its properties and layout sufficient conditions for stable optimisation. Finally, we illustrate the properties of IRCPMs with density learning experiments on tori and spheres. 
Danilo Jimenez Rezende · Sébastien Racanière 🔗 
Mon 6:00 a.m.  7:10 a.m.

Regularity theory of optimal transport maps
(
Plenary talk
)
>
SlidesLive Video In optimal transport, understanding the regularity of optimal maps is an important topic. This lecture aims to present the regularity theory for optimal maps, explain the connection to MongeAmpère type equations, and overview the most fundamental results available. 
Alessio Figalli 🔗 
Mon 7:10 a.m.  7:35 a.m.

Generative adversarial learning with adapted distances
(
Keynote talk
)
>
SlidesLive Video Generative Adversarial Networks (GANs) have proven to be a powerful framework for learning to draw samples from complex distributions. In this talk I will discuss the challenge of learning sequential data via GANs. This notably requires the choice of a loss function that reflects the discrepancy between (measures on) paths. To take on this task, we employ adapted versions of optimal transport distances, that result from imposing a temporal causality constraint on classical transport problems. This constraint provides a natural framework to parameterize the cost function that is learned by the discriminator as a robust (worstcase) distance. We then employ a modification of the empirical measure, to ensure consistency of the estimators. Following Genevay et al. (2018), we also include an entropic penalization term which allows for the use of the Sinkhorn algorithm when computing the optimal transport cost. 
Beatrice Acciaio 🔗 
Mon 7:35 a.m.  8:15 a.m.

Spotlight Presentations
(
Spotlight Presentations
)
>

🔗 
Mon 8:15 a.m.  8:45 a.m.

Poster Session
(
Poster Session
)
>
link
See list of posters below 
🔗 
Mon 8:45 a.m.  9:30 a.m.

Entropic Regularization of Optimal Transport as a Statistical Regularization
(
Plenary talk
)
>
SlidesLive Video The squared 2Wasserstein distance is a natural loss to compare probability distributions in generative models or density fitting tasks thanks to its « informative » gradient, but this loss suffers from a poor sample and computational complexity compared to alternative losses such as kernel MMD. Adding an entropic regularization and debiaising the resulting quantity (yielding the Sinkhorn divergence) mitigates these downsides but also leads to a degradation of the discriminative power of the loss and of the quality of its gradients. In order to understand the tradeoffs at play, we propose to study entropic regularization as one typically studies regularization in Machine Learning: by discussing the optimization, estimation and approximation errors, and their tradeoffs, covering in passing a variety of recent works in the field. The analysis, complemented with numerical experiments, suggests that entropic regularization actually improves the quality and efficiency of the estimation of the squared 2Wasserstein distance, compared to the plugin (i.e unregularized) estimator. 
Lénaïc Chizat 🔗 
Mon 9:30 a.m.  9:55 a.m.

Optimal transport and probability flows
(
Keynote talk
)
>
SlidesLive Video In this talk, I will present some recent work at the intersection of optimal transport (OT) and probability flows. Optimal transport is an elegant theory that has diverse downstream applications. For likelihood estimation in particular, there has been a recent interest in using parametric invertible models (aka normalizing flows) to approximate the data distribution of interest. I will present my recent work on parameterizing flows using a neural convex potential, which is inspired by Brenier's theorem. In addition, I will cover a few other recently proposed probability flow models related to OT. 
ChinWei Huang 🔗 
Mon 9:55 a.m.  10:20 a.m.

Graphical Optimal Transport and its applications
(
Keynote talk
)
>
SlidesLive Video Multimarginal optimal transport (MOT) is a generalization of optimal transport theory to settings with possibly more than two marginals. The computation of the solutions to MOT problems has been a longstanding challenge. In this talk, we introduce graphical optimal transport, a special class of MOT problems. We consider MOT problems from a probabilistic graphical model perspective and point out an elegant connection between the two when the underlying cost for optimal transport allows a graph structure. In particular, an entropy regularized MOT is equivalent to a Bayesian marginal inference problem for probabilistic graphical models with the additional requirement that some of the marginal distributions are specified. This relation on the one hand extends the optimal transport as well as the probabilistic graphical model theories, and on the other hand leads to fast algorithms for MOT by leveraging the welldeveloped algorithms in Bayesian inference. We will cover recent developments of graphical optimal transport in theory and algorithms. We will also go over several applications in aggregate filtering and mean field games. 
Yongxin Chen 🔗 
Mon 10:20 a.m.  11:00 a.m.

Poster Session
(
Poster Session
)
>
link
See list of posters below 
🔗 
Mon 11:00 a.m.  11:25 a.m.

Enabling integrated analysis of singlecell multiomic datasets with optimal transport
(
Keynote talk
)
>
SlidesLive Video In this work, I will present an application of optimal transport to integrate multimodal biological datasets. Cells in multicellular organisms specialize to carry out different functions despite having the same genetic material. This is thanks to celltypespecific gene regulation and misregulation of genes can result in disease. With today’s sequencing technologies, we can take measurements at the singlecell resolution and probe different aspects of the genome that influence gene regulation, such as chemical modifications on the DNA, its 3D structure, etc. Jointly studying these measurements will give a holistic view of the regulatory mechanisms. However, with a few exceptions, applying multiple technologies on the same single cell is not possible. Then, computational integration of separately taken multimodal genomic (“multiomic”) measurements is crucial to enable joint analyses. This task requires an unsupervised approach due to the lack of correspondences known as a priori. We present an algorithm, Single Cell alignment with Optimal Transport (SCOT), that relies on GromovWasserstein optimal transport to align singlecell multiomic datasets. We show that SCOT yields alignments competitive with stateoftheart and unlike previous methods, can approximately selftune its hyperparameters by tracking the GromovWasserstein distance between the aligned datasets. With its unbalanced multimodal extension, it can integrate more than two datasets and yields quality alignments in different scenarios of disproportionate cell type representation across measurements. 
Pinar Demetci 🔗 
Mon 11:25 a.m.  11:40 a.m.

Entropic estimation of optimal transport maps
(
Oral
)
>
SlidesLive Video
We develop a computationally tractable method for estimating the optimal map between two distributions over $\mathbb{R}^d$ with rigorous finitesample guarantees. Leveraging an entropic version of Brenier's theorem, we show that our estimatorthe \emph{barycentric projection} of the optimal entropic planis easy to compute using Sinkhorn's algorithm. As a result, unlike current approaches for map estimation, which are slow to evaluate when the dimension or number of samples is large, our approach is parallelizable and extremely efficient even for massive data sets. Under smoothness assumptions on the optimal map, we show that our estimator enjoys comparable statistical performance to other estimators in the literature, but with much lower computational cost. We showcase the efficacy of our proposed estimator through numerical examples. Our proofs are based on a modified duality principle for entropic optimal transport and on a method for approximating optimal entropic plans due to Pal (2019).

AramAlexandre Pooladian · Jonathan NilesWeed 🔗 
Mon 11:40 a.m.  11:55 a.m.

Discrete Schrödinger Bridges with Applications to TwoSample Homogeneity Testing
(
Oral
)
>
SlidesLive Video We introduce an entropyregularized statistic that defines a divergence between probability distributions. The statistic is the transport cost of a coupling which admits an expression as a weighted average of Monge couplings with respect to a Gibbs measure. This coupling is related to the static Schrödinger bridge given a finite number of particles. We establish the asymptotic consistency of the statistic as the sample size goes to infinity and show that the population limit is the solution of Föllmer's entropyregularized optimal transport. The proof technique relies on a chaos decomposition for paired samples. We illustrate the interest of the approach on the twosample homogeneity testing problem. 
Zaid Harchaoui · Lang Liu · Soumik Pal 🔗 
Mon 11:55 a.m.  12:20 p.m.

Benefits of using optimal transport in computational learning and inversion
(
Keynote talk
)
>
SlidesLive Video Understanding the generalization capacity has been a central topic in mathematical machine learning. In this talk, I will present a generalized weighted leastsquares optimization method for computational learning and inversion with noisy data. In particular, using the Wasserstein metric as the objective function and implementing the Wasserstein gradient flow (or Wasserstein natural gradient descent method) fall into the framework. The weighting scheme encodes both a priori knowledge on the object to be learned and a strategy to weight the contribution of different data points in the loss function. We will see that appropriate weighting from prior knowledge can greatly improve the generalization capability of the learned model. 
Yunan Yang 🔗 
Mon 12:20 p.m.  12:25 p.m.

Concluding Remarks
(
Discussion
)
>
SlidesLive Video 
🔗 
Mon 12:25 p.m.  1:00 p.m.

Poster session  3 and social interaction
(
Poster session
)
>
link
See list of posters below 
🔗 


Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1D FrankWolfe
(
Poster
)
>
Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop an accelerated Sinkhorn algorithm (coined ”translation invariant Sinkhorn”) for UOT, bridging the computational gap with OT. Our second contribution focusses on 1D UOT and proposes a FrankWolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches. 
Thibault Sejourne · FrancoisXavier Vialard · Gabriel Peyré 🔗 


Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1D FrankWolfe
(
Spotlight
)
>
SlidesLive Video Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop an accelerated Sinkhorn algorithm (coined ”translation invariant Sinkhorn”) for UOT, bridging the computational gap with OT. Our second contribution focusses on 1D UOT and proposes a FrankWolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches. 
Thibault Sejourne · FrancoisXavier Vialard · Gabriel Peyré 🔗 


Linear Convergence of Batch Greenkhorn for Regularized Multimarginal Optimal Transport
(
Poster
)
>
link
A prominent class of algorithms for solving regularized optimal transport problems is that of iterative Bregman projections (IBP). Among them, in the classical bimarginal case, superior performance of Greenkhorn algorithm to Sinkhorn algorithm has been testified in several works. Here we prove global linear convergence of a new batch Greenkhorn algorithm for regularized multimarginal optimal transport, which processes at each iteration only a batch of components of a selected marginal. While the linear convergence is established for the famous example of cyclic IBP  the Sinkhorn algorithm, in general for IBP this question is open. Our framework of Batch Greenkhorn is general enough to cover, as special cases some existing algorithms in optimal transport like Greenkhorn algorithm for bimarginal, and (greedy) MultiSinkhorn algorithm for multimarginal optimal transport, for which we provide explicit global linear convergence rates. Moreover, our results highlight the critical role played by the batch size in accelerating the convergence of the algorithm. 
Vladimir Kostic · Saverio Salzo · Massimiliano Pontil 🔗 


Gradient flows on graphons: existence, convergence, continuity equations
(
Poster
)
>
link
Wasserstein gradient flows on probability measures have found a host of applications in various optimization problems. They typically arise as the continuum limit of exchangeable particle systems evolving by some meanfield interaction involving a gradienttype potential. However, in many problems, such as in multilayer neural networks, the socalled particles are edge weights on large graphs whose nodes are exchangeable. Such large graphs are known to converge to continuum limits called graphons as their size grow to infinity. We show that the Euclidean gradient flow of a suitable function of the edgeweights converges to a novel continuum limit given by a curve on the space of graphons that can be appropriately described as a gradient flow or, more technically, a curve of maximal slope. Furthermore, such curves can also be recovered as the limit of a sequence of continuity equations satisfied by the timemarginal laws of stochastic processes of random matrices. 
Sewoong Oh · Soumik Pal · Raghav Somani · Raghav Tripathi 🔗 


LinearTime Gromov Wasserstein Distances using Low Rank Couplings and Costs
(
Poster
)
>
link
The ability to compare and align points across datasets that are known to be related, yet incomparablebecause they live in heterogeneous spacesplays an increasingly important role in machine learning. The GromovWasserstein (GW) formalism can help tackle this problem. Its goal is to seek a lowdistortion assignment between points taken across these incomparable datasets.As a nonconvex and quadratic generalization of optimal transport (OT), GW is NPhard. Although heuristics are known to work reasonably well (e.g. by solving a sequence of nested regularized OT problems), they still remain too costly to scale, with \emph{cubic} complexity in the number of samples $n$. As a result GW is far less used in practice than usual OT.We examine in this work how a recent variant of the OT problem that narrows down the set of admissible couplings to those admitting a low rank factorization of two subcouplings is particularly well suited to the resolution of GW.By updating alternatively each subcoupling, our algorithm computes a stationary point of the problem in time $O(n^2)$. When cost matrices have low rank , our algorithm becomes linear in $n$.We demonstrate the efficiency of our method on simulated and real data.

Meyer Scetbon · Gabriel Peyré · Marco Cuturi 🔗 


LinearTime Gromov Wasserstein Distances using Low Rank Couplings and Costs
(
Spotlight
)
>
link
SlidesLive Video
The ability to compare and align points across datasets that are known to be related, yet incomparablebecause they live in heterogeneous spacesplays an increasingly important role in machine learning. The GromovWasserstein (GW) formalism can help tackle this problem. Its goal is to seek a lowdistortion assignment between points taken across these incomparable datasets.As a nonconvex and quadratic generalization of optimal transport (OT), GW is NPhard. Although heuristics are known to work reasonably well (e.g. by solving a sequence of nested regularized OT problems), they still remain too costly to scale, with \emph{cubic} complexity in the number of samples $n$. As a result GW is far less used in practice than usual OT.We examine in this work how a recent variant of the OT problem that narrows down the set of admissible couplings to those admitting a low rank factorization of two subcouplings is particularly well suited to the resolution of GW.By updating alternatively each subcoupling, our algorithm computes a stationary point of the problem in time $O(n^2)$. When cost matrices have low rank , our algorithm becomes linear in $n$.We demonstrate the efficiency of our method on simulated and real data.

Meyer Scetbon · Gabriel Peyré · Marco Cuturi 🔗 


CrossDomain Lossy Compression as Optimal Transport with an Entropy Bottleneck
(
Poster
)
>
We consider a generalization of the ratedistortionperception framework for lossy compression in which the reconstruction must attain a certain target distribution. This may arise, for example, in image restoration applications when there is a bit interface between a sender who records with degraded quality and receiver who wishes to recover a clear image. In this work, we characterize this as an optimal transport problem with constrained entropy between source and target distributions. We show that optimal solutions to this problem follow a framework that partially decouples the problems of compression and transport, and demonstrate the utility of common randomness. We show that the performance of a deep learning architecture following this framework is competitive with an endtoend system. 
Huan Liu · George Zhang · Jun Chen · Ashish Khisti 🔗 


Learning SingleCell Perturbation Responses using Neural Optimal Transport
(
Poster
)
>
The ability to understand and predict molecular responses towards external perturbations is a core question in molecular biology. Technological advancements in the recent past have enabled the generation of highresolution singlecell data, making it possible to profile individual cells under different experimentally controlled perturbations. However, cells are typically destroyed during measurement, resulting in unpaired distributions over either perturbed or nonperturbed cells. Leveraging the theory of optimal transport and the recent advents of convex neural architectures, we learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a singlecell level.We apply our approach, CellOT, to predict treatment responses of 21,650 cells subject to four different drug perturbations. CellOT outperforms current stateoftheart methods both qualitatively and quantitatively, accurately capturing cellular behavior shifts across all different drugs. 
Charlotte Bunne · Stefan Stark · Gabriele Gut · Andreas Krause · Gunnar Rätsch · Lucas Pelkmans · Kjong Lehmann 🔗 


Input Convex Gradient Networks
(
Poster
)
>
link
The gradients of convex functions are expressive models of nontrivial vector fields. For example, the optimal transport map between any two measures on Euclidean spaces under the squared distance is realized as a convex gradients via Brenier's theorem, which is a key insight used in recent machine learning flow models. In this paper, we study how to model convex gradients by integrating a Jacobianvector product parameterized by a neural network, which we call the Input Convex Gradient Network (ICGN). We theoretically study ICGNs and compare them to modeling the gradient by taking the derivative of an inputconvex neural network, demonstrating that ICGNs can efficiently parameterize convex gradients. 
Jack RichterPowell · Jonathan Lorraine · Brandon Amos 🔗 


Input Convex Gradient Networks
(
Spotlight
)
>
link
SlidesLive Video The gradients of convex functions are expressive models of nontrivial vector fields. For example, the optimal transport map between any two measures on Euclidean spaces under the squared distance is realized as a convex gradients via Brenier's theorem, which is a key insight used in recent machine learning flow models. In this paper, we study how to model convex gradients by integrating a Jacobianvector product parameterized by a neural network, which we call the Input Convex Gradient Network (ICGN). We theoretically study ICGNs and compare them to modeling the gradient by taking the derivative of an inputconvex neural network, demonstrating that ICGNs can efficiently parameterize convex gradients. 
Jack RichterPowell · Jonathan Lorraine · Brandon Amos 🔗 


On the complexity of the optimal transport problem with graphstructured cost
(
Poster
)
>
link
Multimarginal optimal transport (MOT) is a generalization of optimal transport to multiple marginals. Optimal transport has evolved into an important tool in many machine learning applications, and its multimarginal extension opens up for addressing new challenges in the field of machine learning. However, the usage of MOT has been largely impeded by its computational complexity which scales exponentially in the number of marginals. Fortunately, in many applications, such as barycenter or interpolation problems, the cost function adheres to structures, which has recently been exploited for developing efficient computational methods. In this work, we derive computational bounds for these methods. With $m$ marginal distributions supported on $n$ points, we provide a $ \mathcal{\tilde O}(d(G)m n^2\epsilon^{2})$ bound for a $\epsilon$accuracy when the problem is associated with a tree with diameter $d(G)$. For the special case of the Wasserstein barycenter problem, which corresponds to a starshaped tree, our bound is in alignment with the existing complexity bound for it.

Jiaojiao Fan · Isabel Haasler · Johan Karlsson · Yongxin Chen 🔗 


Subspace Detours Meet GromovWasserstein
(
Poster
)
>
link
In the context of optimal transport methods, the subspace detour approach was recently presented by Muzellec and Cuturi (2019). It consists in building a nearly optimal transport plan in the measures space from an optimal transport plan in a wisely chosen subspace, onto which the original measures are projected. The contribution of this paper is to extend this category of methods to the GromovWasserstein problem, which is a particular type of transport distance involving the inner geometry of the compared distributions. After deriving the associated formalism and properties, we also discuss a specific cost for which we can show connections with the KnotheRosenblatt rearrangement. We finally give an experimental illustration on a shape matching problem. 
Clément Bonet · Nicolas Courty · François Septier · Lucas Drumetz 🔗 


Subspace Detours Meet GromovWasserstein
(
Spotlight
)
>
link
SlidesLive Video In the context of optimal transport methods, the subspace detour approach was recently presented by Muzellec and Cuturi (2019). It consists in building a nearly optimal transport plan in the measures space from an optimal transport plan in a wisely chosen subspace, onto which the original measures are projected. The contribution of this paper is to extend this category of methods to the GromovWasserstein problem, which is a particular type of transport distance involving the inner geometry of the compared distributions. After deriving the associated formalism and properties, we also discuss a specific cost for which we can show connections with the KnotheRosenblatt rearrangement. We finally give an experimental illustration on a shape matching problem. 
Clément Bonet · Nicolas Courty · François Septier · Lucas Drumetz 🔗 


Measuring association with Wasserstein distances
(
Poster
)
>
link
Let π ∈ Π(μ, ν) be a coupling between two probability measures μ and ν on a Polish space. We propose and study a class of nonparametric measures of association between μ and ν, which we call Wasserstein correlation coefficients. These coefficients are based on the Wasserstein distance between ν and the disintegration of π with respect to the first coordinate. We also establish basic statistical properties of this new class of measures: we develop a statistical theory for strongly consistent estimators and determine their convergence rate in the case of compactly supported measures μ and ν. Throughout our analysis we make use of the socalled adapted/bicausal Wasserstein distance, in particular we rely on results established in [Backhoff, Bartl, Beiglböck, Wiesel. Estimating processes in adapted Wasserstein distance. 2020]. Our approach applies to probability laws on general Polish spaces. 
Johannes Wiesel 🔗 


Sinkhorn EM: An ExpectationMaximization algorithm based on entropic optimal transport
(
Poster
)
>
link
We study Sinkhorn EM (sEM), a variant of expectationmaximization (EM) based on entropic optimal transport, as an algorithm for mixture model inference when prior information about the mixing weights is known.sEM differs from the classic EM algorithm in the way responsibilities are computed during the expectation step: rather than assign data points to clusters independently, sEM uses optimal transport to compute responsibilities that respect the known proportions.Like EM, sEM has a natural interpretation as a coordinate ascent procedure, which iteratively constructs and optimizes a lower bound on the loglikelihood.However, when the mixing weights are known, we show theoretically and empirically that sEM has better behavior than EM: it possesses better global convergence guarantees and is less prone to getting stuck in bad local optima.We complement our theoretical findings with experiments on simulated data, and demonstrate an application of sEM to an image segmentation task arising in neuroscience.In this setting, sEM yields segmentations that are significantly better than other approaches. 
Gonzalo Mena · Amin Nejatbakhsh · Erdem Varol · Jonathan NilesWeed 🔗 


Sinkhorn EM: An ExpectationMaximizationalgorithm based on entropic optimal transport
(
Spotlight
)
>
link
SlidesLive Video We study Sinkhorn EM (sEM), a variant of expectationmaximization (EM) based on entropic optimal transport, as an algorithm for mixture model inference when prior information about the mixing weights is known.sEM differs from the classic EM algorithm in the way responsibilities are computed during the expectation step: rather than assign data points to clusters independently, sEM uses optimal transport to compute responsibilities that respect the known proportions.Like EM, sEM has a natural interpretation as a coordinate ascent procedure, which iteratively constructs and optimizes a lower bound on the loglikelihood.However, when the mixing weights are known, we show theoretically and empirically that sEM has better behavior than EM: it possesses better global convergence guarantees and is less prone to getting stuck in bad local optima.We complement our theoretical findings with experiments on simulated data, and demonstrate an application of sEM to an image segmentation task arising in neuroscience.In this setting, sEM yields segmentations that are significantly better than other approaches. 
Gonzalo Mena · Amin Nejatbakhsh · Erdem Varol · Jonathan NilesWeed 🔗 


Factored couplings in multimarginal optimal transport via difference of convex programming
(
Poster
)
>
link
Optimal transport (OT) theory underlies many emerging machine learning (ML) methods nowadays solving a wide range of tasks such as generative modeling, transfer learning and information retrieval. These latter works, however, usually build upon a traditional OT setup with two distributions, while leaving a more general multimarginal OT formulation somewhat unexplored. In this paper, we study the multimarginal OT (MMOT) problem and unify several popular OT methods under its umbrella by promoting structural information on the coupling.We show that incorporating such structural information into MMOT results in an instance of a different of convex (DC) programming problem allowing us to solve it numerically. Despite high computational cost of the latter procedure, the solutions provided by DC optimization are usually as qualitative as those obtained using currently employed optimization schemes. 
Quang Huy TRAN · Hicham Janati · Ievgen Redko · Rémi Flamary · Nicolas Courty 🔗 


Factored couplings in multimarginal optimal transport via difference of convex programming
(
Spotlight
)
>
link
SlidesLive Video Optimal transport (OT) theory underlies many emerging machine learning (ML) methods nowadays solving a wide range of tasks such as generative modeling, transfer learning and information retrieval. These latter works, however, usually build upon a traditional OT setup with two distributions, while leaving a more general multimarginal OT formulation somewhat unexplored. In this paper, we study the multimarginal OT (MMOT) problem and unify several popular OT methods under its umbrella by promoting structural information on the coupling.We show that incorporating such structural information into MMOT results in an instance of a different of convex (DC) programming problem allowing us to solve it numerically. Despite high computational cost of the latter procedure, the solutions provided by DC optimization are usually as qualitative as those obtained using currently employed optimization schemes. 
Quang Huy TRAN · Hicham Janati · Ievgen Redko · Rémi Flamary · Nicolas Courty 🔗 


Learning RevenueMaximizing Auctions With Differentiable Matching
(
Poster
)
>
link
We propose a new architecture to approximately learn incentive compatible, revenuemaximizing auctions from sampled valuations, which uses the Sinkhorn algorithm to perform a differentiable bipartite matching. This new framework allows the network to learn strategyproof revenuemaximizing mechanisms in settings not learnable by the previous RegretNet architecture. In particular, our architecture is able to learn mechanisms in settings without free disposal where each bidder must be allocated exactly some number of items. In experiments, we show our approach successfully recovers multiple known optimal mechanisms and highrevenue, lowregret mechanisms in larger settings where the optimal mechanism is unknown. 
Michael Curry · Uro Lyi · Tom Goldstein · John P Dickerson 🔗 


Learning RevenueMaximizing Auctions With Differentiable Matching
(
Spotlight
)
>
link
SlidesLive Video We propose a new architecture to approximately learn incentive compatible, revenuemaximizing auctions from sampled valuations, which uses the Sinkhorn algorithm to perform a differentiable bipartite matching. This new framework allows the network to learn strategyproof revenuemaximizing mechanisms in settings not learnable by the previous RegretNet architecture. In particular, our architecture is able to learn mechanisms in settings without free disposal where each bidder must be allocated exactly some number of items. In experiments, we show our approach successfully recovers multiple known optimal mechanisms and highrevenue, lowregret mechanisms in larger settings where the optimal mechanism is unknown. 
Michael Curry · Uro Lyi · Tom Goldstein · John P Dickerson 🔗 


Entropic estimation of optimal transport maps
(
Poster
)
>
link
We develop a computationally tractable method for estimating the optimal map between two distributions over $\mathbb{R}^d$ with rigorous finitesample guarantees. Leveraging an entropic version of Brenier's theorem, we show that our estimatorthe \emph{barycentric projection} of the optimal entropic planis easy to compute using Sinkhorn's algorithm. As a result, unlike current approaches for map estimation, which are slow to evaluate when the dimension or number of samples is large, our approach is parallelizable and extremely efficient even for massive data sets. Under smoothness assumptions on the optimal map, we show that our estimator enjoys comparable statistical performance to other estimators in the literature, but with much lower computational cost. We showcase the efficacy of our proposed estimator through numerical examples. Our proofs are based on a modified duality principle for entropic optimal transport and on a method for approximating optimal entropic plans due to Pal (2019).

AramAlexandre Pooladian · Jonathan NilesWeed 🔗 


Entropic estimation of optimal transport maps
(
Oral
)
>
link
We develop a computationally tractable method for estimating the optimal map between two distributions over $\mathbb{R}^d$ with rigorous finitesample guarantees. Leveraging an entropic version of Brenier's theorem, we show that our estimatorthe \emph{barycentric projection} of the optimal entropic planis easy to compute using Sinkhorn's algorithm. As a result, unlike current approaches for map estimation, which are slow to evaluate when the dimension or number of samples is large, our approach is parallelizable and extremely efficient even for massive data sets. Under smoothness assumptions on the optimal map, we show that our estimator enjoys comparable statistical performance to other estimators in the literature, but with much lower computational cost. We showcase the efficacy of our proposed estimator through numerical examples. Our proofs are based on a modified duality principle for entropic optimal transport and on a method for approximating optimal entropic plans due to Pal (2019).

AramAlexandre Pooladian · Jonathan NilesWeed 🔗 


A Central Limit Theorems for Multidimensional Wasserstein Distances
(
Poster
)
>
We present recent approaches to prove the asymptotic behaviour of empirical transport cost, $\mathcal{T}_c(P_n,Q)$, under minimal assumptions in high dimension. Centering around its expectation, the weak limit of $\sqrt{n}\{\mathcal{T}_c(P_n,Q)E\mathcal{T}_c(P_n,Q)\}$ is Gaussian. Yet, due to the curse of dimensionality, the variable $E\mathcal{T}_c(P_n,Q)$ can not be exchanged by its population counterpart $\mathcal{T}_c(P,Q)$. When $P$ is finitely supported this problem can be solved and the limit becomes the supremum of a centered Gaussian process, which is Gaussian under some additional conditions on the probability $Q$.

Alberto Gonzalez Sanz · Loubes JeanMichel · Eustasio Barrio 🔗 


Variational Wasserstein gradient flow
(
Poster
)
>
link
The gradient flow of a function over the space of probability densities with respect to the Wasserstein metric often exhibits nice properties and has been utilized in several machine learning applications. The standard approach to compute the Wasserstein gradient flow is the finite difference which discretizes the underlying space over a grid, and is not scalable. In this work, we propose a scalable proximal gradient type algorithm for Wasserstein gradient flow. The key of our method is a variational formulation of the objective function, which makes it possible to realize the JKO proximal map through a primaldual optimization. This primaldual problem can be efficiently solved by alternatively updating the parameters in the inner and outer loops. Our framework covers all the classical Wasserstein gradient flows including the heat equation and the porous medium equation. We demonstrate the performance and scalability of our algorithm with several numerical examples. 
Jiaojiao Fan · Amirhossein Taghvaei · Yongxin Chen 🔗 


Discrete Schrödinger Bridges with Applications to TwoSample Homogeneity Testing
(
Poster
)
>
link
We introduce an entropyregularized statistic that defines a divergence between probability distributions. The statistic is the transport cost of a coupling which admits an expression as a weighted average of Monge couplings with respect to a Gibbs measure. This coupling is related to the static Schr\"odinger bridge given a finite number of particles. We establish the asymptotic consistency of the statistic as the sample size goes to infinity and show that the population limit is the solution of F\"ollmer's entropyregularized optimal transport. The proof technique relies on a chaos decomposition for paired samples. We illustrate the interest of the approach on the twosample homogeneity testing problem. 
Zaid Harchaoui · Lang Liu · Soumik Pal 🔗 


Discrete Schrödinger Bridges with Applications to TwoSample Homogeneity Testing
(
Oral
)
>
link
We introduce an entropyregularized statistic that defines a divergence between probability distributions. The statistic is the transport cost of a coupling which admits an expression as a weighted average of Monge couplings with respect to a Gibbs measure. This coupling is related to the static Schr\"odinger bridge given a finite number of particles. We establish the asymptotic consistency of the statistic as the sample size goes to infinity and show that the population limit is the solution of F\"ollmer's entropyregularized optimal transport. The proof technique relies on a chaos decomposition for paired samples. We illustrate the interest of the approach on the twosample homogeneity testing problem. 
Zaid Harchaoui · Lang Liu · Soumik Pal 🔗 


Towards an FFT for measures
(
Poster
)
>
Complex measures recently became a wellestablished data model. We discuss the adaptation of the ubiquitous fast Fourier transform to measures, which involves their approximation by a multivariate trigonometric polynomial respecting normalization and nonnegativity if applicable. The achieved approximation results, with respect to the Wasserstein1 distance, are sharp up to logarithmic factors. The Fourier transform of atomic measures is shown to be computed up to logarithmic factors in linear time with respect to the problem size. The inverse Fourier transform is in general more involved but can be replaced by the easily computed approximation for typical applications. 
Paul Catala · Mathias Hockmann · Stefan Kunis · Markus Wageringel 🔗 


On Combining Expert Demonstrations in Imitation Learning via Optimal Transport
(
Poster
)
>
Imitation learning (IL) seeks to teach agents specific tasks through expert demonstrations. One of the key approaches to IL is to define a distance between agent and expert and to find an agent policy that minimizes that distance. Optimal transport methods have been widely used in imitation learning as they provide ways to measure meaningful distances between agent and expert trajectories. However, the problem of how to optimally combine multiple expert demonstrations has not been widely studied. The standard method is to simply concatenate state (action) trajectories, which is problematic when trajectories are multimodal. We propose an alternative method that uses a multimarginal optimal transport distance and enables the combination of multiple and diverse statetrajectories in the OT sense, providing a more sensible geometric average of the demonstrations. Our approach enables an agent to learn from several experts, and its efficiency is analyzed on OpenAI Gym control environments and demonstrates that the standard method is not always optimal. 
ilana sebag · Samuel Cohen · Marc Deisenroth 🔗 


Sliced MultiMarginal Optimal Transport
(
Poster
)
>
link
Multimarginal optimal transport enables one to compare multiple probability measures, which increasingly finds application in multitask learning problems.One practical limitation of multimarginal transport is computational scalability in the number of measures, samples and dimensionality.In this work, we propose a multimarginal optimal transport paradigm based on random onedimensional projections, whose (generalized) distance we term the \emph{sliced multimarginal Wasserstein distance}.To construct this distance, we introduce a characterization of the onedimensional multimarginal Kantorovich problem and use it to highlight a number of properties of the sliced multimarginal Wasserstein distance. In particular, we show that (i) the sliced multimarginal Wasserstein distance is a (generalized) metric that induces the same topology as the standard Wasserstein distance, (ii) it admits a dimensionfree sample complexity, (iii) it is tightly connected with the problem of barycentric averaging under the slicedWasserstein metric.We conclude by illustrating the sliced multimarginal Wasserstein on multitask density estimation and multidynamics reinforcement learning problems. 
Samuel Cohen · Alexander Terenin · Yannik Pitcan · Brandon Amos · Marc Deisenroth · Senanayak Sesh Kumar Karri 🔗 


Implicit Riemannian Concave Potential Maps
(
Poster
)
>
link
We are interested in the challenging problem of modelling densities on Riemannian manifolds with a known symmetry group using normalising flows. This has many potential applications in physical sciences such as molecular dynamics and quantum simulations. In this work we combine ideas from implicit neural layers and optimal transport theory to propose a generalisation of existing work on exponential map flows, Implicit Riemannian Concave Potential Maps, IRCPMs. IRCPMs have some nice properties such as simplicity of incorporating knowledge about symmetries and are less expensive then ODEflows. We provide an initial theoretical analysis of its properties and layout sufficient conditions for stable optimisation. Finally, we illustrate the properties of IRCPMs with density learning experiments on tori and spheres. 
Danilo Jimenez Rezende · Sébastien Racanière 🔗 


Implicit Riemannian Concave Potential Maps
(
Oral
)
>
link
We are interested in the challenging problem of modelling densities on Riemannian manifolds with a known symmetry group using normalising flows. This has many potential applications in physical sciences such as molecular dynamics and quantum simulations. In this work we combine ideas from implicit neural layers and optimal transport theory to propose a generalisation of existing work on exponential map flows, Implicit Riemannian Concave Potential Maps, IRCPMs. IRCPMs have some nice properties such as simplicity of incorporating knowledge about symmetries and are less expensive then ODEflows. We provide an initial theoretical analysis of its properties and layout sufficient conditions for stable optimisation. Finally, we illustrate the properties of IRCPMs with density learning experiments on tori and spheres. 
Danilo Jimenez Rezende · Sébastien Racanière 🔗 


Multistage Monge Kantorovich Problem applied to optimal ecological transition
(
Poster
)
>
link
Financial instability is one of the most important challenges in the transition to a lowcarbon economy. Banks are encouraged to align their portfolios to CO2 trajectories fixed by international agreements, showing the necessity of a quantitative methodology to implement it. Wepropose a mathematical formulation for this problem and an optimization criterion for a transition between the current portfolio and a target one. The optimization Problem combines an Optimal Transport problem for which the cost is defined according to the financial context and a Credit Risk estimation. 
Clara Lage · Emmanuel Gobet 🔗 


Wasserstein Adversarially Regularized Graph Autoencoder
(
Poster
)
>
link
This paper introduces Wasserstein Adversarially Regularized Graph Autoencoder (WARGA), an implicit generative algorithm that directly regularizes the latent distribution of node embedding to a target distribution via the Wasserstein metric. The proposed method has been validated in tasks of link prediction and node clustering on realworld graphs, in which WARGA generally outperforms stateoftheart models based on KullbackLeibler (KL) divergence and typical adversarial framework. 
Huidong Liang · Junbin Gao 🔗 


Likelihood Training of Schrödinger Bridges using ForwardBackward SDEs Theory
(
Poster
)
>
link
Schrödinger Bridges (SBs) is an optimal transport problem that has received increasing attentions in deep generative modeling for its mathematical flexibility compared to Scoredbased Generative Models (SGMs). However, it remains unclear whether the optimization principle of SBs relates to modern training of deep generative models, which often relies on constructing parametrized loglikelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on ForwardBackward Stochastic Differential Equations Theory – a nonlinear stochastic optimal control principle that transforms the optimality condition of SBs into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SBs that, surprisingly, generalizes the ones for SGMs as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing application of modern generative training techniques, and we show that the resulting training algorithm achieves encouraging results on generating realistic images on MNIST and CelebA. 
Tianrong Chen · GuanHorng Liu · Evangelos Theodorou 🔗 


Efficient estimates of optimal transport via lowdimensional embeddings
(
Poster
)
>
link
Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension.Recent work aims specifically at reducing this cost by computing OT using lowrank projections of the data (seen as discrete measures)~\citep{paty2019subspace}. We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension.We demonstrate the idea with neural networks. 
Patric Fulop · Vincent Danos 🔗 


Optimal Transport losses and Sinkhorn algorithm with general convex regularization
(
Poster
)
>
link
We introduce a new class of convexregularized Optimal Transport losses, which generalizes the classical Entropyregularization of Optimal Transport and Sinkhorn divergences, and propose a generalized Sinkhorn algorithm. Our framework unifies many regularizations and numerical methods previously appeared in the literature. We show the existence of the maximizer for the dual problem, complementary slackness conditions, providing a complete characterization of solutions for such class of variational problems. As a consequence, we study structural properties of these losses, including continuity, differentiability and provide explicit formulas for its gradient. Finally, we provide theoretical guarantees of convergences and stability of the generalized Sinkhorn algorithm, even in the continuous setting. The techniques developed here are directly applicable also to study Wasserstein barycenters or, more generally, multimarginal problems. 
Augusto Gerolin · Simone Di Marino 🔗 


Towards interpretable contrastive word mover's embedding
(
Poster
)
>
link
This paper shows that a popular approach to the supervised embedding of documents for classification, namely, contrastive Word Mover's Embedding, can be significantly enhanced by adding interpretability. This interpretability is achieved by incorporating a clustering promoting mechanism into the contrastive loss. On several public datasets, we show that our method improves significantly upon existing baselines while providing interpretation to the clusters via identifying a set of keywords that are the most representative of a particular class. Our approach was motivated in part by the need to develop Natural Language Processing (NLP) methods for the novel problem of assessing student work for scientific writing and thinking  a problem that is central to the area of (educational) Learning Sciences (LS). In this context, we show that our approach leads to a meaningful assessment of the student work related to lab reports from a biology class and can help LS researchers gain insights into student understanding and assess evidence of scientific thought processes. 
ruijie jiang · Julia Gouvea · Eric L Miller · David Hammer · Shuchin Aeron 🔗 


Optimizing Functionals on the Space of Probabilities with Input Convex Neural Network
(
Poster
)
>
link
Gradient flows are a powerful tool for optimizing functionals in general metric spaces, including the space of probabilities endowed with the Wasserstein metric. A typical approach to solving this optimization problem relies on its connection to the dynamic formulation of optimal transport and the celebrated JordanKinderlehrerOtto (JKO) scheme. However, this formulation involves optimization over convex functions, which is challenging, especially in high dimensions. In this work, we propose an approach that relies on the recently introduced inputconvex neural networks (ICNN) to parameterize the space of convex functions in order to approximate the JKO scheme, as well as in designing functionals over measures that enjoy convergence guarantees. We derive a computationally efficient implementation of this JKOICNN framework and use various experiments to demonstrate its feasibility and validity in approximating solutions of lowdimensional partial differential equations with known solutions. We also explore the use of our JKOICNN approach in high dimensions with an experiment in controlled generation for molecular discovery. 
David AlvarezMelis · Yair Schiff · Youssef Mroueh 🔗 


Optimizing Functionals on the Space of Probabilities with Input Convex Neural Network
(
Spotlight
)
>
link
SlidesLive Video Gradient flows are a powerful tool for optimizing functionals in general metric spaces, including the space of probabilities endowed with the Wasserstein metric. A typical approach to solving this optimization problem relies on its connection to the dynamic formulation of optimal transport and the celebrated JordanKinderlehrerOtto (JKO) scheme. However, this formulation involves optimization over convex functions, which is challenging, especially in high dimensions. In this work, we propose an approach that relies on the recently introduced inputconvex neural networks (ICNN) to parameterize the space of convex functions in order to approximate the JKO scheme, as well as in designing functionals over measures that enjoy convergence guarantees. We derive a computationally efficient implementation of this JKOICNN framework and use various experiments to demonstrate its feasibility and validity in approximating solutions of lowdimensional partial differential equations with known solutions. We also explore the use of our JKOICNN approach in high dimensions with an experiment in controlled generation for molecular discovery. 
David AlvarezMelis · Yair Schiff · Youssef Mroueh 🔗 


Dual Regularized Optimal Transport
(
Poster
)
>
link
In this paper, we present a new formulation of unbalanced optimal transport called Dual Regularized Optimal Transport (DROT). We argue that regularizing the dual formulation of optimal transport results in a version of unbalanced optimal transport that leads to sparse solutions and that gives us control over mass creation and destruction. We build intuition behind such control and present theoretical properties of the solutions to DROT. We demonstrate that due to recent advances in optimization techniques, we can feasibly solve such a formulation at large scales and present extensive experimental evidence for this formulation and its solution. 
Rishi Sonthalia · Anna Gilbert 🔗 