Workshop
Optimal Transport and Machine Learning
Anna Korba · Aram-Alexandre Pooladian · Charlotte Bunne · David Alvarez-Melis · Marco Cuturi · Ziv Goldfeld
Room 220 - 222
Over the last decade, optimal transport (OT) has evolved from a prize-winning research area in pure mathematics to a recurring theme bursting across many areas of machine learning (ML). Advancements in OT theory, computation, and statistics have fueled breakthroughs in a wide range of applications, from single-cell genomics \cite{schiebinger2019optimal} to generative modeling \cite{arjovsky2017wasserstein} and optimization of over-parametrized neural nets \cite{chizat2018global,de2021diffusion}, among many others. The OTML workshop series (in '14,~'17,~'19, and '21) has been instrumental in shaping this influential research thread. For~this new OTML installment, we aim even higher by hosting two exceptional plenary speakers: Luis Caffarelli, who received the 2023 Abel Prize for his seminal contributions to regularity theory for the Monge–Amp{`e}re equation and OT, and Felix Otto, the 2006 Leibniz Prize awardee and 2017 Blaise Pascal medalist, who made profound contributions to the theory of Wasserstein gradient flows. The OTML workshop will provide a unique platform to federate, disseminate, and advance current knowledge in this rapidly growing field. This, in turn, will facilitate cross-field fertilization and drive the community towards future groundbreaking discoveries.
Schedule
Sat 6:50 a.m. - 7:00 a.m.
|
Introducing the Optimal Transport and Machine Learning (OTML) Workshop
(
Introduction
)
>
SlidesLive Video |
🔗 |
Sat 7:00 a.m. - 8:00 a.m.
|
The making of the JKO scheme (Felix Otto)
(
Plenary talk
)
>
SlidesLive Video De Giorgi popularized that even in the Riemannian setting, gradient flows allow for a variational time discretization that only appeals to the metric, not the differential structure. The JKO scheme is an instance of these minimizing movements, when the (informal) Riemannian setting is given by the Wasserstein geometry, and the functional by the (relative) entropy. This explains why McCann's displacement convexity translates into contractivity of the diffusion equation. Bochner's formula relates the dimension and the Ricci curvature of the underlying manifold to contractive rates. In this picture, De Giorgi's inequality characterization of a gradient flow can be assimilated to a large deviation principle for particle diffusions. |
🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee break
(
Break
)
>
|
🔗 |
Sat 8:30 a.m. - 9:00 a.m.
|
Unbalanced Optimal Transport: Efficient solutions for outlier-robust machine learning (Laetitia Chapel)
(
Keynote talk
)
>
SlidesLive Video Optimal transport (OT) has become a powerful tool for calculating distances (a.k.a. Wasserstein or earth mover's distances) between empirical distribution of data thanks to efficient computational schemes that make the transport computation tractable. The OT problem aims to find a transportation map that preserves the total mass between two probability distributions, requiring their mass to be equal. However, this requirement can be overly restrictive in certain applications such as color or shape matching, where distributions may have arbitrary masses and/or when only a fraction of the total mass has to be transported. This also happens when the source and/or target distributions are contaminated by outliers; in such cases, one might want to exclude these outliers from the transportation plan. Unbalanced Optimal Transport adresses the problem of removing some mass from the problem by relaxing marginal conditions (using weighted penalties in lieu of equality). In this talk, we will review efficient algorithms that can be devised in this context, particularly focusing on formulations where no additional regularization is imposed on the OT plan. |
🔗 |
Sat 9:00 a.m. - 9:15 a.m.
|
Self-Supervised Learning with the Matching Gap (Zoe Piran)
(
Contributing talk
)
>
SlidesLive Video Contrastive learning (CL) is a fundamental paradigm in self-supervised learning. CL methods rely on a loss that nudges the features of various views from one image to stay closer, while pulling away those drawn from different images. Such a loss favors invariance: feature representations of the same perturbed image should collapse to the same vector, while remaining far enough from those of any other image. Although intuitive, CL leaves room for trivial solutions, and has a documented propensity to collapse representations for very different images. This is often mitigated by using a very large variety of augmentations. In this talk, we address this tension by presenting a different loss, the matching gap, stemming from optimal transport theory. Given a set of n images transformed in two different ways, the matching gap is the difference between the mean cost (e.g. a squared distance), in representation space, of the n paired images, and the optimal matching cost obtained by running an optimal matching solver across these two families of n images. The matching gap naturally mitigates the problem of data augmentation invariance, since it can be zero without requiring features from the same image to collapse. We will show that it can be easily differentiated using Danskin’s theorem and attain competitive results, even without extensive data augmentations. |
🔗 |
Sat 9:15 a.m. - 9:30 a.m.
|
Accelerating Motion Planning via Optimal Transport
(
Oral
)
>
link
SlidesLive Video Motion planning is still an open problem for many disciplines, e.g., robotics, autonomous driving, due to issues like high planning times that hinder real-time, efficient decision-making. A class of methods striving to provide smooth solutions is gradient-based trajectory optimization. However, those methods might suffer from bad local minima, while for many settings, they may be inapplicable due to the absence of easy access to objectives-gradients. In response to these issues, we introduce Motion Planning via Optimal Transport (MPOT) - a \textit{gradient-free} method that optimizes a batch of smooth trajectories over highly nonlinear costs, even for high-dimensional tasks, while imposing smoothness through a Gaussian Process dynamics prior via planning-as-inference perspective. To facilitate batch trajectory optimization, we introduce an original zero-order and highly-parallelizable update rule -- the Sinkhorn Step, which uses the regular polytope family for its search directions; each regular polytope, centered on trajectory waypoints, serves as a local neighborhood, effectively acting as a trust region, where the Sinkhorn Step ``transports'' local waypoints toward low-cost regions. We theoretically show that Sinkhorn Step guides the optimizing parameters toward local minima regions on non-convex objective functions. We then show the efficiency of MPOT in a range of problems from low-dimensional point-mass navigation to high-dimensional whole-body robot motion planning, evincing its superiority compared with popular motion planners and paving the way for new applications of optimal transport in motion planning. |
An T. Le · Georgia Chalvatzaki · Armin Biess · Jan Peters 🔗 |
Sat 9:30 a.m. - 10:00 a.m.
|
Learning on graphs with Gromov-Wasserstein: from unsupervised learning to GNN (Rémi Flamary)
(
Keynote talk
)
>
SlidesLive Video In recent years the Optimal Transport (OT) based Gromov-Wasserstein (GW) divergence has been investigated as a similarity measure between structured data expressed as distributions typically lying in different metric spaces, such as graphs with arbitrary sizes. In this talk, we will address the optimization problem inherent in the computation of GW and some of its recent extensions, such as Entropic, Fused and semi-relaxed GW divergences. Next we will illustrate how these OT problems can be used to model graph data in learning scenarios such as graph compression, clustering, classification and structured prediction. Finally we will present a recent application of GW distance as a novel pooling layer in Graph Neural Networks. |
🔗 |
Sat 10:00 a.m. - 11:30 a.m.
|
Lunch + Poster Session I
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 12:00 p.m.
|
Amortized optimization for optimal transport (Brandon Amos)
(
Keynote talk
)
>
SlidesLive Video Optimal transport has thriving applications in machine learning, computer vision, natural language processing, the physical sciences, and economics. These applications have largely been enabled by computational breakthroughs that have lead to tractable solutions to challenging optimization problems, especially in discrete spaces through the use of convex optimization methods. Beyond these well-understood classes problems, many difficult optimization problems and sub-problems in optimal transport remain open. This talk focuses on the use of learning methods to predict, or amortize, the solutions to these optimization problems. This amortization process incurs an initial computational cost of training a model to approximately predict the solutions, but afterwards, the model can produce predictions faster than solving the optimization problems from scratch to the same level of error. Furthermore, even inaccurate predictions are tolerable because they are easily detectable, e.g., via the optimality conditions, and can be fine-tuned by warm-starting an existing method with the prediction. The talk covers how to amortize the computation at three levels: 1) the optimal transport map or potential, 2) the c-transform or convex conjugate, and 3) costs defined by a Lagrangian. |
🔗 |
Sat 12:00 p.m. - 12:30 p.m.
|
Diffusion Schrodinger Bridge Matching (Arnaud Doucet)
(
Keynote talk
)
>
SlidesLive Video Novel mass transport methods motivated by generative modeling have recently been proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models (FMMs) implement such a transport through a Stochastic Differential Equation (SDE) or an Ordinary Differential Equation (ODE). However, while it is desirable in many applications to approximate the deterministic dynamic Optimal Transport (OT) map which admits attractive properties, DDMs and FMMs are not guaranteed to provide transports close to the OT map. In contrast, Schrödinger bridges (SBs) compute stochastic dynamic mappings which recover entropy-regularized versions of OT. Unfortunately, existing numerical methods approximating SBs either scale poorly with dimension or accumulate errors across iterations. In this work, we introduce Iterative Markovian Fitting (IMF), a new methodology for solving SB problems, and Diffusion Schrödinger Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates. DSBM significantly improves over previous SB numerics and recovers as special/limiting cases various recent transport methods. We demonstrate the performance of DSBM on a variety of problems. This is joint work with Yuyang Shi (Oxford), Valentin De Bortoli (Google DeepMind) and Andrew Campbell (Oxford). |
🔗 |
Sat 12:30 p.m. - 1:00 p.m.
|
Sketched Wasserstein Distances (Florentina Bunea)
(
Keynote talk
)
>
SlidesLive Video The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. We give the first axiomatic justification of its usage as a canonical measure of discrepancy between any mixture distributions. Inference for the Wasserstein distance between mixing measures is generally difficult in high dimensions. Specializing to discrete mixtures arising from topic models, we offer the first minimax lower bound on estimating the distance between pairs of mixing measures in this model class. This reveals regimes under which fast estimation of the distance between mixing measures can be expected, even if the ambient dimension of the mixture distributions is large. In such regimes, we develop fully data-driven inferential tools that allow us to obtain the first asymptotically valid confidence intervals for the Wasserstein distance between mixing measures, in topic models. Our results apply to potentially sparse mixtures of potentially sparse high-dimensional discrete probability distributions, and are illustrated via an example on movie reviews from the IMDb data set. |
🔗 |
Sat 1:00 p.m. - 1:15 p.m.
|
Provably Fast Finite Particle Variants of SVGD via Virtual Particle Stochastic Approximation
(
Oral
)
>
link
SlidesLive Video
SVGD is a popular particle-based variational inference algorithm with well studied mean-field dynamics. However, its finite-particle behavior is far less understood. Our work introduces the notion of *virtual particles* to develop novel stochastic approximations of mean-field SVGD dynamics in the space of probability measures, that are exactly realizable using finite particles. As a result, we design two computationally efficient variants of SVGD (VP-SVGD and GB-SVGD) with provably fast finite-particle convergence rates. Our algorithms are specific random-batch approximations of SVGD which are computationally more efficient than ordinary SVGD. We show that the $n$ output particles of VP-SVGD and GB-SVGD, run for $T$ steps with batchsize $K$, are as good as i.i.d samples from a measure whose Kernel Stein Discrepancy to the target is at most $O(\tfrac{d^{1/3}}{(KT)^{1/6}})$ under standard assumptions. We prove similar results under a mild growth condition on the score function, which is weaker than the assumptions of prior works. Our convergence rates for the empirical measure (of the particles output by VP-SVGD and GB-SVGD) to the target distribution enjoys a **double exponential improvement** over the best known finite-particle analysis of SVGD. Furthermore, our results give the **first known polynomial oracle complexity in dimension**, completely eliminating the curse of dimensionality exhibited by previously known finite-particle rates.
|
Aniket Das · Dheeraj Nagaraj 🔗 |
Sat 1:15 p.m. - 1:30 p.m.
|
Towards a Statistical Theory of Learning to Learn In-context with Transformers
(
Oral
)
>
link
SlidesLive Video Classical learning theory focuses on supervised learning of functions via empirical risk minimization where labeled examples for a particular task are represented by the data distribution experienced by the model during training. Recently, in-context learning emerged as a paradigm shift in large pre-trained models. When conditioned with few labeled examples of potentially unseen tasks in the training, the model infers the task at hand and makes predictions on new points. Learning to learn in-context on the other hand, aims at training models in a meta-learning setup that generalize to new unseen tasks from only few shots of labeled examples. We present in this paper a statistical learning framework for the problem of in-context meta learning and define a function class that enables it. The meta-learner is abstracted as a function defined on the cross product of the probability space (representing context) and the data space. The data distribution is sampled from a meta distribution on tasks. Thanks to the regularity we assume on the function class in the Wasserstein geometry, we leverage tools from optimal transport in order to study the generalization of the meta learner to unseen tasks. Finally, we show that encoder transformers exhibit this type of regularity and leverage our theory to analyze their generalization properties. |
Youssef Mroueh 🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Coffee break
(
Break
)
>
|
🔗 |
Sat 2:00 p.m. - 2:30 p.m.
|
Optimal transport on graphs, manifolds and trees (Smita Krishnaswamy)
(
Keynote talk
)
>
SlidesLive Video |
🔗 |
Sat 2:30 p.m. - 3:00 p.m.
|
Variational inference via Wasserstein gradient flows (Sinho Chewi)
(
Keynote talk
)
>
SlidesLive Video Probabilistic problems which involve the non-smooth entropy functional benefit from the design of proximal I will showcase the use of Wasserstein gradient flows as a conceptual framework for developing principled algorithms for variational inference (VI) with accompanying convergence guarantees, particularly for Gaussian VI and mean-field VI. This is joint work with Francis Bach, Krishnakumar Balasubramanian, Silvère Bonnabel, Michael Diao, Yiheng Jiang, Marc Lambert, Aram-Alexandre Pooladian, Philippe Rigollet, and Adil Salim. |
🔗 |
Sat 3:00 p.m. - 3:15 p.m.
|
Closing remarks from organizers
(
Closing remarks
)
>
SlidesLive Video |
🔗 |
Sat 3:15 p.m. - 3:30 p.m.
|
Poster session II + hangout
(
Poster session
)
>
|
🔗 |
-
|
Network Regression with Wasserstein Distances
(
Poster
)
>
link
Motivated by the explosion of graph-based data analysis, we study the problem of network regression, where graph topology is predicted for unseen predictor values. We build upon recent developments on generalized regression models on metric spaces based on Frèchet means and propose a network regression method using the Wasserstein metric. We show that when representing graphs as multivariate Gaussian distributions, the regression problem in the Wasserstein metric becomes a weighted Wasserstein barycenter problem. Such a weighted barycenter can be efficiently computed using fixed point iterations. Numerical results show that the proposed approach improves existing procedures by accurately accounting for graph size, randomness, and sparsity in synthetic and real-world experiments. |
Alexander Zalles · Cesar Uribe · Kai M. Hung · Ann Finneran · Lydia Beaudrot 🔗 |
-
|
Learning via Wasserstein-Based High Probability Generalisation Bounds
(
Poster
)
>
link
Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) -- this is in particular at the core of PAC-Bayes learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayes framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem -- hence restricting its use in practical applications. As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayes bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayes generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayes learning algorithms and we illustrate their empirical advantage on a variety of experiments. |
Paul Viallard · Maxime Haddouche · Umut Simsekli · Benjamin Guedj 🔗 |
-
|
Improved Stein Variational Gradient Descent with Importance Weights
(
Poster
)
>
link
Stein Variational Gradient Descent~(\algname{SVGD}) is a popular sampling algorithm used in various machine learning tasks. It is well known that \algname{SVGD} arises from a discretization of the kernelized gradient flow of the Kullback-Leibler divergence $\KL\left(\cdot\mid\pi\right)$, where $\pi$ is the target distribution. In this work, we propose to enhance \algname{SVGD} via the introduction of {\em importance weights}, which leads to a new method for which we coin the name \algname{$\beta$-SVGD}. In the continuous time and infinite particles regime, the time for this flow to converge to the equilibrium distribution $\pi$, quantified by the Stein Fisher information, depends on $\rho_0$ and $\pi$ very weakly. This is very different from the kernelized gradient flow of Kullback-Leibler divergence, whose time complexity depends on $\KL\left(\rho_0\mid\pi\right)$. Under certain assumptions, we provide a descent lemma for the population limit \algname{$\beta$-SVGD}, which covers the descent lemma for the population limit \algname{SVGD} when $\beta\to 0$. We also illustrate the advantages of \algname{$\beta$-SVGD} over \algname{SVGD} by experiments.
|
Lukang Sun · Peter Richtarik 🔗 |
-
|
On the explainable properties of 1-Lipschitz Neural Networks: An Optimal Transport Perspective
(
Poster
)
>
link
Input gradients have a pivotal role in a variety of applications, including adversarial attack algorithms for evaluating model robustness, explainable AI techniques for generating Saliency Maps, and counterfactual explanations.However, Saliency Maps generated by traditional neural networks are often noisy and provide limited insights. In this paper, we demonstrate that, on the contrary, the Saliency Maps of 1-Lipschitz neural networks, learnt with the dual loss of an optimal transportation problem, exhibit desirable XAI properties:They are highly concentrated on the essential parts of the image with low noise, significantly outperforming state-of-the-art explanation approaches across various models and metrics. We also prove that these maps align unprecedentedly well with human explanations on ImageNet.To explain the particularly beneficial properties of the Saliency Map for such models, we prove this gradient encodes both the direction of the transportation plan and the direction towards the nearest adversarial attack. Following the gradient down to the decision boundary is no longer considered an adversarial attack, but rather a counterfactual explanation that explicitly transports the input from one class to another. Thus, Learning with such a loss jointly optimizes the classification objective and the alignment of the gradient , i.e. the Saliency Map, to the transportation plan direction.These networks were previously known to be certifiably robust by design, and we demonstrate that they scale well for large problems and models, and are tailored for explainability using a fast and straightforward method. |
Mathieu Serrurier · Franck Mamalet · Thomas FEL · Louis Béthune · Thibaut Boissin 🔗 |
-
|
Entropic Gromov-Wasserstein Distances: Stability and Algorithms
(
Poster
)
>
link
The Gromov-Wasserstein (GW) distance quantifies discrepancy between metric measure spaces, but suffers from computational hardness. The entropic Gromov-Wasserstein (EGW) distance serves as a computationally efficient proxy for the GW distance. Recently, it was shown that the quadratic GW and EGW distances admit variational forms that tie them to the well-understood optimal transport (OT) and entropic OT (EOT) problems. By leveraging this connection, we establish convexity and smoothness properties of the objective in this variational problem. This results in the first efficient algorithms for solving the EGW problem that are subject to formal guarantees in both the convex and non-convex regimes. |
Gabriel Rioux · Ziv Goldfeld · Kengo Kato 🔗 |
-
|
SpecTr++: Improved transport plans for speculative decoding of large language models
(
Poster
)
>
link
We revisit the question of accelerating decoding of language models based on speculative draft samples, inspired by Y. Leviathan et al. (ICML 2023). Following Z. Sun et al. (NeurIPS 2023) which makes connections between speculative decoding and optimal transport theory, we design improved transport plans for this problem with no sacrifice in computational complexity in terms of the alphabet size. |
Kwangjun Ahn · Ahmad Beirami · Ziteng Sun · Ananda Theertha Suresh 🔗 |
-
|
Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching
(
Poster
)
>
link
In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, Offline Learning from Observations (LfO) is extensively studied, where the agent learns to solve a task with only expert states and task-agnostic non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods minimize the state occupancy divergence between the learner and expert policies. However, they are limited to either $f$-divergences (KL and $\chi^2$) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To address this problem, we propose Primal Wasserstein DICE (PW-DICE), which minimizes the primal Wasserstein distance between the expert and learner state occupancies with a pessimistic regularizer and leverages a contrastively learned distance as the underlying metric for the Wasserstein distance. Theoretically, we prove that our framework is a generalization of the state-of-the-art, SMODICE, and unifies $f$-divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods on multiple testbeds.
|
Kai Yan · Alex Schwing · Yu-Xiong Wang 🔗 |
-
|
Semi-discrete Gromov-Wasserstein distances: Existence of Gromov-Monge Maps and Statistical Theory
(
Poster
)
>
link
The Gromov-Wasserstein (GW) distance serves as a discrepancy measure between metric measure spaces. Despite recent theoretical developments, its structural properties, such as existence of optimal maps, remain largely unaccounted for. In this work, we analyze the semi-discrete regime for the GW problem wherein one measure is finitely supported. Notably, we derive a primitive condition which guarantees the existence of optimal maps. This condition also enables us to derive the asymptotic distribution of the empirical semi-discrete GW distance under proper centering and scaling. As a complement to this asymptotic result, we also derive expected empirical convergence rates. As is the case with the standard Wasserstein distance, the rate we derive in the semi-discrete GW case, $n^{-\frac{1}{2}}$, is dimension-independent which is in stark contrast to the curse of dimensionality rate obtained in general.
|
Gabriel Rioux · Ziv Goldfeld · Kengo Kato 🔗 |
-
|
Sliced Wasserstein Estimation with Control Variates
(
Poster
)
>
link
The sliced Wasserstein (SW) distances between two probability measures are defined as the expectation of the Wasserstein distance between two one-dimensional projections of the two measures. The randomness comes from a projecting direction that is used to project the two input measures to one dimension. Due to the intractability of the expectation, Monte Carlo integration is performed to estimate the value of the SW distance. Despite having various variants, there has been no prior work that improves the Monte Carlo estimation scheme for the SW distance in terms of controlling its variance. To bridge the literature on variance reduction and the literature on the SW distance, we propose computationally efficient control variates to reduce the variance of the empirical estimation of the SW distance. The key idea is to first find Gaussian approximations of projected one-dimensional measures, then we utilize the closed-form of the Wasserstein-2 distance between two Gaussian distributions to design the control variates. In particular, we propose using a lower bound and an upper bound of the Wasserstein-2 distance between two fitted Gaussians as two computationally efficient control variates. We empirically show that the proposed control variate estimators can help to reduce the variance considerably when comparing measures over images and point-clouds. Finally, we demonstrate the favorable performance of the proposed control variate estimators in gradient flows to interpolate between two point-clouds and in deep generative modeling on standard image datasets, such as CIFAR10 and CelebA. |
Khai Nguyen · Nhat Ho 🔗 |
-
|
Zero-shot Cross-task Preference Alignment for Offline RL via Optimal Transport
(
Poster
)
>
link
In preference-based Reinforcement Learning (PbRL), aligning rewards with human intentions often necessitates a substantial volume of human-provided labels. Furthermore, the expensive preference data from prior tasks often lacks reusability for subsequent tasks, resulting in repetitive labeling for each new task. In this paper, we propose a novel zero-shot cross-task preference-based RL algorithm that leverages labeled preference data from source tasks to infer labels for target tasks, eliminating the requirement for human queries. Our approach utilizes Gromov-Wasserstein distance to align trajectory distributions between source and target tasks. The solved optimal transport matrix serves as a correspondence between trajectories of two tasks, making it possible to identify corresponding trajectory pairs between tasks and transfer the preference labels. However, direct learning from these inferred labels might introduce noisy or inaccurate reward functions. To this end, we introduce Robust Preference Transformer, which considers both reward mean and uncertainty by modeling rewards as Gaussian distributions. Through extensive empirical validation on robotic manipulation tasks from Meta-World and Robomimic, our approach exhibits strong capabilities of transferring preferences between tasks in a zero-shot way and learns reward functions from noisy labels robustly. Notably, our approach significantly surpasses existing methods in limited-data scenarios. The videos of our method are available on the website: https://sites.google.com/view/pot-rpt. |
Runze Liu · Yali Du · Fengshuo Bai · Jiafei Lyu · Xiu Li 🔗 |
-
|
Accelerating Motion Planning via Optimal Transport
(
Poster
)
>
link
Motion planning is still an open problem for many disciplines, e.g., robotics, autonomous driving, due to issues like high planning times that hinder real-time, efficient decision-making. A class of methods striving to provide smooth solutions is gradient-based trajectory optimization. However, those methods might suffer from bad local minima, while for many settings, they may be inapplicable due to the absence of easy access to objectives-gradients. In response to these issues, we introduce Motion Planning via Optimal Transport (MPOT) - a \textit{gradient-free} method that optimizes a batch of smooth trajectories over highly nonlinear costs, even for high-dimensional tasks, while imposing smoothness through a Gaussian Process dynamics prior via planning-as-inference perspective. To facilitate batch trajectory optimization, we introduce an original zero-order and highly-parallelizable update rule -- the Sinkhorn Step, which uses the regular polytope family for its search directions; each regular polytope, centered on trajectory waypoints, serves as a local neighborhood, effectively acting as a trust region, where the Sinkhorn Step ``transports'' local waypoints toward low-cost regions. We theoretically show that Sinkhorn Step guides the optimizing parameters toward local minima regions on non-convex objective functions. We then show the efficiency of MPOT in a range of problems from low-dimensional point-mass navigation to high-dimensional whole-body robot motion planning, evincing its superiority compared with popular motion planners and paving the way for new applications of optimal transport in motion planning. |
An T. Le · Georgia Chalvatzaki · Armin Biess · Jan Peters 🔗 |
-
|
Outlier-Robust Wasserstein DRO
(
Poster
)
>
link
Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to~sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types. We derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks.We prove a strong duality result that enables efficient computation of our outlier-robust WDRO problem.When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks.
|
Sloan Nietert · Ziv Goldfeld · Soroosh Shafiee 🔗 |
-
|
Towards a Statistical Theory of Learning to Learn In-context with Transformers
(
Poster
)
>
link
Classical learning theory focuses on supervised learning of functions via empirical risk minimization where labeled examples for a particular task are represented by the data distribution experienced by the model during training. Recently, in-context learning emerged as a paradigm shift in large pre-trained models. When conditioned with few labeled examples of potentially unseen tasks in the training, the model infers the task at hand and makes predictions on new points. Learning to learn in-context on the other hand, aims at training models in a meta-learning setup that generalize to new unseen tasks from only few shots of labeled examples. We present in this paper a statistical learning framework for the problem of in-context meta learning and define a function class that enables it. The meta-learner is abstracted as a function defined on the cross product of the probability space (representing context) and the data space. The data distribution is sampled from a meta distribution on tasks. Thanks to the regularity we assume on the function class in the Wasserstein geometry, we leverage tools from optimal transport in order to study the generalization of the meta learner to unseen tasks. Finally, we show that encoder transformers exhibit this type of regularity and leverage our theory to analyze their generalization properties. |
Youssef Mroueh 🔗 |
-
|
Estimating Fréchet bounds for validating programmatic weak supervision
(
Poster
)
>
link
We develop methods for estimating Fréchet bounds on (possibly high-dimensional) distribution classes in which some variables are continuous-valued. We establish the statistical correctness of the computed bounds under uncertainty in the marginal constraints and demonstrate the usefulness of our algorithms by evaluating the performance of machine learning (ML) models trained with programmatic weak supervision (PWS). PWS is a framework for principled learning from weak supervision inputs (e.g., crowdsourced labels, knowledge bases, pre-trained models on related tasks, etc.), and it has achieved remarkable success in many areas of science and engineering. Unfortunately, it is generally difficult to validate the performance of ML models trained with PWS due to the absence of labeled data. Our algorithms address this issue by estimating sharp lower and upper bounds for performance metrics such as accuracy/recall/precision drawing connections to tools from computational optimal transport. |
Felipe Maia Polo · Mikhail Yurochkin · Moulinath Banerjee · Subha Maity · Yuekai Sun 🔗 |
-
|
Semidefinite Relaxations of the Gromov-Wasserstein Distance
(
Poster
)
>
link
The Gromov-Wasserstein distance (GW) is an extension of the optimal transport problem that allows one to match objects between incomparable spaces. At its core, GW-distance is specified as the solution of a non-convex quadractically constrained quadratic program, which is not known to be tractable to solve. In particular, existing solvers are only able to find local optimizers. In this work, we propose a semi-definite programming (SDP) relaxation of the GW distance. Our approach provides the ability to compute the optimality gap of any transport map from the global optimal solution. Our initial numerical experiments suggest that our proposed relaxation is strong in that it frequently computes the global optimal solution, together with a proof of global optimality. |
Junyu Chen · Binh T. Nguyen · Yong Sheng Soh 🔗 |
-
|
Invertible normalizing flow neural networks by JKO scheme
(
Poster
)
>
link
Normalizing flow is a class of deep generative models for efficient sampling and density estimation. In practice, the flow often appears as a chain of invertible neural network blocks; to facilitate training, existing works have regularized flow trajectories and designed special network architectures. The current paper develops a neural ODE flow network inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which allows efficient block-wise training of the residual blocks without sampling SDE trajectories or inner loops of score matching or variational learning. As the JKO scheme unfolds the dynamic of gradient flow, the proposed model naturally stacks residual network blocks one by one, reducing the memory load and difficulty in performing end-to-end deep flow network training. We also develop adaptive time reparameterization of the flow network with a progressive refinement of the trajectory in probability space, which improves the model training efficiency and accuracy in practice. Using numerical experiments with synthetic and real data, we show that the proposed JKO-iFlow model achieves similar or better performance in generating new samples compared with the existing flow and diffusion models at a significantly reduced computational and memory cost. |
Chen Xu · Xiuyuan Cheng · Yao Xie 🔗 |
-
|
SyMOT-Flow: Learning optimal transport flow for two arbitrary distributions with maximum mean discrepancy
(
Poster
)
>
link
Finding a transformation between two unknown probability distributions from samples is crucial for modeling complex data distributions and perform tasks such as density estimation, sample generation, and statistical inference. One powerful framework for such transformations is normalizing flow, which transforms an unknown distribution into a standard normal distribution using an invertible network. In this paper, we introduce a novel model called SyMOT-Flow that trains an invertible transformation by minimizing the symmetric maximum mean discrepancy between samples from two unknown distributions, and we incorporate an optimal transport cost as regularization to obtain a short-distance and interpretable transformation. The resulted transformation leads to more stable and accurate sample generation. We establish several theoretical results for the proposed model and demonstrate its effectiveness with low-dimensional illustrative examples as well as high-dimensional generative samples obtained through the forward and reverse flows. |
Zhe Xiong · Qiaoqiao Ding · Xiaoqun Zhang 🔗 |
-
|
Computing high-dimensional optimal transport by flow neural networks
(
Poster
)
>
link
Flow-based models are widely used in generative tasks, including normalizing flow, where a neural network transports from a data distribution $P$ to a normal distribution. This work develops a flow-based model that transports from $P$ to an arbitrary $Q$ where both distributions are only accessible via finite samples. We propose to learn the dynamic optimal transport between $P$ and $Q$ by training a flow neural network. The model is trained to find an invertible transport map between $P$ and $Q$ optimally by minimizing the transport cost. The trained optimal transport flow allows for performing many downstream tasks, including infinitesimal density ratio estimation and distribution interpolation in the latent space for generative models. The effectiveness of the proposed model on high-dimensional data is empirically demonstrated in mutual information estimation, energy-based generative models, and image-to-image translation.
|
Chen Xu · Xiuyuan Cheng · Yao Xie 🔗 |
-
|
Duality and Sample Complexity for the Gromov-Wasserstein Distance
(
Poster
)
>
link
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes these gaps for the quadratic GW distance over Euclidean spaces of different dimensions $d_x$ and $d_y$. We derive a dual form that represents the GW distance in terms of the well-understood OT problem. This enables employing proof techniques from statistical OT based on regularity analysis of dual potentials and empirical process theory, using which we establish the first GW empirical convergence rates. The derived two-sample rate is $n^{-2/\max\{\min\{d_x,d_y\},4\}}$ (up to a log factor when $\min\{d_x,d_y\}=4$), which matches the corresponding rates for OT. We also provide matching lower bounds, thus establishing sharpness of the derived rates. Lastly, the duality is leveraged to shed new light on the open problem of the one-dimensional GW distance between uniform distributions on $n$ points, illuminating why the identity and anti-identity permutations may not be optimal. Our results serve as a first step towards a comprehensive statistical theory as well as computational advancements for GW distances, based on the discovered dual formulations.
|
Zhengxin Zhang · Ziv Goldfeld · Youssef Mroueh · Bharath Sriperumbudur 🔗 |
-
|
PTLP: Partial Transport $L^p$ Distances
(
Poster
)
>
link
Optimal transport and its related problems, including optimal partial transport, have proven to be valuable tools in machine learning for computing meaningful distances between probability or positive measures. This success has led to a growing interest in defining transport-based distances that allow for comparing signed measures and, more generally, multi-channeled signals. Transport $L^p$ distances are notable extensions of the optimal transport framework to signed and possibly multi-channeled signals. In this paper, we introduce partial transport $L^p$ distances as a new family of metrics for comparing generic signals, benefiting from the robustness of partial transport distances. We provide theoretical background such as the existence of optimal plans and the behavior of the distance in various limits. Furthermore, we introduce the sliced variation of these distances, which allows for faster comparison of generic signals. Finally, we demonstrate the application of the proposed distances in signal class separability and nearest neighbor classification.
|
Xinran Liu · Yikun Bai · Huy Tran · Zhanqi Zhu · Matthew Thorpe · Soheil Kolouri 🔗 |
-
|
Optimal Transport for Measures with Noisy Tree Metric
(
Poster
)
>
link
We study optimal transport (OT) problem for probability measures supported on a tree metric space. It is known that such OT problem admits a closed-form expression, but depends fundamentally on the underlying tree structure of supports of input measures. In practice, the given tree structure may be, however, perturbed due to noisy or adversarial measurements. In order to mitigate this issue, we follow the max-min robust OT approach which considers the maximal possible distances between two input probability measures over an uncertainty set of tree metrics. In general, this robust OT approach is notoriously hard to compute due to its non-convexity and non-smoothness which hinders its practical applications, especially for large-scale settings. In this work, we propose \emph{novel uncertainty sets of tree metrics} from the lens of edge deletion/addition which covers a diversity of tree structures. Consequently, by building upon the proposed uncertainty sets, and leveraging the tree structure over supports, we show that the max-min robust OT also admits a closed-form expression which is scalable for large-scale applications. Furthermore, we demonstrate that the max-min robust OT satisfies the metric property and is negative definite. We then exploit its negative definiteness to propose positive definite \emph{kernels} and test them in several simulations on document classification and topological data analysis for measures on a tree. |
Tam Le · Truyen Nguyen · Kenji Fukumizu 🔗 |
-
|
Repairing Regressors for Fair Binary Classification at Any Decision Threshold
(
Poster
)
>
link
We study the problem of post-processing a supervised machine-learned regressor to maximize fair binary classification at all decision thresholds. By decreasing the statistical distance between each group's score distributions, we show that we can increase fair performance across all thresholds at once, and that we can do so without a large decrease in accuracy. To this end, we introduce a formal measure of Distributional Parity, which captures the degree of similarity in the distributions of classifications for different protected groups. Our main result is to put forward a novel post-processing algorithm based on optimal transport, which provably maximizes Distributional Parity, thereby attaining common notions of group fairness like Equalized Odds or Equal Opportunity at all thresholds. We demonstrate on two fairness benchmarks that our technique works well empirically , while also outperforming and generalizing similar techniques from related work. |
Kweku Kwegyir-Aggrey · Jessica Dai · A. Feder Cooper · John Dickerson · Keegan Hines · Suresh Venkatasubramanian 🔗 |
-
|
Causal Discovery via Monotone Triangular Transport Maps
(
Poster
)
>
link
We study the problem of causal structure learning from data using transport maps. Specifically, we first provide a constraint-based method which builds upon lower-triangular monotone parametric transport maps to design conditional independence tests which are agnostic to the noise distribution. We provide an algorithm for causal discovery up to Markov Equivalence for general structural equations and noise distributions, which allows for settings with latent variables. Our approach also extends to score-based causal discovery by providing a novel means for defining scores. This allows us to uniquely recover the causal graph under additional identifiability and structural assumptions, such as additive noise or post-nonlinear models. We provide experimental results to compare the proposed approach with the state of the art on both synthetic and real-world datasets. |
Sina Akbari · Luca Ganassali · Negar Kiyavash 🔗 |
-
|
Applications of Optimal Transport Distances in Unsupervised AutoML
(
Poster
)
>
link
In this work, we explore the utility of Optimal Transport-based dataset similarity to find similar \textit{unlabeled tabular} datasets, especially in the context of automated machine learning (AutoML) on unsupervised tasks. Since unsupervised tasks don't have a ground truth that optimization techniques can optimize towards, but often do have historical information on which pipelines work best, we propose to meta-learn over prior tasks to transfer useful pipelines to new tasks. Our intuition behind this work is that pipelines that worked well on datasets with a \textit{similar underlying data distribution} will work well on new datasets. We use Optimal Transport distances to find this similarity between unlabeled tabular datasets and recommend machine learning pipelines on two downstream unsupervised tasks: Outlier Detection and Clustering. We obtain very promising results against existing baselines and state-of-the-art methods. |
prabhant singh · Joaquin Vanschoren 🔗 |
-
|
Data-Conditional Diffusion Bridges
(
Poster
)
>
link
The dynamic Schrödinger bridge problem provides an appealing setting for solving constrained time-series data generation tasks posed as an iteration over optimal transport problems. Recent works have demonstrated state-of-the-art results but are limited to learning bridges with only initial and terminal constraints. Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and optimal control into learning the diffusion process, enabling constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints, and assess the effectiveness of ISB on a single-cell embryo RNA data set. |
Ella Tamir · Martin Trapp · Arno Solin 🔗 |
-
|
Characterizing Out-of-Distribution Error via Optimal Transport
(
Poster
)
>
link
Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models, so methods of predicting a model's performance on OOD data without labels are important for machine learning safety. While a number of methods have been proposed by prior work, they often underestimate the actual error, sometimes by a large margin, which greatly impacts their applicability to real tasks. In this work, we identify pseudo-label shift, or the difference between the predicted and true OOD label distributions, as a key indicator to this underestimation. Based on this observation, we introduce a novel method for estimating model performance by leveraging optimal transport theory, Confidence Optimal Transport (COT), and show that it provably provides more robust error estimates in the presence of pseudo-label shift. Additionally, we introduce an empirically-motivated variant of COT, Confidence Optimal Transport with Thresholding (COTT), which applies thresholding to the individual transport costs and further improves the accuracy of COT's error estimates. We evaluate COT and COTT on a variety of standard benchmarks that induce various types of distribution shift -- synthetic, novel subpopulation, and natural -- and show that our approaches significantly outperform existing state-of-the-art methods with an up to 3x lower prediction error. |
Yuzhe Lu · Yilong Qin · Runtian Zhai · Andrew Shen · Ketong Chen · Zhenlin Wang · Soheil Kolouri · Simon Stepputtis · Joseph Campbell · Katia Sycara 🔗 |
-
|
A generative flow model for conditional sampling via optimal transport
(
Poster
)
>
link
Sampling conditional distributions is a fundamental task for Bayesian inference and density estimation. Generative models characterize conditionals by learning a transport map that pushes forward a reference (e.g., a standard Gaussian) to the target distribution. While these approaches successfully can describe many non-Gaussian problems, their performance is often limited by parametric bias and the reliability of gradient-based (adversarial) optimizers to learn the map. This work proposes a non-parametric generative model that adaptively maps reference samples to the target. The model uses block-triangular transport maps, whose components characterize conditionals of the target distribution. These maps arise from solving an optimal transport problem with a weighted $L^2$ cost function, thereby extending the data-driven approach in [Trigila and Tabak, 2016] for conditional sampling. The proposed approach is demonstrated on a low-dimensional example and a parameter inference problem involving nonlinear ODEs.
|
Jason Alfonso · Ricardo Baptista · Anupam Bhakta · Noam Gal · Alfin Hou · Vasilisa Lyubimova · Daniel Pocklington · Josef Sajonz · Giulio Trigila · Ryan Tsai 🔗 |
-
|
Understanding Reward Ambiguity Through Optimal Transport Theory in Inverse Reinforcement Learning
(
Poster
)
>
link
In inverse reinforcement learning (IRL), the central objective is to infer underlying reward functions from observed expert behaviors in a way that not only explains the given data but also generalizes to unseen scenarios, ensuring robustness against reward ambiguity—where multiple reward functions can equally explain the same expert behaviors. While significant strides have been made in addressing this issue, current methods often face with high-dimensional problems and lack a geometric foundation. This paper harnesses the optimal transport (OT) theory to provide a fresh perspective on these challenges. By utilizing the Wasserstein distance from OT, we establish a geometric framework that allows for quantifying reward ambiguity and identifying a central representation or centroid of reward functions. These insights pave the way for robust IRL methodologies anchored in geometric interpretations, offering a structured approach to tackle reward ambiguity in high-dimensional settings. |
Ali Baheri 🔗 |
-
|
Optimal Transport with Adaptive Regularisation
(
Poster
)
>
link
Regularising the primal formulation of optimal transport (OT) with a strictly convex term leads to enhanced numerical complexity and a denser transport plan.Many formulations impose a global constraint on the transport plan, for instance by relying on entropic regularisation. As it is more expensive to diffuse mass for outlier points compared to central ones, this typically results in a significant imbalance in the way mass is spread across the points.This can be detrimental for some applications where a minimum of smoothing is required per point. To remedy this, we introduce OT with Adaptive RegularIsation (OTARI), a new formulation of OT that imposes constraints on the mass going in or/and out of each point.We then showcase the benefits of this approach for domain adaptation. |
Hugues Van Assel · Titouan Vayer · Rémi Flamary · Nicolas Courty 🔗 |
-
|
SpecTr: Fast Speculative Decoding via Optimal Transport
(
Poster
)
>
link
Autoregressive sampling from large language models has shown to achieve state-of-the-art results in several natural language tasks.However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up decoding is *speculative decoding*: use a small model to sample a *draft* (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. The tokens in the draft are either accepted or rejected based on a statistical method to guarantee that the final output is a valid sample from the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with *membership cost*. This framework can be viewed as an extension of the well-known *maximal-coupling* problem. This new formulation enables us to generalize the sampling method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal solution can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose an approximate solution whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token.Using this new OT algorithm, we develop a new autoregressive sampling algorithm called *SpecTr*.We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
|
Ziteng Sun · Ananda Theertha Suresh · Jae Hun Ro · Ahmad Beirami · Himanshu Jain · Felix Yu 🔗 |
-
|
Quantum Theory and Application of Contextual Optimal Transport
(
Poster
)
>
link
Optimal Transport (OT) has fueled machine learning (ML) applications across various domains. In cases where paired data measurements ($\mu$, $\nu$) are coupled to a context variable $p_i$ , one may aspire to learn a global transportation map, parameterized through the context to facilitate prediction of target states even from unseen context. Existing approaches for this task leverage Brenier’s theorem and utilize Neural OT. Here, we follow a radically different approach inspired by quantum computing principles to develop a Quantum formulation for learning transportation plans parameterized by a context variable. This is achieved through exploiting a natural link between doubly stochastic matrices and unitary operators. The latter can be directly related to recent results in quantum learning theory suggesting intrinsic advantages in modelling constrained problems with quantum methods. We verify our methodology on synthetic data, emulating the task of predicting single-cell perturbation responses parameterized through drug dosage as context. Our experimental comparisons to a baseline reveal that our method can capture dose-induced variations in cell distributions, even to some extent when extrapolating to dosages outside the interval seen during training. In summary, this work assesses the feasibility of learning to predict contextualized transportation plans through a novel quantum computing approach.
|
Nicola Mariella · Jannis Born · Albert Akhriev · Francesco Tacchino · Christa Zoufal · Eugene Koskin · Ivano Tavernelli · Stefan Woerner · Maria Anna Rapsomaniki · Sergiy Zhuk 🔗 |
-
|
Interpolating between Clustering and Dimensionality Reduction with Gromov-Wasserstein
(
Poster
)
>
link
We present a versatile adaptation of existing dimensionality reduction (DR) objectives, enabling the simultaneous reduction of both sample and feature sizes.Correspondances between input and embedding samples are computed through a semi-relaxed Gromov-Wasserstein optimal transport (OT) problem.When the embedding sample size matches that of the input, our model recovers classical popular DR models. When the embedding's dimensionality is unconstrained, we show that the OT plan delivers a competitive hard clustering. We emphasize the importance of intermediate stages that blend DR and clustering for summarizing real data and apply our method to visualize datasets of images. |
Hugues Van Assel · Cédric Vincent-Cuaz · Titouan Vayer · Rémi Flamary · Nicolas Courty 🔗 |
-
|
On Schrödinger Bridge Matching and Expectation Maximization
(
Poster
)
>
link
In this work, we analyze methods for solving the Schrödinger Bridge problem from the perspective of alternating KL divergence minimization. While existing methods such as Iterative Proportional- or Markovian- Fitting require exact updates due to each iteration optimizing the same argument in the \kl divergence, we justify a joint optimization of a single KL divergence objective from the perspective of information geometry. As in the variational EM algorithm, this allows for partial, stochastic gradient updates to decrease a unified objective. We highlight connections with related bridge-matching, flow-matching, and few-step generative modeling approaches, where various parameterizations of the coupling distributions are contextualized from the perspective of marginal-preserving inference. |
Rob Brekelmans · Kirill Neklyudov 🔗 |
-
|
Optimal transport for vector Gaussian mixture models
(
Poster
)
>
link
Vector-valued Gaussian mixtures form an important special subset of vector-valued distributions. In general, vector-valued distributions constitute natural representations for physical entities, which can mutate or transit among alternative manifestations distributed in a given space. A key example is color imagery. In this note, we vectorize the Gaussian mixture model and study several different optimal mass transport related problems associated to such models. The benefits of using vector Gaussian mixture for optimal mass transport include computational efficiency and the ability to preserve structure. |
Jiening Zhu · Kaiming Xu · Allen Tannenbaum 🔗 |
-
|
Learning TSP Algorithmic Prior using Gumbel-Sinkhorn Operator
(
Poster
)
>
link
The Machine Learning community has recently shown a growing interest in Optimal Transport (OT). Methods that leverage entropy regularization based on OT have proven especially effective for various tasks, including ranking, sorting, and solving jigsaw puzzles. In our study, we broaden the application of entropy regularization methods to address the NP-hard Travelling Salesman Problem (TSP). We first formulate TSP as identifying the permutation of a Hamiltonian Cycle with the shortest length. Following this, we establish the permutation representation using the Gumbel-Sinkhorn operator with entropic regularization. Our findings indicate a balance between entropy and generalization. We further discuss how to generalize across different hardnesses. |
Yimeng Min · Carla Gomes 🔗 |
-
|
Adaptive Algorithms for Continuous-Time Transport: Homotopy-Driven Sampling and a New Interacting Particle System
(
Poster
)
>
link
We propose a new dynamic algorithm which transports samples from a reference distribution to a target distribution in unit time, given access to the target-to-reference density ratio. Our approach is to seek a sequence of transport maps that push forward the reference along a path given by a geometric mixture of the two densities. We take the maps to be simply parameterized, local, sample-driven optimal transport maps which we identify by approximately solving a root-finding problem formulated using importance weights. When feature functions for the maps are taken to be kernels, we obtain a novel interacting particle system from which we derive finite-particle and mean-field ODEs. In discrete time, we introduce an adaptive algorithm for simulating this interacting particle system which adjusts the ODE time steps based on the quality of the transport, automatically uncovering a good "schedule" for traversing the geometric mixture of densities. |
Aimee Maurais · Youssef Marzouk 🔗 |
-
|
Fast and Accurate Cost-Scaling Algorithm for the Semi-Discrete Optimal Transport
(
Poster
)
>
link
Given a continuous probability distribution $\mu$ and a discrete distribution $\nu$ in the $d$-dimensional space, the semi-discrete Optimal Transport (OT) problem asks for computing a minimum-cost plan to transport mass from $\mu$ to $\nu$. In this paper, given a parameter $\varepsilon >0$, we present an approximation algorithm that computes a semi-discrete transport plan $\tau$ with cost $\textcent(\tau) \le \textcent(\tau^*)+\varepsilon$ in $n^{O(d)}\log\frac{C_{\max}}{\varepsilon}$ time; here, $\tau^*$ is the optimal transport plan, $C_{\max}$ is the diameter of the supports of $\mu$ and $\nu$, $n$ is the number of points in the support of the discrete distribution, and we assume we have access to an oracle that outputs the mass of $\mu$ inside a constant-complexity region in $O(1)$ time. Our algorithm works for several ground distances including the $L_p$-norm and the squared-Euclidean distance.
|
Pankaj Agarwal · Sharath Raghvendra · Pouyan Shirzadian · Keegan Yao 🔗 |
-
|
A Computational Framework for Solving Wasserstein Lagrangian Flows
(
Poster
)
>
link
The dynamical formulation of the optimal transport can be extended through various choices of the underlying geometry (kinetic energy), and the regularization of density paths (potential energy). These combinations yield different variational problems (Lagrangians), encompassing many variations of the optimal transport problem such as the Schrödinger bridge, unbalanced optimal transport, and optimal transport with physical constraints, among others. In general, the optimal density path is unknown, and solving these variational problems can be computationally challenging. Leveraging the dual formulation of the Lagrangians, we propose a novel deep learning based framework approaching all of these problems from a unified perspective. Our method does not require simulating or backpropagating through the trajectories of the learned dynamics, and does not need access to optimal couplings. We showcase the versatility of the proposed framework by outperforming previous approaches for the single-cell trajectory inference, where incorporating prior knowledge into the dynamics is crucial for correct predictions. |
Kirill Neklyudov · Rob Brekelmans · Alexander Tong · Lazar Atanackovic · Qiang Liu · Alireza Makhzani 🔗 |
-
|
Provably Fast Finite Particle Variants of SVGD via Virtual Particle Stochastic Approximation
(
Poster
)
>
link
SVGD is a popular particle-based variational inference algorithm with well studied mean-field dynamics. However, its finite-particle behavior is far less understood. Our work introduces the notion of *virtual particles* to develop novel stochastic approximations of mean-field SVGD dynamics in the space of probability measures, that are exactly realizable using finite particles. As a result, we design two computationally efficient variants of SVGD (VP-SVGD and GB-SVGD) with provably fast finite-particle convergence rates. Our algorithms are specific random-batch approximations of SVGD which are computationally more efficient than ordinary SVGD. We show that the $n$ output particles of VP-SVGD and GB-SVGD, run for $T$ steps with batchsize $K$, are as good as i.i.d samples from a measure whose Kernel Stein Discrepancy to the target is at most $O(\tfrac{d^{1/3}}{(KT)^{1/6}})$ under standard assumptions. We prove similar results under a mild growth condition on the score function, which is weaker than the assumptions of prior works. Our convergence rates for the empirical measure (of the particles output by VP-SVGD and GB-SVGD) to the target distribution enjoys a **double exponential improvement** over the best known finite-particle analysis of SVGD. Furthermore, our results give the **first known polynomial oracle complexity in dimension**, completely eliminating the curse of dimensionality exhibited by previously known finite-particle rates.
|
Aniket Das · Dheeraj Nagaraj 🔗 |
-
|
Fourier-Based Bounds for Wasserstein Distances and Their Implications in Computational Inversion
(
Poster
)
>
link
Computational inverse problems entail fitting a mathematical model to data. These problems are often solved numerically, by minimizing the mismatch between the model and the data using an appropriate metric. We focus on the case when this metric is the Wasserstein-$p$ ($W_p$) distance and its generalizations for unbalanced measures, including the Kantorovich-Rubinstein norm. Recent work of Niles-Weed and Berthet established that $W_p$ is bounded from below and above by weighted $\ell_p$ norms of the wavelet coefficients of the mismatch. Steinerberger posed proving the existence of a Fourier-based lower bound on $W_p$ that grows with $p$, as an open problem. Building on this line work, we establish lower and upper bounds on $W_p$ in terms of, respectively, weighted $\ell_{p'}$ and $\ell_2$ norms of the Fourier coefficients of the mismatch where $p'$ is the Holder conjugate of $p$. Moreover, where the model and data have sufficient Sobolev regularity, the upper bound can be also expressed in terms of a weighted $\ell_{p'}$ norm. Our bounds holds for $p$ in $[1,2]$, and the lower bound resolves the above-mentioned open problem for such $p$. Moreover, in the context of computational inversion using $W_p$ as the mismatch metric, these bounds allow us to analyze the resolution of frequencies in the context of early stopping of the minimization process and beyond. Since this metric is used in a broad range of other problems in mathematics and computational sciences, we expect that our bounds will also be of interest independent from inverse problems.
|
Wanli Hong · Vlad Kobzar · Kui Ren 🔗 |