Timezone: »
Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain ---where the model will ultimately be used--- is known in advance. In that case, one would ideally pretrain only the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and ---notably--- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning, we demonstrate this promising new approach to targeted on-demand dataset synthesis.
Author Information
Jiaojiao Fan (Georgia Institute of Technology)
David Alvarez-Melis (Microsoft)
More from the Same Authors
-
2021 : On the complexity of the optimal transport problem with graph-structured cost »
Jiaojiao Fan · Isabel Haasler · Johan Karlsson · Yongxin Chen -
2021 : Variational Wasserstein gradient flow »
Jiaojiao Fan · Amirhossein Taghvaei · Yongxin Chen -
2022 : Neural Unbalanced Optimal Transport via Cycle-Consistent Semi-Couplings »
Frederike Lübeck · Charlotte Bunne · Gabriele Gut · Jacobo Sarabia del Castillo · Lucas Pelkmans · David Alvarez-Melis -
2022 : Neural Unbalanced Optimal Transport via Cycle-Consistent Semi-Couplings »
Frederike Lübeck · Charlotte Bunne · Gabriele Gut · Jacobo Sarabia del Castillo · Lucas Pelkmans · David Alvarez-Melis -
2022 Spotlight: Are GANs overkill for NLP? »
David Alvarez-Melis · Vikas Garg · Adam Kalai -
2022 Poster: Are GANs overkill for NLP? »
David Alvarez-Melis · Vikas Garg · Adam Kalai