Fri 6:30 a.m. - 6:45 a.m.
|
Opening Remarks
(
Remarks
)
SlidesLive Video » |
🔗 |
Fri 6:45 a.m. - 7:30 a.m.
|
Youssef Mroueh on Interpolating for fairness
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 7:30 a.m. - 8:15 a.m.
|
Sanjeev Arora on Using Interpolation to provide privacy in Federated Learning settings
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 8:15 a.m. - 9:00 a.m.
|
Chelsea Finn on Repurposing Mixup for Robustness and Regression
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 9:00 a.m. - 10:00 a.m.
|
Panel discussion I
(
Discussion Panel
)
SlidesLive Video » Chelsea, Sanjeev, Youssef and external panelists (Hongyi Zhang, Kilian Weinberger, Dustin Tran) |
🔗 |
Fri 10:30 a.m. - 12:00 p.m.
|
Lunch with random mixing group and organizers
|
🔗 |
Fri 12:00 p.m. - 12:45 p.m.
|
Kenji Kawaguchi on The developments of the theory of Mixup
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 12:45 p.m. - 1:30 p.m.
|
Alex Lamb on Latent Data Augmentation for Improved Generalization
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 1:30 p.m. - 2:15 p.m.
|
Gabriel Ilharco on Robust and accurate fine-tuning by interpolating weights
(
Invited Talk
)
SlidesLive Video » |
🔗 |
Fri 2:15 p.m. - 3:00 p.m.
|
Panel II
(
Discussion Panel
)
SlidesLive Video » Kenji, Alex and external panelists (Mikhail Belkhin) |
🔗 |
Fri 3:00 p.m. - 3:45 p.m.
|
Poster Session
(
Posters
)
|
🔗 |
Fri 3:45 p.m. - 4:00 p.m.
|
Closing Remarks
(
Remarks
)
SlidesLive Video » |
🔗 |
-
|
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
(
Poster
)
link »
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for training of language models (LMs). BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different domain, such as scientific or legal text. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. These ELMs can be ensembled or averaged at inference time. Experiments show that BTM improves in- and out-of-domain perplexities as compared to compute-matched GPT-style transformer LMs. Our results suggest that extreme parallelism could be used to efficiently scale LMs in future work. |
Margaret Li 🔗 |
-
|
On Data Augmentation and Consistency-based Semi-supervised Relation Extraction
(
Poster
)
link »
To improve the sample efficiency of the Relation extraction (RE) models, semi-supervised learning (SSL) methods aim to leverage unlabelled data in addition to learning from limited labelled data points. Recently, strong data augmentation combined with consistency-based semi-supervised learning methods have advanced the state of the art in several SSL tasks. However, adapting these methods to the RE task has been challenging due to the difficulty of data augmentation for RE. In this work, we leverage the recent advances in controlled text generation to perform high-quality data augmentation for the RE task. We further introduce small but significant changes to model architecture that allows for generation of more training data by interpolating different data points in their latent space. These data augmentations along with consistency training result in very competitive results for semi-supervised relation extraction on four benchmark datasets. |
Komal Teru 🔗 |
-
|
Differentially Private CutMix for Split Learning with Vision Transformer
(
Poster
)
link »
Recently, vision transformer (ViT) has started to outpace the conventional CNN in computer vision tasks. Considering privacy-preserving distributed learning with ViT, federated learning (FL) communicates models, which becomes ill-suited due to ViT's large model size and computing costs. Split learning (SL) detours this by communicating smashed data at a cut-layer, yet suffers from data privacy leakage and large communication costs caused by high similarity between ViT's smashed data and input data. Motivated by this problem, we propose \textit{DP-CutMixSL}, a differentially private (DP) SL framework by developing \textit{DP patch-level randomized CutMix (DP-CutMix)}, a novel privacy-preserving inter-client interpolation scheme that removes randomly selected patches in smashed data. By experiment, we show that DP-CutMixSL not only boosts privacy guarantees and communication efficiency, but also achieves higher accuracy than its Vanilla SL counterpart. Theoretically, we analyze that DP-CutMix amplifies R\'enyi DP (RDP), which is upper-bounded by its Vanilla Mixup counterpart. |
Seungeun Oh · Jihong Park · Sihun Baek · Hyelin Nam · Praneeth Vepakomma · Ramesh Raskar · Mehdi Bennis · Seong-Lyun Kim 🔗 |
-
|
Improving Domain Generalization with Interpolation Robustness
(
Poster
)
link »
We address domain generalization (DG) by viewing the underlying distributional shift as performing interpolation between domains. We devise an algorithm to learn a representation that is robustly invariant under such interpolation and term it as interpolation robustness. We investigate the failure aspect of DG algorithms when availability of training data is scarce. Through extensive experiments, we show that our approach significantly outperforms the recent state-of-the-art algorithm DIRT and the baseline DeepAll on average across different sizes of data on PACS and VLCS datasets. |
Ragja Palakkadavath · Thanh Nguyen-Tang · Sunil Gupta · Svetha Venkatesh 🔗 |
-
|
Pre-train, fine-tune, interpolate: a three-stage strategy for domain generalization
(
Poster
)
link »
The goal of domain generalization is to train models that generalize well to unseen domains. To this end, the typical strategy is two-stage: first pre-training the network on a large corpus, then fine-tuning on the task's training domains. If the pre-training dataset is large enough, this pre-training is efficient because it will contain samples related to the unseen domains. Yet, large pre-training is costly and possible only for a few large companies. Rather than trying to cover all kinds of test distributions during pre-training, we propose to add a third stage: editing the featurizer after fine-tuning. To this end, we interpolate the featurizer with auxiliary featurizers trained on auxiliary datasets. This merging via weight averaging edits the main featurizer by including the features mechanisms learned on the auxiliary datasets. Empirically, we show that this editing strategy improves the performance of existing state-of-the-art models on the DomainBed benchmark by adapting the featurizer to the test domain. We hope to encourage updatable approaches beyond the direct transfer learning strategy. |
Alexandre Rame · Jianyu Zhang · Leon Bottou · David Lopez-Paz 🔗 |
-
|
Sample Relationships through the Lens of Learning Dynamics with Label Information
(
Poster
)
link »
Although much research has been done on proposing new models or loss functions to improve the generalisation of artificial neural networks (ANNs), less attention has been directed to the data, which is also an important factor for training ANNs. In this work, we start from approximating the interaction between two samples, i.e. how learning one sample would modify the model's prediction on the other sample. Through analysing the terms involved in weight updates in supervised learning, we find that the signs of labels influence the interactions between samples. Therefore, we propose the labelled pseudo Neural Tangent Kernel (lpNTK) which takes label information into consideration when measuring the interactions between samples. We first prove that lpNTK would asymptotically converge to the well-known empirical Neural Tangent Kernel in terms of the Frobenius norm under certain assumptions. Secondly, we illustrate how lpNTK helps to understand learning phenomena identified in previous work, specifically the learning difficulty of samples and forgetting events during learning. Moreover, we also show that lpNTK can help to improve the generalisation performance of ANNs in image classification tasks, compared with the original whole training sets. |
Shangmin Guo · Yi Ren · Stefano Albrecht · Kenny Smith 🔗 |
-
|
AlignMixup: Improving Representations By Interpolating Aligned Features
(
Poster
)
link »
Mixup is a powerful data augmentation method that interpolates between two or more examples in the input or feature space and between the corresponding target labels. However, how to best interpolate images is not well defined. Recent mixup methods overlay or cut-and-paste two or more objects into one image, which needs care in selecting regions. In this work, we revisit mixup from the deformation perspective and introduce AlignMixup, where we geometrically align two images in the feature space. The correspondences allow us to interpolate between two sets of features, while keeping the locations of one set. Interestingly, this retains mostly the geometry or pose of one image and the appearance or texture of the other. AlignMixup outperforms state-of-the-art mixup methods on five different benchmarks. |
Shashanka Venkataramanan · Ewa Kijak · laurent amsaleg · Yannis Avrithis 🔗 |
-
|
LSGANs with Gradient Regularizers are Smooth High-dimensional Interpolators
(
Poster
)
link »
We consider the problem of discriminator optimization in least-squares generative adversarial networks (LSGANs) subject to higher-order gradient regularization enforced on the convex hull of all possible interpolation points between the target (real) and generated (fake) data. We analyze the proposed LSGAN cost within a variational framework, and show that the optimal discriminator solves a regularized least-squares problem, and can be represented through a polyharmonic radial basis function (RBF) interpolator. The optimal RBF discriminator can be implemented in closed-form, with the weights computed by solving a linear system of equations. We validate the proposed approach on synthetic Gaussian and standard image datasets. While the optimal LSGAN discriminator leads to a superior convergence on Gaussian data, the inherent low-dimensional manifold structure of images makes the implementation of the optimal discriminator ill-posed. Nevertheless, replacing the trainable discriminator network with a closed-form RBF interpolator results in superior convergence on 2-D Gaussian data, while overcoming pitfalls in GAN training, namely mode dropping and mode collapse. |
Siddarth Asokan · Chandra Seelamantula 🔗 |
-
|
Over-Training with Mixup May Hurt Generalization
(
Poster
)
link »
Mixup, which creates synthetic training instances by linearly interpolating random sample pairs, is a simple yet effective regularization technique to boost the performance of deep models trained with SGD. In this work, we report a previously unobserved phenomenon in Mixup training: on a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs, giving rise to a U-shaped generalization curve. This behavior is further aggravated when the size of the original dataset is reduced. To help understand such a behavior of Mixup, we show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data. Via analyzing a least-square regression problem with a random feature model, we explain why noisy labels may cause the U-shaped curve to occur: Mixup improves generalization through fitting the clean patterns at the early training stage, but as training progresses, Mixup becomes over-fitting to the noise in the synthetic data. |
Zixuan Liu · Ziqiao Wang · Hongyu Guo · Yongyi Mao 🔗 |
-
|
Covariate Shift Detection via Domain Interpolation Sensitivity
(
Poster
)
link »
Covariate shift is a major roadblock in the reliability of image classifiers in the real world. Work on covariate shift has been focused on training classifiers to adapt or generalize to unseen domains. However for transparent decision making, it is equally desirable to develop \textit{covariate shift detection} methods that can indicate whether or not a test image belongs to an unseen domain. In this paper, we introduce a benchmark for covariate shift detection (CSD), that builds upon and complements previous work on domain generalization. We use state-of-the-art OOD detection methods as baselines and find them to be worse than simple confidence-based methods on our CSD benchmark. We propose a interpolation-based technique, Domain Interpolation Sensitivity (DIS), based on the simple hypothesis that interpolation between the test input and randomly sampled inputs from the training domain, offers sufficient information to distinguish between the training domain and unseen domains under covariate shift. DIS surpasses all OOD detection baselines for CSD on multiple domain generalization benchmarks. |
Tejas Gokhale · Joshua Feinglass · 'YZ' Yezhou Yang 🔗 |
-
|
Interpolating Compressed Parameter Subspaces
(
Poster
)
link »
Though distribution shifts have caused growing concern for machine learning scalability, solutions tend to specialize towards a specific type of distribution shift. We learn that constructing a Compressed Parameter Subspaces (CPS), a geometric structure representing distance-regularized parameters mapped to a set of train-time distributions, can maximize average accuracy over a broad range of distribution shifts concurrently. We show sampling parameters within a CPS can mitigate backdoor, adversarial, permutation, stylization and rotation perturbations. Regularizing a hypernetwork with CPS can also reduce task forgetting. |
Siddhartha Datta · Nigel Shadbolt 🔗 |
-
|
Mixup for Robust Image Classification - Application in Continuously Transitioning Industrial Sprays
(
Poster
)
link »
Image classification with deep neural networks has seen a surge of technological breakthroughs with promising applications in areas such as face recognition, object detection, etc. However, in engineering problems, e.g. high-speed imaging of engine fuel injector sprays, deep neural networks face a fundamental challenge - the availability of adequate and diverse data. Typically, only hundreds or thousands of samples are available for training. In addition, the transition between different spray classes is a "continuum" and requires a high level of domain expertise to label images accurately. Thus, this work leverages pre-trained Neural Network models to build classifiers and employed Mixup to systematically deal with the data scarcity and ambiguous class boundaries found in industrial spray applications. Comparing to traditional data augmentation methods, Mixup that linear interpolates different classes naturally aligns with the continuous transition between different classes in spray applications. Results also show that Mixup can train a more accurate and robust deep neural network classifier with only hundreds samples |
Huanyi Shui · Hongjiang Li · devesh upadhyay · Praveen Narayanan · Alemayehu Solomon Admasu 🔗 |
-
|
Momentum-based Weight Interpolation of Strong Zero-Shot Models for Continual Learning
(
Poster
)
link »
Large pretrained, zero-shot capable models have shown considerable success both for standard transfer and adaptation tasks, with particular robustness towards distribution shifts.In addition, subsequent finetuning can considerably improve performance on a selected downstream task. However, through naive finetuning, these zero-shot models lose their generalizability and robustness towards distribution shifts.This is a particular problem for tasks such as Continual Learning (CL), where continuous adaptation has to be performed as new task distributions are introduced sequentially.In this work, we showcase that where finetuning falls short to adapt such zero-shot capable models, simple momentum-based weight interpolation can provide consistent improvements for CL tasks in both memory-free and memory-based settings.In particular, we find improvements of over $+4\%$ on standard CL benchmarks, while reducing the error to the upper limit of jointly training on all tasks at once in parts by more than half, allowing the continual learner to inch closer to the joint training limits.
|
Zafir Stojanovski · Karsten Roth · Zeynep Akata 🔗 |
-
|
Mixed Samples Data Augmentation with Replacing Latent Vector Components in Normalizing Flow
(
Poster
)
link »
Data augmentation mixing two samples has been acknowledged as an effective regularization method for various deep neural network models. Given that images mixed by popular methods (e.g., MixUp and CutMix) are unnatural to the human eye, we hypothesized that generating more natural images could achieve better performance as data augmentation.To verify this, we propose a new mixing method that synthesizes images in which two source images coexist naturally.Our method performs a mixing operation in latent space through a normalizing flow, and the key is how to mix two latent vectors.We preliminarily observed that there exists a dependency between the dimensions in input space and those in latent space in transformation with normalizing flows. Based on this observation, we designed our mixing scheme in latent space. We show that our method yields visually natural augmented images and improves classification performance. |
Genki Osada · Budrul Ahsan · Takashi Nishide 🔗 |
-
|
Overparameterization Implicitly Regularizes Input-Space Smoothness
(
Poster
)
link »
Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable and intermediate activations, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an empirical study of the Lipschitz constant of networks trained in practice, as the number of model parameters and training epochs vary. We present non-monotonic trends for the Lipschitz constant, strongly correlating with double descent for the test error. Our findings highlight a theoretical shortcoming in modeling input-space smoothness via monotonic bounds. |
Matteo Gamba · Hossein Azizpour · Mårten Björkman 🔗 |
-
|
Effect of mixup Training on Representation Learning
(
Poster
)
link »
Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this paper, we look at mixup through a representation learning lens in a semi-supervised learning setup. In particular, we study the role of mixup in promoting linearity in the learned network representations. Towards this, we study two questions: (1) how does the mixup loss that enforces linearity in the last network layer propagate the linearity to the earlier layers?; and (2) how does the enforcement of stronger mixup loss on more than two data points affect the convergence of training? We empirically investigate these properties of mixup on vision datasets such as CIFAR-10, CIFAR-100 and SVHN. Our results show that supervised mixup training does not make all the network layers linear;in fact the intermediate layers become more non-linear during mixup training compared to a network that is trained without mixup. However, when mixup is used as an unsupervised loss, we observe that all the network layers become more linear resulting in faster training convergence. |
Arslan Chaudhry · Aditya Menon · Andreas Veit · Sadeep Jayasumana · Srikumar Ramalingam · Sanjiv Kumar 🔗 |
-
|
FedLN: Federated Learning with Label Noise
(
Poster
)
link »
Federated Learning (FL) is a distributed machine learning paradigm that enables learning models from decentralized private datasets, where the labeling effort is entrusted to the clients. While most existing FL approaches assume high-quality labels are readily available on users' devices; in reality, label noise can naturally occur in FL and follows a non-i.i.d. distribution among clients. Due to the ``non-iid-ness'' challenges, existing state-of-the-art centralized approaches exhibit unsatisfactory performance, while previous FL studies rely on data exchange or repeated server-side aid to improve model's performance. Here, we propose FedLN, a framework to deal with label noise across different FL training stages; namely, FL initialization, and server-side model aggregation. Extensive experiments on various publicly available vision and audio datasets demonstrate an improvement of 24% on average compared to state-of-the-art methods for a label noise level of 70%. |
Vasileios Tsouvalas · Aaqib Saeed · Tanir Özçelebi · Nirvana Meratnia 🔗 |
-
|
Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint
(
Poster
)
link »
Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory. |
Hao Liu · Minshuo Chen · Siawpeng Er · Wenjing Liao · Tong Zhang · Tuo Zhao 🔗 |
-
|
GroupMixNorm Layer for Learning Fair Models
(
Poster
)
link »
Recent research has focused on proposing algorithms for bias mitigation from automated prediction algorithms. Most of the techniques include convex surrogates of fairness metrics such as demographic parity or equalized odds in the loss function, which are not easy to estimate. Further, these fairness constraints are mostly data-dependent and aim to minimize the disparity among the protected groups during the training. However, they may not achieve similar performance on the test set. In order to address the above limitations, this research proposes a novel GroupMixNorm layer for bias mitigation from deep learning models. As an alternative to solving constraint optimization separately for each fairness metric, we have formulated bias mitigation as a problem of distribution alignment of several groups identified through the protected attributes. To this effect, the GroupMixNorm layer probabilistically mixes group-level feature statistics of samples across different groups based on the protected attribute. The proposed method improves upon several fairness metrics with minimal impact on accuracy. Experimental evaluation and extensive analysis on benchmark tabular and image datasets demonstrate the efficacy of the proposed method to achieve state-of-the-art performance. |
Anubha Pandey · Aditi Rai · Maneet Singh · Deepak Bhatt · Tanmoy Bhowmik 🔗 |
-
|
SMILE: Sample-to-feature MIxup for Efficient Transfer LEarning
(
Poster
)
link »
To improve the performance of deep learning, mixup has been proposed to force the neural networks favoring simple linear behaviors in-between training samples. Performing mixup for transfer learning with pre-trained models however is not that simple, a high capacity pre-trained model with a large fully-connected (FC) layer could easily overfit to the target dataset even with samples-to-labels mixed up. In this work, we propose SMILE — \underline{S}ample-to-feature \underline{M}ixup for Eff\underline{I}cient Transfer \underline{LE}arning. With mixed images as inputs, SMILE regularizes the outputs of CNN feature extractors to learn from the mixed feature vectors of inputs, in addition to the mixed labels. SMILE incorporates a mean teacher to provide the surrogate "ground truth" for mixed feature vectors. Extensive experiments have been done to verify the performance improvement made by \TheName, in comparisons with a wide spectrum of transfer learning algorithms, including fine-tuning, L2-SP, DELTA, BSS, RIFLE, Co-Tuning and RegSL, even with mixup strategies combined. |
Xingjian Li · Haoyi Xiong · Cheng-Zhong Xu · Dejing Dou 🔗 |
-
|
Contributed Spotlights
(
Oral
)
|
🔗 |