`

Timezone: »

 
Workshop
Distribution shifts: connecting methods and applications (DistShift)
Shiori Sagawa · Pang Wei Koh · Fanny Yang · Hongseok Namkoong · Jiashi Feng · Kate Saenko · Percy Liang · Sarah Bird · Sergey Levine

Mon Dec 13 09:00 AM -- 06:00 PM (PST) @ None
Event URL: https://sites.google.com/view/distshift2021 »

Distribution shifts---where a model is deployed on a data distribution different from what it was trained on---pose significant robustness challenges in real-world ML applications. Such shifts are often unavoidable in the wild and have been shown to substantially degrade model performance in applications such as biomedicine, wildlife conservation, sustainable development, robotics, education, and criminal justice. For example, models can systematically fail when tested on patients from different hospitals or people from different demographics. Despite the ubiquity of distribution shifts in ML applications, work on these types of real-world shifts is currently underrepresented in the ML research community, with prior work generally focusing instead on synthetic shifts. However, recent work has shown that models that are robust to one kind of shift need not be robust to another, underscoring the importance and urgency of studying the types of distribution shifts that arise in real-world ML deployments. With this workshop, we aim to facilitate deeper exchanges between domain experts in various ML application areas and more methods-oriented researchers, and ground the development of methods for characterizing and mitigating distribution shifts in real-world application contexts.

Mon 9:00 a.m. - 9:10 a.m.
Opening remarks (Talk)
Shiori Sagawa
Mon 9:10 a.m. - 9:35 a.m.
Invited talk: Aleksander Mądry (Invited talk)
Aleksander Madry
Mon 9:35 a.m. - 10:00 a.m.
Invited talk: Suchi Saria (Invited talk)
Suchi Saria
Mon 10:00 a.m. - 10:25 a.m.
Invited talk: Ernest Mwebaze (Invited talk)
Ernest Mwebaze
Mon 10:30 a.m. - 11:00 a.m.
Discussion: Aleksander, Ernest, Suchi (Panel)
Mon 11:00 a.m. - 11:25 a.m.
Invited talk: Elizabeth Tipton (Invited talk)
Beth Tipton
Mon 11:25 a.m. - 11:50 a.m.
Invited talk: Jonas Peters (Invited talk)
Jonas Peters
Mon 11:50 a.m. - 12:10 p.m.
Discussion: Elizabeth, Jonas (Panel)
Mon 12:20 p.m. - 12:30 p.m.
[ OpenReview  link »

Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of the downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose a set of real-world tasks that accurately reflect such complexities and assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.

Neil Band, Tim G. J. Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Mike Dusenberry, Ghassen Jerfel, Dustin Tran, Yarin Gal
Mon 12:30 p.m. - 12:40 p.m.
[ OpenReview  link »

Distribution shifts, in which the training distribution differs from the testing distribution, can significantly degrade the performance of Graph Neural Networks (GNNs). We curate GDS, a benchmark of eight datasets reflecting a diverse range of distribution shifts across graphs. We observe that: (1) most domain generalization algorithms fail to work when applied to domain shifts on graphs; and (2) combinations of powerful GNN models and augmentation techniques usually achieve the best out-of-distribution performance. These emphasize the need for domain generalization algorithms tailored for graphs and further graph augmentation techniques that enhance the robustness of predictors.

Mucong Ding, Kezhi Kong, Jiuhai Chen, John Kirchenbauer, Micah Goldblum, David P Wipf, Furong Huang, Tom Goldstein
Mon 12:40 p.m. - 12:50 p.m.
[ OpenReview  link »

Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper formulates a general model of contextual bandit experiments with nonstationary contexts, which act as the confounders for inferences and can be also viewed as the distribution shifts in the earlier periods of the experiments. In addition, this general model allows the target distribution or population distribution that is used to determine the best action to be different from the empirical distribution over the contexts observed during the experiments. The paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding and distribution shifts. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action --- suggesting optimal adaptivity --- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations --- suggesting unusual robustness.

Chao Qin, Daniel Russo
Mon 12:50 p.m. - 1:00 p.m.
[ OpenReview  link »

Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. \emph{Is importance weighting truly incompatible with the training of overparameterized neural networks?} Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. Our polynomially-tailed loss consistently increases the test accuracy by 2-3%.

Ke Alexander Wang, Niladri Chatterji, Saminul Haque, Tatsunori Hashimoto
Mon 3:50 p.m. - 4:15 p.m.
Invited talk: Chelsea Finn (Invited talk)
Chelsea Finn
Mon 4:15 p.m. - 4:40 p.m.
Invited talk: Masashi Sugiyama (Invited talk)
Masashi Sugiyama
Mon 4:40 p.m. - 5:00 p.m.
Discussion: Chelsea, Masashi (Panel)
Mon 5:00 p.m. - 6:00 p.m.
Panel: Andrew Beck, Jamie Morgenstern, Judy Hoffman, Tatsunori Hashimoto (Panel)
-
[ OpenReview  link »

Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper formulates a general model of contextual bandit experiments with nonstationary contexts, which act as the confounders for inferences and can be also viewed as the distribution shifts in the earlier periods of the experiments. In addition, this general model allows the target distribution or population distribution that is used to determine the best action to be different from the empirical distribution over the contexts observed during the experiments. The paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding and distribution shifts. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action --- suggesting optimal adaptivity --- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations --- suggesting unusual robustness.

Chao Qin, Daniel Russo
-
[ OpenReview  link »

The modeling of what a neural network does not know -- i.e. uncertainty -- is fundamentally important both in terms of theory and practice. This is especially true as the model encounters distribution shift during inference. Bayesian inference has been regarded as the most principled method of uncertainty modeling because it explicitly models two types of uncertainty: \textit{epistemic uncertainty} and aleatoric uncertainty in the form posteriors over parameters and data likelihood respectively. Epistemic uncertainty captures the uncertainty of model parameters due to lack of data, while aleatoric uncertainty captures inherent data ambiguity.Practically, epistemic uncertainty is often assessed by a model's out-of-distribution (OOD) detection performance or calibration, while aleatoric uncertainty can be assessed by in-distribution error detection. Recent attempts to model uncertainty using deterministic models failed to disentangle these two uncertainties due to their non-Bayesian nature. However, it is still possible to capture them empirically in a deterministic model using a combination of density estimation and softmax-entropy. This leaves us the question: how to approach OOD detection/calibration for deterministic (as opposed to Bayesian) and discriminative (as opposed to generative) models? This is arguably the most widely used class of models due to its speed (compared to Bayesian models) and simplicity (compared to generative models). It seems that the conventional association of OOD data with epistemic uncertainty fails under the scope of this type of models, specifically because it does not reason about what has changed in the input distribution and the mechanisms through which these changes affect neural networks and a different perspective is needed to analyze them.

Junjiao Tian, Yen-Chang Hsu, Yilin Shen, Hongxia Jin, Zsolt Kira
-
[ OpenReview  link »

Recent work has unveiled how average generalization frequently relies on superficial patterns in the data. The consequence are brittle models with poor performance in the presence of domain shift in group distribution at test time. When the subgroups in the train data are known we can use tools from robust optimization and regularization mechanism to tackle the problem. However group annotation and identification are daunting and time consuming tasks, seldom performed on large datasets. A recent line of research~\cite{liu2021just} is trying to solve this problem with implicit group distribution at train time, leveraging self-supervision and oversampling to improve generalization on minority groups. Following such ideas we propose a new class-conditional variant of mixup~\cite{zhang2017mixup} for worst-group generalization, augmenting the train distribution with a continuous distribution of groups. Our method, called Just Mix Once, is domain agnostic, computationally efficient and performs on par or better than state-of-the-art on worst-group generalization.

Giorgio Giannone, Serhii Havrylov, Jordan Massiah, Emine Yilmaz, Yunlong JIAO
-
[ OpenReview  link »

Bayesian Neural Networks are often sought after for their strong and trustworthy predictive power. However, inference in these models is often computationally expensive and can be reduced using dimensionality reduction where the key goal is to find an appropriate subspace in which to perform the inference, while retaining significant predictive power. In this work, we propose a theoretical comparative study of the Principal Component Analysis versus the random projection for Bayesian Linear Regression. We find that the PCA is not always the optimal dimensionality reduction method and that the random projection can actually be superior, especially in cases where the data distribution is shifted and the labels have a small norm. We then confirm these results experimentally. Therefore, this work suggests to consider dimension reduction by random projection for Bayesian inference when noisy data are expected.

Alexandre Bense, Amir Joudaki, Tim G. J. Rudner, Vincent Fortuin
-
[ OpenReview  link »

In real-world applications of machine learning, robust systems must consider measures of performance beyond standard test accuracy. These include out-of-distribution (OOD) robustness, prediction consistency, resilience to adversaries, calibrated uncertainty estimates, and the ability to detect anomalous inputs. However, optimizing for some of these measures often sacrifices performance on others. For instance, adversarial training only improves adversarial robustness and degrades classifier performance. Similarly, strong data augmentation and regularization techniques often improve OOD robustness at the cost of weaker anomaly detection, raising the question of whether a Pareto improvement is possible. We identify a weakness of existing data augmentation techniques---namely, while they inject additional entropy into the training set, the entropy does not contain substantial structural complexity. This leads us to design a new data augmentation strategy utilizing the natural structural complexity of fractals, which outperforms numerous baselines and is the first method to comprehensively improve safety measures.

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Dawn Song, Jacob Steinhardt
-
[ OpenReview  link »

This work quantifies the extent to which accuracy degrades on review classification when state-of-the-art Transformer models are subjected to distribution shifts, and offers a solution to significantly decrease this degradation. We find differences in the extent of degradation depending on the independent variable across which the shift is created. Specifically, in our experiments time and sentiment shifts show upto 10% drops in accuracy; whereas shifts between industry and product sectors show 20-40% drops in accuracy. We provide ablation experiments with different Transformer architectures, such as BERT, T5 and Jurassic-I, and study their relationship with this degradation. The suggested solution reuses the base of the model trained on one distribution, in addition to fine-tuning the final dense layer in the model to support the new distribution that is seen once the model is deployed. This uses just 100-300 samples compared to the previous 10,000 samples from the unseen distribution, while decreasing the accuracy drops in half.

Sehaj Chawla, Nikhil Singh, Iddo Drori
-
[ OpenReview  link »
Continual learning in environments with shifting data distributions is a challenging problem with several real-world applications. In this paper we consider settings in which the data distribution (task) shifts abruptly and the timing of these shifts are not known. Furthermore, we consider a $\textit{semi-supervised task-agnostic}$ setting in which the learning algorithm has access to both task-segmented and unsegmented data for offline training. We propose a new approach for this problem setting - Mixture of Basis models (MoB). The core idea is to learn a small set of basis models and construct a dynamic, task-dependent mixture of the models to predict for the current task. We also propose a new methodology to detect observations that are out-of-distribution with respect to the existing basis models and instantiate new models. We test our approach in multiple domains and show that it achieves better prediction error compared to existing methods in most cases, while using fewer models. Moreover, we analyze the latent task representations learned by MoB to show that similar tasks tend to cluster together in the latent space and that the latent representation shifts at the task boundaries when the tasks are dissimilar.
Mengda Xu, Sumitra Ganesh, Pranay Pasula
-
[ OpenReview  link »

When finetuning a pretrained language model for natural language generation tasks, one is currently faced with a tradeoff. Lightweight finetuning (e.g., prefix- tuning, adapters), which freezes all or most of the parameters of the pretrained model, has been shown to achieve stronger out-of-distribution (OOD) performance than full finetuning, which tunes all of the parameters. However, lightweight finetuning can underperform full finetuning in-distribution (ID). In this work, we present methods to combine the benefits of full and lightweight finetuning, achieving strong performance both ID and OOD. First, we show that an ensemble of the lightweight and full finetuning models achieves the best of both worlds: performance matching the better of full and lightweight finetuning, both ID and OOD. Second, we show that we can achieve similar improvements using a single model instead of two with our proposed cocktail finetuning, which augments full finetuning via distillation from a lightweight model. Finally, we provide some explanatory theory in a multiclass logistic regression setting with a large number of classes, describing how distillation on ID data can transfer the OOD behavior of one model to another.

John Hewitt, Xiang Li, Sang Michael Xie, Benjamin Newman, Percy Liang
-
[ OpenReview  link »

In this work we contribute a distribution shift benchmark for a computer vision task; monocular depth estimation. Our differentiation is the decomposition of the wider distribution shift of uncontrolled testing on in-the-wild data to three distinct distribution shifts. Specifically, we generate data via synthesis and analyze them to produce covariate (color input), prior (depth output) and concept (their relationship) distribution shifts. We also synthesize combinations and show how each one is indeed a different challenge to address, as stacking them produces increased performance drops and cannot be addressed horizontally using standard approaches.

Georgios Albanis, Nikolaos Zioulis, Petros Drakoulis, Dimitrios Zarpalas, Petros Daras
-
[ OpenReview  link »
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributionsthat may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a \emph{threshold} on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (\textsc{Wilds}-FMoW, ImageNet, \breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2\text{--}4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.
Saurabh Garg, Sivaraman Balakrishnan, Zachary Lipton, Behnam Neyshabur, Hanie Sedghi
-
[ OpenReview  link »

We share our experience with the recently released WILDS benchmark which is a collection of ten datasets dedicated to developing models and training strategies which are robust to domain shifts. From a handful of experiments, we find a couple of critical observations which we believe are of general interest for any future work on WILDS. Our study focuses on two datasets: iWildCam and FMoW. We show that (1) conducting separate cross-validation for each evaluation metric is crucial for both datasets (2) a weak correlation between validation and test performance might make model development difficult for iWildCam (3) minor changes in the training of hyper-parameters improve the baseline by a relatively large margin (mainly on FMoW) (4) there is a strong correlation between certain domains and certain target labels (mainly on iWildCam). To the best of our knowledge, no prior work on these datasets has reported these observations despite their obvious importance.

Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
-
[ OpenReview  link »

Despite having good average test accuracy, classification models can have poor performance on subpopulations that are not well represented in the training set. In this work we introduce a method to improve prediction accuracy on underrepresented groups that does not require any group labels for training or validation, unlike existing approaches. We provide a sound empirical investigation of our procedure and show that it recovers the worst-group performance of methods that use oracle group annotations.

Vincent Bardenhagen, Alexandru Tifrea, Fanny Yang
-
[ OpenReview  link »

Deep neural networks have made it possible for reinforcement learning algorithms to learn from raw high dimensional inputs. This jump in the progress has caused deep reinforcement learning algorithms to be deployed in many different fields from financial markets to biomedical applications. While the vulnerability of deep neural networks to imperceptible specifically crafted perturbations has also been inherited by deep reinforcement learning agents, several adversarial training methods have been proposed to overcome this vulnerability. In this paper we focus on state-of-the-art adversarial training algorithms and investigate their robustness to semantically meaningful natural perturbations ranging from changes in brightness to rotation. We conduct several experiments in the OpenAI Atari environments, and find that state-of-the-art adversarially trained neural policies are more sensitive to natural perturbations than vanilla trained agents. We believe our investigation lays out intriguing properties of adversarial training and our observations can help build robust and generalizable neural policies.

Ezgi Korkmaz
-
[ OpenReview  link »

Compositional Zero-Shot Learning (CZSL) aims to recognize compositions of objects and states in images, and generalize to the unseen compositions of objects and states. Recent works tackled this problem effectively by using side information (e.g., word embeddings) together with either consistency constraints or specific network designs modeling the relationships between objects, states, compositions, and visual features. In this work, we take a step back, and we revisit the simplest baseline for this task, i.e., Visual Product (VisProd). VisProd considers CZSL as a multi-task problem, predicting objects and states separately. Despite its appealing simplicity, this baseline showed low performance in early CZSL studies. Here we identify the two main reasons behind such unimpressive initial results: network capacity and bias on the seen classes. We show that simple modifications to the object and state predictors allow the model to achieve either comparable or superior results w.r.t. the recent state of the art in both the open-world and closed-world CZSL settings on three different benchmarks.

Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata
-
[ OpenReview  link »

This paper proposes a fast and scalable method for uncertainty quantification of machine learning models' predictions. First, we show the principled way to measure the uncertainty of predictions for a classifier based on Nadaraya-Watson's nonparametric estimate of the conditional label distribution. Importantly, the approach allows to disentangle explicitly \textit{aleatoric} and \textit{epistemic} uncertainties. The resulting method works directly in the feature space. However, one can apply it to any neural network by considering an embedding of the data induced by the network. We demonstrate the strong performance of the method in uncertainty estimation tasks on a variety of real-world image datasets, such as MNIST, SVHN, CIFAR-100 and several versions of ImageNet.

Nikita Kotelevskii, Alexander Fishkov, Kirill Fedyanin, Aleksandr Petiushko, Maxim Panov
-
[ OpenReview  link »

Learning behavioral patterns from observational data has been a \textit{de-facto} approach to motion forecasting. Yet, the current paradigm suffers from two fundamental shortcomings: brittle under covariate shift and inefficient for knowledge transfer. In this work, we propose to address these challenges from a causal representation perspective. We first introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with physical mechanisms, style confounders, and spurious correlations. We then propose two components that explicitly promote the robustness and reusability of the learned motion presentations: (i) unlike the common practice of merging datasets collected from different locations, we exploit their subtle distinctions by means of an invariance loss function, which encourages the model to suppress spurious correlations and capture physical mechanisms; (ii) we devise a modular architecture that factorizes the representations of physical laws and motion styles in a structured way, and progressively prune their dense connections during training to approximate a sparse causal graph. We empirically validate the strength of the proposed method for robust generalization in controlled real-world experiments. We finally discuss the challenges and opportunities in the presence of style shifts through synthetic simulations.

Yuejiang Liu, Alexandre Alahi
-
[ OpenReview  link »

Distribution shifts, in which the training distribution differs from the testing distribution, can significantly degrade the performance of Graph Neural Networks (GNNs). We curate GDS, a benchmark of eight datasets reflecting a diverse range of distribution shifts across graphs. We observe that: (1) most domain generalization algorithms fail to work when applied to domain shifts on graphs; and (2) combinations of powerful GNN models and augmentation techniques usually achieve the best out-of-distribution performance. These emphasize the need for domain generalization algorithms tailored for graphs and further graph augmentation techniques that enhance the robustness of predictors.

Mucong Ding, Kezhi Kong, Jiuhai Chen, John Kirchenbauer, Micah Goldblum, David P Wipf, Furong Huang, Tom Goldstein
-
[ OpenReview  link »

Batch normalization is a common component in computer vision models, including ones typically used for few-shot learning. Batch normalization applied in convolutional networks consists of a normalization step, followed by the application of per-channel trainable affine parameters which shift and scale the normalized features. The use of these affine parameters can speed up model convergence on a source task. However, we demonstrate in this work that, on common few-shot learning benchmarks, training a model on a source task using these affine parameters is detrimental to downstream transfer performance. We study this effect for several methods on well-known benchmarks such as cross-domain few-shot learning (CD-FSL) benchmark and few-shot image classification on miniImageNet. We find consistent performance gains, particularly in settings with more distant transfer tasks. Improvements from applying this low-cost and easy-to-implement modifications are shown to rival gains obtained by more sophisticated and costly methods.

Moslem Yazdanpanah, Christian Desrosiers, Mohammad Havaei, Eugene Belilovsky, Samira Ebrahimi Kahou
-
[ OpenReview  link »

We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy. A ‘robust’ classifier obtained via specialized techniques like removing spurious features has better OOD but worse ID accuracy compared to a ‘standard’ classifier trained via vanilla ERM. On six distribution shift datasets, we find that simply ensembling the standard and robust models is a strong baseline---we match the ID accuracy of a standard model with only a small drop in OOD accuracy compared to the robust model. However, calibrating these models in-domain surprisingly improves the OOD accuracy of the ensemble and completely eliminates the tradeoff and we achieve the best of both ID and OOD accuracy over the original models.

Ananya Kumar, Aditi Raghunathan, Tengyu Ma, Percy Liang
-
[ OpenReview  link »
We propose Correct-N-Contrast (CNC), a contrastive learning method to improve robustness to spurious correlations when training group labels are unknown. Our motivating observation is that worst-group performance is related to a representation alignment loss, which measures the distance in feature space between different groups within each class. We prove that the gap between worst-group and average loss for each class is upper bounded by this alignment loss for that class. Thus, CNC aims to improve representation alignment via contrastive learning. First, CNC uses an ERM model to infer the group information. Second, with a careful sampling scheme, CNC trains a contrastive model to encourage similar representations for groups in the same class. We show that CNC significantly improves worst-group accuracy over existing state-of-the-art methods on popular benchmarks, e.g., achieving $7.7\%$ absolute lift in worst-group accuracy on the CelebA dataset, and performs almost as well as methods trained with group labels. CNC also learns better-aligned representations between different groups in each class, reducing the alignment loss substantially compared to prior methods.
Michael Zhang, Nimit Sohoni, Hongyang Zhang, Chelsea Finn, Christopher Ré
-
[ OpenReview  link »

Test-time adaptation (TTA) aims to achieve high accuracy on out-of-distribution (OOD) target data with only model parameters trained on the source domain and the target data. Standard TTA assumes that the test data is under a single distribution, or the distribution gradually changes with test data streaming in. However, in many scenarios, this assumption does not always hold. For instance, when inference is performed on the cloud, the test data can come from totally different users. In this paper, we try to tackle Domain-agnostic Test-time Adaptation (DaTTA), a new problem setting where the test data distribution is unknown and varies abruptly. To address DaTTA, we propose a framework to perform prototypical training with auxiliary data (PAD). Specifically, we fine-tune the model with augmented test images by consistency loss and further regulate the training process by auxiliary data. We curate a dataset for DaTTA, and the proposed PAD outperforms previous best methods by large margins on both DaTTA and standard TTA.

Qilong Wu, Xiangyu Yue, Alberto Sangiovanni-Vincentelli
-
[ OpenReview  link »

Partial domain adaptation which assumes that the unknown target label space is a subset of the source label space has attracted much attention in computer vision. Despite recent progress, existing methods often suffer from three key problems: negative transfer, lack of discriminability, and domain invariance in the latent space. To alleviate the above issues, we develop a novel 'Select, Label, and Mix' (SLM) framework that aims to learn discriminative invariant feature representations for partial domain adaptation. First, we present an efficient "select" module that automatically filters out the outlier source samples to avoid negative transfer while aligning distributions across both domains. Second, the "label" module iteratively trains the classifier using both the labeled source domain data and the generated pseudo-labels for the target domain to enhance the discriminability of the latent space. Finally, the "mix" module utilizes domain mixup jointly with the other two modules to explore more intrinsic structures across domains leading to a domain-invariant latent space for partial domain adaptation. Experiments on two datasets demonstrate the superiority of our framework over state-of-the-art methods.

Aadarsh Sahoo, Rameswar Panda, Rogerio Feris, Kate Saenko, Abir Das
-
[ OpenReview  link »

The method of Common Spatial Patterns (CSP) is widely used for feature extraction of electroencephalography (EEG) data such as in motor imagery Brain-computer Interface (BCI) systems. It is a data-driven method estimating a set of spatial filters so that the power of the filtered EEG data is maximally separated between imagery classes. This method, however, is prone to overfitting and is known to suffer from poor generalization especially with limited calibration data. On the other hand, due to the high heterogeneity in brain data and the non-stationarity of brain activity, CSP is usually trained for each user separately resulting in long calibration sessions or frequent re-calibrations that are tiring for the user. In this work, we propose a novel algorithm called Spectrally Adaptive Common Spatial Patterns (SACSP) that improves CSP by learning a temporal/spectral filter for each spatial filter so that the spatial filters are concentrated on the most relevant temporal frequencies. We show the efficacy of SACSP in motor imagery BCI in providing better generalizability and higher classification accuracy from calibration to online control compared to existing methods while providing neurophysiologically relevant information about the temporal frequencies of the filtered signals.

Mahta Mousavi, Eric Lybrand, Shuangquan Feng, Shuai Tang, Rayan Saab, Virginia de Sa
-
[ OpenReview  link »

Pre-training on massive unlabeled datasets greatly improves accuracy under distribution shifts. As a first step toward understanding this, we study a popular pre-training method, contrastive learning, in the unsupervised domain adaptation (UDA) setting where we only have labeled data from a source domain and unlabeled data from a target domain. We begin by showing on 4 benchmark datasets that out-of-the-box contrastive pre-training (even without large-scale unlabeled data) is competitive with other UDA methods. Intuitions from classical UDA methods such as domain adversarial training focus on bringing the domains together in feature space to improve generalization from source to target. Surprisingly, we find that contrastive pre-training learns features that are very far apart between the source and target domains. How then does contrastive learning improve robustness to distribution shift? We develop a conceptual model for contrastive learning under domain shifts, where data augmentations form connections between classes and domains that can be far apart. We propose a new measure of connectivity ---the relative connection strengths between same and different classes across domains---that governs the success of contrastive pre-training for domain adaptation in a simple example and strongly correlates with our results on benchmark datasets.

Kendrick Shen, Robbie Jones, Ananya Kumar, Sang Michael Xie, Percy Liang
-
[ OpenReview  link »

The recent success of machine learning methods in the industrial sector open new perspectives for the design of innovative products. However, these promising results are often challenged when it comes to industrial model deployment. Indeed, it frequently appears that the performance of the model is degraded when used on application data due to the distribution shift between the training and the targeted data. This issue is even more critical for model dedicated to the research of innovative designs as the model is mainly used on unseen regions of the design space. In this work, we present, on a real application of tire design, how distribution shifts impact the model performance and what can be expected from several domain adaptation methods. In an objective of industrial model deployment, we conduct this benchmark with the use of unsupervised evaluation metrics that considerably help the model selection.

Antoine De mathelin, François Deheeger, Mathilde MOUGEOT, Nicolas Vayatis
-
[ OpenReview  link »

Spurious correlations allow deep models to predict well during training but poorly on related test populations. Recent work has shown that models that satisfy particular independencies involving the correlation-inducing nuisance variable have guarantees on their test performance. However, enforcing independencies requires nuisances to be observed during training. But nuisances such as demographics or image background labels are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. In this work, we derive the missing-mmd estimator used for invariance objectives under missing nuisances. On simulations, missing-mmds enable improvements in test performance similar to those achieved by using fully-observed data.

Mark Goldstein, Adriel Saporta, Aahlad Puli, Rajesh Ranganath, Andrew Miller
-
[ OpenReview  link »

In this paper, we consider the challenging problem of multi-source zero shot domain generalization (MDG), where labeled training data from multiple source domains are available but with no access to data from the target domain. Many methods have been proposed to address this problem, but surprisingly the naiive solution of pooling all source data together and training a single ERM model is highly competitive. Constructing an ensemble of deep classifiers is a popular approach for building models that are calibrated under challenging distribution shifts. Hence, we propose MulDEns (Multi-Domain Deep Ensembles), a new approach for constructing deep ensembles in multi-domain problems that does not require to construct domain-specific models. Our empirical studies on multiple standard benchmarks show that MulDEns significantly outperforms ERM and existing ensembling solutions for MDG.

Kowshik Thopalli, Sameeksha Katoch, Jayaraman Thiagarajan, Pavan Turaga, Andreas Spanias
-
[ OpenReview  link »

During the deployment of machine learning models, performance degradation can occur compared to the training and validation data.This generalization gap can appear for a variety of reasons and be particularly critical in applications where certain groups of people are disadvantaged by the outcome, e.g. facial analysis. Literature provides a vast amount of methods to either perform robust classification under distribution shifts or at least to express the uncertainty caused by the shifts. However, there is still a need for data that exhibit different natural distribution shifts considering specific subgroups to test these methods. We use a balanced dataset for facial analysis and introduce subpopulation shifts, spurious correlations, and subpopulation-specific label noise. This forms our basis to investigate to what extent known approaches for calibrating neural networks remain reliable under these specified shifts. Each of the modifications leads to performance degradation, but the combination of ensembles and temperature scaling is particularly useful to stabilize the calibration over the shifts.

Jessica Deuschel, Andreas Foltyn
-
[ OpenReview  link »

Traditional AI approaches in customized (personalized) contextual pricing applications assume that the data distribution at the time of online pricing is similar to that observed during training. However, this assumption may be violated in practice because of the dynamic nature of customer buying patterns, particularly due to unanticipated system shocks such as COVID-19. We study the changes in customer behavior for a major airline during the COVID-19 pandemic by framing it as a covariate shift detection problem. We identify which customers changed their travel and purchase behavior and the attributes affecting that change using (i) Fast Generalized Subset Scanning and (ii) Causal Forests. In our experiments with simulated and real-world data, we present how these two techniques can be used to detect covariate shifts through qualitative analysis.

Abhinav Garg, naman shukla, Lavanya Marla, Sriram Somanchi
-
[ OpenReview  link »

Machine learning systems deployed in the wild are often trained on a source distribution that differs from the target distribution on which it is deployed. Unlabeled data can be a powerful source of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data. However, existing distribution shift benchmarks for unlabeled data do not reflect many scenarios that arise naturally in real-world applications. In this work, we introduce U-WILDS, which augments the WILDS benchmark of in-the-wild distribution shifts with curated unlabeled data that would be realistically obtainable in deployment. U-WILDS contains 8 datasets spanning a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). We systematically benchmark contemporary methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on the shifts in U-WILDS is limited. To facilitate the development of methods that can work reliably on real-world distribution shifts, we provide an open-source package containing all of the relevant data loaders, model architectures, and methods.

Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Ian Stavness, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, Percy Liang
-
[ OpenReview  link »

Recent work studies the supervised online continual learning setting where a learner receives a stream of data whose class distribution changes over time. Distinct from other continual learning settings the learner is presented new samples only once and must distinguish between all seen classes. A number of successful methods in this setting focus on storing and replaying a subset of samples alongside incoming data in a computationally efficient manner. One recent proposal ER-AML achieved strong performance in this setting by applying an asymmetric loss based on contrastive learning to the incoming data and replayed data. However, a key ingredient of the proposed method is avoiding contrasts between incoming data and stored data, which makes it impractical for the setting where only one new class is introduced in each phase of the stream. In this work we adapt a recently proposed approach (BYOL) from self-supervised learning to the supervised learning setting, unlocking the constraint on contrasts. We then show that supplementing this with additional regularization on class prototypes yields a new method that achieves strong performance in the one-class incremental learning setting and is competitive with the top performing methods in the multi-class incremental setting.

Nader Asadi, Sudhir Mudur, Eugene Belilovsky
-
[ OpenReview  link »

Distribution shifts in the wild jeopardize the performance of machine learning models as they tend to pick up spurious correlations during training. Recent work (Nagarajan et al., 2020) has characterized two specific failure modes of out-of-distribution (OOD) generalization, and we extend this theoretical framework by interpreting existing algorithms as solutions to these failure modes. We then evaluate them on different image classification datasets, and in the process surface two issues that are central to existing robustness techniques. For those that rely on group annotations, we show how the group information in standard benchmark datasets is unable to fully capture the spurious correlations present. For those that don't require group annotations, the validation set utilized for model selection still carries assumptions that are not realistic in real-world settings, and we show how this choice of shifts in validation set could impact performance of different OOD algorithms.

Thao Nguyen, Hanie Sedghi, Behnam Neyshabur
-
[ OpenReview  link »

This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models. We use statistical learning theory and a thorough experimental analysis to show how multiple tasks can interact with each other in a highly non-trivial fashion when trained on a single model. The generalization error on a particular task can improve when it is trained with synergistic tasks, but can just as easily deteriorate when trained with competing tasks. This phenomenon motivates our method named Model Zoo which, inspired from the boosting literature, grows an ensemble of small models, each of which is trained during one episode of continual learning. We demonstrate dramatically large gains in accuracy on a wide variety of continual learning benchmarks.

Rahul Ramesh, Pratik Chaudhari
-
[ OpenReview  link »

Transfer learning has become a standard technique in computer vision and natural language processing thanks to the fact that it often substantially improves performance on downstream tasks. Recent work by Hendrycks et al. demonstrated that using a pre-trained model can also significantly improve a model's calibration, i.e. how well the model's confidence estimates correspond to the probability of its prediction being correct. In this paper, we provide some nuance to the claim that pre-training improves calibration by demonstrating that this beneficial effect diminishes when there is a domain shift between the pre-training and fine-tuning tasks.

Jay Mohta, Colin Raffel
-
[ OpenReview  link »

Continual Learning methods typically focus on tackling the phenomenon of catastrophic forgetting in the context of neural networks. Catastrophic forgetting is associated with an abrupt loss of knowledge previously learned by a model. In supervised learning problems this forgetting is typically measured or observed by evaluating decrease in task performance. However, a model’s representations can change without losing knowledge. In this work we consider the concept of representation forgetting, which relies on using the difference in performance of an optimal linear classifier before and after a new task is introduced. Using this tool we revisit a number of standard continual learning benchmarks and observe that through this lens, model representations trained without any special control for forgetting often experience minimal representation forgetting. Furthermore we find that many approaches to continual learning that aim to resolve the catastrophic forgetting problem do not improve the representation forgetting upon the usefulness of the representation.

MohammadReza Davari, Eugene Belilovsky
-
[ OpenReview  link »

Training a predictive model with empirical risk minimization requires a distribution of the input training data that matches the testing data. Covariate shift can occur when the testing cases are not class-balanced, but the training is. In order to detect when class imbalance is present in a test sample (without labels), we propose to use statistical divergence based on the Wasserstein distance and optimal transport. Recently, slicing techniques have been proposed that provide computational and statistical advantages in high-dimensional spaces. In this work we presented a computationally simple approach to perform generalized slicing via kernel-based Wasserstein distance and apply it to as a two-sample test. The proposed landmark-based slicing chooses a single point in the samples to be the sole support vector to represent the witness function. We run pseudo-real experiments using the MNIST dataset and compare our method with maximum mean discrepancy (MMD). We have shown that our proposed methods perform better than MMD on these synthetic simulations of covariate shift.

Yuksel Karahan, Bilal Riaz, Austin Brockmeier
-
[ OpenReview  link »
Multi-class classification is one of the most common tasks in machine learning applications, where data is labeled by one of many class labels. Many loss functions have been proposed for multi-class classification including two well-known ones, namely the cross-entropy (CE) loss and the crammer-singer (CS) loss (aka. the SVM loss). While CS loss has been used widely for traditional machine learning tasks for structured data, CE loss is usually a better choice (the default choice) for multi-class deep learning tasks. There are also top-$k$ variants of CS loss and CE loss that are proposed to promote the learning of a classifier for achieving better top-$k$ accuracy. Nevertheless, it still remains unclear the relationship between these different losses, which hinders our understanding of their expectations in different scenarios. In this paper, we present a unified view of the CS/CE losses and their smoothed top-$k$ variants by proposing a new family of loss functions, which are arguably better than the CS/CE losses when the given label information is incomplete and noisy. The new family of smooth loss functions named {label-distributionally robust (LDR) loss} is defined by leveraging the distributionally robust optimization (DRO) framework to model the uncertainty in the given label information, where the uncertainty over true class labels is captured by using distributional weights for each label regularized by a function. We have two observations: (i) the CS and the CE loss are just two special cases of the LDR loss by choosing two particular values for the involved regularization parameter; hence the new LDR loss provides an interpolation between the CS loss and the CE loss, and also induces new variants; (ii) the smoothed top-$k$ losses are also special cases of the LDR loss by regularizing the involved uncertainty variables into a bounded ball. Theoretically, we establish the top-$N$ consistency (for any $N\geq 1$) of the proposed LDR loss, which is not only consistent with existing consistenty results for the CS and the CE loss but also addresses some open problems regarding the consistency of top-$k$ SVM losses. % However, in many real-world applications (e.g., natural image classification), data is often inherently multi-label, which renders the given information incomplete and noisy. Hence, overfitting to the given annotations by deep neural networks with high capacity could harm the generalization performance. To tackle this issue, this paper proposes a novel {\bf label-distributionally robust method} (named LDR), where the uncertainty over true class labels is captured by a regularized distributionally robust optimization framework. Interestingly, this LDR loss family include many existing loss functions as special/extreme cases, e.g., cross-entropy (CE) loss, crammer-singer (CS) loss, but can avoid the defects of CS loss and enjoy more flexibility than CE loss by varying the regularization strength on the distributional weight (DW) variables. Furthermore, we proposed an variant version for LDR that specializes in top-$k$ classification named LDR-$k$, for which we develop a novel efficient analytical solution. Of independent interest, we prove both LDR and LDR-$k$ loss family is calibrated and hence Fisher consistent for a broad family of DW regularization functions. Empirically, we provide some experimental results on synthetic data and real-world benchmark data to validate the effectiveness of the new variants of LDR loss.
Dixian Zhu, Tianbao Yang
-
[ OpenReview  link »

Domain generalization (DG) methods aim to develop models that generalize to settings where the test distribution is different from the training data. In this paper, we focus on the challenging problem of multi-source zero-shot DG, where labeled training data from multiple source domains is available but with no access to data from the target domain. Though this problem has become an important topic of research, surprisingly, the naive solution of pooling all source data together and training a single classifier is highly competitive on standard benchmarks. More importantly, even sophisticated approaches that explicitly optimize for invariance across different domains do not necessarily provide non-trivial gains over ERM. We hypothesize that this behavior arises due to the poor definitions of the domain splits itself. In this paper, we make a first attempt to understand the role pre-defined domain labels play in the success of domain-aware DG methods. To this end, we ignore the domain labels that come with the dataset but instead alternatively perform unsupervised clustering to infer domain splits and train the DG method with these domain labels. We also introduce a novel regularization to improve the behavior of this alternating optimization process. We conduct analysis on two standard benchmarks PACS and VLCS and demonstrate the benefit of re-categorizing samples into new domain groups on DG performance.

Kowshik Thopalli, Pavan Turaga, Jayaraman Thiagarajan
-
[ OpenReview  link »

Computer vision (CV) approaches applied to digital pathology have informed biological discovery and clinical decision-making. However, batch effects in images represent a major challenge to effective analysis. A CV model trained using Empirical Risk Minimization (ERM) risks learning batch-effects when they may align with the labels and serve as spurious correlates. The standard methods to circumvent learning such confounders include (i) application of image augmentation techniques and (ii) examination of the learning process by evaluating through external validation (e.g., unseen data coming from a comparable dataset collected at another hospital). The latter approach is data-hungry and the former, risks occluding biological signal. Here, we suggest two solutions from the Distributionally Robust Optimization (DRO) families. Our contributions are i) a DRO algorithm using abstention which is a slight variation over existing abstention-based DRO algorithms and ii) a group-DRO method where groups are defined as hospitals from which data are collected. We find that the model trained using abstention-based DRO outperforms a model trained using ERM by 9.9% F1 in identifying tumor vs. normal tissue in lung adenocarcinoma (LUAD) at the expense of coverage. Further, by examining the areas abstained by the model with a pathologist, we find that the model trained using a DRO method is more robust to heterogeneity and artifacts in the tissue. Together, we propose selecting models that are more robust to spurious features for translational discovery and clinical decision support.

Surya Narayanan Hari
-
[ OpenReview  link »

Testing within the machine learning (ML) community has centered around assessing a learned model's predictive performance measured against a test dataset. This test dataset is often drawn from the same distribution as the dataset used to train the model, and hence is expected to follow the same distribution as the training dataset. While recent work on robustness testing within ML community has pointed to the importance of testing against distributional shifts, these efforts also focus on estimating the likelihood of the model making an error against a reference dataset/distribution. In this paper, we argue that this view of testing actively discourages researchers and developers from looking into many other sources of robustness failures, for instance corner cases which may have severe impacts. We draw parallels with decades of work within software engineering testing focused on assessing a software system against various stress conditions, including corner cases, as opposed to solely focusing on average-case behaviour. Finally, we put forth a set of recommendations to broaden the view of machine learning testing to a rigorous practice.

Negar Rostamzadeh, Ben Hutchinson, Vinod Prabhakaran
-
[ OpenReview  link »

Invariant Causal Prediction provides a framework for domain (or out-of-distribution) generalization – predicated on the assumption of invariant causal mechanisms that are constant across the data distributions of interest. Accordingly, the Invariant Risk Minimization (IRM) objective has been proposed to learn this stable structure, given sufficient training distributions. Unfortunately, recent work has identified the limitations of IRM when extended to data-generating mechanisms that are different from those considered in its formulation. This work considers the generative process with causal (predecessor) and anticausal (successor) features where environment-specific exogenous factors influence all features – but the target is free of direct environment-specific influences. We show empirically that IRM fails under this data-generating process. Instead, we propose a target conditioned representation independence (TCRI) constraint, which enforces the mediative effect of the observed target on the causal chain of latent features we aim to identify. We show that this approach outperforms both Empirical Risk Minimization (ERM) and IRM.

Olawale Salaudeen, Sanmi Koyejo
-
[ OpenReview  link »

Humans are remarkably capable of zero-shot generalizing while performing tasks in new settings, even when the task is learned entirely from observing others. In this work, we show that current imitation-based policy learning methods do not share this capability, lacking robustness to minor shifts in the training environment. To demonstrate these limitations of current methods, we propose a testing protocol that new methods may use as a benchmark. We implement and evaluate KitchenShift, an instance of our testing protocol that applies domain shifts to a realistic kitchen environment. We train policies from RGB image observations using a set of demonstrations for a multi-stage robotic manipulation task in the kitchen environment. Using KitchenShift, we evaluate imitation and representation learning methods used in current policy learning approaches and find that they are not robust to visual changes in the scene (e.g., lighting, camera view) or changes in the environment state (e.g., orientation of an object). With our benchmark, we hope to encourage the development of algorithms that can generalize under such domain shifts and overcome the challenges preventing robots from completing tasks in diverse everyday settings.

Eliot Xing, Abhinav Gupta, Sam Powers, Victoria Dean
-
[ OpenReview  link »

Machine learning models are updated as new data is acquired or new architectures are developed. These updates usually increase model performance, but may introduce backward compatibility errors, where individual users or groups of users see their performance on the updated model adversely affected. This problem can also be present when training datasets do not accurately reflect overall population demographics, with some groups having overall lower participation in the data collection process, posing a significant fairness concern. We analyze how ideas from distributional robustness and minimax fairness can aid backward compatibility in this scenario, and propose two methods to directly address this issue. Our theoretical analysis is backed by experimental results on CIFAR-10, CelebA, and Waterbirds, three standard image classification datasets.

Martin Bertran, Natalia L Martinez, Guillermo Sapiro
-
[ OpenReview  link »

We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. Based on this understanding, we use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance.We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further. All the codes in this work will be open-sourced.

Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang
-
[ OpenReview  link »

Detecting and addressing distribution shift is an important task in machine learning. However, most of the machine learning solutions to deal with distribution shift lack the capability to identify the key characteristics of such a shift and present it to humans in an interpretable way. In this work, we propose a novel framework to compare two datasets and identify distribution shifts between the datasets. The key challenge is to identify generative factors of variation, which we refer to as attributes, that characterize the similarities and differences between the datasets. Producing this characterization requires finding a set of attributes that can be aligned between the two datasets and sets that are unique. We address this challenge through a novel approach that performs both attribute discovery and attribute alignment across the two distributions. We evaluate our algorithm's effectiveness at accurately identifying these attributes in two separate experiments, one involving two variants of MNIST and a second experiment involving two versions of dSprites.

Matthew Olson, Rushil Anirudh, Jayaraman Thiagarajan, Timo Bremer, Weng-Keen Wong, Shusen Liu
-
[ OpenReview  link »

We investigate an interpretable approach to compare two distributions. The approach, max-sliced Bures divergence, approximates the max-sliced Wasserstein distance, and projects the distributions into a one-dimensional subspace defined by a 'slicing' vector. Unlike heuristic algorithms for the max-sliced Wasserstein-2 distance that are not guaranteed to find the optimal slice, we detail a tractable algorithm that finds the global optimal slice and scales to large sample sizes, due to its expression in terms of second moments. However, it is unable to detect changes in higher-order statistics. To overcome this, we explore using a non-linear mapping provided by the internal representation of a pre-trained neural network (Inception Net). Our approach provides an interpretation of the Fréchet Inception distance by identifying the instances that are either overrepresented or underrepresented with respect to the other sample. We apply the proposed measure to detect class imbalances and underrepresentation within data sets.

Austin Brockmeier, Claudio Claros-Olivares, Luis G Sanchez Giraldo
-
[ OpenReview  link »

Machine learning (ML) has recently demonstrated impressive progress in predictive accuracy across a wide array of tasks. Most ML approaches focus on generalization performance on unseen data that are ``similar'' to the training data (a.k.a. In-Distribution, or IND). However, real world applications and deployments of ML rarely enjoy the comfort of encountering examples that are always IND. In such situations, most ML models commonly display erratic behavior on Out-of-Distribution (OOD) examples, such as assigning high confidence to wrong predictions, or vice-versa. Implications of such unusual model behavior are further exacerbated in the healthcare setting, where patient health can potentially be put at risk. It is crucial to study the behavior and robustness properties of models under distributional shift, understand common failure modes, and take mitigation steps before the model is deployed. Having a benchmark that shines light upon these aspects of a model is a first and necessary step in addressing the issue. Recent work and interest in increasing model robustness in OOD settings have focused more on image modality, both in terms of methods as well as benchmarks, while the Electronic Health Record (EHR) modality is still largely under-explored. We aim to bridge this gap by releasing BEDS-Bench, a benchmark for quantifying the behavior of ML models over EHR data under OOD settings. We use two open access, de-identified EHR datasets to construct several OOD data settings to run tests on. The benchmark exercises several clinical prediction tasks, OOD data settings, and measures relevant metrics that characterize crucial aspects of a model's OOD behavior. We evaluate several learning algorithms under BEDS-Bench and find that all of them show poor generalization performance under distributional shift in general. Our results highlight the need and the potential to improve robustness of EHR models under distributional shift, and \bedS provides one way to measure progress towards that goal.

Anand Avati, Martin Seneviratne, Yuan Xue, Zhen Xu, Balaji Lakshminarayanan, Andrew Dai
-
[ OpenReview  link »

We study the problem of test time robustification, i.e., using the test input to improve model robustness. In this work, we aim to study and devise methods that make no assumptions about the model training process and are broadly applicable at test time. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable: when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the model's average, or marginal, output distribution across the augmentations. In our experiments, we demonstrate that this approach consistently improves robust ResNet and vision transformer models. We achieve several new state-of-the-art results for test shifts caused by image corruptions (ImageNet-C), renditions of common objects (ImageNet-R), and, among ResNet-50 models, adversarially chosen natural examples (ImageNet-A).

Marvin Zhang, Sergey Levine, Chelsea Finn
-
[ OpenReview  link »

Models which can actively seek out the best quality training data hold the promise of more accurate, adaptable, and efficient machine learning. State-of-the-art techniques tend to prefer examples which are the most difficult to classify. While this works well on homogeneous datasets, we find that it can lead to catastrophic failures when performing active learning on multiple distributions which have different degrees of label noise (heteroskedasticity). Most active learning algorithms strongly prefer to draw from the distribution with more noise, even if its examples have no informative structure (such as solid color images). We find that active learning which encourages diversity and model uncertainty in the selected examples can significantly mitigate these failures. We hope these observations are immediately useful to practitioners and can lead to the construction of more realistic and challenging active learning benchmarks.

Savya Khosla, Alex Lamb, Jordan Ash, Cyril Zhang, Kenji Kawaguchi
-
[ OpenReview  link »

With privacy as a motivation, Federated Learning (FL) is an increasingly used paradigm where learning takes place collectively on edge devices, each with a cache of user-generated training examples that remain resident on the local device. These on-device training examples are gathered in situ during the course of users’ interactions with their devices, and thus are highly reflective of at least part of the inference data distribution. Yet a distribution shift may still exist, because on-device training examples can be lacking for some data inputs expected to be encountered at inference time. This paper proposes a way to mitigate this shift: selective usage of datacenter data, mixed in with FL. By mixing decentralized (federated) and centralized (datacenter) data, we can form an effective training data distribution that better matches the inference data distribution, resulting in more useful models.

Sean Augenstein, Andrew S Hard, Rajiv Mathews
-
[ OpenReview  link »

The concern of overconfident mispredictions under distributional shift demands extensive reliability research on Graph Neural Networks used in critical tasks in drug discovery. Here we first introduce CardioTox, a real-world benchmark on drug cardiotoxicity to facilitate such efforts. Our exploratory study shows overconfident mispredictions are often distant from training data. That leads us to develop distance-aware GNNs: GNN-SNGP. Through evaluation on CardioTox and three established benchmarks, we demonstrate GNN-SNGP's effectiveness in increasing distance-awareness, reducing overconfident mispredictions and making better calibrated predictions without sacrificing accuracy performance. Our ablation study further reveals the embeddings learned by GNN-SNGP improves distance-preservation over its base architecture and is one major factor for improvements.

Kehang Han, Balaji Lakshminarayanan, Jeremiah Liu
-
[ OpenReview  link »

Bayesian coresets have become of increasing interest recently for providing a theoretically sound, scalable approach to Bayesian inference. In brief, a coreset is a (weighted) subsample sample of a dataset that approximates the original dataset under some metric. Bayesian coresets specifically focus on approximations that approximate the posterior distribution. Unfortunately, existing Bayesian coreset approaches can significantly undersample minority subpopulations, leading to a lack of distributional robustness. As a remedy, this work extends existing Bayesian coresets from enforcing sparsity constraints to group-wise sparsity constraints. We explore how this approach helps to mitigate distributional vulnerability. We further generalize the group constraints to Bayesian coresets with matroid constraints, which may be of independent interest. We present an optimization analysis of the proposed approach, along with an empirical evaluation on benchmark datasets that support our claims.

Shovik Guha, Rajiv Khanna, Sanmi Koyejo
-
[ OpenReview  link »

Accurately predicting the possible behaviors of traffic participants is an essential capability for autonomous vehicles. Since autonomous vehicles need to navigate in dynamically changing environments, they are expected to make accurate predictions regardless of where they are and what driving circumstances they encountered. Therefore, generalization capability to unseen domains is crucial for prediction models when autonomous vehicles are deployed in the real world. In this paper, we aim to address the domain generalization problem for vehicle intention prediction tasks and a causal-based time series domain generalization (CTSDG) model is proposed. We construct a structural causal model for vehicle intention prediction tasks to learn an invariant representation of input driving data for domain generalization. We further integrate a recurrent latent variable model into our structural causal model to better capture temporal latent dependencies from time-series input data. The effectiveness of our approach is evaluated via real-world driving data. We demonstrate that our proposed method has consistent improvement on prediction accuracy compared to other state-of-the-art domain generalization and behavior prediction methods.

Yeping Hu, Xiaogang Jia, Masayoshi TOMIZUKA, Wei Zhan
-
[ OpenReview  link »

Large pre-trained models such as CLIP offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning approaches substantially improve accuracy in-distribution, they also reduce out-of-distribution robustness. We address this tension by introducing a simple and effective method for improving robustness: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements out-of-distribution, while matching or improving in-distribution accuracy. On ImageNet (in-distribution) and five derived distribution shifts, WiSE-FT improves out-of-distribution accuracy by 2 to 10 percentage points (pp) while increasing in-distribution accuracy by nearly 1 pp relative to standard fine-tuning. WiSE-FT achieves similarly large robustness improvements (2 to 15 pp) on a diverse set of six further distribution shifts, and in-distribution accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on seven commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference.

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Hanna Hajishirzi, Ali Farhadi, Hong Namkoong, Ludwig Schmidt
-
[ OpenReview  link »

Since Transformer architectures have been popularised in Computer Vision, several papers started analysing their properties in terms of calibration, out-of-distribution detection and data-shift robustness. Most of these papers conclude that Transformers, due to some intrinsic properties (presumably the lack of restrictive inductive biases and the computationally intensive self-attention mechanism), outperform Convolutional Neural Networks (CNNs). In this paper we question this conclusion: in some relevant cases, CNNs, with a pre-training and fine-tuning procedure similar to the one used for transformers, exhibit competitive robustness. To fully understand this behaviour, our evidence suggests that researchers should focus on the interaction between pre-training, fine-tuning and the considered architectures rather than on intrinsic properties of Transformers. For this reason, we present some preliminary analyses that shed some light on the impact of pre-training and fine-tuning on out-of-distribution detection and data-shift.

Francesco Pinto, Philip Torr, Puneet Dokania
-
[ OpenReview  link »

Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. In this paper, we show that the statistical problems with covariance estimation drive the poor performance of H-score [1] — a common baseline for newer metrics — and propose shrinkage-based estimator. This results in up to 80% absolute gain in H-score correlation performance, making it competitive with the state-of-the-art LogME measure by [26]. Our shrinkage-based H-score is 3−55 times faster than LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We highlight previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., NCE [24], LEEP [18] that misrepresented them as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We support our findings with ~65,000 (fine-tuning trials) experiments.

Shibal Ibrahim, Natalia Ponomareva, Rahul Mazumder
-
[ OpenReview  link »

We propose an extremely simple approach to regularize a single deterministic neural network to obtain improved accuracy and reliable uncertainty estimates. Our approach, on top of the cross-entropy loss, simply puts an entropy maximization regularizer corresponding to the predictive distribution in the regions of the embedding space between the class clusters. This is achieved by synthetically generating between-cluster samples via the convex combination of two images from {\em different} classes and maximizing the entropy on these samples. Such a data-dependent regularization guides the maximum likelihood estimation to prefer a solution that (1) maps out-of-distribution samples to high entropy regions (creating an entropy barrier); and (2) is more robust to the superficial input perturbations.We empirically demonstrate that Mix-MaxEnt consistently provides much improved classification accuracy, better calibrated probabilities for in-distribution data, and reliable uncertainty estimates when exposed to situations involving domain-shift and out-of-distribution samples.

Francesco Pinto, Harry Yang, Ser Nam Lim, Philip Torr, Puneet Dokania
-
[ OpenReview  link »

The increasing deployment of advanced digital technologies such as Internet of Things (IoT) devices and Cyber-Physical Systems (CPS) in industrial environments is enabling the productive use of machine learning (ML) algorithms in the manufacturing domain.As ML applications transcend from research to productive use in real-world industrial environments, the question of reliability arises. Since the majority of ML models are trained and evaluated on static datasets, continuous online monitoring of their performance is required to build reliable systems. Furthermore, concept and sensor drift can lead to degrading accuracy of the algorithm over time, thus compromising safety, acceptance and economics if undetected and not properly addressed. In this work, we exemplarily highlight the severity of the issue on a publicly available industrial dataset which was recorded over the course of 36 months and explain possible sources of drift. We assess the robustness of ML algorithms commonly used in manufacturing and show how uncertainty estimation may be leveraged for online performance estimation as well as drift detection as a first step towards continually learning applications.

Nicolas Jourdan
-
[ OpenReview  link »

Reliable out-of-distribution (OOD) detection is a fundamental step towards a safer implementation of modern machine learning (ML) systems under distribution shift. In this paper, we introduce Igeood, an effective method for detecting OOD samples. Igeood applies to any pre-trained neural network, does not require OOD samples or assumptions on the OOD data, and works under different degrees of access to the ML model. By building on the geodesic (Fisher-Rao) distance between the underlying data distributions, our discriminator combines confidence scores from the logits outputs and the learned features of a deep neural network. Empirically, we show that Igeood outperforms competing state-of-the-art methods on a variety of networks architectures and datasets, e.g., by increasing up to 8.5% the average TNR at TPR-95% across six different models and nine different OOD datasets.

Eduardo Dadalto, Florence Alberge, Pierre Duhamel, Pablo Piantanida
-
[ OpenReview  link »

In this work, we investigate the unexplored intersection of domain generalization and data-free learning. In particular, we address the question: How can knowledge contained in models trained on different source data domains can be merged into a single model that generalizes well to unseen target domains, in the absence of source and target domain data? Machine learning models that can cope with domain shift are essential for for real-world scenarios with often changing data distributions. Prior domain generalization methods typically rely on using source domain data, making them unsuitable for private decentralized data. We define the novel problem of Data-Free Domain Generalization (DFDG), a practical setting where models trained on the source domains separately are available instead of the original datasets, and investigate how to effectively solve the domain generalization problem in that case. We propose DEKAN, an approach that extracts and fuses domain-specific knowledge from the available teacher models into a student model robust to domain shift. Our empirical evaluation demonstrates the effectiveness of our method which achieves first state-of-the-art results in DFDG by significantly outperforming ensemble and data-free knowledge distillation baselines.

Ahmed Frikha, Haokun Chen, Denis Krompaß, Thomas Runkler, Volker Tresp
-
[ OpenReview  link »

Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of the downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose a set of real-world tasks that accurately reflect such complexities and assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. We provide implementations of all methods included in the benchmark as well as results computed over 100 TPU days, 20 GPU days, 400 hyperparameter configurations, and evaluation on at least 6 random seeds each.

Neil Band, Tim G. J. Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Mike Dusenberry, Ghassen Jerfel, Dustin Tran, Yarin Gal
-
[ OpenReview  link »

In online applications with streaming data, awareness of how far the empirical training or test data has shifted away from its original data distribution can be crucial to the performance of the model. However, historical samples in the data stream may not be kept either due to space requirements or for regulatory reasons. To cope with such situations, we propose Continual Density Ratio Estimation (CDRE), for estimating density ratios between the initial and latest distributions (p/qt) of a data stream without the need of storing past samples, where qt shifted away from p after a time period t. In particular, CDRE is more accurate than standard DRE when the two distributions are less similar, despite not requiring samples from the original distribution. CDRE can be applied in scenarios of online or continual learning, such as importance weighted covariate shift, measuring dataset changes for better decision making.

Yu Chen, Song Liu, Tom Diethe, Peter Flach
-
[ OpenReview  link »

We propose a framework for predictive uncertainty quantification of a neural network that replaces the conventional Bayesian notion of weight probability density function (PDF) with a physics based potential field representation of the model weights in a Gaussian reproducing kernel Hilbert space (RKHS) embedding. This allows us to use perturbation theory from quantum physics to formulate a moment decomposition problem over the model weight-output relationship. The extracted moments reveal successive degrees of regularization of the weight potential field around the local neighborhood of the model output. Such localized moments represent well the PDF tails and we show that this consequently leads to a better ability to detect false model predictions of test data that has undergone a distributional shift away from the training PDF learned by the model. We evaluate our approach against baseline uncertainty quantification methods on datasets that are corrupted using common distortion techniques. Our approach provides fast model predictive uncertainty estimates with much greater precision and calibration.

Rishabh Singh, Jose C Principe
-
[ OpenReview  link »

We devise a coreset selection method based on the idea of gradient matching: the gradients induced by the coreset should match, as closely as possible, those induced by the original training dataset. We evaluate the method in the context of continual learning, where it can be used to curate a rehearsal memory. Our method performs strong competitors such as reservoir sampling across a range of memory sizes.

Lukas Balles, Giovanni Zappella, Cedric Archambeau
-
[ OpenReview  link »

Deep neural networks rely heavily on normalization methods to improve their performance and learning behavior. Although normalization methods spurred the development of increasingly deep and efficient architectures, they also increased the vulnerability with respect to noise and input corruptions.In most applications, however, noise is ubiquitous and diverse; this can often lead to complete failure of machine learning systems as they fail to cope with mismatches between the input distribution during training- and test-time. The most common normalization method, batch normalization, reduces the distribution shift during training but is agnostic to changes of the input distribution during test time. Sample-based normalization methods can correct linear transformations of the activation distribution but cannot mitigate changes in the distribution shape; this makes the network vulnerable to distribution changes that cannot be reflected in the normalization parameters. We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer. This reduces the mismatch between the training and test-time distribution by minimizing the 1-D Wasserstein distance. In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions and thus improves the classification performance without the need for retraining or fine-tuning the model.

Alexander Fuchs, Christian Knoll, Franz Pernkopf
-
[ OpenReview  link »

Unsupervised domain adaptation aims to learn a model generalizing on target domain given labeled source data and unlabeled target data. However, source data sometimes may be unavailable when considering data privacy and decentralized learning architecture. In this paper, we address the source-free unsupervised domain adaptation problem where only the trained source model and unlabeled target data are given. To this end, we propose an Augmented Self-Labeling (ASL) method jointly optimizing model and labels for target data starting from the source model. This includes two alternating steps, where augmented self-labeling improves pseudo-labels via solving an optimal transport problem with Sinkhorn-Knopp algorithm, and model re-training trains the model with the supervision of improved pseudo-labels. We further introduce model regularization terms to improve the model re-training. Experiments show that our method can achieve comparable or better results than the state-of-the-art methods on the standard benchmarks.

Hao Yan, yuhong.guo, Chunsheng Yang
-
[ OpenReview  link »

Generalizing to out-of-distribution (OOD) data -- that is, data from domains unseen during training -- is a key challenge in modern machine learning, which has only recently received much attention. Some existing approaches propose leveraging larger models and pre-training on larger datasets. In this paper, we provide new insights in applying these approaches. Concretely, we show that larger models and larger datasets need to be \emph{simultaneously} leveraged to improve OOD performance. Moreover, we show that using smaller learning rates during fine-tuning is critical to achieving good results, contrary to popular intuition that larger learning rates generalize better when training from scratch. We show that strategies that improve in-distribution accuracy may, counter-intuitively, lead to poor OOD performance despite strong in-distribution performance. Our insights culminate to a method that achieves state-of-the-art results on a number of OOD generalization benchmark tasks, often by a significant margin.

Yaodong Yu, Heinrich Jiang, Dara Bahri, Hossein Mobahi, Seungyeon Kim, Ankit Rawat, Andreas Veit, Yi Ma
-
[ OpenReview  link »

Real-world applications of machine learning require a model to be capable of dealing with domain shifts that might occur at test time due to natural perturbations to the data distribution induced by, for example, changes in the data collection conditions, or synthetic distortions such as adversarial attacks. While a learning system might be simultaneously vulnerable to natural and hand-engineered perturbations, previous work has mainly focused on developing techniques to alleviate the effects of specific types of distribution shifts. In this work, we propose a unified and versatile approach to mitigate both natural and artificial domain shifts via the use of random projections. We show that such projections, implemented as convolutional layers with random weights placed at the input of a model, are capable of increasing the overlap between the different distributions that may appear at training/testing time. We evaluate the proposed approach on settings where different types of distribution shifts occur, and show it provides gains in terms of improved out-of-distribution generalization in the domain generalization setting, as well as increased robustness to two types of adversarial perturbations on the CIFAR-10 dataset without requiring adversarial training.

Isabela Albuquerque, Joao Monteiro, Tiago H Falk
-
[ OpenReview  link »

We study semi-supervised domain generalization (SSDG), a more realistic problem setting than existing domain generalization research. In particular, SSDG assumes only a few data are labeled from each source domain, along with abundant unlabeled data. Our proposed approach, called StyleMatch, extends FixMatch's two-view consistency learning paradigm in two crucial ways to address SSDG: first, stochastic modeling is applied to the classifier's weights to mitigate overfitting in the scarce labeled data; and second, style augmentation is integrated as a third view into the multi-view consistency learning framework to enhance robustness to domain shift. Two SSDG benchmarks are established where StyleMatch outperforms strong baseline methods developed in relevant areas including domain generalization and semi-supervised learning.

Kaiyang Zhou, Chen Change Loy, Ziwei Liu
-
[ OpenReview  link »

A number of deep learning approaches have recently been proposed to improve model performance on subgroups under-represented in the training set. However, Menon et al. recently showed that models with poor subgroup performance can still learn representations which contain useful information about these subgroups. In this work, we explore the representations learned by various approaches to robust learning, finding that different approaches learn practically identical representations. We probe a range of post-hoc procedures for making predictions from learned representations, showing that the distribution of the post-hoc validation set is paramount, and that clustering-based methods may be a promising approach.

David Madras, Richard Zemel
-
[ OpenReview  link »

Machine learning often experiences distribution shifts between training and testing. We introduce a simple objective whose optima are \textit{exactly all} representations on which risk minimizers are guaranteed to be robust to Bayes preserving shifts, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative, i.e., some predictor must be able to minimize the source and target risk. Second, the representation's support should be invariant across source and target. We make this practical by designing self-supervised methods that only use unlabelled data and augmentations. Our objectives achieve SOTA on DomainBed, and give insights into the robustness of recent methods, e.g., CLIP.

Yann Dubois, Yangjun Ruan, Chris Maddison
-
[ OpenReview  link »

Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks---without incurring substantial losses in robustness against other classes of corruptions---remains elusive. In this work, we successfully harden a model against Fourier-based attacks, while producing superior-to-\texttt{AugMix} accuracy and calibration results on both the CIFAR-10-C and CIFAR-100-C datasets; classification error is reduced by over ten percentage points for some high-severity noise and digital-type corruptions. We achieve this by incorporating Fourier-basis perturbations in the \texttt{AugMix} image-augmentation framework. Thus we demonstrate that the \texttt{AugMix} framework can be tailored to effectively target particular distribution shifts, while boosting overall model robustness.

Ryan Soklaski, Michael Yee, Theodoros Tsiligkaridis
-
[ OpenReview  link »

Data samples generated by several real world processes are dynamic in nature i.e., their characteristics vary with time. Thus it is not possible to train and tackle all possible distributional shifts between training and inference, using the host of transfer learning methods in literature. In this paper, we tackle this problem of adapting to domain shift at inference time i.e., we do not change the training process, but quickly adapt the model at test-time to handle any domain shift. For this, we propose to enforce consistency of predictions of data sampled in the vicinity of test sample on the image manifold. On a host of test scenarios like dealing with corruptions (CIFAR-10-C and CIFAR-100-C), and domain adaptation (VisDA-C), our method is at par or significantly outperforms previous methods.

Prabhu Teja Sivaprasad, François Fleuret
-
[ OpenReview  link »

Federated learning has been deployed to train machine learning models from decentralized client data on mobile devices in practice. The clients available for training are observed to have periodically shifting distributions changing with the time of day, which can cause instability in training and degrade the model performance. In this paper, instead of modeling the distribution shift with a block-cyclic pattern as previous works, we model it with a mixture of distributions that gradually changes between daytime modes and nighttime modes, and find this intuitive model to better match the observations in practical federated learning systems. We propose a Federated Expectation-Maximization algorithm enhanced by Temporal priors of the shifting distribution (FedTEM), which jointly learns a mixture model to infer the mode of each client, while training a network with multiple light-weight branches specializing at different modes. Experiments for image classification on EMNIST and CIFAR datasets, and next word prediction on the Stack Overflow dataset show that the proposed algorithm can effectively mitigate the impact of the distribution shift and significantly improve the final model performance.

Chen Zhu, Zheng Xu, Mingqing Chen, Jakub Konečný, Andrew S Hard, Tom Goldstein
-
[ OpenReview  link »

Covariate shifts are a common problem in predictive modeling on real-world problems. This paper proposes addressing the covariate shift problem by minimizing Maximum Mean Discrepancy (MMD) statistics between the training and test sets in either feature input space, feature representation space, or both. We designed three techniques that we call MMD Representation, MMD Mask, and MMD Hybrid to deal with the scenarios where only a distribution shift exists, only a missingness shift exists, or both types of shift exist, respectively. We find that integrating an MMD loss component helps models use the best features for generalization and avoid dangerous extrapolation as much as possible for each test sample. Models treated with this MMD approach show better performance, calibration, and extrapolation on the test set.

Liwen Ouyang, Aaron Key
-
[ OpenReview  link »

Robustness to distribution shifts is critical for deploying machine learning models in the real world. Despite this necessity, there has been little work in defining the underlying mechanisms that cause these shifts and evaluating the robustness of algorithms across multiple, different distribution shifts. To this end, we introduce a framework that enables fine-grained analysis of various distribution shifts. We provide a holistic analysis of current state-of-the-art methods by evaluating 19 distinct methods grouped into five categories across both synthetic and real-world datasets. Overall, we train more than 85K models. Our experimental framework can be easily extended to include new methods, shifts, and datasets. We find, unlike previous work [Gulrajani and Lopez-Paz, 2021], that progress has been made over a standard ERM baseline; in particular, pre-training and augmentations (learned or heuristic) offer large gains in many cases. However, the best methods are not consistent over different datasets and shifts.

Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre-Alvise Rebuffi, Ira Ktena, Krishnamurthy Dvijotham, A. Taylan Cemgil
-
[ OpenReview  link »

Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. \emph{Is importance weighting truly incompatible with the training of overparameterized neural networks?} Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. Our polynomially-tailed loss consistently increases the test accuracy by 2-3%.

Ke Alexander Wang, Niladri Chatterji, Saminul Haque, Tatsunori Hashimoto
-
[ OpenReview  link »

Most modern unsupervised domain adaptation (UDA) approaches are rooted in domain alignment, i.e., learning to align source and target features to learn a target domain classifier using source labels. In semi-supervised domain adaptation (SSDA), when the learner can access few target domain labels, prior approaches have followed UDA theory to use domain alignment for learning. We show that the case of SSDA is different and a good target classifier can be learned without needing explicit alignment. We use self-supervised pretraining and consistency regularization to achieve well separated target clusters, aiding in learning a low error target classifier, allowing our method to outperform recent state of the art approaches on large, challenging benchmarks like DomainNet and VisDA-17.

Samarth Mishra, Kate Saenko, Venkatesh Saligrama
-
[ OpenReview  link »

Self-supervised learning (SSL) learns general visual representations without the need of labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. We investigate SSL under dataset imbalance, and find out that existing self-supervised representations are more robust to class imbalance than supervised representations.The performance gap between balanced and imbalanced pre-training with SSL is much smaller than the gap with supervised learning.Second, to understand the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn label-irrelevant-but-transferable features that help classify the rare classes. In contrast, supervised learning has no incentive to learn features irrelevant to the labels of frequent examples. We validate the hypothesis with semi-synthetic experiments and theoretical analysis on a simplified setting.

Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, Tengyu Ma
-
[ OpenReview  link »

A popular belief based on recent work suggest that overparameterization increases worst-group test error on datasets with spurious correlation in the minority subgroup. These work focus on the case where the subgroups are labelled. Thus, to gain a complete picture, we investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings. We evaluate overparameterized ResNet, VGG, and BERT models in the vision and natural language processing domains on datasets with spurious correlations. We improve on the experimental setup of prior works by (1) studying the effect of model size by varying the depth and width of widely-used model architectures, (2) comparing the trends on pretrained models with those trained from scratch. We empirically demonstrate that increasing pretrained model size, by increasing either depth or width, helps or does not hurt worst-group test error under ERM. The Waterbirds and MultiNLI datasets in particular demonstrate a monotonic increase in worst-group accuracy as model size increases. Our systematic study provides benchmarks over a set of datasets and model architectures, and guidance to researchers working on problems without access to subgroup labels.

Alan Pham, Eunice Chan, Vikranth Srivatsa, Dhruba Ghosh, Yaoqing Yang, Yaodong Yu, Ruiqi Zhong, Joseph Gonzalez, Jacob Steinhardt
-
[ OpenReview  link »

Transfer learning for deep models has shown great success for various recognition tasks. Typically, a backbone network is pre-trained on a source dataset, then fine-tuned on a target dataset. We considered that when both datasets are at hand, learning them simultaneously at least for some period of iterations would yield higher test performance rather than the step-wise optimization. We propose Smooth Transfer Learning, which uses a learnable scheduler function for the loss coefficients so that degrees of contributions from two datasets can be smoothly changed along training time for optimal target performance. The scheduler function is designed so that it can express either pre-training-then-fine-tuning or multi-task learning with fixed weights as special cases. Our method consistently outperforms these special cases in object classification with CIFAR-10 and CIFAR-100, and in digit classification with SVHN and MNIST.

Keita Takayama, Teppei Suzuki, Ikuro Sato, Rei Kawakami, Koichi Shinoda
-
[ OpenReview  link »

We present first empirical results from our ongoing investigation of distribution shifts in image data used for various computer vision tasks. Instead of analyzing the original training and test data, we propose to study shifts in the learned weights of trained models. In this work, we focus on the properties of the distributions of dominantly used 3x3 convolution filter kernels. We collected and publicly provide a data set with over half a billion filters from hundreds of trained CNNs, using a wide range of data sets, architectures and vision tasks. Our analysis shows interesting distribution shifts (or the lack thereof) between trained filters along different axis of meta-parameters, like data type, task, architecture or layer depth. We argue, that the observed properties are a valuable source for further investigation into a better understanding of the impact of shifts in the input data to the generalization abilities of CNN models and novel methods for more robust transfer-learning in this domain.

Paul Gavrikov, Janis Keuper
-
[ OpenReview  link »

We propose an adversarial learning method to tackle a Domain Adaptation time series regression task (DANNTe). The task concerns the virtualization of a physical sensor of a turbine with aim to build a reliable virtual sensor working on operating conditions not considered during the training phase. Our approach is directly inspired by the need to have a domain-invariant representation of the features to correct the covariate shift present in the data. The learner has access to both a labeled source data and unlabeled target data (Unsupervised DA) and is trained on both, exploiting the minmax game between a task regressor neural network and a domain classifier neural network. Both models share the same feature representation in terms of a feature extractor neural network. This work is based on the work of Ganin et al.; we present an extension suitable to be applied to time series data. The results report a significant improvement in regression performance, compared to the base model trained on the source domain only.

Valentina Gori, Luca Strazzera
-
[ OpenReview  link »

Fréchet inception distance (FID) established itself as standard performance measuring method for generative adversarial networks (GANs). In this paper, we empirically investigate the biases that are inherited by its underlying design decision of extracting image features using the Inception v3 image classification network. As a result, we investigate how reliable FID is in terms of ranking performances of GANs. In this context, we find that FID is not aligned with human perception and exchanging Inception v3 with different image classification networks simply steers the ranking towards different biases.

Steffen Jung, Margret Keuper
-
[ OpenReview  link »

A fundamental and still largely unsolved question in the context of Generative Adversarial Networks is whether they are truly able to capture the real data distribution and, consequently, to sample from it. In particular, the multidimensional nature of image distributions leads to a complex evaluation of the diversity of GAN distributions. Existing approaches provide only a partial understanding of this issue, leaving the question unanswered. In this work, we introduce a loop-training scheme for the systematic investigation of observable shifts between the distributions of real training data and GAN generated data. Additionally, we introduce several bounded measures for distribution shifts, which are both easy to compute and to interpret. Overall, the combination of these methods allows an explorative investigation of innate limitations of current GAN algorithms. Our experiments on different data-sets and multiple state-of-the-art GAN architectures show large shifts between input and output distributions, showing that existing theoretical guarantees towards the convergence of output distributions appear not to be holding in practice.

Ricard Durall, Janis Keuper
-
[ OpenReview  link »

Existing long-tailed recognition methods, aiming to train models from long-tailed data, generally assume the models would be evaluated on a uniform test class distribution. However, the practical test class distribution often violates such an assumption (e.g., being long-tailed or even inversely long-tailed), which may lead existing methods to fail. In this work, we study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed while the test class distribution is unknown and can be skewed arbitrarily. Besides class imbalance, this task poses another challenge: the class distribution shift between the training and test samples is unidentified. To address this, we propose a new method, called Test-time Aggregating Diverse Experts (TADE), that presents two solution strategies: (1) a novel skill-diverse expert learning strategy that trains diverse experts to excel at handling different test distributions from a single long-tailed training distribution; (2) a novel test-time expert aggregation strategy that leverages self-supervision to aggregate multiple experts for handling various test distributions. Promising results verify the effectiveness of TADE.

Yifan Zhang, Bryan Hooi, Rachel Hong, Jiashi Feng
-
[ OpenReview  link »
In this work, we present Con$^{2}$DA, a simple framework that extends recent advances in semi-supervised learning to the semi-supervised domain adaptation (SSDA) problem. Our framework generates pairs of associated samples by performing stochastic data transformations to a given input. Associated data pairs are mapped to a feature representation space using a feature extractor. We use different loss functions to enforce consistency between the feature representations of associated data pairs of samples. We show that these learned representations are useful to deal with differences in data distributions in the domain adaptation problem. We performed experiments to study the main components of our model and we show that (i) learning of the consistent and contrastive feature representations is crucial to extract good discriminative features across different domains, and ii) our model benefits from the use of strong augmentation policies. With these findings, our method achieves state-of-the-art performances in three benchmark datasets for SSDA.
Manuel I Perez, Guillermo Cabrera-Vives, Pavlos Protopapas

Author Information

Shiori Sagawa (Stanford University)
Pang Wei Koh (Stanford University)
Fanny Yang (ETH)
Hongseok Namkoong (Columbia University)
Jiashi Feng (National University of Singapore)
Kate Saenko (Boston University & MIT-IBM Watson AI Lab, IBM Research)
Percy Liang (Stanford University)
Sarah Bird (Microsoft)

Sarah’s work focuses on research and emerging technology strategy for AI products in Azure. Sarah works to accelerate the adoption and positive impact of AI by bringing together the latest innovations in research with the best of open source and product expertise to create new tools and technologies. Sarah is currently leading Responsible AI for the Azure Cognitive Services. Prior to joining the Cognitive Services, Sarah lead the development of responsible AI tools in Azure Machine Learning. She is an active member of the Microsoft AETHER committee, where she works to develop and drive company-wide adoption of responsible AI principles, best practices, and technologies. Sarah was one of the founding researchers in the Microsoft FATE research group and prior to joining Microsoft worked on AI fairness in Facebook. Sarah is active contributor to the open source ecosystem, she co-founded ONNX, Fairlearn, and OpenDP’s SmartNoise was a leader in the Pytorch 1.0 and InterpretML projects. She was an early member of the machine learning systems research community and has been active in growing and forming the community. She co-founded the MLSys research conference and the Learning Systems workshops. She has a Ph.D. in computer science from UC Berkeley advised by Dave Patterson, Krste Asanovic, and Burton Smith.

Sergey Levine (UC Berkeley)

More from the Same Authors