This workshop brings together domain experts and ML researchers working on mitigating distribution shifts in realworld applications.
Distribution shifts—where a model is deployed on a data distribution different from what it was trained on—pose significant robustness challenges in realworld ML applications. Such shifts are often unavoidable in the wild and have been shown to substantially degrade model performance in applications such as biomedicine, wildlife conservation, sustainable development, robotics, education, and criminal justice. For example, models can systematically fail when tested on patients from different hospitals or people from different demographics.
This workshop aims to convene a diverse set of domain experts and methodsoriented researchers working on distribution shifts. We are broadly interested in methods, evaluations and benchmarks, and theory for distribution shifts, and we are especially interested in work on distribution shifts that arise naturally in realworld application contexts. Examples of relevant topics include, but are not limited to:
 Examples of realworld distribution shifts in various application areas. We especially welcome applications that are not widely discussed in the ML research community, e.g., education, sustainable development, and conservation. We encourage submissions that characterize distribution shifts and their effects in realworld applications; it is not at all necessary to propose a solution that is algorithmically novel.
 Methods for improving robustness to distribution shifts. Relevant settings include domain generalization, domain adaptation, and subpopulation shifts, and we are interested in a wide range of approaches, from uncertainty estimation to causal inference to active data collection. We welcome methods that can work across a variety of shifts, as well as more domainspecific methods that incorporate prior knowledge on the types of shifts we wish to be robust on. We encourage evaluating these methods on realworld distribution shifts.
 Empirical and theoretical characterization of distribution shifts. Distribution shifts can vary widely in the way in which the data distribution changes, as well as the empirical trends they exhibit. What empirical trends do we observe? What empirical or theoretical frameworks can we use to characterize these different types of shifts and their effects? What kinds of theoretical settings capture useful components of realworld distribution shifts?
 Benchmarks and evaluations. We especially welcome contributions for subpopulation shifts, as they are underrepresented in current ML benchmarks. We are also interested in evaluation protocols that move beyond the standard assumption of fixed training and test splits  for which applications would we need to consider other forms of shifts, such as streams of continuallychanging data or feedback loops between models and data?
Sat 7:00 a.m.  7:10 a.m.

Opening Remarks
(Opening remarks for DistShift 2022)
SlidesLive Video » 
🔗 
Sat 7:10 a.m.  7:35 a.m.

Domain Adaptation: Theory, Algorithms, and Open Library
(Invited Talk)
SlidesLive Video » 
Mingsheng Long 🔗 
Sat 7:35 a.m.  8:00 a.m.

Machinelearning, distribution shifts and extrapolation in the Earth System
(Invited Talk)

Markus Reichstein 🔗 
Sat 8:00 a.m.  8:30 a.m.

Coffee Break

🔗 
Sat 8:30 a.m.  8:55 a.m.

The promises and pitfalls of CVAR
(Invited Talk)
SlidesLive Video » 
Pradeep Ravikumar 🔗 
Sat 9:00 a.m.  9:45 a.m.

Panel Discussion
(Inperson Panel Discussion)
SlidesLive Video » 
Behnam Neyshabur · David Sontag · Pradeep Ravikumar · Erin Hartman 🔗 
Sat 9:45 a.m.  11:00 a.m.

Lunch Break

🔗 
Sat 11:00 a.m.  12:30 p.m.

Poster Session

🔗 
Sat 12:30 p.m.  12:40 p.m.

First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains
(Spotlight)
link »
SlidesLive Video »
Realworld machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of \textit{nonlinear} modelsunder what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even twolayer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the \textit{marginal} distribution of each coordinate of the data (or subset of coordinates) does not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an \emph{arbitrary} function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is wellconditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.

Kefan Dong · Tengyu Ma 🔗 
Sat 12:40 p.m.  12:50 p.m.

Learning Invariant Representations under General Interventions on the Response
(Spotlight)
link »
SlidesLive Video » It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we introduce invariant matching property (IMP) which is an explicit relation to capture interventions through an additional feature. This leads to an alternative form of invariance that enables a unified treatment of general interventions on the response. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings. 
Kang Du · Yu Xiang 🔗 
Sat 12:50 p.m.  1:00 p.m.

CAREER: Economic Prediction of Labor Sequence Data Under Distribution Shift
(Spotlight)
link »
SlidesLive Video » Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, the distribution of these large resume datasets differ in meaningful ways from the survey datasets used for economic estimation; standard econometric models cannot take advantage of their scale or make predictions under distribution shift. To this end we develop CAREER, a transformerbased model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passivelycollected resume data and then finetuned on samples of the downstream data distribution of interest. We find that CAREER forms accurate predictions of job sequences, achieving stateoftheart predictive performance on three widelyused economics datasets. We also find that CAREER is adept at making predictions under distribution shifts in time. 
Keyon Vafa · Emil Palikot · Tianyu Du · Ayush Kanodia · Susan Athey · David Blei 🔗 
Sat 1:00 p.m.  1:10 p.m.

Tackling Distribution Shifts in Federated Learning with Superquantile Aggregation
(Spotlight)
link »
SlidesLive Video » Federated learning has emerged as the predominant framework for distributed machine learning over decentralized data, e.g. in mobile phones. The usual approaches suffer from a distribution shift: the model is trained to fit the average population distribution but is deployed on individual clients, whose data distributions can be quite different. We present a distributionally robust approach to federated learning based on a risk measure known as the superquantile and show how to optimize it by interleaving federated averaging steps with quantile computation. We demonstrate experimentally that our approach is competitive with usual ones in terms of average error and outperforms them in terms of tail statistics of the error. 
Krishna Pillutla · Yassine Laguel · Jérôme Malick · Zaid Harchaoui 🔗 
Sat 1:10 p.m.  1:20 p.m.

DomainAdjusted Regression or: ERM May Already Learn Features Sufficient for OutofDistribution Generalization
(Spotlight)
link »
SlidesLive Video » A common explanation for the failure of deep networks to generalize outofdistribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce DomainAdjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domainspecific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimaxoptimal predictor for a constrained set of test distributions. Further, we provide the first finiteenvironment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance. 
Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski 🔗 
Sat 1:20 p.m.  1:30 p.m.

Data Feedback Loops: Modeldriven Amplification of Dataset Biases
(Spotlight)
link »
SlidesLive Video » Datasets scraped from the internet have been critical to largescale machine learning. Yet, its success puts the utility of future internetderived datasets at potential risk, as model outputs begin to replace human annotations as a source of supervision. In this work, we formalize a system where interactions with one model are recorded as history and scraped as training data in the future. We then analyze its stability over time by tracking changes to a testtime bias statistic (e.g. gender bias of model predictions). We find that the degree of bias amplification is closely linked to whether the model’s outputs behave like samples from the training distribution, a behavior which we characterize and define as consistent calibration. Experiments in three conditional prediction scenarios – image classification, visual rolelabeling, and language generation – demonstrate that models that exhibit a samplinglike behavior are more calibrated and thus more stable. Based on this insight, we propose an intervention to help calibrate and stabilize unstable feedback systems. 
Rohan Taori · Tatsunori Hashimoto 🔗 
Sat 1:30 p.m.  1:45 p.m.

Coffee Break

🔗 
Sat 1:45 p.m.  2:10 p.m.

External Validity: Framework, Design, and Analysis
(Invited Talk)
SlidesLive Video » 
Erin Hartman 🔗 
Sat 2:10 p.m.  2:35 p.m.

Bringing realworld data to bear in addressing distribution shifts: a sociolinguisticallyinformed analysis of ASR errors
(Invited Talk)
SlidesLive Video » 
Alicia Beckford Wassink 🔗 
Sat 2:35 p.m.  3:00 p.m.

Geospatial Distribution Shifts in Ecology: Mapping the Urban Forest
(Invited Talk)
SlidesLive Video » 
Sara Beery 🔗 
Sat 2:58 p.m.  3:00 p.m.

Closing Remarks
(Closing remarks for DistShift 2022)

🔗 


Performative Prediction with Neural Networks
(Poster)
link »
SlidesLive Video » Performative prediction is a framework for learning models that influence the data they intend to predict. We focus on finding classifiers that are performatively stable, i.e. optimal for the data distribution they induce. Standard convergence results for the method of repeated risk minimization assume that the data distribution is Lipschitz continuous to the model's parameters. Under this assumption, the loss must be strongly convex and smooth in these parameters; otherwise, the method will diverge for some problems. In this work, we instead assume that the data distribution is Lipschitz continuous with respect to the model's predictions, a more natural assumption for performative systems. As a result, we are able to significantly relax the assumptions on the loss function. In particular, we do not need to assume convexity with respect to the model's parameters. As an illustration, we introduce a resampling procedure that models realistic distribution shifts and show that it satisfies our assumptions. We support our theory by showing that one can learn performatively stable classifiers with neural networks making predictions about real data that shift according to our proposed procedure. 
Mehrnaz Mofakhami · Ioannis Mitliagkas · Gauthier Gidel 🔗 


Improving Domain Generalization with Interpolation Robustness
(Poster)
link »
SlidesLive Video » We address domain generalization by viewing the underlying distributional shift as interpolation between domains and subsequently devise an algorithm to learn a representation that is robustly invariant under such interpolation, which we coin our approach as \textit{interpolation robustness}. Through extensive experiments, we show that our approach outperforms significantly the recent stateoftheart algorithm \citet{NEURIPS2021_2a271795} and the baseline DeepAll in a limited data setting on PACS and VLCS datasets. 
Ragja Palakkadavath · Thanh NguyenTang · Sunil Gupta · Svetha Venkatesh 🔗 


Deconstructing Distributions: A Pointwise Framework of Learning
(Poster)
link »
SlidesLive Video » In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated at single input point. Specifically, we study a point's profile: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and datain and outofdistribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are 
Gal Kaplun · Nikhil Ghosh · Saurabh Garg · Boaz Barak · Preetum Nakkiran 🔗 


BitrateConstrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts
(Poster)
link »
Although training machine learning models for robustness is critical for realworld adoption, determining how to best ensure robustness remains an open problem. Some methods (e.g., DRO) are overly conservative, while others (e.g., Group DRO) require domain knowledge that may be hard to obtain. In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function is simple. For example, we may expect that group shifts occur along highlevel features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these features, but need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this idea, we formulate a twoplayer game where conditioned on the label the adversary can only separate datapoints into potential groups using simple features, which corresponds to a bitrate constraint on the adversary's capacity. Our resulting practical algorithm, BitrateConstrained DRO (BRDRO), does not require group information on training samples yet matches the performance of Group DRO. Our theoretical analysis reveals that in some settings BRDRO objective can provably yield statistically efficient and less pessimistic solutions than unconstrained DRO. 
Amrith Setlur · Don Dennis · Benjamin Eysenbach · Aditi Raghunathan · Chelsea Finn · Virginia Smith · Sergey Levine 🔗 


Impact of realistic properties of the point spread function on classification tasks to reveal a possible distribution shift
(Poster)
link »
SlidesLive Video » Image classification is a longstanding task in computer vision with deep neuralnetworks (DNN) producing excellent results on various challenges. However, theyare required not only to perform highly accurate on benchmarks such as ImageNet,but also to robustly handle images in adverse conditions, such as modified lighting, sharpness, weather conditions and image compression. Various benchmarksaimed to measure robustness show that neural networks perform differently wellunder distribution shifts. While datasets such as ImageNetC model for example common corruptions such as blur and adverse weather conditions, we argue thatthe properties of the optical system and the potentially resulting complex lens blurare insufficiently well studied in the literature. This study evaluates the impact ofrealistic optical corruptions on the ImageNet classification. The proposed complexcorruption kernels are direction and wavelength dependent and include chromaticaberration, which are all to be expected in realistic scenarios such as autonomousdriving applications. Our experiments on twelve different DNN models show significant differences of more than 5% in the top1 classification error, when comparedto the model performances on matched ImageNetC blur kernels. 
Patrick Müller · Alexander Braun · Margret Keuper 🔗 


A Simple Baseline that Questions the Use of PretrainedModels in Continual Learning
(Poster)
link »
With the success of pretraining techniques in representation learning, a number of continual learning methods based on pretrained models have been proposed. Some of these methods design continual learning mechanisms on the pretrained representations and only allow minimum updates or even no updates of the backbone models during the training of continual learning. In this paper, we question whether the complexity of these models is needed to achieve good performance by comparing them to a simple baseline that we designed. We argue that the pretrained feature extractor itself can be strong enough to achieve a competitive or even better continual learning performance on SplitCIFAR100 and CoRe 50 benchmarks. To validate this, we conduct a very simple baseline that 1) uses the frozen pretrained model to extract image features for every class encountered during the continual learning stage and compute their corresponding mean features on training data, and 2) predicts the class of the input based on the nearest neighbor distance between test samples and mean features of the classes; i.e., Nearest Mean Classifier (NMC). This baseline is singleheaded, exemplarfree, and can be taskfree (by updating the means continually). This baseline achieved $88.53\%$ on 10SplitCIFAR100, surpassing most stateoftheart continual learning methods that are all initialized using the same pretrained transformer model. We hope our baseline may encourage future progress in designing learning systems that can continually add quality to the learning representations even if they started from some pretrained weights.

Paul Janson · Wenxuan Zhang · Rahaf Aljundi · Mohamed Elhoseiny 🔗 


RLSBench: A LargeScale Empirical Study of Domain Adaptation Under Relaxed Label Shift
(Poster)
link »
Despite the emergence of principled methods for domain adaptation under label shift (where only the class balance changes), the sensitivity of these methods to naturalseeming covariate shifts remains precariously underexplored. Meanwhile, popular deep domain adaptation heuristics, despite showing promise on benchmark datasets, tend to falter when faced with shifts in the class balance. Moreover, it's difficult to assess the state of the field owing to inconsistencies among relevant papers in evaluation criteria, datasets, and baselines. In this paper, we introduce RLSbench, a largescale benchmark for such relaxed label shift settings, consisting of 11 vision datasets spanning > 200 distribution shift pairs with different class proportions. We evaluate 12 popular domain adaptation methods, demonstrating a more widespread susceptibility to failure under extreme shifts in the class proportions than was previously known. We develop an effective metaalgorithm, compatible with most deep domain adaptation heuristics, that consists of the following two steps: (i) pseudobalance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. Furthermore, we discover that batchnorm adaption of a model trained on source with aforementioned corrections offers a strong baseline, largely missing from prior comparisons. We hope that these findings and the availability of RLSbench will encourage researchers to include rigorously evaluate proposed methods in relaxed label shift settings. 
Saurabh Garg · Nick Erickson · James Sharpnack · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton 🔗 


Mitigating Dataset Bias by Using Persample Gradient
(Poster)
link »
SlidesLive Video » The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called biasconflicting samples. These methods are based on explicit bias labels provided by human. However, such methods require human costs. Recently, several studies have tried to reduce human intervention by utilizing the output space values of neural networks, such as feature space, logits, loss, or accuracy. However, these output space values may be insufficient for the model to understand the bias attributes well. In this study, we propose a debiasing algorithm leveraging gradient called PGD (Persample Gradientbased Debiasing). PGD comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importancebatch sampling, whose probability is obtained in step (2). Compared with existing baselines for various datasets, the proposed method showed stateoftheart accuracy for the classification task. 
Sumyeong Ahn · SeongYoon Kim · SeYoung Yun 🔗 


Train Offline, Test Online: A Real Robot Learning Benchmark
(Poster)
link »
SlidesLive Video » Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internetscale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robots for evaluating methods on common tasks and an opensource dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data. 
Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta



The Value of Outofdistribution Data
(Poster)
link »
More data is expected to help us generalize to a task. But real datasets can contain outofdistribution (OOD) data; this can come in the form of heterogeneity such as intraclass variability but also in the form of temporal shifts or concept drifts. We demonstrate a counterintuitive phenomenon for such problems: generalization error of the task can be a nonmonotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and mediumsized neural networks on vision benchmarks such as MNIST, CIFAR10, CINIC10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pretraining have on this behavior. 
Ashwin De Silva · Rahul Ramesh · Carey E Priebe · Pratik Chaudhari · Joshua T Vogelstein 🔗 


Reliability benchmarks for image segmentation
(Poster)
link »
Recent work has shown the importance of reliability, where model performance is assessed under stress conditions pervasive in realworld deployment. In this work, we examine reliability tasks in the setting of semantic segmentation, a dense output problem that has typically only been evaluated using indistribution predictive performancefor example, the mean intersection over union score on the Cityscapes validation set. To reduce the gap toward reliable deployment in the real world, we compile a benchmark involving existing (and newly constructed) distribution shifts and metrics. We evaluate current models and several baselines to determine how well segmentation models make robust predictions across multiple types of distribution shift and flag when they don’t know. 
Estefany Kelly Buchanan · Michael Dusenberry · Jie Ren · Kevin Murphy · Balaji Lakshminarayanan · Dustin Tran 🔗 


Adaptive Pretraining of Language Models for Better Logical Reasoning
(Poster)
link »
Logical reasoning of text is an important ability that requires understanding the logical information present in the text and reasoning through them to infer new conclusions. Prior works on improving the logical reasoning ability of language models require complex processing of training data (e.g., aligning symbolic knowledge to text), yielding taskspecific data augmentation solutions that restrict the learning of general logical reasoning skills. In this work, we propose AERIE, an adaptively pretrained language model that has improved logical reasoning abilities. We select a subset of Wikipedia, based on a set of logical inference keywords, for continued pretraining of a language model. We use two selfsupervised loss functions: a modified masked language modeling loss where only specific partsofspeech words, that would likely require more reasoning than basic language understanding, are masked, and a sentence classification loss that teaches the model to distinguish between entailment and contradiction types of sentences. The proposed training paradigm is both simple and generalizable across tasks. We demonstrate the effectiveness of AERIE by comparing it with prior baselines on two logical reasoning datasets. AERIE performs comparably on ReClor and outperforms baselines on LogiQA. 
Soumya Sanyal · Yichong Xu · Shuohang Wang · Ziyi Yang · Reid Pryzant · Wenhao Yu · Chenguang Zhu · Xiang Ren 🔗 


Using Interventions to Improve OutofDistribution Generalization of TextMatching Systems
(Poster)
link »
SlidesLive Video » Given a user's input text, textmatching recommender systems output relevant items by comparing the input text to available items' description, such as producttoproduct recommendation on ecommerce platforms. As users' interests and item inventory are expected to change, it is important for a textmatching system to generalize to data shifts, a task known as outofdistribution (OOD) generalization. However, we find that the popular approach of finetuning a large, base language model on paired item relevance data (e.g., user clicks) can be counterproductive for OOD generalization. For a product recommendation task, finetuning obtains worse accuracy than the base model when recommending items in a new category or for a future time period. To explain this generalization failure, we consider an interventionbased importance metric, which shows that a finetuned model captures spurious correlations and fails to learn the causal features that determine the relevance between any two text inputs. Moreover, standard methods for causal regularization do not apply in this setting, because unlike in images, there exist no universally spurious features in a textmatching task (the same token may be spurious or causal depending on the text it is being matched to). For OOD generalization on text inputs, therefore, we highlight a different goal: avoiding high importance scores for certain features. We do so using an interventionbased regularizer that constraints the importance score of any token on the model's relevance score to be similar to the base model. Results on Amazon product and 3 question recommendation datasets show that our proposed regularizer improves generalization for both indistribution and OOD evaluation, especially in difficult scenarios when the base model is not accurate. 
Parikshit Bansal · Yashoteja Prabhu · Emre Kiciman · Amit Sharma 🔗 


Dropout Disagreement: A Recipe for Group Robustness with Fewer Annotations
(Poster)
link »
Empirical risk minimization (ERM) of neural networks can cause overreliance on spurious correlations and poor generalization on minority groups. Deep feature reweighting (DFR) improves group robustness via lastlayer retraining, but it requires full group and class annotations for the reweighting dataset. To eliminate this impractical requirement, we propose a oneshot active learning method which constructs the reweighting dataset with the disagreement points between the ERM model with and without dropout activated. Our experiments show our approach achieves 95% of DFR performance on the Waterbirds and CelebA datasets despite using no group annotations and up to $7.5\times$ fewer class annotations.

Tyler LaBonte · Abhishek Kumar · Vidya Muthukumar 🔗 


Domain Generalization for Robust ModelBased Offline Reinforcement Learning
(Poster)
link »
SlidesLive Video » Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multidemonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose DomainInvariant Modelbased Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline modelbased RL setting, can improve the stability of the policy learning process, and potentially increase exploration. 
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger 🔗 


MultiDomain LongTailed Learning by Augmenting Disentangled Representations
(Poster)
link »
There is an inescapable longtailed classimbalance issue in many realworld classification problems. Existing longtailed classification methods focus on the singledomain setting, where all examples are drawn from the same distribution. However, realworld scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multidomain longtailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domainassociated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domaininvariant class prototype that averages out domainspecific effects. We evaluate TALLY on four longtailed variants of classical domain generalization benchmarks and two realworld imbalanced multidomain datasets. The results indicate that TALLY consistently outperforms other stateoftheart methods in both subpopulation shift and domain shift. 
Huaxiu Yao · Xinyu Yang · Allan Zhou · Chelsea Finn 🔗 


MetaAdaptive Stock Movement Prediction with TwoStage Representation Learning
(Poster)
link »
SlidesLive Video » Stock movement prediction has always been a tough but attractive task for researchers in machine learning and data mining. Generally speaking, two challenges for stock time series prediction remain not wellexplored. One is the overfitting of deep learning models due to the data shortage and the other one is the potential domain shift that may happen during the evolution of stock time series. In this paper, we present \textit{\textbf{M}eta\textbf{A}daptive \textbf{S}tock movement prediction with two\textbf{S}tag\textbf{E} \textbf{R}epresentation learning (\textbf{MASSER})}, a novel framework for stock movement prediction based on selfsupervised learning and metalearning. Specifically, we first build up a twostage representation learning framework, the firststage representation learning aims for unified embedding learning for the data. And the secondstage learning, which is based on the first stage, is used for temporal domain shift detection via selfsupervised learning. Then, we formalize the problem of stock movement prediction into a standard metalearning setting. Inspired by importance sampling, we estimate sampling probability for tasks to balance the domain discrepancy caused by evolving temporal domains. Extensive experiment results on two open source datasets show that our framework with two simple but classical architectures (GRU and ResNet) as model achieves improvements of 5\%  9.5\% on average accuracy, compared to stateoftheart baselines. 
Donglin Zhan · Yusheng Dai · Yiwei Dong · Jinghai He · Zhenyi Wang · James Anderson 🔗 


Scaleconditioned Adaptation for Large Scale Combinatorial Optimization
(Poster)
link »
Deep reinforcement learning (DRL) for combinatorial optimization has drawn attention as an alternative for humandesigned solvers. However, training DRL solvers for largescale tasks remains challenging due to combinatorial optimization problems' NPhardness. This paper proposes a novel \textit{scaleconditioned adaptation} (SCA) scheme that improves the transferability of the pretrained solvers on largerscale tasks. The main idea is to design a scaleconditioned policy by plugging a simple deep neural network, denoted as \textit{scaleconditioned network} (SCN), into the existing DRL model. SCN extracts a hidden vector from a scale value, and then we add it to the representation vector of the pretrained DRL model. The increment of the representation vector captures the context of scale information and helps the pretrained model effectively adapt the policy to largerscale tasks. Our method is verified to improve the zeroshot and fewshot performance of DRLbased solvers in various largescale combinatorial optimization tasks. 
Minsu Kim · Jiwoo SON · Hyeonah Kim · Jinkyoo Park 🔗 


Malign Overfitting: Interpolation and Invariance are Fundamentally at Odds
(Poster)
link »
Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or outofdistribution generalization. However, multiple recent works empirically demonstrate that common invarianceinducing regularizers are ineffective in the overparameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting", in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that  even in the simplest of settings  any interpolating classifier (with nonzero margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that  in the same setting  successfully learns a noninterpolating classifier that is provably invariant. We validate our theoretical observations regarding the conflict between interpolation and invariance on simulated data and the Waterbirds dataset. 
Yoav Wald · Gal Yona · Uri Shalit · Yair Carmon 🔗 


On the Abilities of Mathematical Extrapolation with Implicit Models
(Poster)
link »
Deep neural networks excel on a variety of different tasks, often surpassing human intelligence. However, when presented with outofdistribution data, these models tend to break down even on the simplest tasks. In this paper, we compare implicitlydefined and classical deep learning models on a series of mathematical extrapolation tasks, where the models are tested with outofdistribution samples during inference time. Throughout our experiments, implicit models greatly outperform classical deep learning networks that overfit the training distribution. We showcase implicit models' unique advantages for extrapolation thanks to their flexible and selective framework. Implicit models, with potentially unlimited depth, not only adapt well to outofdistribution data but also understand the underlying structure of inputs much better. 
Juliette Decugis · Max Emerling · Ashwin Ganesh · Alicia Tsai · Laurent El Ghaoui 🔗 


Estimation of prediction error with known covariate shift
(Poster)
link »
SlidesLive Video » In supervised learning, the estimation of prediction error on unlabeled test data is an important task. Existing methods are usually built on the assumption that the training and test data are sampled from the same distribution, which is often violated in practice. As a result, traditional estimators like crossvalidation (CV) will be biased and this may result in poor model selection. In this paper, we assume that we have a test dataset in which the feature values are available but not the outcome labels, and focus on a particular form of distributional shift of covariate shift. We propose an alternative method based on parametric bootstrap of the target of conditional error ErrX. Empirically our method outperforms CV for both simulation and real data example across different modeling tasks, and is comparable to stateoftheart methods for image classification. 
Hui Xu · Robert Tibshirani 🔗 


A Synthetic Limit Order Book Dataset for Benchmarking Forecasting Algorithms under Distributional Shift
(Poster)
link »
In electronic trading markets, limit order books (LOBs) provide information about pending buy/sell orders at various price levels for given security. Recently, there has been a growing interest in using LOB data for resolving downstream machine learning tasks (e.g., forecasting). However, dealing with outofdistribution (OOD) LOB data is challenging since distributional shifts are unlabeled in current publicly available LOB datasets. Therefore, it is critical to build a synthetic LOB dataset with labeled OOD samples serving as a testbed for developing models that generalize well to unseen scenarios. In this work, we utilize a multiagent market simulator to build a synthetic LOB dataset with and without market stress scenarios, which allows for the design of controlled distributional shift benchmarking. Using the proposed synthetic dataset, we provide a holistic analysis on the forecasting performance of three different stateoftheart forecasting methods. Our results reflect the need for increased researcher efforts to develop algorithms with robustness to distributional shifts in highfrequency time series data. 
Defu Cao · Yousef ElLaham · Loc Trinh · Svitlana Vyetrenko · Yan Liu 🔗 


A Closer Look at Model Adaptation using Feature Distortion and Simplicity Bias
(Poster)
link »
In order to achieve strong indistribution (ID) and outofdistribution (OOD) generalization during transfer learning, it was recently argued that adaptation protocols should better leverage the expressivity of highquality, pretrained models by controlling feature distortion (FD), i.e., the failure to update features orthogonal to the ID. However, in addition to OOD generalization, practical applications require that adapted models are also safe. To this end, we study the susceptibility of common adaptation protocols to simplicity bias (SB), i.e., the wellknown propensity of neural networks to rely upon simple features, as this phenomenon has recently been shown to underlie several problems in safe generalization. Using a controllable, synthetic setting, we demonstrate that solely controlling FD is not sufficient to avoid SB, harming in safe generalization. Given the need to control both SB and FD for improved safety and ID/OOD generalization, we propose modifying a recently proposed protocol with goal of reducing SB. We verify the effectiveness of these modified protocols in decreasing SB on synthetic setting, and in jointly improving OOD generalization and safety on standard adaptation benchmarks. 
Puja Trivedi · Danai Koutra · Jayaraman Thiagarajan 🔗 


Task Modeling: Approximating Multitask Predictions for CrossTask Transfer
(Poster)
link »
We study the problem of learning a target task when data samples from several auxiliary source tasks are available. Examples of this problem appear in multitask learning, where several tasks are combined jointly, and weak supervision, where multiple programmatic labels are generated for each sample. Because of task data's heterogeneity, negative interference is a critical challenge for solving this problem. Previous works have measured firstorder task affinity as an effective metric, yet it becomes less accurate for approximating higherorder transfers. We propose a procedure called task modeling to model first and higherorder transfers. This procedure samples subsets of source tasks and estimates surrogate functions to approximate multitask predictions. We show theoretical and empirical results that task models can be estimated in nearlylinear time in the number of tasks and accurately approximate multitask predictions. Thus, the target task's performance can be optimized using task models to select source tasks. We validate this approach on various datasets and performance metrics. Our method increases accuracy up to 3.6% over existing methods on five text classification tasks with noisy supervision sources. Additionally, task modeling can be applied to group robustness and fairness metrics. Ablation studies show that task models can accurately predict whether or not a set of up to four source tasks transfer positively to the target task. 
Dongyue Li · Huy Nguyen · Hongyang Zhang 🔗 


Generative Posterior Networks for Approximately Bayesian Epistemic Uncertainty Estimation
(Poster)
link »
In many realworld problems, there is a limited set of training data, but an abundance of unlabeled data. We propose a new method, Generative Posterior Networks (GPNs), that uses unlabeled data to estimate epistemic uncertainty in highdimensional problems. A GPN is a generative model that, given a prior distribution over functions, approximates the posterior distribution directly by regularizing the network towards samples from the prior. We prove theoretically that our method indeed approximates the Bayesian posterior and show empirically that it improves epistemic uncertainty estimation and scalability over competing methods. 
Melrose Roderick · Felix Berkenkamp · Fatemeh Sheikholeslami · J. Zico Kolter 🔗 


GraphRelational Distributionally Robust Optimization
(Poster)
link »
SlidesLive Video » Outofdistribution (OOD) generalization is a challenging machine learning problem yet highly desirable in many highstake applications. Distributionally robust optimization (DRO) is a promising learning paradigm to tackle this challenge but suffers from several limitations. To address this challenge, we propose graphrelational distributionally robust optimization that trains OODresilient machine learning models by exploiting the topological structure of data distributions. Our approach can uniformly handle both fullyknown and partiallyknown topological structures. Empirical results on both synthetic and realworld datasets demonstrate the effectiveness and flexibility of our method. 
Fengchun Qiao · Xi Peng 🔗 


A Unified Framework for Comparing Learning Algorithms
(Poster)
link »
Understanding model biases is crucial to understanding how models will perform outofdistribution (OOD). These biases often stem from particular design choices (e.g., architecture or data augmentation). We propose a framework for (learning) algorithm comparisons, wherein the goal is to find similarities and differences between models trained with two different learning algorithms. We begin by formalizing the goal of algorithm comparison as finding distinguishing feature transformations, input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present a twostage method for algorithm comparisons based on comparing how models use the training data, leveraging the recently proposed datamodel representations [IPE+22]. We demonstrate our framework through a case study comparing classifiers trained on the Waterbirds [SKH+20] dataset with/without ImageNet pretraining. 
Harshay Shah · Sung Min Park · Andrew Ilyas · Aleksander Madry 🔗 


Domain Generalization with Nuclear Norm Regularization
(Poster)
link »
The ability to generalize to unseen domains is crucial for machine learning systems, especially when we only have data from limited training domains and must deploy the resulting models in the real world. In this paper, we study domain generalization via the classic empirical risk minimization (ERM) approach with a simple regularizer based on the nuclear norm of the learned features from the training set. Theoretically, we provide intuitions on why nuclear norm regularization works better than ERM and ERM with L2 weight decay in linear settings. Empirically, we show that nuclear norm regularization achieves stateoftheart average accuracy compared to existing methods in a wide range of domain generalization tasks (e.g. 1.7\% test accuracy improvements over the secondbest baseline on DomainNet). 
Zhenmei Shi · Yifei Ming · Ying Fan · Frederic Sala · Yingyu Liang 🔗 


Invariant Feature Subspace Recovery for MultiClass Classification
(Poster)
link »
SlidesLive Video »
Domain generalization aims to learn a model over multiple training environments to generalize to unseen environments. Recently, Wang et al [2022] proposed Invariantfeature Subspace Recovery (ISR), a domain generalization algorithm which uses the means of classconditional data distributions to provably identify the invariantfeature subspace. However, the original ISR algorithm is conditioned on single class only, without utilizing information from the rest classes. In this work, we consider the setting of multiclass classification, and propose an extension of the ISR algorithm, called ISRMulticlass. This proposed algorithm can provably recover the invariantfeature subspace with $\mathcal{O}(d_{spu}/k) + 1$ environments, where $d_{spu}$ is the number of spurious features and $k$ is the number of classes. Empirically, we first examine ISRMulticlass in a synthetic dataset, and demonstrate its superiority over the original ISR in the multiclass setting. Furthermore, we conduct experiments in Multiclass Coloured MNIST, a semisynthetic dataset with strong spurious correlations, and show that ISRMulticlass can significantly improve the robustness of neural nets trained by various methods (e.g., ERM and IRM) against spurious correlations.

Gargi Balasubramaniam · Haoxiang Wang · Han Zhao 🔗 


OutofDistribution Robustness via Targeted Augmentations
(Poster)
link »
Many machine learning systems deployed in the real world face the challenge of domain generalization, or generalizing to new domains that have different data distributions. For example, in wildlife conservation, animal classification models can perform poorly on new camera deployments. Across cameras, the data distribution changes along multiple factors, some of which are spurious (e.g., lowlevel background variations) and others of which are robustly predictive (e.g., habitat type). In this work, we aim to improve outofdistribution performance by learning models that are invariant to spurious crossdomain variations while preserving predictive crossdomain variations. Specifically, we explore targeted augmentations that rely on prior knowledge to randomize only the spurious crossdomain variations. On iWildCam2020WILDS and Camelyon17WILDS, two domain generalization datasets, targeted augmentations outperform the previous stateoftheart by 3.2% and 14.4% points respectively, suggesting that targeting spurious crossdomain variations using prior knowledge can be an effective route to outofdistribution robustness. 
Irena Gao · Shiori Sagawa · Pang Wei Koh · Tatsunori Hashimoto · Percy Liang 🔗 


Pushing the AccuracyFairness Tradeoff Frontier with Introspective Selfplay
(Poster)
link »
SlidesLive Video »
Improving the accuracyfairness frontier of deep neural network (DNN) models is an important problem. Uncertaintybased active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose $Introspective Selfplay$ (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary $Introspection$ task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the biasawareness of the model representation and the resulting uncertainty estimates. On two realworld tabular and language tasks,ISP serves as a simple “plugin” for AL model training, consistently improving both the tailgroup sampling rate and the final accuracyfairness tradeoff frontier of popular AL methods.

Jeremiah Liu · Krishnamurthy Dvijotham · Jihyeon Lee · Quan Yuan · Martin Strobel · Balaji Lakshminarayanan · Deepak Ramachandran 🔗 


Reducing Forgetting in Federated Learning with Truncated CrossEntropy
(Poster)
link »
In federated learning (FL), a global model is learned by aggregating model updates computed from a set of client nodes, each having their own data. A key challenge in FL is the heterogeneity of data across clients whose data distributions differ from one another. Standard FL algorithms perform multiple gradient steps before synchronizing the model, which can lead to clients overly minimizing their local objective and diverging from other client solutions. We demonstrate that in such a setting individual client models experience ``catastrophic forgetting" with respect to other client data. We propose a simple yet efficient approach that modifies the crossentropy objective on a perclient basis such that classes outside a client's label set are shielded from abrupt representation change. Through empirical evaluations, we demonstrate our approach can alleviate this problem, especially under the most challenging FL settings with high heterogeneity, low client participation. 
Gwen Legate · Lucas PageCaccia · Eugene Belilovsky 🔗 


Learning to Extrapolate: A Transductive Approach
(Poster)
link »
Machine learning systems, especially overparameterized deep neural networks, can generalize to novel testing instances drawn from the same distribution as the training data. However, they fare poorly when evaluated on outofsupport testing points. In this work, we tackle the problem of developing machine learning systems that retain the power of overparametrized function approximators, while enabling extrapolation to outofsupport testing points when possible. This is accomplished by noting that under certain conditions, a "transductive" reparameterization can convert an outofsupport extrapolation problem into a problem of withinsupport combinatorial generalization. We propose a simple strategy based on bilinear embeddings to enable this type of combinatorial generalization, thereby addressing the outofsupport extrapolation problem. We instantiate a simple, practical algorithm applicable to various supervised learning problems and imitation learning tasks. 
Aviv Netanyahu · Abhishek Gupta · Max Simchowitz · Kaiqing Zhang · Pulkit Agrawal 🔗 


Surgical FineTuning Improves Adaptation to Distribution Shifts
(Poster)
link »
A common approach to transfer learning under distribution shift is to finetune the last few layers of a pretrained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively finetuning a subset of layers (which we term surgical finetuning) matches or outperforms commonly used finetuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, finetuning only the first few layers works best. We validate our findings systematically across seven realworld data tasks spanning three types of distribution shifts. Theoretically, we prove that for twolayer neural networks in an idealized setting, firstlayer tuning can outperform finetuning all layers. Intuitively, finetuning more parameters on a small target dataset can cause information learned during pretraining to be forgotten, and the relevant information depends on the type of shift. 
Yoonho Lee · Annie Chen · Fahim Tajwar · Ananya Kumar · Huaxiu Yao · Percy Liang · Chelsea Finn 🔗 


Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection
(Poster)
link »
SlidesLive Video » In this study, we leverage the deliberate and systematic faultinjection capabilities of an opensource benchmark suite to perform a series of experiments on stateoftheart deep and robust reinforcement learning algorithms.We aim to benchmark robustness in the context of continuous action spacescrucial for deployment in robot control.We find that robustness is more prominent for action disturbances than it is for disturbances to observations and dynamics. We also observe that stateoftheart approaches that are not explicitly designed to improve robustness perform at a level comparable to that achieved by those that are.Our study and results are intended to provide insight into the current state of safe and robust reinforcement learning and a foundation for the advancement of the field, in particular, for deployment in robotic systems. 
Catherine Glossop · Jacopo Panerati · Amrit Krishnan · Zhaocong Yuan · Angela Schoellig 🔗 


Classwise Domain Generalization: A Novel Framework for Evaluating Distributional Shift
(Poster)
link »
SlidesLive Video » Given that Neural Networks generalize unreasonably well in the IID setting, OOD presents a useful failure case to study their generalization performance. Recent studies have shown that a carefully trained ERM gives good performance in Domain Generalization (DG), with train samples from all domains randomly shuffled in each batch of training. Moreover, methods like MIRO can boost test performance of NN under distribution shift without training data being explicitly annotated with domain information. We present a new setting beyond the Traditional DG (TDG) called Classwise DG (CWDG) benchmark, where for each class, we randomly select one of the domains and keep it aside for testing. Despite being exposed to all domains during training, our experiments show that the performance of neural network drop in this framework compared to Traditional DG(TDG). We evaluate popular DG methods in this setting and show that some methods that the performance are correlated for most methods but a few. Finally, we propose a novel method called Iterative Domain Feature Masking(IDFM), achieving stateoftheart results on the proposed benchmark. 
Sarath Sivaprasad · Akshay Goindani · Mario Fritz · Vineet Gandhi 🔗 


Memory bounds for continual learning
(Poster)
link »
Continual learning, or lifelong learning, is a formidable current challenge to machine learning. It requires the learner to solve a sequence of $k$ different learning tasks, one after the other, while %with each new task learned it retaining its aptitude for earlier tasks; the continual learner should scale better than the obvious solution of developing and maintaining a separate learner for each of the $k$ tasks. We embark on a complexitytheoretic study of continual learning in the PAC framework. We make novel uses of communication complexity to establish that any continual learner, even an improper one, needs memory that grows linearly with $k$, strongly suggesting that the problem is intractable. When logarithmically many passes over the learning tasks are allowed, we provide an algorithm based on multiplicative weights update whose memory requirement scales well; we also establish that improper learning is necessary for such performance. We conjecture that these results may lead to new promising approaches to continual learning.

Binghui Peng · Xi Chen · Christos Papadimitriou 🔗 


Tailored Overlap for Learning Under Distribution Shift
(Poster)
link »
Distributional overlap is a critical determinant of learnability in domain adaptation. The standard theory quantifies overlap in terms of $\chi^2$ divergence, as this factors directly into variance and generalization bounds agnostic to the functional form of the $Y$$X$ relationship. However, in many modern settings, we cannot afford this agnosticism; we often wish to transfer across distributions with disjoint support, where these standard divergence measures are infinite. In this note, we argue that ``tailored'' divergences that are restricted to measuring overlap in a particular function class are more appropriate. We show how $\chi^2$ (and other) divergences can be generalized to this restricted function class setting via a variational representation, and use this to motivate balancing weightbased methods that have been proposed before, but, we believe, should be more widely used.

David BrunsSmith · Alexander D'Amour · Avi Feller · Steve Yadlowsky 🔗 


FewShot Learnable Augmentation for Financial Time Series Prediction under Distribution Shifts
(Poster)
link »
SlidesLive Video » We address the problem of distribution shift in financial time series prediction, where the behavior of the time series changes over time.Satisfactory performance of forecasting algorithms requires constant model recalibration or finetuning to adapt to the new data distribution.Specifically, the ability to quickly finetune a model with only a few training samples available from the new distribution is crucial for many business applications.In this paper, we develop a novel method for learnable data augmentation that effectively adjusts to the new time series distribution with only a few samples. We demonstrate the effectiveness of our method compared to the stateoftheart augmentation methods on both univariate time series (e.g., stock data) and multivariate time series (e.g., yield rate curves) in the presence of distribution shift due to the COVID market shock in 2020. 
Dat Huynh · Elizabeth Fons · Svitlana Vyetrenko 🔗 


Mechanistic Lens on Mode Connectivity
(Poster)
link »
With the rise of pretrained models, finetuning has become increasingly important. However, naive finetuning often does not eliminate a model's sensitivity to spurious cues. To understand and address this limitation, we study the geometry of neural network loss landscapes through the lens of modeconnectivity. We tackle two questions: 1) Are models trained on different distributions modeconnected? 2) Can we fine tune a pretrained model to switch modes? We define a notion of mechanistic similarity based on shared invariances and show linearlyconnected modes are mechanistically similar. We find naive finetuning yields linearly connected solutions and hence is unable to induce relevant invariances. We also propose and validate a method of "mechanistic finetuning" based on our gained insights. 
Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka 🔗 


Is Unsupervised Performance Estimation Impossible When Both Covariates and Labels shift?
(Poster)
link »
Accurately estimating and explaining an ML model’s performance on new datasets is increasingly critical in reliable ML model deployment. With no labels on the new datasets, performance estimation paradigms often assume either covariate shift or label shift, and thus lead to poor estimation accuracy when the assumptions are broken. Is unsupervised performance monitoring really impossible when both covariates and labels shift? In this paper, we give a negative answer. To do so, we introduce Sparse Joint Shift (SJS), a new distribution shift model considering the shift of labels and a few features. We characterize the mathematical conditions under which SJS is identifiable. This shows that unsupervised performance monitoring is indeed feasible when a few features and labels shift. In addition, we propose SEES, an algorithmic framework for performance estimation under SJS. Preliminary experiments show the superior estimation performance of SEES over existing paradigms. This opens the door to tackling the joint shift of both covariates and labels without observing new datasets’ labels. 
Lingjiao Chen · Matei Zaharia · James Zou 🔗 


First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains
(Poster)
link »
Realworld machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of \textit{nonlinear} modelsunder what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even twolayer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the \textit{marginal} distribution of each coordinate of the data (or subset of coordinates) does not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an \emph{arbitrary} function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is wellconditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.

Kefan Dong · Tengyu Ma 🔗 


DrML: Diagnosing and Rectifying Vision Models using Language
(Poster)
link »
Recent multimodal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in largescale imagecaption datasets. Our work highlights a distinct advantage of this multimodal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves laborintensive data acquisition and annotation. Our proposed method, DrML, can discover higherror data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier. 
Yuhui Zhang · Jeff Z. HaoChen · ShihCheng Huang · KuanChieh Wang · James Zou · Serena Yeung 🔗 


Empirical Study on Optimizer Selection for OutofDistribution Generalization
(Poster)
link »
Modern deep learning systems are fragile and do not generalize well under distribution shifts. While much promising work has been accomplished to address these concerns, a systematic study of the role of optimizers and their outofdistribution generalization performance has not been undertaken. In this study, we examine the performance of popular firstorder optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address the problem settings for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as outofdistribution datasets for the exhaustive study. We search over a wide range of hyperparameters and examine the classification accuracy (indistribution and outofdistribution) for over 20,000 models. We arrive at the following findings: i) contrary to conventional wisdom, adaptive optimizers (e.g., Adam) perform worse than nonadaptive optimizers (e.g., SGD, momentumbased SGD), ii) indistribution performance and outofdistribution performance exhibit three types of behavior depending on the dataset – linear returns, increasing returns, and diminishing returns. We believe these findings can help practitioners choose the right optimizer and know what behavior to expect. The code is available at https://anonymous.4open.science/r/OoDOptimizerComparison37DF. 
Hiroki Naganuma · Kartik Ahuja · Ioannis Mitliagkas · Shiro Takagi · Tetsuya Motokawa · Rio Yokota · Kohta Ishikawa · Ikuro Sato 🔗 


Choosing Public Datasets for Private Machine Learning via Gradient Subspace Distance
(Poster)
link »
Differentially private stochastic gradient descent privatizes model training by injecting noise into each iteration, where the noise magnitude increases with the number of model parameters. Recent works suggest that we can reduce the noise by leveraging public data for private machine learning, by projecting gradients onto a subspace prescribed by the public data. However, given a choice of public datasets, it is not clear which one may be most appropriate for the private task. We give an algorithm for selecting a public dataset by measuring a lowdimensional subspace distance between gradients of the public and private examples. The computational and privacy cost overhead of our method is minimal. Empirical evaluation suggests that trained model accuracy is monotone in this distance. 
Xin Gu · Gautam Kamath · Steven Wu 🔗 


Learning Invariant Representations under General Interventions on the Response
(Poster)
link »
SlidesLive Video » It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we introduce invariant matching property (IMP) which is an explicit relation to capture interventions through an additional feature. This leads to an alternative form of invariance that enables a unified treatment of general interventions on the response. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings. 
Kang Du · Yu Xiang 🔗 


Theory and Algorithm for Batch Distribution Drift Problems
(Poster)
link »
SlidesLive Video » We study a problem of gradual \emph{batch distribution drift} motivated by several applications, which consists of determining an accurate predictor for a target time segment, for which a moderate amount of labeled samples are at one's disposal, while leveraging past segments for which substantially more labeled samples are available. We give new algorithms for this problem guided by a new theoretical analysis and generalization bounds derived for this scenario. Additionally, we report the results of extensive experiments demonstrating the benefits of our drifting algorithm, including comparisons with natural baselines. 
Pranjal Awasthi · Corinna Cortes · Christopher Mohri 🔗 


Enabling the Visualization of Distributional Shift using Shapley Values
(Poster)
link »
In streaming data, distributional shifts can appear both in the univariate dimensionsand in the joint distributions with the labels. However, in many realtime scenarios,labels are often either missing or delayed; Unsupervised drift detection methodsare desired in those applications.We design slidSHAPs, a novel representation method for unlabelled data streams.Commonly known in machine learning models, Shapley values offer a way toexploit correlation dependencies among random variables; We develop an unsupervised sliding Shapley value series for categorical time series representing the datastream in a newly defined latent space and track the feature correlation changes.Transforming the original time series to the slidSHAPs allows us to track howdistributional shifts affect the correlations among the input variables; the approachis independent of any kind of labeling. We show how abrupt distributional shiftsin the input variables are transformed into smoother changes in the slidSHAPs;Moreover, slidSHAP allows for intuitive visualization of the shifts when they arenot observable in the original data. 
Bin Li · Chiara Balestra · Emmanuel Müller 🔗 


Frequency Shortcut Learning in Neural Networks
(Poster)
link »
SlidesLive Video » The generalization of neural networks is harmed by shortcut learning: the use of simple nonsemantic features may prevent the networks from learning deeper semantic and taskrelated cues. Existing studies focus mainly on explicit shortcuts, e.g. color patches and annotated text in images, that are visually detectable and may be removed. However, there exist implicit shortcuts determined by bias or superficial statistics in the data that neural networks can easily exploit. Mitigating the learning of implicit shortcuts is challenging due to the simplicitybias and an intrinsic difficulty in identifying them. We empirically investigate shortcut learning in the frequency domain and propose a method to identify learned frequency shortcuts based on frequency removal. We found that frequency shortcuts often correspond to textures consisting of specific frequencies. We also investigate the influence of frequency shortcuts in OutofDistribution (OOD) tests. 
Shunxin Wang · Raymond Veldhuis · Christoph Brune · Nicola Strisciuglio 🔗 


Preserving privacy with PATE for heterogeneous data
(Poster)
link »
SlidesLive Video » Differential privacy has become the standard system to provide privacy guarantees for user data in machine learning models. One of the popular techniques to ensure privacy is the Private Aggregation of Teacher Ensembles (PATE) framework. PATE trains an ensemble of teacher models on private data and transfers the knowledge to a student model, with rigorous privacy guarantees derived using differential privacy. So far, PATE has been shown to work assuming the public and private data are distributed homogeneously. We show that in the case of high mismatch (non iidness) in these distributions, the teachers suffer from high variance in their individual training updates, causing them to converge to vastly different optimum states. This leads to lower consensus and accuracy for data labelling. To address this, we propose a modification to the teacher training process in PATE, that incorporates teacher averaging and update correction which reduces the variance in teacher updates. Our technique leads to improved prediction accuracy of the teacher aggregation mechanism, especially for highly heterogeneous data. Furthermore, our evaluation shows our technique is necessary to sustain the student model performance, and allows it to achieve considerable gains over the original PATE in the utilityprivacy metric. 
Akshay Dodwadmath · Sebastian Stich 🔗 


Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
(Poster)
link »
Deep networks have achieved impressive results on a range of well curated benchmark datasets. Surprisingly, their performance remains sensitive to perturbations that have little effect on human performance. In this work, we propose a novel extension of Mixup called Robustmix that regularizes networks to classify based on lower frequency spatial features. We show that this type of regularization improves robustness on a range of benchmarks such as ImagenetC and Stylized Imagenet. It adds little computational overhead and furthermore does not require a priori knowledge of a large set of image transformations. We find that this approach further complements recent advances in model architecture and data augmentation attaining a stateoftheart mCE of 44.8 with an EfficientNetB8 model and RandAugment, which is a reduction of 16 mCE compared to the baseline. 
JONAS NGNAWE · Marianne ABEMGNIGNI NJIFON · Jonathan Heek · Yann Dauphin 🔗 


Visual response inhibition for increased robustness of convolutional networks to distribution shifts
(Poster)
link »
Convolutional neural networks have been shown to suffer from distribution shifts in the test data, for instance caused by the so called common corruptions and perturbations. Test images can contain noise, digital transformations, and blur that were not present in the training data, negatively impacting the performance of trained models. Humans experience much stronger robustness to noise and visual distortions than deep networks. In this work, we explore the effectiveness of a neuronal response inhibition mechanism, called pushpull, observed in the early part of the visual system, to increase the robustness of deep convolutional networks. We deploy a PushPull inhibition layer as a replacement of the initial convolutional layers (input layer and in the first block of residual and dense architectures) of standard convolutional networks for image classification. We show that the PushPull inhibition component increases the robustness of standard networks for image classification to distribution shifts on the CIFAR10C and CIFAR10P test sets. 
Nicola Strisciuglio · George Azzopardi 🔗 


AdaME: Adaptive learning of multisource adaptationensembles
(Poster)
link »
SlidesLive Video » We present a new adaptive algorithm to build multisource domain adaptation neural networks ensembles. Since the standard convex combination ensembles cannot succeed in this scenario, we present a learnable domainweighted combination and new learning guarantees based on the deep boosting algorithm. We introduce and analyze a new algorithm, ADAME, for this scenario and show that it benefits from favorable theoretical guarantees, is riskaverse and reduces the worstcase mismatch between the inference and training distributions. We also report the results of several experiments demonstrating its performance in the FMOWWILDSdataset. 
Scott Yak · Javier Gonzalvo · Mehryar Mohri · Corinna Cortes 🔗 


Transferability Between Regression Tasks
(Poster)
link »
SlidesLive Video » Transfer learning has been a widely used technique to adapt a deep learning model trained for one task to another when there is a data distribution shift between these tasks. To improve the effectiveness of transfer learning and to understand relationships between tasks, we consider the problem of transferability estimation between regression tasks and propose two novel transferability estimators that are simple, computationally efficient, yet effective and theoretically grounded. We test our proposed methods extensively in various challenging, practical scenarios and show they significantly outperform existing stateoftheart regression task transferability estimators in both accuracy and efficiency. 
Cuong Ngoc Nguyen · Phong Tran The · Lam Ho · Vu Dinh · Anh Tran · Tal Hassner · Cuong V. Nguyen 🔗 


CAREER: Economic Prediction of Labor Sequence Data Under Distribution Shift
(Poster)
link »
Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, the distribution of these large resume datasets differ in meaningful ways from the survey datasets used for economic estimation; standard econometric models cannot take advantage of their scale or make predictions under distribution shift. To this end we develop CAREER, a transformerbased model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passivelycollected resume data and then finetuned on samples of the downstream data distribution of interest. We find that CAREER forms accurate predictions of job sequences, achieving stateoftheart predictive performance on three widelyused economics datasets. We also find that CAREER is adept at making predictions under distribution shifts in time. 
Keyon Vafa · Emil Palikot · Tianyu Du · Ayush Kanodia · Susan Athey · David Blei 🔗 


OutofDistribution Generalization Challenge in Dialog State Tracking
(Poster)
link »
SlidesLive Video » Dialog State Tracking (DST) is a core component for multiturn TaskOriented Dialog (TOD) systems to understand the dialogs. DST models need to generalize to OutofDistribution (OOD) utterances due to the open environments dialog systems face. Unfortunately, utterances in TOD are multilabeled, and most of them appear in specific contexts (i.e., the dialog histories). Both characteristics make them different from the conventional focus of OOD generalization research and remain unexplored. In this paper, we formally define OOD utterances in TOD and evaluate the generalizability of existing competitive DST models on the OOD utterances. Our experimental result shows that the performance of all models drops considerably in dialogs with OOD utterances, indicating an OOD generalization challenge in DST. 
Jiasheng Ye · Yawen Ouyang · Zhen Wu · Xinyu Dai 🔗 


Diversity Boosted Learning for Domain Generalization with A Large Number of Domains
(Poster)
link »
SlidesLive Video »
Machine learning algorithms minimizing the average training loss typically suffer from poor generalization performance. It inspires various works for domain generalization (DG), among which a series of methods work by $O(n^2)$ pairwise domain operations with n domains, where each one is often costly. Moreover, while a common objective in the DG literature is to learn invariant representations against spurious correlations induced by domains, we point out the insufficiency of it and highlight the importance of alleviating spurious correlations caused by objects. Based on the observation that diversity helps mitigate spurious correlations, we propose a Diversity boosted twOlevel saMplIng framework (DOMI) to efficiently sample the most informative ones among a large number of domains and data points. We show that DOMI helps train robust models against spurious correlations from both domainside and objectside, substantially enhancing the performance of five backbone DG algorithms on Rotated MNIST and Rotated Fashion MNIST.

XI LENG · Yatao Bian · Xiaoying Tang 🔗 


Learning with noisy labels using lowdimensional model trajectory
(Poster)
link »
Recent work shows that deep neural networks (DNNs) first learn clean samples and then memorize noisy samples. Early stopping can therefore be used to improve performance when training with noisy labels. It was also shown recently that the training trajectory of DNNs can be approximated in a lowdimensional subspace using PCA. The DNNs can then be trained in this subspace achieving similar or better generalization. These two observations were utilized together, to further boost the generalization performance of vanilla early stopping on noisy label datasets. In this paper, we probe this finding further on different realworld and synthetic label noises. First, we show that the prior method is sensitive to the early stopping hyperparameter. Second, we investigate the effectiveness of PCA, for approximating the optimization trajectory under noisy label information. We propose to estimate lowrank subspace through robust and structured variants of PCA, namely Robust PCA, and Sparse PCA. We find that the subspace estimated through these variants can be less sensitive to early stopping, and can outperform PCA to achieve better test error when trained on noisy labels. 
Vasu Singla · Shuchin Aeron · Toshiaki KoikeAkino · Kieran Parsons · Matthew Brand · Ye Wang 🔗 


Evaluating the Impact of Geometric and Statistical Skews on OutOfDistribution Generalization Performance
(Poster)
link »
Outofdistribution (OOD) or domain generalization is the problem of generalizing to unseen distributions. Recent work suggests that the marginal difficulty of generalizing to OOD over indistribution data (OODID generalization gap) is due to spurious correlations, which arise due to statistical and geometric skews, and can be addressed by careful data augmentation and class balancing. We observe that after constructing a dataset where we remove all conceivable sources of spurious correlation between interpretable factors, classifiers still fail to close the OODID generalization gap. 
Aengus Lynch · Jean Kaddour · Ricardo Silva 🔗 


StrategyAware Contextual Bandits
(Poster)
link »
Algorithmic tools are often used to make decisions about people in highstakes domains. In the presence of such automated decision making, there is incentive for strategic agents to modify their input to the algorithm in order to receive a more desirable outcome. While previous work on strategic classification attempts to capture this phenomenon, these models fail to take into account the multiple actions a decision maker usually has at their disposal, and the fact that they often have access only to bandit feedback. In contrast, we capture this setting as a contextual bandit problem, in which a decision maker must take actions based on a sequence of strategically modified contexts. We provide a lowstrategicregret algorithm for the two action setting, and prove that sublinear strategic regret is generally not possible for settings in which the number of actions is greater than two. Along the way, we obtain impossibility results for multiclass strategic classification which may be of independent interest. 
Keegan Harris · Chara Podimata · Steven Wu 🔗 


Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in Neural Networks
(Poster)
link »
SlidesLive Video » Deep Neural Networks (DNNs) are known to be brittle to even minor distribution shifts compared to the training distribution. While one line of work has demonstrated that \emph{Simplicity Bias} (SB) of DNNs  bias towards learning only the simplest features  is a key reason for this brittleness, another recent line of work has surprisingly found that diverse/ complex features are indeed learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features. To bridge the gap between these two lines of work, we first hypothesize and verify that while SB may not altogether preclude learning complex features, it amplifies simpler features over complex ones. Namely, simple features are replicated several times in the learned representations while complex features might not be replicated. This phenomenon, we term \emph{Feature Replication Hypothesis}, coupled with the \emph{Implicit Bias} of SGD to converge to maximum margin solutions in the feature space, leads the models to rely mostly on the simple features for classification. To mitigate this bias, we propose \emph{Feature Reconstruction Regularizer (FRR)} to ensure that the learned features can be reconstructed back from the logits. The use of \emph{FRR} in linear layer training (\emph{FRRL}) encourages the use of more diverse features for classification. We further propose to finetune the full network by freezing the weights of the linear layer trained using \emph{FRRL}, to refine the learned features, making them more suitable for classification. Using the proposed approach, we demonstrate noteworthy gains on synthetic/ semisynthetic datasets, and outperform existing SOTA on the standard OOD benchmark DomainBed as well. 
Sravanti Addepalli · Anshul Nasery · Venkatesh Babu R · Praneeth Netrapalli · Prateek Jain 🔗 


Useful Confidence Measures: Beyond the Max Score
(Poster)
link »
An important component in deploying machine learning (ML) in safetycritic applications is having a reliable measure of confidence in the ML model's predictions. For a classifier $f$ producing a probability vector $f(x)$ over the candidate classes, the confidence is typically taken to be $\max_i f(x)_i$. This approach is potentially limited, as it disregards the rest of the probability vector. In this work, we derive several confidence measures that depend on information beyond the maximum score, such as marginbased and entropybased measures, and empirically evaluate their usefulness, focusing on NLP tasks with distribution shifts and Transformerbased models. We show that when models are evaluated on the outofdistribution data ``out of the box'', using only the maximum score to inform the confidence measure is highly suboptimal. In the postprocessing regime (where the scores of $f$ can be improved using additional indistribution heldout data), this remains true, albeit less significant. Overall, our results suggest that entropybased confidence is a surprisingly useful measure.

Gal Yona · Amir Feder · Itay Laish 🔗 


Federated Learning under Distributed Concept Drift
(Poster)
link »
SlidesLive Video » Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a wellstudied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are illsuited to staggered drifts, necessitating multiplemodel solutions. We identify the problem of drift adaptation as a timevarying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the groundtruth clustering of clients to concepts at each time step. 
Ellango Jothimurugesan · Kevin Hsieh · Jianyu Wang · Gauri Joshi · Phillip Gibbons 🔗 


An Invariant Learning Characterization of Controlled Text Generation
(Poster)
link »
Controlled generation refers to the problem of creating text that contains stylistic or semantic attributes of interest. Many approaches reduce this problem to building a predictor of the desired attribute.For example, researchers hoping to deploy a large language model to produce nontoxic content may use a toxicity classifier to filter generated text. In this paper, we show that the performance of controlled generation may be poor if the target distribution of text differs from the distribution the predictor was trained on. Instead, we take inspiration from causal representation learning and cast controlled generation under distribution shift as an invariant learning problem: the most effective predictor should be invariant across multiple text environments. Experiments demonstrate the promise and difficulty of adapting invariant learning methods, which have been primarily developed for vision, to text. 
Claudia Shi · Carolina Zheng · Keyon Vafa · Amir Feder · David Blei 🔗 


Tackling Distribution Shifts in Federated Learning with Superquantile Aggregation
(Poster)
link »
SlidesLive Video » Federated learning has emerged as the predominant framework for distributed machine learning over decentralized data, e.g. in mobile phones. The usual approaches suffer from a distribution shift: the model is trained to fit the average population distribution but is deployed on individual clients, whose data distributions can be quite different. We present a distributionally robust approach to federated learning based on a risk measure known as the superquantile and show how to optimize it by interleaving federated averaging steps with quantile computation. We demonstrate experimentally that our approach is competitive with usual ones in terms of average error and outperforms them in terms of tail statistics of the error. 
Krishna Pillutla · Yassine Laguel · Jérôme Malick · Zaid Harchaoui 🔗 


Few Shot Generative Domain Adaptation Via InferenceStage Latent Learning in GANs
(Poster)
link »
SlidesLive Video » In this study, we adapt generative models trained on large source datasets to scarce target domains. We adapt a pretrained Generative Adversarial Network (GAN) without retraining the generator, avoiding catastrophic forgetting and overfitting. Starting from the observation that target images can be `embedded' onto the latent space of a pretrained sourceGAN, our method finds the latent code corresponding to the target domain on the source latent manifold. Optimizing a latent learner network during inference generates a novel target embedding that is supplied to the sourceGAN generator to generate target samples. Our method, albeit simple, can be used to generate data from multiple target distributions using a generator trained on a single source distribution. 
Arnab Kumar Mondal · Piyush Tiwary · Parag Singla · Prathosh AP 🔗 


Relational OutofDistribution Generalization
(Poster)
link »
In outofdistribution (OOD) generalization, domain relation is an important factor. It can provide a global view on the functionality among domains, e.g., the protein domain in the binding affinity task or the geographical location domain in the weather forecast task. Existing work lacks the utilization of the domain relation; yet in this work, we want to explore how to incorporate such rich information into solving the distribution shift problem. Therefore, we propose READ, a general multihead deep learning framework that harnesses domain relation to generalize to unseen domains in a structured learning and inference manner. In READ, each training domain shares a common backbone but learns one separate head. Built on a proposed explicit regularization, READ simulates the generalization process among heads, where a weighted ensemble prediction from heads irrelevant to input domain is calculated via domain relation and aligned with the target. To improve the reliability of domain relation, READ further leverages similarity metric learning to update initial relation. Empirically, we evaluate READ on three domain generalization benchmarks. The results indicate that READ consistently improves upon existing stateoftheart methods on datasets from various fields. 
Xinyu Yang · Xinyi Pan · Shengchao Liu · Huaxiu Yao 🔗 


DomainAdjusted Regression or: ERM May Already Learn Features Sufficient for OutofDistribution Generalization
(Poster)
link »
SlidesLive Video » A common explanation for the failure of deep networks to generalize outofdistribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce DomainAdjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domainspecific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimaxoptimal predictor for a constrained set of test distributions. Further, we provide the first finiteenvironment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance. 
Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski 🔗 


Testtime adaptation with slotcentric models
(Poster)
link »
SlidesLive Video » We consider the problem of segmenting scenes into constituent objects and their parts. Current supervised visual detectors, though impressive within their training distribution, often fail to segment outofdistribution scenes into their constituent entities. Recent testtime adaptation methods use auxiliary selfsupervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses can be insufficient for instance segmentation tasks, without also considering architectural inductive biases. For image segmentation, recent slotcentric generative models break such dependence on supervision by attempting to segment scenes into entities in a selfsupervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Generating Fast and Slow Networks (GFSNets), a semisupervised instance segmentation model equipped with a slotcentric image rendering component that is adapted per scene at test time through gradient descent on reconstruction or novel view synthesis objectives. We show that testtime adaptation greatly improves segmentation in outofdistribution scenes. We evaluate GFSNets in scene segmentation benchmarks and show substantial outofdistribution performance improvements against stateoftheart supervised feed forward detectors and selfsupervised domain adaptation models. 
Mihir Prabhudesai · Sujoy Paul · Sjoerd van Steenkiste · Mehdi S. M. Sajjadi · Anirudh Goyal · Deepak Pathak · Katerina Fragkiadaki · Gaurav Aggarwal · Thomas Kipf 🔗 


Diversity through Disagreement for Better Transferability
(Poster)
link »
SlidesLive Video » Gradientbased learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features  present in the training data but absent from the test data  and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distributionreferred to as the Out of Distribution (OOD) generalization problem.However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm DBAT (DiversityBydisAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how DBAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcutlearning, enhance uncertainty and OOD detection, as well as improve transferability. 
Matteo Pagliardini · Martin Jaggi · François Fleuret · Sai Praneeth Karimireddy 🔗 


EnvAware Anomaly Detection: Ignore Style Changes, Stay True to Content!
(Poster)
link »
SlidesLive Video » We introduce a formalization and benchmark for the unsupervised anomaly detection task in the distributionshift scenario. Our work builds upon the iWildCam dataset, and, to the best of our knowledge, we are the first to propose such an approach for visual data. We empirically validate that environmentaware methods perform better in such cases when compared with the basic Empirical Risk Minimization (ERM). We next propose an extension for generating positive samples for contrastive methods that considers the environment labels when training, improving the ERM baseline score by 8.7%. 
Stefan Smeu · Elena Burceanu · Andrei L Nicolicioiu · Emanuela Haller 🔗 


Toward domain generalized pruning by scoring outofdistribution importance
(Poster)
link »
SlidesLive Video » Filter pruning has been widely used for compressing convolutional neural networks to reduce computation costs during the deployment stage. Recent studies have shown that filter pruning techniques can achieve lossless compression of deep neural networks, reducing redundant filters (kernels) without sacrificing accuracy performance. However, the evaluation is done when the training and testing data are from similar environmental conditions (independent and identically distributed), and how the filter pruning techniques would affect the crossdomain generalization (outofdistribution) performance is largely ignored. We conduct extensive empirical experiments and reveal that although the intradomain performance could be maintained after filter pruning, the crossdomain performance will decay to a large extent. As scoring a filter's importance is one of the central problems for pruning, we design the importance scoring estimation by using the variance of domainlevel risks to consider the pruning risk in the unseen distribution. As such, we can remain more domain generalized filters. The experiments show that under the same pruning ratio, our method can achieve significantly better crossdomain generalization performance than the baseline filter pruning method. For the first attempt, our work sheds light on the joint problem of domain generalization and filter pruning research. 
RIZHAO CAI · Haoliang Li · Alex Kot 🔗 


Active Learning Over Multiple Domains in Natural Language Tasks
(Poster)
link »
SlidesLive Video » Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of outofdistribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multidomain sampling to examine this challenging setting for question answering and sentiment analysis. Among 18 acquisition functions from 4 families of methods, we find HDivergence methods, and particularly our proposed variant DALE, yield effective results, averaging 23% improvements over the random baseline. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multidomain active learning for natural language tasks. 
Shayne Longpre · Julia Reisler · Edward Huang · Yi Lu · Andrew Frank · Nikhil Ramesh · Chris DuBois 🔗 


Adaptive Sampling for Probabilistic Forecasting under Distribution Shift
(Poster)
link »
SlidesLive Video » The world is not static: This causes realworld time series to change over time because external, and potentially disruptive, events such as macroeconomic cycles or the COVID19 pandemic change the underlying factors that influence the time series. Once such a data distribution shift happens, it will be part of the time series history and impact future forecasting attempts. We present an adaptive sampling strategy that selects the part of the history that is relevant for the recent data distribution. We achieve this by learning a discrete distribution over relevant time steps by Bayesian optimization. We instantiate this idea with a twostep, modelagnostic method that is pretrained with uniform sampling and then training a lightweight adaptive architecture with adaptive sampling. We show with synthetic and realworld experiments that this method adapts to distribution shift and reduces the forecasting error of the base model by 8.4%. 
Luca Masserano · Syama Sundar Rangapuram · Shubham Kapoor · Rajbir Nirwan · Youngsuk Park · Michael BohlkeSchneider 🔗 


A Learning Based Hypothesis Test for Harmful Covariate Shift
(Poster)
link »
SlidesLive Video » Quickly and accurately identifying covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in highrisk domains. In this work, we give an intuitive definition of harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a classification model. To detect HCS, we use the discordance between classifiers trained to agree on training data and disagree on test data. We derive a loss function for training these models and show that their disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of highdimensional datasets. Across numerous domains and modalities, we show stateoftheart performance compared to existing methods, particularly when the number of observed test samples is small. 
Tom Ginsberg · Zhongyuan Liang · Rahul Krishnan 🔗 


Engineering Uncertainty Representations to Monitor Distribution Shifts
(Poster)
link »
SlidesLive Video » In some classification tasks, the true label is not known until months or even years after the classifier prediction time. Once the model has been deployed, harmful dataset shift regimes can surface. Without cautious model monitoring, the damage could prove to be irreversible when true labels unfold. In this paper, we propose a method for practitioners to monitor distribution shifts on unlabeled data. We leverage two representations for quantifying and visualizing model uncertainty. The Adversarial Neighborhood Analysis assesses model uncertainty by aggregating predictions in the neighborhood of a data point and comparing them to the prediction at the single point. The NonConformity Analysis exploits the results of conformal prediction and leverages a decision tree to display uncertain zones. We empirically test our approach over scenarios of synthetically generated shifts to prove its efficacy. 
Thomas Bonnier · Benjamin Bosch 🔗 


Data Feedback Loops: Modeldriven Amplification of Dataset Biases
(Poster)
link »
Datasets scraped from the internet have been critical to largescale machine learning. Yet, its success puts the utility of future internetderived datasets at potential risk, as model outputs begin to replace human annotations as a source of supervision. In this work, we formalize a system where interactions with one model are recorded as history and scraped as training data in the future. We then analyze its stability over time by tracking changes to a testtime bias statistic (e.g. gender bias of model predictions). We find that the degree of bias amplification is closely linked to whether the model’s outputs behave like samples from the training distribution, a behavior which we characterize and define as consistent calibration. Experiments in three conditional prediction scenarios – image classification, visual rolelabeling, and language generation – demonstrate that models that exhibit a samplinglike behavior are more calibrated and thus more stable. Based on this insight, we propose an intervention to help calibrate and stabilize unstable feedback systems. 
Rohan Taori · Tatsunori Hashimoto 🔗 


"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
(Poster)
link »
Performance of machine learning models may differ significantly in novel environments compared to during training due to shifts in the underlying data distribution. Attributing performance changes to specific data shifts is critical for identifying sources of model failures and designing stable models. In this work, we design a novel method for attributing performance differences between environments to shifts in the underlying causal mechanisms. We formulate the problem as a cooperative game and derive an importance weighting method for computing the value of a coalition of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on two synthetic datasets and two realworld case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts. 
Haoran Zhang · Harvineet Singh · Marzyeh Ghassemi · Shalmali Joshi 🔗 


A Reproducible and Realistic Evaluation of Partial Domain Adaptation Methods
(Poster)
link »
SlidesLive Video » Unsupervised Domain Adaptation (UDA) aims at classifying unlabeled target images leveraging source labeled ones. In this work, we consider the Partial Domain Adaptation (PDA) variant, where we have extra source classes not present in the target domain. Most successful algorithms use model selection strategies that rely on target labels to find the best hyperparameters and/or models along training. However, these strategies violate the main assumption in PDA: only unlabeled target domain samples are available. The main goal of this work is to provide a realistic evaluation of PDA methods with the different model selection strategies under a consistent evaluation protocol. We evaluate 7 representative PDA algorithms on 2 different realworld datasets using 7 different model selection strategies. Our two main findings are: (i) without target labels for model selection, the accuracy of the methods decreases up to 30 percentage points; (ii) only one method and model selection pair performs reasonably well on both datasets. Experiments were performed with our PyTorch framework, BenchmarkPDA, which we open source. 
Tiago Salvador · Kilian FATRAS · Ioannis Mitliagkas · Adam Oberman 🔗 


Sparse MixtureofExperts are Domain Generalizable Learners
(Poster)
link »
SlidesLive Video » In domain generalization (DG), most existing methods focused on the loss function design. This paper proposes to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformerbased models trained with empirical risk minimization (ERM) outperform CNNbased models employing stateoftheart (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely \emph{Generalizable MixtureofExperts (GMoE)}. Experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. 
Bo Li · Yifei Shen · Jingkang Yang · Yezhen Wang · Jiawei Ren · Tong Che · Jun Zhang · Ziwei Liu 🔗 


Deep ClassConditional Gaussians for Continual Learning
(Poster)
link »
The current state of the art for continual learning with frozen, pretrained embedding networks are simple probabilistic models defined over the embedding space, for example class conditional Gaussians. As yet, in the taskincremental online setting, it has been an open question how to extend these methods to when the embedding function has to be learned from scratch. In this paper, we propose DeepCCG, an empirical Bayesian method which learns online both a class conditional Gaussian model and an embedding function. The learning process can be interpreted as using a variant of experience replay, known to be effective in continual learning. As part of our framework, we decide which examples to store by selecting the subset that minimises the KL divergence between the true posterior and the posterior induced by the subset. We demonstrate performance taskincremental online settings, including those with overlapping tasks. Our method outperforms all other methods, including several other replaybased methods. 
Thomas Lee · Amos Storkey 🔗 


A Closer Look at Novel Class Discovery from the Labeled Set
(Poster)
link »
SlidesLive Video » Novel class discovery (NCD) is to infer novel categories in an unlabeled set using prior knowledge of a labeled set comprising diverse but related classes. Existing research focuses on using the labeled set methodologically and little on analyzing it. In this study, we closer look at NCD from the labeled set and focus on two questions: (i) Given an unlabeled set, \textit{what labeled set best supports novel class discovery?} (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but \textit{how can we measure this relation?} For (i), we propose and substantiate the hypothesis that NCD could benefit from a labeled set with high semantic similarity to the unlabeled set. Using ImageNet's hierarchical class structure, we create a largescale benchmark with variable semantic similarity across labeled/unlabeled datasets. In contrast, existing NCD benchmarks ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. We utilize this metric to validate our established benchmark and demonstrate it highly corresponds with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to suboptimal outcomes in lowsimilarity settings. 
ZIYUN LI · Jona Otholt · Ben Dai · Di Hu · Christoph Meinel · Haojin Yang 🔗 


A new benchmark for group distribution shifts in hand grasp regression for object manipulation. Can metalearning raise the bar?
(Poster)
link »
SlidesLive Video » Understanding handobject pose with computer vision opens the door to new applications in mixed reality, assisted living or humanrobot interaction. Most methods are trained and evaluated on balanced datasets. This is of limited use in realworld applications; how do these methods perform in the wild on unknown objects? We propose a novel benchmark for object group distribution shifts in hand and object pose regression. We then test the hypothesis that metalearning a baseline pose regression neural network can adapt to these shifts and generalise better to unknown objects. Our results show measurable improvements over the baseline, depending on the amount of prior knowledge. For the task of joint handobject pose regression, we observe optimisation interference for the metalearner. To address this issue and improve the method further, we provide a comprehensive analysis which should serve as a basis for future work on this benchmark. 
Théo Morales · Gerard Lacey 🔗 


Instance norm improves metalearning in classimbalanced land cover classification
(Poster)
link »
SlidesLive Video » Distribution shift is omnipresent in geographic data, where various climatic and cultural factors lead to different representations across the globe. We aim to adapt dynamically to unseen data distributions with modelagnostic metalearning, where data sampled from each distribution is seen as a task with only a few annotated samples. Transductive batch normalization layers are often employed in metalearning models, as they reach the highest numerical accuracy on the classbalanced target tasks used as metalearning benchmarks. In this work, we demonstrate empirically that transductive batch normalization collapses when deployed on a real classimbalanced land cover classification problem. We propose a solution to replace batch normalization with instance normalization. This modification consistently outperformed all other normalization alternatives across different metalearning algorithms in our classimbalanced land cover classification test tasks. 
Marc Russwurm · Devis Tuia 🔗 


CUDA: Curriculum of Data Augmentation for Longtailed Recognition
(Poster)
link »
SlidesLive Video » Class imbalance problems frequently occur in realworld tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by reweighting or resampling training samples. These rebalancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we propose a simple and efficient novel curriculum, which is designed to find the appropriate perclass strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for longtailed recognition. CUDA can simply be integrated into existing longtailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the stateoftheart method on imbalanced datasets such as CIFAR100LT. 
Sumyeong Ahn · Jongwoo Ko · SeYoung Yun 🔗 


Benchmarking Robustness under Distribution Shift of Multimodal ImageText Models
(Poster)
link »
SlidesLive Video » Multimodal imagetext models have shown remarkable performance in the past few years. However, the robustness of such foundation models against distribution shifts is crucial in downstream applications. In this paper, we investigate their robustness under image and text perturbations. We first build several multimodal benchmark datasets by applying 17 image perturbation and 16 text perturbation techniques. Then we extensively study the robustness of 6 widely adopted models on 3 downstream tasks (imagetext retrieval, visual reasoning, and visual entailment). We observe that these powerful multimodal models are sensitive to image/text perturbations, especially to image perturbations. For text, characterlevel perturbations have shown higher adversarial impact than wordlevel and sentencelevel perturbations. We also observe that models trained by generative objectives tend to be more robust. Our findings in terms of robustness study could facilitate the development of large imagetext models, as well as their deployment for realworld applications. 
Jielin Qiu · Yi Zhu · Xingjian Shi · Zhiqiang Tang · DING ZHAO · Bo Li · Mu Li 🔗 


Sorted eigenvalue comparison $d_{\mathsf{Eig}}$: A simple alternative to $d_{\mathsf{FID}}$
(Poster)
link »
SlidesLive Video »
For $i = 1, 2$, let $\mathbf{S}_i$ be the sample covariance of $\mathbf{Z}_i$ with $n_i$ $p$dimensional vectors. First, we theoretically justify an improved Fréchet Inception Distance ($d_{\mathsf{FID}}$) algorithm that replaces np.trace(sqrtm($\mathbf{S}_1 \mathbf{S}_2$)) with np.sqrt(eigvals($\mathbf{S}_1 \mathbf{S}_2$)).sum(). With the appearance of unsorted eigenvalues in the improved $d_{\mathsf{FID}}$, we are then motivated to propose sorted eigenvalue comparison ($d_{\mathsf{Eig}}$) as a simple alternative: $d_{\mathsf{Eig}}(\mathbf{S}_1, \mathbf{S}_2)^2=\sum_{j=1}^p (\sqrt{\lambda_j^1}  \sqrt{\lambda_j^2})^2$, and $\lambda_j^i$ is the $j$th largest eigenvalue of $\mathbf{S}_i$. Second, we present two main takeaways for the improved $d_{\mathsf{FID}}$ and proposed $d_{\mathsf{Eig}}$ . (i) $d_{\mathsf{FID}}$: The error bound for computing nonnegative eigenvalues of diagonalizable $\mathbf S_1 \mathbf S_2$ is reduced to $\mathcal{O}(\varepsilon) \\mathbf S_1 \ \\mathbf S_1 \mathbf S_2 \$, along with reducing the run time by $\sim25\%$. (ii) $d_{\mathsf{Eig}}$: The error bound for computing nonnegative eigenvalues of sample covariance $\mathbf S_i$ is further tightened to $\mathcal{O}(\varepsilon) \\mathbf S_i \$, with reducing $\sim90\%$ run time. Last, we discuss limitations and future work for $d_{\mathsf{Eig}}$.

Jiqing Wu · Viktor H Koelzer 🔗 


HICODETSG and VCOCOSG: New Data Splits to Evaluate Systematic Generalization in HumanObject Interaction Detection
(Poster)
link »
HumanObject Interaction (HOI) detection is a task to predict interactions between humans and objects in an image. In realworld scenarios, HOI detection models are required systematic generalization, i.e., generalization to novel combinations of objects and interactions, because it is highly probable that the train data only cover a limited portion of all possible combinations. However, to our knowledge, no open benchmark or existing work evaluates the systematic generalization in HOI detection. To address this issue, we created two new sets of HOI detection data splits named HICODETSG and VCOCOSG based on HICODET and VCOCO datasets. We evaluated representative HOI detection models on our data splits and observed large degradation in the test performances compared to those on the original datasets. This result shows that systematic generalization is a challenging goal in HOI detection. We hope our new data splits encourage more research toward this goal. 
Kentaro Takemoto · Moyuru Yamada · Tomotake Sasaki · Hisanao Akima 🔗 


Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification
(Poster)
link »
SlidesLive Video » While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to stateoftheartaccuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in realworld datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of groupcovariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples. 
Niladri S. Chatterji · Saminul Haque · Tatsunori Hashimoto 🔗 


CrossDataset Propensity Estimation for Debiasing Recommender Systems
(Poster)
link »
Datasets for training recommender systems are often subject to distribution shift induced by users' and recommenders' selection biases. In this paper, we study the impact of selection bias on datasets with different quantization. We then leverage two differently quantized datasets from different source distributions to mitigate distribution shift by applying the inverse probability scoring method from causal inference. Empirically, our approach gains significant performance improvement over singledataset methods and alternative ways of combining two datasets. 
Fengyu Li · Sarah Dean 🔗 


Multiple Modes for Continual Learning
(Poster)
link »
Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a tradeoff between constructing multiple parameter modes and allocating tasks per mode. ModeOptimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely sub10 population, domain, and task shift. 
Siddhartha Datta · Nigel Shadbolt 🔗 


An Empirical Study on Distribution Shift Robustness From the Perspective of PreTraining and Data Augmentation
(Poster)
link »
SlidesLive Video » The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem from the perspective of pretraining and data augmentation, two important factors in the practice of deep learning that have not been systematically investigated by existing work. By evaluating seven pretrained models, including ResNets and ViT's with selfsupervision and supervision mode, on five important distributionshift datasets, from WILDS and DomainBed benchmarks, with five different learning algorithms, we provide the first comprehensive empirical study focusing on pretraining and data augmentation. With our empirical result obtained from 1,330 models, we provide the following main observations: 1) ERM combined with data augmentation can achieve stateoftheart performance if we choose a proper pretrained model respecting the data property; 2) specialized algorithms further improve the robustness on top of ERM when handling a specific type of distribution shift, e.g., GroupDRO for spurious correlation and CORAL for largescale outofdistribution data; 3) Comparing different pretraining modes, architectures and data sizes, we provide novel observations about pretraining on distribution shift, which sheds light on designing or selecting pretraining strategy for different kinds of distribution shifts. In summary, our empirical study provides a comprehensive baseline for a wide range of pretraining models finetuned with data augmentation, which potentially inspires research exploiting the power of pretraining and data augmentation in the future of distribution shift study. 
Ziquan Liu · Yi Xu · Yuanhong Xu · Qi Qian · Hao Li · Rong Jin · Xiangyang Ji · Antoni Chan 🔗 


Characterizing Anomalies with Explainable Classifiers
(Poster)
link »
SlidesLive Video » As machine learning techniques are increasingly used to make societalscale decisions, model performance issues stemming from datadrift can result in costly consequences. While methods exist to quantify datadrift, a further classification of drifted points into groups of similarly anomalous points can be helpful for practitioners as a means to combating drift (e.g. by providing context about how/where in the data pipeline shift might be introduced). We show how such characterization is possible by making use of tools from the model explainability literature. We also show how simple rules can be extracted to generate database queries for anomalous data and detect anomalous data in the future. 
Naveen Durvasula · Valentine d Hauteville · Keegan Hines · John Dickerson 🔗 


Momentumbased Weight Interpolation of Strong ZeroShot Models for Continual Learning
(Poster)
link »
SlidesLive Video »
Large pretrained, zeroshot capable models have shown considerable success both for standard transfer and adaptation tasks, with particular robustness towards distribution shifts.In addition, subsequent finetuning can considerably improve performance on a selected downstream task. However, through naive finetuning, these zeroshot models lose their generalizability and robustness towards distribution shifts.This is a particular problem for tasks such as Continual Learning (CL), where continuous adaptation has to be performed as new task distributions are introduced sequentially.In this work, we showcase that where finetuning falls short to adapt such zeroshot capable models, simple momentumbased weight interpolation can provide consistent improvements for CL tasks in both memoryfree and memorybased settings.In particular, we find improvements of over $+4\%$ on standard CL benchmarks, while reducing the error to the upper limit of jointly training on all tasks at once in parts by more than half, allowing the continual learner to inch closer to the joint training limits.

Zafir Stojanovski · Karsten Roth · Zeynep Akata 🔗 


Explanation Shift: Detecting distribution shifts on tabular data via the explanation space
(Poster)
link »
SlidesLive Video » As input data distributions evolve, the predictive performance of machine learning models tends to deteriorate. In the past, predictive performance was considered the key indicator to monitor. However, explanation aspects have come to attention within the last years. In this work, we investigate how model predictive performance and model explanation characteristics are affected under distribution shifts and how these key indicators are related to each other for tabular data.We find that the modeling of explanation shifts can be a better indicator for the detection of predictive performance changes than stateoftheart techniques based on representations of distribution shifts. We provide a mathematical analysis of different types of distribution shifts as well as synthetic experimental examples. 
Carlos Mougan · Klaus Broelemann · Gjergji Kasneci · Thanassis Tiropanis · Steffen Staab 🔗 


Augmentation Consistencyguided Selftraining for Sourcefree Domain Adaptive Semantic Segmentation
(Poster)
link »
We focus on sourcefree domain adaptation for semantic segmentation, wherein a source model must adapt itself to a new target domain given only unlabeled target data. We propose Augmentation Consistencyguided Selftraining (AUGCO), an adaptation algorithm that uses the model's pixellevel predictive consistency across diverse, automatically generated views of each target image along with model confidence to identify reliable pixel predictions, and selectively selftrains on those, leading to stateoftheart performance within a simple to implement and fast to converge approach. 
Viraj Prabhu · Shivam Khare · Deeksha Kartik · Judy Hoffman 🔗 