Timezone: »

Workshop
Workshop on Distribution Shifts: Connecting Methods and Applications
Chelsea Finn · Fanny Yang · Hongseok Namkoong · Masashi Sugiyama · Jacob Eisenstein · Jonas Peters · Rebecca Roelofs · Shiori Sagawa · Pang Wei Koh · Yoonho Lee

Sat Dec 03 07:00 AM -- 03:00 PM (PST) @ Room 388 - 390

This workshop brings together domain experts and ML researchers working on mitigating distribution shifts in real-world applications.

Distribution shifts—where a model is deployed on a data distribution different from what it was trained on—pose significant robustness challenges in real-world ML applications. Such shifts are often unavoidable in the wild and have been shown to substantially degrade model performance in applications such as biomedicine, wildlife conservation, sustainable development, robotics, education, and criminal justice. For example, models can systematically fail when tested on patients from different hospitals or people from different demographics.

This workshop aims to convene a diverse set of domain experts and methods-oriented researchers working on distribution shifts. We are broadly interested in methods, evaluations and benchmarks, and theory for distribution shifts, and we are especially interested in work on distribution shifts that arise naturally in real-world application contexts. Examples of relevant topics include, but are not limited to:
- Examples of real-world distribution shifts in various application areas. We especially welcome applications that are not widely discussed in the ML research community, e.g., education, sustainable development, and conservation. We encourage submissions that characterize distribution shifts and their effects in real-world applications; it is not at all necessary to propose a solution that is algorithmically novel.
- Methods for improving robustness to distribution shifts. Relevant settings include domain generalization, domain adaptation, and subpopulation shifts, and we are interested in a wide range of approaches, from uncertainty estimation to causal inference to active data collection. We welcome methods that can work across a variety of shifts, as well as more domain-specific methods that incorporate prior knowledge on the types of shifts we wish to be robust on. We encourage evaluating these methods on real-world distribution shifts.
- Empirical and theoretical characterization of distribution shifts. Distribution shifts can vary widely in the way in which the data distribution changes, as well as the empirical trends they exhibit. What empirical trends do we observe? What empirical or theoretical frameworks can we use to characterize these different types of shifts and their effects? What kinds of theoretical settings capture useful components of real-world distribution shifts?
- Benchmarks and evaluations. We especially welcome contributions for subpopulation shifts, as they are underrepresented in current ML benchmarks. We are also interested in evaluation protocols that move beyond the standard assumption of fixed training and test splits -- for which applications would we need to consider other forms of shifts, such as streams of continually-changing data or feedback loops between models and data?

 Sat 7:00 a.m. - 7:10 a.m. Opening Remarks ( Opening remarks for DistShift 2022 ) 🔗 Sat 7:10 a.m. - 7:35 a.m. Domain Adaptation: Theory, Algorithms, and Open Library ( Invited Talk ) Mingsheng Long 🔗 Sat 7:35 a.m. - 8:00 a.m. Machine-learning, distribution shifts and extrapolation in the Earth System ( Invited Talk ) Markus Reichstein 🔗 Sat 8:00 a.m. - 8:30 a.m. Coffee Break 🔗 Sat 8:30 a.m. - 8:55 a.m. The promises and pitfalls of CVAR ( Invited Talk ) Pradeep Ravikumar 🔗 Sat 9:00 a.m. - 9:45 a.m. Panel Discussion ( In-person Panel Discussion ) Behnam Neyshabur · David Sontag · Pradeep Ravikumar · Erin Hartman 🔗 Sat 9:45 a.m. - 11:00 a.m. Lunch Break 🔗 Sat 11:00 a.m. - 12:30 p.m. Poster Session 🔗 Sat 12:30 p.m. - 12:40 p.m. First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains ( Spotlight )  link »    Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of \textit{nonlinear} models---under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the \textit{marginal} distribution of each coordinate of the data (or subset of coordinates) does not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an \emph{arbitrary} function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized. Link » Kefan Dong · Tengyu Ma 🔗 Sat 12:40 p.m. - 12:50 p.m. Learning Invariant Representations under General Interventions on the Response ( Spotlight )  link »    It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we introduce invariant matching property (IMP) which is an explicit relation to capture interventions through an additional feature. This leads to an alternative form of invariance that enables a unified treatment of general interventions on the response. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings. Link » Kang Du · Yu Xiang 🔗 Sat 12:50 p.m. - 1:00 p.m. CAREER: Economic Prediction of Labor Sequence Data Under Distribution Shift ( Spotlight )  link »    Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, the distribution of these large resume datasets differ in meaningful ways from the survey datasets used for economic estimation; standard econometric models cannot take advantage of their scale or make predictions under distribution shift. To this end we develop CAREER, a transformer-based model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned on samples of the downstream data distribution of interest. We find that CAREER forms accurate predictions of job sequences, achieving state-of-the-art predictive performance on three widely-used economics datasets. We also find that CAREER is adept at making predictions under distribution shifts in time. Link » Keyon Vafa · Emil Palikot · Tianyu Du · Ayush Kanodia · Susan Athey · David Blei 🔗 Sat 1:00 p.m. - 1:10 p.m. Tackling Distribution Shifts in Federated Learning with Superquantile Aggregation ( Spotlight )  link »    Federated learning has emerged as the predominant framework for distributed machine learning over decentralized data, e.g. in mobile phones. The usual approaches suffer from a distribution shift: the model is trained to fit the average population distribution but is deployed on individual clients, whose data distributions can be quite different. We present a distributionally robust approach to federated learning based on a risk measure known as the superquantile and show how to optimize it by interleaving federated averaging steps with quantile computation. We demonstrate experimentally that our approach is competitive with usual ones in terms of average error and outperforms them in terms of tail statistics of the error. Link » Krishna Pillutla · Yassine Laguel · Jérôme Malick · Zaid Harchaoui 🔗 Sat 1:10 p.m. - 1:20 p.m. Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization ( Spotlight )  link »    A common explanation for the failure of deep networks to generalize out-of-distribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance. Link » Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski 🔗 Sat 1:20 p.m. - 1:30 p.m. Data Feedback Loops: Model-driven Amplification of Dataset Biases ( Spotlight )  link »    Datasets scraped from the internet have been critical to large-scale machine learning. Yet, its success puts the utility of future internet-derived datasets at potential risk, as model outputs begin to replace human annotations as a source of supervision. In this work, we formalize a system where interactions with one model are recorded as history and scraped as training data in the future. We then analyze its stability over time by tracking changes to a test-time bias statistic (e.g. gender bias of model predictions). We find that the degree of bias amplification is closely linked to whether the model’s outputs behave like samples from the training distribution, a behavior which we characterize and define as consistent calibration. Experiments in three conditional prediction scenarios – image classification, visual role-labeling, and language generation – demonstrate that models that exhibit a sampling-like behavior are more calibrated and thus more stable. Based on this insight, we propose an intervention to help calibrate and stabilize unstable feedback systems. Link » Rohan Taori · Tatsunori Hashimoto 🔗 Sat 1:30 p.m. - 1:45 p.m. Coffee Break 🔗 Sat 1:45 p.m. - 2:10 p.m. External Validity: Framework, Design, and Analysis ( Invited Talk ) Erin Hartman 🔗 Sat 2:10 p.m. - 2:35 p.m. Bringing real-world data to bear in addressing distribution shifts: a sociolinguistically-informed analysis of ASR errors ( Invited Talk ) Alicia Beckford Wassink 🔗 Sat 2:35 p.m. - 3:00 p.m. Geospatial Distribution Shifts in Ecology: Mapping the Urban Forest ( Invited Talk ) Sara Beery 🔗 Sat 2:58 p.m. - 3:00 p.m. Closing Remarks ( Closing remarks for DistShift 2022 ) 🔗 - Performative Prediction with Neural Networks ( Poster )  link »    Performative prediction is a framework for learning models that influence the data they intend to predict. We focus on finding classifiers that are performatively stable, i.e. optimal for the data distribution they induce. Standard convergence results for the method of repeated risk minimization assume that the data distribution is Lipschitz continuous to the model's parameters. Under this assumption, the loss must be strongly convex and smooth in these parameters; otherwise, the method will diverge for some problems. In this work, we instead assume that the data distribution is Lipschitz continuous with respect to the model's predictions, a more natural assumption for performative systems. As a result, we are able to significantly relax the assumptions on the loss function. In particular, we do not need to assume convexity with respect to the model's parameters. As an illustration, we introduce a resampling procedure that models realistic distribution shifts and show that it satisfies our assumptions. We support our theory by showing that one can learn performatively stable classifiers with neural networks making predictions about real data that shift according to our proposed procedure. Link » Mehrnaz Mofakhami · Ioannis Mitliagkas · Gauthier Gidel 🔗 - Improving Domain Generalization with Interpolation Robustness ( Poster )  link »    We address domain generalization by viewing the underlying distributional shift as interpolation between domains and subsequently devise an algorithm to learn a representation that is robustly invariant under such interpolation, which we coin our approach as \textit{interpolation robustness}. Through extensive experiments, we show that our approach outperforms significantly the recent state-of-the-art algorithm \citet{NEURIPS2021_2a271795} and the baseline DeepAll in a limited data setting on PACS and VLCS datasets. Link » Ragja Palakkadavath · Thanh Nguyen-Tang · Sunil Gupta · Svetha Venkatesh 🔗 - Deconstructing Distributions: A Pointwise Framework of Learning ( Poster )  link »    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated at single input point. Specifically, we study a point's profile: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data---in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are compatible'' points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even *negative* correlation: cases where improving overall model accuracy actually *hurts* performance on these inputs. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is *negatively correlated* with CIFAR-10 accuracy. Illustrating for the first time an OOD dataset that completely invertsaccuracy-on-the-line'' (Miller et al., 2021). Link » Gal Kaplun · Nikhil Ghosh · Saurabh Garg · Boaz Barak · Preetum Nakkiran 🔗 - Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts ( Poster )  link » Although training machine learning models for robustness is critical for real-world adoption, determining how to best ensure robustness remains an open problem. Some methods (e.g., DRO) are overly conservative, while others (e.g., Group DRO) require domain knowledge that may be hard to obtain. In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function is simple. For example, we may expect that group shifts occur along high-level features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these features, but need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this idea, we formulate a two-player game where conditioned on the label the adversary can only separate datapoints into potential groups using simple features, which corresponds to a bitrate constraint on the adversary's capacity. Our resulting practical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group information on training samples yet matches the performance of Group DRO. Our theoretical analysis reveals that in some settings BR-DRO objective can provably yield statistically efficient and less pessimistic solutions than unconstrained DRO. Link » Amrith Setlur · Don Dennis · Benjamin Eysenbach · Aditi Raghunathan · Chelsea Finn · Virginia Smith · Sergey Levine 🔗 - Impact of realistic properties of the point spread function on classification tasks to reveal a possible distribution shift ( Poster )  link »    Image classification is a long-standing task in computer vision with deep neuralnetworks (DNN) producing excellent results on various challenges. However, theyare required not only to perform highly accurate on benchmarks such as ImageNet,but also to robustly handle images in adverse conditions, such as modified lighting, sharpness, weather conditions and image compression. Various benchmarksaimed to measure robustness show that neural networks perform differently wellunder distribution shifts. While datasets such as ImageNet-C model for example common corruptions such as blur and adverse weather conditions, we argue thatthe properties of the optical system and the potentially resulting complex lens blurare insufficiently well studied in the literature. This study evaluates the impact ofrealistic optical corruptions on the ImageNet classification. The proposed complexcorruption kernels are direction and wavelength dependent and include chromaticaberration, which are all to be expected in realistic scenarios such as autonomousdriving applications. Our experiments on twelve different DNN models show significant differences of more than 5% in the top1 classification error, when comparedto the model performances on matched ImageNet-C blur kernels. Link » Patrick Müller · Alexander Braun · Margret Keuper 🔗 - A Simple Baseline that Questions the Use of Pretrained-Models in Continual Learning ( Poster )  link » With the success of pretraining techniques in representation learning, a number of continual learning methods based on pretrained models have been proposed. Some of these methods design continual learning mechanisms on the pre-trained representations and only allow minimum updates or even no updates of the backbone models during the training of continual learning. In this paper, we question whether the complexity of these models is needed to achieve good performance by comparing them to a simple baseline that we designed. We argue that the pretrained feature extractor itself can be strong enough to achieve a competitive or even better continual learning performance on Split-CIFAR100 and CoRe 50 benchmarks. To validate this, we conduct a very simple baseline that 1) uses the frozen pretrained model to extract image features for every class encountered during the continual learning stage and compute their corresponding mean features on training data, and 2) predicts the class of the input based on the nearest neighbor distance between test samples and mean features of the classes; i.e., Nearest Mean Classifier (NMC). This baseline is single-headed, exemplar-free, and can be task-free (by updating the means continually). This baseline achieved $88.53\%$ on 10-Split-CIFAR-100, surpassing most state-of-the-art continual learning methods that are all initialized using the same pretrained transformer model. We hope our baseline may encourage future progress in designing learning systems that can continually add quality to the learning representations even if they started from some pretrained weights. Link » Paul Janson · Wenxuan Zhang · Rahaf Aljundi · Mohamed Elhoseiny 🔗 - RLSBench: A Large-Scale Empirical Study of Domain Adaptation Under Relaxed Label Shift ( Poster )  link » Despite the emergence of principled methods for domain adaptation under label shift (where only the class balance changes), the sensitivity of these methods to natural-seeming covariate shifts remains precariously underexplored. Meanwhile, popular deep domain adaptation heuristics, despite showing promise on benchmark datasets, tend to falter when faced with shifts in the class balance. Moreover, it's difficult to assess the state of the field owing to inconsistencies among relevant papers in evaluation criteria, datasets, and baselines. In this paper, we introduce RLSbench, a large-scale benchmark for such relaxed label shift settings, consisting of 11 vision datasets spanning > 200 distribution shift pairs with different class proportions. We evaluate 12 popular domain adaptation methods, demonstrating a more widespread susceptibility to failure under extreme shifts in the class proportions than was previously known. We develop an effective meta-algorithm, compatible with most deep domain adaptation heuristics, that consists of the following two steps: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. Furthermore, we discover that batch-norm adaption of a model trained on source with aforementioned corrections offers a strong baseline, largely missing from prior comparisons. We hope that these findings and the availability of RLSbench will encourage researchers to include rigorously evaluate proposed methods in relaxed label shift settings. Link » Saurabh Garg · Nick Erickson · James Sharpnack · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton 🔗 - Mitigating Dataset Bias by Using Per-sample Gradient ( Poster )  link »    The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels provided by human. However, such methods require human costs. Recently, several studies have tried to reduce human intervention by utilizing the output space values of neural networks, such as feature space, logits, loss, or accuracy. However, these output space values may be insufficient for the model to understand the bias attributes well. In this study, we propose a debiasing algorithm leveraging gradient called PGD (Per-sample Gradient-based Debiasing). PGD comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various datasets, the proposed method showed state-of-the-art accuracy for the classification task. Link » Sumyeong Ahn · SeongYoon Kim · Se-Young Yun 🔗 - Train Offline, Test Online: A Real Robot Learning Benchmark ( Poster )  link »    Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robots for evaluating methods on common tasks and an open-source dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data. Link » Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 🔗 - The Value of Out-of-distribution Data ( Poster )  link » More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior. Link » Ashwin De Silva · Rahul Ramesh · Carey E Priebe · Pratik Chaudhari · Joshua T Vogelstein 🔗 - Reliability benchmarks for image segmentation ( Poster )  link » Recent work has shown the importance of reliability, where model performance is assessed under stress conditions pervasive in real-world deployment. In this work, we examine reliability tasks in the setting of semantic segmentation, a dense output problem that has typically only been evaluated using in-distribution predictive performance---for example, the mean intersection over union score on the Cityscapes validation set. To reduce the gap toward reliable deployment in the real world, we compile a benchmark involving existing (and newly constructed) distribution shifts and metrics. We evaluate current models and several baselines to determine how well segmentation models make robust predictions across multiple types of distribution shift and flag when they don’t know. Link » Estefany Kelly Buchanan · Michael Dusenberry · Jie Ren · Kevin Murphy · Balaji Lakshminarayanan · Dustin Tran 🔗 - Adaptive Pre-training of Language Models for Better Logical Reasoning ( Poster )  link » Logical reasoning of text is an important ability that requires understanding the logical information present in the text and reasoning through them to infer new conclusions. Prior works on improving the logical reasoning ability of language models require complex processing of training data (e.g., aligning symbolic knowledge to text), yielding task-specific data augmentation solutions that restrict the learning of general logical reasoning skills. In this work, we propose AERIE, an adaptively pre-trained language model that has improved logical reasoning abilities. We select a subset of Wikipedia, based on a set of logical inference keywords, for continued pretraining of a language model. We use two self-supervised loss functions: a modified masked language modeling loss where only specific parts-of-speech words, that would likely require more reasoning than basic language understanding, are masked, and a sentence classification loss that teaches the model to distinguish between entailment and contradiction types of sentences. The proposed training paradigm is both simple and generalizable across tasks. We demonstrate the effectiveness of AERIE by comparing it with prior baselines on two logical reasoning datasets. AERIE performs comparably on ReClor and outperforms baselines on LogiQA. Link » Soumya Sanyal · Yichong Xu · Shuohang Wang · Ziyi Yang · Reid Pryzant · Wenhao Yu · Chenguang Zhu · Xiang Ren 🔗 - Using Interventions to Improve Out-of-Distribution Generalization of Text-Matching Systems ( Poster )  link »    Given a user's input text, text-matching recommender systems output relevant items by comparing the input text to available items' description, such as product-to-product recommendation on e-commerce platforms. As users' interests and item inventory are expected to change, it is important for a text-matching system to generalize to data shifts, a task known as out-of-distribution (OOD) generalization. However, we find that the popular approach of fine-tuning a large, base language model on paired item relevance data (e.g., user clicks) can be counter-productive for OOD generalization. For a product recommendation task, fine-tuning obtains worse accuracy than the base model when recommending items in a new category or for a future time period. To explain this generalization failure, we consider an intervention-based importance metric, which shows that a fine-tuned model captures spurious correlations and fails to learn the causal features that determine the relevance between any two text inputs. Moreover, standard methods for causal regularization do not apply in this setting, because unlike in images, there exist no universally spurious features in a text-matching task (the same token may be spurious or causal depending on the text it is being matched to). For OOD generalization on text inputs, therefore, we highlight a different goal: avoiding high importance scores for certain features. We do so using an intervention-based regularizer that constraints the importance score of any token on the model's relevance score to be similar to the base model. Results on Amazon product and 3 question recommendation datasets show that our proposed regularizer improves generalization for both in-distribution and OOD evaluation, especially in difficult scenarios when the base model is not accurate. Link » Parikshit Bansal · Yashoteja Prabhu · Emre Kiciman · Amit Sharma 🔗 - Dropout Disagreement: A Recipe for Group Robustness with Fewer Annotations ( Poster )  link » Empirical risk minimization (ERM) of neural networks can cause over-reliance on spurious correlations and poor generalization on minority groups. Deep feature reweighting (DFR) improves group robustness via last-layer retraining, but it requires full group and class annotations for the reweighting dataset. To eliminate this impractical requirement, we propose a one-shot active learning method which constructs the reweighting dataset with the disagreement points between the ERM model with and without dropout activated. Our experiments show our approach achieves 95% of DFR performance on the Waterbirds and CelebA datasets despite using no group annotations and up to $7.5\times$ fewer class annotations. Link » Tyler LaBonte · Abhishek Kumar · Vidya Muthukumar 🔗 - Domain Generalization for Robust Model-Based Offline Reinforcement Learning ( Poster )  link »    Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multi-demonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose Domain-Invariant Model-based Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline model-based RL setting, can improve the stability of the policy learning process, and potentially increase exploration. Link » Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger 🔗 - Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations ( Poster )  link » There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Existing long-tailed classification methods focus on the single-domain setting, where all examples are drawn from the same distribution. However, real-world scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on four long-tailed variants of classical domain generalization benchmarks and two real-world imbalanced multi-domain datasets. The results indicate that TALLY consistently outperforms other state-of-the-art methods in both subpopulation shift and domain shift. Link » Huaxiu Yao · Xinyu Yang · Allan Zhou · Chelsea Finn 🔗 - Meta-Adaptive Stock Movement Prediction with Two-Stage Representation Learning ( Poster )  link »    Stock movement prediction has always been a tough but attractive task for researchers in machine learning and data mining. Generally speaking, two challenges for stock time series prediction remain not well-explored. One is the overfitting of deep learning models due to the data shortage and the other one is the potential domain shift that may happen during the evolution of stock time series. In this paper, we present \textit{\textbf{M}eta-\textbf{A}daptive \textbf{S}tock movement prediction with two-\textbf{S}tag\textbf{E} \textbf{R}epresentation learning (\textbf{MASSER})}, a novel framework for stock movement prediction based on self-supervised learning and meta-learning. Specifically, we first build up a two-stage representation learning framework, the first-stage representation learning aims for unified embedding learning for the data. And the second-stage learning, which is based on the first stage, is used for temporal domain shift detection via self-supervised learning. Then, we formalize the problem of stock movement prediction into a standard meta-learning setting. Inspired by importance sampling, we estimate sampling probability for tasks to balance the domain discrepancy caused by evolving temporal domains. Extensive experiment results on two open source datasets show that our framework with two simple but classical architectures (GRU and ResNet) as model achieves improvements of 5\% - 9.5\% on average accuracy, compared to state-of-the-art baselines. Link » Donglin Zhan · Yusheng Dai · Yiwei Dong · Jinghai He · Zhenyi Wang · James Anderson 🔗 - Scale-conditioned Adaptation for Large Scale Combinatorial Optimization ( Poster )  link » Deep reinforcement learning (DRL) for combinatorial optimization has drawn attention as an alternative for human-designed solvers. However, training DRL solvers for large-scale tasks remains challenging due to combinatorial optimization problems' NP-hardness. This paper proposes a novel \textit{scale-conditioned adaptation} (SCA) scheme that improves the transferability of the pre-trained solvers on larger-scale tasks. The main idea is to design a scale-conditioned policy by plugging a simple deep neural network, denoted as \textit{scale-conditioned network} (SCN), into the existing DRL model. SCN extracts a hidden vector from a scale value, and then we add it to the representation vector of the pre-trained DRL model. The increment of the representation vector captures the context of scale information and helps the pre-trained model effectively adapt the policy to larger-scale tasks. Our method is verified to improve the zero-shot and few-shot performance of DRL-based solvers in various large-scale combinatorial optimization tasks. Link » Minsu Kim · Jiwoo SON · Hyeonah Kim · Jinkyoo Park 🔗 - Malign Overfitting: Interpolation and Invariance are Fundamentally at Odds ( Poster )  link » Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of benign overfitting", in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that - even in the simplest of settings - any interpolating classifier (with nonzero margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that - in the same setting - successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations regarding the conflict between interpolation and invariance on simulated data and the Waterbirds dataset. Link » Yoav Wald · Gal Yona · Uri Shalit · Yair Carmon 🔗 - On the Abilities of Mathematical Extrapolation with Implicit Models ( Poster )  link » Deep neural networks excel on a variety of different tasks, often surpassing human intelligence. However, when presented with out-of-distribution data, these models tend to break down even on the simplest tasks. In this paper, we compare implicitly-defined and classical deep learning models on a series of mathematical extrapolation tasks, where the models are tested with out-of-distribution samples during inference time. Throughout our experiments, implicit models greatly outperform classical deep learning networks that overfit the training distribution. We showcase implicit models' unique advantages for extrapolation thanks to their flexible and selective framework. Implicit models, with potentially unlimited depth, not only adapt well to out-of-distribution data but also understand the underlying structure of inputs much better. Link » Juliette Decugis · Max Emerling · Ashwin Ganesh · Alicia Tsai · Laurent El Ghaoui 🔗 - Estimation of prediction error with known covariate shift ( Poster )  link »    In supervised learning, the estimation of prediction error on unlabeled test data is an important task. Existing methods are usually built on the assumption that the training and test data are sampled from the same distribution, which is often violated in practice. As a result, traditional estimators like cross-validation (CV) will be biased and this may result in poor model selection. In this paper, we assume that we have a test dataset in which the feature values are available but not the outcome labels, and focus on a particular form of distributional shift of covariate shift. We propose an alternative method based on parametric bootstrap of the target of conditional error ErrX. Empirically our method outperforms CV for both simulation and real data example across different modeling tasks, and is comparable to state-of-the-art methods for image classification. Link » Hui Xu · Robert Tibshirani 🔗 - A Synthetic Limit Order Book Dataset for Benchmarking Forecasting Algorithms under Distributional Shift ( Poster )  link » In electronic trading markets, limit order books (LOBs) provide information about pending buy/sell orders at various price levels for given security. Recently, there has been a growing interest in using LOB data for resolving downstream machine learning tasks (e.g., forecasting). However, dealing with out-of-distribution (OOD) LOB data is challenging since distributional shifts are unlabeled in current publicly available LOB datasets. Therefore, it is critical to build a synthetic LOB dataset with labeled OOD samples serving as a testbed for developing models that generalize well to unseen scenarios. In this work, we utilize a multi-agent market simulator to build a synthetic LOB dataset with and without market stress scenarios, which allows for the design of controlled distributional shift benchmarking. Using the proposed synthetic dataset, we provide a holistic analysis on the forecasting performance of three different state-of-the-art forecasting methods. Our results reflect the need for increased researcher efforts to develop algorithms with robustness to distributional shifts in high-frequency time series data. Link » Defu Cao · Yousef El-Laham · Loc Trinh · Svitlana Vyetrenko · Yan Liu 🔗 - A Closer Look at Model Adaptation using Feature Distortion and Simplicity Bias ( Poster )  link » In order to achieve strong in-distribution (ID) and out-of-distribution (OOD) generalization during transfer learning, it was recently argued that adaptation protocols should better leverage the expressivity of high-quality, pretrained models by controlling feature distortion (FD), i.e., the failure to update features orthogonal to the ID. However, in addition to OOD generalization, practical applications require that adapted models are also safe. To this end, we study the susceptibility of common adaptation protocols to simplicity bias (SB), i.e., the well-known propensity of neural networks to rely upon simple features, as this phenomenon has recently been shown to underlie several problems in safe generalization. Using a controllable, synthetic setting, we demonstrate that solely controlling FD is not sufficient to avoid SB, harming in safe generalization. Given the need to control both SB and FD for improved safety and ID/OOD generalization, we propose modifying a recently proposed protocol with goal of reducing SB. We verify the effectiveness of these modified protocols in decreasing SB on synthetic setting, and in jointly improving OOD generalization and safety on standard adaptation benchmarks. Link » Puja Trivedi · Danai Koutra · Jayaraman Thiagarajan 🔗 - Task Modeling: Approximating Multitask Predictions for Cross-Task Transfer ( Poster )  link » We study the problem of learning a target task when data samples from several auxiliary source tasks are available. Examples of this problem appear in multitask learning, where several tasks are combined jointly, and weak supervision, where multiple programmatic labels are generated for each sample. Because of task data's heterogeneity, negative interference is a critical challenge for solving this problem. Previous works have measured first-order task affinity as an effective metric, yet it becomes less accurate for approximating higher-order transfers. We propose a procedure called task modeling to model first- and higher-order transfers. This procedure samples subsets of source tasks and estimates surrogate functions to approximate multitask predictions. We show theoretical and empirical results that task models can be estimated in nearly-linear time in the number of tasks and accurately approximate multitask predictions. Thus, the target task's performance can be optimized using task models to select source tasks. We validate this approach on various datasets and performance metrics. Our method increases accuracy up to 3.6% over existing methods on five text classification tasks with noisy supervision sources. Additionally, task modeling can be applied to group robustness and fairness metrics. Ablation studies show that task models can accurately predict whether or not a set of up to four source tasks transfer positively to the target task. Link » Dongyue Li · Huy Nguyen · Hongyang Zhang 🔗 - Generative Posterior Networks for Approximately Bayesian Epistemic Uncertainty Estimation ( Poster )  link » In many real-world problems, there is a limited set of training data, but an abundance of unlabeled data. We propose a new method, Generative Posterior Networks (GPNs), that uses unlabeled data to estimate epistemic uncertainty in high-dimensional problems. A GPN is a generative model that, given a prior distribution over functions, approximates the posterior distribution directly by regularizing the network towards samples from the prior. We prove theoretically that our method indeed approximates the Bayesian posterior and show empirically that it improves epistemic uncertainty estimation and scalability over competing methods. Link » Melrose Roderick · Felix Berkenkamp · Fatemeh Sheikholeslami · J. Zico Kolter 🔗 - Graph-Relational Distributionally Robust Optimization ( Poster )  link »    Out-of-distribution (OOD) generalization is a challenging machine learning problem yet highly desirable in many high-stake applications. Distributionally robust optimization (DRO) is a promising learning paradigm to tackle this challenge but suffers from several limitations. To address this challenge, we propose graph-relational distributionally robust optimization that trains OOD-resilient machine learning models by exploiting the topological structure of data distributions. Our approach can uniformly handle both fully-known and partially-known topological structures. Empirical results on both synthetic and real-world datasets demonstrate the effectiveness and flexibility of our method. Link » Fengchun Qiao · Xi Peng 🔗 - A Unified Framework for Comparing Learning Algorithms ( Poster )  link » Understanding model biases is crucial to understanding how models will perform out-of-distribution (OOD). These biases often stem from particular design choices (e.g., architecture or data augmentation). We propose a framework for (learning) algorithm comparisons, wherein the goal is to find similarities and differences between models trained with two different learning algorithms. We begin by formalizing the goal of algorithm comparison as finding distinguishing feature transformations, input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present a two-stage method for algorithm comparisons based on comparing how models use the training data, leveraging the recently proposed datamodel representations [IPE+22]. We demonstrate our framework through a case study comparing classifiers trained on the Waterbirds [SKH+20] dataset with/without ImageNet pre-training. Link » Harshay Shah · Sung Min Park · Andrew Ilyas · Aleksander Madry 🔗 - Domain Generalization with Nuclear Norm Regularization ( Poster )  link » The ability to generalize to unseen domains is crucial for machine learning systems, especially when we only have data from limited training domains and must deploy the resulting models in the real world. In this paper, we study domain generalization via the classic empirical risk minimization (ERM) approach with a simple regularizer based on the nuclear norm of the learned features from the training set. Theoretically, we provide intuitions on why nuclear norm regularization works better than ERM and ERM with L2 weight decay in linear settings. Empirically, we show that nuclear norm regularization achieves state-of-the-art average accuracy compared to existing methods in a wide range of domain generalization tasks (e.g. 1.7\% test accuracy improvements over the second-best baseline on DomainNet). Link » Zhenmei Shi · Yifei Ming · Ying Fan · Frederic Sala · Yingyu Liang 🔗 - Invariant Feature Subspace Recovery for Multi-Class Classification ( Poster )  link »    Domain generalization aims to learn a model over multiple training environments to generalize to unseen environments. Recently, Wang et al [2022] proposed Invariant-feature Subspace Recovery (ISR), a domain generalization algorithm which uses the means of class-conditional data distributions to provably identify the invariant-feature subspace. However, the original ISR algorithm is conditioned on single class only, without utilizing information from the rest classes. In this work, we consider the setting of multi-class classification, and propose an extension of the ISR algorithm, called ISR-Multiclass. This proposed algorithm can provably recover the invariant-feature subspace with $\mathcal{O}(d_{spu}/k) + 1$ environments, where $d_{spu}$ is the number of spurious features and $k$ is the number of classes. Empirically, we first examine ISR-Multiclass in a synthetic dataset, and demonstrate its superiority over the original ISR in the multi-class setting. Furthermore, we conduct experiments in Multiclass Coloured MNIST, a semi-synthetic dataset with strong spurious correlations, and show that ISR-Multiclass can significantly improve the robustness of neural nets trained by various methods (e.g., ERM and IRM) against spurious correlations. Link » Gargi Balasubramaniam · Haoxiang Wang · Han Zhao 🔗 - Out-of-Distribution Robustness via Targeted Augmentations ( Poster )  link » Many machine learning systems deployed in the real world face the challenge of domain generalization, or generalizing to new domains that have different data distributions. For example, in wildlife conservation, animal classification models can perform poorly on new camera deployments. Across cameras, the data distribution changes along multiple factors, some of which are spurious (e.g., low-level background variations) and others of which are robustly predictive (e.g., habitat type). In this work, we aim to improve out-of-distribution performance by learning models that are invariant to spurious cross-domain variations while preserving predictive cross-domain variations. Specifically, we explore targeted augmentations that rely on prior knowledge to randomize only the spurious cross-domain variations. On iWildCam2020-WILDS and Camelyon17-WILDS, two domain generalization datasets, targeted augmentations outperform the previous state-of-the-art by 3.2% and 14.4% points respectively, suggesting that targeting spurious cross-domain variations using prior knowledge can be an effective route to out-of-distribution robustness. Link » Irena Gao · Shiori Sagawa · Pang Wei Koh · Tatsunori Hashimoto · Percy Liang 🔗 - Pushing the Accuracy-Fairness Tradeoff Frontier with Introspective Self-play ( Poster )  link »    Improving the accuracy-fairness frontier of deep neural network (DNN) models is an important problem. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose $Introspective Self-play$ (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary $Introspection$ task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks,ISP serves as a simple “plug-in” for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods. Link » Jeremiah Liu · Krishnamurthy Dvijotham · Jihyeon Lee · Quan Yuan · Martin Strobel · Balaji Lakshminarayanan · Deepak Ramachandran 🔗 - Reducing Forgetting in Federated Learning with Truncated Cross-Entropy ( Poster )  link » In federated learning (FL), a global model is learned by aggregating model updates computed from a set of client nodes, each having their own data. A key challenge in FL is the heterogeneity of data across clients whose data distributions differ from one another. Standard FL algorithms perform multiple gradient steps before synchronizing the model, which can lead to clients overly minimizing their local objective and diverging from other client solutions. We demonstrate that in such a setting individual client models experience catastrophic forgetting" with respect to other client data. We propose a simple yet efficient approach that modifies the cross-entropy objective on a per-client basis such that classes outside a client's label set are shielded from abrupt representation change. Through empirical evaluations, we demonstrate our approach can alleviate this problem, especially under the most challenging FL settings with high heterogeneity, low client participation. Link » Gwen Legate · Lucas Page-Caccia · Eugene Belilovsky 🔗 - Learning to Extrapolate: A Transductive Approach ( Poster )  link » Machine learning systems, especially overparameterized deep neural networks, can generalize to novel testing instances drawn from the same distribution as the training data. However, they fare poorly when evaluated on out-of-support testing points. In this work, we tackle the problem of developing machine learning systems that retain the power of overparametrized function approximators, while enabling extrapolation to out-of-support testing points when possible. This is accomplished by noting that under certain conditions, a "transductive" reparameterization can convert an out-of-support extrapolation problem into a problem of within-support combinatorial generalization. We propose a simple strategy based on bilinear embeddings to enable this type of combinatorial generalization, thereby addressing the out-of-support extrapolation problem. We instantiate a simple, practical algorithm applicable to various supervised learning problems and imitation learning tasks. Link » Aviv Netanyahu · Abhishek Gupta · Max Simchowitz · Kaiqing Zhang · Pulkit Agrawal 🔗 - Surgical Fine-Tuning Improves Adaptation to Distribution Shifts ( Poster )  link » A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift. Link » Yoonho Lee · Annie Chen · Fahim Tajwar · Ananya Kumar · Huaxiu Yao · Percy Liang · Chelsea Finn 🔗 - Characterising the Robustness of Reinforcement Learning for Continuous Control using Disturbance Injection ( Poster )  link »    In this study, we leverage the deliberate and systematic fault-injection capabilities of an open-source benchmark suite to perform a series of experiments on state-of-the-art deep and robust reinforcement learning algorithms.We aim to benchmark robustness in the context of continuous action spaces---crucial for deployment in robot control.We find that robustness is more prominent for action disturbances than it is for disturbances to observations and dynamics. We also observe that state-of-the-art approaches that are not explicitly designed to improve robustness perform at a level comparable to that achieved by those that are.Our study and results are intended to provide insight into the current state of safe and robust reinforcement learning and a foundation for the advancement of the field, in particular, for deployment in robotic systems. Link » Catherine Glossop · Jacopo Panerati · Amrit Krishnan · Zhaocong Yuan · Angela Schoellig 🔗 - Class-wise Domain Generalization: A Novel Framework for Evaluating Distributional Shift ( Poster )  link »    Given that Neural Networks generalize unreasonably well in the IID setting, OOD presents a useful failure case to study their generalization performance. Recent studies have shown that a carefully trained ERM gives good performance in Domain Generalization (DG), with train samples from all domains randomly shuffled in each batch of training. Moreover, methods like MIRO can boost test performance of NN under distribution shift without training data being explicitly annotated with domain information. We present a new setting beyond the Traditional DG (TDG) called Class-wise DG (CWDG) benchmark, where for each class, we randomly select one of the domains and keep it aside for testing. Despite being exposed to all domains during training, our experiments show that the performance of neural network drop in this framework compared to Traditional DG(TDG). We evaluate popular DG methods in this setting and show that some methods that the performance are correlated for most methods but a few. Finally, we propose a novel method called Iterative Domain Feature Masking(IDFM), achieving state-of-the-art results on the proposed benchmark. Link » Sarath Sivaprasad · Akshay Goindani · Mario Fritz · Vineet Gandhi 🔗 - Memory bounds for continual learning ( Poster )  link » Continual learning, or lifelong learning, is a formidable current challenge to machine learning. It requires the learner to solve a sequence of $k$ different learning tasks, one after the other, while %with each new task learned it retaining its aptitude for earlier tasks; the continual learner should scale better than the obvious solution of developing and maintaining a separate learner for each of the $k$ tasks. We embark on a complexity-theoretic study of continual learning in the PAC framework. We make novel uses of communication complexity to establish that any continual learner, even an improper one, needs memory that grows linearly with $k$, strongly suggesting that the problem is intractable. When logarithmically many passes over the learning tasks are allowed, we provide an algorithm based on multiplicative weights update whose memory requirement scales well; we also establish that improper learning is necessary for such performance. We conjecture that these results may lead to new promising approaches to continual learning. Link » Binghui Peng · Xi Chen · Christos Papadimitriou 🔗 - Tailored Overlap for Learning Under Distribution Shift ( Poster )  link » Distributional overlap is a critical determinant of learnability in domain adaptation. The standard theory quantifies overlap in terms of $\chi^2$ divergence, as this factors directly into variance and generalization bounds agnostic to the functional form of the $Y$-$X$ relationship. However, in many modern settings, we cannot afford this agnosticism; we often wish to transfer across distributions with disjoint support, where these standard divergence measures are infinite. In this note, we argue that tailored'' divergences that are restricted to measuring overlap in a particular function class are more appropriate. We show how $\chi^2$ (and other) divergences can be generalized to this restricted function class setting via a variational representation, and use this to motivate balancing weight-based methods that have been proposed before, but, we believe, should be more widely used. Link » David Bruns-Smith · Alexander D'Amour · Avi Feller · Steve Yadlowsky 🔗 - Few-Shot Learnable Augmentation for Financial Time Series Prediction under Distribution Shifts ( Poster )  link »    We address the problem of distribution shift in financial time series prediction, where the behavior of the time series changes over time.Satisfactory performance of forecasting algorithms requires constant model recalibration or fine-tuning to adapt to the new data distribution.Specifically, the ability to quickly fine-tune a model with only a few training samples available from the new distribution is crucial for many business applications.In this paper, we develop a novel method for learnable data augmentation that effectively adjusts to the new time series distribution with only a few samples. We demonstrate the effectiveness of our method compared to the state-of-the-art augmentation methods on both univariate time series (e.g., stock data) and multivariate time series (e.g., yield rate curves) in the presence of distribution shift due to the COVID market shock in 2020. Link » Dat Huynh · Elizabeth Fons · Svitlana Vyetrenko 🔗 - Mechanistic Lens on Mode Connectivity ( Poster )  link » With the rise of pretrained models, fine-tuning has become increasingly important. However, naive fine-tuning often does not eliminate a model's sensitivity to spurious cues. To understand and address this limitation, we study the geometry of neural network loss landscapes through the lens of mode-connectivity. We tackle two questions: 1) Are models trained on different distributions mode-connected? 2) Can we fine tune a pre-trained model to switch modes? We define a notion of mechanistic similarity based on shared invariances and show linearly-connected modes are mechanistically similar. We find naive fine-tuning yields linearly connected solutions and hence is unable to induce relevant invariances. We also propose and validate a method of "mechanistic fine-tuning" based on our gained insights. Link » Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka 🔗 - Is Unsupervised Performance Estimation Impossible When Both Covariates and Labels shift? ( Poster )  link » Accurately estimating and explaining an ML model’s performance on new datasets is increasingly critical in reliable ML model deployment. With no labels on the new datasets, performance estimation paradigms often assume either covariate shift or label shift, and thus lead to poor estimation accuracy when the assumptions are broken. Is unsupervised performance monitoring really impossible when both covariates and labels shift? In this paper, we give a negative answer. To do so, we introduce Sparse Joint Shift (SJS), a new distribution shift model considering the shift of labels and a few features. We characterize the mathematical conditions under which SJS is identifiable. This shows that unsupervised performance monitoring is indeed feasible when a few features and labels shift. In addition, we propose SEES, an algorithmic framework for performance estimation under SJS. Preliminary experiments show the superior estimation performance of SEES over existing paradigms. This opens the door to tackling the joint shift of both covariates and labels without observing new datasets’ labels. Link » Lingjiao Chen · Matei Zaharia · James Zou 🔗 - First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains ( Poster )  link » Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of \textit{nonlinear} models---under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the \textit{marginal} distribution of each coordinate of the data (or subset of coordinates) does not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form $f(x)=\sum f_i(x_i)$, where $f_i$ is an \emph{arbitrary} function on the subset of features $x_i$, can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized. Link » Kefan Dong · Tengyu Ma 🔗 - DrML: Diagnosing and Rectifying Vision Models using Language ( Poster )  link » Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method, DrML, can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier. Link » Yuhui Zhang · Jeff Z. HaoChen · Shih-Cheng Huang · Kuan-Chieh Wang · James Zou · Serena Yeung 🔗 - Empirical Study on Optimizer Selection for Out-of-Distribution Generalization ( Poster )  link » Modern deep learning systems are fragile and do not generalize well under distribution shifts. While much promising work has been accomplished to address these concerns, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address the problem settings for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as out-of-distribution datasets for the exhaustive study. We search over a wide range of hyperparameters and examine the classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings: i) contrary to conventional wisdom, adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum-based SGD), ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset – linear returns, increasing returns, and diminishing returns. We believe these findings can help practitioners choose the right optimizer and know what behavior to expect. The code is available at https://anonymous.4open.science/r/OoD-Optimizer-Comparison-37DF. Link » Hiroki Naganuma · Kartik Ahuja · Ioannis Mitliagkas · Shiro Takagi · Tetsuya Motokawa · Rio Yokota · Kohta Ishikawa · Ikuro Sato 🔗 - Choosing Public Datasets for Private Machine Learning via Gradient Subspace Distance ( Poster )  link » Differentially private stochastic gradient descent privatizes model training by injecting noise into each iteration, where the noise magnitude increases with the number of model parameters. Recent works suggest that we can reduce the noise by leveraging public data for private machine learning, by projecting gradients onto a subspace prescribed by the public data. However, given a choice of public datasets, it is not clear which one may be most appropriate for the private task. We give an algorithm for selecting a public dataset by measuring a low-dimensional subspace distance between gradients of the public and private examples. The computational and privacy cost overhead of our method is minimal. Empirical evaluation suggests that trained model accuracy is monotone in this distance. Link » Xin Gu · Gautam Kamath · Steven Wu 🔗 - Learning Invariant Representations under General Interventions on the Response ( Poster )  link »    It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we introduce invariant matching property (IMP) which is an explicit relation to capture interventions through an additional feature. This leads to an alternative form of invariance that enables a unified treatment of general interventions on the response. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings. Link » Kang Du · Yu Xiang 🔗 - Theory and Algorithm for Batch Distribution Drift Problems ( Poster )  link »    We study a problem of gradual \emph{batch distribution drift} motivated by several applications, which consists of determining an accurate predictor for a target time segment, for which a moderate amount of labeled samples are at one's disposal, while leveraging past segments for which substantially more labeled samples are available. We give new algorithms for this problem guided by a new theoretical analysis and generalization bounds derived for this scenario. Additionally, we report the results of extensive experiments demonstrating the benefits of our drifting algorithm, including comparisons with natural baselines. Link » Pranjal Awasthi · Corinna Cortes · Christopher Mohri 🔗 - Enabling the Visualization of Distributional Shift using Shapley Values ( Poster )  link » In streaming data, distributional shifts can appear both in the univariate dimensionsand in the joint distributions with the labels. However, in many real-time scenarios,labels are often either missing or delayed; Unsupervised drift detection methodsare desired in those applications.We design slidSHAPs, a novel representation method for unlabelled data streams.Commonly known in machine learning models, Shapley values offer a way toexploit correlation dependencies among random variables; We develop an unsuper-vised sliding Shapley value series for categorical time series representing the datastream in a newly defined latent space and track the feature correlation changes.Transforming the original time series to the slidSHAPs allows us to track howdistributional shifts affect the correlations among the input variables; the approachis independent of any kind of labeling. We show how abrupt distributional shiftsin the input variables are transformed into smoother changes in the slidSHAPs;Moreover, slidSHAP allows for intuitive visualization of the shifts when they arenot observable in the original data. Link » Bin Li · Chiara Balestra · Emmanuel Müller 🔗 - Frequency Shortcut Learning in Neural Networks ( Poster )  link »    The generalization of neural networks is harmed by shortcut learning: the use of simple non-semantic features may prevent the networks from learning deeper semantic and task-related cues. Existing studies focus mainly on explicit shortcuts, e.g. color patches and annotated text in images, that are visually detectable and may be removed. However, there exist implicit shortcuts determined by bias or superficial statistics in the data that neural networks can easily exploit. Mitigating the learning of implicit shortcuts is challenging due to the simplicity-bias and an intrinsic difficulty in identifying them. We empirically investigate shortcut learning in the frequency domain and propose a method to identify learned frequency shortcuts based on frequency removal. We found that frequency shortcuts often correspond to textures consisting of specific frequencies. We also investigate the influence of frequency shortcuts in Out-of-Distribution (OOD) tests. Link » Shunxin Wang · Raymond Veldhuis · Christoph Brune · Nicola Strisciuglio 🔗 - Preserving privacy with PATE for heterogeneous data ( Poster )  link »    Differential privacy has become the standard system to provide privacy guarantees for user data in machine learning models. One of the popular techniques to ensure privacy is the Private Aggregation of Teacher Ensembles (PATE) framework. PATE trains an ensemble of teacher models on private data and transfers the knowledge to a student model, with rigorous privacy guarantees derived using differential privacy. So far, PATE has been shown to work assuming the public and private data are distributed homogeneously. We show that in the case of high mismatch (non iid-ness) in these distributions, the teachers suffer from high variance in their individual training updates, causing them to converge to vastly different optimum states. This leads to lower consensus and accuracy for data labelling. To address this, we propose a modification to the teacher training process in PATE, that incorporates teacher averaging and update correction which reduces the variance in teacher updates. Our technique leads to improved prediction accuracy of the teacher aggregation mechanism, especially for highly heterogeneous data. Furthermore, our evaluation shows our technique is necessary to sustain the student model performance, and allows it to achieve considerable gains over the original PATE in the utility-privacy metric. Link » Akshay Dodwadmath · Sebastian Stich 🔗 - Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets ( Poster )  link » Deep networks have achieved impressive results on a range of well curated benchmark datasets. Surprisingly, their performance remains sensitive to perturbations that have little effect on human performance. In this work, we propose a novel extension of Mixup called Robustmix that regularizes networks to classify based on lower frequency spatial features. We show that this type of regularization improves robustness on a range of benchmarks such as Imagenet-C and Stylized Imagenet. It adds little computational overhead and furthermore does not require a priori knowledge of a large set of image transformations. We find that this approach further complements recent advances in model architecture and data augmentation attaining a state-of-the-art mCE of 44.8 with an EfficientNet-B8 model and RandAugment, which is a reduction of 16 mCE compared to the baseline. Link » JONAS NGNAWE · Marianne ABEMGNIGNI NJIFON · Jonathan Heek · Yann Dauphin 🔗 - Visual response inhibition for increased robustness of convolutional networks to distribution shifts ( Poster )  link » Convolutional neural networks have been shown to suffer from distribution shifts in the test data, for instance caused by the so called common corruptions and perturbations. Test images can contain noise, digital transformations, and blur that were not present in the training data, negatively impacting the performance of trained models. Humans experience much stronger robustness to noise and visual distortions than deep networks. In this work, we explore the effectiveness of a neuronal response inhibition mechanism, called push-pull, observed in the early part of the visual system, to increase the robustness of deep convolutional networks. We deploy a Push-Pull inhibition layer as a replacement of the initial convolutional layers (input layer and in the first block of residual and dense architectures) of standard convolutional networks for image classification. We show that the Push-Pull inhibition component increases the robustness of standard networks for image classification to distribution shifts on the CIFAR10-C and CIFAR10-P test sets. Link » Nicola Strisciuglio · George Azzopardi 🔗 - AdaME: Adaptive learning of multisource adaptationensembles ( Poster )  link »    We present a new adaptive algorithm to build multisource domain adaptation neural networks ensembles. Since the standard convex combination ensembles cannot succeed in this scenario, we present a learnable domain-weighted combination and new learning guarantees based on the deep boosting algorithm. We introduce and analyze a new algorithm, ADAME, for this scenario and show that it benefits from favorable theoretical guarantees, is risk-averse and reduces the worst-case mismatch between the inference and training distributions. We also report the results of several experiments demonstrating its performance in the FMOW-WILDSdataset. Link » Scott Yak · Javier Gonzalvo · Mehryar Mohri · Corinna Cortes 🔗 - Transferability Between Regression Tasks ( Poster )  link »    Transfer learning has been a widely used technique to adapt a deep learning model trained for one task to another when there is a data distribution shift between these tasks. To improve the effectiveness of transfer learning and to understand relationships between tasks, we consider the problem of transferability estimation between regression tasks and propose two novel transferability estimators that are simple, computationally efficient, yet effective and theoretically grounded. We test our proposed methods extensively in various challenging, practical scenarios and show they significantly outperform existing state-of-the-art regression task transferability estimators in both accuracy and efficiency. Link » Cuong Ngoc Nguyen · Phong Tran The · Lam Ho · Vu Dinh · Anh Tran · Tal Hassner · Cuong V. Nguyen 🔗 - CAREER: Economic Prediction of Labor Sequence Data Under Distribution Shift ( Poster )  link » Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, the distribution of these large resume datasets differ in meaningful ways from the survey datasets used for economic estimation; standard econometric models cannot take advantage of their scale or make predictions under distribution shift. To this end we develop CAREER, a transformer-based model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned on samples of the downstream data distribution of interest. We find that CAREER forms accurate predictions of job sequences, achieving state-of-the-art predictive performance on three widely-used economics datasets. We also find that CAREER is adept at making predictions under distribution shifts in time. Link » Keyon Vafa · Emil Palikot · Tianyu Du · Ayush Kanodia · Susan Athey · David Blei 🔗 - Out-of-Distribution Generalization Challenge in Dialog State Tracking ( Poster )  link »    Dialog State Tracking (DST) is a core component for multi-turn Task-Oriented Dialog (TOD) systems to understand the dialogs. DST models need to generalize to Out-of-Distribution (OOD) utterances due to the open environments dialog systems face. Unfortunately, utterances in TOD are multi-labeled, and most of them appear in specific contexts (i.e., the dialog histories). Both characteristics make them different from the conventional focus of OOD generalization research and remain unexplored. In this paper, we formally define OOD utterances in TOD and evaluate the generalizability of existing competitive DST models on the OOD utterances. Our experimental result shows that the performance of all models drops considerably in dialogs with OOD utterances, indicating an OOD generalization challenge in DST. Link » Jiasheng Ye · Yawen Ouyang · Zhen Wu · Xinyu Dai 🔗 - Diversity Boosted Learning for Domain Generalization with A Large Number of Domains ( Poster )  link »    Machine learning algorithms minimizing the average training loss typically suffer from poor generalization performance. It inspires various works for domain generalization (DG), among which a series of methods work by $O(n^2)$ pairwise domain operations with n domains, where each one is often costly. Moreover, while a common objective in the DG literature is to learn invariant representations against spurious correlations induced by domains, we point out the insufficiency of it and highlight the importance of alleviating spurious correlations caused by objects. Based on the observation that diversity helps mitigate spurious correlations, we propose a Diversity boosted twO-level saMplIng framework (DOMI) to efficiently sample the most informative ones among a large number of domains and data points. We show that DOMI helps train robust models against spurious correlations from both domain-side and object-side, substantially enhancing the performance of five backbone DG algorithms on Rotated MNIST and Rotated Fashion MNIST. Link » XI LENG · Yatao Bian · Xiaoying Tang 🔗 - Learning with noisy labels using low-dimensional model trajectory ( Poster )  link » Recent work shows that deep neural networks (DNNs) first learn clean samples and then memorize noisy samples. Early stopping can therefore be used to improve performance when training with noisy labels. It was also shown recently that the training trajectory of DNNs can be approximated in a low-dimensional subspace using PCA. The DNNs can then be trained in this subspace achieving similar or better generalization. These two observations were utilized together, to further boost the generalization performance of vanilla early stopping on noisy label datasets. In this paper, we probe this finding further on different real-world and synthetic label noises. First, we show that the prior method is sensitive to the early stopping hyper-parameter. Second, we investigate the effectiveness of PCA, for approximating the optimization trajectory under noisy label information. We propose to estimate low-rank subspace through robust and structured variants of PCA, namely Robust PCA, and Sparse PCA. We find that the subspace estimated through these variants can be less sensitive to early stopping, and can outperform PCA to achieve better test error when trained on noisy labels. Link » Vasu Singla · Shuchin Aeron · Toshiaki Koike-Akino · Kieran Parsons · Matthew Brand · Ye Wang 🔗 - Evaluating the Impact of Geometric and Statistical Skews on Out-Of-Distribution Generalization Performance ( Poster )  link » Out-of-distribution (OOD) or domain generalization is the problem of generalizing to unseen distributions. Recent work suggests that the marginal difficulty of generalizing to OOD over in-distribution data (OOD-ID generalization gap) is due to spurious correlations, which arise due to statistical and geometric skews, and can be addressed by careful data augmentation and class balancing. We observe that after constructing a dataset where we remove all conceivable sources of spurious correlation between interpretable factors, classifiers still fail to close the OOD-ID generalization gap. Link » Aengus Lynch · Jean Kaddour · Ricardo Silva 🔗 - Strategy-Aware Contextual Bandits ( Poster )  link » Algorithmic tools are often used to make decisions about people in high-stakes domains. In the presence of such automated decision making, there is incentive for strategic agents to modify their input to the algorithm in order to receive a more desirable outcome. While previous work on strategic classification attempts to capture this phenomenon, these models fail to take into account the multiple actions a decision maker usually has at their disposal, and the fact that they often have access only to bandit feedback. In contrast, we capture this setting as a contextual bandit problem, in which a decision maker must take actions based on a sequence of strategically modified contexts. We provide a low-strategic-regret algorithm for the two action setting, and prove that sublinear strategic regret is generally not possible for settings in which the number of actions is greater than two. Along the way, we obtain impossibility results for multi-class strategic classification which may be of independent interest. Link » Keegan Harris · Chara Podimata · Steven Wu 🔗 - Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in Neural Networks ( Poster )  link »    Deep Neural Networks (DNNs) are known to be brittle to even minor distribution shifts compared to the training distribution. While one line of work has demonstrated that \emph{Simplicity Bias} (SB) of DNNs -- bias towards learning only the simplest features -- is a key reason for this brittleness, another recent line of work has surprisingly found that diverse/ complex features are indeed learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features. To bridge the gap between these two lines of work, we first hypothesize and verify that while SB may not altogether preclude learning complex features, it amplifies simpler features over complex ones. Namely, simple features are replicated several times in the learned representations while complex features might not be replicated. This phenomenon, we term \emph{Feature Replication Hypothesis}, coupled with the \emph{Implicit Bias} of SGD to converge to maximum margin solutions in the feature space, leads the models to rely mostly on the simple features for classification. To mitigate this bias, we propose \emph{Feature Reconstruction Regularizer (FRR)} to ensure that the learned features can be reconstructed back from the logits. The use of \emph{FRR} in linear layer training (\emph{FRR-L}) encourages the use of more diverse features for classification. We further propose to finetune the full network by freezing the weights of the linear layer trained using \emph{FRR-L}, to refine the learned features, making them more suitable for classification. Using the proposed approach, we demonstrate noteworthy gains on synthetic/ semi-synthetic datasets, and outperform existing SOTA on the standard OOD benchmark DomainBed as well. Link » Sravanti Addepalli · Anshul Nasery · Venkatesh Babu R · Praneeth Netrapalli · Prateek Jain 🔗 - Useful Confidence Measures: Beyond the Max Score ( Poster )  link » An important component in deploying machine learning (ML) in safety-critic applications is having a reliable measure of confidence in the ML model's predictions. For a classifier $f$ producing a probability vector $f(x)$ over the candidate classes, the confidence is typically taken to be $\max_i f(x)_i$. This approach is potentially limited, as it disregards the rest of the probability vector. In this work, we derive several confidence measures that depend on information beyond the maximum score, such as margin-based and entropy-based measures, and empirically evaluate their usefulness, focusing on NLP tasks with distribution shifts and Transformer-based models. We show that when models are evaluated on the out-of-distribution data out of the box'', using only the maximum score to inform the confidence measure is highly suboptimal. In the post-processing regime (where the scores of $f$ can be improved using additional in-distribution held-out data), this remains true, albeit less significant. Overall, our results suggest that entropy-based confidence is a surprisingly useful measure. Link » Gal Yona · Amir Feder · Itay Laish 🔗 - Federated Learning under Distributed Concept Drift ( Poster )  link »    Federated Learning (FL) under distributed concept drift is a largely unexplored area. Although concept drift is itself a well-studied phenomenon, it poses particular challenges for FL, because drifts arise staggered in time and space (across clients). Our work is the first to explicitly study data heterogeneity in both dimensions. We first demonstrate that prior solutions to drift adaptation, with their single global model, are ill-suited to staggered drifts, necessitating multiple-model solutions. We identify the problem of drift adaptation as a time-varying clustering problem, and we propose two new clustering algorithms for reacting to drifts based on local drift detection and hierarchical clustering. Empirical evaluation shows that our solutions achieve significantly higher accuracy than existing baselines, and are comparable to an idealized algorithm with oracle knowledge of the ground-truth clustering of clients to concepts at each time step. Link » Ellango Jothimurugesan · Kevin Hsieh · Jianyu Wang · Gauri Joshi · Phillip Gibbons 🔗 - An Invariant Learning Characterization of Controlled Text Generation ( Poster )  link » Controlled generation refers to the problem of creating text that contains stylistic or semantic attributes of interest. Many approaches reduce this problem to building a predictor of the desired attribute.For example, researchers hoping to deploy a large language model to produce non-toxic content may use a toxicity classifier to filter generated text. In this paper, we show that the performance of controlled generation may be poor if the target distribution of text differs from the distribution the predictor was trained on. Instead, we take inspiration from causal representation learning and cast controlled generation under distribution shift as an invariant learning problem: the most effective predictor should be invariant across multiple text environments. Experiments demonstrate the promise and difficulty of adapting invariant learning methods, which have been primarily developed for vision, to text. Link » Claudia Shi · Carolina Zheng · Keyon Vafa · Amir Feder · David Blei 🔗 - Tackling Distribution Shifts in Federated Learning with Superquantile Aggregation ( Poster )  link »    Federated learning has emerged as the predominant framework for distributed machine learning over decentralized data, e.g. in mobile phones. The usual approaches suffer from a distribution shift: the model is trained to fit the average population distribution but is deployed on individual clients, whose data distributions can be quite different. We present a distributionally robust approach to federated learning based on a risk measure known as the superquantile and show how to optimize it by interleaving federated averaging steps with quantile computation. We demonstrate experimentally that our approach is competitive with usual ones in terms of average error and outperforms them in terms of tail statistics of the error. Link » Krishna Pillutla · Yassine Laguel · Jérôme Malick · Zaid Harchaoui 🔗 - Few Shot Generative Domain Adaptation Via Inference-Stage Latent Learning in GANs ( Poster )  link »    In this study, we adapt generative models trained on large source datasets to scarce target domains. We adapt a pre-trained Generative Adversarial Network (GAN) without retraining the generator, avoiding catastrophic forgetting and over-fitting. Starting from the observation that target images can be `embedded' onto the latent space of a pre-trained source-GAN, our method finds the latent code corresponding to the target domain on the source latent manifold. Optimizing a latent learner network during inference generates a novel target embedding that is supplied to the source-GAN generator to generate target samples. Our method, albeit simple, can be used to generate data from multiple target distributions using a generator trained on a single source distribution. Link » Arnab Kumar Mondal · Piyush Tiwary · Parag Singla · Prathosh AP 🔗 - Relational Out-of-Distribution Generalization ( Poster )  link » In out-of-distribution (OOD) generalization, domain relation is an important factor. It can provide a global view on the functionality among domains, e.g., the protein domain in the binding affinity task or the geographical location domain in the weather forecast task. Existing work lacks the utilization of the domain relation; yet in this work, we want to explore how to incorporate such rich information into solving the distribution shift problem. Therefore, we propose READ, a general multi-head deep learning framework that harnesses domain relation to generalize to unseen domains in a structured learning and inference manner. In READ, each training domain shares a common backbone but learns one separate head. Built on a proposed explicit regularization, READ simulates the generalization process among heads, where a weighted ensemble prediction from heads irrelevant to input domain is calculated via domain relation and aligned with the target. To improve the reliability of domain relation, READ further leverages similarity metric learning to update initial relation. Empirically, we evaluate READ on three domain generalization benchmarks. The results indicate that READ consistently improves upon existing state-of-the-art methods on datasets from various fields. Link » Xinyu Yang · Xinyi Pan · Shengchao Liu · Huaxiu Yao 🔗 - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization ( Poster )  link »    A common explanation for the failure of deep networks to generalize out-of-distribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance. Link » Elan Rosenfeld · Pradeep Ravikumar · Andrej Risteski 🔗 - Test-time adaptation with slot-centric models ( Poster )  link »    We consider the problem of segmenting scenes into constituent objects and their parts. Current supervised visual detectors, though impressive within their training distribution, often fail to segment out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses can be insufficient for instance segmentation tasks, without also considering architectural inductive biases. For image segmentation, recent slot-centric generative models break such dependence on supervision by attempting to segment scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Generating Fast and Slow Networks (GFS-Nets), a semi-supervised instance segmentation model equipped with a slot-centric image rendering component that is adapted per scene at test time through gradient descent on reconstruction or novel view synthesis objectives. We show that test-time adaptation greatly improves segmentation in out-of-distribution scenes. We evaluate GFS-Nets in scene segmentation benchmarks and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed forward detectors and self-supervised domain adaptation models. Link » Mihir Prabhudesai · Sujoy Paul · Sjoerd van Steenkiste · Mehdi S. M. Sajjadi · Anirudh Goyal · Deepak Pathak · Katerina Fragkiadaki · Gaurav Aggarwal · Thomas Kipf 🔗 - Diversity through Disagreement for Better Transferability ( Poster )  link »    Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features --- present in the training data but absent from the test data --- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution---referred to as the Out of Distribution (OOD) generalization problem.However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability. Link » Matteo Pagliardini · Martin Jaggi · François Fleuret · Sai Praneeth Karimireddy 🔗 - Env-Aware Anomaly Detection: Ignore Style Changes, Stay True to Content! ( Poster )  link »    We introduce a formalization and benchmark for the unsupervised anomaly detection task in the distribution-shift scenario. Our work builds upon the iWildCam dataset, and, to the best of our knowledge, we are the first to propose such an approach for visual data. We empirically validate that environment-aware methods perform better in such cases when compared with the basic Empirical Risk Minimization (ERM). We next propose an extension for generating positive samples for contrastive methods that considers the environment labels when training, improving the ERM baseline score by 8.7%. Link » Stefan Smeu · Elena Burceanu · Andrei L Nicolicioiu · Emanuela Haller 🔗 - Toward domain generalized pruning by scoring out-of-distribution importance ( Poster )  link »    Filter pruning has been widely used for compressing convolutional neural networks to reduce computation costs during the deployment stage. Recent studies have shown that filter pruning techniques can achieve lossless compression of deep neural networks, reducing redundant filters (kernels) without sacrificing accuracy performance. However, the evaluation is done when the training and testing data are from similar environmental conditions (independent and identically distributed), and how the filter pruning techniques would affect the cross-domain generalization (out-of-distribution) performance is largely ignored. We conduct extensive empirical experiments and reveal that although the intra-domain performance could be maintained after filter pruning, the cross-domain performance will decay to a large extent. As scoring a filter's importance is one of the central problems for pruning, we design the importance scoring estimation by using the variance of domain-level risks to consider the pruning risk in the unseen distribution. As such, we can remain more domain generalized filters. The experiments show that under the same pruning ratio, our method can achieve significantly better cross-domain generalization performance than the baseline filter pruning method. For the first attempt, our work sheds light on the joint problem of domain generalization and filter pruning research. Link » RIZHAO CAI · Haoliang Li · Alex Kot 🔗 - Active Learning Over Multiple Domains in Natural Language Tasks ( Poster )  link »    Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multi-domain sampling to examine this challenging setting for question answering and sentiment analysis. Among 18 acquisition functions from 4 families of methods, we find H-Divergence methods, and particularly our proposed variant DAL-E, yield effective results, averaging 2-3% improvements over the random baseline. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multi-domain active learning for natural language tasks. Link » Shayne Longpre · Julia Reisler · Edward Huang · Yi Lu · Andrew Frank · Nikhil Ramesh · Chris DuBois 🔗 - Adaptive Sampling for Probabilistic Forecasting under Distribution Shift ( Poster )  link »    The world is not static: This causes real-world time series to change over time because external, and potentially disruptive, events such as macroeconomic cycles or the COVID-19 pandemic change the underlying factors that influence the time series. Once such a data distribution shift happens, it will be part of the time series history and impact future forecasting attempts. We present an adaptive sampling strategy that selects the part of the history that is relevant for the recent data distribution. We achieve this by learning a discrete distribution over relevant time steps by Bayesian optimization. We instantiate this idea with a two-step, model-agnostic method that is pre-trained with uniform sampling and then training a lightweight adaptive architecture with adaptive sampling. We show with synthetic and real-world experiments that this method adapts to distribution shift and reduces the forecasting error of the base model by 8.4%. Link » Luca Masserano · Syama Sundar Rangapuram · Shubham Kapoor · Rajbir Nirwan · Youngsuk Park · Michael Bohlke-Schneider 🔗 - A Learning Based Hypothesis Test for Harmful Covariate Shift ( Poster )  link »    Quickly and accurately identifying covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. In this work, we give an intuitive definition of harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a classification model. To detect HCS, we use the discordance between classifiers trained to agree on training data and disagree on test data. We derive a loss function for training these models and show that their disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small. Link » Tom Ginsberg · Zhongyuan Liang · Rahul Krishnan 🔗 - Engineering Uncertainty Representations to Monitor Distribution Shifts ( Poster )  link »    In some classification tasks, the true label is not known until months or even years after the classifier prediction time. Once the model has been deployed, harmful dataset shift regimes can surface. Without cautious model monitoring, the damage could prove to be irreversible when true labels unfold. In this paper, we propose a method for practitioners to monitor distribution shifts on unlabeled data. We leverage two representations for quantifying and visualizing model uncertainty. The Adversarial Neighborhood Analysis assesses model uncertainty by aggregating predictions in the neighborhood of a data point and comparing them to the prediction at the single point. The Non-Conformity Analysis exploits the results of conformal prediction and leverages a decision tree to display uncertain zones. We empirically test our approach over scenarios of synthetically generated shifts to prove its efficacy. Link » Thomas Bonnier · Benjamin Bosch 🔗 - Data Feedback Loops: Model-driven Amplification of Dataset Biases ( Poster )  link » Datasets scraped from the internet have been critical to large-scale machine learning. Yet, its success puts the utility of future internet-derived datasets at potential risk, as model outputs begin to replace human annotations as a source of supervision. In this work, we formalize a system where interactions with one model are recorded as history and scraped as training data in the future. We then analyze its stability over time by tracking changes to a test-time bias statistic (e.g. gender bias of model predictions). We find that the degree of bias amplification is closely linked to whether the model’s outputs behave like samples from the training distribution, a behavior which we characterize and define as consistent calibration. Experiments in three conditional prediction scenarios – image classification, visual role-labeling, and language generation – demonstrate that models that exhibit a sampling-like behavior are more calibrated and thus more stable. Based on this insight, we propose an intervention to help calibrate and stabilize unstable feedback systems. Link » Rohan Taori · Tatsunori Hashimoto 🔗 - "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts ( Poster )  link » Performance of machine learning models may differ significantly in novel environments compared to during training due to shifts in the underlying data distribution. Attributing performance changes to specific data shifts is critical for identifying sources of model failures and designing stable models. In this work, we design a novel method for attributing performance differences between environments to shifts in the underlying causal mechanisms. We formulate the problem as a cooperative game and derive an importance weighting method for computing the value of a coalition of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on two synthetic datasets and two real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts. Link » Haoran Zhang · Harvineet Singh · Marzyeh Ghassemi · Shalmali Joshi 🔗 - A Reproducible and Realistic Evaluation of Partial Domain Adaptation Methods ( Poster )  link »    Unsupervised Domain Adaptation (UDA) aims at classifying unlabeled target images leveraging source labeled ones. In this work, we consider the Partial Domain Adaptation (PDA) variant, where we have extra source classes not present in the target domain. Most successful algorithms use model selection strategies that rely on target labels to find the best hyper-parameters and/or models along training. However, these strategies violate the main assumption in PDA: only unlabeled target domain samples are available. The main goal of this work is to provide a realistic evaluation of PDA methods with the different model selection strategies under a consistent evaluation protocol. We evaluate 7 representative PDA algorithms on 2 different real-world datasets using 7 different model selection strategies. Our two main findings are: (i) without target labels for model selection, the accuracy of the methods decreases up to 30 percentage points; (ii) only one method and model selection pair performs reasonably well on both datasets. Experiments were performed with our PyTorch framework, BenchmarkPDA, which we open source. Link » Tiago Salvador · Kilian FATRAS · Ioannis Mitliagkas · Adam Oberman 🔗 - Sparse Mixture-of-Experts are Domain Generalizable Learners ( Poster )  link »    In domain generalization (DG), most existing methods focused on the loss function design. This paper proposes to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely \emph{Generalizable Mixture-of-Experts (GMoE)}. Experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Link » Bo Li · Yifei Shen · Jingkang Yang · Yezhen Wang · Jiawei Ren · Tong Che · Jun Zhang · Ziwei Liu 🔗 - Deep Class-Conditional Gaussians for Continual Learning ( Poster )  link » The current state of the art for continual learning with frozen, pre-trained embedding networks are simple probabilistic models defined over the embedding space, for example class conditional Gaussians. As yet, in the task-incremental online setting, it has been an open question how to extend these methods to when the embedding function has to be learned from scratch. In this paper, we propose DeepCCG, an empirical Bayesian method which learns online both a class conditional Gaussian model and an embedding function. The learning process can be interpreted as using a variant of experience replay, known to be effective in continual learning. As part of our framework, we decide which examples to store by selecting the subset that minimises the KL divergence between the true posterior and the posterior induced by the subset. We demonstrate performance task-incremental online settings, including those with overlapping tasks. Our method outperforms all other methods, including several other replay-based methods. Link » Thomas Lee · Amos Storkey 🔗 - A Closer Look at Novel Class Discovery from the Labeled Set ( Poster )  link »    Novel class discovery (NCD) is to infer novel categories in an unlabeled set using prior knowledge of a labeled set comprising diverse but related classes. Existing research focuses on using the labeled set methodologically and little on analyzing it. In this study, we closer look at NCD from the labeled set and focus on two questions: (i) Given an unlabeled set, \textit{what labeled set best supports novel class discovery?} (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but \textit{how can we measure this relation?} For (i), we propose and substantiate the hypothesis that NCD could benefit from a labeled set with high semantic similarity to the unlabeled set. Using ImageNet's hierarchical class structure, we create a large-scale benchmark with variable semantic similarity across labeled/unlabeled datasets. In contrast, existing NCD benchmarks ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. We utilize this metric to validate our established benchmark and demonstrate it highly corresponds with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to sub-optimal outcomes in low-similarity settings. Link » ZIYUN LI · Jona Otholt · Ben Dai · Di Hu · Christoph Meinel · Haojin Yang 🔗 - A new benchmark for group distribution shifts in hand grasp regression for object manipulation. Can meta-learning raise the bar? ( Poster )  link »    Understanding hand-object pose with computer vision opens the door to new applications in mixed reality, assisted living or human-robot interaction. Most methods are trained and evaluated on balanced datasets. This is of limited use in real-world applications; how do these methods perform in the wild on unknown objects? We propose a novel benchmark for object group distribution shifts in hand and object pose regression. We then test the hypothesis that meta-learning a baseline pose regression neural network can adapt to these shifts and generalise better to unknown objects. Our results show measurable improvements over the baseline, depending on the amount of prior knowledge. For the task of joint hand-object pose regression, we observe optimisation interference for the meta-learner. To address this issue and improve the method further, we provide a comprehensive analysis which should serve as a basis for future work on this benchmark. Link » Théo Morales · Gerard Lacey 🔗 - Instance norm improves meta-learning in class-imbalanced land cover classification ( Poster )  link »    Distribution shift is omnipresent in geographic data, where various climatic and cultural factors lead to different representations across the globe. We aim to adapt dynamically to unseen data distributions with model-agnostic meta-learning, where data sampled from each distribution is seen as a task with only a few annotated samples. Transductive batch normalization layers are often employed in meta-learning models, as they reach the highest numerical accuracy on the class-balanced target tasks used as meta-learning benchmarks. In this work, we demonstrate empirically that transductive batch normalization collapses when deployed on a real class-imbalanced land cover classification problem. We propose a solution to replace batch normalization with instance normalization. This modification consistently outperformed all other normalization alternatives across different meta-learning algorithms in our class-imbalanced land cover classification test tasks. Link » Marc Russwurm · Devis Tuia 🔗 - CUDA: Curriculum of Data Augmentation for Long-tailed Recognition ( Poster )  link »    Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on imbalanced datasets such as CIFAR-100-LT. Link » Sumyeong Ahn · Jongwoo Ko · Se-Young Yun 🔗 - Benchmarking Robustness under Distribution Shift of Multimodal Image-Text Models ( Poster )  link »    Multimodal image-text models have shown remarkable performance in the past few years. However, the robustness of such foundation models against distribution shifts is crucial in downstream applications. In this paper, we investigate their robustness under image and text perturbations. We first build several multimodal benchmark datasets by applying 17 image perturbation and 16 text perturbation techniques. Then we extensively study the robustness of 6 widely adopted models on 3 downstream tasks (image-text retrieval, visual reasoning, and visual entailment). We observe that these powerful multimodal models are sensitive to image/text perturbations, especially to image perturbations. For text, character-level perturbations have shown higher adversarial impact than word-level and sentence-level perturbations. We also observe that models trained by generative objectives tend to be more robust. Our findings in terms of robustness study could facilitate the development of large image-text models, as well as their deployment for real-world applications. Link » Jielin Qiu · Yi Zhu · Xingjian Shi · Zhiqiang Tang · DING ZHAO · Bo Li · Mu Li 🔗 - Sorted eigenvalue comparison $d_{\mathsf{Eig}}$: A simple alternative to $d_{\mathsf{FID}}$ ( Poster )  link »    For $i = 1, 2$, let $\mathbf{S}_i$ be the sample covariance of $\mathbf{Z}_i$ with $n_i$ $p$-dimensional vectors. First, we theoretically justify an improved Fréchet Inception Distance ($d_{\mathsf{FID}}$) algorithm that replaces np.trace(sqrtm($\mathbf{S}_1 \mathbf{S}_2$)) with np.sqrt(eigvals($\mathbf{S}_1 \mathbf{S}_2$)).sum(). With the appearance of unsorted eigenvalues in the improved $d_{\mathsf{FID}}$, we are then motivated to propose sorted eigenvalue comparison ($d_{\mathsf{Eig}}$) as a simple alternative: $d_{\mathsf{Eig}}(\mathbf{S}_1, \mathbf{S}_2)^2=\sum_{j=1}^p (\sqrt{\lambda_j^1} - \sqrt{\lambda_j^2})^2$, and $\lambda_j^i$ is the $j$-th largest eigenvalue of $\mathbf{S}_i$. Second, we present two main takeaways for the improved $d_{\mathsf{FID}}$ and proposed $d_{\mathsf{Eig}}$ . (i) $d_{\mathsf{FID}}$: The error bound for computing non-negative eigenvalues of diagonalizable $\mathbf S_1 \mathbf S_2$ is reduced to $\mathcal{O}(\varepsilon) \|\mathbf S_1 \| \|\mathbf S_1 \mathbf S_2 \|$, along with reducing the run time by $\sim25\%$. (ii) $d_{\mathsf{Eig}}$: The error bound for computing non-negative eigenvalues of sample covariance $\mathbf S_i$ is further tightened to $\mathcal{O}(\varepsilon) \|\mathbf S_i \|$, with reducing $\sim90\%$ run time. Last, we discuss limitations and future work for $d_{\mathsf{Eig}}$. Link » Jiqing Wu · Viktor H Koelzer 🔗 - HICO-DET-SG and V-COCO-SG: New Data Splits to Evaluate Systematic Generalization in Human-Object Interaction Detection ( Poster )  link » Human-Object Interaction (HOI) detection is a task to predict interactions between humans and objects in an image. In real-world scenarios, HOI detection models are required systematic generalization, i.e., generalization to novel combinations of objects and interactions, because it is highly probable that the train data only cover a limited portion of all possible combinations. However, to our knowledge, no open benchmark or existing work evaluates the systematic generalization in HOI detection. To address this issue, we created two new sets of HOI detection data splits named HICO-DET-SG and V-COCO-SG based on HICO-DET and V-COCO datasets. We evaluated representative HOI detection models on our data splits and observed large degradation in the test performances compared to those on the original datasets. This result shows that systematic generalization is a challenging goal in HOI detection. We hope our new data splits encourage more research toward this goal. Link » Kentaro Takemoto · Moyuru Yamada · Tomotake Sasaki · Hisanao Akima 🔗 - Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification ( Poster )  link »    While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples. Link » Niladri S. Chatterji · Saminul Haque · Tatsunori Hashimoto 🔗 - Cross-Dataset Propensity Estimation for Debiasing Recommender Systems ( Poster )  link » Datasets for training recommender systems are often subject to distribution shift induced by users' and recommenders' selection biases. In this paper, we study the impact of selection bias on datasets with different quantization. We then leverage two differently quantized datasets from different source distributions to mitigate distribution shift by applying the inverse probability scoring method from causal inference. Empirically, our approach gains significant performance improvement over single-dataset methods and alternative ways of combining two datasets. Link » Fengyu Li · Sarah Dean 🔗 - Multiple Modes for Continual Learning ( Poster )  link » Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely sub10 population, domain, and task shift. Link » Siddhartha Datta · Nigel Shadbolt 🔗 - An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation ( Poster )  link »    The performance of machine learning models under distribution shift has been the focus of the community in recent years. Most of current methods have been proposed to improve the robustness to distribution shift from the algorithmic perspective, i.e., designing better training algorithms to help the generalization in shifted test distributions. This paper studies the distribution shift problem from the perspective of pre-training and data augmentation, two important factors in the practice of deep learning that have not been systematically investigated by existing work. By evaluating seven pre-trained models, including ResNets and ViT's with self-supervision and supervision mode, on five important distribution-shift datasets, from WILDS and DomainBed benchmarks, with five different learning algorithms, we provide the first comprehensive empirical study focusing on pre-training and data augmentation. With our empirical result obtained from 1,330 models, we provide the following main observations: 1) ERM combined with data augmentation can achieve state-of-the-art performance if we choose a proper pre-trained model respecting the data property; 2) specialized algorithms further improve the robustness on top of ERM when handling a specific type of distribution shift, e.g., GroupDRO for spurious correlation and CORAL for large-scale out-of-distribution data; 3) Comparing different pre-training modes, architectures and data sizes, we provide novel observations about pre-training on distribution shift, which sheds light on designing or selecting pre-training strategy for different kinds of distribution shifts. In summary, our empirical study provides a comprehensive baseline for a wide range of pre-training models fine-tuned with data augmentation, which potentially inspires research exploiting the power of pre-training and data augmentation in the future of distribution shift study. Link » Ziquan Liu · Yi Xu · Yuanhong Xu · Qi Qian · Hao Li · Rong Jin · Xiangyang Ji · Antoni Chan 🔗 - Characterizing Anomalies with Explainable Classifiers ( Poster )  link »    As machine learning techniques are increasingly used to make societal-scale decisions, model performance issues stemming from data-drift can result in costly consequences. While methods exist to quantify data-drift, a further classification of drifted points into groups of similarly anomalous points can be helpful for practitioners as a means to combating drift (e.g. by providing context about how/where in the data pipeline shift might be introduced). We show how such characterization is possible by making use of tools from the model explainability literature. We also show how simple rules can be extracted to generate database queries for anomalous data and detect anomalous data in the future. Link » Naveen Durvasula · Valentine d Hauteville · Keegan Hines · John Dickerson 🔗 - Momentum-based Weight Interpolation of Strong Zero-Shot Models for Continual Learning ( Poster )  link »    Large pretrained, zero-shot capable models have shown considerable success both for standard transfer and adaptation tasks, with particular robustness towards distribution shifts.In addition, subsequent finetuning can considerably improve performance on a selected downstream task. However, through naive finetuning, these zero-shot models lose their generalizability and robustness towards distribution shifts.This is a particular problem for tasks such as Continual Learning (CL), where continuous adaptation has to be performed as new task distributions are introduced sequentially.In this work, we showcase that where finetuning falls short to adapt such zero-shot capable models, simple momentum-based weight interpolation can provide consistent improvements for CL tasks in both memory-free and memory-based settings.In particular, we find improvements of over $+4\%$ on standard CL benchmarks, while reducing the error to the upper limit of jointly training on all tasks at once in parts by more than half, allowing the continual learner to inch closer to the joint training limits. Link » Zafir Stojanovski · Karsten Roth · Zeynep Akata 🔗 - Explanation Shift: Detecting distribution shifts on tabular data via the explanation space ( Poster )  link »    As input data distributions evolve, the predictive performance of machine learning models tends to deteriorate. In the past, predictive performance was considered the key indicator to monitor. However, explanation aspects have come to attention within the last years. In this work, we investigate how model predictive performance and model explanation characteristics are affected under distribution shifts and how these key indicators are related to each other for tabular data.We find that the modeling of explanation shifts can be a better indicator for the detection of predictive performance changes than state-of-the-art techniques based on representations of distribution shifts. We provide a mathematical analysis of different types of distribution shifts as well as synthetic experimental examples. Link » Carlos Mougan · Klaus Broelemann · Gjergji Kasneci · Thanassis Tiropanis · Steffen Staab 🔗 - Augmentation Consistency-guided Self-training for Source-free Domain Adaptive Semantic Segmentation ( Poster )  link » We focus on source-free domain adaptation for semantic segmentation, wherein a source model must adapt itself to a new target domain given only unlabeled target data. We propose Augmentation Consistency-guided Self-training (AUGCO), an adaptation algorithm that uses the model's pixel-level predictive consistency across diverse, automatically generated views of each target image along with model confidence to identify reliable pixel predictions, and selectively self-trains on those, leading to state-of-the-art performance within a simple to implement and fast to converge approach. Link » Viraj Prabhu · Shivam Khare · Deeksha Kartik · Judy Hoffman 🔗