Workshop
I Can’t Believe It’s Not Better: Understanding Deep Learning Through Empirical Falsification
Arno Blaas · Sahra Ghalebikesabi · Javier Antorán · Fan Feng · Melanie F. Pradier · Ian Mason · David Rohde
La Nouvelle Orleans Ballroom C (level 2)
Deep learning has flourished in the last decade. Recent breakthroughs have shown stunning results, and yet, researchers still cannot fully explain why neural networks generalise so well or why some architectures or optimizers work better than others. There is a lack of understanding of existing deep learning systems, which led NeurIPS 2017 test of time award winners Rahimi & Recht to compare machine learning with alchemy and to call for the return of the 'rigour police'.
Despite excellent theoretical work in the field, deep neural networks are so complex that they might not be able to be fully comprehended with theory alone. Unfortunately, the experimental alternative  rigorous work that neither proves a theorem nor proposes a new method  is currently undervalued in the machine learning community.
To change this, this workshop aims to promote the method of empirical falsification.
We solicit contributions which explicitly formulate a hypothesis related to deep learning or its applications (based on first principles or prior work), and then empirically falsify it through experiments. We further encourage submissions to go a layer deeper and investigate the causes of an initial idea not working as expected. This workshop will showcase how negative results offer important learning opportunities for deep learning researchers, possibly far greater than the incremental improvements found in conventional machine learning papers!
Why empirical falsification? In the words of Karl Popper, "It is easy to obtain confirmations, or verifications, for nearly every theory—if we look for confirmations. Confirmations should count only if they are the result of risky predictions."
We believe that similarly to physics, which seeks to understand nature, the complexity of deep neural networks makes any understanding about them built inductively likely to be brittle.
The most reliable method with which physicists can probe nature is by experimentally validating (or not) the falsifiable predictions made by their existing theories. We posit the same could be the case for deep learning and believe that the task of understanding deep neural networks would benefit from adopting the approach of empirical falsification.
Schedule
Sat 6:15 a.m.  6:25 a.m.

Welcome and Opening Remarks
(
Opening Remarks
)
SlidesLive Video 
🔗 
Sat 6:25 a.m.  6:30 a.m.

Introduction to ICBINB
(
Briefing
)
SlidesLive Video 
🔗 
Sat 6:30 a.m.  6:55 a.m.

Jeffrey Bowers: Researchers Comparing DNNs to Brains Need to Adopt Standard Methods of Science.
(
Invited Talk
)
SlidesLive Video The claim that DNNs and brains represent information in similar ways is largely based on the good performance of DNNs on various brain benchmarks. On this approach, the better DNNs can predict neural activity, the better the correspondence between DNNs and brains. But this is at odds with the standard scientific research approach that is characterized by varying independent variables to test specific hypotheses regarding the causal mechanisms that underlie some phenomenon; models are supported to the extent that they account for these experimental results. The best evidence for a model is that it survives “severe” tests, namely, experiments that have a high probability of falsifying a model if and only if the model is false in some relevant manner. When DNNs are assessed in this way, they catastrophically fail. The field needs to change its methods and put far more weight into falsification to get a better characterization of DNNbrain correspondences and to build more humanlike AI. 
Jeffrey Bowers 🔗 
Sat 6:55 a.m.  7:00 a.m.

Jeffrey Bowers: Researchers Comparing DNNs to Brains Need to Adopt Standard Methods of Science.
(
Q&A
)

Jeffrey Bowers 🔗 
Sat 7:00 a.m.  7:25 a.m.

Lawrence Udeigwe: On the Elements of Theory in Neuroscience.
(
Invited Talk
)
SlidesLive Video In science, theories are essential for encapsulating knowledge obtained from data, making predictions, and building models that make simulations and technological applications possible. Neuroscience  along with cognitive science  however, is a young field with fewer established theories (than, say, physics). One consequence of this fact is that new practitioners in the field sometimes find it difficult to know what makes a good theory. Moreover, the use of conceptual theories and models in the field has endured some criticisms: theories have low quantitative prediction power; models have weak transparency; etc. Addressing these issues calls for identifying the elements of theory in neuroscience. In this talk I will try to present and discuss, with case studies, the following: (1) taxonomies by which the different dimensions of a theory can be assessed. (2) criteria for the goodness of a theory. (3 )tradeoffs between agreement with the natural world and representational consistency in the theory/model world. 
Lawrence Udeigwe 🔗 
Sat 7:25 a.m.  7:30 a.m.

Lawrence Udeigwe: On the Elements of Theory in Neuroscience.
(
Q&A
)

Lawrence Udeigwe 🔗 
Sat 7:30 a.m.  7:35 a.m.

Spotlight 1  Elre Talea Oldewage: Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning FewShot MetaLearners
(
Spotlight Talk
)
SlidesLive Video This paper examines the robustness of deployed fewshot metalearning systems when they are fed an imperceptibly perturbed fewshot dataset. We attack amortized metalearners, which allows us to craft colluding sets of inputs that are tailored to fool the system's learning algorithm when used as training data. Jointly crafted adversarial inputs might be expected to synergistically manipulate a classifier, allowing for very strong datapoisoning attacks that would be hard to detect. We show that in a white box setting, these attacks are very successful and can cause the target model's predictions to become worse than chance. However, in opposition to the wellknown transferability of adversarial examples in general, the colluding sets do not transfer well to different classifiers. We explore two hypotheses to explain this: 'overfitting' by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred. Regardless of the mitigation strategies suggested by these hypotheses, the colluding inputs transfer no better than adversarial inputs that are generated independently in the usual way. 
Elre Oldewage 🔗 
Sat 7:35 a.m.  7:40 a.m.

Spotlight 2  Abhishek Moturu: Volumebased Performance not Guaranteed by Promising Patchbased Results in Medical Imaging
(
Spotlight Talk
)
SlidesLive Video Wholebody MRIs are commonly used to screen for early signs of cancers. In addition to the small size of tumours at onset, variations in individuals, tumour types, and MRI machines increase the difficulty of finding tumours in these scans. Using patches, rather than wholebody scans, to train a deeplearningbased segmentation model, with a custom compound patch loss function, several augmentations, and additional synthetically generated training data to identify areas where there is a high probability of a tumour provided promising results at the patchlevel. However, applying the patchbased model to the entire volume did not yield great results despite all of the stateoftheart improvements, with over 50% of the tumour sections in the dataset being missed. Our work highlights the discrepancy between the commonly used patchbased analysis and the overall performance on the whole image and the importance of focusing on the metrics relevant to the ultimate user, in our case, the clinician. Much work remains to be done to bring stateoftheart segmentation to clinical practice of cancer screening. 
Abhishek Moturu 🔗 
Sat 7:40 a.m.  7:45 a.m.

Spotlight 3 Rebecca Saul: LempelZiv Networks
(
Spotlight Talk
)
SlidesLive Video Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compressionbased methods have demonstrated more robustness when processing such sequences  in particular, an approach pairing the LempelZiv Jaccard Distance (LZJD) with the kNearest Neighbor algorithm has shown promise on long sequence problems (up to T=200,000,000 steps) involving malware classification. Unfortunately, use of LZJD is limited to discrete domains. To extend the benefits of LZJD to a continuous domain, we investigate the effectiveness of a deeplearning analog of the algorithm, the LempelZiv Network. While we achieve successful proofofconcept, we are unable to meaningfully improve on the performance of a standard LSTM across a variety of datasets and sequence processing tasks. In addition to presenting this negative result, our work highlights the problem of subpar baseline tuning in newer research areas. 
Rebecca Saul 🔗 
Sat 7:45 a.m.  7:50 a.m.

Spotlight 4 Tatiana Likhomanenko: Continuous Soft PseudoLabeling in ASR
(
Spotlight Talk
)
SlidesLive Video Continuous pseudolabeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semisupervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudolabels (PLs) with it, here PLs are generated in endtoend manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacherstudent models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hardlabels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka softlabels) over sequences as the target for unlabeled data, instead of a single best pass pseudolabeled transcript (hardlabels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that softlabels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hardlabels is that training loss on hardlabels imposes sequencelevel consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using softlabels. These approaches can bring the accuracy of softlabels closer to that of hardlabels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements. 
Tatiana Likhomanenko 🔗 
Sat 7:50 a.m.  7:55 a.m.

Spotlight 5  Gabriel LoaizaGanem: Denoising Deep Generative Models
(
Spotlight Talk
)
SlidesLive Video Likelihoodbased deep generative models have recently been shown to exhibit pathological behaviour under the manifold hypothesis as a consequence of using highdimensional densities to model data with lowdimensional structure. In this paper we propose two methodologies aimed at addressing this problem. Both are based on adding Gaussian noise to the data to remove the dimensionality mismatch during training, and both provide a denoising mechanism whose goal is to sample from the model as though no noise had been added to the data. Our first approach is based on Tweedie's formula, and the second on models which take the variance of added noise as a conditional input. We show that surprisingly, while well motivated, these approaches only sporadically improve performance over not adding noise, and that other methods of addressing the dimensionality mismatch are more empirically adequate. 
Gabriel LoaizaGanem 🔗 
Sat 7:55 a.m.  8:00 a.m.

Spotlight 6  Sheheryar Zaidi: When Does Reinitialization Work?
(
Spotlight Talk
)
SlidesLive Video Reinitializing a neural network during training has been observed to improve generalization in recent works. Yet it is neither widely adopted in deep learning practice nor is it often used in stateoftheart training protocols. This raises the question of when reinitialization works, and whether it should be used together with regularization techniques such as data augmentation, weight decay and learning rate schedules. In this work, we conduct an extensive empirical comparison of standard training with a selection of reinitialization methods to answer this question, training over 15,000 models on a variety of image classification benchmarks. We first establish that such methods are consistently beneficial for generalization in the absence of any other regularization. However, when deployed alongside other carefully tuned regularization techniques, reinitialization methods offer little to no added benefit for generalization, although optimal generalization performance becomes less sensitive to the choice of learning rate and weight decay hyperparameters. To investigate the impact of reinitialization methods on noisy data, we also consider learning under label noise. Surprisingly, in this case, reinitialization significantly improves upon standard training, even in the presence of other carefully tuned regularization techniques. 
Sheheryar Zaidi 🔗 
Sat 8:00 a.m.  8:30 a.m.

Coffee Break (and Poster Session Set Up)
(
Break
)

🔗 
Sat 8:30 a.m.  10:00 a.m.

Poster Session
(
Poster Session
)

🔗 
Sat 9:30 a.m.  10:00 a.m.

ICBINB Virtual MeetUp
(
Virtual Gathering
)
Do you value depth over breath, process over outcome, deep understanding of empirical results and collaborative/peer supported research? Do you think publication incentives put too much emphasis on benchmarks/tables with shiny bold numbers, at times using metrics that might not even be the most appropriate ones? Should the ML community value more empirical analysis and sharing of unexpected negative results? Do you want to know more about the ICBINB initiative? Come talk to us at the ICBINB MeetUp! This will be a 30 minute informal gathering point where you can get to know more about the ICBINB initiative, meet some of their current members and/or other workshop attendees that care about any of the questions above! We would love to meet you, exchange ideas, tell you about other efforts/activities we are doing, or just have a nice informal discussions about metaresearch! We would also love to hear any feedback you have, and provide info for anyone that wants to be more involved. 
🔗 
Sat 9:30 a.m.  10:00 a.m.

ICBINB InPerson MeetUp
(
Informal Gathering
)
Do you value depth over breath, process over outcome, deep understanding of empirical results and collaborative/peer supported research? Do you think publication incentives put too much emphasis on benchmarks/tables with shiny bold numbers, at times using metrics that might not even be the most appropriate ones? Should the ML community value more empirical analysis and sharing of unexpected negative results? Do you want to know more about the ICBINB initiative? Come talk to us at the ICBINB MeetUp! This will be a 30 minute informal gathering point where you can get to know more about the ICBINB initiative, meet some of their current members and/or other workshop attendees that care about any of the questions above! We would love to meet you, exchange ideas, tell you about other efforts/activities we are doing, or just have a nice informal discussions about metaresearch! We would also love to hear any feedback you have, and provide info for anyone that wants to be more involved. 
🔗 
Sat 10:00 a.m.  11:00 a.m.

Lunch break
(
Break
)

🔗 
Sat 11:00 a.m.  11:25 a.m.

Kathrin Grosse: On the Limitations of Bayesian Uncertainty in Adversarial Settings.
(
Invited Talk
)
SlidesLive Video Adversarial examples have been recognized as a threat, and still pose problems, as it is hard to defend them. Naturally, one might be tempted to think that an image looking like a panda and being classified as a gibbon might be unusualor at least unusual enough to be discovered by for example Bayesian uncertainty measures. Alas, it turns out that also Bayesian confidence and uncertainty measures are easy to fool when the optimization procedure is adapted accordingly. Moreover, adversarial examples transfer between different methods, so they can also be attacked in a black box setting. To conclude the talk, we will discuss briefly the practical necessity to defend evasion, and what is needed to not only evaluate defenses properly, but also build practical defenses. 
Kathrin Grosse 🔗 
Sat 11:25 a.m.  11:30 a.m.

Kathrin Grosse: On the Limitations on Bayesian Uncertainty in Adversarial Settings.
(
Q&A
)

Kathrin Grosse 🔗 
Sat 11:30 a.m.  11:55 a.m.

Andrew Gordon Wilson: When Bayesian Orthodoxy Can Go Wrong: Model Selection and OutofDistribution Generalization
(
Invited Talk
)
SlidesLive Video We will reexamine two popular usecases of Bayesian approaches: model selection, and robustness to distribution shifts. The marginal likelihood (Bayesian evidence) provides a distinctive approach to resolving foundational scientific questions  "how can we choose between models that are entirely consistent with any data?" and "how can we learn hyperparameters or correct ground truth constraints, such as intrinsic dimensionalities, or symmetries, if our training loss doesn't select for them?". There are compelling arguments that the marginal likelihood automatically encodes Occam's razor. There are also widespread practical applications, including the variational ELBO for hyperparameter learning. However, we will discuss how the marginal likelihood is answering a fundamentally different question than "will my trained model provide good generalization?". We consider the discrepancies and their significant practical implications in detail, as well as possible resolutions. Moreover, it is often thought that Bayesian methods, representing epistemic uncertainty, ought to have more reasonable predictive distributions under covariate shift, since these points will be far from our data manifold. However, we were surprised to find that high quality approximate Bayesian inference often leads to significantly decreased generalization performance. To understand these findings, we investigate fundamentally why Bayesian model averaging can deteriorate predictive performance under distribution and covariate shifts, and provide several remedies based on this understanding. 
Andrew Gordon Wilson 🔗 
Sat 11:55 a.m.  12:00 p.m.

Andrew Gordon Wilson: When Bayesian Orthodoxy Can Go Wrong: Model Selection and OutofDistribution Generalization
(
Q&A
)

Andrew Gordon Wilson 🔗 
Sat 12:00 p.m.  12:25 p.m.

Kun Zhang: Causal Principles Meet Deep Learning: Successes and Challenges.
(
Invited Talk
)
SlidesLive Video This talk is concerned with causal representation learning, which aims to reveal the underlying highlevel hidden causal variables and their relations. It can be seen as a special case of causal discovery, whose goal is to recover the underlying causal structure or causal model from observational data. The modularity property of a causal system implies properties of minimal changes and independent changes of causal representations, and I will explain how such properties make it possible to recover the underlying causal representations from observational data with identifiability guarantees: under appropriate assumptions, the learned representations are consistent with the underlying causal process. The talk will consider various settings with independent and identically distributed (i.i.d.) data, temporal data, or data with distribution shift as input, and demonstrate when identifiable causal representation learning can benefit from the flexibility of deep learning and when it has to impose parametric assumptions on the causal process. 
Kun Zhang 🔗 
Sat 12:25 p.m.  12:30 p.m.

Kun Zhang: Causal Principles Meet Deep Learning: Successes and Challenges.
(
Q&A
)

Kun Zhang 🔗 
Sat 12:30 p.m.  12:40 p.m.

Piersilvio De Bartolomeis: Certified defences hurt generalisation
(
Contributed Talk
)
SlidesLive Video In recent years, much work has been devoted to designing certified defences for neural networks, i.e., methods for learning neural networks that are provably robust to certain adversarial perturbations. Due to the nonconvexity of the problem, dominant approaches in this area rely on convex approximations, which are inherently loose. In this paper, we question the effectiveness of such approaches for realistic computer vision tasks. First, we provide extensive empirical evidence to show that certified defences suffer not only worse accuracy but also worse robustness and fairness than empirical defences. We hypothesise that the reason for why certified defences suffer in generalisation is (i) the large number of relaxed nonconvex constraints and (ii) strong alignment between the adversarial perturbations and the "signal" direction. We provide a combination of theoretical and experimental evidence to support these hypotheses. 
Piersilvio De Bartolomeis 🔗 
Sat 12:40 p.m.  12:50 p.m.

Simran Kaur: On the Maximum Hessian Eigenvalue and Generalization
(
Contributed Talk
)
SlidesLive Video The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly λmax, the largest eigenvalue of the Hessian of the loss); and algorithms, such as SharpnessAware Minimization (SAM), that directly optimize for flatness. Other works question the link between λmax and generalization. In this paper, we present findings that call λmax's influence on generalization further into question. We show that: (1) while larger learning rates reduce λmax for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change λmax without affecting generalization; (3) while SAM produces smaller λmax for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller λmax; and (5) while batchnormalization does not consistently produce smaller λmax, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GDSGD discrepancy demonstrates limits to λmax's ability to explain generalization in neural networks. 
Simran Kaur 🔗 
Sat 12:50 p.m.  1:00 p.m.

Taiga Abe: The Best Deep Ensembles Sacrifice Predictive Diversity
(
Contributed Talk
)
SlidesLive Video Ensembling remains a hugely popular method for increasing the performance of a given class of models. In the case of deep learning, the benefits of ensembling are often attributed to the diverse predictions of the individual ensemble members. Here we investigate a tradeoff between diversity and individual model performance, and find thatsurprisinglyencouraging diversity during training almost always yields worse ensembles. We show that this tradeoff arises from the Jensen gap between the single model and ensemble losses, and show that Jensen gap is a natural measure of diversity for both the mean squared error and cross entropy loss functions. Our results suggest that to reduce the ensemble error, we should move away from efforts to increase predictive diversity, and instead we should construct ensembles from less diverse (but more accurate) component models. 
🔗 
Sat 1:00 p.m.  1:30 p.m.

Coffee break
(
Break
)

🔗 
Sat 1:30 p.m.  1:55 p.m.

Fanny Yang: Surprising failures of standard practices in ML when the sample size is small.
(
Invited Talk
)
SlidesLive Video In this talk, we discuss two failure cases of common practices that are typically believed to improve on vanilla methods: (i) adversarial training can lead to worse robust accuracy than standard training (ii) active learning can lead to a worse classifier than a model trained using uniform samples. In particular, we can prove both mathematically and empirically, that such failures can happen in the smallsample regime. We discuss highlevel explanations derived from the theory, that shed light on the causes of these phenomena in practice. 
Fanny Yang 🔗 
Sat 1:55 p.m.  2:00 p.m.

Fanny Yang: Surprising failures of standard practices in ML when the sample size is small.
(
Q&A
)

Fanny Yang 🔗 
Sat 2:00 p.m.  2:50 p.m.

Panel Discussion  What Role Should Empiricism Play in Building AI?
(
Panel Discussion
)
SlidesLive Video Panelists: Samy Bengio, Kevin Murphy, Cheng Zhang, Fanny Yang, Andrew Gordon Wilson. Moderated by: Francisco J.R. Ruiz. 
🔗 
Sat 2:50 p.m.  3:00 p.m.

Closing remarks & awards
(
Closing remarks
)
SlidesLive Video 
🔗 


Exploring the LongTerm Generalization of Counting Behavior in RNNs
(
Poster
)
link
In this study, we investigate the generalization of LSTM, ReLU and GRU models on counting tasks over long sequences. Previous theoretical work has established that RNNs with ReLU activation and LSTMs have the capacity for counting with suitable configuration, while GRUs have limitations that prevent correct counting over longer sequences. Despite this and some positive empirical results for LSTMs on Dyck1 languages, our experimental results show that LSTMs fail to learn correct counting behavior for sequences that are significantly longer than in the training data. ReLUs show a much larger variance in behavior and mostly, their generalization is worse. The long sequence generalization is empirically related to validation loss, but reliable long sequence generalization is not practically achievable through backpropagation. Because of their design LSTMs, GRUs and ReLUs have different modes of failure, which we illustrate with specifically designed sequences. In particular, the necessary saturation of activation functions in LSTMs and the correct weight setting for ReLUs to generalize counting behavior are not achieved in standard training regimes. In summary, learning generalizable counting behavior is still an open problem and we discuss potential approaches for further research. 
Nadine ElNaggar · Pranava Madhyastha · Tillman Weyde 🔗 


Scaling Laws Beyond Backpropagation
(
Poster
)
link
Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoderonly Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions. 
Matthew Filipovich · Alessandro Cappelli · Daniel Hesslow · Julien Launay 🔗 


Dynamic Statistical Learning with Engineered Features Outperforms Deep Neural Networks for Smart Building Cooling Load Predictions
(
Poster
)
link
Cooling load predictions for smart build operations play an important role in optimizing the operation of heating, ventilation, and airconditioning systems. In this paper we report the cooling load prediction solution of real municipal buildings in Hong Kong set up by a recent AI competition. We show that dynamic statistical learning models with engineered features from domain knowledge outperform deep learning alternatives. The proposed solution for the global competition was conferred a Grand Prize and a Gold Award by the panel of internationally renowned experts. We report the data preprocessing based on cooling operation knowledge, feature engineering from control system knowledge, and interpretable learning algorithms to build the models. To find the best model to predict the cooling load, deep learning models with LSTM and Gated recurrent units are extensively studied and compared with our proposed solution. 
Yiren Liu · S. Joe Qin · Xiangyu Zhao · Yixiao HUANG · Shenglong Yao · Guo Han 🔗 


Spread Love Not Hate: Undermining the Importance of Hateful Pretraining for Hate Speech Detection
(
Poster
)
link
Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Although this method has proven to be effective for many domains, it might not always provide desirable benefits. In this paper we study the effects of hateful pretraining on low resource hate speech classification tasks. While previous studies on English language have emphasized its importance, we aim to to augment their observations with some nonobvious insights. We evaluate different variations of tweet based BERT models pretrained on hateful, nonhateful and mixed subsets of 40M tweet dataset. This evaluation is carried for Indian languages Hindi and Marathi. This paper is an empirical evidence that hateful pretraining is not the best pretraining option for hate speech detection. We show that pretraining on nonhateful text from target domain provides similar or better results. Further, we introduce HindTweetBERT and MahaTweetBERT, the first publicly available BERT models pretrained on Hindi and Marathi tweets respectively. We show that they provide stateoftheart performance on hate speech classification tasks. We also release a gold hate speech evaluation benchmark HateEvalHi and HateEvalMr consisting of manually labeled 2000 tweets each. 
Omkar Gokhale · Aditya Kane · Shantanu Patankar · Tanmay Chavan · Raviraj Joshi 🔗 


On Equivalences between Weight and FunctionSpace Langevin Dynamics
(
Poster
)
link
Approximate inference for overparameterized Bayesian models appears challenging, due to the complex structure of the posterior. To address this issue, a recent line of work has investigated the possibility of directly conducting approximate inference in "function space", the space of prediction functions. This note provides an alternative perspective to this problem, by showing that for many models  including a simplified neural network model  Langevin dynamics in the overparameterized "weight space" induces equivalent functionspace trajectories to certain Langevin dynamics procedures in function space. Thus, the former can already be viewed as a functionspace inference algorithm, with its convergence unaffected by overparameterization. We provide simulations on Bayesian neural network models, and discuss the implication of the results. 
Ziyu Wang · Yuhao Zhou · Ruqi Zhang · Jun Zhu 🔗 


Pitfalls of conditional computation for multimodal learning
(
Poster
)
link
Humans have perfected the art of learning from multiple modalities, through sensory organs. Despite impressive predictive performance on a single modality, neural networks cannot reach human level accuracy with respect to multiple modalities. This is a particularly challenging task due to variations in the structure of respective modalities. A popular method, Conditional Batch Normalization (CBN), was proposed to learn contextual features to aid a deep learning task. This uses the auxiliary data to improve representational power by learning affine transformation for Convolution Neural Networks. Despite the boost in performance by using CBN layer, our work reveals that the visual features learned by introducing auxiliary data via CBN deteriorates. We perform comprehensive experiments to evaluate the brittleness of a dataset to CBN. We show the sensitivity of CBN to the dataset, suggesting that learning from visual features could often be superior for generalization. We perform exhaustive experiments on natural images for bird classification and histology images for cancer type classification. We observe that the CBN network, learns close to no visual features on the bird classification dataset and partial visual features on the histology dataset. Our experiments reveal that CBN may encourage shortcut learning between the auxiliary data and labels. 
Ivaxi Sheth · Mohammad Havaei · Samira Ebrahimi Kahou 🔗 


The Effect of Data Dimensionality on Neural Network Prunability
(
Poster
)
link
Practitioners often prune neural networks for efficiency gains and generalization improvements, but few scrutinize the factors determining the prunability of a neural network – the maximum fraction of weights that pruning can remove without compromising the model’s test accuracy. In this work, we study the properties of input data that may contribute to the prunability of a neural network. For high dimensional input data such as images, text, and audio, the manifold hypothesis suggests that these high dimensional inputs actually lie on or near a significantly lower dimensional manifold. Prior work demonstrates that the underlying low dimensional structure of the input data may affect the sample efficiency of learning. In this paper, we investigate whether the low dimensional structure of the input data affects the prunability of a neural network. 
Zachary Ankner · Alex Renda · Gintare Karolina Dziugaite · Jonathan Frankle · Tian Jin 🔗 


SpikeandSlab Probabilistic Backpropagation: When Smarter Approximations Make No Difference
(
Poster
)
link
Probabilistic backpropagation is an approximate Bayesian inference method for deep neural networks, using a messagepassing framework. These messageswhich correspond to distributions arising as we propagate our input through a probabilistic neural networkare approximated as Gaussian. However, in practice, the exact distributions may be highly nonGaussian. In this paper, we propose a more realistic approximation based on a spikeandslab distribution. Unfortunately, in this case, better approximation of the messages does not translate to better downstream performance. We present results comparing the two schemes and discuss why we do not see a benefit from this spikeandslab approach. 
Evan Ott · Sinead Williamson 🔗 


Can We Forecast And Detect Earthquakes From Heterogeneous Multivariate Time Series Data?
(
Poster
)
link
Earthquake forecasting is a topic of utmost societal importance, yet has represented one of the greatest challenges to date. Case studies from the past show that seismic activity may lead to changes in the local geomagnetic and ionospheric field, which may operate as potential precursors and postcursors to largemagnitude earthquakes. However, detailed and datadriven research has yet to support the existence of precursors and postcursors. This work makes an attempt to build datadriven deep learning networks that can learn the temporal changes in geophysical phenomena before and after large magnitude earthquake events. First, we do numerous experiments using various machine learning and deep learning models, but none of them are sufficiently generalizable to forecast earthquakes from potential precursors. Our negative findings may make sense as there is not any conclusive and comprehensive evidence yet supporting the existence of earthquake precursors. We, therefore consider detecting earthquakes from postcursors data to spot potential pitfalls and outline the scope of possibility. Our tests indicate that while detecting earthquakes from postcursor data might be promising, it would fall short. Poor performance could be brought on by a lack of data and extremely complex relationships. However, we are leaving room for future research with deeper networks and data augmentation. 
Asadullah Hill Galib · Luke Cullen · Andy Smith · Debvrat Varshney · Edward Brown · Peter Chi · Xiangning Chu · Filip Svoboda 🔗 


Surgical FineTuning Improves Adaptation to Distribution Shifts
(
Poster
)
link
A common approach to transfer learning under distribution shift is to finetune the last few layers of a pretrained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively finetuning a subset of layers (which we term surgical finetuning) matches or outperforms commonly used finetuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, finetuning only the first few layers works best. We validate our findings systematically across seven realworld data tasks spanning three types of distribution shifts. Theoretically, we prove that for twolayer neural networks in an idealized setting, firstlayer tuning can outperform finetuning all layers. Intuitively, finetuning more parameters on a small target dataset can cause information learned during pretraining to be forgotten, and the relevant information depends on the type of shift. 
Yoonho Lee · Annie Chen · Fahim Tajwar · Ananya Kumar · Huaxiu Yao · Percy Liang · Chelsea Finn 🔗 


Models with Conditional Computation Learn Suboptimal Solutions
(
Poster
)
link
Sparselyactivated neural networks with conditional computation learn to route their inputs through different subnetworks, providing a strong structural prior and reducing computational costs.Despite their possible benefits, models with learned routing often underperform their parametermatched denselyactivated counterparts as well as models that use nonlearned heuristic routing strategies.In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparselyactivated models with nondifferentiable discrete routing decisions.To test this hypothesis, we evaluate the performance of sparselyactivated models trained with various gradient estimation techniques in three settings where a highquality heuristic routing strategy can be designed.Our experiments reveal that learned routing reaches substantially worse solutions than heuristic routing in various settings.As a first step towards remedying this gap, we demonstrate that supervising the routing decision on a small fraction of the examples is sufficient to help the model to learn better routing strategies. Our results shed light on the difficulties of learning effective routing and set the stage for future work on conditional computation mechanisms and training techniques. 
Mohammed Muqeeth · Haokun Liu · Colin Raffel 🔗 


On The Diversity of ASR Hypotheses In Spoken Language Understanding
(
Poster
)
link
In Conversational AI, an Automatic Speech Recognition (ASR) system is used to transcribe the user's speech, and the output of the ASR is passed as an input to a Spoken Language Understanding (SLU) system, which outputs semantic objects (such as intent, slotact pairs, etc.). Recent work, including the stateoftheart methods in SLU utilize either Word lattices or NBest Hypotheses from the ASR. The intuition given for using NBest instead of 1Best is that the hypotheses provide extra information due to errors in the transcriptions of the ASR system, i.e., the performance gain is attributed to the worderrorrate (WER) of the ASR. We empirically show that the gain in using NBest hypotheses is loosely related to WER but related to the diversity of hypotheses. 
Surya Kant Sahu · Swaraj Dalmia 🔗 


Lessons from Developing Multimodal Models with Code and Developer Interactions
(
Poster
)
link
Recent advances in natural language processing has seen the rise of language models trained on code. Of great interest is the ability of these models to find and classify defects in existing code bases. These models have been applied to defect detection but improvements between these models has been minor. Literature from cyber security highlights how developer behaviors are often the cause of these defects. In this work we propose to approach the defect detection problem in a multimodal manner using weaklyaligned code and the developer workflow data. We find that models trained on code and developer interactions tend to overfit and do not generalize because of weakalignment between the code and developer workflow data. 
Nicholas Botzer · Yasanka Horawalavithana · Tim Weninger · Svitlana Volkova 🔗 


An Empirical Analysis of the Advantages of Finite v.s. Infinite Width Bayesian Neural Networks
(
Poster
)
link
Understanding the relative advantages of finite versus infinitewidth neural networks (NNs) is important for model selection. However, comparing NNs with different widths is challenging because, as the width increases, multiple model properties change simultaneously  model capacity increases while the model becomes less flexible in learning features from the data. Analyses of Bayesian neural networks (BNNs) is even more difficult because inference in the finite width case is intractable. In this work, we empirically compare finite and infinite width BNNs, and provide quantitative and qualitative explanations for their performance difference. We find that when the limiting model is misspecified, increasing the width can reduce the generalization performance of BNNs. In these cases, we provide evidence that finite BNNs generalize better partially due to the properties of their frequency spectrum that allows them to adapt under model mismatch. 
Jiayu Yao · Yaniv Yacoby · Beau Coker · Weiwei Pan · Finale DoshiVelez 🔗 


Are Neurons Actually Collapsed? On the FineGrained Structure in Neural Representations
(
Poster
)
link
Recent work has observed an intriguing "Neural Collapse" phenomenon in welltrained neural networks, where the lastlayer representations of training samples with the same label collapse into each other. This suggests that the lastlayer representations are completely determined by the labels, and do not depend on the intrinsic structure of input distribution. We provide evidence that this is not a complete description, and that the apparent collapse hides important finegrained structure in the representations. Specifically, even when representations apparently collapse, the small amount of remaining variation can still faithfully and accurately captures the intrinsic structure of input distribution. As an example, if we train on CIFAR10 using only 5 coarsegrained labels (by combining two classes into one superclass) until convergence, we can reconstruct the original 10class labels from the learned representations via unsupervised clustering. The reconstructed labels achieve $93\%$ accuracy on the CIFAR10 test set, nearly matching the normal CIFAR10 accuracy for the same architecture. Our findings show concretely how the structure of input data can play a significant role in determining the finegrained structure of neural representations, going beyond what Neural Collapse predicts.

Yongyi Yang · Jacob Steinhardt · Wei Hu 🔗 


Model Stitching: Looking For Functional Similarity Between Representations
(
Poster
)
link
Model stitching (Lenc \& Vedaldi 2015) is a compelling methodology to compare different neural network representations, because it allows us to measure to what degree they may be interchanged. We expand on a previous work from Bansal, Nakkiran \& Barak which used model stitching to compare representations of the same shapes learned by differently seeded and/or trained neural networks of the same architecture. Our contribution enables us to compare the representations learned by layers with different shapes from neural networks with different architectures. We subsequently reveal unexpected behavior of model stitching. Namely, we find that stitching, based on convolutions, for small ResNets, can reach high accuracy if those layers come later in the first (sender) network than in the second (receiver), even if those layers are far apart.This leads us to hypothesize that stitches are not in fact learning to match the representations expected by receiver layers, but instead finding different representations which nonetheless yield similar results. Thus, we believe that model stitching may not necessarily always be an accurate measure of similarity. 
Adriano Hernandez · Rumen Dangovski · Peter Y. Lu 🔗 


On the Sparsity of Image Superresolution Network
(
Poster
)
link
The over parameterization of neural networks has been widely concerned for a long time. This gives us the opportunity to find a subnetworks that can improve the parameter efficiency of neural networks from a over parameterized network. In our study, we used EDSR as the backbone network to explore the parameter efficiency in superresolution(SR) networks in the form of sparsity. Specifically, we search for sparse subnetworks at the two granularity of weight and kernel through various methods, and analyze the relationship between the structure and performance of the subnetworks. (1) We observe the ``Lottery Ticket Hypothesis'' from a new perspective in the regression task of SR on weight granularity. (2) On convolution kernel granularity, we apply several methods to explore the influence of different sparse subnetworks on network performance and found that based on certain rules, the performance of different subnetworks rarely depends on their structures. (3) We propose a very convenient widthsparsity method on convolution kernel granularity, which can improve the parameter utilization efficiency of most SR networks. 
Chenyu Dong · Hailong Ma · Jinjin Gu · Ruofan Zhang · Jieming Li · Chun Yuan 🔗 


Paradigmatic Revolutions in Computer Vision
(
Poster
)
link
Kuhn's groundbreaking Structure divides scientific progress into four phases, the preparadigm period, normal science, scientific crisis and revolution. Most of the time a field advances incrementally, constrained and guided by a currently agreed upon paradigm following an implicit set of rules. Creative phases emerge when phenomena occur which lack satisfactory explanation within the current paradigm (the crisis) until a new one replaces it (the revolution). This model of science was mainly laid out by exemplars from natural science, while we want to show that Kuhn's work is also applicable for information sciences. We analyze the state of one field in particular, computer vision, using Kuhn's vocabulary. Following significant technologydriven advances of machine learning methods in the age of deep learning, researchers in computer vision were eager to accept the models that now dominate the state of the art. We discuss the current state of the field especially in light of the deep learning revolution and argue that current deep learning methods cannot fully constitute a paradigm for computer vision in the Kuhnian sense. 
Andreas Kriegler 🔗 


The curse of (non)convexity: The case of an OptimizationInspired Data Pruning algorithm
(
Poster
)
link
Data pruning consists of identifying a subset of the training set that can be used for training instead of the full dataset. This pruned dataset is often chosen to satisfy some desirable properties. In this paper, we leverage some existing theory on importance sampling with Stochastic Gradient Descent (SGD) to derive a new principled data pruning algorithm based on Lipschitz properties of the loss function. The goal is to identify a training subset that accelerates training (compared to e.g. random pruning). We call this algorithm $\texttt{LiPrune}$. We illustrate cases where $\texttt{LiPrune}$ outperforms existing methods and show the limitations and failures of this algorithm in the context of deep learning.

Fadhel Ayed · Soufiane Hayou 🔗 


An Empirical Study on Clustering Pretrained Embeddings: Is Deep Strictly Better?
(
Poster
)
link
Recent research in clustering face embeddings has found that unsupervised, shallow, heuristicbased methodsincluding $k$means and hierarchical agglomerative clusteringunderperform supervised, deep, inductive methods. While the reported improvements are indeed impressive, experiments are mostly limited to face datasets, where the clustered embeddings are highly discriminative or wellseparated by class (Recall@1 above 90% and often near ceiling), and the experimental methodology seemingly favors the deep methods. We conduct an empirical study of 14 clustering methods on two popular nonface datasetsCars196 and Stanford Online Productsand obtain robust, but contentious findings. Notably, deep methods are surprisingly fragile for embeddings with more uncertainty, where they underperform the shallow, heuristicbased methods. We believe our benchmarks broaden the scope of supervised clustering methods beyond the face domain and can serve as a foundation on which these methods could be improved.

Tyler Scott · Ting Liu · Michael Mozer · Andrew Gallagher 🔗 


When Are Graph Neural Networks Better Than StructureAgnostic Methods?
(
Poster
)
link
Graph neural networks (GNNs) are commonly applied to graph data, but their performance is often poorly understood. It is easy to find examples in which a GNN is unable to learn useful graph representations, but generally hard to explain why. In this work, we analyse the effectiveness of graph representations learned by shallow GNNs (2layers) for input graphs with different structural properties and feature information. We expand on the failure cases by decoupling the impact of structural and feature information on the learning process. Our results indicate that GNNs' implicit architectural assumptions are tightly related to the structural properties of the input graph and may impair its learning ability. In case of mismatch, they can often be outperformed by structureagnostic methods like multilayer perceptron. 
Diana Gomes · Fred RL · Kyriakos Efthymiadis · Ann Nowe · Peter Vrancx 🔗 


The (Un)Scalability of Heuristic Approximators for NPHard Search Problems
(
Poster
)
link
The A* algorithm is commonly used to solve NPhard combinatorial optimization problems. When provided with an accurate heuristic function, A* can solve such problems in time complexity that is polynomial in the solution depth. This fact implies that accurate heuristic approximation for many such problems is also NPhard. In this context, we examine a line of recent publications that propose the use of deep neural networks for heuristic approximation. We assert that these works suffer from inherent scalability limitations since  under the assumption that P$\ne$NP  such approaches result in either (a) network sizes that scale exponentially in the instance sizes or (b) heuristic approximation accuracy that scales inversely with the instance sizes. Our claim is supported by experimental results for three representative NPhard search problems that show that fitting deep neural networks accurately to heuristic functions necessitates network sizes that scale exponentially with the instance size.

Sumedh Dattaguru Pendurkar · Taoan Huang · Sven Koenig · Guni Sharon 🔗 


DARTFormer: Finding The Best Type Of Attention
(
Poster
)
link
Given the wide and ever growing range of different efficient Transformer attention mechanisms, it is important to identify which attention is most effective when given a task. In this work, we are also interested in combining different attention types to build heterogeneous Transformers. We first propose a DARTSlike Neural Architecture Search (NAS) method to find the best attention for a given task, in this setup, all heads use the same attention (homogeneous models). Our results suggest that NAS is highly effective on this task, and it identifies the best attention mechanisms for IMDb byte level text classification and Listops. We then extend our framework to search for and build Transformers with multiple different attention types, and call them heterogeneous Transformers. We show that whilst these heterogeneous Transformers are better than the average homogeneous models, they cannot outperform the best. We explore the reasons why heterogeneous attention makes sense, and why it ultimately fails. 
Jason Brown · Yiren Zhao · I Shumailov · Robert Mullins 🔗 


Exploring the Sharpened Cosine Similarity
(
Poster
)
link
Convolutional layers have long served as the primary workhorse for image classification. Recently, an alternative to convolution was proposed using the Sharpened Cosine Similarity (SCS), which in theory may serve as a better feature detector. While multiple sources report promising results, there has not been to date a fullscale empirical analysis of neural network performance using these new layers. In our work, we explore SCS's parameter behavior and potential as a dropin replacement for convolutions in multiple CNN architectures benchmarked on CIFAR10. We find that while SCS may not yield significant increases in accuracy, it may learn more interpretable representations. We also find that, in some circumstances, SCS may confer a slight increase in adversarial robustness. 
Skyler Wu · Fred Lu · Edward Raff · James Holt 🔗 


Are you using test loglikelihood correctly?
(
Poster
)
link
Test loglikelihood is commonly used to compare different models of the same data and different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test loglikelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) conclusions about forecast accuracy based on test loglikelihood comparisons may not agree with conclusions based on other distributional quantities like means; and (ii) that approximate Bayesian inference algorithms that attain higher test loglikelihoods need not also yield more accurate posterior approximations. 
Sameer Deshpande · Soumya Ghosh · Tin Nguyen · Tamara Broderick 🔗 


Identifying the Context Shift between Test Benchmarks and Production Data
(
Poster
)
link
Benchmark datasets have traditionally served dual purposes: first, benchmarks offer a standard on which machine learning researchers can compare different methods, and second, benchmarks provide a model, albeit imperfect, of the real world. The incompleteness of test benchmarks (and the data upon which models are trained) hinder robustness in machine learning, enable shortcut learning, and leave models systematically prone to err on outofdistribution and adversarially perturbed data. In an effort to clarify how to address the mismatch between test benchmarks and production data, we introduce context shift to describe semantically meaningful changes in the underlying data generation process. Moreover, we identify three methods for addressing context shift that would otherwise lead to model prediction errors: first, we describe how human intuition and expert knowledge can identify semantically meaningful features upon which models systematically fail, second, we detail how dynamic benchmarking – with its focus on capturing the data generation process – can promote generalizability through corroboration, and third, we highlight that clarifying a model's limitations can reduce unexpected errors. 
Matt Groh 🔗 


On the performance of Direct Loss Minimization for Bayesian Neural Networks
(
Poster
)
link
Direct Loss Minimization (DLM) has been proposed as a pseudoBayesian method motivated as regularized loss minimization. Compared to variational inference, it replaces the loss term in the evidence lower bound (ELBO) with the predictive log loss, which is the same loss function used in evaluation. A number of theoretical and empirical results in prior work suggest that DLM can significantly improve over ELBO optimization for some models. However, as we point out in this paper, this is not the case for Bayesian neural networks (BNNs). The paper explores the practical performance of DLM for BNN, the reasons for its failure and its relationship to optimizing the ELBO, uncovering some interesting facts about both algorithms. 
Yadi Wei · Roni Khardon 🔗 


Analysing the Relations of Misclassified Inputs Between Models
(
Poster
)
link
A common thought in the machine learning community is that many of the misclassified images are "difficult" images (for example images where the details are too small to differentiate between two classes). We evaluate those misclassified images of various deep learning models and check if other models can correctly classify those images. We find that the misclassified images of each model are different. Moreover, despite that models have similar accuracy on ImageNet, one model can classify correctly more than 15\% of the misclassified images of another model. This means can encourage further research to use two or more architectures when performing a prediction. 
Hadar Shavit 🔗 


How many trained neural networks are needed for influence estimation in modern deep learning?
(
Poster
)
link
Influence estimation attempts to estimate the effect of removing a training example on downstream predictions. Prior work has shown that a firstorder approximation to estimate influence does not agree with the groundtruth of retraining or finetuning without a training example. Recently, Feldman and Zhang [2020] created an influence estimator that provides meaningful influence estimates but requires training thousands of models on large subsets of a dataset. In this work, we explore how the method in Feldman and Zhang [2020] scales with the number of trained models. We also show empirical and analytical results in the standard influence estimation setting that provide intuitions about the role of nondeterminism in neural network training and how the accuracy of test predictions affects the number of models needed to detect an influential training example. We ultimately find that a large amount of models are needed for influence estimation, though the exact number is hard to quantify due to training nondeterminism and depends on test example difficulty, which varies between tasks. 
Sasha (Alexandre) Doubov · Tianshi Cao · David Acuna · Sanja Fidler 🔗 


Much Easier Said Than Done: Falsifying the Causal Relevance of Decoding Methods
(
Poster
)
link
Linear classifier probes are frequently utilized to better understand how neural networks function. Researchers have approached the problem of determining unit importance in neural networks by probing their learned, internal representations. Linear classifier probes identify highly selective units as the most important for network function. Whether or not a network actually relies on high selectivity units can be tested by removing them from the network using ablation. Surprisingly, when highly selective units are ablated they only produce small performance deficits, and even then only in some cases. In spite of the absence of ablation effects for selective neurons, linear decoding methods can be effectively used to interpret network function, leaving their effectiveness a mystery. To falsify the exclusive role of selectivity in network function and resolve this contradiction, we systematically ablate groups of units in subregions of activation space. Here, we find a weak relationship between neurons identified by probes and those identified by ablation. More specifically, we find that an interaction between selectivity and the average activity of the unit better predicts ablation performance deficits for groups of units in Alexnet, VGG16, MobileNetV2, and ResNet101. Linear decoders are likely somewhat effective because they overlap with those units that are causally important for network function. Interpretability methods could be improved by focusing on causally important units. 
Lucas Hayne · Abhijit Suresh · Hunar Jain · Rahul Kumar Mohan Kumar · R. McKell Carter 🔗 


Evaluating Robust Perceptual Losses for Image Reconstruction
(
Poster
)
link
Nowadays, many deep neural networks (DNNs) for image reconstructing tasks are trained using a combination of pixelwise loss functions and perceptual image losses like learned perceptual image patch similarity (LPIPS). As these perceptual image losses compare the features of a pretrained DNN, it is unsurprising that they are vulnerable to adversarial examples. It is known that: (i) DNNs can be robustified against adversarial examples using adversarial training, and (ii) adversarial examples are imperceptible by the human eye. Thus, we hypothesize that perceptual metrics, based on a robustly trained DNN, are more aligned with human perception than those based on nonrobust models. Our extensive experiments on an image super resolution task show, however, that this is not the case. We observe that models trained with a robust perceptual loss tend to produce more artifacts in the reconstructed image. Furthermore, we were unable to find reliable image similarity metrics or evaluation methods to quantify these observations (which are known open problems). 
Tobias Uelwer · Felix Michels · Oliver De Candido 🔗 


The Best Deep Ensembles Sacrifice Predictive Diversity
(
Poster
)
link
Ensembling remains a hugely popular method for increasing the performance of a given class of models. In the case of deep learning, the benefits of ensembling are often attributed to the diverse predictions of the individual ensemble members. Here we investigate a tradeoff between diversity and individual model performance, and find thatsurprisinglyencouraging diversity during training almost always yields worse ensembles. We show that this tradeoff arises from the Jensen gap between the single model and ensemble losses, and show that Jensen gap is a natural measure of diversity for both the mean squared error and cross entropy loss functions. Our results suggest that to reduce the ensemble error, we should move away from efforts to increase predictive diversity, and instead we should construct ensembles from less diverse (but more accurate) component models. 
Taiga Abe · Estefany Kelly Buchanan · Geoff Pleiss · John Cunningham 🔗 


Volumebased Performance not Guaranteed by Promising Patchbased Results in Medical Imaging
(
Poster
)
link
Wholebody MRIs are commonly used to screen for early signs of cancers. In addition to the small size of tumours at onset, variations in individuals, tumour types, and MRI machines increase the difficulty of finding tumours in these scans. Using patches, rather than wholebody scans, to train a deeplearningbased segmentation model, with a custom compound patch loss function, several augmentations, and additional synthetically generated training data to identify areas where there is a high probability of a tumour provided promising results at the patchlevel. However, applying the patchbased model to the entire volume did not yield great results despite all of the stateoftheart improvements, with over 50% of the tumour sections in the dataset being missed. Our work highlights the discrepancy between the commonly used patchbased analysis and the overall performance on the whole image and the importance of focusing on the metrics relevant to the ultimate user, in our case, the clinician. Much work remains to be done to bring stateoftheart segmentation to clinical practice of cancer screening. 
Abhishek Moturu · Sayali Joshi · Andrea Doria · Anna Goldenberg 🔗 


Continuous Soft PseudoLabeling in ASR
(
Poster
)
link
Continuous pseudolabeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semisupervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudolabels (PLs) with it, here PLs are generated in endtoend manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacherstudent models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hardlabels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka softlabels) over sequences as the target for unlabeled data, instead of a single best pass pseudolabeled transcript (hardlabels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that softlabels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hardlabels is that training loss on hardlabels imposes sequencelevel consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using softlabels. These approaches can bring the accuracy of softlabels closer to that of hardlabels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements. 
Tatiana Likhomanenko · Ronan Collobert · Navdeep Jaitly · Samy Bengio 🔗 


On the Maximum Hessian Eigenvalue and Generalization
(
Poster
)
link
The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as SharpnessAware Minimization (SAM), that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batchnormalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GDSGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.

Simran Kaur · Jeremy M Cohen · Zachary Lipton 🔗 


Denoising Deep Generative Models
(
Poster
)
link
Likelihoodbased deep generative models have recently been shown to exhibit pathological behaviour under the manifold hypothesis as a consequence of using highdimensional densities to model data with lowdimensional structure. In this paper we propose two methodologies aimed at addressing this problem. Both are based on adding Gaussian noise to the data to remove the dimensionality mismatch during training, and both provide a denoising mechanism whose goal is to sample from the model as though no noise had been added to the data. Our first approach is based on Tweedie's formula, and the second on models which take the variance of added noise as a conditional input. We show that surprisingly, while well motivated, these approaches only sporadically improve performance over not adding noise, and that other methods of addressing the dimensionality mismatch are more empirically adequate. 
Gabriel LoaizaGanem · Brendan Ross · Luhuan Wu · John Cunningham · Jesse Cresswell · Anthony Caterini 🔗 


Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning FewShot MetaLearners
(
Poster
)
link
This paper examines the robustness of deployed fewshot metalearning systems when they are fed an imperceptibly perturbed fewshot dataset. We attack amortized metalearners, which allows us to craft colluding sets of inputs that are tailored to fool the system's learning algorithm when used as training data. Jointly crafted adversarial inputs might be expected to synergistically manipulate a classifier, allowing for very strong datapoisoning attacks that would be hard to detect. We show that in a white box setting, these attacks are very successful and can cause the target model's predictions to become worse than chance. However, in opposition to the wellknown transferability of adversarial examples in general, the colluding sets do not transfer well to different classifiers. We explore two hypotheses to explain this: 'overfitting' by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred. Regardless of the mitigation strategies suggested by these hypotheses, the colluding inputs transfer no better than adversarial inputs that are generated independently in the usual way. 
Elre Oldewage · John Bronskill · Richard Turner 🔗 


Certified defences hurt generalisation
(
Poster
)
link
In recent years, much work has been devoted to designing certified defences for neural networks, i.e., methods for learning neural networks that are provably robust to certain adversarial perturbations. Due to the nonconvexity of the problem, dominant approaches in this area rely on convex approximations, which are inherently loose. In this paper, we question the effectiveness of such approaches for realistic computer vision tasks. First, we provide extensive empirical evidence to show that certified defences suffer not only worse accuracy but also worse robustness and fairness than empirical defences. We hypothesise that the reason for why certified defences suffer in generalisation is (i) the large number of relaxed nonconvex constraints and (ii) strong alignment between the adversarial perturbations and the "signal" direction. We provide a combination of theoretical and experimental evidence to support these hypotheses. 
Piersilvio De Bartolomeis · Jacob Clarysse · Fanny Yang · Amartya Sanyal 🔗 


When Does Reinitialization Work?
(
Poster
)
link
Reinitializing a neural network during training has been observed to improve generalization in recent works. Yet it is neither widely adopted in deep learning practice nor is it often used in stateoftheart training protocols. This raises the question of when reinitialization works, and whether it should be used together with regularization techniques such as data augmentation, weight decay and learning rate schedules. In this work, we conduct an extensive empirical comparison of standard training with a selection of reinitialization methods to answer this question, training over 15,000 models on a variety of image classification benchmarks. We first establish that such methods are consistently beneficial for generalization in the absence of any other regularization. However, when deployed alongside other carefully tuned regularization techniques, reinitialization methods offer little to no added benefit for generalization, although optimal generalization performance becomes less sensitive to the choice of learning rate and weight decay hyperparameters. To investigate the impact of reinitialization methods on noisy data, we also consider learning under label noise. Surprisingly, in this case, reinitialization significantly improves upon standard training, even in the presence of other carefully tuned regularization techniques. 
Sheheryar Zaidi · Tudor Berariu · Hyunjik Kim · Jorg Bornschein · Claudia Clopath · Yee Whye Teh · Razvan Pascanu 🔗 


LempelZiv Networks
(
Poster
)
link
Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compressionbased methods have demonstrated more robustness when processing such sequences  in particular, an approach pairing the LempelZiv Jaccard Distance (LZJD) with the kNearest Neighbor algorithm has shown promise on long sequence problems (up to $T=200,000,000$ steps) involving malware classification. Unfortunately, use of LZJD is limited to discrete domains. To extend the benefits of LZJD to a continuous domain, we investigate the effectiveness of a deeplearning analog of the algorithm, the LempelZiv Network. While we achieve successful proofofconcept, we are unable to meaningfully improve on the performance of a standard LSTM across a variety of datasets and sequence processing tasks. In addition to presenting this negative result, our work highlights the problem of subpar baseline tuning in newer research areas.

Rebecca Saul · Mohammad Mahmudul Alam · John Hurwitz · Edward Raff · Tim Oates · James Holt 🔗 