Timezone: »

Workshop
Bayesian Deep Learning
Yarin Gal · Yingzhen Li · Sebastian Farquhar · Christos Louizos · Eric Nalisnick · Andrew Gordon Wilson · Zoubin Ghahramani · Kevin Murphy · Max Welling

Tue Dec 14 03:00 AM -- 11:00 AM (PST) @ None

To deploy deep learning in the wild responsibly, we must know when models are making unsubstantiated guesses. The field of Bayesian Deep Learning (BDL) has been a focal point in the ML community for the development of such tools. Big strides have been made in BDL in recent years, with the field making an impact outside of the ML community, in fields including astronomy, medical imaging, physical sciences, and many others. But the field of BDL itself is facing an evaluation crisis: most BDL papers evaluate uncertainty estimation quality of new methods on MNIST and CIFAR alone, ignoring needs of real world applications which use BDL. Therefore, apart from discussing latest advances in BDL methodologies, a particular focus of this year’s programme is on the reliability of BDL techniques in downstream tasks. This focus is reflected through invited talks from practitioners in other fields and by working together with the two NeurIPS challenges in BDL — the Approximate Inference in Bayesian Deep Learning Challenge and the Shifts Challenge on Robustness and Uncertainty under Real-World Distributional Shift — advertising work done in applications including autonomous driving, medical, space, and more. We hope that the mainstream BDL community will adopt real world benchmarks based on such applications, pushing the field forward beyond MNIST and CIFAR evaluations.

 Tue 3:00 a.m. - 3:10 a.m. Opening Remarks (Opening remarks (zoom))  link » 🔗 Tue 3:10 a.m. - 3:30 a.m. Adaptive and Robust Learning with Bayes (Invited talk)  link »    Emtiyaz Khan, Dharmesh Tailor, Siddharth Swaroop Link » 🔗 Tue 3:30 a.m. - 3:50 a.m. A Bayesian Perspective on Meta-Learning (Invited talk)  link »    Yee Whye Teh Link » 🔗 Tue 3:50 a.m. - 4:10 a.m. Shifts Challenge: Robustness and Uncertainty under Real-World Distributional Shift (Competition talk)  link » 🔗 Tue 4:10 a.m. - 4:30 a.m. Gaussian Dropout as an Information Bottleneck Layer (Contributed talk)  link »    Melanie Rey Link » 🔗 Tue 4:20 a.m. - 4:30 a.m. Funnels: Exact Maximum Likelihood with Dimensionality Reduction (Contributed talk)  link »    Samuel Klein Link » 🔗 Tue 4:30 a.m. - 5:30 a.m. Posters (gather town link to the right) and lunch break (Poster)  link » 🔗 Tue 5:30 a.m. - 5:50 a.m. Spacecraft Collision Avoidance with Bayesian Deep Learning (Invited talk)  link »    Atılım Güneş Baydin, Francesco Pinto Link » 🔗 Tue 5:50 a.m. - 6:10 a.m. Inference & Sampling with Symmetries (Invited talk)  link »    Danilo Rezende, Peter Wirnsberger Link » 🔗 Tue 6:10 a.m. - 6:30 a.m. Bayesian Neural Networks, Andversarial Attacks, and How the Amount of Samples Matters (Invited talk)  link »    Asja Fischer, Sina Däubener Link » 🔗 Tue 6:30 a.m. - 8:00 a.m. Posters (gather town) (Poster)  link » 🔗 Tue 8:00 a.m. - 8:20 a.m. Quantified Uncertainty for Safe Operation of Particle Accelerators (Invited talk)  link »    Adi Hanuka, Owen Convery Link » 🔗 Tue 8:20 a.m. - 8:30 a.m. Diversity is All You Need to Improve Bayesian Model Averaging (Contributed talk)  link »    Yashvir Grewal Link » 🔗 Tue 8:30 a.m. - 8:40 a.m. Structure Stochastic Gradient MCMC: a hybrid VI and MCMC approach (Contributed talk)  link »    Alex Boyd, Antonios Alexos Link » 🔗 Tue 8:40 a.m. - 9:00 a.m. Evaluating Approximate Inference in Bayesian Deep Learning (Competition talk)  link » 🔗 Tue 9:00 a.m. - 9:20 a.m. An Automatic Finite-Data Robustness Metric for Bayes and Beyond: Can Dropping a Little Data Change Conclusions? (Invited talk)  link » 🔗 Tue 9:20 a.m. - 9:25 a.m. Closing remarks  link » 🔗 Tue 9:25 a.m. - 11:00 a.m. Social and Posters (gather town) (Poster)  link » 🔗 - Diversity is All You Need to Improve Bayesian Model Averaging (Poster) []   link » Existing approximate inference techniques produce predictive distributions that are quite distinct from the predictive distribution of the gold-standard Hamiltonian Monte Carlo. In this work, we bring the predictive distribution produced by deep ensembles more closer to the Hamiltonian Monte Carlo predictive distribution by increasing the diversity within the ensembles. The proposed approach outperforms the existing approximate inference methods and is also currently ranked the highest in the Approximate Inference competition at NeurIPS 2021. Link » Yashvir Singh Grewal · Thang Bui 🔗 - Regularizations Are All You Need: Weather Prediction Under Distributional Shift (Poster) []   link » In this paper, we present preliminary results on improving out-of-domain weather prediction and uncertainty estimation as part of the \texttt{Shifts Challenge on Robustness and Uncertainty under Real-World Distributional Shift} challenge. Our preliminary results show that by leveraging an ensemble of Bayesian models and thoughtful;y splitting the training set, we can achieve more robust and accurate results than standard libraries. We quantify our predictions using several metrics and propose several future lines of inquiry and experimentation to boost performance. Link » Sankalp Gilda · Neel Bhandari · Wendy Wing Yee Mak · Andrea Panizza 🔗 - Reducing redundancy in Semantic-KITTI: Study on data augmentations within Active Learning (Poster) []   link » Active learning has recently gained attention in deep learning tasks dedicated to autonomous driving, such as image classification. However, semantic segmentation for point clouds remains a largely unexplored task in active learning, mainly due to the heavy computational cost of such work. In this paper, we present an analysis to reduce data redundancy in the large-scale dataset Semantic-Kitti, thanks to active learning uncertainty-based methods and data augmentation. We are able to demonstrate that data augmentation techniques is helping our active learning cycles, and achieve baseline accuracy with only 60% of the dataset. Link » Alexandre Almin · Anh Duong · Léo Lemarié · Ravi Kiran 🔗 - An Empirical Analysis of Uncertainty Estimation in Genomics Applications (Poster) []   link » The usability of machine learning solutions in critical real-world applications relies on the availability of an uncertainty measure that reflects the confidence in the model predictions. In this work, we present an empirical analysis of uncertainty estimation approaches in Deep Learning models. We contrast Bayesian Neural Networks (BNN) against Monte Carlo-dropout (MC-dropout) methods to evaluate their performance and uncertainty scores in two classification tasks with different dataset characteristics. Link » Sepideh Saran · Mahsa Ghanbari · Uwe Ohler 🔗 - Hierarchical Topic Evaluation: Statistical vs. Neural Models (Poster) []   link » Hierarchical topic models (HTMs)---especially those based on Bayesian deep learning---are gaining increasing attention from the ML community. However, in contrast to their flat counterparts, their proper evaluation is rarely addressed. We propose several measures to evaluate HTMs in terms of their (branch-wise and layer-wise) topic hierarchy. We apply these measures to benchmark several HTMs on a wide range of datasets. We compare neural HTMs to traditional statistical HTMs in topic quality and interpretability. Our findings may help better judge advantages and disadvantages in different deep hierarchical topic models and drive future research in this area. Link » Mayank Kumar Nagda · Charu Karakkaparambil James · Sophie Burkhardt · Marius Kloft 🔗 - Reflected Hamiltonian Monte Carlo (Poster) []   link » The Hamiltonian Monte Carlo method is well-known for its ability to generate distant proposals and avoid random-walk behaviour. Its sampling efficiency however is highly sensitive to the choice of the number of leapfrog integration steps. Although the No-U-Turn Sampler automates the tuning of this parameter, it is computationally expensive and practically challenging to implement, especially on parallel architectures. In this work, we introduce the Reflected Hamiltonian Monte Carlo sampler, an HMC methodology that builds upon a reflection mechanism also used in the Bouncy Particle Sampler. The algorithm has an update rate parameter that plays an analogous role to that of the number of leapfrog integration steps in Hamiltonian Monte Carlo. With a focus on high-dimensional classification tasks, we demonstrate the competitive performance of the proposed algorithm against well-tuned Hamiltonian-based Markov Chain Monte Carlo methods. Link » Khai Xiang Au · alexandre thiery 🔗 - Federated Functional Variational Inference (Poster) []   link » Traditional federated learning (FL) involves optimizing point estimates for the parameters of the server model via a maximum likelihood objective. While models trained with such objectives show competitive predictive accuracy, they are poorly calibrated and provide no reliable uncertainty estimates. Well calibrated uncertainty is, however, important in safety critical applications of FL such as self-driving cars and healthcare. In this work we propose several methods to train Bayesian neural networks, networks providing uncertainty over their model parameters, in FL. We introduce baseline methods that employ priors in and do inference on the weight-space of the network. We also propose two function-space inference methods. These build upon recent work in functional variational inference to posit prior distributions in and do inference on the function-space of the network. These two approaches are based on Federated Averaging (FedAvg) and Expectation-Maximization (EM). We compare these function-space methods to their weight-space counterparts. Link » Michael Hutchinson · Matthias Reisser · Christos Louizos 🔗 - Towards Robust Object Detection: Bayesian RetinaNet for Homoscedastic Aleatoric Uncertainty Modeling (Poster) []   link » According to recent studies, commonly used computer vision datasets contain about 4% of label errors. For example, the COCO dataset is known for its high level of noise in data labels, which limits its use for training robust neural deep architectures in a real-world scenario.To model such a noise, in this paper we have proposed the homoscedastic aleatoric uncertainty estimation, and present a series of novel loss functions to address the problem of image object detection at scale.Specifically, the proposed functions are based on Bayesian inference and we have incorporated them into the common community-adopted object detection deep learning architecture RetinaNet.We have also shown that modeling of homoscedastic aleatoric uncertainty using our novel functions allows to increase the model interpretability and to improve the object detection performance being evaluated on the COCO dataset. Link » Natalia Khanzhina · Alexey Lapenok · Andrey Filchenkov 🔗 - Stochastic Pruning: Fine-Tuning, and PAC-Bayes bound optimization (Poster) []   link » We introduce an algorithmic framework for stochastic fine-tuning of pruning masks, starting from masks produced by several baselines. We further show that by minimizing a PAC-Bayes bound with data-dependent priors, we obtain a self-bounded learning algorithm with numerically tight bounds. In the linear model, we show that a PAC-Bayes generalization error bound is controlled by the magnitude of the change in feature alignment between the prior'' andposterior'' data. Link » Soufiane Hayou · Bobby He · Gintare Karolina Dziugaite 🔗 - Adversarial Learning of a Variational Generative Model with Succinct Bottleneck Representation (Poster) []   link » A new bimodal generative model is proposed for generating conditional and joint samples, accompanied with a training method with learning a succinct bottleneck representation.The proposed model, dubbed as the variational Wyner model, is designed based on two classical problems in network information theory---distributed simulation and channel synthesis---in which Wyner's common information arises as the fundamental limit on the succinctness of the common representation.The model is trained by minimizing the symmetric Kullback--Leibler divergence between variational and model distributions with regularization terms for common information, reconstruction consistency, and latent space matching terms, which is carried out via an adversarial density ratio estimation technique. Link » J. Jon Ryu · Yoojin Choi · Young-Han Kim · Mostafa El-Khamy · Jungwon Lee 🔗 - Posterior Temperature Optimization in Variational Inference for Inverse Problems (Poster) []   link » Bayesian methods feature useful properties for solving inverse problems, such as tomographic reconstruction. The prior distribution introduces regularization, which helps solving the ill-posed problem and reduces overfitting. In practice, this often results in a suboptimal posterior temperature and the full potential of the Bayesian approach is not realized. In this paper, we optimize both the parameters of the prior distribution and the posterior temperature using Bayesian optimization. Well-tempered posteriors lead to better predictive performance and improved uncertainty calibration, which we demonstrate for the task of sparse-view CT reconstruction. Link » Max Laves · Malte Tölle · Alexander Schlaefer · Sandy Engelhardt 🔗 - Revisiting the Structured Variational Autoencoder (Poster) []   link » The Structured Variational Autoencoder (SVAE) was introduced five years ago. It presented a modeling idea---to use probabilsitic graphical models (PGMs) as priors on latent variables and deep neural networks (DNNs) to map them to observed data---as well as an inference idea---to have the recognition network output conjugate potentials to the PGM prior rather than a full posterior. While mathematically appealing, the SVAE proved impractical to use or extend, as learning required implicit differentiation of a PGM inference algorithm, and the original authors' implementation was in pure Python with no GPU or TPU support. Now, armed with the power of JAX, a software library for automatic differentiation and compilation to CPU, GPU, or TPU targets, we revisit the SVAE. We develop a modular implementation that is orders of magnitude faster than the original code and show examples in a variety of different settings, including a scientific application to animal behavior modeling. Furthermore, we extend the original model by incorporating interior potentials, which allows for more expressive PGM priors, such as the Recurrent Switching Linear Dynamical System (rSLDS). Our JAX implementation of the SVAE and its extensions open up avenues for many practical applications, extensions, and theoretical investigations. Link » Yixiu Zhao · Scott Linderman 🔗 - Robust outlier detection by de-biasing VAE likelihoods (Poster) []   link » Deep networks often make confident, yet, incorrect, predictions when tested with outlier data that is far removed from their training distributions. Likelihoods computed by deep generative models (DGM) are a candidate metric for outlier detection with unlabeled data. Yet, DGM likelihoods are readily biased and unreliable. Here, we examine outlier detection with variational autoencoders (VAEs), among the simplest of DGMs. We show that an analytically-derived correction ameliorates a key bias with VAE likelihoods. The bias correction is sample-specific, computationally inexpensive and readily computed for various visible distributions. Next, we show that a well-known preprocessing technique, contrast stretching, extends the effectiveness of bias correction to improve outlier detection performance. We evaluate our approach comprehensively with nine (grayscale and natural) image datasets, and demonstrate significant advantages, in terms of speed and accuracy, over four state-of-the-art methods. Link » Kushal Chauhan · Pradeep Shenoy · Manish Gupta · Devarajan Sridharan 🔗 - The Dynamics of Functional Diversity throughout Neural Network Training (Poster) []   link » Deep ensembles offer consistent performance gains, both in terms of reduced generalization error and improved predictive uncertainty estimates. These performance gains are attributed to functional diversity among the components that make up the ensembles: ensemble performance increases with the diversity of the components. A standard way to generate a diversity of components from a single data set is to train multiple networks on the same data, but different minibatch orders (and augmentations, etc.). In this work, we study when and how this type of diversity decreases during deep neural network training. Using couplings of multiple training runs, we find that diversity rapidly decreases at the start of training, and that increased training time does not restore this lost diversity, implying that early stages of training make irreversible commitments. In particular, our findings provide further evidence that there is less diversity among functions once linear mode connectivity sets in. This motivates studying perturbations to training that upset linear mode connectivity. We then study how functional diversity is affected by retraining after reinitializing the weights in some layers. We find that we recover significantly more diversity by reinitializing layers closer to the input layer, compared to reinitializing layers closer to the output, also restoring the error barrier. Link » Lee Zamparo · Marc-Etienne Brunet · Thomas George · Sepideh Kharaghani · Gintare Karolina Dziugaite 🔗 - Biases in variational Bayesian neural networks (Poster) []   link » Variational inference recently became the de facto standard method for approximate Bayesian neural networks. However, the standard mean-field approach (MFVI) possesses many undesirable behaviours. This short paper empirically investigates the variational biases of MFVI and other variational families. The preliminary results shed light on the poor performance of many variational approaches for model selection. Link » Thang Bui 🔗 - Bayesian Inference in Augmented Bow Tie Networks (Poster) []   link » We develop a deep generative model that generalizes feed-forward, rectified linear neural networks with stochastic activations. We call these models bow tie networks because of the shape of their activation distributions. Then we leverage the Pólya-gamma augmentation scheme to render the model conditionally conjugate, and we derive a block Gibbs sampling algorithm based to approximate the posterior distribution over activations and model parameters. The resulting algorithm is massively parallelizable. We show a proof-of-concept of this model and Bayesian inference algorithm on a variety of standard regression benchmarks. Link » Jimmy Smith · Dieterich Lawson · Scott Linderman 🔗 - Fast Finite Width Neural Tangent Kernel (Poster) []   link » The Neural Tangent Kernel (NTK), defined as the outer product of the neural network (NN) Jacobians, $\Theta_\theta(x_1, x_2) = \left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial f(\theta, x_2)\big/\partial \theta\right]^T$, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite-width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency.We open-source (https://github.com/iclr2022anon/fast_finite_width_ntk) our two algorithms as general-purpose JAX function transformations that apply to any differentiable computation (convolutions, attention, recurrence, etc.) and introduce no new hyper-parameters. Link » Roman Novak · Jascha Sohl-Dickstein · Samuel Schoenholz 🔗 - Reliable Uncertainty Quantification of Deep Learning Models for a Free Electron Laser Scientific Facility (Poster) []   link » Particle accelerators are essential instruments for scientific experiments. They provide different experiments with particle beams of different parameters (e.g. beam energies or durations). This is accomplished by changing a wide variety of controllable settings, in a process called tuning. This is a challenging task, as many particle accelerators are complex machines with thousands of components, each of which contribute sources of uncertainty. Fast, accurate models of these systems could aid rapid customization of beams, but in order to accomplish this reliably, quantified uncertainties are essential. We address the problem of obtaining reliable uncertainties from learned models of a noisy, high-dimensional, nonlinear accelerator system: the X-ray free electron laser at the Linac Coherent Light Source, which is a scientific user facility. We examine the efficacy of Bayesian Neural Networks (BNNs) to reliably quantify predictive uncertainty and compare these with Quantile Regression Neural Networks (QRNNs). The QRNN models provide mean absolute error on predictions that are consistent with the noise of the measured data. We find the BNN is sensitive to outliers and is substantially more computationally expensive, but it still captures the general trend of the target data. Link » Lipi Gupta · Aashwin Mishra · Auralee Edelen 🔗 - Latent Goal Allocation for Multi-Agent Goal-Conditioned Self-Supervised Learning (Poster) []   link » Multi-agent learning plays an essential role in ubiquitous practical applications including game theory, autonomous driving, and etc. On the other end, goal-conditioned learning attracts a surge of interests with the capability of solving a rich variety of tasks and configurations. Nevertheless, the scenarios that combine both multi-agent and goal-conditioned settings have not been considered previously, attributed to the daunting challenges of both areas. In this work, we target \textbf{{\em multi-agent goal-conditioned tasks}}, with the objective of learning a universal policy for multiple agents to reach a set of sub-goals. This task necessitates the agents to execute differently conditioned on the assigned sub-goal. In various scenarios, considering it is infeasible to have access to direct rewards of actions and sub-goal assignment labels for each agent, we resort to imitation learning using only demonstrations of experts, without the need of a reward and sub-goal assignment labels. Regarding this, we propose a probabilistic graphical model, named Latent Goal Allocation (LGA), which explicitly promotes the sub-goal assignment as a latent variable to generate the corresponding action for each agent. We conduct experiments to show that the proposed LGA outperforms existing baselines with interpretable sub-goal assignment processes. Link » Laixi Shi · Peide Huang · Rui Chen 🔗 - Constraining cosmological parameters from N-body simulations with Bayesian Neural Networks (Poster) []   link » In this paper we use The Quijote simulations in order to extract the cosmological parameters through Bayesian Neural Networks. This kind of models has a remarkable ability of estimating the associated uncertainty, which is one of the ultimate goals in the precision cosmology era. We demonstrate the advantages of BNNs for extracting more complex output distributions and non-Gaussianities information from the simulations. Link » Hector Javier Hortua 🔗 - Evaluating Deep Learning Uncertainty Quantification Methods for Neutrino Physics Applications (Poster) []   link » We evaluate uncertainty quantification (UQ) methods for deep learning applied to liquid argon time projection chamber (LArTPC) physics analysis tasks. As deep learning applications enter widespread usage among physics data analysis, neural networks with reliable estimates of prediction uncertainty and robust performance against overconfidence and out-of-distribution (OOD) samples are critical for its full deployment in analyzing experimental data. While numerous UQ methods have been tested on simple datasets, performance evaluations for more complex tasks and datasets have been scarce. We assess the application of selected deep learning UQ methods on the task of particle classification in a simulated 3D LArTPC point cloud dataset. We observe that uncertainty enabled networks not only allow for better rejection of prediction mistakes and OOD detection, but also generally achieve higher overall accuracy across different task settings. Link » Dae Heun Koh · Aashwin Mishra · Kazuhiro Terao 🔗 - Model-embedding flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling (Poster) []   link » Normalizing flows have shown great success as general-purpose density estimators. However, many real-world applications require the use of domain-specific knowledge, which normalizing flows cannot readily incorporate. We propose embedded-model flows (EMF), which alternate general-purpose transformations with structured layers that embed domain-specific inductive biases. These layers are automatically constructed by converting user-specified differentiable probabilistic models into equivalent bijective transformations. We also introduce gated structured layers, which allow bypassing the parts of the models that fail to capture the statistics of the data. We demonstrate that EMFs can be used to induce desirable properties such as multimodality, hierarchical coupling and continuity. Furthermore, we show that EMFs enable a high-performance form of variational inference where the structure of the prior model is embedded in the variational architecture. In our experiments, we show that this approach outperforms state-of-the-art methods in common structured inference problems. Link » Gianluigi Silvestri · Emily Fertig · Dave Moore · Luca Ambrogioni 🔗 - Likelihood-free Density Ratio Acquisition Functions are not Equivalent to Expected Improvements (Poster) []   link » Bayesian Optimization (BO) is one of the most effective black-box optimization methods, yet the need to ensure analytical tractability in the posterior predictive makes it challenging to apply BO to large-scale problems with high-dimensional observations. For these problems, likelihood-free methods present a promising avenue since they can work with more expressive models and are often more efficient. Previous papers have claimed that density ratios acquired from the likelihood-free inference are equivalent to the widely popular expected improvement acquisition function, allowing us to perform BO without expensive exact posterior inference. Unfortunately, we show in this paper that the claim is false; we identify errors in their reasoning and illustrate a counter-example where density ratios are inversely correlated to expected improvements. Our results suggest that additional care is needed when interpreting and applying density ratio acquisition functions from likelihood-free inference. Link » Jiaming Song · Stefano Ermon 🔗 - Object-Factored Models with Partially Observable State (Poster) []   link » In a typical robot manipulation setting, the physical laws that govern object dynamics never change, but the set of objects does. To complicate matters, objects may have intrinsic properties that are not directly observable (e.g., center of mass or friction coefficients). In this work, we introduce a latent-variable model of object-factored dynamics. This model represents uncertainty about the dynamics using deep ensembles while capturing uncertainty about each object's intrinsic properties using object-specific latent variables. We show that this model allows a robot to rapidly generalize to new objects by using information theoretic active learning. Additionally, we highlight the benefits of the deep ensemble for robust performance in downstream tasks. Link » Isaiah Brand · Michael Noseworthy · Sebastian Castro · Nick Roy 🔗 - On Efficient Uncertainty Estimation for Resource-Constrained Mobile Applications (Poster) []   link » Deep neural networks have shown great success in prediction quality while reliable and robust uncertainty estimation remains a challenge. Predictive uncertainty supplements model predictions and enables improved functionality of downstream tasks including embedded and mobile applications, such as virtual reality, augmented reality, sensor fusion, and perception. These applications often require a compromise in complexity to obtain uncertainty estimates due to very limited memory and compute resources. We tackle this problem by building upon Monte Carlo Dropout (MCDO) models using the Axolotl framework; specifically, we diversify sampled subnetworks, leverage dropout patterns, and use a branching technique to improve predictive performance while maintaining fast computations. We conduct experiments on (1) a multi-class classification task using the CIFAR10 dataset, and (2) a more complex human body segmentation task. Our results show the effectiveness of our approach by reaching close to Deep Ensemble prediction quality and uncertainty estimation, while still achieving faster inference on resource-limited mobile platforms. Link » Johanna Rock · Tiago Azevedo · René de Jong · Daniel Ruiz · Partha Maji 🔗 - Dropout and Ensemble Networks for Thermospheric Density Uncertainty Estimation (Poster) []   link » Accurately estimating spacecraft location is of crucial importance for a variety of safety-critical tasks in low-Earth orbit (LEO), including satellite collision avoidance and re-entry. The solar activity largely impacts the physical characteristics of the thermosphere, consequently affecting trajectories of spacecraft in LEO. State-of-the-art models for estimating thermospheric density are either computationally expensive or under-perform during extreme solar activity. Moreover, these models provide single-point solutions, neglecting critical information on the associated uncertainty. In this work we use and compare two methods, Monte Carlo dropout and deep ensembles, to estimate thermospheric total mass density and associated uncertainty. The networks are trained using ground-truth density data from five well-calibrated satellites, using orbital data information, solar and geomagnetic indices as input. The trained models improve for a subset of satellites upon operational solutions, also providing measure of uncertainty in the density estimation. Link » Stefano Bonasera · Giacomo Acciarini · Jorge Pérez-Hernández · Bernard Benson · Edward Brown · Eric Sutton · Moriba Jah · Christopher Bridges · Atilim Gunes Baydin 🔗 - Benchmark for Out-of-Distribution Detection in Deep Reinforcement Learning (Poster) []   link » Reinforcement Learning (RL) based solutions are being adopted in a variety of domains including robotics, health care and industrial automation. Most focus is given to when these solutions work well, but they fail when presented with out of distribution inputs. RL policies share the same faults as most machine learning models. Out of distribution detection for RL is generally not well covered in the literature, and there is a lack of benchmarks for this task. In this work we propose a benchmark to evaluate OOD detection methods in a Reinforcement Learning setting, by modifying the physical parameters of non-visual standard environments or corrupting the state observation for visual environments. We discuss ways to generate custom RL environments that can produce OOD data, and evaluate three uncertainty methods for the OOD detection task. Our results show that ensemble methods have the best OOD detection performance with a lower standard deviation across multiple environments. Link » Aaqib Parvez Mohammed · Matias Valdenegro-Toro 🔗 - Can Network Flatness Explain the Training Speed-Generalisation Connection? (Poster) []   link » Recent work has shown that training speed, as estimated by the sum over training loss, is predictive of generalization performance. From a Bayesian perspective, this metric can be theoretically linked to marginal likelihood in linear models. However, it is unclear why the relationship holds for DNNs and what the underlying mechanisms are. We hypothesise that this relationship holds in DNNs because of network flatness, which causes both fast training speed and good generalization. We also investigated the hypothesis in varying settings and found that it might hold when the variance in the stochastic gradient estimation is moderate, with either logit averaging, or no data transformation at all. This paper specifies the conditions future works should impose when investigating the connecting mechanism. Link » Qiaochu Jiang · Clare Lyle · Lisa Schut · Yarin Gal 🔗 - Mixture-of-experts VAEs can disregard unimodal variation in surjective multimodal data (Poster) []   link » Machine learning systems are often deployed in domains that entail data from multiple modalities, for example, phenotypic and genotypic characteristics describe patients in healthcare. Previous works have developed variational autoencoders (VAEs) that generate multimodal data. We consider surjective data, where single datapoints from one modality (such as labels) describe multiple datapoints from another modality (such as images). We theoretically and empirically demonstrate that multimodal VAEs with mixture of experts posterior can struggle to capture unimodal variability in surjective data. Link » Jannik Wolff · Tassilo Klein · Moin Nabi Nabi · Rahul G Krishnan · Shinichi Nakajima 🔗 - Depth Uncertainty Networks for Active Learning (Poster) []   link » In active learning, the size and complexity of the training dataset change over time. Simple models that are well specified by the amount of data available at the start of active learning might suffer from bias as more points are actively sampled. Flexible models that might be well suited to the full dataset can suffer from overfitting towards the start of active learning. We tackle this problem using Depth Uncertainty Networks (DUNs), a BNN variant in which the depth of the network, and thus its complexity, is inferred. We find that DUNs outperform other BNN variants on several active learning tasks. Importantly, we show that on the tasks in which DUNs perform best they present notably less overfitting than baselines. Link » Chelsea Murray · James Allingham · Javier Antoran · José Miguel Hernández-Lobato 🔗 - The Peril of Popular Deep Learning Uncertainty Estimation Methods (Poster) []   link » Uncertainty estimation (UE) techniques---such as the Gaussian process (GP), Bayesian neural networks (BNN), Monte Carlo dropout (MCDropout)---aim to improve the interpretability of machine learning models by assigning an estimated uncertainty value to each of their prediciton outputs. However, since too high uncertainty estimates can have fatal consequences in practice, this paper analyzes the above techniques.Firstly, we show that GP methods always yield high uncertainty estimates on out of distribution (OOD) data. Secondly, we show on a 2D toy example that both BNNs and MCDropout do not give high uncertainty estimates on OOD samples. Finally, we show empirically that this pitfall of BNNs and MCDropout holds on real world datasets as well. Our insights (i) raise awareness for the more cautious use of currently popular UE methods in Deep Learning, (ii) encourage the development of UE methods that approximate GP-based methods---instead of BNNs and MCDropout, and (iii) our empirical setups can be used for verifying the OOD performances of any other UE method. Link » Yehao Liu · Matteo Pagliardini · Tatjana Chavdarova · Sebastian Stich 🔗 - Dependence between Bayesian neural network units (Poster) []   link » The connection between Bayesian neural networks and Gaussian processes gained a~lot of attention in the last few years, with the flagship result that hidden units converge to a Gaussian process limit when the layers width tends to infinity. Underpinning this result is the fact that hidden units become independent in the infinite-width limit. Our aim is to shed some light on hidden units dependence properties in practical finite-width Bayesian neural networks. In addition to theoretical results, we assess empirically the depth and width impacts on hidden units dependence properties. Link » Mariia Vladimirova · Julyan Arbel · Stephane Girard 🔗 - Relaxed-Responsibility Hierarchical Discrete VAEs (Poster) []   link » Successfully training Variational Autoencoders (VAEs) with a hierarchy of discrete latent variables remains an area of active research. Vector-Quantised VAEs are a powerful approach to discrete VAEs, but naive hierarchical extensions can be unstable when training. Leveraging insights from classical methods of inference we introduce Relaxed-Responsibility Vector-Quantisation, a novel way to parameterise discrete latent variables, a refinement of relaxed Vector-Quantisation that gives better performance and more stable training. This enables a novel approach to hierarchical discrete variational autoencoders with numerous layers of latent variables (here up to 32) that we train end-to-end. Within hierarchical probabilistic deep generative models with discrete latent variables trained end-to-end, we achieve state-of-the-art bits-per-dim results for various standard datasets. Link » Matthew Willetts · Xenia Miscouridou · Stephen J Roberts · Chris C Holmes 🔗 - Precision Agriculture Based on Bayesian Neural Network (Poster) []   link » Precision agriculture, utilizing various information to manage crop production, has become the important approach to imitate the food supply problem around the world. Accurate prediction of crop yield is the main task of precision agriculture. With the help of neural networks, precision agriculture has progressed rapidly in past decades. However, neural networks are notoriously data-hungry anddata collection in agriculture is expensive and time-consuming. Bayesian neural network, extending the neural network with Bayes inference, is useful under such circumstance. Moreover, Bayesian allows to estimate uncertainty associated with prediction which makes the result more reliable. In this paper, a Bayesian neural network was applied a small dataset and the result shows Bayesian neural networkis more reliable under such circumstance. Link » lei zhao 🔗 - Decomposing Representations for Deterministic Uncertainty Estimation (Poster) []   link » Uncertainty estimation is a key component in any deployed machine learning system. One way to evaluate uncertainty estimation is using “out-of-distribution” (OoD) detection, that is, distinguishing between the training data distribution and an unseen different data distribution using uncertainty. In this work, we show that current feature density based uncertainty estimators cannot perform well consistently across different OoD detection settings. To solve this, we propose to decompose the learned representations and integrate the uncertainties estimated on them separately. Through experiments, we demonstrate that we can greatly improve the performance and the interpretability of the uncertainty estimation. Link » Haiwen Huang · Joost van Amersfoort · Yarin Gal 🔗 - Gaussian dropout as an information bottleneck layer (Poster) []   link » As models become more powerful, they can acquire the ability to fit the data well in multiple qualitatively different ways. At the same time, we might have requirements other than high predictive performance that we would like the model to satisfy. One way to express such preferences is by controlling the information flow in the model with carefully placed information bottleneck layers, which limit the amount of information that passes through them by applying noise to their inputs. The most notable example of such a layer is the stochastic representation layer of the Deep Variational Information Bottleneck, using which requires adding a variational upper bound on the mutual information between its inputs and outputs as a penalty to the loss function. We show that using Gaussian dropout, which involves multiplicative Gaussian noise, achieves the same goal in a simpler way without requiring any additional terms in the objective. We evaluate the two approaches in the generative modelling setting, by using them to encourage the use of latent variables in a VAE with an autoregressive decoder for modelling images. Link » Melanie Rey · Andriy Mnih 🔗 - Funnels: Exact maximum likelihood with dimensionality reduction (Poster) []   link » Normalizing flows are diffeomorphic, typically dimension-preserving, models trained using the likelihood of the model. We use the SurVAE framework to construct dimension reducing surjective flows via a new layer, known as the funnel. We demonstrate its efficacy on a variety of datasets, and show it improves upon or matches the performance of existing flows while having a reduced latent space size. This layer can also be used with convolutional and feed forward layers. Link » Samuel Klein · John Raine · Tobias Golling · Slava Voloshynovskiy · Sebastion Pina-Otey 🔗 - Progress in Self-Certified Neural Networks (Poster) []   link » A learning method is self-certified if it uses all available data to simultaneously learn a predictor and certify its quality with a statistical certificate that is valid on unseen data. Recent work has shown that neural network models trained by optimising PAC-Bayes bounds lead not only to accurate predictors, but also to tight risk certificates, bearing promise towards self-certified learning. In this context, learning and certification strategies based on PAC-Bayes bounds are especially attractive due to their ability to leverage all data to learn a posterior and simultaneously certify its risk. In this paper, we assess the progress towards self-certification in neural networks learnt by PAC-Bayes inspired objectives. We empirically compare (on 4 classification datasets) classical test set bounds for deterministic predictors and a PAC-Bayes bound for randomised self-certified predictors. We show that in data starvation regimes, holding out data for the test set bounds adversely affects generalisation performance, while learning and certification strategies based on PAC-Bayes bounds do not suffer from this drawback. We find that probabilistic neural networks learnt by PAC-Bayes inspired objectives lead to certificates that can be surprisinglycompetitive with commonly used test set bounds. Link » Maria Perez-Ortiz · Omar Rivasplata · Emilio Parrado-Hernández · Benjamin Guedj · John Shawe-Taylor 🔗 - Multimodal Relational VAE (Poster) []   link » In this work, we propose a new formulation for multimodal VAEs to model and learn the relationship between data types. Despite their recent progress, current multimodal generative methods are based on simplistic assumptions regarding the relation between data types, which leads to a trade-off between coherence and quality of generated samples - even for simple toy datasets. The proposed method learns the relationship between data types instead of relying on pre-defined and limiting assumptions. Based on the principles of variational inference, we change the posterior approximation to explicitly include information about the relation between data types. We show empirically that the simplified assumption of a single shared latent space leads to inferior performance for a dataset with additional pairwise shared information. Link » Thomas Sutter · Julia Vogt 🔗 - Laplace Approximation with Diagonalized Hessian for Over-parameterized Neural Networks (Poster) []   link » Bayesian Neural Networks (BNNs) provide valid uncertainty estimation on their feedforward outputs. However, it can become computationally prohibitive to apply them to modern large-scale neural networks. In this work, we combine Laplace approximation with linearized inference for a real-time and robust uncertainty evaluation. Specifically, we study the effectiveness and computational necessity of a diagonal Hessian approximation in Laplace approximation on over-parameterized networks. The proposed approach is investigated on object detection tasks in an autonomous driving scenario and demonstrates faster inference speed and convincing results. Link » Ming Gui · Ziqing Zhao · Tianming Qiu · Hao Shen 🔗 - Exploring the Limits of Epistemic Uncertainty Quantification in Low-Shot Settings (Poster) []   link » Uncertainty quantification in neural network promises to increase safety of AI systems, but it is not clear how performance might vary with the training set size. In this paper we evaluate seven uncertainty methods on Fashion MNIST and CIFAR10, as we sub-sample and produce varied training set sizes. We find that calibration error and out of distribution detection performance strongly depend on the training set size, with most methods being miscalibrated on the test set with small training sets. Gradient-based methods seem to poorly estimate epistemicuncertainty and are the most affected by training set size. We expect our results can guide future research into uncertainty quantification and help practitioners select methods based on their particular available data. Link » Matias Valdenegro-Toro 🔗 - Mixtures of Laplace Approximations for Improved Post-Hoc Uncertainty in Deep Learning (Poster) []   link » Deep neural networks are prone to overconfident predictions on outliers. Bayesian neural networks and deep ensembles have both been shown to mitigate this problem to some extent. In this work, we aim to combine the benefits of the two approaches by proposing to predict with a Gaussian mixture model posterior that consists of a weighted sum of Laplace approximations of independently trained deep neural networks. The method can be used \textit{post hoc} with any set of pre-trained networks and only requires a small computational and memory overhead compared to regular ensembles. We theoretically validate that our approach mitigates overconfidence far away'' from the training data and empirically compare against state-of-the-art baselines on standard uncertainty quantification benchmarks. Link » Runa Eschenhagen · Erik Daxberger · Philipp Hennig · Agustinus Kristiadi 🔗 - Kronecker-Factored Optimal Curvature (Poster) []   link » The current scalable Bayesian methods for Deep Neural Networks (DNNs) often rely on the Fisher Information Matrix (FIM). For the tractable computations of the FIM, the Kronecker-Factored Approximate Curvature (K-FAC) method is widely adopted, which approximates the true FIM by a layer-wise block-diagonal matrix, and each diagonal block is then Kronecker-factored. In this paper, we propose an alternative formulation to obtain the Kronecker-factored FIM. The key insight is to cast the given FIM computations into an optimization problem over the sums of Kronecker products. In particular, we prove that this formulation is equivalent to the best rank-one approximation problem, where the well-known power iteration method is guaranteed to converge to an optimal rank-one solution - resulting in our novel algorithm: the Kronecker-Factored Optimal Curvature (K-FOC). In a proof-of-concept experiment, we show that the proposed algorithm can achieve more accurate estimates of the true FIM when compared to the K-FAC method. Link » Dominik Schnaus · Jongseok Lee · Rudolph Triebel 🔗 - Contrastive Generative Adversarial Network for Anomaly Detection (Poster) []   link » Anomaly detection (AD) is a fundamental challenge in machine learning that finds samples that do not belong to the distribution of the training data. Recently self-supervised learning approaches and, in particular, contrastive learning show promising results in various machine vision applications mitigating the hunger of traditional supervised deep learning approaches for an enormous amount of labeled data. In this work, we adopt the idea of contrastive learning for reconstruction-based anomaly detection models. Our contrastive learning approach contrasts the sample with local feature maps of itself instead of contrasting a given sample with other instances as in conventional contrastive learning approaches. Our anomaly detection model based on contrastive generative adversarial network, AD-CGAN, is shown to obtain state-of-the-art performance in multiple benchmark datasets. AD-CGAN outperforms the existing reconstruction-based approaches by more than $15\%$ ROC-AUC in several benchmark experiments. Link » Laya Rafiee Sevyeri · Thomas Fevens 🔗 - Certifiably Robust Variational Autoencoders (Poster) []   link » We derive bounds on the minimal size of an input perturbation required to change a VAE’s reconstruction by more than an allowed amount, with these bounds depending on key parameters such as the Lipschitz constants of the encoder and decoder. Our bounds allow one to specify a desired level of robustness upfront and then train a VAE that is certified to achieve this robustness. Link » Ben Barrett · Alexander Camuto · Matthew Willetts · Tom Rainforth 🔗 - On Symmetries in Variational Bayesian Neural Nets (Poster) []   link » Probabilistic inference of Neural Network parameters is challenging due to the highly multi-modal likelihood functions. Most importantly, the permutation invariance of the neurons of the hidden layers renders the likelihood function unidentifiable with a factorial number of equivalent (symmetric) modes, independent of the data. We show that variational Bayesian methods that approximate the (multi-modal) posterior by a (uni-modal) Gaussian distribution are biased towards approximations with identical (e.g. zero-centred) weights, resulting in severe underfitting.This explains the common empirical observation that, in contrast to MCMC methods, variational approximations typically collapse most weights to the (zero-centred) prior.We propose a simple modification to the likelihood function that breaks the symmetry using fixed semi-orthogonal matrices as skip connections in each layer.Initial empirical results show an improved predictive performance. Link » Richard Kurle · Tim Januschowski · Jan Gasthaus · Bernie Wang 🔗 - Greedy Bayesian Posterior Approximation with Deep Ensembles (Poster) []   link » Ensembles of independently trained neural networks are a state-of-the-art approach to estimate predictive uncertainty in Deep Learning, and can be interpreted as an approximation of the posterior distribution via a mixture of delta functions. The training of ensembles relies on non-convexity of the loss landscape and random initialization of their individual members, making the resulting posterior approximation uncontrolled. This paper proposes a novel and principled method to tackle this limitation, minimizing an $f$-divergence between the true posterior and a kernel density estimator in a function space. We analyze this objective from a combinatorial point of view, and show that it is submodular with respect to mixture components for any $f$. Subsequently, we consider the problem of ensemble construction, and from the marginal gain of the total objective, we derive a novel diversity term for training ensembles greedily. The performance of our approach is demonstrated on computer vision out-of-distribution detection benchmarks in a range of architectures trained on multiple datasets. The source code of our method is publicly available at https://github.com/MIPT-Oulu/greedy_ensembles_training. Link » Aleksei Tiulpin · Matthew Blaschko 🔗 - On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty (Poster) []   link » Inducing point Gaussian process approximations are often considered a gold standard in uncertainty estimation since they retain many of the properties of the exact GP and scale to large datasets. A major drawback is that they have difficulty scaling to high dimensional inputs. Deep Kernel Learning (DKL) promises a solution: a deep feature extractor transforms the inputs over which an inducing point Gaussian process is defined. However, DKL has been shown to provide unreliable uncertainty estimates in practice. We study why, and show that with no constraints, the DKL objective pushes far-away'' data points to be mapped to the same features as those of training-set points. With this insight we propose to constrain DKL's feature extractor to approximately preserve distances through a bi-Lipschitz constraint, resulting in a feature space favorable to DKL. We obtain a model, DUE, which demonstrates uncertainty quality outperforming previous DKL and other single forward pass uncertainty methods, while maintaining the speed and accuracy of standard neural networks. Link » Joost van Amersfoort · Lewis Smith · Andrew Jesson · Oscar Key · Yarin Gal 🔗 - An Empirical Study of Neural Kernel Bandits (Poster) []   link » Neural bandits have enabled practitioners to operate efficiently on problems with non-linear reward functions. While in general contextual bandits commonly utilize Gaussian process (GP) predictive distributions for decision making, the most successful neural variants use only the last layer parameters in the derivation. Research on neural kernels (NK) has recently established a correspondence between deep networks and GPs that take into account all the parameters of a NN and can be trained more efficiently than most Bayesian NNs. We propose to directly apply NK-induced distributions to guide an upper confidence bound or Thompson sampling-based policy. We show that NK bandits achieve state-of-the-art performance on highly non-linear structured data. Furthermore, we analyze practical considerations such as training frequency and model partitioning. We believe our work will help better understand the impact of utilizing NKs in applied settings. Link » Michal Lisicki · Arash Afkanpour · Graham Taylor 🔗 - Structured Stochastic Gradient MCMC: a hybrid VI and MCMC approach (Poster) []   link » Stochastic gradient Markov chain Monte Carlo (SGMCMC) is considered the gold standard for Bayesian inference in large-scale models, such as Bayesian neural networks. Since practitioners face speed versus accuracy tradeoffs in these models, variational inference (VI) is often the preferable option. Unfortunately, VI makes strong assumptions on both the factorization and functional form of the posterior. In this work, we propose a new non-parametric variational approximation that makes no assumptions about the approximate posterior's functional form and allows practitioners to specify the exact dependencies the algorithm should respect or break. The approach relies on a new Langevin-type algorithm that operates on a modified energy function, where parts of the latent variables are averaged over samples from earlier iterations of the Markov chain. This way, statistical dependencies can be broken in a controlled way, allowing the chain to mix faster. This scheme can be further modified in a dropout'' manner, leading to even more scalability. By implementing the scheme on a ResNet-20 architecture, we obtain better predictive likelihoods and larger effective sample sizes than full SGMCMC. Link » Antonios Alexos · Alex Boyd · Stephan Mandt 🔗 - Contrastive Representation Learning with Trainable Augmentation Channel (Poster) []   link » In contrastive representation learning, data representation is trained so that it can classify the image instances even when the images are altered by augmentations.However, depending on the datasets, some augmentations can damage the information of the images beyond recognition, and such augmentations can result in collapsed representations.We present a partial solution to this problem by formalizing a stochastic encoding process in which there exist a tug-of-war between the data corruption introduced by the augmentations and the information preserved by the encoder.We show that, with the infoMax objective based on this framework, we can learn a data-dependent distribution of augmentations to avoid the collapse of the representation. Link » Masanori Koyama · Kentaro Minami · Takeru Miyato · Yarin Gal 🔗 - Power-law asymptotics of the generalization error for GP regression under power-law priors and targets (Poster) []   link » We study the power-law asymptotics of learning curves for Gaussian process regression (GPR). When the eigenspectrum of the prior decays with rate $\alpha$ and the eigenexpansion coefficients of the target function decay with rate $\beta$, we show that the Bayesian generalization error behaves as $\tilde O(n^{\max\{\frac{1}{\alpha}-1, \frac{1-2\beta}{\alpha}\}})$ with high probability over the draw of $n$ input samples. Infinitely wide neural networks can be related to GPR with respect to the Neural Network Gaussian Process kernel, which in several cases is known to have a power-law spectrum. Hence our methods can be applied to study the generalization error of infinitely wide neural networks. We present toy experiments demonstrating the theory. Link » Hui Jin · Pradeep Kr. Banerjee · Guido Montufar 🔗 - Deep Bayesian Learning for Car Hacking Detection (Poster) []   link » With the rise of self-drive cars and connected-vehicles, cars are equipped with various devices to assistant the drivers or support self-drive systems. Undoubtedly, cars have become more intelligent as we can deploy more and more devices and software on the cars. Accordingly, the security of assistant and self-drive systems in the cars becomes a life threatening issue as smart cars can be invaded by malicious attacks that cause traffic accidents. Currently, canonical machine learning and deep learning methods are extensively employed in car hacking detection. However, machine learning and deep learning methods can easily be overconfident and defeated by carefully designed adversarial examples. Moreover, those methods cannot provide explanations for security engineers for further analysis. In this work, we investigated Deep Bayesian Learning models to detect and analyze car hacking behaviors. The Bayesian learning methods can capture the uncertainty of the data and avoid overconfident issues. Moreover, the Bayesian models can provide more information to support the prediction results that can help security engineers further identify the attacks. We have compared our model with deep learning models and the results show the advantages of our proposed model. The code of this work is publicly available. Link » Laha Ale · Scott King · Ning Zhang 🔗 - Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning (Poster) []   link » High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-ofthe-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. https://github.com/google/uncertainty-baselines Link » Zachary Nado · Neil Band · Mark Collier · Josip Djolonga · Mike Dusenberry · Sebastian Farquhar · Qixuan Feng · Angelos Filos · Marton Havasi · Rodolphe Jenatton · Ghassen Jerfel · Jeremiah Liu · Zelda Mariet · Jeremy Nixon · Shreyas Padhy · Jie Ren · Tim G. J. Rudner · Yeming Wen · Florian Wenzel · Kevin Murphy · D. Sculley · Balaji Lakshminarayanan · Jasper Snoek · Yarin Gal · Dustin Tran 🔗 - Generation of data on discontinuous manifolds via continuous stochastic non-invertible networks (Poster) []   link » The generation of discontinuous distributions is a difficult task for most known frameworks, such as generative autoencoders and generative adversarial networks. Generative non-invertible models are unable to accurately generate such distributions, require long training and often are subject to mode collapse. Variational autoencoders (VAEs), which are based on the idea of keeping the latent space to be Gaussian for the sake of a simple sampling, allow an accurate reconstruction, while they experience significant limitations at generation level. In this work, instead of trying to keep the latent space Gaussian, we use a pretrained contrastive encoder to obtain a clustered latent space. Then, for each cluster, representing a unimodal submanifold, we train a dedicated low complexity network to generate it from the Gaussian distribution. The proposed framework is based on the information-theoretic formulation of mutual information maximization between the input data and latent space representation. We derive a link between the cost functions and the information-theoretic formulation. We apply our approach to synthetic 2D distributions to demonstrate both reconstruction and generation of discontinuous distributions using continuous stochastic networks. Link » Mariia Drozdova · Vitaliy Kinakh · Guillaume Quétant · Tobias Golling · Slava Voloshynovskiy 🔗 - Uncertainty Quantification in End-to-End Implicit Neural Representations for Medical Imaging (Poster) []   link » Implicit neural representations (INRs) have recently achieved impressive results in image representation. This work explores the uncertainty quantification quality of INRs for medical imaging. We propose the first uncertainty aware, end-to-end INR architecture for computed tomography (CT) image reconstruction. Four established neural network uncertainty quantification techniques -- deep ensembles, Monte Carlo dropout, Bayes-by-backpropagation, and Hamiltonian Monte Carlo -- are implemented and assessed according to both image reconstruction quality and model calibration. We find that these INRs outperform traditional medical image reconstruction algorithms according to predictive accuracy; deep ensembles of Monte Carlo dropout base-learners achieve the best image reconstruction and model calibration among the techniques tested; activation function and random Fourier feature embedding frequency have large effects on model performance; and Bayes-by-backpropogation is ill-suited for sampling from the INR posterior distributions. Preliminary results further indicate that, with adequate tuning, Hamiltonian Monte Carlo may outperform Monte Carlo dropout deep ensembles. Link » Francisca Vasconcelos · Bobby He · Yee Teh 🔗 - Evaluating Predictive Uncertainty and Robustness to Distributional Shift Using Real World Data (Poster) []   link » Most machine learning models operate under the assumption that the training, testing and deployment data is independent and identically distributed (i.i.d.).This assumption doesn’t generally hold true in a natural setting. Usually, the deployment data is subject to various types of distributional shifts. The magnitude of a model’s performance is proportional to this shift in the distribution of the dataset. Thus it becomes necessary to evaluate a model’s uncertainty and robustness to distributional shift to get a realistic estimate of its expected performance on real-world data. Present methods to evaluate uncertainty and model’s robustness are lacking and often fail to paint the full picture. Moreover, most analysis so far has primarily focused on classification tasks. In this paper, we propose more insightful metrics for general regression tasks using the Shifts Weather Prediction Dataset. We also present an evaluation of the baseline methods using these metrics. Link » Kumud Lakara · Akshat Bhandari · Pratinav Seth · Ujjwal Verma 🔗 - Generalization Gap in Amortized Inference (Poster) []   link » The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications. The Variational Auto-Encoder (VAE) is a popular class of latent variable model used for many such applications including density estimation, representation learning and lossless compression. In this work, we highlight how the common use of amortized inference to scale the training of VAE models to large data sets can be a major cause of poor generalization performance. We propose a new training phase for the inference network that helps reduce over-fitting to training data. We demonstrate how the proposed scheme can improve generalization performance in the context of image modeling. Link » Mingtian Zhang · Peter Hayes · David Barber 🔗 - Information-theoretic stochastic contrastive conditional GAN: InfoSCC-GAN (Poster) []   link » Conditional generation is a subclass of generative problems where the output of the generation is conditioned by the attribute information. In this paper, we present a stochastic contrastive conditional generative adversarial network (InfoSCC-GAN) with an explorable latent space. The InfoSCC-GAN architecture is based on an unsupervised contrastive encoder built on the InfoNCE paradigm, an attribute classifier, and an EigenGAN generator. We propose a novel training method, based on generator regularization using external or internal attributes every $n$-th iteration, using a pre-trained contrastive encoder and a pre-trained classifier. The proposed InfoSCC-GAN is derived based on an information-theoretic formulation of mutual information maximization between the input data and latent space representation as well as latent space and generated data. Thus, we demonstrate a link between the training objective functions and the above information-theoretic formulation. The experimental results show that InfoSCC-GAN outperforms the "vanilla" EigenGAN in the image generation on several datasets. In addition, we investigate the impact of regularization techniques, discriminator architectures, and loss functions by performing ablation studies.Finally, we demonstrate that thanks to the EigenGAN generator, the proposed framework enjoys a stochastic generation in contrast to vanilla deterministic GANs yet with the independent training of encoder, classifier, and generator in contrast to existingframeworks.Code, experimental results, and demos are available \url{https://anonymous.4open.science/r/InfoSCC-GAN-D113}. Link » Vitaliy Kinakh · Mariia Drozdova · Guillaume Quétant · Tobias Golling · Slava Voloshynovskiy 🔗 - Deep Classifiers with Label Noise Modeling and Distance Awareness (Poster) []   link » Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two complementary types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, Imagenet-C, and Imagenet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which adds an additional type of uncertainty and also outperforms other ensemble baselines. Link » Vincent Fortuin · Mark Collier · Florian Wenzel · James Allingham · Jeremiah Liu · Dustin Tran · Balaji Lakshminarayanan · Jesse Berent · Rodolphe Jenatton · Effrosyni Kokiopoulou 🔗 - Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks (Poster) []   link » Bayesian deep learning seeks to equip deep neural networks with the ability to precisely quantify their predictive uncertainty, and has promised to make deep learning more reliable for safety-critical real-world applications. Yet, existing Bayesian deep learning methods fall short of this promise; new methods continue to be evaluated on unrealistic test beds that do not reflect the complexities of the downstream real-world tasks that would benefit most from reliable uncertainty quantification. We propose a set of real-world tasks that accurately reflect such complexities and assess the reliability of predictive models in safety-critical scenarios. Specifically, we curate two publicly available datasets of high-resolution human retina images exhibiting varying degrees of diabetic retinopathy, a medical condition that can lead to blindness, and use them to design a suite of automated diagnosis tasks that require reliable predictive uncertainty quantification. We use these tasks to benchmark well-established and state-of-the-art Bayesian deep learning methods on task-specific evaluation metrics. We provide an easy-to-use codebase for fast and easy benchmarking following reproducibility and software design principles. Link » Neil Band · Tim G. J. Rudner · Qixuan Feng · Angelos Filos · Zachary Nado · Mike Dusenberry · Ghassen Jerfel · Dustin Tran · Yarin Gal 🔗 - Stochastic Local Winner-Takes-All Networks Enable Profound Adversarial Robustness (Poster) []   link » This work explores the potency of stochastic competition-based activations, namely Stochastic Local Winner-Takes-All (LWTA), against powerful (gradient-based) white-box and black-box adversarial attacks; we especially focus on Adversarial Training settings. In our work, we replace the conventional ReLU-based nonlinearities with blocks comprising locally and stochastically competing linear units. The output of each network layer now yields a sparse output, depending on the outcome of winner sampling in each block. We rely on the Variational Bayesian framework for training and inference; we incorporate conventional PGD-based adversarial training arguments to increase the overall adversarial robustness. As we experimentally show, the arising networks yield state-of-the-art robustness against powerful adversarial attacks while retaining very high classification rate in the benign case. Link » Konstantinos Panousis · Sotirios Chatzis · Sergios Theodoridis 🔗 - Being a Bit Frequentist Improves Bayesian Neural Networks (Poster) []   link » Despite their compelling theoretical properties, Bayesian neural networks (BNNs) tend to perform worse than frequentist methods in classification-based uncertainty quantification (UQ) tasks such as out-of-distribution (OOD) detection. In this paper, based on empirical findings in prior works, we hypothesize that this issue is because even recent Bayesian methods have never considered OOD data in their training processes, even though this OOD training'' technique is an integral part of state-of-the-art frequentist UQ methods. To validate this, we treat OOD data as a first-class citizen in BNN training by exploring several ways of incorporating OOD data in Bayesian inference. We show in experiments that OOD-trained BNNs are competitive to, if not better than recent frequentist baselines. This work thus provides strong baselines for future work in Bayesian deep learning. Link » Agustinus Kristiadi · Matthias Hein · Philipp Hennig 🔗 - Reproducible, incremental representation learning with Rosetta VAE (Poster) []   link » Variational autoencoders are among the most popular methods for distilling low-dimensional structure from high-dimensional data, making them increasingly valuable as tools for data exploration and scientific discovery. However, unlike typical machine learning problems, in which a single model is trained once on a single large dataset, scientific workflows privilege learned features that are reproducible, portable across labs, and capable of incrementally adding new data. Ideally, methods used by different research groups should produce comparable results, even without sharing fully-trained models or entire data sets. Here, we address this challenge by introducing the Rosetta VAE (R-VAE), a method of distilling previously learned representations and retraining new models to reproduce and build on prior results. The R-VAE uses post hoc clustering over the latent space of a fully-trained model to identify a small number of Rosetta Points (input, latent pairs) to serve as anchors for training future models. An adjustable hyperparameter, , balances fidelity to the previously learned latent space against accommodation of new data. We demonstrate that the R-VAE reconstructs data as well as the VAE and -VAE, outperforms both methods in recovery of a target latent space in a sequential training setting, and dramatically increases consistency of the learned representation across training runs. Similar to other VAE methods, R-VAE makes few assumptions about the data and underlying distributions, uses the same number of hyperparameters as -VAE, and provides a simple and intuitive solution to stable and consistent retraining. Link » Miles Martinez · John Pearson 🔗 - An Empirical Comparison of GANs and Normalizing Flows for Density Estimation (Poster) []   link » Generative adversarial networks (GANs) and normalizing flows are both approaches to density estimation that use deep neural networks to transform samples from an uninformative prior distribution to an approximation of the data distribution. There is great interest in both for general-purpose statistical modeling, but the two approaches have seldom been compared to each other for modeling non-image data. The difficulty of computing likelihoods with GANs, which are implicit models, makes conducting such a comparison challenging. We work around this difficulty by considering several low-dimensional synthetic datasets. An extensive grid search over GAN architectures, hyperparameters, and training procedures suggests that no GAN is capable of modeling our simple low-dimensional data well, a task we view as a prerequisite for an approach to be considered suitable for general-purpose statistical modeling. Several normalizing flows, on the other hand, excelled at these tasks, even substantially outperforming WGAN in terms of Wasserstein distance---the metric that WGAN alone targets. Scientists and other practitioners should be wary of relying on WGAN for applications that require accurate density estimation. Link » TIanci Liu · Jeffrey Regier 🔗 - Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks (Poster) []   link » We consider the problem of the stability of saliency-based explanations of Neural Network predictions under adversarial attacks in a classification task. Saliency interpretations of deterministic Neural Networks are remarkably brittle even whenthe attacks fail, i.e. for attacks that do not change the classification label. We empirically show that interpretations provided by Bayesian Neural Networks are considerably more stable under adversarial perturbations of the inputs and evenunder direct attacks to the explanations. By leveraging recent results, we also provide a theoretical explanation of this result in terms of the geometry of the data manifold. Additionally, we discuss the stability of the interpretations of high level representations of the inputs in the internal layers of a Network. Our results demonstrate that Bayesian methods, in addition to be more robust to adversarialattacks, have the potential to provide more stable and interpretable assessments of Neural Network predictions. Link » Ginevra Carbone · Luca Bortolussi · Guido Sanguinetti 🔗 - Non-stationary Gaussian process discriminant analysis with variable selection for high-dimensional functional data (Poster) []   link » High-dimensional classification and feature selection tasks are ubiquitous with the recent advancement in data acquisition technology. In several application areas such as biology, genomics and proteomics, the data are often functional in their nature and exhibit a degree of roughness and non-stationarity. These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately. We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework. Our model is a two-layer non-stationary Gaussian process coupled with an Ising prior to identify differentially-distributed locations. Scalable inference is achieved via developing a variational scheme that exploits advances in the use of sparse inverse covariance matrices. We demonstrate the performance of our methodology on simulated datasets and two proteomics datasets: breast cancer and SARS-CoV-2. Our approach distinguishes itself by offering explainability as well as uncertainty quantification in addition to low computational cost, which are crucial to increase trust and social acceptance of data-driven tools. Link » Weichang Yu · Sara Wade · Howard Bondell · Lamiae Azizi 🔗 - Pathologies in Priors and Inference for Bayesian Transformers (Poster) []   link » In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial.Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist.In this work, we study this curiously underpopulated area of Bayesian transformers.We find that weight-space inference in transformers does not work well, regardless of the approximate posterior.We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models.We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer.Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights.We find that this proposed method performs competitively with our baselines. Link » Tristan Cinquin · Alex Immer · Max Horn · Vincent Fortuin 🔗 - Analytically Tractable Inference in Neural Networks - An Alternative to Backpropagation (Poster) []   link » Until now, neural networks have been predominantly relying on backpropagation and gradient descent as the inference engine in order to learn the neural network's parameters. This is primarily because closed-form Bayesian inference for neural networks has been considered to be intractable. This short paper will outline a new analytical method for performing tractable approximate Gaussian inference (TAGI) in Bayesian neural networks. The method enables the analytical inference of the posterior mean vector and diagonal covariance matrix for weights and biases. One key aspect is that the method matches or exceeds the state-of-the-art performance while having the same computational complexity as current methods relying on the gradient backpropagation, i.e., linear complexity with respect to the number of parameters in the network. Performing Bayesian inference in neural networks enables several key features, such as the quantification of epistemic uncertainty associated with model parameters, the online estimation of parameters, and a reduction in the number of hyperparameters due to the absence of gradient-based optimization. Moreover, the analytical framework proposed also enables unprecedented features such as the propagation of uncertainty from the input of a network up to its output, and it allows inferring the value of hidden states, inputs, as well as latent variables. The first part covers the theoretical foundation and working principles of the analytically tractable uncertainty propagation in neural networks, as well as the parameter and hidden state inference. Then, the second part will go through benchmarks demonstrating the superiority of the approach on supervised, unsupervised, and reinforcement learning tasks. In addition, we will showcase how TAGI can be applied to reinforcement learning problems such as the Atari game environment. Finally, the last part will present how we can leverage the analytic inference capabilities of our approach to enable novel applications of neural networks such as closed-form direct adversarial attacks, and the usage of a neural network as a generic black-box optimization method. Link » Luong-Ha Nguyen · James-A. Goulet 🔗 - Infinite-channel deep convolutional Stable neural networks (Poster) []   link » The connection between infinite-width neural networks (NNs) and Gaussian processes (GPs) is well known since the seminal work of Neal (1996). While numerous theoretical refinements have been proposed in recent years, the connection between NNs and GPs relies on two critical distributional assumptions on the NN's parameters: i) finite variance ii) independent and identical distribution (iid). In this paper, we consider the problem of removing assumption i) in the context of deep feed-forward convolutional NNs. We show that the infinite-channel limit of a deep feed-forward convolutional NNs, under suitable scaling, is a stochastic process with multivariate stable finite-dimensional distributions, and we give an explicit recursion over the layers for their parameters. Our contribution extends recent results of Favaro et al (2021) to convolutional architectures, and it paves the way to exciting lines of research that rely on GP limits. Link » Daniele Bracale · Stefano Favaro · Sandra Fortini · Stefano Peluchetti 🔗 - Unveiling Mode-connectivity of the ELBO Landscape (Poster) []   link » We demonstrate and discuss mode-connectivity of the ELBO, the objective function of variational inference (VI). Local optima of the ELBO are found to be connected by essentially flat maximum energy paths (MEPs), suggesting that optima of the ELBO are not discrete modes but lie on a connected subset in parameter space. We focus on Latent Dirichlet Allocation, a model commonly fit with VI. Our findings parallel recent results showing mode-connectivity of neural net loss functions, a property that has helped explain and improve the performance of neural nets. We find MEPs between maxima of the ELBO using the simplified string method (SSM), a gradient-based algorithm that updates images along a path on the ELBO. The mode-connectivity property is explained with a heuristic argument about statistical degeneracy, related to over-parametrization in neural networks. This study corroborates and extends the empirical experience that topic modeling has many optima, providing a loss-landscape-based explanation for the `no best answer" phenomenon experienced by practitioners of LDA. Link » Edith Zhang · David Blei 🔗

#### Author Information

##### Yingzhen Li (Imperial College London)

Yingzhen Li is a senior researcher at Microsoft Research Cambridge. She received her PhD from the University of Cambridge, and previously she has interned at Disney Research. She is passionate about building reliable machine learning systems, and her approach combines both Bayesian statistics and deep learning. Her contributions to the approximate inference field include: (1) algorithmic advances, such as variational inference with different divergences, combining variational inference with MCMC and approximate inference with implicit distributions; (2) applications of approximate inference, such as uncertainty estimation in Bayesian neural networks and algorithms to train deep generative models. She has served as area chairs at NeurIPS/ICML/ICLR/AISTATS on related research topics, and she is a co-organizer of the AABI2020 symposium, a flagship event of approximate inference.

##### Zoubin Ghahramani (Uber and University of Cambridge)

Zoubin Ghahramani is Professor of Information Engineering at the University of Cambridge, where he leads the Machine Learning Group. He studied computer science and cognitive science at the University of Pennsylvania, obtained his PhD from MIT in 1995, and was a postdoctoral fellow at the University of Toronto. His academic career includes concurrent appointments as one of the founding members of the Gatsby Computational Neuroscience Unit in London, and as a faculty member of CMU's Machine Learning Department for over 10 years. His current research interests include statistical machine learning, Bayesian nonparametrics, scalable inference, probabilistic programming, and building an automatic statistician. He has held a number of leadership roles as programme and general chair of the leading international conferences in machine learning including: AISTATS (2005), ICML (2007, 2011), and NIPS (2013, 2014). In 2015 he was elected a Fellow of the Royal Society.