Track: Probabilistic Methods, Applications

Wed 6 Dec. 16:20 - 16:35 PST

Oral

Reliable Decision Support using Counterfactual Models

Peter Schulam · Suchi Saria

Answering "What if?" questions is important in many domains. For example, would a patient's disease progression slow down if I were to give them a dose of drug A? Ideally, we answer our question using an experiment, but this is not always possible (e.g., it may be unethical). As an alternative, we can use non-experimental data to learn models that make counterfactual predictions of what we would observe had we run an experiment. In this paper, we propose the counterfactual GP, a counterfactual model of continuous-time trajectories (time series) under sequences of actions taken in continuous-time. We develop our model within the potential outcomes framework of Neyman and Rubin. The counterfactual GP is trained using a joint maximum likelihood objective that adjusts for dependencies between observed actions and outcomes in the training data. We report two sets of experimental results using the counterfactual GP. The first shows that it can be used to learn the natural progression (i.e. untreated progression) of biomarker trajectories from observational data. In the second, we show how the CGP can be used for medical decision support by learning counterfactual models of renal health under different types of dialysis.

Wed 6 Dec. 16:35 - 16:50 PST

Oral

Convolutional Gaussian Processes

Mark van der Wilk · Carl Edward Rasmussen · James Hensman

We introduce a practical way of introducing convolutional structure into Gaussian processes, which makes them better suited to high-dimensional inputs like images than existing kernels. The main contribution of our work is the construction of an inter-domain inducing point approximation that is well-tailored to the convolutional kernel. This allows us to gain the generalisation benefit of a convolutional kernel, together with fast but accurate posterior inference. We investigate several variations of the convolutional kernel, and apply it to MNIST and CIFAR-10 that have been known to be challenging for Gaussian processes. We also show how the marginal likelihood can be used to find an optimal weighting between convolutional and RBF kernels to further improve performance. We hope this illustration of the usefulness of a marginal likelihood will help to automate discovering architectures in larger models.

Wed 6 Dec. 16:50 - 17:05 PST

Oral

Counterfactual Fairness

Matt Kusner · Joshua Loftus · Chris Russell · Ricardo Silva

Machine learning can impact people with legal or ethical consequences when it is used to automate decisions in areas such as insurance, lending, hiring, and predictive policing. In many of these scenarios, previous decisions have been made that are unfairly biased against certain subpopulations, for example those of a particular race, gender, or sexual orientation. Since this past data may be biased, machine learning predictors must account for this to avoid perpetuating or creating discriminatory practices. In this paper, we develop a framework for modeling fairness using tools from causal inference. Our definition of counterfactual fairness captures the intuition that a decision is fair towards an individual if it the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group. We demonstrate our framework on a real-world problem of fair prediction of success in law school.

Wed 6 Dec. 17:05 - 17:10 PST

Spotlight

An Empirical Bayes Approach to Optimizing Machine Learning Algorithms

James McInerney

There is rapidly growing interest in using Bayesian optimization to tune model and inference hyperparameters for machine learning algorithms that take a long time to run. For example, Spearmint is a popular software package for selecting the optimal number of layers and learning rate in neural networks. But given that there is uncertainty about which hyperparameters give the best predictive performance, and given that fitting a model for each choice of hyperparameters is costly, it is arguably wasteful to "throw away" all but the best result, as per Bayesian optimization. A related issue is the danger of overfitting the validation data when optimizing many hyperparameters. In this paper, we consider an alternative approach that uses more samples from the hyperparameter selection procedure to average over the uncertainty in model hyperparameters. The resulting approach, empirical Bayes for hyperparameter averaging (EB-Hyp) predicts held-out data better than Bayesian optimization in two experiments on latent Dirichlet allocation and deep latent Gaussian models. EB-Hyp suggests a simpler approach to evaluating and deploying machine learning algorithms that does not require a separate validation data set and hyperparameter selection procedure.

Wed 6 Dec. 17:10 - 17:15 PST

Spotlight

PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference

Jonathan Huggins · Ryan Adams · Tamara Broderick

Generalized linear models (GLMs)---such as logistic regression, Poisson regression, and robust regression---provide interpretable models for diverse data types. Probabilistic approaches, particularly Bayesian ones, allow coherent estimates of uncertainty, incorporation of prior information, and sharing of power across experiments via hierarchical models. In practice, however, the approximate Bayesian methods necessary for inference have either failed to scale to large data sets or failed to provide theoretical guarantees on the quality of inference. We propose a new approach based on constructing polynomial approximate sufficient statistics for GLMs (PASS-GLM). We demonstrate that our method admits a simple algorithm as well as trivial streaming and distributed extensions that do not compound error across computations. We provide theoretical guarantees on the quality of point (MAP) estimates, the approximate posterior, and posterior mean and uncertainty estimates. We validate our approach empirically in the case of logistic regression using a quadratic approximation and show competitive performance in terms of both speed and accuracy---including on an advertising data set with 40 million data points and 20,000 covariates.

Wed 6 Dec. 17:15 - 17:20 PST

Spotlight

Multiresolution Kernel Approximation for Gaussian Process Regression

Yi Ding · Risi Kondor · Jonathan Eskreis-Winkler

Gaussian process regression generally does not scale to beyond a few thousands data points without applying some sort of kernel approximation method. Most approximations focus on the high eigenvalue part of the spectrum of the kernel matrix, $K$, which leads to bad performance when the length scale of the kernel is small. In this paper we introduce Multiresolution Kernel Approximation (MKA), the first true broad bandwidth kernel approximation algorithm. Important points about MKA are that it is memory efficient, and it is a direct method, which means that it also makes it easy to approximate $K^{-1}$ and $\mathop{\textrm{det}}(K)$.

Wed 6 Dec. 17:20 - 17:25 PST

Spotlight

Multi-Information Source Optimization

Matthias Poloczek · Jialei Wang · Peter Frazier

We consider Bayesian methods for multi-information source optimization (MISO), in which we seek to optimize an expensive-to-evaluate black-box objective function while also accessing cheaper but biased and noisy approximations ("information sources"). We present a novel algorithm that outperforms the state of the art for this problem by using a joint statistical model of the information sources better suited to MISO than those used by previous approaches, and a novel acquisition function based on a one-step optimality analysis supported by efficient parallelization. We provide a guarantee on the asymptotic quality of the solution provided by this algorithm. Experimental evaluations demonstrate that this algorithm consistently finds designs of higher value at less cost than previous approaches.

Wed 6 Dec. 17:25 - 17:30 PST

Spotlight

Doubly Stochastic Variational Inference for Deep Gaussian Processes

Hugh Salimbeni · Marc Deisenroth

Deep Gaussian processes (DGPs) are multi-layer generalizations of GPs, but inference in these models has proved challenging. Existing approaches to inference in DGP models assume approximate posteriors that force independence between the layers, and do not work well in practice. We present a doubly stochastic variational inference algorithm, which does not force independence between layers. With our method of inference we demonstrate that a DGP model can be used effectively on data ranging in size from hundreds to a billion points. We provide strong empirical evidence that our inference scheme for DGPs works well in practice in both classification and regression.

Wed 6 Dec. 17:30 - 17:35 PST

Spotlight

Permutation-based Causal Inference Algorithms with Interventions

Yuhao Wang · Liam Solus · Karren Yang · Caroline Uhler

Learning Bayesian networks using both observational and interventional data is now a fundamentally important problem due to recent technological developments in genomics that generate single-cell gene expression data at a very large scale. In order to utilize this data for learning gene regulatory networks, efficient and reliable causal inference algorithms are needed that can make use of both observational and interventional data. In this paper, we present two algorithms of this type and prove that both are consistent under the faithfulness assumption. These algorithms are interventional adaptations of the Greedy SP algorithm and are the first algorithms using both observational and interventional data with consistency guarantees. Moreover, these algorithms have the advantage that they are non-parametric, which makes them useful for analyzing inherently non-Gaussian gene expression data. In this paper, we present these two algorithms and their consistency guarantees, and we analyze their performance on simulated data, protein signaling data, and single-cell gene expression data.

Wed 6 Dec. 17:35 - 17:40 PST

Spotlight

Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra

John T Halloran · David M Rocke

Tandem mass spectrometry (MS/MS) is a high-throughput technology used to identify the proteins in a complex biological sample, such as a drop of blood. A collection of spectra is generated at the output of the process, each spectrum of which is representative of a peptide (protein subsequence) present in the original complex sample. In this work, we leverage the log-likelihood gradients of generative models to improve the identification of such spectra. In particular, we show that the gradient of a recently proposed dynamic Bayesian network (DBN) may be naturally employed by a kernel-based discriminative classifier. The resulting Fisher kernel substantially improves upon recent attempts to combine generative and discriminative models for post-processing analysis, outperforming all other methods on the evaluated datasets. We extend the improved accuracy offered by the Fisher kernel framework to other search algorithms by introducing Theseus, a DBN representating a large number of widely used MS/MS scoring functions. Furthermore, with gradient ascent and max-product inference at hand, we use Theseus to learn model parameters without any supervision.

Wed 6 Dec. 17:40 - 17:45 PST

Spotlight

Style Transfer from Non-parallel Text by Cross-Alignment

Tianxiao Shen · Tao Lei · Regina Barzilay · Tommi Jaakkola

This paper focuses on style transfer on the basis of un-paired text. This is an instance of broader family of problems including machine translation, decipherment, and sentiment modification. The key technical challenge is to separate the content from desired text characteristics such as sentiment. We leverage refined cross-alignment of latent representations, across mono-lingual text corpora with different characteristics. We deliberately modify encoded examples according to their characteristics, requiring the reproduced instances to match, as a population, available examples with the altered characteristics. We demonstrate the effectiveness of the method on three tasks: sentiment modification, decipherment of word substitution cyphers, and recovery of word reodering.

Wed 6 Dec. 17:45 - 17:50 PST

Spotlight

Premise Selection for Theorem Proving by Deep Graph Embedding

Mingzhe Wang · Yihe Tang · Jian Wang · Jia Deng

We propose a deep learning approach to premise selection: selecting relevant mathematical statements for the automated proof of a given conjecture. We represent a higher-order logic formula as a graph that is invariant to variable renaming, but at the same time fully preserves syntactic and semantic information. We then embed the graph into a continuous vector via a novel embedding method that preserves the information of edge ordering. Our approach achieves the state-of-the-art results on the HolStep dataset, improving the classification accuracy from 83% to 90.3%.

Wed 6 Dec. 17:50 - 17:55 PST

Spotlight

Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks

Ahmed M. Alaa · Mihaela van der Schaar

Designing optimal treatment plans for patients with comorbidities requires accurate cause-specific mortality prognosis. Motivated by the recent availability of linked electronic health records, we develop a nonparametric Bayesian model for survival analysis with competing risks, which can be used for jointly assessing a patient's risk of multiple (competing) adverse outcomes. The model views a patient's survival times with respect to the competing risks as the outputs of a deep multi-task Gaussian process (DMGP), the inputs to which are the patients' covariates. Unlike parametric survival analysis methods based on Cox and Weibull models, our model uses DMGPs to capture complex non-linear interactions between the patients' covariates and cause-specific survival times, thereby learning flexible patient-specific and cause-specific survival curves, all in a data-driven fashion without explicit parametric assumptions on the hazard rates. We propose a variational inference algorithm that is capable of learning the model parameters from time-to-event data while handling right censoring. Experiments on synthetic and real data show that our model outperforms the state-of-the-art survival models.

Wed 6 Dec. 17:55 - 18:00 PST

Spotlight

Unsupervised Learning of Disentangled Representations from Video

Emily Denton · vighnesh Birodkar

We present a new model DRNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluating our approach on a range of synthetic and real videos. For the latter, we demonstrate the ability to coherently generate up to several hundred steps into the future.

Main Navigation

Session

Probabilistic Methods, Applications

Reliable Decision Support using Counterfactual Models

Convolutional Gaussian Processes

Counterfactual Fairness

An Empirical Bayes Approach to Optimizing Machine Learning Algorithms

PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference

Multiresolution Kernel Approximation for Gaussian Process Regression

Multi-Information Source Optimization

Doubly Stochastic Variational Inference for Deep Gaussian Processes

Permutation-based Causal Inference Algorithms with Interventions

Gradients of Generative Models for Improved Discriminative Analysis of Tandem Mass Spectra

Style Transfer from Non-parallel Text by Cross-Alignment

Premise Selection for Theorem Proving by Deep Graph Embedding

Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks

Unsupervised Learning of Disentangled Representations from Video