Timezone: »

 
Kernelized Stein Discrepancies for Biological Sequences
Alan Amin · Eli Weinstein · Debora Marks
Event URL: https://openreview.net/forum?id=fvBVj5djg3 »

Generative models of biological sequences are a powerful tool for learning from complex sequence data, predicting the effects of mutations, and designing novel biomolecules with desired properties. The problem of measuring differences between high-dimensional distributions is central to the successful construction and use of generative probabilistic models. In this paper we propose the KSD-B, a novel divergence measure for distributions over biological sequences that is based on the kernelized Stein discrepancy (KSD). As for all KSDs, the KSD-B between a model and dataset can be evaluated even when the normalizing constant of the model is unknown; unlike any previous KSD, the KSD-B can be applied to arbitrary distributions over variable-length discrete sequences, and can take into account biological notions of mutational distance. Our theoretical results rigorously establish that the KSD-B is not only a valid divergence measure, but also that it detects non-convergence in distribution. We outline the wide variety of possible applications of the KSD-B, including (a) goodness-of-fit tests, which enable generative sequence models to be evaluated on an absolute instead of relative scale; (b) measurement of posterior sample quality, which enables accurate semi-supervised sequence design and ancestral sequence reconstruction; and (c) selection of a set of representative points, which enables the design of libraries of sequences that are representative of a given generative model for efficient experimental testing.

Author Information

Alan Amin (Harvard University)
Eli Weinstein (Columbia University)
Debora Marks (Harvard University)

Debora is a mathematician and computational biologist with a track record of using novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for interpreting genetic variation in a way that impacts biomedical applications. During her PhD, she quantified the pan-genomic scope of microRNA targeting - the combinatorial regulation of protein expression and co-discovered the first microRNA in a virus.  As a postdoc she made a breakthrough in the classic, unsolved problem of ab initio 3D structure prediction of proteins using undirected graphical probability models for evolutionary sequences. She has developed this approach to determine functional interactions, biomolecular structures, including the 3D structure of RNA and RNA-protein complexes and the conformational ensembles of apparently disordered proteins. Her new lab at Harvard is interested in developing methods in deep learning to address a wide range of biological challenges including designing drug affinity libraries for large numbers of human genes, predicting epistasis in antibiotic resistance, the effects of genetic variation on human disease etiology and drug response and sequence design for biosynthetic applications.

More from the Same Authors