Timezone: »

Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness
Eli Weinstein · Alan Amin · Jonathan Frazer · Debora Marks

Wed Nov 30 02:00 PM -- 04:00 PM (PST) @ Hall J #510

Understanding the consequences of mutation for molecular fitness and function is a fundamental problem in biology. Recently, generative probabilistic models have emerged as a powerful tool for estimating fitness from evolutionary sequence data, with accuracy sufficient to predict both laboratory measurements of function and disease risk in humans, and to design novel functional proteins. Existing techniques rest on an assumed relationship between density estimation and fitness estimation, a relationship that we interrogate in this article. We prove that fitness is not identifiable from observational sequence data alone, placing fundamental limits on our ability to disentangle fitness landscapes from phylogenetic history. We show on real datasets that perfect density estimation in the limit of infinite data would, with high confidence, result in poor fitness estimation; current models perform accurate fitness estimation because of, not despite, misspecification. Our results challenge the conventional wisdom that bigger models trained on bigger datasets will inevitably lead to better fitness estimation, and suggest novel estimation strategies going forward.

Author Information

Eli Weinstein (Columbia University)
Alan Amin (Harvard University)
Jonathan Frazer (Harvard Medical School)

Jonny's background is in theoretical physics. His previous work included developing computational tools for testing theories of the very early universe, as well as pioneering the use of information theory and probabilistic modelling for studying cosmic inflation in string theory. His love of high-dimensional probability and information theory has now brought him to the data-rich world of genomics. In his current work, Jonny is developing new models of evolutionary sequence data - both protein sequences and whole bacterial genomes. In the case of protein sequences, the goal is to develop unsupervised models of protein fitness for the purpose of mutation effect prediction (and hence protein design). In the case of bacterial genomes, the goal is to uncover epistatic interactions involved in antibiotic resistance.

Debora Marks (Harvard University)

Debora is a mathematician and computational biologist with a track record of using novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for interpreting genetic variation in a way that impacts biomedical applications. During her PhD, she quantified the pan-genomic scope of microRNA targeting - the combinatorial regulation of protein expression and co-discovered the first microRNA in a virus.  As a postdoc she made a breakthrough in the classic, unsolved problem of ab initio 3D structure prediction of proteins using undirected graphical probability models for evolutionary sequences. She has developed this approach to determine functional interactions, biomolecular structures, including the 3D structure of RNA and RNA-protein complexes and the conformational ensembles of apparently disordered proteins. Her new lab at Harvard is interested in developing methods in deep learning to address a wide range of biological challenges including designing drug affinity libraries for large numbers of human genes, predicting epistasis in antibiotic resistance, the effects of genetic variation on human disease etiology and drug response and sequence design for biosynthetic applications.

More from the Same Authors