Timezone: »

 
TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction
Pascal Notin · Lodevicus van Niekerk · Aaron Kollasch · Daniel Ritter · Yarin Gal · Debora Marks
Event URL: https://openreview.net/forum?id=l7Oo9DcLmR1 »

Successful approaches that model the fitness landscape of protein sequences have typically relied on family-specific sets of homologous sequences called multiple-sequence alignments (Hopf et al. 2017; Riesselman et al. 2018; Frazer et al. 2021). They are however limited by the fact many proteins are difficult to align or have shallow alignments. Newer models such as transformers that do not rely on alignments have been promising (Madani et al. 2020; Rives et al. 2021; Notin et al. 2022; Hesselow et al. 2022) to progressively bridge the gap with their alignment-based counterparts. In this work, we introduce TranceptEVE -- a hybrid between family-specific and family-agnostic models that seeks to build on the relative strengths from each approach to achieve state-of-the-art performance on the fitness prediction task. We demonstrate that it outperforms all other baselines on the recently released ProteinGym benchmarks (Notin et al. 2022) -- a curated set of 94 deep mutational scanning assays to assess the effects of substitution and indel mutations. We also quantify its ability to predict the pathogenicity of genetic mutations in humans based on annotations from ClinVar.

Author Information

Pascal Notin (Department of Computer Science, University of Oxford)
Lodevicus van Niekerk (University of Oxford)

Research Assistant at Marks Lab (Harvard Medical School) and OATML (Oxford).

Aaron Kollasch (Harvard University)
Daniel Ritter (Harvard Medical School)
Yarin Gal (University of OXford)
Debora Marks (Harvard University)

Debora is a mathematician and computational biologist with a track record of using novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for interpreting genetic variation in a way that impacts biomedical applications. During her PhD, she quantified the pan-genomic scope of microRNA targeting - the combinatorial regulation of protein expression and co-discovered the first microRNA in a virus.  As a postdoc she made a breakthrough in the classic, unsolved problem of ab initio 3D structure prediction of proteins using undirected graphical probability models for evolutionary sequences. She has developed this approach to determine functional interactions, biomolecular structures, including the 3D structure of RNA and RNA-protein complexes and the conformational ensembles of apparently disordered proteins. Her new lab at Harvard is interested in developing methods in deep learning to address a wide range of biological challenges including designing drug affinity libraries for large numbers of human genes, predicting epistasis in antibiotic resistance, the effects of genetic variation on human disease etiology and drug response and sequence design for biosynthetic applications.

More from the Same Authors