Timezone: »

ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design
Pascal Notin · Aaron Kollasch · Daniel Ritter · Lood van Niekerk · Steffanie Paul · Han Spinner · Nathan Rollins · Ada Shaw · Rose Orenbuch · Ruben Weitzman · Jonathan Frazer · Mafalda Dias · Dinko Franceschi · Yarin Gal · Debora Marks

Tue Dec 12 08:45 AM -- 10:45 AM (PST) @ Great Hall & Hall B1+B2 #326

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing their effectiveness is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym v1.0, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 40 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark. We open source the corresponding codebase, datasets, MSAs, structures, predictions and develop a user-friendly website that facilitates comparisons across all settings.

Author Information

Pascal Notin (Department of Computer Science, University of Oxford)
Aaron Kollasch (Harvard University)
Daniel Ritter (Harvard Medical School)
Lood van Niekerk (Harvard)

Research Assistant at Debora Marks Lab (Harvard Medical School).

Steffanie Paul (Harvard University)

Steffanie is a PhD student in Debora Marks' lab at Harvard Medical School. She is interested in leveraging advances in Generative AI to combine sequence and structure information for protein design. Her current projects include developing models for affinity maturation campaign data and methods for de novo antibody design. She also does theoretical work, developing statistical goodness of fit tests for conditional generative protein models.

Han Spinner (Harvard Medical School, Harvard University)
Nathan Rollins (Seismic Therapeutic)
Ada Shaw (Harvard University)
Rose Orenbuch (Harvard University)
Ruben Weitzman (University of Oxford)
Jonathan Frazer (Centre for Genomic Regulation (CRG))
Mafalda Dias (Centre for Genomic Regulation (CRG))
Dinko Franceschi (Harvard University)
Yarin Gal (University of Oxford)
Debora Marks (Harvard University)

Debora is a mathematician and computational biologist with a track record of using novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for interpreting genetic variation in a way that impacts biomedical applications. During her PhD, she quantified the pan-genomic scope of microRNA targeting - the combinatorial regulation of protein expression and co-discovered the first microRNA in a virus.  As a postdoc she made a breakthrough in the classic, unsolved problem of ab initio 3D structure prediction of proteins using undirected graphical probability models for evolutionary sequences. She has developed this approach to determine functional interactions, biomolecular structures, including the 3D structure of RNA and RNA-protein complexes and the conformational ensembles of apparently disordered proteins. Her new lab at Harvard is interested in developing methods in deep learning to address a wide range of biological challenges including designing drug affinity libraries for large numbers of human genes, predicting epistasis in antibiotic resistance, the effects of genetic variation on human disease etiology and drug response and sequence design for biosynthetic applications.

More from the Same Authors