Timezone: »
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins to address our most pressing challenges in climate, agriculture and healthcare. Despite an increase in machine learning-based protein modeling methods, assessing their effectiveness is problematic due to the use of distinct, often contrived, experimental datasets and variable performance across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym v1.0, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 40 high-performing models from various subfields (eg., mutation effects, inverse folding) into a unified benchmark. We open source the corresponding codebase, datasets, MSAs, structures, predictions and develop a user-friendly website that facilitates comparisons across all settings.
Author Information
Pascal Notin (Department of Computer Science, University of Oxford)
Aaron Kollasch (Harvard University)
Daniel Ritter (Harvard Medical School)
Lood van Niekerk (Harvard)
Research Assistant at Debora Marks Lab (Harvard Medical School).
Steffanie Paul (Harvard University)
Steffanie is a PhD student in Debora Marks' lab at Harvard Medical School. She is interested in leveraging advances in Generative AI to combine sequence and structure information for protein design. Her current projects include developing models for affinity maturation campaign data and methods for de novo antibody design. She also does theoretical work, developing statistical goodness of fit tests for conditional generative protein models.
Han Spinner (Harvard Medical School, Harvard University)
Nathan Rollins (Seismic Therapeutic)
Ada Shaw (Harvard University)
Rose Orenbuch (Harvard University)
Ruben Weitzman (University of Oxford)
Jonathan Frazer (Centre for Genomic Regulation (CRG))
Mafalda Dias (Centre for Genomic Regulation (CRG))
Dinko Franceschi (Harvard University)
Yarin Gal (University of Oxford)
Debora Marks (Harvard University)
Debora is a mathematician and computational biologist with a track record of using novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for interpreting genetic variation in a way that impacts biomedical applications. During her PhD, she quantified the pan-genomic scope of microRNA targeting - the combinatorial regulation of protein expression and co-discovered the first microRNA in a virus. As a postdoc she made a breakthrough in the classic, unsolved problem of ab initio 3D structure prediction of proteins using undirected graphical probability models for evolutionary sequences. She has developed this approach to determine functional interactions, biomolecular structures, including the 3D structure of RNA and RNA-protein complexes and the conformational ensembles of apparently disordered proteins. Her new lab at Harvard is interested in developing methods in deep learning to address a wide range of biological challenges including designing drug affinity libraries for large numbers of human genes, predicting epistasis in antibiotic resistance, the effects of genetic variation on human disease etiology and drug response and sequence design for biosynthetic applications.
More from the Same Authors
-
2020 : Paper 40: Real2sim: Automatic Generation of Open Street Map Towns For Autonomous Driving Benchmarks »
Panagiotis Tigas · Yarin Gal -
2022 : Discovering Long-period Exoplanets using Deep Learning with Citizen Science Labels »
Shreshth Malik · Nora Eisner · Chris Lintott · Yarin Gal -
2022 : How can we use natural evolution and genetic experiments to design protein functions? »
Ada Shaw · June Shin · Debora Marks -
2022 : TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction »
Pascal Notin · Lood van Niekerk · Aaron Kollasch · Daniel Ritter · Yarin Gal · Debora Marks -
2022 : Kernelized Stein Discrepancies for Biological Sequences »
Alan Amin · Eli Weinstein · Debora Marks -
2022 : scPerturb: Information Resource for Harmonized Single-Cell Perturbation Data »
Tessa Green · Stefan Peidli · Ciyue Shen · Torsten Gross · Joseph Min · Samuele Garda · Jake Taylor-King · Debora Marks · Augustin Luna · Nils Blüthgen · Chris Sander -
2022 : Designing and Evolving Neuron-Specific Proteases »
Han Spinner · Colin Hemez · Julia McCreary · David Liu · Debora Marks -
2022 : Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? »
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal -
2022 : Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? »
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal -
2022 : What 'Out-of-distribution' Is and Is Not »
Sebastian Farquhar · Yarin Gal -
2022 : Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation »
Lorenz Kuhn · Yarin Gal · Sebastian Farquhar -
2022 : Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? »
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal -
2023 : Sampling Protein Language Models for Functional Protein Design »
Jeremie Theddy Darmawan · Yarin Gal · Pascal Notin -
2023 : An Energy Based Model for Incorporating Sequence Priors for Target-Specific Antibody Design »
Steffanie Paul · Yining Huang · Debora Marks -
2023 : Combining Structure and Sequence for Superior Fitness Prediction »
Steffanie Paul · Aaron Kollasch · Pascal Notin · Debora Marks -
2023 : Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates »
Kelsey Doerksen · Yarin Gal · Freddie Kalaitzis · Yuliya Marchetti · Steven Lu · James Montgomery · Kazuyuki Miyazaki · Kevin Bowman -
2023 : Sampling Protein Language Models for Functional Protein Design »
Jeremie Theddy Darmawan · Yarin Gal · Pascal Notin -
2023 : Combining Structure and Sequence for Superior Fitness Prediction »
Steffanie Paul · Pascal Notin · Aaron Kollasch · Debora Marks -
2023 Poster: ProteinNPT: Improving protein property prediction and design with non-parametric transformers »
Pascal Notin · Ruben Weitzman · Debora Marks · Yarin Gal -
2022 Workshop: Learning Meaningful Representations of Life »
Elizabeth Wood · Adji Bousso Dieng · Aleksandrina Goeva · Alex X Lu · Anshul Kundaje · Chang Liu · Debora Marks · Ed Boyden · Eli N Weinstein · Lorin Crawford · Mor Nitzan · Rebecca Boiarsky · Romain Lopez · Tamara Broderick · Ray Jones · Wouter Boomsma · Yixin Wang · Stephen Ra -
2022 Poster: Tractable Function-Space Variational Inference in Bayesian Neural Networks »
Tim G. J. Rudner · Zonghao Chen · Yee Whye Teh · Yarin Gal -
2022 Poster: Scalable Sensitivity and Uncertainty Analyses for Causal-Effect Estimates of Continuous-Valued Interventions »
Andrew Jesson · Alyson Douglas · Peter Manshausen · Maëlys Solal · Nicolai Meinshausen · Philip Stier · Yarin Gal · Uri Shalit -
2022 Poster: Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness »
Eli Weinstein · Alan Amin · Jonathan Frazer · Debora Marks -
2022 Poster: Interventions, Where and How? Experimental Design for Causal Models at Scale »
Panagiotis Tigas · Yashas Annadani · Andrew Jesson · Bernhard Schölkopf · Yarin Gal · Stefan Bauer -
2022 Poster: Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation »
Jannik Kossen · Sebastian Farquhar · Yarin Gal · Thomas Rainforth -
2021 Workshop: Learning Meaningful Representations of Life (LMRL) »
Elizabeth Wood · Adji Bousso Dieng · Aleksandrina Goeva · Anshul Kundaje · Barbara Engelhardt · Chang Liu · David Van Valen · Debora Marks · Edward Boyden · Eli N Weinstein · Lorin Crawford · Mor Nitzan · Romain Lopez · Tamara Broderick · Ray Jones · Wouter Boomsma · Yixin Wang -
2019 : Synthetic Systems »
Pamela Silver · Debora Marks · Chang Liu · Possu Huang -
2019 Workshop: Learning Meaningful Representations of Life »
Elizabeth Wood · Yakir Reshef · Jonathan Bloom · Jasper Snoek · Barbara Engelhardt · Scott Linderman · Suchi Saria · Alexander Wiltschko · Casey Greene · Chang Liu · Kresten Lindorff-Larsen · Debora Marks -
2018 : Invited Talk Session 2 »
Debora Marks · Olexandr Isayev · Tess Smidt · Nathaniel Thomas -
2018 : TBC 4 »
Debora Marks -
2018 Poster: BRUNO: A Deep Recurrent Model for Exchangeable Data »
Iryna Korshunova · Jonas Degrave · Ferenc Huszar · Yarin Gal · Arthur Gretton · Joni Dambre