Timezone: »
Protein design holds immense potential for optimizing naturally occurring sequences, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, including an expansive design space, sparse functional regions, and scarcity of available labels. Furthermore, real-life design scenarios often necessitate the simultaneous optimization of multiple properties, exacerbating label sparsity issues. In this paper, we present ProteinNPT, a non-parametric transformer variant tailored for protein sequences and particularly suited to label-scarce and multi-task optimization settings. We first expand the ProteinGym benchmark to evaluate models in supervised settings and develop several cross-validation schemes for robust assessment. Subsequently, we reimplement existing top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design in several in silico Bayesian optimization experiments.
Author Information
Pascal Notin (Department of Computer Science, University of Oxford)
Ruben Weitzman (University of Oxford)
Debora Marks (Harvard University)
Debora is a mathematician and computational biologist with a track record of using novel algorithms and statistics to successfully address unsolved biological problems. She has a passion for interpreting genetic variation in a way that impacts biomedical applications. During her PhD, she quantified the pan-genomic scope of microRNA targeting - the combinatorial regulation of protein expression and co-discovered the first microRNA in a virus. As a postdoc she made a breakthrough in the classic, unsolved problem of ab initio 3D structure prediction of proteins using undirected graphical probability models for evolutionary sequences. She has developed this approach to determine functional interactions, biomolecular structures, including the 3D structure of RNA and RNA-protein complexes and the conformational ensembles of apparently disordered proteins. Her new lab at Harvard is interested in developing methods in deep learning to address a wide range of biological challenges including designing drug affinity libraries for large numbers of human genes, predicting epistasis in antibiotic resistance, the effects of genetic variation on human disease etiology and drug response and sequence design for biosynthetic applications.
Yarin Gal (University of Oxford)
More from the Same Authors
-
2020 : Paper 40: Real2sim: Automatic Generation of Open Street Map Towns For Autonomous Driving Benchmarks »
Panagiotis Tigas · Yarin Gal -
2022 : Discovering Long-period Exoplanets using Deep Learning with Citizen Science Labels »
Shreshth Malik · Nora Eisner · Chris Lintott · Yarin Gal -
2022 : How can we use natural evolution and genetic experiments to design protein functions? »
Ada Shaw · June Shin · Debora Marks -
2022 : TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction »
Pascal Notin · Lood van Niekerk · Aaron Kollasch · Daniel Ritter · Yarin Gal · Debora Marks -
2022 : Kernelized Stein Discrepancies for Biological Sequences »
Alan Amin · Eli Weinstein · Debora Marks -
2022 : scPerturb: Information Resource for Harmonized Single-Cell Perturbation Data »
Tessa Green · Stefan Peidli · Ciyue Shen · Torsten Gross · Joseph Min · Samuele Garda · Jake Taylor-King · Debora Marks · Augustin Luna · Nils Blüthgen · Chris Sander -
2022 : Designing and Evolving Neuron-Specific Proteases »
Han Spinner · Colin Hemez · Julia McCreary · David Liu · Debora Marks -
2022 : Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? »
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal -
2022 : Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? »
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal -
2022 : What 'Out-of-distribution' Is and Is Not »
Sebastian Farquhar · Yarin Gal -
2022 : Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation »
Lorenz Kuhn · Yarin Gal · Sebastian Farquhar -
2022 : Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? »
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal -
2023 : Sampling Protein Language Models for Functional Protein Design »
Jeremie Theddy Darmawan · Yarin Gal · Pascal Notin -
2023 : An Energy Based Model for Incorporating Sequence Priors for Target-Specific Antibody Design »
Steffanie Paul · Yining Huang · Debora Marks -
2023 : Combining Structure and Sequence for Superior Fitness Prediction »
Steffanie Paul · Aaron Kollasch · Pascal Notin · Debora Marks -
2023 : Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates »
Kelsey Doerksen · Yarin Gal · Freddie Kalaitzis · Yuliya Marchetti · Steven Lu · James Montgomery · Kazuyuki Miyazaki · Kevin Bowman -
2023 : Sampling Protein Language Models for Functional Protein Design »
Jeremie Theddy Darmawan · Yarin Gal · Pascal Notin -
2023 : Combining Structure and Sequence for Superior Fitness Prediction »
Steffanie Paul · Pascal Notin · Aaron Kollasch · Debora Marks -
2023 Poster: ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design »
Pascal Notin · Aaron Kollasch · Daniel Ritter · Lood van Niekerk · Steffanie Paul · Han Spinner · Nathan Rollins · Ada Shaw · Rose Orenbuch · Ruben Weitzman · Jonathan Frazer · Mafalda Dias · Dinko Franceschi · Yarin Gal · Debora Marks -
2022 Workshop: Learning Meaningful Representations of Life »
Elizabeth Wood · Adji Bousso Dieng · Aleksandrina Goeva · Alex X Lu · Anshul Kundaje · Chang Liu · Debora Marks · Ed Boyden · Eli N Weinstein · Lorin Crawford · Mor Nitzan · Rebecca Boiarsky · Romain Lopez · Tamara Broderick · Ray Jones · Wouter Boomsma · Yixin Wang · Stephen Ra -
2022 Poster: Tractable Function-Space Variational Inference in Bayesian Neural Networks »
Tim G. J. Rudner · Zonghao Chen · Yee Whye Teh · Yarin Gal -
2022 Poster: Scalable Sensitivity and Uncertainty Analyses for Causal-Effect Estimates of Continuous-Valued Interventions »
Andrew Jesson · Alyson Douglas · Peter Manshausen · Maëlys Solal · Nicolai Meinshausen · Philip Stier · Yarin Gal · Uri Shalit -
2022 Poster: Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness »
Eli Weinstein · Alan Amin · Jonathan Frazer · Debora Marks -
2022 Poster: Interventions, Where and How? Experimental Design for Causal Models at Scale »
Panagiotis Tigas · Yashas Annadani · Andrew Jesson · Bernhard Schölkopf · Yarin Gal · Stefan Bauer -
2022 Poster: Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation »
Jannik Kossen · Sebastian Farquhar · Yarin Gal · Thomas Rainforth -
2021 Workshop: Learning Meaningful Representations of Life (LMRL) »
Elizabeth Wood · Adji Bousso Dieng · Aleksandrina Goeva · Anshul Kundaje · Barbara Engelhardt · Chang Liu · David Van Valen · Debora Marks · Edward Boyden · Eli N Weinstein · Lorin Crawford · Mor Nitzan · Romain Lopez · Tamara Broderick · Ray Jones · Wouter Boomsma · Yixin Wang -
2019 : Synthetic Systems »
Pamela Silver · Debora Marks · Chang Liu · Possu Huang -
2019 Workshop: Learning Meaningful Representations of Life »
Elizabeth Wood · Yakir Reshef · Jonathan Bloom · Jasper Snoek · Barbara Engelhardt · Scott Linderman · Suchi Saria · Alexander Wiltschko · Casey Greene · Chang Liu · Kresten Lindorff-Larsen · Debora Marks -
2018 : Invited Talk Session 2 »
Debora Marks · Olexandr Isayev · Tess Smidt · Nathaniel Thomas -
2018 : TBC 4 »
Debora Marks -
2018 Poster: BRUNO: A Deep Recurrent Model for Exchangeable Data »
Iryna Korshunova · Jonas Degrave · Ferenc Huszar · Yarin Gal · Arthur Gretton · Joni Dambre