Workshop
Machine Learning in Structural Biology Workshop
Hannah Wayment-Steele · Roshan Rao · Ellen Zhong · Sergey Ovchinnikov · Gabriele Corso · Gina El Nesr
Room 208 - 210
Structural biology, the study of the 3D structure or shape of proteins and other biomolecules, has been transformed by breakthroughs from machine learning algorithms. While methods such as AlphaFold2 have made exponential progress in certain areas, many active and open challenges for the field remain, including modeling protein dynamics, predicting the structure of other classes of biomolecules such as RNA, and ultimately relating the structure of isolated proteins to the in vivo and contextual nature of their underlying function. These challenges are diverse and require interdisciplinary collaboration between ML and structural biology researchers. The 4th edition of the Machine Learning in Structural Biology (MLSB) workshop focuses on these challenges and opportunities. In a unique commitment of support, PRX Life journal has committed to waiving publication fees for accepted papers in a special collection for interested authors. We anticipate this workshop will be of significant interest to both ML researchers as well as computational / experimental biologists and will stimulate continued problem-solving and new directions in the field.
Schedule
Fri 6:30 a.m. - 6:35 a.m.
|
Opening Remarks
(
Remarks
)
>
SlidesLive Video |
🔗 |
Fri 6:35 a.m. - 7:00 a.m.
|
Health system scale language models for clinical and operational decision making
(
Talk
)
>
SlidesLive Video |
Kyunghyun Cho 🔗 |
Fri 7:00 a.m. - 7:15 a.m.
|
Validation of de novo designed water-soluble and membrane proteins by in silico folding and melting
(
Contributed
)
>
SlidesLive Video |
🔗 |
Fri 7:15 a.m. - 7:40 a.m.
|
Accurate and tunable de novo protein shapes for new functions
(
Talk
)
>
SlidesLive Video |
Tanja Kortemme 🔗 |
Fri 8:00 a.m. - 8:25 a.m.
|
A CryoET Data Portal to Foster a Collaboration between the Machine Learning and CryoET Communities
(
Talk
)
>
SlidesLive Video |
Bridget Carragher 🔗 |
Fri 8:25 a.m. - 8:40 a.m.
|
AlphaFold Meets Flow Matching for Generating Protein Ensembles
(
Contributed
)
>
SlidesLive Video |
🔗 |
Fri 8:40 a.m. - 8:55 a.m.
|
DSMBind: an unsupervised generative modeling framework for binding energy prediction
(
Contributed
)
>
SlidesLive Video |
🔗 |
Fri 8:55 a.m. - 9:20 a.m.
|
Leveraging microfluidics for high-throughput and quantitative biochemistry and biophysics
(
Talk
)
>
SlidesLive Video |
Polly Fordyce 🔗 |
Fri 9:20 a.m. - 10:40 a.m.
|
Poster Session 1/Lunch
(
Poster session
)
>
|
🔗 |
Fri 10:40 a.m. - 11:05 a.m.
|
Illuminating protein space with a programmable generative model
(
Talk
)
>
SlidesLive Video |
Gevorg Grigoryan 🔗 |
Fri 11:05 a.m. - 11:20 a.m.
|
Protein generation with evolutionary diffusion: sequence is all you need
(
Contributed
)
>
SlidesLive Video |
🔗 |
Fri 11:20 a.m. - 11:45 a.m.
|
De novo design of protein structure and function with RFdiffusion
(
Talk
)
>
SlidesLive Video |
Jason Yim · Brian L Trippe 🔗 |
Fri 12:00 p.m. - 12:15 p.m.
|
DiffDock-Pocket: Diffusion for Pocket-Level Docking with Sidechain Flexibility
(
Contributed
)
>
SlidesLive Video |
🔗 |
Fri 12:15 p.m. - 12:30 p.m.
|
PoseCheck: Generative Models for 3D Structure-based Drug Design Produce Unrealistic Poses
(
Contributed
)
>
SlidesLive Video |
🔗 |
Fri 12:30 p.m. - 12:55 p.m.
|
World-wide competitions and the RNA folding problem
(
Talk
)
>
SlidesLive Video |
Rhiju Das 🔗 |
Fri 1:00 p.m. - 2:00 p.m.
|
Panel Session
(
Session
)
>
SlidesLive Video |
🔗 |
Fri 2:00 p.m. - 3:00 p.m.
|
Poster Session 2 / Happy Hour
(
Poster Session
)
>
|
🔗 |
Fri 3:00 p.m. - 3:05 p.m.
|
Closing Remarks
(
Remarks
)
>
SlidesLive Video |
🔗 |
-
|
ESMFold Hallucinates Native-Like Protein Sequences
(
Poster
)
>
We describe protein sequence design by inverting the protein structure prediction algorithm ESMFold,which achieves high accuracy by relying on evolutionary patterns derived from a pretrained protein language models (PLM; ESM2). In principle, by inverting ESMFold, protein sequences can be designed to fulfill one or more design objectives, such as high prediction confidence, predicted protein binding, or other geometric constraints that can be expressed with loss functions. In practice, sequences designed using an inverted AlphaFold model, termed AFDesign, contained unnatural sequence profilesand were shown to express poorly, whereas an inverted RosettaFold network was shown to be sensitive to adversarial sequences. Here, we demonstrate that these limitations do not extend to neural networks that include PLMs, such as ESMFold. Our inverted model, termed ESM-Design, can generate sequences with profiles that are both more native-like and more likely to express than sequences generated using AFDesign. However these sequences are less likely to express than sequences rescued by the structure-based design method ProteinMPNN. The safeguard offered by the PLM came with steep increases in memory consumption, preventing proteins greater than 150 residues from being modeled on a single GPU with 80GB VRAM. During this investigation, we also observed the role played by different sequence initialization schemes, with random sampling of discrete amino acids improving convergence and model quality over any continuous random initialization method. Finally, we showed how this approach can be used to introduce sequence and structure diversification in small proteins such as ubiquitin, while respecting the sequence conservation of active site residues. Our results highlight the effects of architectural differences between structure prediction networks on zero-shot protein design. |
Jeliazko Jeliazkov · Diego del Alamo · Joel Karpiak 🔗 |
-
|
Conditioned Protein Structure Prediction
(
Poster
)
>
Deep learning based protein structure prediction has facilitated major breakthroughs in biological sciences. However, current methods struggle with alternative conformation prediction and offer limited integration of expert knowledge on protein dynamics. We introduce AFEXplorer, a generic approach that tailors AlphaFold predictions to user-defined constraints in coarse coordinate spaces by optimizing embedding features. Its effectiveness in generating functional protein conformations in accordance with predefined conditions were demonstrated through comprehensive examples. AFEXplorer serves as a versatile platform for conditioned protein structure prediction, bridging the gap between automated models and domain-specific insights. |
Tengyu Xie · Zilin Song · Jing Huang 🔗 |
-
|
Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design
(
Poster
)
>
The field of antibody-based therapeutics has grown significantly in recent years, with targeted antibodies emerging as a potentially effective approach to personalized therapies.Such therapies could be particularly beneficial for complex, highly individual diseases such as cancer.However, progress in this field is often constrained by the extensive search space of amino acid sequences that form the foundation of antibody design.In this study, we introduce a novel reinforcement learning method specifically tailored to address the unique challenges of this domain.We demonstrate that our method can learn the design of high-affinity antibodies against multiple targets in silico, utilizing either online interaction or offline datasets.To the best of our knowledge, our approach is the first of its kind and outperforms existing methods on all tested antigens in the Absolut! database. |
Yannick Vogt · Mehdi Naouar · Maria Kalweit · Christoph Cornelius Miething · Justus Duyster · Roland Mertelsmann · Gabriel Kalweit · Joschka Boedecker 🔗 |
-
|
Guiding diffusion models for antibody sequence and structure co-design with developability properties
(
Poster
)
>
Recent advances in deep generative methods have allowed antibody sequence and structure co-design. This study addresses the challenge of tailoring the highly variable complementarity-determining regions (CDRs) in antibodies to fulfill developability requirements. We introduce a novel approach that integrates property guidance into the antibody design process using diffusion probabilistic models. This approach allows us to simultaneously design CDRs conditioned on antigen structures while considering critical properties like solubility and folding stability. Our property-conditioned diffusion model offers versatility by accommodating diverse property constraints, presenting a promising avenue for computational antibody design in therapeutic applications. |
Amelia Villegas-Morcillo · Jana M. Weber · Marcel Reinders 🔗 |
-
|
AlphaFold Distillation for Protein Design
(
Poster
)
>
Inverse protein folding, the process of designing sequences that fold into a specific 3D structure, is crucial in bio-engineering and drug discovery. Traditional methods rely on experimentally resolved structures, but these cover only a small fraction of protein sequences. Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences. However, these models are too slow for integration into the optimization loop of inverse folding models during training.To address this, we propose using knowledge distillation on folding model confidence metrics, such as pTM or pLDDT scores, to create a faster and end-to-end differentiable distilled model. This model can then be used as a structure consistency regularizer in training the inverse folding model. Our technique is versatile and can be applied to other design tasks, such as sequence-based protein infilling.Experimental results show that our method outperforms non-regularized baselines, yielding up to 3\% improvement in sequence recovery and up to 45\% improvement in protein diversity while maintaining structural consistency in generated sequences. Anonymized code for this work is available at https://anonymous.4open.science/r/AFDistill-28C3 |
Igor Melnyk · Aurelie Lozano · Payel Das · Vijil Chenthamarakshan 🔗 |
-
|
Binding Oracle: Fine-Tuning From Stability to Binding Free Energy
(
Poster
)
>
The ability to predict changes in binding free energy (ddG binding) for mutations at protein-protein interfaces (PPIs) is critical for the understanding genetic diseases and engineering novel protein-based therapeutics. Here, we present Binding Oracle: a structure-based graph transformer for predicting ddG binding at PPIs. Binding Oracle fine-tunes Stability Oracle with Selective LoRA: a technique that synergizes layer selection via gradient norms with LoRA. Selective LoRA enables the identification and fine-tuning of the layers most critical for the downstream task, thus, regularizing against overfitting. Additionally, we present new training-test splits of mutational data from the SKEMPI2.0, Ab-Bind, and NABE databases that use a strict 30% sequence similarity threshold to avoid data leakage during model evaluation. Binding Oracle, when trained with the Thermodynamic Permutations data augmentation technique, achieves SOTA on S487 without using any evolutionary auxiliary features. Our results empirically demonstrate how sparse fine-tuning techniques, such as Selective LoRA, can enable rapid domain adaptation in protein machine learning frameworks. |
Chengyue Gong · Adam Klivans · Jordan Wells · James Loy · Qiang Liu · Alex Dimakis · Daniel Diaz 🔗 |
-
|
Scalable Multimer Structure Prediction using Diffusion Models
(
Poster
)
>
Accurate protein complex structure modeling is a necessary step in understanding the behavior of biological pathways and cellular systems. While some works have attempted to address this challenge, there is still a need for scaling existing methods to larger protein complexes. To address this need, we propose a novel diffusion generative model (DGM) that predicts large multimeric protein structures by learning to rigidly dock its chains together. Additionally, we construct a new dataset specifically for large protein complexes used to train and evaluate our DGM. We substantially improve prediction runtime and completion rates while maintaining competitive accuracy with current methods. |
Peter Pao-Huang · Bowen Jing · Dr. Bonnie Berger 🔗 |
-
|
Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular Property Prediction
(
Poster
)
>
Few-shot learning is a promising approach to molecular property prediction as supervised data is often very limited. However, many important molecular properties depend on complex molecular characteristics — such as the various 3D geometries a molecule may adopt or the types of chemical interactions it can form — that are not explicitly encoded in the feature space and must be approximated from limited data. Learning these characteristics can be difficult, especially for few-shot learning algorithms that are designed for fast adaptation to new tasks. In this work, we develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction. Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations, and a multi-task learning paradigm to structure the embedding space. The embeddings improve few-shot learning performance using Multi-Task, MAML, and Prototypical Networks on multiple molecular property prediction benchmarks. |
Christopher Fifty · Joseph M Paggi · Ehsan Amid · Jure Leskovec · Ron Dror 🔗 |
-
|
Molecular Diffusion Models with Virtual Receptors
(
Poster
)
>
Machine learning approaches to Structure-Based Drug Design (SBDD) have proven quite fertile over the last few years. In particular, diffusion-based approaches to SBDD have shown great promise. We present a technique which expands on this diffusion approach in two crucial ways. First, we address the size disparity between the drug molecule and the target/receptor, which makes learning more challenging and inference slower. We do so through the notion of a Virtual Receptor, which is a compressed version of the receptor; it is learned so as to preserve key aspects of the structural information of the original receptor, while respecting the relevant group equivariance. Second, we incorporate a protein language embedding used originally in the context of protein folding. We experimentally demonstrate the contributions of both the virtual receptors and the protein embeddings: in practice, they lead to both better performance, as well as significantly faster computations. |
Matan Halfon · Eyal Rozenberg · Ehud Rivlin · Daniel Freedman 🔗 |
-
|
CESPED: a new benchmark for supervised particle pose estimation in Cryo-EM.
(
Poster
)
>
Cryo-EM is a powerful tool for understanding macromolecular structures, yet cur-rent methods for structure reconstruction are slow and computationally demanding.To accelerate research on pose estimation, we present CESPED, a new datasetspecifically designed for Supervised Pose Estimation in Cryo-EM. Alongside CE-SPED, we provide a PyTorch package to simplify Cryo-EM data handling andmodel evaluation. We evaluate the performance of a baseline model, Image2Sphere,on CESPED, showing promising results but also highlighting the need for furtheradvancements in this area. |
Ruben Sanchez Garcia · Michael Saur · Javier Vargas · Carl Poelking · Charlotte Deane 🔗 |
-
|
Learning Scalar Fields for Molecular Docking with Fast Fourier Transforms
(
Poster
)
>
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. Moreover, the runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the Vina and Gnina scoring functions, and is more robust on computationally predicted structures. |
Bowen Jing · Tommi Jaakkola · Dr. Bonnie Berger 🔗 |
-
|
VN-EGNN: Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification
(
Poster
)
>
Being able to identify regions within or around proteins, to which ligands canpotentially bind, is an essential step to develop new drugs. Binding site iden-tification methods can now profit from the availability of large amounts of 3Dstructures in protein structure databases or from AlphaFold predictions. Currentbinding site identification methods rely on geometric deep learning, which takesgeometric invariances and equivariances into account. Such methods turned outto be very beneficial for physics-related tasks like binding energy or motion tra-jectory prediction. However, their performance at binding site identification isstill limited, which might be due to limited expressivity or oversquashing effectsof E(n)-Equivariant Graph Neural Networks (EGNNs). Here, we extend EGNNsby adding virtual nodes and applying an extended message passing scheme. Thevirtual nodes in these graphs both improve the predictive performance and can alsolearn to represent binding sites. In our experiments, we show that VN-EGNN setsa new state of the art at binding site identification on three common benchmarks,COACH420, HOLO4K, and PDBbind2020. |
Florian Sestak · Lisa Schneckenreiter · Sepp Hochreiter · Andreas Mayr · Günter Klambauer 🔗 |
-
|
Enhancing Ligand Pose Sampling for Machine Learning–Based Docking
(
Poster
)
>
Deep learning promises to dramatically improve scoring functions for molecular docking, leading to substantial advances in binding pose prediction and virtual screening. To train scoring functions—and to perform molecular docking—one must generate a set of candidate ligand binding poses. Unfortunately, the sampling protocols currently used to generate candidate poses frequently fail to produce any poses close to the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking. Here, we describe several improved protocols for pose sampling: GLOW (auGmented sampLing with sOftened vdW potential) and a novel technique named IVES (IteratiVe Ensemble Sampling). Our benchmarking results demonstrate the effectiveness of our methods in improving the likelihood of sampling accurate poses, especially for binding pockets whose shape changes substantially when different ligands bind. This improvement is observed across both experimentally determined and AlphaFold-generated protein structures. Additionally, we present datasets of candidate ligand poses generated using our methods for each of around 5,000 protein-ligand cross-docking pairs, for training and testing scoring functions. To benefit the research community, we provide an open-source Python implementation of GLOW and IVES and the newly created cross-docking datasets at https://github.com/drorlab/GLOW-IVES. |
Patricia Suriana · Ron Dror 🔗 |
-
|
Improved encoding of ensembles in PDBx/mmCIF
(
Poster
)
>
In their folded state, biomolecules exchange between multiple conformational states, crucial for their function. However, most structural models derived from experiments and computational predictions only encode a single state. To represent biomolecules more accurately, we must move towards modeling and predicting structural ensembles. Information about structural ensembles exists within experimental data from X-ray crystallography and cryo electron microscopy (cryoEM). While new tools are available to detect conformational and compositional heterogeneity that exist within these ensembles, the legacy PDB data structure does not robustly encapsulate this complexity. We propose modifications to the Macromolecular Crystallographic Information File (mmCIF) to improve the representation and interrelation of conformational and compositional heterogeneity. These modifications will enable improved tools to capture macromolecular ensembles in a way that is human and machine interpretable, potentially catalyzing breakthroughs for ensemble-function predictions, analogous to AlphaFold's achievements with single structure prediction. |
Stephanie Wankowicz · James Fraser 🔗 |
-
|
AlphaFold Meets Flow Matching for Generating Protein Ensembles
(
Poster
)
>
The significant success of AlphaFold2 at protein structure prediction has pointed to structural ensembles as the next frontier towards a more complete computational understanding of protein structure. At the same time, iterative refinement-based techniques such as diffusion have driven significant breakthroughs in generative modeling. We explore the synergy of these developments by combining highly accurate protein structure prediction models with flow matching, a powerful modern generative modeling framework, in order to sample the conformational landscape of proteins. Preliminary results on membrane transporters, ligand-induced conformational change, and disordered ensembles show the potential of the approach. Importantly, and unlike MSA-based methods, our method also obtains similar distributions even when used with language model-based algorithms such as ESMFold, which are otherwise deterministic given an input sequence. These results open exciting avenues in the computational prediction of conformational flexibility. |
Bowen Jing · Dr. Bonnie Berger · Tommi Jaakkola 🔗 |
-
|
AlphaFold Meets Flow Matching for Generating Protein Ensembles
(
Oral
)
>
The significant success of AlphaFold2 at protein structure prediction has pointed to structural ensembles as the next frontier towards a more complete computational understanding of protein structure. At the same time, iterative refinement-based techniques such as diffusion have driven significant breakthroughs in generative modeling. We explore the synergy of these developments by combining highly accurate protein structure prediction models with flow matching, a powerful modern generative modeling framework, in order to sample the conformational landscape of proteins. Preliminary results on membrane transporters, ligand-induced conformational change, and disordered ensembles show the potential of the approach. Importantly, and unlike MSA-based methods, our method also obtains similar distributions even when used with language model-based algorithms such as ESMFold, which are otherwise deterministic given an input sequence. These results open exciting avenues in the computational prediction of conformational flexibility. |
Bowen Jing · Dr. Bonnie Berger · Tommi Jaakkola 🔗 |
-
|
The Discovery of Binding Modes Requires Rethinking Docking Generalization
(
Poster
)
>
Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, it is critical that docking methods generalize well across the proteome. However, existing benchmarks fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that machine learning-based docking models have very weak generalization abilities even when combined with various data augmentation strategies. Instead, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between a diffusion and a confidence model. Unlike previous self-training methods from other domains, we directly exploit the multi-resolution generation process of diffusion models using rollouts and confidence scores to reduce the generalization gap. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes. |
Gabriele Corso · Arthur Deng · Nicholas Polizzi · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
Conformational sampling and interpolation using language-based protein folding neural networks
(
Poster
)
>
Protein language models (PLMs), such ESM2, learn a rich semantic grammar of the protein sequence space. When coupled to protein folding neural networks (e.g., ESMFold), they can facilitate the prediction of tertiary and quaternary protein structures at high accuracy. However, they are limited to modeling protein structures in single states. This manuscript demonstrates that ESMFold can predict alternate conformations of some proteins, including de novo designed proteins. Randomly masking the sequence prior to PLM input returned alternate embeddings that ESMFold sometimes mapped to distinct physiologically relevant conformations. From there, inversion of the ESMFold trunk facilitated the generation of high-confidence interconversion paths between the two states. These paths provide a deeper glimpse of how language-based protein folding neural networks derive structural information from high-dimensional sequence representations, while exposing limitations in their general understanding of protein structure and folding. |
Diego del Alamo · Jeliazko Jeliazkov · Daphne Truan · Joel Karpiak 🔗 |
-
|
FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data
(
Poster
)
>
Machine learning (ML) for protein design frequently requires large datasets of protein fitness generated by high-throughput experiments, and many ML models use these datasets for training, fine-tuning, and benchmarking. However, these approaches do not account for underlying experimental noise, potentially making their conclusions inaccurate. In this work, we present FLIGHTED (Fitness Landscape Inference Generated by High-Throughput Experimental Data), a Bayesian method for generating fitness landscapes with calibrated errors from noisy high-throughput experimental data. We apply FLIGHTED to datasets generated by single-step enrichment-based selection assays such as Fluorescence-Activated Cell Sorting (FACS) and phage display and to data from a novel high-throughput assay DHARMA (direct high-throughput activity recording and measurement assay) that ties fitness to base editing activity. Our results suggest that de-noising single-step selection data generates well-calibrated predictions that are sufficient to change which models perform best in benchmarking studies. Applying FLIGHTED to DHARMA provides more accurate fitness measurements with better calibrated errors; FLIGHTED-DHARMA can be used to generate large protein fitness datasets with up to 10^6 variants. FLIGHTED can be used on any high-throughput assay and makes it easy for ML scientists to account for experimental noise when modeling protein fitness. |
Vikram Sundar · Boqiang Tu · Lindsey Guan · Kevin Esvelt 🔗 |
-
|
Contrasting Sequence with Structure: \\Pre-training Graph Representations with PLMs
(
Poster
)
>
Understanding protein function is vital for drug discovery, disease diagnosis, and protein engineering. While Protein Language Models (PLMs) pre-trained on vast protein sequence datasets have achieved remarkable success, equivalent Protein Structure Models (PSMs) remain underrepresented. We attribute this to the relative lack of high-confidence structural data and suitable pre-training objectives. In this context, we introduce BioCLIP, a contrastive learning framework that pre-trains PSMs by leveraging PLMs, generating meaningful per-residue and per-chain structural representations. When evaluated on tasks such as protein-protein interaction, Gene Ontology annotation, and Enzyme Commission number prediction, BioCLIP-trained PSMs consistently outperform models trained from scratch and further enhance performance when merged with sequence embeddings. Notably, BioCLIP approaches, or exceeds, specialized methods across all benchmarks using its singular pre-trained design. Our work addresses the challenges of obtaining quality structural data and designing self-supervised objectives, setting the stage for more comprehensive models of protein function. Source code is publicly available. |
Louis Robinson · Timothy Atkinson · Liviu Copoiu · Patrick Bordes · Thomas PIERROT · Thomas Barrett 🔗 |
-
|
Target-Aware Variational Auto-Encoders for Ligand Generation with Multi-Modal Protein Modeling
(
Poster
)
>
Without knowledge of specific pockets, generating ligands based on the global structure of a protein target plays a crucial role in drug discovery as it helps reduce the search space for potential drug-like candidates in the pipeline. However, contemporary methods require optimizing tailored networks for each protein, which is arduous and costly. To address this issue, we introduce TargetVAE, a target-aware variational auto-encoder that generates ligands with high binding affinities to arbitrary protein targets, guided by a novel prior network that learns from entire protein structures. We showcase the superiority of our approach by conducting extensive experiments and evaluations, including the assessment of generative model quality, ligand generation for unseen targets, docking score computation, and binding affinity prediction. Empirical results demonstrate the promising performance of our proposed approach. Our source code in PyTorch is publicly available at https://github.com/HySonLab/Ligand_Generation |
Khang Ngo · Truong Son Hy 🔗 |
-
|
DSMBind: an unsupervised generative modeling framework for binding energy prediction
(
Poster
)
>
Predicting the binding between proteins and other molecules is a core question in biology. Geometric deep learning is a promising paradigm for protein-ligand or protein-protein binding energy prediction, but its accuracy is limited by the size of training data as high-throughput binding assays are expensive. Unsupervised learning, such as protein language models, is particularly useful in this setting because it does not need experimental binding energy data for training. In this work, we propose DSMBind, a new generative modeling framework for protein complex structures, and show that the likelihood of crystal structures are highly correlated with their binding energy. Specifically, DSMBind learns an energy-based model from a training set of unlabeled crystal structures via SE(3) denoising score matching (DSM), where we perturb a protein complex via random rotation of backbone and side-chains. We find the learned energy is highly correlated with experimental binding affinity across multiple benchmarks, including protein-ligand binding, antibody-antigen binding, and protein-protein binding mutation effect prediction. DSMBind not only outperforms unsupervised learning methods based on protein language models or inverse folding, but also matches the performance of state-of-the-art supervised models trained on experimental binding data. |
Wengong Jin · Caroline Uhler · Nir HaCohen 🔗 |
-
|
DSMBind: an unsupervised generative modeling framework for binding energy prediction
(
Oral
)
>
Predicting the binding between proteins and other molecules is a core question in biology. Geometric deep learning is a promising paradigm for protein-ligand or protein-protein binding energy prediction, but its accuracy is limited by the size of training data as high-throughput binding assays are expensive. Unsupervised learning, such as protein language models, is particularly useful in this setting because it does not need experimental binding energy data for training. In this work, we propose DSMBind, a new generative modeling framework for protein complex structures, and show that the likelihood of crystal structures are highly correlated with their binding energy. Specifically, DSMBind learns an energy-based model from a training set of unlabeled crystal structures via SE(3) denoising score matching (DSM), where we perturb a protein complex via random rotation of backbone and side-chains. We find the learned energy is highly correlated with experimental binding affinity across multiple benchmarks, including protein-ligand binding, antibody-antigen binding, and protein-protein binding mutation effect prediction. DSMBind not only outperforms unsupervised learning methods based on protein language models or inverse folding, but also matches the performance of state-of-the-art supervised models trained on experimental binding data. |
Wengong Jin · Caroline Uhler · Nir HaCohen 🔗 |
-
|
Fast non-autoregressive inverse folding with discrete diffusion
(
Poster
)
>
Generating protein sequences that fold into a intended 3D structure is a fundamental step in de novo protein design. De facto methods utilize autoregressive generation, but this eschews higher order interactions that could be exploited to improve inference speed. We describe a non-autoregressive alternative that performs inference using a constant number of calls resulting in a 23 times speed up without a loss in performance on the CATH benchmark. Conditioned on the 3D structure, we fine-tune ProteinMPNN to perform discrete diffusion with a purity prior over the index sampling order. Our approach gives the flexibility in trading off inference speed and accuracy by modulating the diffusion speed. |
John Yang · Jason Yim · Tommi Jaakkola · Regina Barzilay 🔗 |
-
|
TopoDiff: Improve Protein Backbone Generation with Topology-aware Latent Encoding
(
Poster
)
>
The \textit{de novo} design of protein structures is an intriguing research topic in the field of protein engineering. Recent breakthroughs in diffusion-based generative models have demonstrated substantial promise in generating diverse and realistic protein structures. Nevertheless, while existing models either focus on unconditional generation or fine-grained conditioning at the residue level, a holistic, top-down approach to control the overall topological arrangements is still lacking. In response, we introduce TopoDiff, a diffusion-based framework augmented by a topology encoding module, which is capable of unsupervisedly learning a compact latent representation of natural protein topologies with interpretable characteristics and simultaneously harnessing this learnt information for controllable protein structure generation. We also propose a novel metric specifically designed to assess the coverage of sampled proteins with respect to the natural protein space. In comparative analyses with existing models, our generative model not only demonstrates comparable performance on established metrics but also exhibits better coverage across the recognized topology landscape. In summary, TopoDiff emerges as a novel solution towards enhancing the controllability and comprehensiveness of \textit{de novo} protein structure generation, presenting new possibilities for innovative applications in protein engineering and beyond. |
Yuyang Zhang · Zinnia Ma · Haipeng Gong 🔗 |
-
|
Harmonic Prior Self-conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design
(
Poster
)
>
A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon the state-of-the-art generative processes for docking in simplicity, generality, and performance. Enabled by this structure model, FlowSite designs binding sites substantially better than baseline approaches and provides the first general solution for binding site design. |
Hannes Stärk · Bowen Jing · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
CrysFormer: Protein Crystallography Prediction via 3d Patterson Maps and Partial Structure Attention
(
Poster
)
>
Determining the structure of a protein has been a decades-long open question. A protein's three-dimensional structure often poses nontrivial computation costs, when classical simulation algorithms are utilized. Advances in the transformer neural network architecture achieve significant improvements for this problem, by learning from a large dataset of sequence information and corresponding protein structures. Yet, such methods often only focus on sequence information; other available prior knowledge, such as protein crystallography and partial structure of amino acids, could be potentially utilized. To the best of our knowledge, we propose the first transformer-based model that directly utilizes protein crystallography and partial structure information to predict the electron density maps of proteins. Via two new datasets of peptide fragments (2-residue and 15-residue), we demonstrate our method, dubbed CrysFormer, can achieve accurate predictions, based on a much smaller dataset size and with reduced computation costs. |
Chen Dun · Tom Pan · Shikai Jin · Ria Stevens · Mitchell D. Miller · George Phillips · Anastasios Kyrillidis 🔗 |
-
|
PoseCheck: Generative Models for 3D Structure-based Drug Design Produce Unrealistic Poses
(
Poster
)
>
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quality of the generated molecules themselves, with limited evaluation of the 3D \emph{poses} that these methods produce, with most work simply discarding the generated pose and only reporting a ``corrected” pose after redocking with traditional methods. Little is known about whether generated molecules satisfy known physical constraints for binding and the extent to which redocking alters the generated interactions. We introduce \posecheck{}, an extensive analysis of multiple state-of-the-art methods and find that generated molecules have significantly more physical violations and fewer key interactions compared to baselines, calling into question the implicit assumption that providing rich 3D structure information improves molecule complementarity. We make recommendations for future research tackling identified failure modes and hope our benchmark will serve as a springboard for future SBDD generative modelling work to have a real-world impact. Our evaluation suite is easy to use in future 3D SBDD work and is available at \href{https://anonymous.4open.science/r/posecheck-358E/README.md}{\texttt{https://anonymous.4open.science/r/posecheck-358E}}. |
Charles Harris · Kieran Didi · Arian Jamasb · Chaitanya Joshi · Simon Mathis · Pietro Lió · Tom Blundell 🔗 |
-
|
PoseCheck: Generative Models for 3D Structure-based Drug Design Produce Unrealistic Poses
(
Oral
)
>
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quality of the generated molecules themselves, with limited evaluation of the 3D \emph{poses} that these methods produce, with most work simply discarding the generated pose and only reporting a ``corrected” pose after redocking with traditional methods. Little is known about whether generated molecules satisfy known physical constraints for binding and the extent to which redocking alters the generated interactions. We introduce \posecheck{}, an extensive analysis of multiple state-of-the-art methods and find that generated molecules have significantly more physical violations and fewer key interactions compared to baselines, calling into question the implicit assumption that providing rich 3D structure information improves molecule complementarity. We make recommendations for future research tackling identified failure modes and hope our benchmark will serve as a springboard for future SBDD generative modelling work to have a real-world impact. Our evaluation suite is easy to use in future 3D SBDD work and is available at \href{https://anonymous.4open.science/r/posecheck-358E/README.md}{\texttt{https://anonymous.4open.science/r/posecheck-358E}}. |
Charles Harris · Kieran Didi · Arian Jamasb · Chaitanya Joshi · Simon Mathis · Pietro Lió · Tom Blundell 🔗 |
-
|
Sampling Protein Language Models for Functional Protein Design
(
Poster
)
>
Protein language models have emerged as powerful ways to learn complex representations of proteins, thereby improving their performance on several downstream tasks, from structure prediction to fitness prediction, property prediction, homology detection, and more. By learning a distribution over protein sequences, they are also very promising tools for designing novel and functional proteins, with broad applications in healthcare, new material, or sustainability. Given the vastness of the corresponding sample space, efficient exploration methods are critical to the success of protein engineering efforts. However, the methodologies for adequately sampling these models to achieve core protein design objectives remain underexplored and have predominantly leaned on techniques developed for Natural Language Processing. In this work, we first develop a holistic in silico protein design evaluation framework, to comprehensively compare different sampling methods. After performing a thorough review of sampling methods for language models, we introduce several sampling strategies tailored to protein design. Lastly, we compare the various strategies on our in silico benchmark, investigating the effects of key hyperparameters and highlighting practical guidance on the relative strengths of different methods. |
Jeremie Theddy Darmawan · Yarin Gal · Pascal Notin 🔗 |
-
|
A framework for conditional diffusion modelling with applications in protein design
(
Poster
)
>
Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional generation task, and several conditional generation protocols were proposed or imported from the Computer Vision literature.However, most of these protocols are motivated heuristically, e.g. via analogies to Langevin dynamics, and lack a unifying framework, obscuring connections between the different approaches.In this work, we unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform. This new perspective allows us to draw connections between existing methods and propose a new conditional training protocol. We illustrate the effectiveness of this new protocol in both, image outpainting and motif scaffolding and find that it outperforms standard methods. |
Kieran Didi · Francisco Vargas · Simon Mathis · Vincent Dutordoir · Emile Mathieu · Urszula Julia Komorowska · Pietro Lió 🔗 |
-
|
DiffRNAFold: Generating RNA Tertiary Structures with Latent Space Diffusion
(
Poster
)
>
RNA molecules provide an exciting frontier for novel therapeutics. Accurate determination of RNA structure could accelerate development of therapeutics through an improved understanding of function. However, the extremely large conformation space has kept the RNA 3D structure space largely unresolved. Using recent advances in generative modeling, we propose DiffRNAFold, a latent space diffusion model for RNA tertiary structure design. Our preliminary results suggest that DiffRNAFold generated molecules are similar in 3D space to true RNA molecules, providing an important first step towards accurate structure and function prediction in vivo. |
Mihir Bafna · Vikranth Keerthipati · Subhash Kanaparthi · Ruochi Zhang 🔗 |
-
|
Pair-EGRET: Enhancing the prediction of protein-protein interaction sites through graph attention networks and protein language models
(
Poster
)
>
Proteins are responsible for most biological functions, many of which require the interaction of more than one protein molecule. However, predicting protein-protein interaction (PPI) sites (the interfacial residues of a protein that interact with other protein molecules) remains a challenge. The growing demand and cost associated with the reliable identification of PPI sites using conventional experimental methods call for computational tools for automated prediction and understanding of PPIs. Here, we present Pair-EGRET, an edge-aggregated graph attention network that leverages the features extracted from pre-trained transformer-like models to accurately predict pairwise protein-protein interaction sites. Pair-EGRET works on a k-nearest neighbor graph, representing the three-dimensional structure of a protein, and utilizes the cross-attention mechanism on top of a siamese network to accurately identify interfacial residues of a pair of proteins. Through an extensive evaluation study using a diverse array of experimental data, evaluation metrics, and case studies on representative protein sequences, we find that our method outperforms other state-of-the-art methods for predicting PPI sites. Moreover, Pair-EGRET can provide interpretable insights from the learned cross-attention matrix. Pair-EGRET is freely available at https://github.com/1705004/Pair-EGRET. |
Ramisa Alam · Sazan Mahbub · Md. Shamsuzzoha Bayzid 🔗 |
-
|
FlexiDock: Compositional diffusion models for flexible molecular docking
(
Poster
)
>
docking is a critical process in structure-based drug discovery to predict the binding conformations between a protein and a small molecule ligand. Recently, deep learning-based methods have achieved promising performance over traditional physics-based search-and-score methods. Despite their success on accurately predicting the binding poses of the small molecule ligands, modeling of protein flexibility and dynamics still remains largely unexplored for docking. We observe that models that do not account for the protein flexibility suffer a large performance drop in cases where proteins undergo large conformational changes upon ligand binding. To address this gap, we developed FlexiDock, a compositional alternating neural diffusion process, which include two diffusion models to explicitly model the conformational flexibilities of proteins and ligands, respectively. The compositional diffusion process is inspired by the induced-fit model in flexible docking. We found the compositional diffusion is able to improve the structural prediction of the proteins upon ligand binding. Our method also offers promising insights into modeling proteins' conformational switches. |
Zichen Wang · Balasubramaniam Srinivasan · Zhengyuan Shen · George Karypis · Huzefa Rangwala 🔗 |
-
|
In vitro validated antibody design against multiple therapeutic antigens using generative inverse folding
(
Poster
)
>
Deep learning approaches have demonstrated the ability to design protein sequences given backbone structures. While these approaches have been applied in silico to designing antibody complementarity-determining regions (CDRs), they have yet to be validated in vitro for designing antibody binders, which is the true measure of success for antibody design. Here we describe IgDesign, a deep learning method for antibody CDR design, and demonstrate its robustness with successful binder design for 8 therapeutic antigens. The model is tasked with designing heavy chain CDR3 (HCDR3) or all three heavy chain CDRs (HCDR123) using native backbone structures of antibody-antigen complexes, along with the antigen and antibody framework (FWR) sequences as context. For each of the 8 antigens, we design 100 HCDR3s and 100 HCDR123s, scaffold them into the native antibody's variable region, and screen them for binding against the antigen using surface plasmon resonance (SPR). As a baseline, we screen 100 HCDR3s taken from the model's training set and paired with the native HCDR1 and HCDR2. We observe that both HCDR3 design and HCDR123 design outperform this HCDR3-only baseline. IgDesign is the first experimentally validated antibody inverse folding model. It can design antibody binders to multiple therapeutic antigens with high success rates and, in some cases, improved affinities over clinically validated reference antibodies. Antibody inverse folding has applications to both de novo antibody design and lead optimization, making IgDesign a valuable tool for accelerating drug development and enabling therapeutic design. |
Amir Shanehsazzadeh 🔗 |
-
|
Evaluating Zero-Shot Scoring for In Vitro Antibody Binding Prediction with Experimental Validation
(
Poster
)
>
The success of therapeutic antibodies relies on their ability to selectively bind antigens. AI-based antibody design protocols have shown promise in generating epitope-specific designs. Many of these protocols use an inverse folding step to generate diverse sequences given a backbone structure. Due to prohibitive screening costs, it is key to identify candidate sequences likely to bind in vitro. Here, we compare the efficacy of 8 common scoring paradigms based on open-source models to classify antibody designs as binders or non-binders. We evaluate these approaches on a novel surface plasmon resonance (SPR) dataset, spanning 5 antigens. Our results show that existing methods struggle to detect binders, and performance is highly variable across antigens. We find that metrics computed on flexibly docked antibody-antigen complexes are more robust, and ensembles scores are more consistent than individual metrics. We provide experimental insight to analyze current scoring techniques, highlighting that the development of robust, zero-shot filters is an important research gap. |
Divya Nori · Simon Mathis · Amir Shanehsazzadeh 🔗 |
-
|
PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design
(
Poster
)
>
Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years.However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the in silico validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability.To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins.We curate datasets from high-quality CATH protein data as well as high-throughput de novo protein design and mutagenesis experiments,and in doing so, present the PDB-Struct benchmark that evaluates both recent and previously uncompared protein design methods.Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric.We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark.Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. The source code will be released upon acceptance. |
Chuanrui WANG · Bozitao Zhong · Zuobai Zhang · Narendra Chaudhary · Sanchit Misra · Jian Tang 🔗 |
-
|
Optimizing protein language models with Sentence Transformers
(
Poster
)
>
Protein language models (pLMs) have appeared in a wide range of in-silico protein engineering tasks and have shown impressive results. However the ways they are applied remain mostly standardised. Here, we introduce a set of finetuning techniques based on Sentence Transformers (STs) integrated with a novel data augmentation procedure and show how it can offer new state-of-the-art performance. Despite having initially been developed in classic NLP space, STs hold a natural appeal in pLM related applications, largely due their use of sequence pairs and triplets in the process. We demonstrate this conceptual approach in two different settings that frequently occur in this domain; a residue and also a sequence level prediction task. Apart from showing how these tools can extract more and higher quality information from pLMs, we discuss the main differences between their applications in NLP and in the protein spaces. We conclude by discussing the related challenges and provide a comprehensive outlook on potential applications. |
Istvan Redl 🔗 |
-
|
DiffDock-Pocket: Diffusion for Pocket-Level Docking with Sidechain Flexibility
(
Poster
)
>
When a small molecule binds to a protein, the 3D structure of the protein and its function change. Understanding this process, called molecular docking, can be crucial in areas such as drug design. Recent learning-based attempts have shown promising results at this task, yet lack features that traditional approaches support. In this work, we close this gap by proposing DiffDock-Pocket, a diffusion-based docking algorithm that is conditioned on a binding target to predict ligand poses only in a specific binding pocket. On top of this, our model supports receptor flexibility and predicts the position of sidechains close to the binding site. Empirically, we improve the state-of-the-art in site-specific-docking on the PDBBind benchmark. Especially when using in-silico generated structures, we achieve more than twice the performance of current methods while being more than 20 times faster than other flexible approaches. Although the model was not trained for cross-docking to different structures, it yields competitive results in this task. |
Michael Plainer · Marcella Toth · Simon Dobers · Hannes Stärk · Gabriele Corso · Céline Marquet · Regina Barzilay 🔗 |
-
|
DiffDock-Pocket: Diffusion for Pocket-Level Docking with Sidechain Flexibility
(
Oral
)
>
When a small molecule binds to a protein, the 3D structure of the protein and its function change. Understanding this process, called molecular docking, can be crucial in areas such as drug design. Recent learning-based attempts have shown promising results at this task, yet lack features that traditional approaches support. In this work, we close this gap by proposing DiffDock-Pocket, a diffusion-based docking algorithm that is conditioned on a binding target to predict ligand poses only in a specific binding pocket. On top of this, our model supports receptor flexibility and predicts the position of sidechains close to the binding site. Empirically, we improve the state-of-the-art in site-specific-docking on the PDBBind benchmark. Especially when using in-silico generated structures, we achieve more than twice the performance of current methods while being more than 20 times faster than other flexible approaches. Although the model was not trained for cross-docking to different structures, it yields competitive results in this task. |
Michael Plainer · Marcella Toth · Simon Dobers · Hannes Stärk · Gabriele Corso · Céline Marquet · Regina Barzilay 🔗 |
-
|
Transition Path Sampling with Boltzmann Generator-based MCMC Moves
(
Poster
)
>
Sampling all possible transition paths between two 3D states of a molecular system has various applications ranging from catalyst design to drug discovery. Current approaches to sample transition paths use Markov chain Monte Carlo and rely on time-intensive molecular dynamics simulations to find new paths. Our approach operates in the latent space of a normalizing flow that maps from the molecule's Boltzmann distribution to a Gaussian, where we propose new paths without requiring molecular simulations. Using alanine dipeptide, we explore Metropolis-Hastings acceptance criteria in the latent space for exact sampling and investigate different latent proposal mechanisms. |
Michael Plainer · Hannes Stärk · Charlotte Bunne · Stephan Günnemann 🔗 |
-
|
Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning
(
Poster
)
>
Proteins can be represented in various ways, including their sequences, 3D structures, and surfaces. While recent studies have successfully employed sequence- or structure-based representations to address multiple tasks in protein science, there has been significant oversight in incorporating protein surface information, a critical factor for protein function. In this paper, we present a pre-training strategy that incorporates information from protein sequences, 3D structures, and surfaces to improve protein representation learning. Specifically, we utilize Implicit Neural Representations (INRs) for learning surface characteristics, and name it ProteinINR. We confirm that ProteinINR successfully reconstructs protein surfaces, and integrate this surface learning into the existing pre-training strategy of sequences and structures. Our results demonstrate that our approach can enhance performance in various downstream tasks, thereby underscoring the importance of including surface attributes in protein representation learning. These findings underline the importance of understanding protein surfaces for generating effective protein representations. |
Youhan Lee · Hasun Yu · Jaemyung Lee · Jaehoon Kim 🔗 |
-
|
Inpainting Protein Sequence and Structure with ProtFill
(
Poster
)
>
Designing new proteins with specific binding capabilities is a challenging task that has the potential to revolutionize many fields, including medicine and material science. Here we introduce ProtFill, a unified method for simultaneous protein structure and sequence design. Distinct from most existing computational design frameworks which focus on either structure or sequence design, our method embraces both representations concurrently. Employing an $SE(3)$ equivariant diffusion graph neural network, our method excels in both sequence prediction and structure recovery. We demonstrate the model's applicability in interface redesign for antibodies as well as other proteins, underscoring the efficacy of our approach and the potential of the diffusion framework in protein design. The code is available at https://anonymous.4open.science/r/ProtFill-1234/.
|
Elizaveta Kozlova · Daniel Nakhaee-Zadeh Gutierrez · Arthur Valentin 🔗 |
-
|
Investigating Protein-DNA Binding Energetic of Mismatched DNA
(
Poster
)
>
Transcription Factors (TFs) bind to regulatory DNA regions, modulating gene expression. Although various high-throughput techniques have been used to characterize protein binding preferences, this work is the first to extend these studies to non-canonical mismatched bases. The mutagenesis study here presented allows us to determine the binding profile in the double-stranded DNA sequence. Additionally, we leverage deep learning to complete the pairwise interactions map. In this context, we introduce ShapPWM, a motif strategy that marginalizes individual nucleotide contribution by computing the Shapley values. Our model reveals that high synergistic interactions appear between nucleotides in the flanking regions of the contacts. This information offers valuable insights into the binding mechanism and reaction energy, without the necessity of solving intricate crystal structures. |
Ruben Solozabal · Tamir Avioz · Yunxiang LI · Le Song · Martin Takac · Ariel Afek 🔗 |
-
|
AntiFold: Improved antibody structure design using inverse folding
(
Poster
)
>
The design and optimization of antibodies, important therapeutic agents, requires an intricate balance across multiple properties. A primary challenge in optimization is ensuring that introduced sequence mutations do not disrupt the antibody structure or target binding mode. Protein inverse folding models, which predict diverse sequences that fold into the same structure, are promising for maintaining structural integrity during optimization. Here we present AntiFold, an inverse folding model developed for solved and predicted antibody structures, based on the ESM-IF1 model. AntiFold achieves large gains in performance versus existing inverse folding models on sequence recovery, across antibody complementarity determining regions and framework regions. AntiFold-generated sequences show high structural agreement between predicted and experimental structures. The tool efficiently samples hundreds of antibody structures per minute, providing a scalable solution for antibody design. AntiFold is freely available for academic use at: https://opig.stats.ox.ac.uk/data/downloads/AntiFold. |
Alissa M Hummer · Magnus H Høie · Tobias Olsen · Morten Nielsen · Charlotte Deane 🔗 |
-
|
Improved B-cell epitope prediction using AlphaFold2 modeling and inverse folding latent representations
(
Poster
)
>
Accurate computational identification of B-cell epitopes is crucial for the development of vaccines, therapies, and diagnostic tools. However, current structure-based prediction methods face limitations due to the dependency on experimentally solved structures. Here, we introduce a markedly improved B-cell epitope prediction tool that innovatively employs inverse folding structure representations and a positive-unlabelled learning strategy, and is explicitly adapted for both solved and predicted structures. Our tool demonstrates a considerable improvement in performance over existing methods, accurately predicting linear and conformational epitopes across multiple independent datasets. Most notably, it maintains high predictive performance across solved, relaxed and predicted structures, alleviating the need for experimental validation and extending the general applicability of accurate B-cell epitope prediction by more than 3 orders of magnitude. |
Paolo Marcatili 🔗 |
-
|
Combining Structure and Sequence for Superior Fitness Prediction
(
Poster
)
>
Deep generative models of protein sequence and inverse folding models have shown great promise as protein design methods. While sequence-based models have shown strong zero-shot mutation effect prediction performance, inverse folding models have not been extensively characterized in this way. As these models use information from protein structures, it is likely that inverse folding models possess inductive biases that make them better predictors of certain function types. Using the collection of model scores contained in the newly updated ProteinGym, we systematically explore the differential zero-shot predictive power of sequence and inverse folding models. We find that inverse folding models consistently outperform the best-in-class sequence models on assays of protein thermostability, but have lower performance on other properties. Motivated by these findings, we develop StructSeq, an ensemble model combining information from sequence, multiple sequence alignments (MSAs), and structure. StructSeq achieves state-of-the-art Spearman correlation on ProteinGym and is robust to different functional assay types. |
Steffanie Paul · Pascal Notin · Aaron Kollasch · Debora Marks 🔗 |
-
|
Epitope-specific antibody design using diffusion models on the latent space of ESM embeddings
(
Poster
)
>
There was a significant progress in protein design using deep learning approaches. The majority of methods predict sequences for a given structure. Recently, diffusion approaches were developed for generating protein backbones. However, de novo design of epitope-specific antibody binders remains an unsolved problem due to the challenge of simultaneous optimization of the antibody sequence, variable loop structures, and antigen binding. Here we present, EAGLE (Epitope-specific Antibody Generation using Language model Embeddings), a diffusion-based model that does not require input backbone structures. The full antibody sequence (constant and variable regions) is designed in the continuous space using protein language model embeddings. Similarly to denoising diffusion probabilistic models for image generation that condition the sampling on a text prompt, here we condition the sampling of antibody sequences on antigen structure and epitope amino acids. The model is trained on the available antibody and antibody-antigen structures, as well as antibody sequences. Our Top-100 designs include sequences with 55% identity to known binders for the most variable heavy chain loop. EAGLE's high performance is achieved by tailoring the method specifically for antibody design through integration of continuous latent space diffusion and sampling conditioned on antigen structure and epitope amino acids. Our model enables generating a wide range of diverse, unique, variable loop length antibody binders using straightforward epitope specifications. |
Tomer Cohen · Dina Schneidman 🔗 |
-
|
Protein language models learn evolutionary statistics of interacting sequence motifs
(
Poster
)
>
Protein language models (pLMs) have emerged as potent tools for predicting protein structures and designing proteins, yet it is unknown to what degree these models actually understand the inherent biophysics of protein structure. Motivated by a discovery that pLMs erroneously predict non-physical structure fragments for protein isoforms, we investigated the nature of sequence context needed for contact predictions in ESM2 by developing a "categorical Jacobian" approach, allowing for a completely unsupervised way of assessing coevolutionary signal stored in models, as well as by artificially modifying sequences. We found that pLMs make contact predictions conditioned on sequence motifs and the relative linear distance between segment pairs. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models. |
Zhidian Zhang · Hannah Wayment-Steele · Garyk Brixi · Matteo Dal Peraro · Dorothee Kern · Sergey Ovchinnikov 🔗 |
-
|
Using artificial sequence coevolution to predict disulfide-rich peptide structures with experimental connectivity in AlphaFold
(
Poster
)
>
We present a novel approach for embedding contact information in Alphafold to predict structures of disulfide-rich peptides (DRPs) with experimental disulfide connectivity. While AlphaFold generates accurate DRP structure prediction in most cases, it sometimes fails at predicting the specific connectivity pattern of the multiple disulfide bonds. Here, we take advantage of the principles of sequence coevolution to directly embed specific connectivity patterns within the MSA by mutating highly conserved cysteines in subsets of the MSA. This approach can be used to incorporate experimental disulfide connectivity patterns from mass spectrometry into DRP structure prediction. Lastly, after minimization of predicted structures by molecular dynamics, we find that predicted DRP structures with native connectivity display favorable peptide properties compared to non-native connectivities, suggesting our approach may be useful for determining the native connectivity of DRPs from sequence alone. |
Gabriella Gerlach · John Nicoludis 🔗 |
-
|
Preferential Bayesian Optimisation for Protein Design with Ranking-Based Fitness Predictors
(
Poster
)
>
Ranking-based loss functions have recently been shown to improve the quality of predictions of fitness landscapes for both standard supervised deep learning models and fine-tuned protein language models. We consider the implications of this finding for protein design with Bayesian optimisation. We investigate uncertainty quantification techniques applicable to protein language models fine-tuned with ranking losses, and show that they offer competitive calibration to CNN ensembles while demonstrating superior predictive performance. Finally, we demonstrate how uncertainty-aware ranking-based models can be exploited for protein design within the framework of preferential Bayesian optimisation. |
Alex Hawkins-Hooker · Paul Duckworth · Oliver Bent 🔗 |
-
|
FAFormer: Frame Averaging Transformer for Predicting Nucleic Acid-Protein Interactions
(
Poster
)
>
Frame averaging (FA), a recent progress in geometric deep learning, is a general framework that endows a given architecture with the ability to transform data equivariantly. However, serving FA as a model wrapper introduces additional computation that grows linearly with the group's cardinality and may hinder the exploitation of 3D structures, making it challenging to model macro-molecules such as proteins and nucleic acids. In this paper, we present FAFormer, an equivariant Transformer model that incorporates FA as a basic component within each layer. Such incorporation allows FAFormer to model the coordinates in the latent space directly without using other elaborate geometric features. Building on this foundation, we introduce an equivariant cross-attention module to FAFormer to capture the interactions between node and coordinate representations. Besides, an equivariant feed-forward network is proposed for enhancing the communication between them. To evaluate FAFormer's performance, we establish two benchmark datasets for nucleic acid-protein contact prediction and compare FAFormer with 8 different baseline models. With these two innovations, FAFormer outperforms all the baselines and achieves state-of-the-art performance. |
Tinglin Huang · Zhenqiao Song · Rex Ying · Wengong Jin 🔗 |
-
|
LightMHC: A Light Model for pMHC Structure Prediction with Graph Neural Networks
(
Poster
)
>
The peptide-major histocompatibility complex (pMHC) is a crucial protein in cell-mediated immune recognition and response. Accurate structure prediction is potentially beneficial for protein interaction prediction and therefore helps immunotherapy design. However, predicting these structures is challenging due to the sequential and structural variability. In addition, existing pre-trained models such as AlphaFold 2 require expensive computation thus inhibiting high throughput in silico peptide screening. In this study, we propose LightMHC: a lightweight model (2.2M parameters) equipped with attention mechanisms, graph neural networks, and convolutional neural networks. LightMHC predicts full-atom pMHC structures from amino-acid sequences alone, without template structures. The model achieved comparable or superior performance to AlphaFold 2 and ESMFold (93M and 15B parameters respectively), with five-fold acceleration (6.65 seconds/sample for LightMHC versus 36.82 seconds/sample for AlphaFold 2), potentially offering a valuable tool for immune protein structure prediction and immunotherapy design. |
Antoine Delaunay · Yunguan Fu · Nikolai Gorbushin · Robert McHardy · Bachir Djermani · Liviu Copoiu · Michael Rooney · Maren Lang · Andrey Tovchigrechko · Ugur Sahin · Karim Beguir · Nicolas Lopez Carranza
|
-
|
FrameDiPT: SE(3) Diffusion Model for Protein Structure Inpainting
(
Poster
)
>
Protein structure prediction field has been revolutionised by deep learning with protein folding models such as AlphaFold 2 and ESMFold. These models enable rapid in silico prediction and have been integrated into de novo protein design and protein-protein interaction (PPI) prediction. However, biologically relevant features dependent on conformational distributions cannot be estimated with these models. Diffusion models, a novel class of generative models, have been developed to learn conformational distributions and applied to de novo protein design. Limited work has been done on protein structure inpainting, where a masked section is recovered by simultaneously conditioning on its sequence and the rest of the structure. In this work, we propose FrameDiff inPainTing (FrameDiPT), a generalised model for protein inpainting. This is important for T-cells given the hyper-variability of the complementarity determining region (CDR) loops. We evaluated the model on CDR loop design for T-cell receptors and achieved comparable prediction accuracy to ProteinGenerator and RFdiffusion with limited training data and learnable parameters. Different from deterministic structure prediction models, FrameDiPT captures the conformational distribution at different regions and binding states, highlighting a key advantage of generative models. |
Cheng ZHANG · Adam Leach · Thomas Makkink · Miguel Arbesú · Ibtissem Kadri · Daniel Luo · Liron Mizrahi · Sabrine Krichen · Maren Lang · Andrey Tovchigrechko · Nicolas Lopez Carranza · Ugur Sahin · Karim Beguir · Michael Rooney · Yunguan Fu
|
-
|
An Active Learning Framework for ML-Assisted Labeling of Cryo-EM Micrographs
(
Poster
)
>
Single-particle cryo-electron microscopy (cryo-EM) has grown significantly as a tool for discerning biological macromolecule structures. A fundamental step in this technique is the accurate identification of individual protein particles from micrographs laden with noise. Machine learning models, specifically convolutional neural networks like ResNet, have shown promise by reducing dependence on manual methods and adapting to the intricate features within the micrographs. However, challenges persist due to low signal-to-noise ratios, resulting in false positives or missed detections. Analogous challenges in computer vision have found respite in active learning, a method that combines automated systems with human intervention for refined outcomes. This paper presents a novel approach for cryo-EM particle picking based on active learning and logistic regression. Our method employs the pre-trained convolutional-based model from the Topaz particle picking software. This model is used for the initial feature extraction and subsequently refines particle predictions through a logistic regression with a human feedback loop. Complementing this, we introduce the Napari plugin, enhancing user interaction with the micrograph and facilitating intuitive model training. This approach allowed us to achieve $\sim$ 10\% average precision improvement over the Topaz pre-trained model with only 100 labeled particles.
|
Robert Kiewisz · Tristan Bepler 🔗 |
-
|
Validation of de novo designed water-soluble and membrane proteins by in silico folding and melting.
(
Poster
)
>
In silico validation of de novo designed proteins withdeep learning (DL)-based structure prediction algorithms has becomemainstream. However, formal evidence that high-confidence predictions lead to higher chances of experimental success is lacking. We used experimentally characterized \emph{de novo} designs to show that AlphaFold2 and ESMFold excel at different tasks. ESMFold can identify designs generated based on high-quality (designable) backbones. However, only AlphaFold2 can predict which sequences are more likely to folding among similar designs. We show that ESMFold can predict high-quality structures from just a few contacts and introduce a new approach based on incremental perturbation of the prediction ("in silico melting"), which can reveal differences in the presence of favorable contacts between designs. This study provides a new insight on DL-based structure prediction models explainability and on how they could be leveraged for the design of increasingly complex proteins; in particular membrane proteins which still lack many basic in silico design and validation tools. |
Alvaro Martin · Carolin Berner · Sergey Ovchinnikov · Anastassia Vorobieva 🔗 |
-
|
Validation of de novo designed water-soluble and membrane proteins by in silico folding and melting.
(
Oral
)
>
In silico validation of de novo designed proteins withdeep learning (DL)-based structure prediction algorithms has becomemainstream. However, formal evidence that high-confidence predictions lead to higher chances of experimental success is lacking. We used experimentally characterized \emph{de novo} designs to show that AlphaFold2 and ESMFold excel at different tasks. ESMFold can identify designs generated based on high-quality (designable) backbones. However, only AlphaFold2 can predict which sequences are more likely to folding among similar designs. We show that ESMFold can predict high-quality structures from just a few contacts and introduce a new approach based on incremental perturbation of the prediction ("in silico melting"), which can reveal differences in the presence of favorable contacts between designs. This study provides a new insight on DL-based structure prediction models explainability and on how they could be leveraged for the design of increasingly complex proteins; in particular membrane proteins which still lack many basic in silico design and validation tools. |
Alvaro Martin · Carolin Berner · Sergey Ovchinnikov · Anastassia Vorobieva 🔗 |
-
|
Structure, Surface and Interface Informed Protein Language Model
(
Poster
)
>
Language models applied to protein sequence data have gained a lot of interest in recent years, mainly due to their ability to capture complex patterns at the protein sequence level. However, their understanding of why certain evolution-related conservation patterns appear is limited. This work explores the potential of protein language models to further incorporate intrinsic protein properties stemming from protein structures, surfaces, and interfaces. The results indicate that this multi-task pretraining allows the PLM to learn more meaningful representations by leveraging information obtained from different protein views. We evaluate and show improvements in performance on various downstream tasks, such as enzyme classification, remote homology detection, and protein engineering datasets. |
Ioan Ieremie 🔗 |
-
|
De Novo Short Linear Motif (SLiM) Discovery With AlphaFold-Multimer
(
Poster
)
>
Short Linear Motifs (SLiMs) are short, disordered peptide fragments, which mediate a large class of protein-protein interactions (PPIs). SLiM-mediated interactions are often dynamic, low affinity interactions, which play a crucial role in cell regulation and signal transduction. Despite their importance to cell function, complete characterization of SLiMs, both in terms of binding partners and diversity, as well as consolidation into a unified dataset, is bottlenecked by experimental throughput as well as the difficulty of extracting and aggregating motif information across numerous papers and experiments. Currently, only a minuscule fraction of the estimated hundreds of thousands of SLiMs have been identified . Furthermore, the limited number of experimentally validated SLiM-protein interactions has made de novo SLiM discovery via computational methods challenging . Up until now, computational methods for de novo SLiM discovery task has been too challenging with most progress centered around the non-de novo setting which leverages extant evolutionary data. However, recent progress in protein structure prediction has translated to significant progress across many applications, so we posit that protein structure prediction networks may make de novoSLiM discovery tractable. In this work, we curate a SLiM discovery benchmark dataset, devise an AlphaFold-Multimer-based SLiM discovery method, and demonstrate settings in which our method can accurately perform de novo SLiM discovery. |
Theo Sternlieb · · Davian Ho · Jeffrey Chan 🔗 |
-
|
AF2BIND: Predicting ligand-binding sites using the pair representation of AlphaFold2
(
Poster
)
>
Predicting ligand-binding sites, particularly in the absence of previously resolved homologous structures, presents a significant challenge in structural biology. Here, we leverage the internal pairwise representation of AlphaFold2 (AF2) to train a model, AF2BIND, to accurately predict small-molecule-binding residues given only a target protein. AF2BIND uses 20 "bait" amino acids to optimally extract the binding signal in the absence of a small-molecule ligand. We find that the AF2 pair representation outperforms other neural-network representations for binding-site prediction. Moreover, use of the 20 bait amino acids allows for extraction of predicted chemical properties of the unknown ligand. |
Artem Gazizov · Anna Lian · Casper Goverde · Sergey Ovchinnikov · Nicholas Polizzi 🔗 |
-
|
Protein generation with evolutionary diffusion: sequence is all you need
(
Poster
)
>
Diffusion models have demonstrated the ability to generate biologically plausible proteins that are dissimilar to any proteins seen in nature, enabling unprecedented capability and control in de novo protein design. However, current state-of-the-art diffusion models generate protein structures, which limits the scope of their training data and restricts generations to a small and biased subset of protein space. We introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, and design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design. All code and models will be open-source. |
Sarah Alamdari · Nitya Thakkar · Rianne van den Berg · Alex X Lu · Nicolo Fusi · Ava Amini · Kevin Yang 🔗 |
-
|
Protein generation with evolutionary diffusion: sequence is all you need
(
Oral
)
>
Diffusion models have demonstrated the ability to generate biologically plausible proteins that are dissimilar to any proteins seen in nature, enabling unprecedented capability and control in de novo protein design. However, current state-of-the-art diffusion models generate protein structures, which limits the scope of their training data and restricts generations to a small and biased subset of protein space. We introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, and design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design. All code and models will be open-source. |
Sarah Alamdari · Nitya Thakkar · Rianne van den Berg · Alex X Lu · Nicolo Fusi · Ava Amini · Kevin Yang 🔗 |
-
|
Protein-Protein Docking with Latent Diffusion
(
Poster
)
>
Interactions between proteins form the basis for many biological processes, and understanding their relationships is an area of active research. Computational approaches offer a way to facilitate this understanding without the burden of expensive and time-consuming experiments. Here, we introduce LatentDock, a generative model for protein-protein docking. Our method leverages a diffusion model operating within a geometrically-structured latent space, derived from an encoder producing roto-translational invariant representations of protein complexes. Critically, it is able to perform flexible docking, capturing both backbone and side-chain conformational changes. Furthermore, our model can condition on binding sites, leading to significant performance gains. Empirical evaluations show the efficacy of our approach over relevant baselines, even outperforming models that do not account for flexibility. |
Matt McPartlon · Céline Marquet · Tomas Geffner · Daniel Kovtun · Alexander Goncearenco · Zachary Carpenter · Luca Naef · Michael Bronstein · Jinbo Xu 🔗 |
-
|
HiFi-NN annotates the microbial dark matter with Enzyme Commission numbers
(
Poster
)
>
The accurate computational annotation of protein sequences with enzymatic function, especially those that are part of the functional and taxonomic dark matter, remains a fundamental challenge in bioinformatics. Here, we present HiFi-NN, (Hierarchically-Finetuned Nearest Neighbor search) which annotates protein sequences to the 4th level of EC (enzyme commission) number with greater precision and recall than all existing deep learning methods. HiFi-NN is a hierarchically-finetuned deep learning method based on a combination of semi-supervised representation learning and a nearest neighbours classifier. Furthermore, we show that this method can correctly identify the EC number of a given sequence to identities below 40%, where the current state of the art annotation tool, BLASTp, cannot. We proceed to improve the representations learned by increasing the diversity of the training set, not just in sequence space but also in terms of the environment the sequences have been sampled from. Finally, we use HiFi-NN to annotate a portion of microbial dark matter sequences in the MGnify database. |
Gavin Ayres 🔗 |
-
|
Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion
(
Poster
)
>
link
Generative models of macromolecules carry abundant and impactful implications for industrial and biomedical efforts in protein engineering. However, existing methods are currently limited to modeling protein structures or sequences, independently or jointly, without regard to the interactions that commonly occur between proteins and other macromolecules. In this work, we introduce MMDiff, a generative model that jointly designs sequences and structures of nucleic acid and protein complexes, independently or in complex, using joint SE(3)-discrete diffusion noise. Such a model has important implications for emerging areas of macromolecular design including structure-based transcription factor design and design of noncoding RNA sequences. We demonstrate the utility of MMDiff through a rigorous new design benchmark for macromolecular complex generation that we introduce in this work. Our results demonstrate that MMDiff is able to successfully generate micro-RNA and single-stranded DNA molecules while being modestly capable of joint modeling DNA and RNA molecules in interaction with multi-chain protein complexes. |
Alex Morehead · Jeffrey Ruffolo · Aadyot Bhatnagar · Ali Madani 🔗 |
-
|
SO(3)-Equivariant Representation Learning in 2D Images
(
Poster
)
>
Imaging physical objects that are free to rotate and translate in 3D is challenging. While an object's pose and location do not change its nature, varying them presents problems for current vision models. Equivariant models account for these nuisance transformations, but current architectures only model either 2D transformations of 2D signals or 3D transformations of 3D signals. Here, we propose a novel convolutional layer consisting of 2D projections of 3D filters that models 3D equivariances of 2D signals—critical for capturing the full space of spatial transformations of objects in imaging domains such as cryo-EM. We additionally present methods for aggregating our rotation-specific outputs. We demonstrate significant improvement on several tasks, including particle picking and pose estimation. |
Darnell Granberry · Alireza Nasiri · Jiayi Shou · Alex J. Noble · Tristan Bepler 🔗 |
-
|
HelixDiff: Conditional Full-atom Design of Peptides With Diffusion Models
(
Poster
)
>
Peptide engineering has emerged as a critical discipline within biomedicine, finding applications in therapeutics, diagnostics, and synthetic biology. Despite their prevalence in biological processes, pursuing de novo therapeutic peptide design remains a formidable challenge. We here focus on generating helical peptides and present HelixDiff, a score-based diffusion model to learn and generate all-atom helical structures. We incorporate a hotspot-specific inpainting mechanism for the conditional design of α-helix structures that align with critical residues at protein-peptide interfaces. The results of our model showcase the production of helix structures with near-native geometries for a substantial portion of the test scenarios, showing root mean square deviations (RMSDs) less than 1Å. HelixDiff has shown better sequence recovery and Rosetta scores for unconditional and conditional generations than HelixGAN, our previous gan-based model. The case study involving glucagon-like peptide-1 (GLP-1) underscored HelixDiff's exceptional capacity to generate therapeutic D-peptides. The HelixDiff D-GLP-1 design is more stable than our earlier HelixGAN design when both D-peptides are bound to the GLP-1 receptor according to molecular dynamics simulations. The source code and data sets are available at github (https://github.com/xxiexuezhi/HelixDiff). |
Xuezhi Xie · Pedro A Valiente · Jisun Kim · Philip Kim 🔗 |
-
|
DIFFMASIF: Score-Based Diffusion Models for Protein Surfaces
(
Poster
)
>
Predicting protein-protein complexes is a central challenge of computational structural biology. However, existing state-of-the-art methods rely on co-evolution learned on large amino acid sequence datasets and thus often fall short on both transient and engineered interfaces (which are of particular interest in therapeutic applications) where co-evolutionary signals are absent or minimal. To address this, we introduce \diffmasif, a novel, score-based diffusion model for rigid protein-protein docking. Instead of sequence-based features, \diffmasif uses a protein molecular surface-based encoder-decoder architecture trained via a novel combination of geometric pre-training tasks to effectively learn physical complementarity. The encoder uses learned geometric features extracted from protein surface point clouds as well as geometrically pre-trained residue embeddings pooled to the surface. It directly learns binding site complementary through prediction of contact sites as both pretraining and auxiliary loss, and also allows for specification of known binding sites during inference. It is followed by a decoder predicting rotation and translation via $\mathrm{SO}(3)$ diffusion. We show that \diffmasif \ achieves SOTA among Deep Learning methods for rigid body docking, in particular on structurally novel interfaces and low sequence conservation. This provides a significant advance towards accurate modelling of protein interactions with low co-evolution and their many practical applications.
|
Mehmet Akdel · Freyr Sverrisson · Dylan Abramson · Jean Feydy · Alexander Goncearenco · Yusuf Adeshina · Daniel Kovtun · Céline Marquet · Xuejin Zhang · David Baugher · Zachary Carpenter · Luca Naef · Michael Bronstein · Bruno Correia
|
-
|
FLAb: Benchmarking deep learning methods for antibody fitness prediction
(
Poster
)
>
The successful application of machine learning in therapeutic antibody design relies heavily on the ability of models to accurately represent the sequence-structure-function landscape, also known as the fitness landscape. Previous protein benchmarks (including The Critical Assessment of Function Annotation, Tasks Assessing Protein Embeddings, and FLIP) examine fitness and mutational landscapes across many protein families, but they either exclude antibody data or use very little of it. In light of this, we present the Fitness Landscape for Antibodies (FLAb), the largest therapeutic antibody design benchmark to date. FLAb currently encompasses six properties of therapeutic antibodies: (1) expression, (2) thermostability, (3) immunogenicity, (4) aggregation, (5) polyreactivity, and (6) binding affinity. We use FLAb to assess the performance of various widely adopted, pretrained, deep learning models for proteins (IgLM, AntiBERTy, ProtGPT2, ProGen2, ProteinMPNN, and ESM-IF); and compare them to physics-based Rosetta. Overall, no models are able to correlate with all properties or across multiple datasets of similar properties, indicating that more work is needed in prediction of antibody fitness. Additionally, we elucidate how wild type origin, deep learning architecture, training data composition, parameter size, and evolutionary signal affect performance, and we identify which fitness landscapes are more readily captured by each protein model. To promote an expansion on therapeutic antibody design benchmarking, all FLAb data are freely accessible and open for additional contribution at https://github.com/Graylab/FLAb. |
Michael F Chungyoun · Jeffrey Ruffolo · Jeffrey Gray 🔗 |
-
|
Parameter-Efficient Fine Tuning of Protein Language Models Improves Prediction of Protein-Protein Interactions
(
Poster
)
>
Mirroring the massive increase in the size of transformer-based models in natural language processing, proteomics too has seen increasingly large foundational protein language models. As model size increases, the computational and memory footprint of fine-tuning expands out of reach of many academic labs and small biotechs. In this work, we compare fine-tuning of protein language models to training a classifier head on frozen representations, and the parameter-efficient fine tuning method LoRA on the task of predicting protein-protein interactions. We find that using LoRA actually outperforms full fine-tuning while requiring a reduced memory footprint, and that using frozen embeddings remains a viable alternative if computational resources for fine-tuning are impractical. |
Samuel Sledzieski · Meghana Kshirsagar · Rahul Dodhia · Bonnie Berger · Juan Lavista Ferres 🔗 |
-
|
TriFold: A New Architecture for Predicting Protein Sequences from Structural Data
(
Poster
)
>
The inverse protein folding challenge aims to identify specific amino acid sequences that fold into a predetermined protein structure. Despite advancements like AlphaFold2, it remains a complex issue in protein engineering. This paper introduces a novel architecture inspired by the self-attention mechanisms in AlphaFold2 and RoseTTAFold2, adapted for solving the inverse folding problem. Our approach, contrasted with previous graph-based models, leverages attention-based transformer architecture to efficiently integrate information across the entire protein. We combine attention mechanisms, such as invariant point attention, with those designed for sequence and pair representations, resulting in enhanced performance in the inverse protein folding task. Furthermore, we introduce a novel feature representation of protein structure used as an inductive bias in pair representation. The proposed model is trained and tested using the OpenFold codebase on the Protein Data Bank and the AlphaFold distillation dataset, achieving performance improvements over ProteinMPNN regarding sequence recovery. The model's validation on the CAMEO dataset, which comprises proteins released from October 16th, 2021 – January 16th, 2022, further substantiates its efficacy in enhanced sequence recovery across short, single, and multiple chains. |
Harish Srinivasan · Jian Zhou 🔗 |
-
|
End-to-End Sidechain Modeling in AlphaFold2: Attention May or May Not Be All That You Need
(
Poster
)
>
AlphaFold2 (AF2) has made significant strides in computational structural biology and drug discovery. However, limitations remain, particularly for downstream tasks such as molecular docking. We propose inaccuracies in amino acid sidechain prediction could contribute to these limitations. To address this, we explored two simple and complementary strategies to improve sidechain accuracy in AF2: (1) substituting the default ResNet-based angle predictor in AlphaFold2 with a Transformer-like model, and (2) refining the angle predictor using an energy-like loss function. Our analysis indicates that ResNets and Transformers offer comparable performance. However, training with an energy-like loss can sometimes boost structural quality, especially when the entire model is finetuned. We suggest a holistic approach that looks beyond AF2's sidechain torsion angle predictor to improve sidechain modeling in future studies. |
Jonathan King · David Koes 🔗 |
-
|
Coarse-graining via reparametrization avoids force-matching and back-mapping
(
Poster
)
>
Energy minimization problems are highly non-convex problems at the heart of physical sciences. These problems often suffer from slow convergence due to sharply falling potentials, leading to small gradients. To make them tractable, we often resort to coarse-graining (CG), a type of lossy compression. We introduce a new way to perform CG using reparametrization, which does not require the costly steps of force-matching and back-mapping required in traditional CG. We focus on improving the slow dynamics by using CG to projecting onto slow modes. We also propose a way to find robust slow modes for many physical potentials. Our method also does not require data, which is expensive in molecular systems and a bottleneck for applying machine learning methods to such systems. We test our method on molecular dynamics for folding of small proteins. We observe that our method either reaches deeper (more optimal) energies or runs in shorter time than the baseline non-CG simulations. |
Nima Dehmamy · Csaba Both · Subhro Das · Tommi Jaakkola 🔗 |
-
|
SE3Lig: SE(3)-equivariant CNNs for the reconstruction of cofactors and ligands in protein structures
(
Poster
)
>
Protein structure prediction algorithms such as AlphaFold2 and ESMFold have dramatically increased the availability of high-quality models of protein structures. Because these algorithms do not predict anything aside from the protein itself, there is a growing need for methods that can rapidly screen protein structures for ligands. Previous work on similar tasks has shown promise but is lacking scope in the classes of atoms predicted and can benefit from the recent architectural developments in convolutional neural networks (CNNs). In this work, we introduce SE3Lig, a model for semantic in-painting of small molecules in protein structures. Specifically, we report SE(3)-equivariant CNNs trained to predict the atomic densities of common classes of cofactors (hemes, flavins, etc.) and the water molecules and inorganic ions in their vicinity. While the models are trained on high-resolution crystal structures of enzymes, they perform well on structures predicted by AlphaFold2, which suggests that the algorithm correctly represents cofactor-binding cavities. |
Guillaume Lamoureux · Sid Bhadra-Lobo · Anushriya Subedy · Sagar Khare 🔗 |
-
|
Cramming Protein Language Model Training in 24 GPU Hours
(
Poster
)
>
Protein language models (pLMs) are ubiquitous across biological machine learning research, but state-of-the-art models like ESM2 take hundreds of thousands of GPU hours to pre-train on the vast protein universe. Resource requirements for scaling up pLMs prevent fundamental investigations into how optimal modeling choices might differ from those used in natural language. Here, we define a ``cramming'' challenge for pLMs and train performant models in 24 hours on a single GPU. By re-examining many aspects of pLM training, we are able to train a 67 million parameter model in a single day that achieves comparable performance on downstream protein fitness landscape inference tasks to ESM-3B, a model trained for over 15,000 times more GPU hours than ours. |
Nathan Frey · Taylor Joren · Aya Ismail · Allen Goodman · Stephen Ra · Kyunghyun Cho · Richard Bonneau · Vladimir Gligorijevic 🔗 |
-
|
Preparation Of Labeled Cryo-ET Datasets For Training And Evaluation Of Machine Learning Models
(
Poster
)
>
We present datasets aimed at improving the efficiency of cryo-electron tomographic data analysis. While cryo-electron tomography (cryo-ET) holds immense promise as a tool for native structural biology, it faces persistent challenges in segmentation and annotation. These challenges primarily stem from the absence of diverse ground truth datasets for efficient model training, evaluation, and benchmarking. To address these challenges, we have collected and are currently annotating datasets spanning a range of complexities. Composed of carefully selected protein mixtures and organisms with small genomes, these datasets offer a broad spectrum of structures for study. The datasets are designed to provide a robust foundation for development and evaluation of machine learning models for annotation tasks, thereby enhancing the efficacy and applicability of cryo-ET in elucidating complex native biological structures and interactions. This ongoing project will soon offer the annotated datasets publicly, encouraging further innovation and research in the community. |
Aygul Ishemgulova · Alex J. Noble · Tristan Bepler · Alex De Marco 🔗 |
-
|
EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergence
(
Poster
)
>
Aligning EM density maps and fitting atomic models are essential steps in single particle cryogenic electron microscopy (cryo-EM), with recent methods leveraging various algorithms and machine learning tools. As aligning maps remains challenging in the presence of a map that only partially fits the other (e.g. one subunit), we here propose a new procedure, EMPOT (EM Partial alignment with Optimal Transport), for partial alignment of 3D maps. EMPOT first finds a coupling between 3D point-cloud representations, which is associated with their so-called unbalanced Gromov Wasserstein divergence, and second, uses this coupling to find an optimal rigid body transformation. Upon running and benchmarking our method with experimental maps and structures, we show that EMPOT outperforms standard methods for aligning subunits of a protein complex and fitting atomic models to a density map, suggesting potential applications of Partial Optimal Transport for improving Cryo-EM pipelines. |
Aryan Tajmir Riahi · Chenwei Zhang · James Chen · Anne Condon · Khanh Dao Duc 🔗 |
-
|
Fast protein backbone generation with SE(3) flow matching
(
Poster
)
>
This work presents a method for fast protein backbone generation using SE(3) flow matching.Specifically, we adapt FrameDiff, a state-of-the-art non-pretrained diffusion model, to perform flow matching with minimal changes.We first develop the theoretical results for SE(3) flow matching then demonstrate modifications during training to effectively learn the conditional vector field.Compared to FrameDiff, we require five times less timesteps to sample while achieving the same designability metrics on unconditional monomer backbone generation.Our work paves way towards faster generative models in de novo protein design. |
Jason Yim · Andrew Campbell · Yue Kwang Foong · Sarah Lewis · Victor Satorras · Michael Gastegger · Bas Veeling · Jose Jimenez-Luna · Regina Barzilay · Tommi Jaakkola · Frank Noe
|
-
|
Frame2seq: structure-conditioned masked language modeling for protein sequence design
(
Poster
)
>
Machine learning has revolutionized computational protein design, enabling significant progress in protein backbone generation and sequence design. For protein sequence design, encoder-decoder models have achieved state-of-the-art accuracy, which has translated to experimental success. Here, we introduce Frame2seq, a structure-conditioned masked language model for protein sequence design that, in contrast to the autoregressive methods, generates sequences in a single pass. On the CATH 4.2 test dataset, Frame2seq outperforms the state-of-the-art autoregressive method, ProteinMPNN, achieving 49.1% sequence recovery (2.0% improvement) with over six times faster inference. In addition, Frame2seq accurately estimates the error in its own predictions across diverse backbones. To expand design tasks beyond native-like sequence space, we use Frame2seq to generate low sequence identity designs for de novo backbones. Through experimental characterization, we show that Frame2seq successfully designs soluble, monomeric, stable proteins with low sequence identity to native. The speed and accuracy of Frame2seq will accelerate exploration of novel sequence space across diverse design tasks, including challenging applications such as multi-objective optimization. |
Deniz Akpinaroglu · Kosuke Seki · Eleanor Zhu · Tanja Kortemme 🔗 |
-
|
Structure-Conditioned Generative Models for De Novo Ligand Generation: A Pharmacophore Assessment
(
Poster
)
>
Deep generative models show promise for de novo molecular design, especially pocket-conditioned conditional generation methods that output small-molecule ligands in their predicted binding pose with high shape complementarity. However, recent work demonstrates these models still fail to generate chemically valid and synthetically accessible ligands. This paper provides further insight into these methods and their generated molecules through analysis of pharmacophore features commonly used in structure-based and ligand-based drug discovery. We specifically assess the generated distribution of hydrogen bond donors, acceptors, and aromatic rings from deep generative methods on three well-studied protein targets: adenosine A2a receptor, cyclin-dependent kinase 2, and the main protease of SARS-CoV-2. Our results find autoregressive approaches better recapitulate the expected spatial distribution of pharmacophore features compared to diffusion-based models. The analysis presented here highlights current limitations in deep generative models for 3D design, while suggesting new directions to realistically aid structure-based design. |
Shannon Smith · Leo Gendelev · Kangway Chuang · Seth Harris 🔗 |
-
|
Jointly Embedding Protein Structures and Sequences through Residue Level Alignment
(
Poster
)
>
The relationships between protein sequences, structures, and their functions are determined by complex codes that scientists aim to decipher. In particular, while structures contain key information about the protein's biochemical functions, they are often experimentally difficult to obtain. In contrast, protein sequences are abundant but are a step removed from molecular function. In this paper, we propose Residue Level Alignment (RLA) — a self-supervised objective for aligning structure and sequence embedding spaces. By situating structure and sequence encoders within the same latent space, RLA allows the structure encoder to leverage large sequence databases and enriches the sequence encoder with spatial information. Moreover, our framework enables us to measure the similarity between a structure and sequence by comparing their RLA embeddings: we show how RLA similarity scores can be used for binder design by screening for appropriate docking candidates for a given protein-protein or protein-peptide interaction. |
Foster Birnbaum · Saachi Jain · Amy Keating · Aleksander Madry 🔗 |
-
|
Evaluating Representation Learning on the Protein Structure Universe
(
Poster
)
>
We introduce ProteinWorkshop, a comprehensive and rigorous benchmark suite for evaluating protein structure representation learning methods. We provide large-scale pretraining and downstream tasks comprised of both experimental and predicted structures, offering a balanced challenge to representation learning algorithms. We demonstrate the utility of our benchmark by systematically evaluating state-of-the-art protein-specific and generic geometric Graph Neural Networks and the extent to which they benefit from pretraining. We find that: (1) pretraining consistently improves the performance of both rotation-invariant and equivariant geometric models; (2) equivariant models seem to benefit more from pretraining compared to invariant models. Our open-source codebase reduces the barrier to entry for working with large structure-based datasets by providing utilities for constructing new tasks directly from the entire PDB, as well as storage-efficient dataloaders from large-scale predicted structures including AlphaFoldDB and ESM Atlas. ProteinWorkshop is available at: https://anonymous.4open.science/r/ProteinWorkshop-B8F5. |
Arian Jamasb · Alex Morehead · Zuobai Zhang · Chaitanya K. Joshi · Kieran Didi · Simon Mathis · Charles Harris · Jian Tang · Jianlin Cheng · Pietro Lió · Tom Blundell
|
-
|
Enhancing Antibody Language Models with Structural Information
(
Poster
)
>
The central tenet of molecular biology is that a protein’s amino acid sequence determines its three-dimensional structure, and thus its function. However, proteins with similar sequences do not always fold into the same shape, and vice-versa, dissimilar sequences can adopt similar folds. In this work, we explore antibodies, a class of proteins in the immune system, whose local shapes are highly unpredictable, even with small variations in their sequence. Inspired by the CLIP method [1], we propose a multimodal contrastive learning approach, contrastive sequence-structure pre-training (CSSP), which amalgamates the representations of antibody sequences and structures in a mutual latent space. Integrating structural information leads both antibody and protein language models to show better correspondence with structural similarity and improves accuracy and data efficiency in downstream binding prediction tasks. We provide an optimised CSSP-trained model, AbFormer-CSSP, for non-commercial use at [HuggingFace link redacted for anonymity]. |
Justin Barton · Jacob Galson · Jinwoo Leem 🔗 |
-
|
Amortized Pose Estimation for X-Ray Single Particle Imaging
(
Poster
)
>
X-ray single particle imaging (SPI) is a nascent technique that can capture the dynamics of biomolecules at room temperature. SPI experiments will one day collect tens of millions of images of the same molecule in order to overcome the weak scattering of individual proteins. Existing reconstruction algorithms will be unable to scale to datasets of this size because they perform computationally expensive search steps to estimate the orientation of the molecule in each image. In this work, we propose a reconstruction algorithm that amortizes the estimation of pose via an autoencoder framework. Our approach consists of a convolutional encoder that maps X-ray images to predicted poses and a physics-based decoder that implicitly fuses all the 2D scattering images into a volumetric representation of the molecule. We validate our method on 6 synthetic datasets of 2 distinct proteins, showing that for the largest datasets containing 5 million images, our technique can reconstruct the electron density in a single pass. |
Jay Shenoy · Axel Levy · Frederic Poitevin · Gordon Wetzstein 🔗 |
-
|
Rethinking Performance Measures of RNA Secondary Structure Problems
(
Poster
)
>
Accurate RNA secondary structure prediction is vital for understanding cellular regulation and disease mechanisms. Deep learning (DL) methods have surpassed traditional algorithms by predicting complex features like pseudoknots and multi-interacting base pairs. However, traditional distance measures can hardly deal with such tertiary interactions and the currently used evaluation measures (F1 score, MCC) have limitations. We propose the Weisfeiler-Lehman graph kernel (WL) as an alternative metric. Embracing graph-based metrics like WL enables fair and accurate evaluation of RNA structure prediction algorithms. Further, WL provides informative guidance, as demonstrated in an RNA design experiment. |
Frederic Runge · Jörg Franke · Daniel Fertmann · Frank Hutter 🔗 |
-
|
Structure-based and leakage-free data splits for rigorous protein function evaluation
(
Poster
)
>
Datasets for protein machine learning tasks are typically constructed by splitting protein sequences between train, validation and test sets based on protein sequence similarity. For tasks largely determined by protein structure, such as protein function prediction, we hypothesize that such data splitting may cause data leakage between the sets, since proteins can be structurally and functionally similar but still have dissimilar sequences. As a result, model performance on low sequence similarity levels could be overestimated. We demonstrate that this is the case on a commonly used enzyme dataset, by introducing a novel dataset construction methodology designed to prevent leakage between sets based on i) using structure similarity instead of sequence similarity to cluster proteins, and ii) generating tight protein clusters using community detection. Additionally, we demonstrate that simple models based on protein language model representations provide powerful baselines for the task of protein function prediction. |
Charlotte Rochereau · Mohammed AlQuraishi · Arthur Valentin · Gergo Nikolenyi 🔗 |
-
|
Uncovering sequence diversity from a known protein structure
(
Poster
)
>
We present InvMSAFold, a method generating a diverse set of protein sequencesfolding into a single structure. For a given structure it defines a probability distribution over the space of sequences.This distribution captures second-order correlations observed in Multiple Sequence Alignments (MSA) of homologous proteins. Our innovation lies in generating highly diverse protein sequences while preserving structural and functional integrity. This approach offers exciting prospects, particularly in directed evolution, by providing diverse starting points for protein design. |
Luca Alessandro Silva · Barthélémy Meynard · Carlo Lucibello · Christoph Feinauer 🔗 |
-
|
Exploiting language models for protein discovery with latent walk-jump sampling
(
Poster
)
>
We introduce a single-step score-based denoising framework for generative modeling of antibody protein sequences from higher dimensional embeddings of pretrained language models. Our latent Walk-Jump Sampler (or L-WJS) framework learns the manifold of a smoothed latent space of a pretrained protein language model. New sequences are generated by score-based exploration using Langevin MCMC (walk) on the smoothed latent space and denoising (jump) to the latent space. Our framework thus combines the attractive properties of the rich and semantically meaningful representations from pretrained protein language models trained on large corpus of sequences and the improved sample quality of score-based modeling in the latent space. We demonstrate that latent-WJS is data efficient, generates novel, diverse and natural antibody sequences and opens-up avenues for sampling (both unguided and guided) from the latent space of various pretrained models. |
Sai Pooja Mahajan · Nathan Frey · Dan Berenberg · Joseph Kleinhenz · Richard Bonneau · Vladimir Gligorijevic · Andrew Watkins · Saeed Saremi 🔗 |