Timezone: »

Workshop
Machine Learning in Structural Biology Workshop
Roshan Rao · Jonas Adler · Namrata Anand · John Ingraham · Sergey Ovchinnikov · Ellen Zhong

Sat Dec 03 06:30 AM -- 03:00 PM (PST) @ Room 288 - 289

In only a few years, structural biology, the study of the 3D structure or shape of proteins and other biomolecules, has been transformed by breakthroughs from machine learning algorithms. Machine learning models are now routinely being used by experimentalists to predict structures that can help answer real biological questions (e.g. AlphaFold), accelerate the experimental process of structure determination (e.g. computer vision algorithms for cryo-electron microscopy), and have become a new industry standard for bioengineering new protein therapeutics (e.g. large language models for protein design). Despite all this progress, there are still many active and open challenges for the field, such as modeling protein dynamics, predicting higher order complexes, pushing towards generalization of protein folding physics, and relating the structure of proteins to the in vivo and contextual nature of their underlying function. These challenges are diverse and interdisciplinary, motivating new kinds of machine learning systems and requiring the development and maturation of standard benchmarks and datasets.

In this exciting time for the field, our workshop, “Machine Learning in Structural Biology” (MLSB), seeks to bring together relevant experts, practitioners, and students across a broad community to focus on these challenges and opportunities. We believe the union of these communities, including the geometric and graph learning communities, NLP researchers, and structural biologists with domain expertise at our workshop can help spur new ideas, spark collaborations, and advance the impact of machine learning in structural biology. Progress at this intersection promises to unlock new scientific discoveries and the ability to design novel medicines.

 Sat 6:30 a.m. - 6:35 a.m. Opening Remarks (Remarks) 🔗 Sat 6:35 a.m. - 7:00 a.m. Invited Speaker (Talk) David Fleet 🔗 Sat 7:00 a.m. - 7:15 a.m. Latent Space Diffusion Models of Cryo-EM Structures (Oral) Cryo-electron microscopy (cryo-EM) is unique among tools in structural biology in its ability to image large, dynamic protein complexes. Key to this ability are image processing algorithms for heterogeneous cryo-EM reconstruction, including recent deep learning-based approaches. The state-of-the-art method cryoDRGN uses a Variational Autoencoder (VAE) framework to learn a continuous distribution of protein structures from single particle cryo-EM imaging data. While cryoDRGN is able to model complex structural motions, in practice, the Gaussian prior distribution of the VAE fails to match the aggregate approximate posterior, especially for multi-modal distributions (e.g. compositional heterogeneity). Here, we train a diffusion model as an expressive, learnable prior for cryoDRGN. We show the ability to sample from the model on two synthetic and two real datasets, where samples accurately follow the data distribution unlike samples from the VAE prior distribution. Our approach learns a high-quality generative model over molecular configurations directly from cryo-EM imaging data. We also demonstrate how the diffusion model prior can be leveraged for fast latent space traversal and interpolation between states of interest. By learning an accurate model of the data distribution, our method unlocks tools in generative modeling, sampling, and distribution analysis for heterogeneous cryo-EM ensembles. Karsten Kreis · Tim Dockhorn · Zihao Li · Ellen Zhong 🔗 Sat 7:15 a.m. - 7:30 a.m. Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models (Oral) Proteins are macromolecules that mediate a significant fraction of the cellular processes that underlie life. An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions. To this end, we introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches. The model is learned entirely from experimental data and conditions its generation on a compact specification of protein topology to produce a full-atom backbone configuration as well as sequence and side-chain predictions. We demonstrate the quality of the model via qualitative and quantitative analysis of its samples. We show how the model can be applied to protein structure determination such as in CryoEM and present results on predicting domain structures to simulated electron densities at varying resolutions. Videos of sampling trajectories are available at https://nanand2.github.io/proteins. Namrata Anand · Tudor Achim 🔗 Sat 7:30 a.m. - 7:45 a.m. Predicting conformational landscapes of known and putative fold-switching proteins using AlphaFold2 (Oral)    Proteins that switch their secondary structures upon response to a stimulus -- commonly known as "metamorphic proteins" -- directly question the paradigm of “one structure per protein”. Despite the potential to more deeply understand protein folding and function through studying metamorphic proteins, their discovery has been largely by chance, with fewer than 10 experimentally validated. AlphaFold2 (AF2) has dramatically increased accuracy in predicting single structures, though it fails to return alternate states for known metamorphic proteins in its default settings. We demonstrate that clustering an input multiple sequence alignment (MSA) by sequence similarity enables AF2 to sample alternate states of known metamorphs. Moreover, AF2 scores these alternate states with high confidence. We used our clustering method, AF-cluster, to screen for alternate states in protein families without known fold-switching, and identified a putative alternate state for the oxidoreductase DsbE. Similarly to KaiB, DsbE is predicted to switch between a thioredoxin-like fold and a novel fold. This prediction is the subject of ongoing experimental testing. Further development of such bioinformatic methods in tandem with experiment will likely aid in accelerating discovery and gaining a more systematic understanding of fold-switching in protein families. Hannah Wayment-Steele · Sergey Ovchinnikov · Lucy Colwell · Dorothee Kern 🔗 Sat 7:45 a.m. - 8:05 a.m. Break 🔗 Sat 8:05 a.m. - 8:30 a.m. Invited Speaker (Talk) 🔗 Sat 8:30 a.m. - 8:45 a.m. SWAMPNN: End-to-end protein structures alignment (Oral)    With the recent breakthrough of highly accurate structure prediction methods, there has been a rapid growth of available protein structures. Efficient methods are needed to infer structural similarity within these datasets. We present an end-to-end alignment method, called SWAMPNN, that takes as input the 3D coordinates of a protein pair and outputs a structural alignment. We show that the model is able to recapitulate TM-align alignments while running faster and is more accurate than Foldseek on the alignment task while being comparable for classification. Jeanne Trinquier · Samantha Petti · Shihao Feng · Johannes Soeding · Martin Steinegger · Sergey Ovchinnikov 🔗 Sat 8:45 a.m. - 9:00 a.m. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking (Oral) Predicting the binding structure of a small molecule ligand to a protein---a task known as molecular docking---is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, DiffDock has fast inference times and provides confidence estimates with high selective accuracy. Gabriele Corso · Hannes Stärk · Bowen Jing · Regina Barzilay · Tommi Jaakkola 🔗 Sat 9:00 a.m. - 9:15 a.m. Dynamic-backbone protein-ligand structure prediction with multiscale generative diffusion models (Oral)    Molecular complexes formed by proteins and small-molecule ligands are ubiquitous, and predicting their 3D structures can facilitate both biological discoveries and the design of novel enzymes or drug molecules. Here we propose NeuralPLexer, a deep generative model framework to rapidly predict protein-ligand complex structures and their fluctuations using protein backbone template and molecular graph inputs. NeuralPLexer jointly samples protein and small-molecule 3D coordinates at an atomistic resolution through a generative model that incorporates biophysical constraints and inferred proximity information into a time-truncated diffusion process. The reverse-time generative diffusion process is learned by a novel stereochemistry-aware equivariant graph transformer that enables efficient, concurrent gradient field prediction for all heavy atoms in the protein-ligand complex. NeuralPLexer outperforms existing physics-based and learning-based methods on benchmarking problems including fixed-backbone blind protein-ligand docking and ligand-coupled binding site repacking. Moreover, we identify preliminary evidence that NeuralPLexer enriches bound-state-like protein structures when applied to systems where protein folding landscapes are significantly altered by the presence of ligands. Our results reveal that a data-driven approach can capture the structural cooperativity among protein and small-molecule entities, showing promise for the computational identification of novel drug targets and the end-to-end differentiable design of functional small-molecules and ligand-binding proteins. Zhuoran Qiao · Weili Nie · Arash Vahdat · Thomas Miller · Anima Anandkumar 🔗 Sat 9:15 a.m. - 10:15 a.m. Poster Session 🔗 Sat 10:15 a.m. - 11:00 a.m. Lunch (Break) 🔗 Sat 11:00 a.m. - 11:25 a.m. Invited Speaker (Talk) Max Welling 🔗 Sat 11:25 a.m. - 11:40 a.m. EquiFold: Protein Structure Prediction with a Novel Coarse-Grained Structure Representation (Oral)    Designing proteins to achieve specific functions often requires in silico modeling of their properties at high throughput scale and can significantly benefit from fast and accurate protein structure prediction. We introduce EquiFold, a new end-to-end differentiable, SE(3)-equivariant, all-atom protein structure prediction model. EquiFold uses a novel coarse-grained representation of protein structures that does not require multiple sequence alignments or protein language model embeddings, inputs that are commonly used in other state-of-the-art structure prediction models. Our method relies on geometrical structure representation and is substantially smaller than prior state-of-the-art models. In preliminary studies, EquiFold achieved comparable accuracy to AlphaFold but was orders of magnitude faster. The combination of high speed and accuracy make EquiFold suitable for a number of downstream tasks, including protein property prediction and design. Jae Hyeon Lee · Payman Yadollahpour · Andrew Watkins · Nathan Frey · Andrew Leaver-Fay · Stephen Ra · Vladimir Gligorijevic · Kyunghyun Cho · Aviv Regev · Richard Bonneau 🔗 Sat 11:40 a.m. - 11:55 a.m. Predicting Ligand – RNA Binding Using E3-Equivariant Network and Pretraining (Oral)    It is becoming increasingly appreciated that small molecules hold great promise in targeting therapeutically relevant RNAs, such as viral RNAs or splicing junctions. Yet predicting ligand targeting RNA is particularly difficult since limited data are available. To overcome this, we fine-tuned a pretrained small molecule representation model, Uni-Mol, to predict the RNA-binding propensity of ligands and the RNA binding QSAR model. In addition, we develop an E3-equivariant model to predict possible ligands given the RNA pocket geometry. To the best of our knowledge, this is the first E3-equivariant model for predicting RNA-ligand binding. We demonstrated the great potential of Uni-Mol pretraining in the RNA-ligand tasks towards efficient and rational RNA drug discovery. Zhenfeng Deng · Ruichu Gu · Hangrui Bi · Xinyan Wang · Zhaolei Zhang · Han Wen 🔗 Sat 11:55 a.m. - 12:20 p.m. Invited Speaker (Talk) Alex Rives 🔗 Sat 12:20 p.m. - 12:35 p.m. Seq2MSA: A Language Model for Protein Sequence Diversification (Oral)    Diversification libraries of protein sequences that contain a similar set of structures over a variety of sequences can help protein design pipelines by introducing flexibility into the starting structures and providing a range of starting points for directed evolution. However, exploring the sequence space is computationally challenging: the vast majority of sequence space is non-viable, and even of those sequences that do fold to well-formed protein structures, it is challenging to find the fraction that maintain a similar fold class to a given protein. In this work, we propose to use an encoder-decoder language model, trained on a novel Seq2MSA task, that can create diversification libraries of any input protein. In particular, using our model, we are able to generate sequences that maintain structural similarity to a target sequence while pushing below 40% sequence identity to any protein in UniRef. Our diversification pipeline has the potential to aid in computational protein design by providing a diverse set of starting points in sequence space for a given functional or structural target. Pascal Sturmfels · Roshan Rao · Robert Verkuil · Zeming Lin · Tom Sercu · Adam Lerer · Alex Rives 🔗 Sat 12:35 p.m. - 12:50 p.m. Metal3D: Accurate prediction of transition metal ion location via deep learning (Oral)    Metal ions are essential cofactors for many proteins and about half of the structurally characterized proteins contain a metal ion. Metal ions play a crucial role for many applications such as enzyme design or design of protein-protein interactions because they are biologically abundant, tether to the protein using strong interactions, and have favorable catalytic properties e.g. as Lewis acid. In this work, we develop a convolutional neural network based approach to identify metal binding sites in experimental and computationally predicted protein structures. Comparison with other currently available tools shows that Metal3D is the most accurate metal ion location predictor to date using a single structure as input. Metal3D outputs a confidence metric for each predicted site and works on proteins with few homologes in the protein data bank. The predicted metal ion locations for Metal3D are within 0.70 ± 0.64 \AA\, of the experimental locations with half of the sites below 0.5 \AA . Metal3D predicts a global metal density that can be used for annotation of structures predicted using e.g.~AlphaFold2 and a per residue metal density that can be used in protein design workflows for the location of suitable metal binding sites and rotamer sampling to create novel metalloproteins. Simon Dürr 🔗 Sat 12:50 p.m. - 1:50 p.m. Panel Session (Discussion Panel) 🔗 Sat 1:50 p.m. - 2:55 p.m. Poster Session / Happy Hour (Poster Session) 🔗 Sat 2:55 p.m. - 3:00 p.m. Closing Remarks (Remarks) 🔗 - Identifying endogenous peptide receptors by combining structure and transmembrane topology prediction (Poster) Many secreted endogenous peptides rely on signalling pathways to exert their function in the body. While peptides can be discovered through high throughput technologies, their cognate receptors typically cannot, hindering the understanding of their mode of action. We investigate the use of AlphaFold-Multimer for identifying the cognate receptors of secreted endogenous peptides in human receptor libraries without any prior knowledge about likely candidates. We find that AlphaFold's predicted confidence metrics have strong performance for prioritizing true peptide-receptor interactions. By applying transmembrane topology prediction using DeepTMHMM, we further improve performance by detecting and filtering biologically implausible predicted interactions. In a library of 1112 human receptors, the method ranks true receptors in the top percentile on average for 11 benchmark peptide-receptor pairs. Felix Teufel · Jan Christian Refsgaard · Christian Toft Madsen · Carsten Stahlhut · Mads Grønborg · Dennis Madsen · Ole Winther 🔗 - Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization (Poster) []  Designing functionally interesting biological sequences pose challenges due to the combinatorially large space of the problem. As such, the acceleration of exploration through this landscape can have a substantial impact on the progress of the medical field. Motivated by this, we propose MetaRLBO where we (1) train an autoregressive generative model via Meta-Reinforcement Learning augmented with surrogate reward functions and exploration bonus to navigate through the sequence space efficiently. The Meta-RL policy is trained over a distribution of beliefs (i.e., proxy oracles) of the objective function, encouraging the policy to generate diverse sequences. Due to the large-batch and low-round nature of the wet-lab evaluations (true function evaluation), we (2) perform a more targeted evaluation through Bayesian Optimization. Our in-silico experiments show that meta-learning over such ensembles provides robustness against reward misspecification and achieves competitive results compared to existing strong baselines. Leo Feng · Padideh Nouri · Aneri Muni · Yoshua Bengio · Pierre-Luc Bacon 🔗 - Agile Language Transformers for Recombinant Protein Expression Optimization (Poster) []  Language Transformers (LaTs) have achieved state-of-the-art performance in a range of challenging protein modeling tasks including structure prediction, design, mutation effect prediction, and others. The lion's share of these improvements derive from exponential increases in the size and depth of these neural networks, which now routinely exceed billions of trainable parameters, rather than fundamental architectural innovations. This explosive growth in model size poses an obstacle to integration into design-build-test cycles, wherein models are iteratively evaluated, retrained, and improved throughout data collection. As a result, large LaTs do not meet the need for lightweight, rapid-to-train models that excel at problems with tight data-model feedback loops. Here, we present a small, 10 million-parameter BERT model with linearly scaling attention that can be trained from scratch on four Nvidia V100 GPUs in under a week and fine-tuned with full back-propagation in hours to days. We demonstrate that this model excels at two challenging active-learning problems, recombinant protein expression prediction and codon optimization, that require interfacing with experiments. Our approach highlights the size-cost tradeoff inherent to LaTs and demonstrates the utility of small, custom-designed models in practical settings. Jeliazko Jeliazkov · Maxim Shapovalov · Diego del Alamo · Matt Sternke · Joel Karpiak 🔗 - ModelAngelo: Automated Model Building in Cryo-EM Maps (Poster) Electron cryo-microscopy (cryo-EM) produces three-dimensional (3D) maps of the electrostatic potential of biological macromolecules, including proteins. At sufficient resolution, the cryo-EM maps, along with some knowledge about the imaged molecules, allow de novo atomic modelling. Typically, this is done through a laborious manual process. Recent advances in machine learning applications to protein structure prediction show potential for automating this process. Taking inspiration from these techniques, we have built ModelAngelo for automated model building of proteins in cryo-EM maps. ModelAngelo first uses a residual convolutional neural network (CNN) to initialize a graph representation with nodes assigned to individual amino acids of the proteins in the map and edges representing the protein chain. The graph is then refined with a graph neural network (GNN) that combines the cryo-EM data, the amino acid sequence data and prior knowledge about protein geometries. The GNN refines the geometry of the protein chain and classifies the amino acids for each of its nodes. The final graph is post-processed with a hidden Markov model (HMM) search to map each protein chain to entries in a user provided sequence file. Application to 28 test cases shows that ModelAngelo outperforms state-of-the-art and approximates manual building for cryo-EM maps with resolutions better than 3.5 A. Kiarash Jamali · Dari Kimanius · Sjors Scheres 🔗 - Unsupervised language models for disease variant prediction (Poster) []  There is considerable interest in predicting the pathogenicity of protein variants in human genes. Due to the sparsity of high quality labels, recent approaches turn to unsupervised learning, using Multiple Sequence Alignments (MSAs) to train generative models of natural sequence variation within each gene. These generative models then predict variant likelihood as a proxy to evolutionary fitness. In this work we instead combine this evolutionary principle with pretrained protein language models (LMs), which have already shown promising results in predicting protein structure and function. Instead of training separate models per-gene, we find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot, without MSAs or finetuning. We call this unsupervised approach VELM (Variant Effect via Language Models), and show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes. Allan Zhou · Nicholas C. Landolfi · Daniel ONeill 🔗 - Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem (Poster) []  Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop ProtDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif. Jason Yim · Brian L Trippe · Doug Tischer · David Baker · Tamara Broderick · Regina Barzilay · Tommi Jaakkola 🔗 - Online Inference of Structure Factor Amplitudes for Serial X-ray Crystallography (Poster) []  Advances in X-ray techniques at Free Electron Laser and synchrotrons now enable the collection of diffraction snapshots from millions of micro crystals. These are often paired with physical or chemical perturbations to obtain movies of the response of proteins to chemical and physical stimuli. Analysis of these data requires scalable algorithms. Distributed computing is one way to accomplish this as national labs may provide the necessary compute resources. However, a more accessible approach would be to construct algorithms which can operate on small batches of data on a single computer. The extreme case, an online algorithm, learns to process data by looking at one example at a time. Here we describe the successful implementation of one such algorithm for scaling and merging reflection intensities. The algorithm uses deep learning to scale reflection intensities while encouraging the merged structure factor estimates to follow a crystallographic prior distribution. The model is trained by gradient descent on a Bayesian objective function. We demonstrate that the model can estimate productive global parameter updates from single images. This approach has modest hardware requirements, can adapt on the fly as new data are acquired, and has the potential for transfer learning between data sets. The algorithm can be the heart of a flexible, scalable infrastructure that powers the next generation of diffraction experiments. Kevin Dalton · Doeke Hekstra 🔗 - Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness (Poster) []  Traditional antibody optimization approaches involve screening a small subset of the available sequence space, often resulting in drug candidates with suboptimal binding affinity, developability or immunogenicity. Based on two distinct antibodies, we demonstrate that deep contextual language models trained on high-throughput affinity data can quantitatively predict binding of unseen antibody sequence variants. These variants span a KD range of three orders of magnitude over a large mutational space. Our models reveal strong epistatic effects, which highlight the need for intelligent screening approaches. In addition, we introduce the modeling of “naturalness”, a metric that scores antibody variants for similarity to natural immunoglobulins. We show that naturalness is associated with measures of drug developability and immunogenicity, and that it can be optimized alongside binding affinity using a genetic algorithm. This approach promises to accelerate and improve antibody engineering, and may increase the success rate in developing novel antibody and related drug candidates. Sharrol Bachas · Goran Rakocevic · David Spencer · Anand Sastry · Robel Haile · John Sutton · George Kasun · Andrew Stachyra · Jahir Gutierrez · Edriss Yassine · Borka Medjo · Vincent Blay · Christa Kohnert · Jennifer Stanton · Alexander Brown · Nebojsa Tijanic · Cailen McCloskey · Rebecca Viazzo · Rebecca Consbruck · Hayley Carter · Simon Gottreich-Levine · Shaheed Abdulhaqq · Jacob Shaul · Abigail Ventura · Randal Olson · Engin Yapici · Joshua Meier · Sean McClain · Matthew Weinstock · Gregory Hannum · Ariel Schwartz · Miles Gander · Roberto Spreafico 🔗 - Fast and Accurate Antibody Structure Prediction without Sequence Homologs (Poster) []     Accurate prediction of antibody structures is critical in analyzing the function of antibodies, thus enabling the rational design of antibodies. However, existing antibody structure prediction methods often only formulate backbone atoms and rely on additional tools for side-chain conformation prediction. In this work, we propose a fully end-to-end architecture for simultaneous prediction of backbone and side-chain conformations. Pre-trained language model is adopted for fast structure prediction by avoiding the time-consuming search for sequence homologs. The model firstly predicts monomer structures of each chain, and then refines them into heavy-light chain complexes structure prediction, with enables multi-level supervision for model training. Evaluation results verify the effectiveness of propose method in both antibody and nanobody structure prediction. Jiaxiang Wu · Fandi Wu · Biaobin Jiang · Wei Liu · Peilin Zhao 🔗 - ExpressUrself: A spatial model for predicting recombinant expression from mRNA sequence (Poster) []     Maximising the yield of recombinantly expressed proteins is a critical part of any protein engineering pipeline. In most cases, the expression of a given protein can be tuned by adjusting its DNA coding sequence, however finding coding sequences that optimise expression is a nontrivial task. The 3-dimensional structure of mRNA is known to strongly influence the expression levels of proteins, due to its effect on the efficiency of ribosome attachment. While correlations between mRNA structure and expression are well established, no model to date has succeeded in effectively utilising this information to accurately predict expression levels. Here we present ExpressUrself, a model designed to capture spatial characteristics of the sequence surrounding the start codon of an mRNA transcript, and intended to be used for optimising protein expression. The model is trained and tested on a large data set of variant DNA sequences and is able to predict the expression of previously unseen transcripts to a high degree of accuracy. Michael P Dunne · Javier Caceres-Delpiano 🔗 - Adversarial Attacks on Protein Language Models (Poster) []     Deep Learning models for protein structure prediction, such as AlphaFold2, leverage Transformer architectures and their attention mechanism to capture structural and functional properties of amino acid sequences. Despite the high accuracy of predictions, biologically insignificant perturbations of the input sequences, or even single point mutations, can lead to substantially different 3d structures. On the other hand, protein language models are often insensitive to biologically relevant mutations that induce misfolding or dysfunction (e.g. missense mutations). Precisely, predictions of the 3d coordinates do not reveal the structure-disruptive effect of these mutations. Therefore, there is an evident inconsistency between the biological importance of mutations and the resulting change in structural prediction.Inspired by this problem, we introduce the concept of adversarial perturbation of protein sequences in continuous embedding spaces of protein language models. Our method relies on attention scores to detect the most vulnerable amino acid positions in the input sequences. Adversarial mutations are biologically diverse from their references and are able to significantly alter the resulting 3d structures. Ginevra Carbone · Francesca Cuturello · Luca Bortolussi · Alberto Cazzaniga 🔗 - The geometry of hidden representations of protein language models (Poster) []     Protein language models (pLMs) transform their input into a sequence of hidden representations whose geometric behavior changes across layers. Looking at fundamental geometric properties such as the intrinsic dimension and the neighbor composition of these representations, we observe that these changes highlight a pattern characterized by three distinct phases. This phenomenon emerges across many models trained on diverse datasets, thus revealing a general computational strategy learned by pLMs to reconstruct missing parts of the data. These analyses show the existence of low-dimensional maps that encode evolutionary and biological properties such as remote homology and structural information. Our geometric approach sets the foundations for future systematic attempts to understand the space of protein sequences with representation learning techniques. Lucrezia Valeriani · Francesca Cuturello · Alessio Ansuini · Alberto Cazzaniga 🔗 - Physics-aware Graph Neural Network for Accurate RNA 3D Structure Prediction (Poster) []  Biological functions of RNAs are determined by their three-dimensional (3D) structures. Thus, given the limited number of experimentally determined RNA structures, the prediction of RNA structures will facilitate elucidating RNA functions and RNA-targeted drug discovery, but remains a challenging task. In this work, we propose a Graph Neural Network (GNN)-based scoring function trained only with the atomic types and coordinates on limited solved RNA 3D structures for distinguishing accurate structural models. The proposed Physics-aware Multiplex Graph Neural Network (PaxNet) separately models the local and non-local interactions inspired by molecular mechanics. Furthermore, PaxNet contains an attention-based fusion module that learns the individual contribution of each interaction type for the final prediction. We rigorously evaluate the performance of PaxNet on two benchmarks and compare it with several state-of-the-art baselines. The results show that PaxNet significantly outperforms all the baselines overall, and demonstrate the potential of PaxNet for improving the 3D structure modeling of RNA and other macromolecules. Shuo Zhang · Lei Xie · Yang Liu 🔗 - Improving Molecule Properties Through 2-Stage VAE (Poster) Variational autoencoder (VAE) is a popular method for drug discovery and there had been a great deal of architectures and pipelines proposed to improve its performance. But the VAE model itself suffers from deficiencies such as poor manifold recovery when data lie on low-dimensional manifold embedded in higher dimensional ambient space and they manifest themselves in each applications differently. The consequences of it in drug discovery is somewhat under-explored. In this paper, we study how to improve the similarity of the data generated via VAE and the training dataset by improving manifold recovery via a 2-stage VAE where the second stage VAE is trained on the latent variables of the first one. We experimentally evaluated our approach using the ChEMBL dataset as well as a polymer datasets. In both dataset, the 2-stage VAE method is able to improve the property statistics significantly from a pre-existing method. Chenghui Zhou · Barnabas Poczos 🔗 - Investigating the conformational landscape of AlphaFold2-predicted protein kinase structures (Poster) []  Protein kinases are a family of signalling proteins, crucial for maintaining cellular homeostasis. When dysregulated, kinases drive the pathogenesis of several diseases, and are thus one of the largest target categories for drug discovery. Kinase activity is tightly controlled by switching through several active and inactive conformations in their catalytic domain. Kinase inhibitors have been designed to engage kinases in specific conformational states, where each conformation presents a unique physico-chemical environment for therapeutic intervention. Thus, modeling kinases across conformations can enable the design of novel and optimally selective kinase drugs. Due to the recent success of AlphaFold2 in accurately predicting the 3D structure of proteins based on sequence, we investigated the conformational landscape of protein kinases as modeled by AlphaFold2. We observed that AlphaFold2 is able to model several kinase conformations across the kinome, however, certain conformations are only observed in specific kinase families. Furthermore, we show that the per residue predicted local distance difference test can capture information describing conformational dynamics of kinases. Finally, we evaluated the docking performance of AlphaFold2 kinase structures for enriching known ligands. Taken together, we see an opportunity to leverage AlphaFold2 models for structure-based drug discovery against kinases across several pharmacologically relevant conformational states. Carmen Al Masri · Francesco Trozzi · Marcel Patek · Anna Cichonska · Balaguru Ravikumar · Rayees Rahman 🔗 - Heterogeneous reconstruction of deformable atomic models in Cryo-EM (Poster) Cryogenic electron microscopy (cryo-EM) provides a unique opportunity to study the structural heterogeneity of biomolecules. Being able to explain this heterogeneity with atomic models would help our understanding of their functional mechanism but the size and ruggedness of the structural space presents an immense challenge. In this work, we describe a heterogeneous reconstruction method based on an atomistic representation whose deformation is reduced to a handful of collective motions through normal mode analysis. Our implementation follows an encoder-decoder approach. The amplitude of motion along the normal modes and the 2D shift between the center of the image and the center of the molecule are jointly estimated by an encoder while a physics-based decoder aggregates the images into a representation of the heterogeneity readily interpretable at the atomic level. We illustrate our method on 3 synthetic datasets corresponding to different distributions along a simulated trajectory of adenylate kinase transitioning from its open to its closed conformations. We show for each distribution that, given enough normal modes, our approach is able to recapitulate the intermediate atomic models with atomic-level accuracy. Youssef Nashed · Ariana Peck · Julien Martel · Axel Levy · Bongjin Koo · Gordon Wetzstein · Nina Miolane · Daniel Ratner · Frederic Poitevin 🔗 - Improving Protein Subcellular Localization Prediction with Structural Prediction & Graph Neural Networks (Poster) []     We present a method that improves subcellular localization prediction for proteins based on their sequence by leveraging structure prediction and Graph Neural Networks. We demonstrate that Language Models, trained on protein sequences, and Graph Neural Nets, trained on protein's 3D structures, are both efficient approaches. They both learn meaningful, yet different representations of proteins; hence, ensembling them outperforms the reigning state of the art method. Geoffroy Dubourg-Felonneau · Arash Abbasi · Eyal Akiva · Lawrence Lee 🔗 - EvoOpt: an MSA-guided, fully unsupervised sequence optimization pipeline for protein design (Poster) []  Recent years have seen rapid growth in machine learning algorithms for protein design. Among them, protein sequence optimization methods to maximize molecular functionality can significantly impact many industries. However, as shown in this study, most existing methods are data-hungry: they tend to be low-performant when available training data is scarce (i.e., in a low-N regime), which is often the case in practical protein engineering scenarios. In response, here we examine the extreme case: what if we have no training data? To answer, we propose a fully unsupervised sequence optimization pipeline named EvoOpt that leverages evolutionary information provided by multiple sequence alignments (MSAs) and the generative power of MSA Transformer, a protein language model (PLM) that takes an MSA as input. The extensive evaluation herein demonstrates that EvoOpt outperforms or is on par with the existing supervised methods even in relatively high-N regimes. We also report that the optimization performance with MSA Transformer is almost equivalent to or superior to that with a PLM that takes a single sequence as input, such as ESM-1b or ESM2 of far more model parameters. These results indicate the advantage of using an MSA to guide an algorithm toward promising candidates in the search space, directly exploiting evolutionary information. Hideki Yamaguchi · Yutaka Saito 🔗 - ChemSpacE: Interpretable and Interactive Chemical Space Exploration (Poster) Discovering meaningful molecules in the vast combinatorial chemical space has been a long-standing challenge in many fields from materials science to drug discovery. Recent advances in machine learning, especially generative models, have made remarkable progress and demonstrate considerable promise for automated molecule design. Nevertheless, most molecule generative models remain black-box systems, whose utility is limited by a lack of interpretability and human participation in the generation process. In this work we propose \textbf{Chem}ical \textbf{Spac}e \textbf{E}xplorer (ChemSpacE), a simple yet effective method for exploring the chemical space with pre-trained deep generative models. It enables users to interact with existing generative models and inform the molecule generation process. We demonstrate the efficacy of ChemSpacE on the molecule optimization task and the molecule manipulation task in single property and multi-property settings. On the molecule optimization task, the performance of ChemSpacE is on par with previous black-box optimization methods yet is considerably faster and more sample efficient. Furthermore, the interface from ChemSpacE facilitates human-in-the-loop chemical space exploration and interactive molecule design. Yuanqi Du · Xian Liu · Nilay Shah · Shengchao Liu · Jieyu Zhang · Bolei Zhou 🔗 - T-cell receptor specific protein language model for prediction and interpretation of epitope binding (ProtLM.TCR) (Poster) []  The cellular adaptive immune response relies on epitope recognition by T-cell receptors (TCRs). We used a language model for TCRs (ProtLM.TCR) to predict TCR-epitope binding. This model was pre-trained on a large set of TCR sequences before being fine-tuned to predict TCR-epitope bindings across multiple human leukocyte antigen (HLA) of class-I types. We then tested ProtLM.TCR on a balanced set of binders and non-binders for each epitope, avoiding model shortcuts like HLA categories. We compared pan-HLA versus HLA-specific models, and our results show that while computational prediction of novel TCR-epitope binding probability is feasible, more diverse datasets are required to achieve a more generalized performance towards de novo epitope binding predictions. We also show that ProtLM.TCR embeddings outperform BLOSUM and hand-crafted embeddings. Finally, we have used the LIME framework to examine the interpretability of these predictions. Ahmed Essaghir 🔗 - Representation Learning on Biomolecular Structures using Equivariant Graph Attention (Poster) []     Learning and reasoning about 3D molecular structures with varying size is an emerging and important challenge in machine learning and especially in the development of biotherapeutics. Equivariant Graph Neural Networks (GNNs) can simultaneously leverage the geometric and relational detail of the problem domain and are known to learn expressive representations through the propagation of information between nodes leveraging geometrical details, such as directionality in their intermediate layers. In this work, we propose an equivariant GNN that operates with Cartesian coordinates to incorporate directionality and implements a novel attention mechanism, acting as a content and spatial dependent filter. Our proposed message function processes vector features in a geometrically meaningful way by mixing existing vectors and creating new ones based on cross products. Tuan Le · Frank Noe · Djork-Arné Clevert 🔗 - What is hidden in the darkness? Characterization of AlphaFold structural space (Poster) The recent public release of the latest version of the AlphaFold database has given us access to over 200 million predicted protein structures. We use a "shape-mer" approach, a structural fragmentation method analogous to sequence k-mers, to describe these structures and look for novelties - both in terms of proteins with rare or novel structural composition and possible functional annotation of under-studied proteins. Janani Durairaj · Joana Maria Soa Pereira · Mehmet Akdel · Torsten Schwede 🔗 - So ManyFolds, So Little Time: Efficient Protein Structure Prediction With pLMs and MSAs (Poster) []     In recent years, machine learning approaches for de novo protein structure prediction have made significant progress, culminating in AlphaFold which approaches experimental accuracies in certain settings and heralds the possibility of rapid in silico protein modelling and design. However, such applications can be challenging in practice due to the significant compute required for training and inference of such models, and their strong reliance on the evolutionary information contained in multiple sequence alignments (MSAs), which may not be available for certain targets of interest. Here, we first present a streamlined AlphaFold architecture and training pipeline that still provides good performance with significantly reduced computational burden. Aligned with recent approaches such as OmegaFold and ESMFold, our model is initially trained to predict structure from sequences alone by leveraging embeddings from the pretrained ESM-2 protein language model (pLM). We then compare this approach to an equivalent model trained on MSA-profile information only, and find that the latter still provides a performance boost - suggesting that even state-of-the-art pLMs cannot yet easily replace the evolutionary information of homologous sequences. Finally, we train a model that can make predictions from either the combination, or only one, of pLM and MSA inputs. Ultimately, we obtain accuracies in any of these three input modes similar to models trained uniquely in that setting, whilst also demonstrating that these modalities are complimentary, each regularly outperforming the other. Thomas D Barrett · Amelia Villegas-Morcillo · Louis Robinson · Benoit Gaujac · Karim Beguir · Arthur Flajolet 🔗 - Peptide-MHC Structure Prediction With Mixed Residue and Atom Graph Neural Network (Poster) []     Neoantigen-targeting vaccines have achieved breakthrough success in cancer immunotherapy by eliciting immune responses against neoantigens, which are proteins uniquely produced by cancer cells. During the immune response, the interactions between peptides and major histocompatibility complexes (MHC) play an important role as peptides must be bound and presented by MHC to be recognised by the immune system. However, only limited experimentally determined peptide-MHC (pMHC) structures are available, and \textit{in-silico} structure modelling is therefore used for studying their interactions. Current approaches mainly use Monte Carlo sampling and energy minimisation, and are often computationally expensive. On the other hand, the advent of large high-quality proteomic data sets has led to an unprecedented opportunity for deep learning-based methods with pMHC structure prediction becoming feasible with these trained protein folding models.In this work, we present a graph neural network-based model for pMHC structure prediction, which takes an amino acid-level pMHC graph and an atomic-level peptide graph as inputs and predicts the peptide backbone conformation. With a novel weighted reconstruction loss, the trained model achieved a similar accuracy to AlphaFold 2, requiring only 1.7M learnable parameters compared to 93M, representing in a more than 98\% reduction in the number of required parameters. Antoine Delaunay · Yunguan Fu · Alberto Bégué · Robert McHardy · Bachir Djermani · Liviu Copoiu · Michael Rooney · Andrey Tovchigrechko · Marcin Skwark · Nicolas Lopez Carranza · Maren Lang · Karim Beguir · Ugur Sahin 🔗 - ContactNet: Geometric-Based Deep Learning Model for Predicting Protein-Protein Interactions (Poster) []  Deep learning approaches achieved significant progress in predicting protein structures. These methods are often applied to protein-protein interactions (PPIs) yet require Multiple Sequence Alignment (MSA) which is unavailable for various interactions, such as antibody-antigen. Computational docking methods are capable of sampling accurate complex models, but also produce thousands of invalid configurations. The design of scoring functions for identifying accurate models is a long-standing challenge. We develop a novel attention-based Graph Neural Network (GNN), ContactNet, for classifying PPI models obtained from docking algorithms into accurate and incorrect ones. When trained on docked antigen and modeled antibody structures, ContactNet doubles the accuracy of current state-of-the-art scoring functions, achieving accurate models among its Top-10 at 43\% of the test cases. When applied to unbound antibodies, its Top-10 accuracy increases to 65\%. This performance is achieved without MSA and the approach is applicable to other types of interactions, such as host-pathogens or general PPIs. Matan Halfon · Dina Schneidman · Tomer Cohen · raanan fattal 🔗 - End-to-end accurate and high-throughput modeling of antibody-antigen complexes (Poster) []  Antibodies are produced by the immune system in response to infection or vaccina-tion. While sequencing of the individual antibody repertoire is becoming routine,identifying the antigens they recognize requires costly low-throughput experiments.Even when the antigen is known, epitope mapping is still challenging: experimentalapproaches are low-throughput and computational ones are not sufficiently accurate.Recently, AlphaFold2 has revolutionized structural biology by predicting highlyaccurate protein structures and complexes. However, it relies on an evolutionaryinformation that is not available for antibody-antigen interactions. Traditionalcomputational epitope mapping is based on structure modeling (folding) of theantibodies followed by docking the predicted structure to the corresponding antigen.The problem with this sequential approach is that the folding step does not considerthe structural changes of the antibody upon antigen binding and the docking stepis inaccurate because the antibody is considered rigid. Here, we develop a deeplearning end-to-end model, that given an antibody sequence and its correspondingantigen structure can simultaneously perform folding and docking tasks. Themodel produces the 3D coordinates of the entire antibody-antigen (Ab-Ag) ornanobody-antigen (Nb-Ag) complex, including the side chains. An accurate modelis detected among the Top-5 predictions for 75% of the test set. In addition tomining antibody repertoires, such a method can have the potential to be used inantibody-based drug design, as well as in the vaccine design. Tomer Cohen · Dina Schneidman 🔗 - Deep Local Analysis estimates effects of mutations on protein-protein interactions (Poster) []  The spectacular advances in protein and protein complex structure prediction hold promises for the reconstruction of interactomes at large scale at the residue resolution. Beyond determining the 3D arrangement of interacting partners, modeling approaches should be able to sense the impact of sequence variations such as point mutations on the strength of the association. In this work, we report on DLA-mutation, a novel and efficient deep learning framework for accurately predicting mutation-induced binding affinity changes. It relies on a 3D-invariant description of local 3D environments at protein interfaces and leverages the large amounts of available protein complex structures through self-supervised learning. It combines the learnt representations with evolutionary information, and a description of interface structural regions, in a siamese architecture. DLA-mutation achieves a Pearson correlation coefficient of 0.81 on a large collection of more than 2000 mutations, and its generalization capability to unseen complexes is higher than state-of-the-art methods. Yasser Mohseni Behbahani · Elodie Laine · Alessandra Carbone 🔗 - Membrane and microtubule rapid instance segmentation with dimensionless instance segmentation by learning graph representations of point clouds (Poster) []     Point clouds are an increasingly common spatial data modality, being produced by sensors used in robotics and self-driving cars, and as natural intermediate representations of objects in microscopy and other bioimaging domains (e.g., cell locations over time, or filaments, membranes, or organelle boundaries in cryo-electron micrographs or tomograms). However, semantic and instance segmentation of this data remains challenging due to the complex nature of objects in point clouds. Especially in bioimaging domains where objects are often large and can be intersecting or overlapping. Furthermore, methods for operating on point clouds should not be sensitive to the specific orientation or translation of the point cloud, which is often arbitrary. Here, we frame the point cloud instance segmentation problem as a graph learning problem in which we seek to learn a function that accepts the point cloud as input and outputs a probability distribution over neighbor graphs in which connected components of the graph correspond to individual object instances. We introduce the Dmensionless Instance Segmentation Transformer (DIST), a deep neural network for spatially invariant instance segmentation of point clouds to solve this point cloud-to-graph problem. DIST uses an SO(n) invariant transformer layer architecture to operate on point clouds of arbitrary dimension and outputs, for each pair of points, the probability that an edge exists between them in the instance graph. We then decode the most likely set of instances using a graph cut. We demonstrate the power of DIST for the segmentation of biomolecules in cryo-electron micrographs and tomograms, far surpassing existing methods for membrane and filament segmentation in empirical evaluation. We anticipate that DIST will underpin a new generation of methods for point cloud segmentation in bioimaging and that our general model and approach will provide useful insights for point cloud segmentation methods in other domains. Robert Kiewisz · Tristan Bepler 🔗 - Masked inverse folding with sequence transfer for protein representation learning (Poster)    Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures.Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures.In this study, we train a masked inverse folding protein language model parameterized as a structured graph neural network. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance. Kevin Yang · Niccoló Zanichelli · Hugh Yeh 🔗 - 3D Reconstruction of Protein Complex Structures Using Synthesized Multi-View AFM Images (Poster) []     Recent developments in deep learning-based methods demonstrated its potential to predict the 3D protein structures using inputs such as protein sequences, Cryo-Electron microscopy (Cryo-EM) images of proteins, etc. However, these methods struggle to predict the protein complexes (PC), structures with more than one protein. In this work, we explore the atomic force microscope (AFM) assisted deep learning-based methods to predict the 3D structure of PCs. The images produced by AFM capture the protein structure in different and random orientations. These multi-view images can help train the neural network to predict the 3D structure of protein complexes. However, obtaining the dataset of actual AFM images is time-consuming and not a pragmatic task. We propose a virtual AFM imaging pipeline that takes a 'PDB' protein file and generates multi-view 2D virtual AFM images using volume rendering techniques. With this, we created a dataset of around 8K proteins. We train a neural network for 3D reconstruction called Pix2Vox++ using the synthesized multi-view 2D AFM images dataset. We compare the predicted structure obtained using a different number of views and get the intersection over union (IoU) value of 0.92 on the training dataset and 0.52 on the validation dataset. We believe this approach will lead to better prediction of the structure of protein complexes. Jaydeep Rade · Soumik Sarkar · Anwesha Sarkar · Adarsh Krishnamurthy 🔗 - Protein structure generation via folding diffusion (Poster)    The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model that designs protein backbone structures via a procedure that mirrors the native folding process. We describe protein backbone structure as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues, and generate new structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins biologically twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release the first open-source codebase and trained models for protein structure diffusion. Kevin Wu · Kevin Yang · Rianne van den Berg · James Zou · Alex X Lu · Ava Soleimany 🔗 - Learning Free Energy Pathways through Reinforcement Learning of Adaptive Steered Molecular Dynamics (Poster) []  In this paper, we develop a formulation to utilize reinforcement learning and sampling-based robotics planning to derive low free energy transition pathways between two known states. Our formulation uses Jarzynski's equality and the stiff-spring approximation to obtain point estimates of energy, and construct an informed path search with atomistic resolution. At the core of this framework, is our first ever attempt we use a policy driven adaptive steered molecular dynamics (SMD) to control our molecular dynamics simulations. We show that both the reinforcement learning and robotics planning realization of the RL-guided framework can solve for pathways on toy analytical surfaces and alanine dipeptide. Nicholas Ho · John Kevin Cava · John Vant · Ankita Shukla · Jacob Miratsky · Pavan Turaga · Ross Maciejewski · Abhishek Singharoy 🔗 - Representation of missense variants for predicting modes of action (Poster) []     Accurate prediction of functional impact for missense variants is fundamental for genetic analysis and clinical applications. Current methods focused on generating an overall pathogenicity prediction score while overlooking the fact that variant effect should be multi-dimensional via different modes of action, such as gain or loss of function, and loss of folding stability or enzymatic activity. Recent breakthrough of high-capacity language models enabled \textit{ab initio} prediction of protein structures as well as self-supervised representation learning of protein sequence and functions. Here we present RESCVE, a method to learn universal representation of sequence variation from protein context. We demonstrated the utility of the method predicting a range of modes of action for missense variants through transfer learning. Guojie Zhong · Yufeng Shen 🔗 - Pretrained protein language model transfer learning: is the final layer representation what we want? (Poster) []     Large pretrained protein language models have improved protein sequence-to-function prediction. This often takes the form of transfer learning, where final-layer representations from large pretrained models are extracted for downstream tasks. Although pretrained models have been empirically successful, there is little current understanding of how the features learned by pretraining relate to and are useful for downstream tasks. In this work, we investigate whether transferring a partial model, by using the output from a middle layer, is as effective as full model transfer, and if so, whether successful transfer depends on the downstream task and model properties. Across datasets and tasks, we evaluate partial model transfer of pretrained transformer and convolutional neural networks of varying sizes. We observe that pretrained representations outperform the one-hot baseline for most tasks. More importantly, we find that representations from middle layers can be as effective as those from later layers. To our knowledge, our work is the first to report the effectiveness of partial model transfer for protein property prediction. Our results point to a mismatch between the pretraining and downstream tasks, indicating a need for more relevant pretraining tasks so that representations from later layers can be better utilized for downstream tasks. Francesca-Zhoufan Li · Ava Soleimany · Kevin Yang · Alex X Lu 🔗 - Does Inter-Protein Contact Prediction Benefit from Multi-Modal Data and Auxiliary Tasks? (Poster) []  Approaches to (in silico) predict structures of proteins have been revolutionized by AlphaFold2, while those to predict interfaces between proteins are relatively underdeveloped, owing to the overly complicated protein complex data. In short, proteins are represented by 1D sequences folding into 3D structures, and interact to form assemblies to function. We believe such intricate scenarios are better modeled with additional indicative information, of their multi-modality nature and multi-scale functionality. We thus hypothesize to improve inter-protein contact prediction via augmenting input features with multi-modal representations, and synergizing the objective with auxiliary predictive tasks. (i) We first progressively add three protein modalities into models: protein sequences, sequences with evolutionary information, and structure-aware intra-contact maps, with observations that utilizing all data modalities delivers the best prediction precision. Fine-grained analysis reveals evolutionary and structural information benefit predictions on the difficult and rigid protein complexes, respectively, assessed by resemblance to bound structures in residue contacts. (ii) We next introduce three auxiliary tasks via multi-task learning or pre-training: inter-contact distance, angle, and protein-protein interaction (PPI) prediction. Although PPI prediction is reported to benefit from predicting inter-contacts (as causal interpretations), in reverse it is not true, and the same are the other two tasks across all complex categories. This again reflects the high complexity of the protein assembly data on which, designing synergistic auxiliary tasks is nontrivial. Arghamitra Talukder · Rujie Yin · Yang Shen · Yuning You 🔗 - Conditional Invariances for Conformer Invariant Protein Representations (Poster) Representation learning for proteins is an emerging area in geometric deep learning. Recent works have factored in both the relational (atomic bonds) and the geometric aspects (atomic positions) of the task, notably bringing together graph neural networks (GNNs) with neural networks for point clouds. The equivariances and invariances to geometric transformations (group actions such as rotations and translations) so far treat large molecules as rigid structures. However, in many important settings, proteins can co-exist as an ensemble of multiple stable conformations. The conformations of a protein, however, cannot be described as input-independent transformations of the protein: Two proteins may require different sets of transformations in order to describe their set of viable conformations. To address this limitation, we introduce the concept of conditional transformations (CT). CT can capture protein structure, while respecting the constraints on dihedral (torsion) angles and steric repulsions between atoms. We then introduce a Markov chain Monte Carlo framework to learn representations that are invariant to these conditional transformations. Our results show that endowing existing baseline models with these conditional transformations helps improve their performance without sacrificing computational efficiency. Balasubramaniam Srinivasan · Vassilis Ioannidis · Soji Adeshina · Mayank Kakodkar · George Karypis · Bruno Ribeiro 🔗 - Investigating graph neural network for RNA structural embedding (Poster) []  The biological function of natural non-coding RNAs (ncRNA) is tightly bound to their molecular structure. Sequence analyses such as multiple sequence alignments (MSA) are the bread and butter of bio-molecules functional analysis; however, analyzing sequence and structure simultaneously is a difficult task. In this work, we propose CARNAGE (Clustering/Alignment of RNA with Graph-network Embedding), which leverages a graph neural network encoder to imprint structural information into a sequence-like embedding; therefore, downstream sequence analyses now account implicitly for structural constraints. In contrast to the traditional "supervised" alignment approaches, we trained our network on a masking problem, independent from the alignment or clustering problem. Our method is very versatile and has shown good performances in 1) designing RNAs sequences, 2) clustering sequences, and 3) aligning multiple sequences only using the simplest Needleman and Wunsch's algorithm. Not only can this approach be readily extended to RNA tridimensional structures, but it can also be applied to proteins. vaitea opuu · Helene Bret 🔗 - Explainable Deep Generative Models, Ancestral Fragments, and Murky Regions of the Protein Structure Universe (Poster) []  Modern proteins did not arise abruptly, as singular events, but rather over the course of at least 3.5 billion years of evolution. Can machine learning teach us how this occurred? The molecular evolutionary processes that yielded the intricate three-dimensional (3D) structures of proteins involve duplication, recombination and mutation of genetic elements, corresponding to short peptide fragments. Identifying and elucidating these ancestral fragments is crucial to deciphering the interrelationships amongst proteins, as well as how evolution acts upon protein sequences, structures & functions. Traditionally, structural fragments have been found using sequence and 3D structural alignment, but that becomes challenging when proteins have undergone extensive permutations-allowing two proteins to share a common architecture, though their topologies may drastically differ (a phenomenon termed the Urfold). We have designed a new framework to identify compact, potentially-discontinuous peptide fragments by combining (i) deep generative models of protein superfamilies with (ii) layer-wise relevance propagation (LRP) to identify atoms of great relevance in creating an embedding during an all superfamilies x all domains analysis. Our approach recapitulates known relationships amongst the evolutionarily ancient small beta-barrels (e.g. SH3 and OB folds) and P-loop-containing proteins (e.g. Rossmann and P-loop NTPases), previously established via manual analysis. Because of the generality of our deep model's approach, we anticipate that it can enable the discovery of new ancestral peptides. In a sense, our framework uses LRP as an `explainable AI' approach, in conjunction with a recent deep generative model of protein structure (termed DeepUrfold), in order to leverage decades worth of structural biology knowledge to decipher the underlying molecular bases for protein structural relationships-including those which are exceedingly remote, yet can be discovered via deep learning. Eli Draizen · Cameron Mura · Philip Bourne 🔗 - APPRAISE: ranking engineered proteins by target binding propensity using pair-wise, competitive structure modeling and physics-informed analysis (Poster) []  Deep learning-based methods for protein structure prediction, represented by AlphaFold2 [1] and RosettaFold [2] have achieved unprecedented accuracy. However, the power of these structure-prediction tools has not been fully harnessed to guide the engineering of protein-based therapeutics. For example, there is a gap between the ability to predict the structures of candidate protein molecules and the ability to assess which of those molecules are more likely to bind to a target receptor. Here we introduce Automated Pair-wise Peptide-Receptor binding model AnalysIs for Screening Engineered proteins (APPRAISE), a method for predicting the receptor binding propensity of engineered proteins. This method involves using an established structure-prediction tool to generate models of two engineered proteins competing for binding to a target protein. These structure models are then subjected to fast analysis (<1 CPU second per model) to generate a score that takes into account biophysical principles and geometrical constraints. As a proof-of-concept, we tested this tool on engineered Adeno-Associated Viral (AAV) vectors with surface displayed peptides. Using AlphaFold2-multimer [3] as the structure prediction engine, APPRAISE can accurately classify receptor-dependent vs. receptor-independent AAV capsids with a ROC-AUC of 0.87 in a set of 22 samples. When used to screen a library of 100 variants, APPRAISE correctly predicted a variant with a distinct sequence from previously known receptor binders to be a top receptor binder, which was confirmed by in vitro and in vivo experiments. Without further fine-tuning, APPRAISE can accurately rank other classes of engineered proteins, such as miniproteins binders and nanobodies, that bind to therapeutic receptors. With high accuracy, generalizability, and interpretability, the APPRAISE method would expand the utilities of current structural prediction capabilities and accelerate protein engineering for biomedical applications. Xiaozhe Ding · Xinhong Chen · Erin Sullivan · Tim Miles · Viviana Gradinaru 🔗 - Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC (Poster) []  A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a plug and play framework for evolving proteins in silico that supports mixing and matching a variety of generative models with discriminative models to help constrain search to the proteins most likely to appear in nature. Our framework achieves this by sampling from a product of experts distribution defined in discrete protein space and does not require any model fine-tuning or re-training. Instead of resorting to sample-inefficient search based on random mutations, as is typical of previous plug and play algorithms for protein engineering, we propose a fast discrete sampler that uses gradients to efficiently identify promising mutations. Our in silico directed evolution experiments on wide fitness landscapes show that we efficiently discover variants that are multiple mutations away from a wild type protein with high evolutionary sequence likelihood as well as estimated activity. Our framework is analyzed across a range of different evolutionary generative models including a 650M parameter protein language model. Patrick Emami · Aidan Perreault · Jeffrey Law · David Biagioni · Peter St. John 🔗 - Training self-supervised peptide sequence models on artificially chopped proteins (Poster) []     Representation learning for proteins has primarily focused on the global understanding of protein sequences regardless of their length. However, shorter proteins (known as peptides) take on distinct structures and functions compared to their longer counterparts. Unfortunately, there are not as many naturally occurring peptides available to be sequenced and therefore less peptide-specific data to train with. In this paper, we propose a new peptide data augmentation scheme, where we train peptide language models on artificially constructed peptides that are small contiguous subsets of longer, wild-type proteins; we refer to the training peptides as “chopped proteins”. We evaluate the representation potential of models trained with chopped proteins versus natural peptides and find that training language models with chopped proteins results in more generalized embeddings for short protein sequences. These peptide-specific models also retain information about the original protein they were derived from better than language models trained on full-length proteins. We compare masked language model training objectives to three novel peptide-specific training objectives: next-peptide prediction, contrastive peptide selection and evolution-weighted MLM. We demonstrate improved zero-shot learning performance for a deep mutational scan peptides benchmark. Gil Sadeh · Zichen Wang · Jasleen Grewal · Huzefa Rangwala · Layne Price 🔗 - Protein Sequence Design in a Latent Space via Model-based Reinforcement Learning (Poster)    Proteins are complex molecules responsible for different functions in the human body. Enhancing the functionality of a protein and/or cellular fitness can significantly impact various industries. However, their optimization remains challenging, and sequences generated by data-driven methods often fail in wet lab experiments. This study investigates the limitations of existing model-based sequence design methods and presents a novel optimization framework that can efficiently traverse the latent representation space instead of the protein sequence space. Our framework generates proteins with higher functionality and cellular fitness by modeling the sequence design task as a Markov decision process and applying model-based reinforcement learning. We discuss the results in a comprehensive evaluation of two distinct proteins, GFP and His3. Minji Lee · Luiz Felipe Vecchietti · Hyunkyu Jung · Hyunjoo Ro · Ho Min Kim · Meeyoung Cha 🔗 - Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space (Poster) []  Group-equivariant neural networks have emerged as a data-efficient approach to solve classification and regression tasks, while respecting the relevant symmetries of the data. However, little work has been done to extend this paradigm to the unsupervised and generative domains. Here, we present \textit{Holographic}-(V)AE (H-(V)AE), a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data. We extensively test the performance of H-(V)AE on diverse datasets and show that its latent space efficiently encodes the categorical features of spherical images and structural features of protein atomic environments. Our work can further be seen as a case study for equivariant modeling of a data distribution by reconstructing its Fourier encoding. Gian Marco Visani · Michael Pun · Armita Nourmohammad 🔗 - Protein-Protein Docking with Iterative Transformer (Poster) []  Conventional protein-protein docking algorithms usually rely on heavy candidate sampling and re-ranking, but these steps are time-consuming and hinder applications that require high-throughput complex structure prediction, e.g., structure-based virtual screening. Existing deep learning methods for protein-protein docking, despite being much faster, suffer from low docking success rates. In addition, they simplify the problem to assume no conformational changes within any protein upon binding (rigid docking). This assumption precludes applications when binding-induced conformational changes play a role, such as allosteric inhibition or docking from uncertain unbound model structures. To address the limitations, we designed a novel iterative transformer network that predicts the 3D transformation from a randomized initial docking pose to a refined docked pose. Our method, GeoDock, is flexible at the protein residue level, allowing the prediction of rigid-body movement as well as conformational changes upon binding. For two benchmark sets of rigid docking targets, GeoDock successfully docks 32% and 20% of the protein pairs, outperforming the baseline deep learning method EQUIDOCK (8% and 0% success rates). Additionally, GeoDock achieves comparable docking success rates to the conventional docking algorithms while being 80-500 times faster. Although binding-induced conformational changes are still a challenge owing to limited training and evaluation data, our architecture sets up the foundation to capture flexibility going ahead. Lee-Shin Chu · Jeffrey Ruffolo · Jeffrey Gray 🔗 - Structure-based Drug Design with Equivariant Diffusion Models (Poster) []  Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions. Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an E(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide a realistic binding scenario that complements the synthetic CrossDocked dataset. Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking. Arne Schneuing · Yuanqi Du · Charles Harris · Arian Jamasb · Ilia Igashov · weitao Du · Tom Blundell · Pietro Lió · Carla Gomes · Max Welling · Michael Bronstein · Bruno Correia 🔗 - Reconstruction of polymer structures from contact maps with Deep Learning (Poster) []     For any polymer, the euclidean distance map (\textbf{D}) is defined as a matrix where $D_{ij}=d_{ij}^2$ where $d_{ij}$ is the distance between $i$ and $j$. This contains all the necessary information to re-create the structure. However certain biological experiments, especially Hi-C or NOESY NMR, are only able to provide us with a list of monomers that are within a certain cut-off distance ($r_c$). This is called a contact-map (\textbf{C}). We propose a deep auto-encoder that is able to reconstruct \textbf{D} when only provided with \textbf{C}. We test this network on ensembles of structures generated by MD simulations. We show that a deep auto-encoder is capable of reconstructing polymer structures simply from the contact map information. We propose that this network can be applied to single-cell Hi-C maps to reconstruct chromosome structures in individual cells. Atreya Dey 🔗 - Contrasting drugs from decoys (Poster)    Protein language models (PLMs) have recently been proposed to advance drug-target interaction (DTI) prediction, and have shown state-of-the-art performance on several standard benchmarks. However, a remaining challenge for all DTI prediction models (including PLM-based ones) is distinguishing true drugs from highly-similar decoys. Leveraging techniques from self-supervised contrastive learning, we introduce a second-generation PLM-based DTI model trained on triplets of proteins, drugs, and decoys (small drug-like molecules that do not bind to the protein). We show that our approach, CON-Plex, improves specificity while maintaining high prediction accuracy and generalizability to new drug classes. CON-Plex maps proteins and drugs to a shared latent space which can be interpreted to identify mutually-compatible classes of proteins and drugs. Data and code are available at https://zenodo.org/record/7127229. Samuel Sledzieski · Rohit Singh · Lenore J Cowen · Bonnie Berger 🔗 - Learning from physics-based features improves protein property prediction (Poster) []     Data-based and physics-based methods have long been considered as distinct approaches for protein property prediction. However, they share complementary strengths, such that integrating physics-based features with machine learning may improve model generalizability and accuracy. Here, we demonstrate that incorporating pre-computed energetic features in machine learning models improves performance in out-of-distribution and low training data regimes in a proof of concept study with two distinct protein engineering tasks. By training with sequence, structure, and pre-computed Rosetta energy features on graph neural nets, we achieve performance comparable to masked inverse folding pretraining with the same architecture. Amy Wang · Ava Soleimany · Alex X Lu · Kevin Yang 🔗 - RL Boltzmann Generators for Conformer Generation in Data-Sparse Environments (Poster) []  The generation of conformers has been a long-standing interest to structural chemists and biologists alike. A subset of proteins known as intrinsically disordered proteins (IDPs) fail to exhibit a fixed structure and, therefore, must also be studied in this light of conformer generation. Unlike in the small molecule setting, ground truth data are sparse in the IDP setting, undermining many existing conformer generation methods that rely on such data for training. Boltzmann generators, trained solely on the energy function, serve as an alternative but display a mode collapse that similarly preclude their direct application to IDPs. We investigate the potential of training an RL Boltzmann generator against a closely related “Gibbs score,” and demonstrate that conformer coverage does not track well with such training. This suggests that the inadequacy of solely training against the energy is independent of the modeling modality. Yash Patel · Ambuj Tewari 🔗 - 3D alignment of cryogenic electron microscopy density maps by minimizing their Wasserstein distance (Poster) []     Aligning electron density maps of multiple conformations of a biomolecule from Cryogenic electron microscopy (cryo-EM) is a first key step to study conformational heterogeneity. As this step remains challenging, with standard alignment tools being potentially stuck in local minima, we propose here a new procedure, which relies on the use of computational optimal transport (OT) to align EM maps in 3D space. By embedding a fast estimation of OT maps within a stochastic gradient descent algorithm, our method searches for a rotation that minimizes the Wasserstein distance between two maps, represented as point clouds. We show that our method outperforms standard methods on experimental data, with an increased range of rotation angles leading to proper alignment, suggesting that it can be further applied to align 3D EM maps. Aryan Tajmir Riahi · Geoffrey Woollard · Frederic Poitevin · Anne Condon · Khanh Dao Duc 🔗 - A Benchmark Framework for Evaluating Structure-to-Sequence Models for Protein Design (Poster) []     Structure-based \textit{de novo} protein design methods, ESM-IF1 \cite{pmlr-v162-hsu22a} and ProteinMPNN \cite{dauparas2022robust}, have recently shown impressive results in zero-shot fitness prediction, protein sequence recovery, and experimentally-validated protein design. The prospect of utilizing such methods to design better proteins is tantalizing and has already driven experimental work \cite{wicky2022hallucinating}. However, current understanding of when and why these methods perform well or poorly is limited due to a paucity of comprehensive ground-truth experimental data. This makes \textit{in silico} benchmarking and ablation difficult, requiring expensive experimental validation hampering fast feedback loops and rapid methodological development. In this work, we evaluate the capabilities of structure-based methods for protein design against a combinatorially complete fitness landscape measuring stability and binding of the protein G domain B1 \cite{wu2016adaptation}. We develop a framework for protein design that divides into two tasks: generation and ranking. In the case of ESM-IF1 we significantly improve its generation capabilities via distilled conditional language modeling. We find that both methods show impressive generation and ranking results for small experimental budgets but scale poorly to larger budgets. Finally, we demonstrate that modeling protein complexes exhibits minor design improvements for binding affinity tasks. Jeffrey Chan · Seyone Chithrananda · David Brookes · Sam Sinai 🔗 - Predicting Immune Escape with Pretrained Protein Language Model Embeddings (Poster) []     Assessing the severity of new pathogenic variants requires an understanding of which mutations will escape the human immune response. Even single point mutations to an antigen can cause immune escape and infection via abrogation of antibody binding. Recent work has modeled the effect of single point mutations on proteins by leveraging the information contained in large-scale, pretrained protein language models. These models are often applied in a zero-shot setting, where the effect of each mutation is predicted based on the output of the language model with no additional training. However, this approach cannot appropriately model immune escape, which involves the interaction of two proteins---antibody and antigen---instead of one and requires making different predictions for the same antigenic mutation in response to different antibodies. Here, we explore several methods for predicting immune escape by building models on top of embeddings from pretrained protein language models. We evaluate our methods on a SARS-CoV-2 deep mutational scanning dataset and show that our embedding-based methods significantly outperform zero-shot methods, which have almost no predictive power. We additionally highlight insights into how best to use embeddings from pretrained protein language models to predict escape. Kyle Swanson · Howard Chang · James Zou 🔗 - Visualizing DNA reaction trajectories with deep graph embedding approaches (Poster) []  Synthetic biologists and molecular programmers design novel nucleic acid reactions, with many potential applications. Good visualization tools are needed to help domain experts make sense of the complex outputs of folding pathway simulations of such reactions. Here we present ViDa, a new approach for visualizing DNA reaction folding trajectories over the energy landscape of secondary structures. We integrate a deep graph embedding model with common dimensionality reduction approaches, to map high-dimensional data onto 2D Euclidean space. We assess ViDa on two well studied and contrasting DNA hybridization reactions. Our preliminary results suggest that ViDa's visualization successfully separates trajectories with different folding mechanisms, thereby providing useful insight to users, and is a big improvement over the current state-of-the-art in DNA kinetics visualization. Chenwei Zhang · Anne Condon · Khanh Dao Duc 🔗 - MLPfold: Identification of transition state ensembles in molecular dynamics simulations using machine learning (Poster) []  Molecular dynamics simulations generate a large amount of raw data, which often require computationally expensive analysis to extract important information. Here, we propose a novel method, MLPfold, to identify the transition state ensemble of a system through an automated labeling process and supervised learning using a simple MLP. This seeks to replicate the conventional Pfold calculation but without requiring the running of any additional simulations. MLPfold was tested on numerous model potentials and Brownian dynamics simulation of the Ubiquitin hairpin and shows promise in predicting committor probabilities and identifying transition states. Preetham Venkatesh 🔗 - Predicting interaction partners using masked language modeling (Poster) []  Determining which proteins interact together from their amino acid sequences is an important task. In particular, even if an interaction is known to exist in some species between members of two protein families, determining which other members of these families are interaction partners can be tricky. Indeed, it requires identifying which paralogs interact together. Various methods have been proposed to this end. Here, we present a new one, which relies on a protein language model trained on multiple sequence alignments and directly exploits the fact that this model was trained to fill in masked amino acids. We obtain promising results on two different benchmark pairs of interacting protein families where partners are known. In particular, performance is good even for shallow alignments, while previous coevolution-based methods require deep ones. Performance is also found to quickly improve by giving the model correct examples of interacting sequences. Damiano Sgarbossa · Umberto Lupo · Anne-Florence Bitbol 🔗 - ZymCTRL: a conditional language model for the controllable generation of artificial enzymes (Poster) []  The design of custom-tailored proteins has the potential to provide novel and groundbreaking solutions in many fields, including molecular medicine or envi- ronmental sciences. Among protein classes, enzymes are particularly attractive because their complex active sites can accelerate chemical reactions and trans- formations by several orders of magnitude. Since enzymes are biodegradable nanoscopic materials, they hold an unmatched promise as sustainable, large-scale industrial catalysts. Motivated by the enormous success of language models in designing novel yet nature-like proteins, we hypothesized that an enzyme-specific language model could provide new opportunities to design purpose-built artificial enzymes. Here, we describe ZymCTRL, a conditional language model trained on the BRENDA database of enzymes, which generates enzymes of a specific Enzymatic Class upon a user prompt. ZymCTRL generates artificial enzymes distant to natural ones while their intended functionality matches predictions from orthogonal methods. We release the model to the community. Noelia Ferruz 🔗 - Large-scale self-supervised pre-training on protein three-dimensional structures (Poster)    Recent developments in the protein structure prediction field led to a drastic increase in the number of available protein three-dimensional structures. This creates a challenge and presents an opportunity for discovering fitting approaches to utilise such new datasets in various machine learning settings. In this paper, we propose STEP (STructural Embedding of Proteins) a self-supervised learning approach for creating meaningful embeddings of protein structures and demonstrate its utility in a variety of downstream tasks. We study various approaches to such a problem, including deep metric learning, as well assimple label prediction tasks. We demonstrate the superiority of STEP over existing models in a variety of downstream tasks, including the prediction of drug-target interactions. We show that for especially challenging tasks, such as predicting drugs for new proteins, our model shows improvement of up to 0.1 AUROC over previous methods. Ilya Senatorov 🔗 - Fast protein structure searching using structure graph embeddings (Poster) []  Comparing and searching protein structures independent of primary sequence has proved useful for remote homology detection, function annotation and protein classification. With the recent leap in accuracy of protein structure prediction methods and increased availability of protein models, attention is turning to how to best make use of this data. Fast and accurate methods to search databases of millions of structures will be essential to this endeavour, in the same way that fast protein sequence searching underpins much of bioinformatics. We train a simple graph neural network to learn a low-dimensional embedding of protein structure, and show that the embedding can be used to query structures against large structural databases with accuracy comparable to current methods. The speed of the method and ability to scale to millions of structures makes it suitable for this structure-rich era. Joe Greener · Kiarash Jamali 🔗 - A Federated Learning benchmark for Drug-Target Interaction (Poster) Aggregating pharmaceutical data in the drug-target interaction (DTI) domain has the potential to deliver life-saving breakthroughs. It is, however, notoriously difficult due to regulatory constraints and commercial interests. This work proposes the application of federated learning, which we argue to be reconcilable with the industry's constraints, as it does not require sharing of any information that would reveal the entities' data or any other high-level summary of it. When used on a representative GraphDTA model and the KIBA dataset it achieves up to 15% improved performance relative to the best available non-privacy preserving alternative. Our extensive battery of experiments shows that, unlike in other domains, the non-IID data distribution in the DTI datasets does not deteriorate FL performance. Additionally, we identify a material trade-off between the benefits of adding new data, and the cost of adding more clients. Filip Svoboda · Gianluca Mittone · Nicholas Lane · Pietro Lió 🔗 - Physics aware inference for the cryo-EM inverse problem: anisotropic network model heterogeneity, global 3D pose and microscope defocus (Poster) []  We propose a parametric forward model for single particle cryo-electron microscopy (cryo-EM), and employ stochastic variational inference to infer posterior distributions of the physically interpretable latent variables. Our novel cryo-EM forward model accounts for the biomolecular configuration (via spatial coordinates of pseudo-atoms, in contrast with traditional voxelized representations) the global 3D pose, the effect of the microscope (contrast transfer function's defocus parameter), and noise. To capture heterogeneity, we use the anisotropic network model (ANM), a Gaussian in the space of atomic coordinates. We perform experiments on synthetic data and show that the posterior of the scalar component along the lowest ANM mode and the angle of 2D in-plane pose can be jointly inferred with deep neural networks. We also demonstrate Fourier frequency marching in the simulation and likelihood during training, without retraining the neural networks that characterize the variational posterior. Geoffrey Woollard · Shayan Shekarforoush · Frank Wood · Marcus Brubaker · Khanh Dao Duc 🔗 - Allele-conditional attention mechanism for HLA-peptide complex binding affinity prediction (Poster) []     The Human Leukocyte Antigen (HLA) complex plays a crucial role in adaptive immune responses for cancer immunology. Due to the complex interactions governing the binding process of peptides in the HLA surface and the intrinsic polymorphic nature of the HLA complex, one of the main bottlenecks for cancer vaccine design is the accurate prediction of binding epitopes for specific-alleles. Data-driven approaches using binding experiments' information have shown to be effective for high-throughput screening of candidates, instead of expensive docking methods. However, there is still no consensus on how to most effectively represent amino acid sequences and model long interactions patterns present in these complexes. Recently, attention-based models have been explored to improve this task, allowing for higher flexibility by introducing weaker inductive biases into the models, however carrying a critical trade-off between expressively and data-efficiency. We propose an allele-conditional attention mechanism for binding prediction and show how constraining attention between the HLA-context and peptide sequences improves performance, while requiring less parameters compared to standard transformer-like models. We thoroughly study the impact of different attention schemes and pooling methods on the task of binding affinity prediction and benchmark widely utilized deep learning architectures. In addition, we show that patterns in string representation space can also provide insights and encode information that correlates with the underlying spatial interactions between HLA class I and peptide amino acids, without any extra docking simulations. Rodrigo Hormazabal · Doyeong Hwang · Kiyoung Kim · Sehui Han · Kyunghoon Bae · Honglak Lee 🔗 - Conditional neural processes for molecules (Poster) []  Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in QSAR modelling, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification. Miguel Garcia-Ortegon · Andreas Bender · Sergio Bacallado 🔗 - Towards automated crystallographic structure refinement with a differentiable pipeline (Poster) []  The lack of interfaces between crystallographic data and machine learning methods prevents the application of modern neural network frameworks to the crystal structure determination. Here we present \texttt{SFcalculator}, a differentiable pipeline to generate crystallographic data (structure factors) from atomic molecule structures with the bulk solvent model. This calculator fills the gap between the long-established crystallography field and the state-of-the-art deep learning algorithms. We discuss the correctness and performance of our \texttt{SFcalculator} by comparing with the current most-used tool \texttt{Phenix}. Finally, we demonstrate with an initial try that it makes possible the automated structure refinement in a well-regularized latent space defined by a deep generative model, which enables a more principled way to impose prior knowledge. We believe this tool paves the way towards a fully automated structure refinement and a possible end-to-end model, which is crucial for the next generation high throughput diffraction experiments. Minhuan Li · Doeke Hekstra 🔗 - Ligand-aware protein sequence design using protein self contacts (Poster) []  The design of ligand-binding proteins remains a significant challenge. Few, if any, structure-to-sequence deep learning methods include representations of small molecules for use in sequence design. Here, we show that favorable interactions between chemical-group fragments and proteins can be learned from large databases consisting of protein self contacts. We approximate ligands as collections of proteinaceous chemical groups and train simple MLPs to learn amino-acid identities when conditioned on the placement of these chemical groups relative to the backbone of a residue. We use fragment-aware amino-acid probabilities to compute the binding-site residues of protein-ligand structures and evaluate our method by sequence recovery. Surprisingly, this simple fragment-aware feature can in some cases accurately predict residue identities with no prior knowledge of binding site structures. Jody Mou · Benjamin Fry · Chun-Chen Yao · Nicholas Polizzi 🔗 - Lightweight Equivariant Graph Representation Learning for Protein Engineering (Poster) []     This work tackles the issue of directed evolution in computational protein design that makes an accurate prediction for the function of a protein mutant. We design a lightweight pre-training graph neural network model for multi-task protein representation learning from its 3D structure. Rather than reconstructing and optimizing the protein structure, the trained model recovers the amino acid types and key properties of the central residues from a given noisy three-dimensional local environment. On the prediction task for the higher-order mutants, where many amino acid sites of the protein are mutated, the proposed training strategy achieves remarkably higher performance by 20% improvement at the cost of requiring less than 1% of computational resources that are required by popular transformer-based state-of-the-art deep learning models for protein design. Bingxin Zhou · · Kai Yi · Xinye Xiong · Pan Tan · Liang Hong · Yuguang Wang 🔗 - Improving Molecular Pretraining with Complementary Featurizations (Poster) Molecular pretraining, which learns molecular representations over massive unlabeled data, has become a prominent paradigm to solve a variety of tasks in computational chemistry and drug discovery. Recently, prosperous progress has been made in molecular pretraining with different molecular featurizations, including 1D SMILES strings, 2D graphs, and 3D geometries. However, the role of molecular featurizations with their corresponding neural architectures in molecular pretraining remains largely unexamined. In this paper, through two case studies—chirality classification and aromatic ring counting—we first demonstrate that different featurization techniques convey chemical information differently. In light of this observation, we propose a simple and effective MOlecular pretraining framework with COmplementary featurizations (MOCO). MOCO comprehensively leverages multiple featurizations that complement each other and outperforms existing state-of-the-art models that solely relies on one or two featurizations on a wide range of molecular property prediction tasks. Yanqiao Zhu · Dingshuo Chen · Yuanqi Du · Yingze Wang · Qiang Liu · Shu Wu 🔗 - Using domain-domain interactions to probe the limitations of MSA pairing strategies (Poster) []  State-of-the-art methods for the prediction of the structures of interacting protein complexes rely on the construction of paired multiple sequence alignments, whose rows contain concatenated pairs of homologues of each of the interacting chains. Despite the inherent difficulty of accurately pairing interacting homologues of each chain, most existing methods use simple heuristic strategies for this purpose. The accuracy of these heuristic strategies and the consequences of their widespread usage remain poorly understood, due in large part to the paucity of ground truth data on correct pairings. To remedy this situation we propose a novel benchmark setting for interaction partner pairing algorithms, based on domain-domain interactions within single protein chains. The co-existence of pairs of domains within single chains means that ground-truth pairs of homologues are known a priori, allowing both the accuracy of pairing strategies and the influence of inaccurate pairings on downstream inferences to be quantified directly. We provide evidence that the widely used best-hit pairing strategy leads in many cases to very noisy paired MSAs, from which inferences of 3D structure can be significantly less accurate than those made using the correctly paired MSAs. We conclude that further improvements in pairing strategies promise significant benefits for structure predictors capable of exploiting co-evolutionary signal. Alex Hawkins-Hooker · David Jones · Brooks Paige 🔗 - Metal3D: Accurate prediction of transition metal ion location via deep learning (Poster)    Metal ions are essential cofactors for many proteins and about half of the structurally characterized proteins contain a metal ion. Metal ions play a crucial role for many applications such as enzyme design or design of protein-protein interactions because they are biologically abundant, tether to the protein using strong interactions, and have favorable catalytic properties e.g. as Lewis acid. In this work, we develop a convolutional neural network based approach to identify metal binding sites in experimental and computationally predicted protein structures. Comparison with other currently available tools shows that Metal3D is the most accurate metal ion location predictor to date using a single structure as input. Metal3D outputs a confidence metric for each predicted site and works on proteins with few homologes in the protein data bank. The predicted metal ion locations for Metal3D are within 0.70 ± 0.64 \AA\, of the experimental locations with half of the sites below 0.5 \AA . Metal3D predicts a global metal density that can be used for annotation of structures predicted using e.g.~AlphaFold2 and a per residue metal density that can be used in protein design workflows for the location of suitable metal binding sites and rotamer sampling to create novel metalloproteins. Simon Dürr 🔗 - Predicting conformational landscapes of known and putative fold-switching proteins using AlphaFold2 (Poster) []     Proteins that switch their secondary structures upon response to a stimulus -- commonly known as "metamorphic proteins" -- directly question the paradigm of “one structure per protein”. Despite the potential to more deeply understand protein folding and function through studying metamorphic proteins, their discovery has been largely by chance, with fewer than 10 experimentally validated. AlphaFold2 (AF2) has dramatically increased accuracy in predicting single structures, though it fails to return alternate states for known metamorphic proteins in its default settings. We demonstrate that clustering an input multiple sequence alignment (MSA) by sequence similarity enables AF2 to sample alternate states of known metamorphs. Moreover, AF2 scores these alternate states with high confidence. We used our clustering method, AF-cluster, to screen for alternate states in protein families without known fold-switching, and identified a putative alternate state for the oxidoreductase DsbE. Similarly to KaiB, DsbE is predicted to switch between a thioredoxin-like fold and a novel fold. This prediction is the subject of ongoing experimental testing. Further development of such bioinformatic methods in tandem with experiment will likely aid in accelerating discovery and gaining a more systematic understanding of fold-switching in protein families. Hannah Wayment-Steele · Sergey Ovchinnikov · Lucy Colwell · Dorothee Kern 🔗 - Dynamic-backbone protein-ligand structure prediction with multiscale generative diffusion models (Poster)    Molecular complexes formed by proteins and small-molecule ligands are ubiquitous, and predicting their 3D structures can facilitate both biological discoveries and the design of novel enzymes or drug molecules. Here we propose NeuralPLexer, a deep generative model framework to rapidly predict protein-ligand complex structures and their fluctuations using protein backbone template and molecular graph inputs. NeuralPLexer jointly samples protein and small-molecule 3D coordinates at an atomistic resolution through a generative model that incorporates biophysical constraints and inferred proximity information into a time-truncated diffusion process. The reverse-time generative diffusion process is learned by a novel stereochemistry-aware equivariant graph transformer that enables efficient, concurrent gradient field prediction for all heavy atoms in the protein-ligand complex. NeuralPLexer outperforms existing physics-based and learning-based methods on benchmarking problems including fixed-backbone blind protein-ligand docking and ligand-coupled binding site repacking. Moreover, we identify preliminary evidence that NeuralPLexer enriches bound-state-like protein structures when applied to systems where protein folding landscapes are significantly altered by the presence of ligands. Our results reveal that a data-driven approach can capture the structural cooperativity among protein and small-molecule entities, showing promise for the computational identification of novel drug targets and the end-to-end differentiable design of functional small-molecules and ligand-binding proteins. Zhuoran Qiao · Weili Nie · Arash Vahdat · Thomas Miller · Anima Anandkumar 🔗 - Predicting Ligand – RNA Binding Using E3-Equivariant Network and Pretraining (Poster) []     It is becoming increasingly appreciated that small molecules hold great promise in targeting therapeutically relevant RNAs, such as viral RNAs or splicing junctions. Yet predicting ligand targeting RNA is particularly difficult since limited data are available. To overcome this, we fine-tuned a pretrained small molecule representation model, Uni-Mol, to predict the RNA-binding propensity of ligands and the RNA binding QSAR model. In addition, we develop an E3-equivariant model to predict possible ligands given the RNA pocket geometry. To the best of our knowledge, this is the first E3-equivariant model for predicting RNA-ligand binding. We demonstrated the great potential of Uni-Mol pretraining in the RNA-ligand tasks towards efficient and rational RNA drug discovery. Zhenfeng Deng · Ruichu Gu · Hangrui Bi · Xinyan Wang · Zhaolei Zhang · Han Wen 🔗 - Seq2MSA: A Language Model for Protein Sequence Diversification (Poster) []  Diversification libraries of protein sequences that contain a similar set of structures over a variety of sequences can help protein design pipelines by introducing flexibility into the starting structures and providing a range of starting points for directed evolution. However, exploring the sequence space is computationally challenging: the vast majority of sequence space is non-viable, and even of those sequences that do fold to well-formed protein structures, it is challenging to find the fraction that maintain a similar fold class to a given protein. In this work, we propose to use an encoder-decoder language model, trained on a novel Seq2MSA task, that can create diversification libraries of any input protein. In particular, using our model, we are able to generate sequences that maintain structural similarity to a target sequence while pushing below 40% sequence identity to any protein in UniRef. Our diversification pipeline has the potential to aid in computational protein design by providing a diverse set of starting points in sequence space for a given functional or structural target. Pascal Sturmfels · Roshan Rao · Robert Verkuil · Zeming Lin · Tom Sercu · Adam Lerer · Alex Rives 🔗 - EquiFold: Protein Structure Prediction with a Novel Coarse-Grained Structure Representation (Poster) []  Designing proteins to achieve specific functions often requires in silico modeling of their properties at high throughput scale and can significantly benefit from fast and accurate protein structure prediction. We introduce EquiFold, a new end-to-end differentiable, SE(3)-equivariant, all-atom protein structure prediction model. EquiFold uses a novel coarse-grained representation of protein structures that does not require multiple sequence alignments or protein language model embeddings, inputs that are commonly used in other state-of-the-art structure prediction models. Our method relies on geometrical structure representation and is substantially smaller than prior state-of-the-art models. In preliminary studies, EquiFold achieved comparable accuracy to AlphaFold but was orders of magnitude faster. The combination of high speed and accuracy make EquiFold suitable for a number of downstream tasks, including protein property prediction and design. Jae Hyeon Lee · Payman Yadollahpour · Andrew Watkins · Nathan Frey · Andrew Leaver-Fay · Stephen Ra · Vladimir Gligorijevic · Kyunghyun Cho · Aviv Regev · Richard Bonneau 🔗 - DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking (Poster) []     Predicting the binding structure of a small molecule ligand to a protein---a task known as molecular docking---is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, DiffDock has fast inference times and provides confidence estimates with high selective accuracy. Gabriele Corso · Hannes Stärk · Bowen Jing · Regina Barzilay · Tommi Jaakkola 🔗 - SWAMPNN: End-to-end protein structures alignment (Poster) []     With the recent breakthrough of highly accurate structure prediction methods, there has been a rapid growth of available protein structures. Efficient methods are needed to infer structural similarity within these datasets. We present an end-to-end alignment method, called SWAMPNN, that takes as input the 3D coordinates of a protein pair and outputs a structural alignment. We show that the model is able to recapitulate TM-align alignments while running faster and is more accurate than Foldseek on the alignment task while being comparable for classification. Jeanne Trinquier · Samantha Petti · Shihao Feng · Johannes Soeding · Martin Steinegger · Sergey Ovchinnikov 🔗 - Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models (Poster) Proteins are macromolecules that mediate a significant fraction of the cellular processes that underlie life. An important task in bioengineering is designing proteins with specific 3D structures and chemical properties which enable targeted functions. To this end, we introduce a generative model of both protein structure and sequence that can operate at significantly larger scales than previous molecular generative modeling approaches. The model is learned entirely from experimental data and conditions its generation on a compact specification of protein topology to produce a full-atom backbone configuration as well as sequence and side-chain predictions. We demonstrate the quality of the model via qualitative and quantitative analysis of its samples. We show how the model can be applied to protein structure determination such as in CryoEM and present results on predicting domain structures to simulated electron densities at varying resolutions. Videos of sampling trajectories are available at https://nanand2.github.io/proteins. Namrata Anand · Tudor Achim 🔗 - Latent Space Diffusion Models of Cryo-EM Structures (Poster) Cryo-electron microscopy (cryo-EM) is unique among tools in structural biology in its ability to image large, dynamic protein complexes. Key to this ability are image processing algorithms for heterogeneous cryo-EM reconstruction, including recent deep learning-based approaches. The state-of-the-art method cryoDRGN uses a Variational Autoencoder (VAE) framework to learn a continuous distribution of protein structures from single particle cryo-EM imaging data. While cryoDRGN is able to model complex structural motions, in practice, the Gaussian prior distribution of the VAE fails to match the aggregate approximate posterior, especially for multi-modal distributions (e.g. compositional heterogeneity). Here, we train a diffusion model as an expressive, learnable prior for cryoDRGN. We show the ability to sample from the model on two synthetic and two real datasets, where samples accurately follow the data distribution unlike samples from the VAE prior distribution. Our approach learns a high-quality generative model over molecular configurations directly from cryo-EM imaging data. We also demonstrate how the diffusion model prior can be leveraged for fast latent space traversal and interpolation between states of interest. By learning an accurate model of the data distribution, our method unlocks tools in generative modeling, sampling, and distribution analysis for heterogeneous cryo-EM ensembles. Karsten Kreis · Tim Dockhorn · Zihao Li · Ellen Zhong 🔗

#### Author Information

##### Jonas Adler (KTH - Royal Institute of Technology)

I’m a Research Scientist at Elekta, pursuing a PhD in Applied Mathematics working under the supervision of Ozan Öktem. I do research in inverse problems and machine learning, especially focusing on the intersection between model-driven and data-driven methods. Organizing [DLIP2019](https://sites.google.com/view/dlip2019).