Workshop
New Frontiers of AI for Drug Discovery and Development
Animashree Anandkumar · Ilija Bogunovic · Ti-chiun Chang · Quanquan Gu · Jure Leskovec · Michelle Li · Chong Liu · Nataša Tagasovska · Mengdi Wang · Wei Wang
Room 242
We will facilitate interdisciplinary discussions to identify gaps and opportunities for AI in the drug discovery and development pipeline.
Schedule
Fri 6:15 a.m. - 6:25 a.m.
|
Opening Remarks
(
Presentation
)
>
SlidesLive Video |
Chong Liu 🔗 |
Fri 6:25 a.m. - 7:05 a.m.
|
Roundtable Discussion
(
Discussion
)
>
SlidesLive Video Karen Sayal (GSK): The Opportunities and Challenges of Integrating ML into Clinical Trials Bülent Kiziltan (Novartis): From Drug Discovery to Drug Design: AI-driven Generative Chemistry Haoda Fu (Eli Lilly): AI/ML Based De Novo Design for Biologics Quanquan Gu (UCLA/ByteDance): TBA |
🔗 |
Fri 7:05 a.m. - 7:40 a.m.
|
Invited Talk 1 - Wengong Jin (MIT) - DSMBind: SE(3) denoising score matching for unsupervised binding energy prediction and nanobody design
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 7:40 a.m. - 8:00 a.m.
|
2 Oral Presentations
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 8:30 a.m. - 9:05 a.m.
|
Invited Talk 2 - Marinka Zitnik (Harvard) - Foundation Models for Molecular Drug Design and Clinical Drug Development
(
Presentation
)
>
SlidesLive Video We are laying the foundations for AI to enhance the design of new drugs and understanding existing medicines, eventually enabling AI to learn on its own. First, I describe FAIR (NeurIPS 2023), a generative model for protein pocket design that enhances drug binding to biological targets. FAIR co-designs protein pocket sequences and corresponding 3D structures, outperforming existing methods by 15.5% (AAR) and 13.5% (RMSD). For drugs to be effective, they must act on biological targets in relevant biological contexts. I describe PINNACLE (bioRxiv 2023), a multi-scale graph neural network for identifying optimal cell contexts for drugs to act in. PINNACLE models perform an array of tasks, including enhancing 3D structural protein representations critical in immune-oncology, predicting the effects of drugs across cell-type contexts, and nominating therapeutic targets in a cell-type specific manner. Finally, candidate drugs need to be matched to patient benefits. I present TxGNN (medRxiv 2023), a knowledge graph AI model for zero-shot prediction of therapeutic use across over 17,000 diseases, enabling drug repurposing for 7,000 rare diseases with a mere 5% having FDA-approved drugs. TxGNN's predictions align with clinical prescriptions across 1.2 million medical records. Last, we founded Therapeutics Commons, a global initiative to access and evaluate AI across therapeutic modalities (including small molecules, macro-molecules, cell and gene therapies) and stages of drug discovery (spanning from molecular design and target nomination to modeling efficacy, safety, and drug repurposing). The Commons offers benchmarks, leaderboards, and model hubs with pre-trained models and multimodal datasets to facilitate the use of AI in therapeutic science. |
🔗 |
Fri 9:05 a.m. - 9:40 a.m.
|
2 Oral Presentations
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:25 a.m. - 10:00 a.m.
|
Invited Talk 3 - Haoda Fu (Eli Lilly) - LLM Is Not All You Need. Generative AI on Smooth Manifolds
(
Presentation
)
>
SlidesLive Video Generative AI is a rapidly evolving technology that has garnered significant interest lately. In this presentation, we'll discuss the latest approaches, organizing them within a cohesive framework using stochastic differential equations to understand complex, high-dimensional data distributions. We'll highlight the necessity of studying generative models beyond Euclidean spaces, considering smooth manifolds essential in areas like robotics and medical imagery, and for leveraging symmetries in the de novo design of molecular structures. Our team's recent advancements in this blossoming field, ripe with opportunities for academic and industrial collaborations, will also be showcased. |
🔗 |
Fri 10:00 a.m. - 11:20 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Fri 11:20 a.m. - 12:25 p.m.
|
Poster Session
(
Poster
)
>
|
🔗 |
Fri 12:25 p.m. - 1:00 p.m.
|
Invited Talk 4 - Michael Bronstein (Oxford) - Harnessing geometry for molecular design
(
Presentation
)
>
SlidesLive Video |
Michael Bronstein 🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 1:30 p.m. - 2:05 p.m.
|
Invited Talk 5 - Jian Tang (MILA) - Diffusion Models for Molecular Structure Prediction
(
Presentation
)
>
SlidesLive Video Predicting the 3D structures of molecules is a fundamental problem in both computational chemistry and biology (for small molecules and proteins respectively). In this talk, I'm going to introduce some of our recent work for molecular structure prediction with diffusion-based models including: (1) the first diffusion models for 3D molecular structure prediction, GeoDiff; (2) a diffusion model defined on Torsional space for protein side chain structure prediction, DiffPack, and a diffusion model for inferring multiple protein stable conformations, Str2Str. |
🔗 |
Fri 2:05 p.m. - 2:40 p.m.
|
Invited Talk 6 - Iya Khalil (Merck) - Decoding Biology with AI and High-Throughput Biology
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 2:40 p.m. - 3:10 p.m.
|
Award Ceremony
(
Presentation
)
>
SlidesLive Video |
Lucian Chan · Denis Tarasov 🔗 |
Fri 3:10 p.m. - 3:20 p.m.
|
Concluding Remarks
(
Presentation
)
>
SlidesLive Video |
Chong Liu 🔗 |
-
|
Removing Biases from Molecular Representations via Information Maximization
(
Poster
)
>
link
High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. |
Chenyu Wang · Sharut Gupta · Caroline Uhler · Tommi Jaakkola 🔗 |
-
|
MolSiam: Simple Siamese Self-supervised Representation Learning for Small Molecules
(
Poster
)
>
link
We investigate a self-supervised learning technique from the Simple Siamese (SimSiam) Representation Learning framework on 2D molecule graphs. SimSiam does not require negative samples during training, making it 1) more computationally efficient and 2) less vulnerable to faulty negatives compared with contrastive learning. Leveraging unlabeled molecular data, we demonstrate that our approach, MolSiam, effectively captures the underlying features of molecules and shows that those with similar properties tend to cluster in UMAP analysis. By fine-tuning pre-trained MolSiam models, we observe performance improvements across four downstream therapeutic property prediction tasks without training with negative pairs. |
Joshua Yao-Yu Lin 🔗 |
-
|
Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction
(
Poster
)
>
link
Survival prediction, central to the analysis of clinical trials, has the potential to be transformed by the availability of RNA-seq data as it reveals the underlying molecular and genetic mechanisms for disease and outcomes. However, the amount of RNA-seq samples available for understudied or rare diseases is often limited. To address this, leveraging data across different cancer types can be a viable solution, necessitating the application of self-supervised learning techniques. Yet, this wealth of data often comes in a tabular format without a known structure, hindering the development of a generally effective augmentation method for survival prediction. While traditional methods have been constrained by a one cancer-one model philosophy or have relied solely on a single modality, our approach, Guided-STab, on the contrary, offers a comprehensive approach through pretraining on all available RNA-seq data from various cancer types while guiding the representation by incorporating sparse clinical features as auxiliary tasks. With a multitask-guided self-supervised representation learning framework, we maximize the potential of vast unlabeled datasets from various cancer types, leading to genomic-driven survival predictions. These auxiliary clinical tasks then guide the learned representations to enhance critical survival factors. Extensive experiments reinforce the promise of our approach, as Guided-STab consistently outperforms established benchmarks on TCGA dataset. |
You Wu · Omid Bazgir · Yongju Lee · Tommaso Biancalani · James Lu · Ehsan Hajiramezanali 🔗 |
-
|
Do chemical language models provide a better compound representation?
(
Poster
)
>
link
In recent years, several chemical language models have been developed, inspired by the success of protein language models and advancements in natural language processing. In this study, we explore whether pre-training a chemical language model on billion-scale compound datasets, such as Enamine and ZINC20, can lead to improved compound representation in the drug space. We compare the learned representations of these models with de the facto standard compound representation, and evaluate their potential application in drug discovery and development by benchmarking them on biophysics, physiology, and physical chemistry datasets. Our findings suggest that the conventional masked language modeling approach on these extensive pre-training datasets is insufficient in enhancing compound representations. This highlights the need for additional physicochemical inductive bias in the modeling beyond scaling the dataset size. |
Mirko Torrisi · Saeid Asadollahi · Antonio De la Vega de Leon · Kai Wang · Wilbert Copeland 🔗 |
-
|
Generalist Equivariant Transformer Towards 3D Molecular Interaction Learning
(
Poster
)
>
link
Many processes in biology and drug discovery involve various 3D interactions between molecules, such as protein and protein, protein and small molecule, etc. Given that different molecules are usually represented in different granularity, existing methods usually encode each type of molecules independently with different models, leaving it defective to learn the universal underlying interaction physics. In this paper, we first propose to universally represent an arbitrary 3D complex as a geometric graph of sets, shedding light on encoding all types of molecules with one model. We then propose a Generalist Equivariant Transformer (GET) to effectively capture both domain-specific hierarchies and domain-agnostic interaction physics. To be specific, GET consists of a bilevel attention module, a feed-forward module and a layer normalization module, where each module is E(3) equivariant and specialized for handling sets of variable sizes. Notably, in contrast to conventional pooling-based hierarchical models, our GET is able to retain fine-grained information of all levels. Extensive experiments on the interactions between proteins, small molecules and RNA/DNAs verify the effectiveness and generalization capability of our proposed method across different domains. |
Xiangzhe Kong · Wenbing Huang · Yang Liu 🔗 |
-
|
Inpainting Protein Sequence and Structure with ProtFill
(
Poster
)
>
link
Designing new proteins with specific binding capabilities is a challenging task that has the potential to revolutionize many fields, including medicine and material science. Here we introduce ProtFill, a unified method for simultaneous protein structure and sequence design. Distinct from most existing computational design frameworks which focus on either structure or sequence design, our method embraces both representations concurrently. Employing an $SE(3)$ equivariant diffusion graph neural network, our method excels in both sequence prediction and structure recovery. We demonstrate the model's applicability in interface redesign for antibodies as well as other proteins, underscoring the efficacy of our approach and the potential of the diffusion framework in protein design. The code is available at https://anonymous.4open.science/r/ProtFill-1234/.
|
Elizaveta Kozlova · Arthur Valentin · Daniel Nakhaee-Zadeh Gutierrez 🔗 |
-
|
$\textit{In vitro}$ validated antibody design against multiple therapeutic antigens using generative inverse folding
(
Poster
)
>
link
Deep learning approaches have demonstrated the ability to design protein sequences given backbone structures. While these approaches have been applied $\textit{in silico}$ to designing antibody complementarity-determining regions (CDRs), they have yet to be validated $\textit{in vitro}$ for designing antibody binders, which is the true measure of success for antibody design. Here we describe $\textit{IgDesign}$, a deep learning method for antibody CDR design, and demonstrate its robustness with successful binder design for 8 therapeutic antigens. The model is tasked with designing heavy chain CDR3 (HCDR3) or all three heavy chain CDRs (HCDR123) using native backbone structures of antibody-antigen complexes, along with the antigen and antibody framework (FWR) sequences as context. For each of the 8 antigens, we design 100 HCDR3s and 100 HCDR123s, scaffold them into the native antibody's variable region, and screen them for binding against the antigen using surface plasmon resonance (SPR). As a baseline, we screen 100 HCDR3s taken from the model's training set and paired with the native HCDR1 and HCDR2. We observe that both HCDR3 design and HCDR123 design outperform this HCDR3-only baseline. IgDesign is the first experimentally validated antibody inverse folding model. It can design antibody binders to multiple therapeutic antigens with high success rates and, in some cases, improved affinities over clinically validated reference antibodies. Antibody inverse folding has applications to both $\textit{de novo}$ antibody design and lead optimization, making IgDesign a valuable tool for accelerating drug development and enabling therapeutic design.
|
Amir Shanehsazzadeh 🔗 |
-
|
TopoPool: An Adaptive Graph Pooling Layer for Extracting Molecular and Protein Substructures
(
Poster
)
>
link
Within molecules and proteins, discrete substructures affect high level properties and behavior in distinct ways. As such, explicitly locating and accounting for these substructures is a central problem when learning molecular or protein representations. Typically represented as graphs, this task falls under the umbrella of graph pooling, or segmentation. Given the highly variable size, number, and topology of these substructures, an ideal pooling algorithm would would adapt on a graph-by-graph basis and use local context to locate optimal pools. However, this poses a challenge where differentiability is concerned, and each of the learnable graph pooling methods proposed to date must make strong a priori assumptions in regards to the number or size of the learned pools. As such, demand remains for a graph pooling algorithm that can maintain differentiability while retaining adaptability in the size and number of learned pools. To meet this demand, we introduce the Topographical Pooling Layer (TopoPool): a differentiable, hierarchical graph pooling layer that learns an arbitrary number of varying sized pools without making any a priori assumptions about their number or size. Additionally, it naturally uncovers only connected substructures, increasing the interpretability of the learned pools and obviating the need for exogenous regularizers to enforce connectedness. We evaluate TopoPool on diverse molecular and protein property prediction tasks, where we achieve competitive performance against existing methods. Taken together, TopoPool represents a novel addition to the graph pooling toolbox, and is particularly relevant to areas like drug design where locating and optimizing discrete, connected molecular substructures is of central importance. |
Mattson Thieme · Majdi Hassan · Chetan Rupakheti · Kedar Thiagarajan · Abhishek Pandey · Han Liu 🔗 |
-
|
Offline RL for generative design of protein binders
(
Poster
)
>
link
Offline Reinforcement Learning (RL) offers a compelling avenue for solving RL problems without the need for interactions with an environment, which may be expensive or unsafe. While online RL methods have found success in various domains, such as de novo Structure-Based Drug Discovery (SBDD), they struggle when it comes to optimizing essential properties derived from protein-ligand docking. The high computational cost associated with the docking process makes it impractical for online RL, which typically requires hundreds of thousands of interactions during learning. In this study, we propose the application of offline RL to address the bottleneck posed by the docking process, leveraging RL's capability to optimize non-differentiable properties. Our preliminary investigation focuses on using offline RL to conditionally generate drugs with improved docking and chemical properties. |
Denis Tarasov · Ulrich Armel Mbou Sob · Miguel Arbesú · Nima Siboni · Sebastien Boyer · Andries Smit · Oliver Bent · Arnu Pretorius · Marcin Skwark 🔗 |
-
|
Online Learning of Optimal Prescriptions under Bandit Feedback with Unknown Contexts
(
Poster
)
>
link
Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn prescriptions of highest reward subject to the contextual information, while the unknown reward parameters of each prescription need to be learned by experimenting it. Accordingly, a fundamental problem is that of balancing exploration (i.e., prescribing different options to learn the parameters), versus exploitation (i.e., sticking with the best option to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partially observed contexts remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal prescriptions based on observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling with other quantities including dimensions and number of options. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we utilize concentration inequalities for dependent data and also develop novel probabilistic bounds for time-varying suboptimality gaps, among others. These techniques pave the road towards studying similar problems. |
Hongju Park · Mohamad Kazem Shirani Faradonbeh 🔗 |
-
|
Hit Expansion Driven By Machine Learning
(
Poster
)
>
link
Recent work \cite{McCloskey2020-es} utilized experimental data from DNA-encoded library (DEL) selections to train graph convolutional neural networks (GCNNs) \cite{Kearnes2016-sk} for identifying hit compounds for protein targets and their prospective test results demonstrated excellent hit rates for three diverse proteins. Building on this work, we propose two novel approaches to leverage DEL GCNN model predictions and embeddings to automate hit expansion, a critical step in real-world drug discovery that guides the optimization of initial hit compounds toward clinical candidates. We prospectively tested the proposed approaches on a protein target (sEH) and our methods identified more small molecules with higher potency compared to traditional molecular fingerprint similarity searches. Specifically, we discovered $34$ molecules with higher potency than a sEH clinical trial candidate using our approaches. All sEH assay results are publicly available at \url{https://www.tdcommons.org/dpubs_series/7414/}. Furthermore, applying the automated hit expansion approach to WDR91, a novel protein target that has no known binders, led to the discovery of two first-in-class covalent binders that were experimentally confirmed by co-crystal structures.
|
Jin Xu · Steven Kearnes · JW Feng 🔗 |
-
|
A framework for conditional diffusion modelling with applications in motif scaffolding for protein design
(
Poster
)
>
link
Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional generation task, and several conditional generation protocols were proposed or imported from the Computer Vision literature. However, most of these protocols are motivated heuristically, e.g. via analogies to Langevin dynamics, and lack a unifying framework, obscuring connections between the different approaches.In this work, we unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform. This new perspective allows us to draw connections between existing methods and propose a new conditional training protocol. We illustrate the effectiveness of this new protocol in both, image outpainting and motif scaffolding and find that it outperforms standard methods. |
Kieran Didi · Francisco Vargas · Simon Mathis · Vincent Dutordoir · Emile Mathieu · Urszula Julia Komorowska · Pietro Lió 🔗 |
-
|
Automating reward function configuration for drug design
(
Poster
)
>
link
Designing reward functions that can guide generative molecular design (GMD) algorithms to desirable areas of chemical space is of critical importance in AI-driven drug discovery. Traditionally, this has been a manual and error-prone task; the selection of appropriate computational methods to approximate biological assays is challenging and the normalisation and aggregation of computed values into a single score even more so, leading to potential reliance on trial-and-error approaches. We propose a novel approach for automated reward configuration that relies solely on experimental data, mitigating the challenges of manual reward adjustment on drug discovery projects. Our method achieves this by constructing a ranking over experimental data based on Pareto dominance over the multi-objective space, then training a neural network to approximate the reward function such that rankings determined by the predicted reward correlate with those determined by the Pareto dominance relation. We validate our method using two case studies. In the first study we simulate Design-Make-Test-Analyse (DMTA) cycles by alternating reward function updates and generative runs guided by that function. We show that the learned function adapts over time to yield compounds that score highly with respect to evaluation functions taken from the literature. In the second study we apply our algorithm to historical data from four real drug discovery projects. We show that our algorithm yields reward functions that outperform the predictive accuracy of human-defined functions, achieving an improvement of up to $0.4$ in Spearman's correlation against a ground truth evaluation function that encodes the target drug profile for that project. Our method provides an efficient data-driven way to configure reward functions for GMD, and serves as a strong baseline for future research into transformative approaches for the automation of drug discovery.
|
Temitope Ajileye · Paul Gainer · Marius Urbonas · Douglas Pires 🔗 |
-
|
DGFN: Double Generative Flow Networks
(
Poster
)
>
link
Deep learning is emerging as an effective tool in drug discovery, with potential applications in both predictive and generative models. Generative Flow Networks (GFlowNets/GFNs) are a recently introduced method recognized for the ability to generate diverse candidates, in particular in small molecule generation tasks. In this work, we introduce double GFlowNets (DGFNs). Drawing inspiration from reinforcement learning and Double Deep Q-Learning, we introduce a target network used to sample trajectories, while updating the main network with these sampled trajectories. Empirical results confirm that DGFNs effectively enhance exploration in sparse reward domains and high-dimensional state spaces, both challenging aspects of de-novo design in drug discovery. |
Elaine Lau · Nikhil Murali Vemgal · Doina Precup · Emmanuel Bengio 🔗 |
-
|
Target Conditioned GFlowNet for Drug Design
(
Poster
)
>
link
We seek to automate the generation of drug-like compounds conditioned to specific protein pocket targets. Most current methods approximate the protein-molecule distribution of a finite dataset and, therefore struggle to generate molecules with significant binding improvement over the training dataset. We instead frame the pocket-conditioned molecular generation task as an RL problem and develop TacoGFN, a target conditional Generative Flow Networks model. Our method is explicitly encouraged to generate molecules with desired properties as opposed to fitting on a pre-existing data distribution. To this end, we develop transformer-based docking score prediction to speed up docking score computation and propose TacoGFN to explore molecule space efficiently. Furthermore, we incorporate several rounds of active learning where generated samples are queried using a docking oracle to improve the docking score prediction. This approach allows us to accurately explore as much of the molecule landscape as we can afford computationally. Empirically, molecules generated using TacoGFN and its variants significantly outperform all baseline methods across every property (Docking score, QED, SA, Lipinski), while being orders of magnitude faster. |
Tony Shen · Mohit Pandey · Martin Ester 🔗 |
-
|
Towards a more inductive world for drug repurposing approaches
(
Poster
)
>
link
Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process, and show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance when evaluated as previously done in the literature, hence not being suited for drug repurposing approaches. We then propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true. We envision this work as the underpinning for future fair benchmarking and robust model design. All generated resources and tools are publicly available as a python package. |
Jesus de la Fuente Cedeño · Guillermo Serrano · Uxia Veleiro · Mikel Casals · Laura Vera · Marija Pizurica · Antonio Pineda-Lucena · Idoia Ochoa · Silve Vicent · Olivier Gevaert · Mikel Hernaez
|
-
|
Sample-efficient Antibody Design through Protein Language Model for Risk-aware Batch Bayesian Optimization
(
Poster
)
>
link
Antibody design is a time-consuming and expensive process that often requires1extensive experimentation to identify the best candidates. To address this challenge,2we propose an efficient and risk-aware antibody design framework that leverages3protein language models (PLMs) and batch Bayesian optimization (BO). Our4framework utilizes the generative power of protein language models to predict5candidate sequences with higher naturalness and a Bayesian optimization algorithm6to iteratively explore the sequence space and identify the most promising candidates.7To further improve the efficiency of the search process, we introduce a risk-aware8approach that balances exploration and exploitation by incorporating uncertainty9estimates into the acquisition function of the Bayesian optimization algorithm.10We demonstrate the effectiveness of our approach through experiments on several11benchmark datasets, showing that our framework outperforms state-of-the-art12methods in terms of both efficiency and quality of the designed sequences. Our13framework has the potential to accelerate the discovery of new antibodies and14reduce the cost and time required for antib |
Yanzheng Wang · Tianyu Shi 🔗 |
-
|
De novo design of antibody heavy chains with SE(3) diffusion
(
Poster
)
>
link
We introduce VH-Diff, an antibody heavy chain variable domain diffusion model.This model is based on FrameDiff, a general protein backbone diffusion framework, which was fine-tuned on antibody structures. The backbone dihedral angles of sampled structures show good agreement with a reference antibody distribution.We use an antibody-specific inverse folding model to recover sequences corresponding to the predicted structures, and study their validity with an antibody numbering tool.Assessing the designability and novelty of the structures generated with our heavy chain model we find that VH-Diff produces highly designable structures that can contain novel binding regions.Finally, we compare our model with a state-of-the-art sequence-based generative model and show more consistent preservation of the conserved framework region with our structure-based method. |
Frédéric Dreyer · Daniel Cutting · David Errington · Charlotte Deane 🔗 |
-
|
Identifying regularization schemes that make the feature attributions faithful
(
Poster
)
>
link
Feature attribution methods assign a score to each input dimension as a measure of the relevance of that dimension to a model's output. Despite wide use, the feature importance rankings induced by gradient-based feature attributions are unfaithful, that is, they do not correlate with the input-perturbation sensitivity of the model---unless the model is trained to be adversarially robust. Here we demonstrate that these concerns translate to models trained for protein function prediction tasks. Despite making a model's gradient-based attributions faithful to the model, adversarial training has low real-data performance. We find that independent Gaussian noise corruption is an effective alternative, to adversarial training, that confers faithfulness onto a model's gradient-based attributions without performance degradation. On the other hand, we observe no meaningful faithfulness benefits from regularization schemes like dropout and weight decay. We translate these insights to a real-world protein function prediction task, where the gradient-based feature attributions of noise-regularized models, correctly indicate low sensitivity to irrelevant gap tokens in a protein's sequence alignment. |
Julius Adebayo · Samuel Stanton · Simon Kelow · Michael Maser · Richard Bonneau · Vladimir Gligorijevic · Kyunghyun Cho · Stephen Ra · Nathan Frey 🔗 |
-
|
ChatPathway: Conversational Large Language Models for Biology Pathway Detection
(
Poster
)
>
link
Biological pathways, like protein-protein interactions and metabolic networks, are vital for understanding diseases and drug development. Some databases such as KEGG are designed to store and map these pathways. However, many bioinformatics methods face limitations due to database constraints, and certain deep learning models struggle with the complexities of biochemical reactions involving large molecules and diverse enzymes. Importantly, the thorough exploration of biological pathways demands a deep understanding of scientific literature and past research. Despite this, recent advancements in Large Language Models (LLMs), especially ChatGPT, show promise. We first restructured data from KEGG and augmented it with molecule structural and functional information sourced from UniProt and PubChem. Our study evaluated LLMs, particularly GPT-3.5-turbo and Galactica, in predicting biochemical reactions and pathways using our constructed data. We also assessed its ability to predict novel pathways, not covered in its training dataset, using findings from recently published studies. While GPT demonstrated strengths in pathway mapping, Galactica encountered challenges. This research emphasizes the potential of merging LLMs with biology, suggesting a harmonious blend of human expertise and AI in decoding biological systems. |
Yanjing Li · Hannan Xu · Haiteng Zhao · Hongyu Guo · Shengchao Liu 🔗 |
-
|
Leap: molecular synthesisability scoring with intermediates
(
Poster
)
>
link
Assessing whether a molecule can be synthesised is a primary task in drug discovery. It enables computational chemists to filter for viable compoundsor bias molecular generative models. The notion of synthesisability is dynamic as it evolves depending on the availability of key compounds. A common approach in drug discovery involves exploring the chemical space surrounding synthetically-accessibleintermediates. This strategy improves the synthesisability of the derived molecules due to the availability of key intermediates.Existing synthesisability scoring methods such as SAScore, SCScore and RAScore, cannot condition on intermediates dynamically.Our approach, Leap, is a GPT-2 model trained on the depth, or longest linear path, of predicted synthesis routes that allows information on the availability of key intermediates to be included at inference time.We show that Leap surpasses all other scoring methods by at least 5% on AUC score when identifying synthesisable molecules, and can successfully adapt predicted scores when presented with a relevant intermediate compound. |
Antonia Calvi · Théophile Gaudin · Dominik Miketa · Dominique Sydow · Liam Wilbraham 🔗 |
-
|
PharmacoNet: Accelerating Large-Scale Virtual Screening by Deep Pharmacophore Modeling
(
Poster
)
>
link
As the size of accessible compound libraries expands to over 10 billion, the need for more efficient structure-based virtual screening methods is emerging. Different pre-screening methods have been developed to rapidly screen the library, but the structure-based methods applicable to general proteins are still lacking: the challenge is to predict the binding pose between proteins and ligands and perform scoring in an extremely short time. We introduce PharmacoNet, a deep learning framework that identifies the optimal 3D pharmacophore arrangement which a ligand should have for stable binding from the binding site. By coarse-grained graph matching between ligands and the generated pharmacophore arrangement, we solve the expensive binding pose sampling and scoring procedures of existing methods in a single step. PharmacoNet is significantly faster than state-of-the-art structure-based approaches, yet reasonably accurate with a simple scoring function. Furthermore, we show the promising result that PharmacoNet effectively retains hit candidates even under the high pre-screening filtration rates. Overall, our study uncovers the hitherto untapped potential of a pharmacophore modeling approach in deep learning-based drug discovery. |
Seonghwan Seo · Woo Youn Kim 🔗 |
-
|
Model-free selective inference and its applications to drug discovery
(
Poster
)
>
link
Decision making or scientific discovery pipelines such as drug discovery often involve multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions from a machine learning model to shortlist a few candidates from a large pool. We study screening procedures that aim to select candidates whose unobserved outcomes exceed user-specified values. We develop a method that wraps around any prediction model to produce a subset of candidates while controlling the proportion of falsely selected units. Building upon the conformal inference framework, our method first constructs p-values that quantify the statistical evidence for large outcomes; it then determines the shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. In many cases, the procedure selects candidates whose predictions are above a data-dependent threshold. Our theoretical guarantee holds under mild exchangeability conditions on the samples, generalizing existing results on multiple conformal p-values. We demonstrate the empirical performance of our method via applications to drug discovery datasets. |
Ying Jin · Emmanuel Candes 🔗 |
-
|
AlphaFold Meets Flow Matching for Generating Protein Ensembles
(
Poster
)
>
link
The significant success of AlphaFold2 at protein structure prediction has pointed to structural ensembles as the next frontier towards a more complete computational understanding of protein structure. At the same time, iterative refinement-based techniques such as diffusion have driven significant breakthroughs in generative modeling. We explore the synergy of these developments by combining highly accurate protein structure prediction models with flow matching, a powerful modern generative modeling framework, in order to sample the conformational landscape of proteins. Preliminary results on membrane transporters, ligand-induced conformational change, and disordered ensembles show the potential of the approach. Importantly, and unlike MSA-based methods, our method also obtains similar distributions even when used with language model-based algorithms such as ESMFold, which are otherwise deterministic given an input sequence. These results open exciting avenues in the computational prediction of conformational flexibility. |
Bowen Jing · Bonnie Berger · Tommi Jaakkola 🔗 |
-
|
RetroBridge: Modeling Retrosynthesis with Markov Bridges
(
Poster
)
>
link
Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing multi-step reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce RetroBridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks. |
Ilia Igashov · Arne Schneuing · Marwin Segler · Michael Bronstein · Bruno Correia 🔗 |
-
|
Ab-DeepGA: A generative modeling framework leveraging deep learning for antibody affinity tuning
(
Poster
)
>
link
Antibodies and their derived biologics are a major class of novel human therapeutics, with over 70 FDA approvals in the past decade. $In$ $vitro$ display technologies are commonly used to select specific antibodies with high affinity and specificity to a target antigen, but these experiments are resource intensive and can explore only a limited antibody sequence space. Here, we present Ab-DeepGA, a method that combines experimental advances with a deep learning interpretability approach to efficiently search sequence space for sequences with desired affinity to a target antigen. Starting from a combined phage-yeast display experiment against a target antigen, we sorted and sequenced antigen-specific, llama-derived heavy-chain only antibodies ($V_{HH}$) with a wide range of binding affinities. This data was used to train a deep convolutional neural network to predict $V_{HH}$ binding strength from sequence. To generate $de$ $novo$ sequences at a desired binding strength, model interpretation was applied to the trained models, and SHAPley interpretation was used to guide genetic algorithm exploration of sequence space. We show our approach leads to improved recovery of sequences in a held-out test set compared to genetic algorithms. Ab-DeepGA is a novel generative modeling approach that combines advances in experimental display with an interpretable deep learning algorithm that efficiently explores antibody sequence space to identify high affinity binders to a target antigen.
|
BoRam Lee · Yara Seif · Kevin Teng · Xiao Xiao · Isha Verma · Ming-Tang Chen · Alan Cheng 🔗 |
-
|
ExPT: Scaling Foundation Models for Experimental Design via Synthetic Pretraining
(
Poster
)
>
link
Experimental design is a fundamental problem in many science and engineering fields. In this problem, sample efficiency is crucial due to the time, money, and safety costs of real-world design evaluations. Existing approaches either rely on active data collection or access to large, labeled datasets of past experiments, making them impractical in many real-world scenarios. In this work, we address the more challenging yet realistic setting of few-shot experimental design, where only a few labeled data points of input designs and their corresponding values are available. We approach this problem as a conditional generation task, where a model conditions on a few labeled examples and the desired output to generate an optimal input design. To this end, we present Pretrained Transformers for Experimental Design (ExPT), which uses a novel combination of synthetic pretraining with in-context learning to enable few-shot generalization. In ExPT, we only assume knowledge of a finite collection of unlabelled data points from the input domain and pretrain a transformer neural network to optimize diverse synthetic functions defined over this domain. Unsupervised pretraining allows ExPT to adapt to any design task at test time in an in-context fashion by conditioning on a few labeled data points from the target task and generating the candidate optima. We evaluate ExPT on few-shot experimental design in challenging domains and demonstrate its superior generality and performance compared to existing methods. |
Tung Nguyen · Sudhanshu Agrawal · Aditya Grover 🔗 |
-
|
On Modelability and Generalizability: Are Machine Learning Models for Drug Synergy Exploiting Artefacts and Biases in Available Data?
(
Poster
)
>
link
Synergy models are useful tools for exploring drug combinatorial search space and identifying promising sub-spaces for in vitro/vivo experiments. Here, we report that distributional biases in the training-validation-test sets used for predictive modeling of drug synergy can explain much of the variability observed in model performances (up to $0.22$ $\Delta$AUPRC). We built 145 classification models spanning 4,577 unique drugs and 75,276 pair-wise drug combinations extracted from DrugComb, and examined spurious correlations in both the input feature and output label spaces. We posit that some synergy datasets are easier to model than others due to factors such as synergy spread, class separation, chemical structural diversity, physicochemical diversity, combinatorial tests per drug, and combinatorial label entropy. We simulate distribution shifts for these dataset attributes and report that the drug-wise homogeneity of combinatorial labels most influences modelability ($0.16\pm0.06$ $\Delta$AUPRC). Our findings imply that seemingly high-performing drug synergy models may not generalize well to broader medicinal space. We caution that the synergy modeling community's efforts may be better expended in examining data-specific artefacts and biases rigorously prior to model building.
|
Arushi Gandhi 🔗 |
-
|
Machine Learning Guided AQFEP: A Fast & Efficient Absolute Free Energy Perturbation Solution for Virtual Screening
(
Poster
)
>
link
Structure-based methods in drug discovery have become an integral part of the modern drug discovery process. The power of virtual screening (VS) lies in its ability to rapidly and cost-effectively explore enormous chemical spaces to select promising ligands for further in vitro investigation. Relative Free Energy Perturbation (RFEP) and similar methods are the gold standard for binding affinity prediction in drug discovery hit-to-lead and lead optimization phases, but have high computational cost and the requirement of a structural analog with a known activity. Without a reference molecule requirement, Absolute FEP (AFEP) has, in theory, better accuracy for hit ID, but in practice, the slow throughput is not compatible with VS, where fast docking and unreliable scoring functions are still the standard. Here, we present an integrated workflow to virtually screen large and diverse chemical libraries efficiently, combining active learning with a rigorous physics-based scoring function based on a fast absolute free energy perturbation method (AQFEP). We validate the performance of the approach in the ranking of structurally related ligands, virtual screening hit rate enrichment, and active learning chemical space exploration; disclosing the largest reported collection of absolute free energy simulations to date. |
Jordan Crivelli-Decker · Zane Beckwith · Gary Tom · Ly Le · Sheenam Khuttan · Romelia Salomon-Ferrer · Jackson Beall · Andrea Bortolato 🔗 |
-
|
Data-Efficient Molecular Generation with Hierarchical Textual Inversion
(
Poster
)
>
link
Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular Generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by a recent textual inversion technique in the visual domain that achieves data-efficient generation via simple optimization of a new single text token of a pre-trained text-to-image generative model. However, we find that its naive adoption fails for molecules due to their complicated and structured nature. Hence, we propose a hierarchical textual inversion scheme based on introducing low-level tokens that are selected differently per molecule in addition to the original single text token in textual inversion to learn the common concept among molecules. We then generate molecules using a pre-trained text-to-molecule model by interpolating the low-level tokens. Extensive experiments demonstrate the superiority of HI-Mol with notable data-efficiency. For instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50$\times$ less training data. We also show the efficacy of HI-Mol in various applications, including molecular optimization and low-shot molecular property prediction.
|
Seojin Kim · Jaehyun Nam · Sihyun Yu · Younghoon Shin · Jinwoo Shin 🔗 |
-
|
PiNUI: A Dataset of Protein-Protein Interactions for Machine Learning
(
Poster
)
>
link
We introduce a new novel dataset named PiNUI: Protein Interactions with Nearly Uniform Imbalance. PiNUI is a dataset of Protein-Protein Interactions (PPI) specifically designed for Machine Learning (ML) applications that offer a higher degree of representativeness of real-world PPI tasks compared to existing ML-ready PPI datasets. We achieve such by increasing the data size and quality, and minimizing the sampling bias of negative interactions. We demonstrate that models trained on PiNUI almost always outperform those trained on conventional PPI datasets when evaluated on various general PPI tasks using external test sets. |
Geoffroy Dubourg-Felonneau · Eyal Akiva · Daniel Wesego · Ranjani Varadan 🔗 |
-
|
Protein Language Model-Powered 3-Dimensional Ligand Binding Site Prediction from Protein Sequence
(
Poster
)
>
link
Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery. |
Shuo Zhang · Lei Xie 🔗 |
-
|
All You Need is LOVE: Large Optimized Vector Embeddings Network for Drug Repurposing
(
Poster
)
>
link
Traditional drug development is a resource-intensive and time-consuming process with a high rate of failure. To expedite this process, researchers have turned to computational approaches to construct comprehensive graphs of drug-disease associations and explore drug repurposing, finding novel therapeutic applications for existing medications. In parallel, the rapid advancement of the machine-learning field, coupled with the evolution of Natural Language Processing, shows capabilities for reasoning and extracting relationships across various domains. Here, we introduce LOVENet (Large Optimized Vector Embeddings Network), a new framework maximizing the synergistic effects of knowledge graphs and large language models (LLMs) to discover novel therapeutic uses for pre-existing drugs. Specifically, our approach fuses information from pairs of embedding from Llama 2 and heterogeneous knowledge graphs to derive complex relations of drugs and diseases. To empirically validate our methodology, we conducted benchmarking experiments against state-of-the-art algorithms, utilizing three distinct datasets. Our results demonstrate that LOVENet consistently outperforms all other baselines. |
Sina Akbarian · Sepehr Asgarian · Jouhyun Jeon 🔗 |
-
|
PoseCheck: Generative Models for 3D Structure-based Drug Design Produce Unrealistic Poses
(
Poster
)
>
link
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quality of the generated molecules themselves, with limited evaluation of the 3D \emph{poses} that these methods produce, with most work simply discarding the generated pose and only reporting a ``corrected” pose after redocking with traditional methods. Little is known about whether generated molecules satisfy known physical constraints for binding and the extent to which redocking alters the generated interactions. We introduce \posecheck{}, an extensive analysis of multiple state-of-the-art methods and find that generated molecules have significantly more physical violations and fewer key interactions compared to baselines, calling into question the implicit assumption that providing rich 3D structure information improves molecule complementarity. We make recommendations for future research tackling identified failure modes and hope our benchmark will serve as a springboard for future SBDD generative modelling work to have a real-world impact. Our evaluation suite is easy to use in future 3D SBDD work and is available at \href{https://anonymous.4open.science/r/posecheck-358E/README.md}{\texttt{https://anonymous.4open.science/r/posecheck-358E}}. |
Charles Harris · Kieran Didi · Arian Jamasb · Chaitanya K. Joshi · Simon Mathis · Pietro Lió · Tom Blundell 🔗 |
-
|
Neurosymbolic AI Reveals Biases and Limitations in ML-Driven Drug Discovery
(
Poster
)
>
link
Recently, several machine learning approaches have aided drug discovery by identifying promising candidates and predicting potential indications. However, understanding the ways in which drugs achieve their therapeutic effects, otherwise known as their mechanisms-of-action (MoA), is important for understanding potency, side effects, and interactions with various tissue types, among other things. We leveraged and improved the interpretability of a neurosymbolic reinforcement learning method in an attempt to reveal MoAs. While doing so, we observed that our findings raised several concerns with the reasoning process. Specifically, we debate situations in which patterns following a "guilt-by-association" trend are useful for predictions regarding novel compounds. We present our results to facilitate discussion about how generalizable ML-based models are to the drug discovery process as well as how important interpretability can be to such models. |
Lauren Nicole DeLong · Yojana Gadiya · Jacques Fleuriot · Daniel Domingo-Fernández 🔗 |
-
|
Leveraging expert feedback to align proxy and ground truth rewards in goal-oriented molecular generation
(
Poster
)
>
link
Reinforcement learning has proven useful for \emph{de novo} molecular design. Leveraging a reward function associated with a given design task allows for efficiently exploring the chemical space, thus producing relevant candidates.Nevertheless, while tasks involving optimization of drug-likeness properties such as LogP or molecular weight do enjoy a tractable and cheap-to-evaluate reward definition, more realistic objectives such as bioactivity or binding affinity do not.For such tasks, the ground truth reward is prohibitively expensive to compute and cannot be done inside a molecule generation loop, thus it is usually taken as the output of a statistical model.Such a model will act as a faulty reward signal when taken out-of-training distribution, which typically happens when exploring the chemical space, thus leading to molecules judged promising by the system, but which do not align with reality.We investigate this alignment problem through the lens of Human-In-The-Loop ML and propose a combination of two reward models independently trained on experimental data and expert feedback, with a gating process that decides which model output will be used as a reward for a given candidate. This combined system can be fine-tuned as expert feedback is acquired throughout the molecular design process, using several active learning criteria that we evaluate. In this active learning regime, our combined model demonstrates an improvement over the vanilla setting, even for noisy expert feedback. |
Julien Martinelli · Yasmine Nahal · Duong Lê · Ola Engkvist · Samuel Kaski 🔗 |
-
|
CryoSTAR: Cryo-EM Heterogeneous Reconstruction of Atomic Models with Structural Regularization
(
Poster
)
>
link
Atomic models, which directly represent molecular structural variations (i.e., conformation), have received increasing attention in the field of cryo-electron microscopy (cryo-EM) heterogeneity analysis. However, the nonconvex nature of the structural space (the space of atomic coordinates) poses a significant challenge to finding a physical-plausible solution. In this paper, we address this challenge by proposing a novel approach, named cryoSTAR, with the aim of reconstructing atomic models from cryo-EM images. Our approach is motivated by the observation that weak regularization allows atomic models to be excessively flexible in the search space, resulting in a loss of local structural fidelity, while strong regularization tends to trap atomic models in the neighborhood of the initial structure, limiting their ability to explore the conformational landscape effectively. To strike a balance, we introduce adaptive structural regularization at the atomic level to modulate the reconstruction process. We relax the flexible region adaptively to allow for greater conformational changes. Our method achieves the lowest RMSD (up to a maximum decrease of 7.14\AA) on a synthetic dataset, and uncovers reasonable dynamics on an experimental dataset, highlighting its generalizability across different protein systems. Our work sheds light on the potential of atomic models as an alternative to traditional volumetric density maps for cryo-EM heterogeneous reconstruction. |
Yi Zhou · Yilai Li · Jing Yuan · Fei YE · Quanquan Gu 🔗 |
-
|
TrustAffinity: accurate, reliable and scalable out-of-distribution protein-ligand binding affinity prediction using trustworthy deep learning
(
Poster
)
>
link
Accurate, reliable and scalable predictions of protein ligand binding affinity through artificial intelligence have a great potential to accelerate drug discovery process. While many works have been introduced for this purpose, their performance remains poor when applied to new out-of-distribution (OOD) cases where new unseen chemicals belong to a new chemical scaffold. Moreover, they neither account for uncertainty nor quantify the uncertainty associated with individual predictions. To address these issues, we propose a sequence-based novel deep learning framework, TrustAffinity, to predict the binding affinity and the uncertainty of the prediction. TrustAffinity employs a novel uncertainty-based loss function to leverage the uncertainty for improving OOD generalizations. We perform extensive validations of TrustAffinity in multiple OOD settings. TrustAffinity significantly outperforms state-of-the-art deep learning methods and protein-ligand docking in the prediction of binding affinity. Moreover, TrustAffinity is able to perform predictions at least three orders of magnitude of faster than protein-ligand docking, highlighting its suitability for integrating TrustAffinity into a real-time drug discovery pipeline. Notably, we successfully illustrate the practical utility of TrustAffinity through a case study focused on lead discovery in the context of opioid use disorder. |
Amitesh Badkul · Li Xie · Shuo Zhang · Lei Xie 🔗 |
-
|
Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules
(
Poster
)
>
link
The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intra-molecular potential energy, and long-range interactions. To remedy these issues, we present a novel flow architecture that utilizes split channels and gated attention to efficiently learn the conformational distribution of proteins defined by internal coordinates. We show that by utilizing a 2-Wasserstein loss, one can smooth the transition from maximum likelihood training to energy-based training, enabling the training of Boltzmann Generators for macromolecules. We evaluate our model and training strategy on villin headpiece HP35(nle-nle), a 35-residue subdomain, and protein G, a 56-residue protein. We demonstrate that standard architectures and training strategies, such as maximum likelihood alone, fail while our novel architecture and multi-stage training strategy are able to model the conformational distributions of protein G and HP35. |
Joseph Kim · David Bloore · Karan Kapoor · Jun Feng · Ming-Hong Hao · Mengdi Wang 🔗 |
-
|
Evaluating Zero-Shot Scoring for In Vitro Antibody Binding Prediction with Experimental Validation
(
Poster
)
>
link
The success of therapeutic antibodies relies on their ability to selectively bind antigens. AI-based antibody design protocols have shown promise in generating epitope-specific designs. Many of these protocols use an inverse folding step to generate diverse sequences given a backbone structure. Due to prohibitive screening costs, it is key to identify candidate sequences likely to bind in vitro. Here, we compare the efficacy of 8 common scoring paradigms based on open-source models to classify antibody designs as binders or non-binders. We evaluate these approaches on a novel surface plasmon resonance (SPR) dataset, spanning 5 antigens. Our results show that existing methods struggle to detect binders, and performance is highly variable across antigens. We find that metrics computed on flexibly docked antibody-antigen complexes are more robust, and ensembles scores are more consistent than individual metrics. We provide experimental insight to analyze current scoring techniques, highlighting that the development of robust, zero-shot filters is an important research gap. |
Divya Nori · Simon Mathis · Amir Shanehsazzadeh 🔗 |
-
|
CongFu: Conditional Graph Fusion for Drug Synergy Prediction
(
Poster
)
>
link
Drug synergy, characterized by the amplified combined effect of multiple drugs, is critically important for optimizing therapeutic outcomes. Limited data on drug synergy, arising from the vast number of possible drug combinations and testing costs, motivate the need for predictive methods. In this work, we introduce CongFu, a novel Conditional Graph Fusion Layer, designed to predict drug synergy. CongFu employs an attention mechanism and a bottleneck to extract local graph contexts and conditionally fuse graph data within a global context. Its modular architecture enables flexible replacement of layer modules, including readouts and graph encoders, facilitating customization for diverse applications. To evaluate the performance of CongFu, we conduct comprehensive experiments on four datasets, encompassing three distinct setups for drug synergy prediction. CongFu achieves state-of-the-art results on 11 out of 12 benchmark datasets, demonstrating its ability to capture intricate patterns of drug synergy. Through ablation studies, we validate the significance of individual layer components, affirming their contributions to overall predictive performance. Finally, we propose an explainability strategy for elucidating the effect of drugs on genes. By addressing the challenge of predicting drug synergy in untested drug pairs and utilizing our proposed explainability approach, CongFu opens new avenues for optimizing drug combinations and advancing personalized medicine. |
Oleksii Tsepa · Bohdan Naida · Anna Goldenberg · Bo Wang 🔗 |
-
|
The neural scaling laws for phenotypic drug discovery
(
Poster
)
>
link
Recent breakthroughs by deep neural networks (DNNs) in natural language processing (NLP) and computer vision have been driven by a scale-up of models and data rather than the discovery of novel computing paradigms. Here, we investigate if scale can have a similar impact for models designed to aid small molecule drug discovery. We address this question through a large-scale and systematic analysis of how DNN size, data diet, and learning routines interact to impact accuracy on our Phenotypic Chemistry Arena (Pheno-CA) benchmark — a diverse set of drug development tasks posed on image-based high content screening data. Surprisingly, we find that DNNs explicitly supervised to solve tasks in the Pheno-CA do not continuously improve as their data and model size is scaled-up. To address this issue, we introduce a novel precursor task, the Inverse Biological Process (IBP), which is designed to resemble the causal objective functions that have proven successful for NLP. We indeed find that DNNs first trained with IBP then probed for performance on the Pheno-CA significantly outperform task-supervised DNNs. More importantly, the performance of these IBP-trained DNNs monotonically improves with data and model scale. Our findings reveal that the DNN ingredients needed to accurately solve small molecule drug development tasks are already in our hands, and project how much more experimental data is needed to achieve any desired level of improvement. |
Drew Linsley · John Griffin · Jason Brown · Adam Roose · Steven Finkbeiner · Peter Linsley · Jeremy Linsley 🔗 |
-
|
VN-EGNN: Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification
(
Poster
)
>
link
Being able to identify regions within or around proteins, to which ligands canpotentially bind, is an essential step to develop new drugs. Binding site iden-tification methods can now profit from the availability of large amounts of 3Dstructures in protein structure databases or from AlphaFold predictions. Currentbinding site identification methods rely on geometric deep learning, which takesgeometric invariances and equivariances into account. Such methods turned outto be very beneficial for physics-related tasks like binding energy or motion tra-jectory prediction. However, their performance at binding site identification isstill limited, which might be due to limited expressivity or oversquashing effectsof E(n)-Equivariant Graph Neural Networks (EGNNs). Here, we extend EGNNsby adding virtual nodes and applying an extended message passing scheme. Thevirtual nodes in these graphs both improve the predictive performance and can alsolearn to represent binding sites. In our experiments, we show that VN-EGNN setsa new state of the art at binding site identification on three common benchmarks,COACH420, HOLO4K, and PDBbind2020. |
Florian Sestak · Lisa Schneckenreiter · Sepp Hochreiter · Andreas Mayr · Günter Klambauer 🔗 |
-
|
SALSA: Semantically-Aware Latent Space Autoencoder
(
Poster
)
>
link
For molecular representations, SMILES strings are a popular choice, as they allow for leveraging of modern NLP methodologies, one being the sequence-to-sequence autoencoder. However, an autoencoder trained solely on SMILES is insufficient to learn semantically meaningful representations, which capture structural similarities between molecules. We define native chemical similarity using chemical graphs, which enables the use of a rigorous metric, such as graph edit distance (GED). We demonstrate by example that a standard SMILES autoencoder may map structurally similar molecules to distant latent vectors, resulting in an incoherent latent space. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive objective of mapping structurally similar molecules to nearby vectors in the latent space. We evaluate semantic awareness of SALSA representations by comparing to a naive autoencoder as well as ECFP4, a molecular fingerprint commonly used in cheminformatics. We show empirically that \salsa{} learns a representation that maintains 1) structural awareness, 2) physicochemical property awareness, 3) biological property awareness, and 4) semantic continuity. |
Kathryn E. Kirchoff · Travis Maxfield · Alexander Tropsha · Shawn Gomez 🔗 |
-
|
Gotta be SAFE: A New Framework for Molecular Design
(
Poster
)
>
link
Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining full compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hopping, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through extensive experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design. |
Emmanuel Noutahi · Cristian Gabellini · Michael Craig · Jonathan Siu Chi Lim · Prudencio Tossou 🔗 |
-
|
MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction
(
Poster
)
>
link
Harnessing textual information offers significant advantages in the drug design process, providing invaluable insights into complex molecular structures and facilitating molecule design based on textual instructions. With recent advancements in the utilization of Large Language Models (LLMs) for multi-modal data applications, we aim to leverage the capabilities of LLM for molecule property prediction tasks. We introduce MoleculeGPT, which is designed to provide answers to queries concerning molecular properties on the basis of molecular structure inputs. To train the MoleculeGPT, we have curated a new dataset from the raw molecule description in PubChem for instruction-following tasks. We evaluate the performance of MoleculeGPT on multiple-choice questions and several downstream tasks on molecule property prediction for drug design. Experimental results show that MoleculeGPT can generate responses that closely resemble human-level performance and demonstrate exceptional capabilities across diverse downstream tasks. |
Weitong ZHANG · Xiaoyun Wang · Weili Nie · Joe Eaton · Brad Rees · Quanquan Gu 🔗 |
-
|
Embracing assay heterogeneity with neural processes for markedly improved bioactivity predictions
(
Poster
)
>
link
Predicting the bioactivity of a ligand is one of the hardest and most important challenges in computer-aided drug discovery. Despite years of data collection and curation efforts, bioactivity data remains sparse and heterogeneous, thus hampering efforts to build predictive models that are accurate, transferable and robust. The intrinsic variability of the experimental data is further compounded by data aggregation practices that neglect heterogeneity to overcome sparsity. Here we discuss the limitations of these practices and present a hierarchical meta-learning framework that exploits the information synergy across disparate assays by successfully accounting for assay heterogeneity. We show that the model achieves a drastic improvement in affinity prediction across diverse protein targets and assay types compared to conventional baselines. It can quickly adapt to new target contexts using very few observations, thus enabling large-scale virtual screening in early-phase drug discovery. |
Lucian Chan · Marcel Verdonk · Carl Poelking 🔗 |
-
|
FragXsiteDTI: an interpretable transformer-based model for drug-target interaction prediction
(
Poster
)
>
link
Drug-Target Interaction (DTI) prediction is vital for drug discovery, yet challenges persist in achieving model interpretability and optimizing performance. We propose a novel transformer-based model, FragXsiteDTI, that aims to address these challenges in DTI prediction. Notably, FragXsiteDTI is the first DTI model to simultaneously leverage drug molecule fragments and protein pockets. Our information-rich representations for both proteins and drugs offer a detailed perspective on their interaction. Inspired by the Perceiver IO framework, our model features a learnable latent array, initially interacting with protein binding site embeddings using cross-attention and later refined through self-attention and used as a query to the drug fragments in the drug's cross-attention transformer block. This learnable query array serves as a mediator and enables seamless information translation, preserving critical nuances in drug-protein interactions. Our computational results on two benchmarking datasets demonstrate the superior predictive power of our model over several state-of-the-art models. We also show the interpretability of our model in terms of the critical components of both target proteins and drug molecules within drug-target pairs. |
Ali Khodabandeh Yalabadi · Mehdi Yazdani-Jahromi · Niloofar Yousefi · Aida Tayebi · Sina Abdidizaji · OZLEM GARIBAY 🔗 |
-
|
Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization
(
Poster
)
>
link
While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient, diversity-promoting, and able to find top designs using reasonably small mutation counts. |
Jiahao Qiu · Hui Yuan · Jinghong Zhang · Wentao Chen · Huazheng Wang · Mengdi Wang 🔗 |
-
|
GraphPrint: Extracting Features from 3D Protein Structure for Drug Target Affinity Prediction
(
Poster
)
>
link
Accurate drug target affinity prediction can improve drug candidate selection, accelerate the drug discovery process, and reduce drug production costs. Previous work focused on traditional fingerprints or used features extracted based on the amino acid sequence in the protein, ignoring its 3D structure which affects its binding affinity. In this work, we propose GraphPrint: a framework for incorporating 3D protein structure features for drug target affinity prediction. We generate graph representations for protein 3D structures using amino acid residue location coordinates and combine them with drug graph representation and traditional features to jointly learn drug target affinity. Our model achieves a mean square error of 0.1378 and a concordance index of 0.8929 on the KIBA dataset and improves over using traditional protein features alone. Our ablation study shows that the 3D protein structure-based features provide information complementary to traditional features. |
Amritpal Singh 🔗 |
-
|
Large-scale Pretraining Improves Sample Efficiency of Active Learning based Molecule Virtual Screening
(
Poster
)
>
link
Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, brute-force virtual screening using traditional tools such as docking becomes infeasible in terms of time and computational resources. Active learning and Bayesian optimization has recently been proven as effective methods of narrowing down the search space. An essential component in those methods is a surrogate machine learning model that is trained with a small subset of the library to predict the desired properties of compounds. Accurate model can achieve high sample efficiency by finding the most promising compounds with only a fraction of the whole library being virtually screened. In this study, we examined the performance of pretrained transformer- based language model and graph neural network in Bayesian optimization active learning framework. The best pretrained models identifies 58.97% of the top-50000 by docking score after screening only 0.6% of an ultra-large library containing 99.5 million compounds, improving 8% over previous state-of-the-art baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Such model can serve as a boost to the accuracy and sample efficiency of active learning based molecule virtual screening. |
Zhonglin Cao · Simone Sciabola · Ye Wang 🔗 |
-
|
Role of Structural and Conformational Diversity for Machine Learning Potentials
(
Poster
)
>
link
In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts. |
Nikhil Shenoy · Prudencio Tossou · Emmanuel Noutahi · Hadrien Mary · Dominique Beaini · Jiarui Ding 🔗 |
-
|
DrugImprover: Utilizing Reinforcement Learning for Multi-Objective Alignment in Drug Optimization
(
Poster
)
>
link
Reinforcement learning from human feedback (RLHF) is a method for enhancing the finetuning of large language models (LLMs), leading to notable performance improvements that can also align better with human values. Building upon the inspiration drawn from RLHF, this research delves into the realm of drug optimization. We employ reinforcement learning to finetune a drug optimization model, enhancing the original drug across multiple target objectives, while retains the beneficial chemical properties of the original drug. Our proposal comprises three primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. (2) A novel Advantage-alignment Policy Optimization (APO) with multi-critic guided exploration algorithm for finetuning the objective-oriented properties. (3) A dataset of 2 million compounds, each with OEDOCK docking scores on two proteins, 3CLPro (PDBID: 7BQY) and RTCB (PDBID: 4DWQ), from SARS-CoV-2 and human cancer cells, respectively. We conduct a comprehensive evaluation of APO and demonstrate its effectiveness in improving the original drug across multiple properties. |
Xuefeng Liu · Songhao Jiang · Archit Vasan · Alexander Brace · Ozan Gokdemir · Thomas Brettin · Fangfang Xia · Ian Foster · Rick Stevens 🔗 |
-
|
AbLEF: Antibody Language Ensemble Fusion for thermodynamically empowered property predictions
(
Poster
)
>
link
Pre-trained protein language and/or structural models are often fine-tuned on drug development properties (i.e., developability properties) to accelerate drug discovery initiatives. However, these models generally rely on a single structural conformation and/or a single sequence as a molecular representation. We present a physics-based model whereby structural ensemble representations are fused by a transformer-based architecture and concatenated to a language representation to predict antibody protein properties. AbLEF enables the direct infusion of thermodynamic information into latent space and this enhancesproperty prediction by explicitly infusing dynamic molecular behavior that occurs during experimental measurement. We find that $\textbf{(1)}$ ensembles of structures generated from molecular simulation can further improve antibody property prediction for small datasets,$\textbf{(2)}$ fine-tuned large protein language models can match smaller antibody-specific language models at predicting antibody properties, $\textbf{(3)}$ trained multimodal sequence and structural representations outperform sequence representations alone, $\textbf{(4)}$ pre-trained sequence with structure models are competitive with shallow machine learning (ML) methods in the small data regime, and $\textbf{(5)}$ predicting measured antibody properties remains difficult for limited high fidelity datasets.
|
Zachary Rollins · Talal Widatalla · Andrew Waight · Alan Cheng · Essam Metwally 🔗 |
-
|
DiffDock-Pocket: Diffusion for Pocket-Level Docking with Sidechain Flexibility
(
Poster
)
>
link
When a small molecule binds to a protein, the 3D structure of the protein and its function change. Understanding this process, called molecular docking, can be crucial in areas such as drug design. Recent learning-based attempts have shown promising results at this task, yet lack features that traditional approaches support. In this work, we close this gap by proposing DiffDock-Pocket, a diffusion-based docking algorithm that is conditioned on a binding target to predict ligand poses only in a specific binding pocket. On top of this, our model supports receptor flexibility and predicts the position of sidechains close to the binding site. Empirically, we improve the state-of-the-art in site-specific-docking on the PDBBind benchmark. Especially when using in-silico generated structures, we achieve more than twice the performance of current methods while being more than 20 times faster than other flexible approaches. Although the model was not trained for cross-docking to different structures, it yields competitive results in this task. |
Michael Plainer · Marcella Toth · Simon Dobers · Hannes Stärk · Gabriele Corso · Céline Marquet · Regina Barzilay 🔗 |
-
|
PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design
(
Poster
)
>
link
Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years.However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability.To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins.We curate datasets from high-quality CATH protein data as well as high-throughput $\textit{de novo}$ protein design and mutagenesis experiments,and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods.Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric.We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark.Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. The source code will be released upon acceptance.
|
Chuanrui Wang · Bozitao Zhong · Zuobai Zhang · Narendra Chaudhary · Sanchit Misra · Jian Tang 🔗 |
-
|
Protein Language Models Enable Accurate Cryptic Ligand Binding Pocket Prediction
(
Poster
)
>
link
Accurate prediction of protein-ligand binding pockets is a critical task in protein functional analysis and small molecule pharmaceutical design. However, the flexible and dynamic nature of proteins conceal an unknown number of potentially invaluable "cryptic" pockets. Current approaches for cryptic pocket discovery rely on molecular dynamics (MD), leading to poor scalability and bias. Even recent ML-based cryptic pocket discovery approaches require large, post-processed MD datasets to train their models. In contrast, this work presents ``Efficient Sequence-based cryptic Pocket prediction'' (ESP) leveraging advanced Protein Language Models (PLMs), and demonstrates significant improvement in predictive efficacy compared to ML-based cryptic pocket prediction SOTA (ROCAUC 0.93 vs 0.87). ESP achieves detection of cryptic pockets via training on readily available, non cryptic-pocket-specific data from the PDBBind dataset, rather than costly simulation and post-processing. Further, while SOTA's predictions often include positive signal broadly distributed over a target structure, ESP produces more spatially-focused predictions which increase downstream utility. |
David Bloore · Joseph Kim · Karan Kapoor · Eric Chen · Kaifu Gao · Mengdi Wang · Ming-Hong Hao 🔗 |
-
|
Explaining Drug Repositioning: A Case-Based Reasoning Graph Neural Network Approach
(
Poster
)
>
link
Drug repositioning, the identification of novel uses of existing therapies, has become an attractive strategy to accelerate drug development. Recently, knowledge graphs (KGs) have emerged as a powerful representation of interconnected data within the biomedical domain. While biomedical KGs can be used to predict new connections between compounds and diseases, most approaches only state whether two nodes are related. Yet, they fail to explain why two nodes are related. In this project, we introduce an implementation of the semi-parametric Case-Based Reasoning over subgraphs approach (CBR-SUBG), designed to derive the underlying mechanisms for a drug query by gathering graph patterns of similar entities. We show that our adaptation outperforms existing KG link prediction models on a drug repositioning task. Furthermore, our findings demonstrate that CBR-SUBG strategy can not only rank potential repositioning candidates but also provide interpretable biological paths, leading to more informed decisions. |
Adriana Carolina Gonzalez Cavazos 🔗 |
-
|
Graph Neural Bayesian Optimization for Virtual Screening
(
Poster
)
>
link
Virtual screening is an essential component of early-stage drug and materials discovery. This is challenged by the increasingly intractable size of virtual libraries and the high cost of evaluating properties. We propose GNN-SS, a Graph Neural Network (GNN) powered Bayesian Optimization (BO) algorithm. GNN-SS utilizes random sub-sampling to reduce the computational complexity of the BO problem, and diversifies queries for training the model. We further introduce data-independent projections to efficiently model second-order random feature interactions, and improve uncertainty estimates. GNN-SS is computationally light, sample-efficient, and rapidly narrows the search space by leveraging the generalization ability of GNNs. Our algorithm achieves state-of-the-art performance among screening methods for the Practical Molecular Optimization benchmark. |
Miles Wang-Henderson · Bartu Soyuer · Parnian Kassraie · Andreas Krause · Ilija Bogunovic 🔗 |
-
|
Synthon Embeddings for Modeling DNA Encoded Libraries
(
Poster
)
>
link
DNA-Encoded Library (DEL) has proven to be a powerful tool that utilizes combinatorially constructed small-molecules to facilitate highly-efficient screening assays. These selection experiments, involving multiple stages of washing, elution, and identification of potent binders via unique DNA barcodes, often generate complex data. This complexity can potentially mask the underlying signals, necessitating the application of computational tools such as machine learning to uncover valuable insights. We introduce an innovative approach to model DEL data, by decomposing the molecular representation into their mono-synthon and di-synthon building blocks, which capitalizes on the inherent hierarchical structure of these molecules. Additionally, we investigate various methods of integrating covariate factors to more effectively account for data noise. Our model demonstrates strong performance compared to count baselines, enriches the correct pharmacophores, and offers valuable insights via its intrinsic interpretable structure, thereby providing a robust tool for the analysis of DEL data. |
Benson Chen · Mohammad Sultan · Theofanis Karaletsos 🔗 |
-
|
Learning Scalar Fields for Molecular Docking with Fast Fourier Transforms
(
Poster
)
>
link
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. Moreover, the runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the Vina and Gnina scoring functions, and is more robust on computationally predicted structures. |
Bowen Jing · Tommi Jaakkola · Bonnie Berger 🔗 |
-
|
$\textit{In vitro}$ validated antibody design against multiple therapeutic antigens using generative inverse folding
(
Oral
)
>
link
Deep learning approaches have demonstrated the ability to design protein sequences given backbone structures. While these approaches have been applied $\textit{in silico}$ to designing antibody complementarity-determining regions (CDRs), they have yet to be validated $\textit{in vitro}$ for designing antibody binders, which is the true measure of success for antibody design. Here we describe $\textit{IgDesign}$, a deep learning method for antibody CDR design, and demonstrate its robustness with successful binder design for 8 therapeutic antigens. The model is tasked with designing heavy chain CDR3 (HCDR3) or all three heavy chain CDRs (HCDR123) using native backbone structures of antibody-antigen complexes, along with the antigen and antibody framework (FWR) sequences as context. For each of the 8 antigens, we design 100 HCDR3s and 100 HCDR123s, scaffold them into the native antibody's variable region, and screen them for binding against the antigen using surface plasmon resonance (SPR). As a baseline, we screen 100 HCDR3s taken from the model's training set and paired with the native HCDR1 and HCDR2. We observe that both HCDR3 design and HCDR123 design outperform this HCDR3-only baseline. IgDesign is the first experimentally validated antibody inverse folding model. It can design antibody binders to multiple therapeutic antigens with high success rates and, in some cases, improved affinities over clinically validated reference antibodies. Antibody inverse folding has applications to both $\textit{de novo}$ antibody design and lead optimization, making IgDesign a valuable tool for accelerating drug development and enabling therapeutic design.
|
Amir Shanehsazzadeh 🔗 |
-
|
Offline RL for generative design of protein binders
(
Oral
)
>
link
SlidesLive Video Offline Reinforcement Learning (RL) offers a compelling avenue for solving RL problems without the need for interactions with an environment, which may be expensive or unsafe. While online RL methods have found success in various domains, such as de novo Structure-Based Drug Discovery (SBDD), they struggle when it comes to optimizing essential properties derived from protein-ligand docking. The high computational cost associated with the docking process makes it impractical for online RL, which typically requires hundreds of thousands of interactions during learning. In this study, we propose the application of offline RL to address the bottleneck posed by the docking process, leveraging RL's capability to optimize non-differentiable properties. Our preliminary investigation focuses on using offline RL to conditionally generate drugs with improved docking and chemical properties. |
Denis Tarasov · Ulrich Armel Mbou Sob · Miguel Arbesú · Nima Siboni · Sebastien Boyer · Andries Smit · Oliver Bent · Arnu Pretorius · Marcin Skwark 🔗 |
-
|
A framework for conditional diffusion modelling with applications in motif scaffolding for protein design
(
Oral
)
>
link
Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional generation task, and several conditional generation protocols were proposed or imported from the Computer Vision literature. However, most of these protocols are motivated heuristically, e.g. via analogies to Langevin dynamics, and lack a unifying framework, obscuring connections between the different approaches.In this work, we unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform. This new perspective allows us to draw connections between existing methods and propose a new conditional training protocol. We illustrate the effectiveness of this new protocol in both, image outpainting and motif scaffolding and find that it outperforms standard methods. |
Kieran Didi · Francisco Vargas · Simon Mathis · Vincent Dutordoir · Emile Mathieu · Urszula Julia Komorowska · Pietro Lió 🔗 |
-
|
Towards a more inductive world for drug repurposing approaches
(
Oral
)
>
link
Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process, and show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance when evaluated as previously done in the literature, hence not being suited for drug repurposing approaches. We then propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true. We envision this work as the underpinning for future fair benchmarking and robust model design. All generated resources and tools are publicly available as a python package. |
Jesus de la Fuente Cedeño · Guillermo Serrano · Uxia Veleiro · Mikel Casals · Laura Vera · Marija Pizurica · Antonio Pineda-Lucena · Idoia Ochoa · Silve Vicent · Olivier Gevaert · Mikel Hernaez
|
-
|
Embracing assay heterogeneity with neural processes for markedly improved bioactivity predictions
(
Oral
)
>
link
SlidesLive Video Predicting the bioactivity of a ligand is one of the hardest and most important challenges in computer-aided drug discovery. Despite years of data collection and curation efforts, bioactivity data remains sparse and heterogeneous, thus hampering efforts to build predictive models that are accurate, transferable and robust. The intrinsic variability of the experimental data is further compounded by data aggregation practices that neglect heterogeneity to overcome sparsity. Here we discuss the limitations of these practices and present a hierarchical meta-learning framework that exploits the information synergy across disparate assays by successfully accounting for assay heterogeneity. We show that the model achieves a drastic improvement in affinity prediction across diverse protein targets and assay types compared to conventional baselines. It can quickly adapt to new target contexts using very few observations, thus enabling large-scale virtual screening in early-phase drug discovery. |
Lucian Chan · Marcel Verdonk · Carl Poelking 🔗 |
-
|
DrugImprover: Utilizing Reinforcement Learning for Multi-Objective Alignment in Drug Optimization
(
Oral
)
>
link
SlidesLive Video Reinforcement learning from human feedback (RLHF) is a method for enhancing the finetuning of large language models (LLMs), leading to notable performance improvements that can also align better with human values. Building upon the inspiration drawn from RLHF, this research delves into the realm of drug optimization. We employ reinforcement learning to finetune a drug optimization model, enhancing the original drug across multiple target objectives, while retains the beneficial chemical properties of the original drug. Our proposal comprises three primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. (2) A novel Advantage-alignment Policy Optimization (APO) with multi-critic guided exploration algorithm for finetuning the objective-oriented properties. (3) A dataset of 2 million compounds, each with OEDOCK docking scores on two proteins, 3CLPro (PDBID: 7BQY) and RTCB (PDBID: 4DWQ), from SARS-CoV-2 and human cancer cells, respectively. We conduct a comprehensive evaluation of APO and demonstrate its effectiveness in improving the original drug across multiple properties. |
Xuefeng Liu · Songhao Jiang · Archit Vasan · Alexander Brace · Ozan Gokdemir · Thomas Brettin · Fangfang Xia · Ian Foster · Rick Stevens 🔗 |