Timezone: »

Machine Learning for Molecules
José Miguel Hernández-Lobato · Matt Kusner · Brooks Paige · Marwin Segler · Jennifer Wei

Sat Dec 12 05:30 AM -- 01:00 PM (PST) @ None
Event URL: https://ml4molecules.github.io/ »

Discovering new molecules and materials is a central pillar of human well-being, providing new medicines, securing the world’s food supply via agrochemicals, or delivering new battery or solar panel materials to mitigate climate change. However, the discovery of new molecules for an application can often take up to a decade, with costs spiraling. Machine learning can help to accelerate the discovery process. The goal of this workshop is to bring together researchers interested in improving applications of machine learning for chemical and physical problems and industry experts with practical experience in pharmaceutical and agricultural development. In a highly interactive format, we will outline the current frontiers and present emerging research directions. We aim to use this workshop as an opportunity to establish a common language between all communities, to actively discuss new research problems, and also to collect datasets by which novel machine learning models can be benchmarked. The program is a collection of invited talks, alongside contributed posters. A panel discussion will provide different perspectives and experiences of influential researchers from both fields and also engage open participant conversation. An expected outcome of this workshop is the interdisciplinary exchange of ideas and initiation of collaboration.

Sat 5:30 a.m. - 5:40 a.m.
Opening Remarks
Sat 5:30 a.m. - 2:00 p.m.

Please use this Discord for questions/discussion for all sessions.

First time Discord user? Check out this video (https://www.youtube.com/watch?v=LDVqruRsYtA).

Sat 5:41 a.m. - 6:01 a.m.

Abstract: The digital revolution finally reached the pharmaceutical industry and machine-learning models are becoming more and more relevant for drug discovery. Often people outside the domain underestimate the complexity related to drug discovery. The hope that novel algorithms and models can remedy the challenge of finding new drugs more efficiently is often shaken when data science experts dive more deeply in the domain. Nevertheless, there are many areas in drug discovery where machine learning and data science can make a difference. A very important point is trying to better understand the data before just applying new models. Especially in early drug discovery, many data sets are very small and tricky given the data distribution, data bias, data shift or incompleteness. Another critical point are the users, to make machine learning models effective and actionable they need to be accessible and integrated into the daily work of the scientists which are most of the time not data scientists themselves. With this, an important aspect is also education of the users to deepen their knowledge and create the right expectations on machine learning models. In this presentation, several of these aspects will be discussed in more detail using examples and learnings we made over the past years.

Biography: Dr. Nadine Schneider obtained a BSc and MSc in Bioinformatics from the Saarland University in Germany. She did her PhD in Molecular Modeling in the group of Prof. Dr. Matthias Rarey at the University of Hamburg, Germany. In her PhD she worked on a novel protein-ligand scoring function which was integrated in the commercial modeling software SeeSAR (BioSolveIT GmbH). In 2014 she joined the Novartis Institutes for BioMedical Research (NIBR) in Basel (Switzerland) for a postdoc focusing on Cheminformatics and Data Science under supervision of Dr. Gregory Landrum and Dr. Nikolaus Stiefl. Since 2017 she is a researcher in the Computer-Aided Drug Design team in Global Discovery Chemistry in NIBR, Basel.

Nadine Schneider
Sat 6:01 a.m. - 6:10 a.m.
Invited Talk: Nadine Schneider - Live Q&A (Q&A)
Sat 6:11 a.m. - 6:31 a.m.


The rare-event sampling problem is one of the fundamental problems in statistical mechanics and particularly in molecular dynamics or Monte-Carlo simulations of molecules. Here I will introduce to Boltzmann-generating flows that combine invertible neural networks and statistical-mechanics based reweighting or resampling methods in order to train a machine-learning method to generate samples from the desired equilibrium distribution of the molecule or other many-body system. In particular, two recent developments will be described: equivariant flows that take symmetries in the molecular energy function into account, and Stochastic Normalizing Flows which combine deterministic invertible neural networks with stochastic sampling steps and are trained using path likelihood maximization techniques that have emerged in nonequilibrium statistical mechanics.


Frank Noé has undergraduate degrees in electrical energineering and computer science and graduated in computer science and computational physics at University of Heidelberg. Frank is currently full professor for Mathematics, Computer Science and Physics at Freie Universität Berlin, Germany. Since 2015 he also holds an adjunct professorship in Chemistry at Rice University Houston, Texas.

Frank's research focuses on developing new Machine Learning methods for the physical sciences, especially molecular sciences. Frank received two awards of the European Research Council, an ERC starting grant in 2012 and an ERC consolidator grant in 2017. He received the early career award in theoretical Chemistry of the American Chemical Society in 2019 and he is ISI highly cited researcher since 2019.

Frank Noe
Sat 6:31 a.m. - 6:40 a.m.
Invited Talk: Frank Noe - Live Q&A (Q&A)
Sat 6:40 a.m. - 6:50 a.m.
Contributed Talk: Evidential Deep Learning for Guided Molecular Property Prediction and Discovery - Ava Soleimany, Alexander Amini, Samuel Goldman, Daniela Rus, Sangeeta Bhatia and Connor Coley (Talk) Video
Ava P Soleimany
Sat 6:50 a.m. - 7:00 a.m.
Contributed Talk: Gaussian Process Molecular Property Prediction with FlowMO - Henry Moss and Ryan-Rhys Griffiths (Talk) Video
Henry Moss
Sat 7:00 a.m. - 7:10 a.m.
Contributed Talk: Explaining Deep Graph Networks with Molecular Counterfactuals - Davide Bacciu and Danilo Numeroso (Talk) Video
Danilo Numeroso
Sat 7:11 a.m. - 7:31 a.m.


Machine learning is emerging as a powerful tool in quantum chemistry and materials science, combining the accuracy of electronic structure methods with computational efficiency. Going beyond the simple prediction of chemical properties, machine learning potentials can be applied to perform fast molecular dynamics simulations, model solvent effects and response properties as well as find structures with desired properties by inverse design. In this talk, we will show how this opens a clear path towards unifying machine learning and quantum chemistry.


Klaus-Robert Müller has been a professor of computer science at Technische Universit{\"a}t Berlin since 2006; at the same time he is co-directing the Berlin Big Data Center. He studied physics in Karlsruhe from 1984 to 1989 and obtained his Ph.D. degree in computer science at Technische Universit{\"a}t Karlsruhe in 1992. After completing a postdoctoral position at GMD FIRST in Berlin, he was a research fellow at the University of Tokyo from 1994 to 1995. In 1995, he founded the Intelligent Data Analysis group at GMD-FIRST (later Fraunhofer FIRST) and directed it until 2008. From 1999 to 2006, he was a professor at the University of Potsdam. He was awarded the Olympus Prize for Pattern Recognition (1999), the SEL Alcatel Communication Award (2006), the Science Prize of Berlin by the Governing Mayor of Berlin (2014), and the Vodafone Innovations Award (2017). In 2012, he was elected member of the German National Academy of Sciences-Leopoldina, in 2017 of the Berlin Brandenburg Academy of Sciences and also in 2017 external scientific member of the Max Planck Society. In 2019 and 2020 he became ISI Highly Cited Researcher. His research interests are intelligent data analysis and machine learning with applications in neuroscience (specifically brain-computer interfaces), physics and chemistry.

Kristof T. Schütt is a senior researcher at the Berlin Institute for the Foundations of Learning and Data (BIFOLD). He received his master's degree in computer science in 2012 and his PhD in machine learning in 2018 at the machine learning group of Technische Universität Berlin. Until September 2020, he worked at the Audatic company developing neural networks for real-time speech enhancement. His research interests include interpretable neural networks, representation learning, generative models, and machine learning applications in quantum chemistry.

Klaus-Robert Müller, Kristof Schütt
Sat 7:31 a.m. - 7:40 a.m.
Invited Talk: Klaus Robert-Müller and Kristof Schütt - Live Q&A (Q&A)
Sat 7:41 a.m. - 8:01 a.m.


Deep learning methods applied to chemistry can be used to accelerate the discovery of new molecules, such as promising pharmaceuticals. Notably, methods such as graph neural networks (GNNs) are interesting tools to explore for molecular design because graphs are natural data structures for describing molecules. The process of designing novel, drug-like compounds can be viewed as one of generating graphs which optimize all the features of the desirable molecules.

In this talk, I will provide an overview of how deep learning methods can be applied to complex drug design tasks, focusing on our recently published tool, GraphINVENT. GraphINVENT uses GNNs and a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time, and learns to build new molecules resembling a training set without any explicit programming of chemical rules. GraphINVENT is one of many recent platforms which aim to streamline the drug discovery process using AI.


I joined the Molecular AI group at AstraZeneca in October 2018. My work focuses on using deep learning methods for graph-based molecular design. Before AstraZeneca, I was a PhD student in Professor Berend Smit’s molecular simulation group at UC Berkeley and EPFL. I received my PhD in Chemistry from UC Berkeley in July 2018, and my BS in Chemistry from Caltech in June 2013.

Rocío Mercado
Sat 8:01 a.m. - 8:10 a.m.
Invited Talk: Rocio Mercado - Live Q&A (Q&A)
Sat 8:10 a.m. - 8:15 a.m.
Spotlight Talk: Comparison of Atom Representations in Graph Neural Networks for Molecular Property Prediction - Agnieszka Pocha, Tomasz Danel and Lukasz Maziarka (Talk) Video
Tomasz Danel
Sat 8:15 a.m. - 8:20 a.m.
Spotlight Talk: Completion of partial reaction equations - Alain C. Vaucher, Philippe Schwaller and Teodoro Laino (Talk) Video
Alain Vaucher,
Sat 8:20 a.m. - 8:25 a.m.
Spotlight Talk: Molecular representation learning with language models and domain-relevant auxiliary tasks - Benedek Fabian, Thomas Edlich, Héléna Gaspar, Marwin Segler, Joshua Meyers, Marco Fiscato and Mohamed Ahmed (Talk) Video
Benedek Fabian
Sat 8:25 a.m. - 8:30 a.m.
Spotlight Talk: Accelerate the screening of complex materials by learning to reduce random and systematic errors - Tian Xie, Yang Shao-Horn and Jeffrey Grossman. (Talk) Video
Tian Xie
Sat 8:30 a.m. - 9:30 a.m.
Poster Session Break (Break)
Sat 9:30 a.m. - 10:00 a.m.
Panel (Discussion Panel)
Alan Aspuru-Guzik, Jennifer Listgarten, Klaus-Robert Müller, Nadine Schneider
Sat 10:00 a.m. - 10:10 a.m.
Contributed Talk: Bayesian GNNs for Molecular Property Prediction - George Lamb and Brooks Paige (Talk) Video
George Lamb
Sat 10:10 a.m. - 10:20 a.m.
Contributed Talk: Design of Experiments for Verifying Biomolecular Networks - Ruby Sedgwick, John Goertz, Ruth Misener, Molly Stevens and Mark van der Wilk. (Talk) Video
Ruby Sedgwick
Sat 10:20 a.m. - 10:30 a.m.
Contributed Talk: Multi-task learning for electronic structure to predict and explore molecular potential energy surfaces - Z. Qiao, F. Ding, M. Welborn, P.J. Bygrave, D.G.A. Smith, A. Anandkumar, F. R. Manby and TF. Miller III (Talk) Video
Zhuoran Qiao
Sat 10:31 a.m. - 10:51 a.m.


Over the last few years, we have seen a dramatic uptick in the application of Machine Learning in drug discovery. Developments in deep learning have led to a renaissance in Quantitative Structure-Activity Relationships (QSAR) and de-novo molecule generation. While the field continues to advance, it faces several challenges. As with any application of machine learning, the results will depend on the data, the representation, and the algorithms used to generate the machine learning models. In many cases, drug discovery data presents some unique challenges not found in data from other disciplines. Furthermore, the optimal means of representing molecules in machine learning is still an open question. This presentation will highlight current challenges and hopefully motivate new work to move the field forward.


Pat Walters heads the Computation & Informatics group at Relay Therapeutics in Cambridge, MA. His group focuses on novel computational methods that integrate computer simulations and experimental data to provide insights that drive drug discovery programs. Pat is co-author of the book “Deep Learning for the Life Sciences,” published by O’Reilly and Associates. His AI work began with expert systems in the late 1980s, moved to machine learning in the 1990s, and has continued through 25 years in the pharmaceutical industry. Before joining Relay, Pat spent more than 20 years at Vertex Pharmaceuticals, where he was Global Head of Modeling & Informatics. He is a member of the editorial advisory board for the Journal of Medicinal Chemistry and has been a guest editor for multiple scientific journals. Pat received his Ph.D. in Organic Chemistry from the University of Arizona, where he studied the application of artificial intelligence in conformational analysis. Before obtaining his Ph.D., he worked at Varian Instruments as both a chemist and a software developer. Pat received his B.S. in Chemistry from the University of California, Santa Barbara.

Patrick Walters
Sat 10:51 a.m. - 11:00 a.m.
Invited Talk: Patrick Walters - Live Q&A (Q&A)
Sat 11:01 a.m. - 11:21 a.m.


Increased reliance on chemicals in both industrialized and developing countries has led to a dramatic change of our exposure patterns to both natural and synthetic chemicals. This diverse plethora of xenobiotics, some of which have become nearly ubiquitous, includes among others, pesticides, pharmaceuticals, food compounds, and their largely unknown chemo-/biotransformation products. To accurately assess the various environmental health threats they may pose, it is crucial to understand how they are biologically produced, activated, detoxified, and eliminated from various biological matrices. As it turns out, understanding the biological and environmental fate of xenobiotics is a major step towards deciphering the aforementioned mechanisms. Moreover, it contributes significantly to the development of safer and more sustainable chemicals. Over the past decade several in silico tools have been developed for the prediction and identification of metabolites, most of which are only commercially available and significantly biased towards drug-like molecules. In this presentation, we will describe BioTransformer, an open source software and freely accessible server for the prediction of human CYP450-catalyzed metabolism, human gut microbial degradation, human phase-II metabolism, human promiscuous metabolism, and environmental microbial degradation. Moreover, we will present an assessment of its performance in predicting the metabolism of agrochemicals, conducted at Corteva Agriscience. Furthermore, we will illustrate a few examples of its application as demonstrated by various published scientific studies. Finally, we will share future perspectives for this open source project, and describe how it could significantly benefit the exposure science and regulatory communities.


Dr. Yannick Djoumbou Feunang earned his PhD in Microbiology and Biotechnology at the University of Alberta - Canada, in 2017, where his research focused in developing Cheminformatics tools to enhance Metabolomics. Some of his main contributions include software tools ClassyFire, BioTransformer, and CFM-ID 3.0, with applications of ontology and linked data, as well as machine-learning, and knowledge-based artificial intelligence to biology and chemistry. Additionally, he has contributed to the development of databases such as DrugBank and HMDB. Since 2018, Dr. Djoumbou Feunang has worked as a Research Investigator for the Chemistry Data Science research group at Corteva Agriscience in Indianapolis, Indiana. His responsibilities include among others: (1) the development of machine learning models to support lead generation and optimization projects, and; (2) the enhancement of Corteva’s Cheminformatics scientific computing platform. He also currently leads a project aiming at building a cutting-edge, adapted in silico metabolism platform at Corteva Agriscience.

Yannick Djoumbou Feunang
Sat 11:21 a.m. - 11:30 a.m.
Invited Talk: Yannick Djoumbou Feunang - Live Q&A (Q&A)
Sat 11:30 a.m. - 11:35 a.m.
Spotlight Talk: Data augmentation strategies to improve reaction yield predictions and estimate uncertainty - Philippe Schwaller, Alain Vaucher, Teodoro Laino and Jean-Louis Reymond (Talk) Video
Philippe Schwaller
Sat 11:35 a.m. - 11:40 a.m.
Spotlight Talk: Message Passing Networks for Molecules with Tetrahedral Chirality - Lagnajit Pattanaik, Octavian Ganea, Ian Coley, Klavs Jensen, William Green and Connor Coley. (Talk) Video
Lagnajit Pattanaik
Sat 11:40 a.m. - 11:45 a.m.
Spotlight Talk: Protein model quality assessment using rotation-equivariant, hierarchical neural networks - Stephan Eismann, Patricia Suriana, Bowen Jing, Raphael Townshend and Ron Dror. (Talk) Video
Stephan Eismann
Sat 11:45 a.m. - 11:50 a.m.
Spotlight Talk: Crystal Structure Search with Random Relaxations Using Graph Networks - Gowoon Cheon, Lusann Yang, Kevin McCloskey, Evan Reed and Ekin Cubuk (Talk) Video
Gowoon Cheon
Sat 11:51 a.m. - 12:11 p.m.


The interpretability of machine learning models for molecules is critical to scientific discovery, understanding, and debugging. Attribution is one approach to interpretability, which highlights parts of the input that are influential to a neural network’s prediction. With molecules, we can set up synthetic tasks such as the identification of subfragment logics to generate ground truth attributions and labels. This scenario serves as a testbed to quantitatively study attributions of molecular graphs with Graph Neural Networks (GNNs). We perform multiple experiments looking at the effect of GNN architectures, label noise, and spurious correlations in attributions. In the end, we make concrete recommendations for which attribution methods and models to use while also providing a framework for evaluating new attribution techniques.

Biography: I am a research scientist at Google Research. My research centers around using machine learning techniques to build data-driven models for the prediction of molecular properties and the generation of new molecules and materials via generative models. Applications include solar cells, solubility, drug-design, and particularly smelly molecules. I am part of a team that wants to do for olfaction, what machine learning has done for vision and speech.

I am also passionate about science education and divulgation, I am one of the founders and organizers for Clubes de Ciencia Mexico and a LatinX-centered AI conference RIIAA. In my free time, I like to run, eat ice cream and cook food.

Benjamin Sanchez-Lengeling
Sat 12:11 p.m. - 12:20 p.m.
Invited Talk: Benjamin Sanchez-Lengeling - Live Q&A (Q&A)
Sat 12:21 p.m. - 12:41 p.m.


Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a target more tightly than previously observed. To that end, costly experimental measurements are being replaced with calls to a high-capacity regression model trained on labeled data, which can be leveraged in an in silico search for promising design candidates. The aim then is to discover designs that are better than the best design in the observed data. This goal puts machine-learning based design in a much more difficult spot than traditional applications of predictive modelling, since successful design requires, by definition, some degree of extrapolation---a pushing of the predictive models to its unknown limits, in parts of the design space that are a priori unknown. In this talk I'll discuss our emerging approaches to tackle this problem.

Biography: Since Jan. 2018, Jennifer Listgarten has been a Professor in the Department of Electrical Engineering and Computer Science, and Center for Computational Biology, at the University of California, Berkeley. She is also a member of the steering committee for the Berkeley AI Research (BAIR) Lab, and a Chan Zuckerberg investigator. From 2007 to 2017 she was at Microsoft Research, through Cambridge, MA (2014-2017), Los Angeles (2008-2014), and Redmond, WA (2007-2008). She completed her Ph.D. in the machine learning group in the Department of Computer Science at the University of Toronto, located in her hometown. She has two undergraduate degrees, one in Physics and one in Computer Science, from Queen's University in Kingston, Ontario. Jennifer's research interests are broadly at the intersection of machine learning, applied statistics, molecular biology and science.

Jennifer Listgarten
Sat 12:41 p.m. - 12:50 p.m.
Invited Talk: Jennifer Listgarten - Live Q&A (Q&A)
Sat 12:50 p.m. - 1:00 p.m.
Closing Remarks
Sat 1:00 p.m. - 2:00 p.m.
Poster Session Part 2 (Break)

Author Information

Jose Miguel Hernández-Lobato (University of Cambridge)
Matt Kusner (University College London)
Brooks Paige (University College London)
Marwin Segler (BenevolentAI)
Jennifer Wei (Google Research)

More from the Same Authors