Drug Discovery with Large-Scale, Physics-Based Synthetic Data at D. E. Shaw Research
Abstract
Recent breakthroughs in machine learning—such as diffusion models for biomolecular cofolding and multimodal large language models (LLMs)—have the potential to reshape the drug discovery process by enabling reasoning across complex biological and chemical systems. Despite advancements in machine learning hardware, algorithms, and model architectures, data scarcity remains a fundamental bottleneck to progress in the field. High-quality experimental data on biological structures, their interactions with drug-like molecules, and important physical properties of those molecules are limited, slow to generate, and capital intensive. At D. E. Shaw Research, one way we are addressing this challenge is through large-scale, physics-based synthetic data generation. Many of our models are trained using vast amounts of molecular dynamics simulation data produced by our special-purpose supercomputer, Anton. The synthetic data from Anton helps us to generate diverse and physically meaningful datasets that extend well beyond data available from experiment. This talk will highlight some of the types of data we generate as well as some of the methods used to incorporate molecular dynamics data into multimodal LLMs.