`

Timezone: »

 
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development
Kexin Huang · Tianfan Fu · Wenhao Gao · Yue Zhao · Yusuf Roohani · Jure Leskovec · Connor Coley · Cao Xiao · Jimeng Sun · Marinka Zitnik
Event URL: https://openreview.net/forum?id=8nvgnORnoWr »

Therapeutics machine learning is an emerging field with incredible opportunities for innovation and impact. However, advancement in this field requires the formulation of meaningful tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and diverse types of data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including distributional shifts, multi-scale and multi-modal learning, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is available at https://tdcommons.ai.

Author Information

Kexin Huang (Stanford University)
Tianfan Fu (Georgia Institute of Technology)
Wenhao Gao (Massachusetts Institute of Technology)
Yue Zhao (Carnegie Mellon University)

I am pursuing a Ph.D. in Information Systems at Carnegie Mellon University, advised by Prof. Leman Akoglu. Different from most IS researchers, I focus on data mining algorithms, systems, and applications. Research Keywords: Outlier & Anomaly Detection; Ensemble Learning; Scalable Machine Learning; Machine Learning Systems.

Yusuf Roohani (Stanford University)
Jure Leskovec (Stanford University/Pinterest)
Connor Coley (MIT)
Cao Xiao (Iqvia)
Jimeng Sun (University of Illinois, Urbana Champaign)
Marinka Zitnik (Harvard University)

More from the Same Authors