Synthetic clinical trial data create opportunities for data sharing, cross-collaboration, and innovation for these valuable, siloed data sources. While the value of synthetic clinical trial data relies on the privacy preservation it offers the clinical trial participants, the true degree of privacy has been questioned in recent literature. Given the highly sensitive nature of clinical trial data, especially their content composing private health information, there is an urgent need for a framework specifically designed to provide guaranteed levels of privacy for synthetic datasets generated from clinical trial data. In this paper, we propose a practical privacy framework that ensures synthetic clinical trial data privacy at the level of the source data by design and provides objective, measurable bounds on the disclosure risks through a combination of technical, policy, and algorithmic controls. The proposed framework enforces privacy prior to the generation of synthetic datasets and therefore complements the privacy preserving attributes intrinsic to the algorithms used for synthetic data generation. To demonstrate how the components of the framework address the privacy requirements needed for clinical trial data, we discuss how this privacy system responds to a set of realistic adversarial scenarios. Ultimately, we believe the proposed framework can foster more privacy research in clinical trial data sharing.
Afrah Shafquat (Medidata, a Dassault Systèmes company)
Afrah Shafquat is a Sr. Data Scientist at Medidata AI where her work is focused on synthetic clinical trial data generation and innovative machine-learning models to further understanding of clinical and healthcare datasets. She has a PhD in Computational Biology (2020) from Cornell University where her dissertation focused on inferring errors in disease diagnoses using Bayesian hierarchical models. She also has an SB in Biological Engineering from MIT.
Mandis Beigi (Medidata (Dassault Systemes))
Mandis Beigi, PhD is a senior data scientist working in the Trial Design solutions group at Medidata Solutions (Dassault Systemes). She currently works on synthetic data generation from clinical trials data and privacy risks of synthetic data. She has over 20 years of prior experience at IBM Research working in various fields such as computer vision, sensor data analytics, high dimensional data analytics, anomaly detection and rule based system and network management. She received her masters and PhD in Electrical Engineering at Columbia University.
Jimeng Sun (University of Illinois, Urbana Champaign)
More from the Same Authors
2021 : Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development »
Kexin Huang · Tianfan Fu · Wenhao Gao · Yue Zhao · Yusuf Roohani · Jure Leskovec · Connor Coley · Cao Xiao · Jimeng Sun · Marinka Zitnik
2022 : Recommendation for New Drugs with Limited Prescription Data »
Zhenbang Wu · Huaxiu Yao · Zhe Su · David Liebovitz · Lucas Glass · James Zou · Chelsea Finn · Jimeng Sun
2022 : Synthetic Clinical Trial Data while Preserving Subject-Level Privacy »
Mandis Beigi · Afrah Shafquat · Jason Mezey · Jacob Aptekar
2022 Poster: Reinforced Genetic Algorithm for Structure-based Drug Design »
Tianfan Fu · Wenhao Gao · Connor Coley · Jimeng Sun
2022 Poster: ATD: Augmenting CP Tensor Decomposition by Self Supervision »
Chaoqi Yang · Cheng Qian · Navjot Singh · Cao (Danica) Xiao · M Westover · Edgar Solomonik · Jimeng Sun
2022 Poster: TransTab: Learning Transferable Tabular Transformers Across Tables »
Zifeng Wang · Jimeng Sun
2022 Poster: Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization »
Wenhao Gao · Tianfan Fu · Jimeng Sun · Connor Coley
2022 Poster: Conformal Prediction with Temporal Quantile Adjustments »
Zhen Lin · Shubhendu Trivedi · Jimeng Sun