Synthetic Data for Empowering ML Research

Mihaela van der Schaar · Zhaozhi Qian · Sergul Aydore · Dimitris Vlitas · Dino Oglic · Tucker Balch



Advances in machine learning owe much to the public availability of high-quality benchmark datasets and the well-defined problem settings that they encapsulate. Examples are abundant: CIFAR-10 for image classification, COCO for object detection, SQuAD for question answering, BookCorpus for language modelling, etc. There is a general belief that the accessibility of high-quality benchmark datasets is central to the thriving of our community.

However, three prominent issues affect benchmark datasets: data scarcity, privacy, and bias. They already manifest in many existing benchmarks, and also make the curation and publication of new benchmarks difficult (if not impossible) in numerous high-stakes domains, including healthcare, finance, and education. Hence, although ML holds strong promise in these domains, the lack of high-quality benchmark datasets creates a significant hurdle for the development of methodology and algorithms and leads to missed opportunities.

Synthetic data is a promising solution to the key issues of benchmark dataset curation and publication. Specifically, high-quality synthetic data generation could be done while addressing the following major issues.

1. Data Scarcity. The training and evaluation of ML algorithms require datasets with a sufficient sample size. Note that even if the algorithm can learn from very few samples, we still need sufficient validation data for model evaluation. However, it is often challenging to obtain the desired number of samples due to the inherent data scarcity (e.g. people with unique characteristics, patients with rare diseases etc.) or the cost and feasibility of certain data collection. There has been very active research in cross-domain and out-of-domain data generation, as well as generation from a few samples. Once the generator is trained, one could obtain arbitrarily large synthetic datasets.

2. Privacy. In many key applications, ML algorithms rely on record-level data collected from human subjects, which leads to privacy concerns and legal risks. As a result, data owners are often hesitant to publish datasets for the research community. Even if they are willing to, accessing the datasets often requires significant time and effort from the researchers. Synthetic data is regarded as one potential way to promote privacy. The 2019 NeurIPS Competition "Synthetic data hide and seek challenge" demonstrates the difficulty in performing privacy attacks on synthetic data. Many recent works look further into the theoretical and practical aspects of synthetic data and privacy.

3. Bias and under-representation. The benchmark dataset may be subject to data collection bias and under-represent certain groups (e.g. people with less-privileged access to technology). Using these datasets as benchmarks would (implicitly) encourage the community to build algorithms that reflect or even exploit the existing bias. This is likely to hamper the adoption of ML in high-stake applications that require fairness, such as finance and justice. Synthetic data provides a way to curate less biased benchmark data. Specifically, (conditional) generative models can be used to augment any under-represented group in the original dataset. Recent works have shown that training on synthetically augmented data leads to consistent improvements in robustness and generalisation.

Why do we need this workshop? Despite the growing interest in using synthetic data to empower ML, this agenda is still challenging because it involves multiple research fields and various industry stakeholders. Specifically, it calls for the collaboration of the researchers in generative models, privacy, and fairness. Existing research in generative models focuses on generating high-fidelity data, often neglecting the privacy and fairness aspect. On the other hand, the existing research in privacy and fairness often focuses on the discriminative setting rather than the generative setting. Finally, while generative modelling in images and tabular data has matured, the generation of time series and multi-modal data is still a vibrant area of research, especially in complex domains in healthcare and finance. The data modality and characteristics differ significantly across application domains and industries. It is therefore important to get the inputs from the industry experts such that the benchmark reflects reality.

The goal of this workshop is to provide a platform for vigorous discussion with researchers in various fields of ML and industry experts in the hope to progress the idea of using synthetic data to empower ML research. The workshop also provides a forum for constructive debates and identifications of strengths and weaknesses with respect to alternative approaches, e.g. federated learning

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles »


Log in and register to view live content