Timezone: »

Synthetic Data for Empowering ML Research
Mihaela van der Schaar · Zhaozhi Qian · Sergul Aydore · Dimitris Vlitas · Dino Oglic · Tucker Balch

Fri Dec 02 06:00 AM -- 03:00 PM (PST) @ Room 288 - 289
Event URL: https://www.syntheticdata4ml.vanderschaar-lab.com/ »

Advances in machine learning owe much to the public availability of high-quality benchmark datasets and the well-defined problem settings that they encapsulate. Examples are abundant: CIFAR-10 for image classification, COCO for object detection, SQuAD for question answering, BookCorpus for language modelling, etc. There is a general belief that the accessibility of high-quality benchmark datasets is central to the thriving of our community.

However, three prominent issues affect benchmark datasets: data scarcity, privacy, and bias. They already manifest in many existing benchmarks, and also make the curation and publication of new benchmarks difficult (if not impossible) in numerous high-stakes domains, including healthcare, finance, and education. Hence, although ML holds strong promise in these domains, the lack of high-quality benchmark datasets creates a significant hurdle for the development of methodology and algorithms and leads to missed opportunities.

Synthetic data is a promising solution to the key issues of benchmark dataset curation and publication. Specifically, high-quality synthetic data generation could be done while addressing the following major issues.

1. Data Scarcity. The training and evaluation of ML algorithms require datasets with a sufficient sample size. Note that even if the algorithm can learn from very few samples, we still need sufficient validation data for model evaluation. However, it is often challenging to obtain the desired number of samples due to the inherent data scarcity (e.g. people with unique characteristics, patients with rare diseases etc.) or the cost and feasibility of certain data collection. There has been very active research in cross-domain and out-of-domain data generation, as well as generation from a few samples. Once the generator is trained, one could obtain arbitrarily large synthetic datasets.

2. Privacy. In many key applications, ML algorithms rely on record-level data collected from human subjects, which leads to privacy concerns and legal risks. As a result, data owners are often hesitant to publish datasets for the research community. Even if they are willing to, accessing the datasets often requires significant time and effort from the researchers. Synthetic data is regarded as one potential way to promote privacy. The 2019 NeurIPS Competition "Synthetic data hide and seek challenge" demonstrates the difficulty in performing privacy attacks on synthetic data. Many recent works look further into the theoretical and practical aspects of synthetic data and privacy.

3. Bias and under-representation. The benchmark dataset may be subject to data collection bias and under-represent certain groups (e.g. people with less-privileged access to technology). Using these datasets as benchmarks would (implicitly) encourage the community to build algorithms that reflect or even exploit the existing bias. This is likely to hamper the adoption of ML in high-stake applications that require fairness, such as finance and justice. Synthetic data provides a way to curate less biased benchmark data. Specifically, (conditional) generative models can be used to augment any under-represented group in the original dataset. Recent works have shown that training on synthetically augmented data leads to consistent improvements in robustness and generalisation.

Why do we need this workshop? Despite the growing interest in using synthetic data to empower ML, this agenda is still challenging because it involves multiple research fields and various industry stakeholders. Specifically, it calls for the collaboration of the researchers in generative models, privacy, and fairness. Existing research in generative models focuses on generating high-fidelity data, often neglecting the privacy and fairness aspect. On the other hand, the existing research in privacy and fairness often focuses on the discriminative setting rather than the generative setting. Finally, while generative modelling in images and tabular data has matured, the generation of time series and multi-modal data is still a vibrant area of research, especially in complex domains in healthcare and finance. The data modality and characteristics differ significantly across application domains and industries. It is therefore important to get the inputs from the industry experts such that the benchmark reflects reality.

The goal of this workshop is to provide a platform for vigorous discussion with researchers in various fields of ML and industry experts in the hope to progress the idea of using synthetic data to empower ML research. The workshop also provides a forum for constructive debates and identifications of strengths and weaknesses with respect to alternative approaches, e.g. federated learning

Author Information

Mihaela van der Schaar (University of Cambridge)
Zhaozhi Qian (University of Cambridge)
Sergul Aydore (AWS AI)
Dimitris Vlitas
Dino Oglic (AstraZeneca)
Tucker Balch (J.P. Morgan)

Tucker Balch is a managing director at J.P. Morgan AI Research. He is a professor on leave from Georgia Tech. He is interested in research problems concerning multi agent social behavior. This interest has led to research in a wide range of topics from financial markets to to tracking and modeling the behavior of ants, honeybees and monkeys. He teaches courses in Robotics, Machine Learning and Finance. In addition to his teaching on campus, more than 170,000 students have take his courses online via Coursera and Udacity. He is Chief Scientist and co-founder of Lucena Research, an investment software firm that applies Machine Learning and Big Data approaches to investment problems. Balch has published 120 conference papers and journal articles. His work has been covered by CNN, New Scientist, Institutional Investor, and the New York Times. His graduated students work at NASA/JPL, Boston Dynamics, Goldman Sachs, Morgan Stanley, Citadel, AQR, and Yahoo! Finance. Before his career in academia, Balch was a USAF F-15 pilot.

More from the Same Authors