Advances in machine learning owe much to the public availability of high-quality benchmark datasets and the well-defined problem settings that they encapsulate. Examples are abundant: CIFAR-10 for image classification, COCO for object detection, SQuAD for question answering, BookCorpus for language modelling, etc. There is a general belief that the accessibility of high-quality benchmark datasets is central to the thriving of our community.
However, three prominent issues affect benchmark datasets: data scarcity, privacy, and bias. They already manifest in many existing benchmarks, and also make the curation and publication of new benchmarks difficult (if not impossible) in numerous high-stakes domains, including healthcare, finance, and education. Hence, although ML holds strong promise in these domains, the lack of high-quality benchmark datasets creates a significant hurdle for the development of methodology and algorithms and leads to missed opportunities.
Synthetic data is a promising solution to the key issues of benchmark dataset curation and publication. Specifically, high-quality synthetic data generation could be done while addressing the following major issues.
1. Data Scarcity. The training and evaluation of ML algorithms require datasets with a sufficient sample size. Note that even if the algorithm can learn from very few samples, we still need sufficient validation data for model evaluation. However, it is often challenging to obtain the desired number of samples due to the inherent data scarcity (e.g. people with unique characteristics, patients with rare diseases etc.) or the cost and feasibility of certain data collection. There has been very active research in cross-domain and out-of-domain data generation, as well as generation from a few samples. Once the generator is trained, one could obtain arbitrarily large synthetic datasets.
2. Privacy. In many key applications, ML algorithms rely on record-level data collected from human subjects, which leads to privacy concerns and legal risks. As a result, data owners are often hesitant to publish datasets for the research community. Even if they are willing to, accessing the datasets often requires significant time and effort from the researchers. Synthetic data is regarded as one potential way to promote privacy. The 2019 NeurIPS Competition "Synthetic data hide and seek challenge" demonstrates the difficulty in performing privacy attacks on synthetic data. Many recent works look further into the theoretical and practical aspects of synthetic data and privacy.
3. Bias and under-representation. The benchmark dataset may be subject to data collection bias and under-represent certain groups (e.g. people with less-privileged access to technology). Using these datasets as benchmarks would (implicitly) encourage the community to build algorithms that reflect or even exploit the existing bias. This is likely to hamper the adoption of ML in high-stake applications that require fairness, such as finance and justice. Synthetic data provides a way to curate less biased benchmark data. Specifically, (conditional) generative models can be used to augment any under-represented group in the original dataset. Recent works have shown that training on synthetically augmented data leads to consistent improvements in robustness and generalisation.
Why do we need this workshop? Despite the growing interest in using synthetic data to empower ML, this agenda is still challenging because it involves multiple research fields and various industry stakeholders. Specifically, it calls for the collaboration of the researchers in generative models, privacy, and fairness. Existing research in generative models focuses on generating high-fidelity data, often neglecting the privacy and fairness aspect. On the other hand, the existing research in privacy and fairness often focuses on the discriminative setting rather than the generative setting. Finally, while generative modelling in images and tabular data has matured, the generation of time series and multi-modal data is still a vibrant area of research, especially in complex domains in healthcare and finance. The data modality and characteristics differ significantly across application domains and industries. It is therefore important to get the inputs from the industry experts such that the benchmark reflects reality.
The goal of this workshop is to provide a platform for vigorous discussion with researchers in various fields of ML and industry experts in the hope to progress the idea of using synthetic data to empower ML research. The workshop also provides a forum for constructive debates and identifications of strengths and weaknesses with respect to alternative approaches, e.g. federated learning
Fri 6:00 a.m. - 6:15 a.m.
|
Opening remark
SlidesLive Video » |
🔗 |
Fri 6:15 a.m. - 6:40 a.m.
|
Invited Talk #1, Differentially Private Learning with Margin Guarantees, Mehryar Mohri
(
Invited Talk
)
SlidesLive Video » Title: Differentially Private Learning with Margin Guarantees Abstract: Preserving privacy is a crucial objective for machine learning algorithms. But, despite the remarkable theoretical and algorithmic progress in differential privacy over the last decade or more, its application to learning still faces several obstacles. A recent series of publications have shown that differentially private PAC learning of infinite hypothesis sets is not possible, even for common hypothesis sets such as that of linear functions. Another rich body of literature has studied differentially private empirical risk minimization in a constrained optimization setting and shown that the guarantees are necessarily dimension-dependent. In the unconstrained setting, dimension-independent bounds have been given, but they admit a dependency on the norm of a vector that can be extremely large, which makes them uninformative. These results raise some fundamental questions about private learning with common high-dimensional problems: is differentially private learning with favorable (dimension-independent) guarantees possible for standard hypothesis sets? This talk presents a series of new differentially private algorithms for learning linear classifiers, kernel classifiers, and neural-network classifiers with dimension-independent, confidence-margin guarantees. Joint work with Raef Bassily and Ananda Theertha Suresh. |
Mehryar Mohri 🔗 |
Fri 6:40 a.m. - 7:20 a.m.
|
Privacy Panel
(
Discussion Panel
)
SlidesLive Video » Synthetic data and privacy. |
Mario Fritz · Katrina Ligett · Vamsi Potluru · Shuai Tang 🔗 |
Fri 7:20 a.m. - 8:00 a.m.
|
Contributed Talks Part 1/2
(
Contributed Talks
)
Papers:
|
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Morning Break
|
🔗 |
Fri 8:30 a.m. - 9:20 a.m.
|
Contributed Talks Part 2/2
(
Contributed Talks
)
Papers:
|
🔗 |
Fri 9:20 a.m. - 9:45 a.m.
|
Invited Talk #2, Kalyan Veeramachaneni, SDMetrics: Evaluating Synthetic Data
(
Invited Talk
)
SlidesLive Video » Compared to other machine learning tasks like classification and regression, synthetic data generation is a new area of inquiry for machine learning. One challenge we encountered early on in working with synthetic data was the lack of standardized metrics for evaluating it. Although evaluation for tabular synthetic data is less subjective than for ML-generated images or natural language, it comes with its own specific considerations. For instance, metrics must take into account what the data is being generated for, as well as tradeoffs between quality, privacy, and utility that are inherent to this type of data. To begin addressing this need, we created an open source library called SDMetrics, which contains a number of synthetic data evaluation tools. We identified inherent hierarchies that exist in these evaluations — for example, columnwise comparison vs. correlation matrix comparison — and built ways to test and validate these metrics. The library also provides user-friendly, focused reports and mechanisms to prevent "metric fatigue.” |
Kalyan Veeramachaneni 🔗 |
Fri 9:45 a.m. - 10:00 a.m.
|
Announcing the Best Paper Awards
(
Award
)
SlidesLive Video » |
🔗 |
Fri 10:00 a.m. - 11:30 a.m.
|
Lunch and Poster Session
(
Poster Session
)
|
🔗 |
Fri 11:30 a.m. - 12:10 p.m.
|
Fairness Panel
(
Discussion Panel
)
SlidesLive Video » Synthetic data and fairness. |
Freedom Gumedze · Rachel Cummings · Bo Li · Robert Tillman · Edward Choi 🔗 |
Fri 12:10 p.m. - 12:35 p.m.
|
Achievements and Challenges Part 1/2
(
Invited Talk
)
SlidesLive Video » Progress, achievements and challenges in synthetic data. |
Dimitris Vlitas · Dino Oglic 🔗 |
Fri 12:35 p.m. - 1:00 p.m.
|
Invited Talk #3, Synthetic medical data – needed to bring ML in medicine up to speed?
(
Invited Talk
)
SlidesLive Video » Synthetic medical data – needed to bring ML in medicine up to speed? To bring medical AI to the next level, in terms of exceeding state-of-art and not just developing but also implementing novel algorithms as decision support tools into clinical practice, we need to bring data scientists, laboratory scientists and physician (researchers) together. To succeed with this, we need to bring medical data out in the open, to benefit from the best possible models being developed jointly by the data science community. Solving the data privacy issues – in terms of creating synthetic data sets reflecting the correlations of the original data sets without jeopardizing data privacy – is needed. Here you get the perspectives on real clinical challenges, real data science approaches being implemented into clinical practice, regulatory issues and real unmet needs of synthetic data sets from the perspective of a physician researcher. |
Carsten Utoft Niemann 🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Afternoon Break
|
🔗 |
Fri 1:30 p.m. - 1:55 p.m.
|
Invited Talk #4, The Fifth Paradigm of Scientific Discovery, Max Welling
(
Invited Talk
)
SlidesLive Video » Title: The Fifth Paradigm of Scientific Discovery Abstract: I will argue that we may be at the beginning of a new paradigm of scientific discovery based on deep learning combined with ab initio simulation of physical processes. We envision a system where simulations generate data to train neural surrogate models that in turn will accelerate simulations. The result will be an active learning framework where accurate data is acquired when the surrogate model is uncertain about it’s predictions. We will argue this hybrid approach can accelerate scientific discovery, for instance the the search for new drugs, and materials. |
Max Welling 🔗 |
Fri 1:55 p.m. - 2:20 p.m.
|
Achievements and Challenges Part 2/2
(
Invited Talk
)
SlidesLive Video » Progress, achievements, and challenges in synthetic data. |
Zhaozhi Qian · Tucker Balch · Sergul Aydore 🔗 |
Fri 2:20 p.m. - 2:45 p.m.
|
Invited Talk #5, Privacy-Preserving Data Synthesis for General Purposes, Bo Li
(
Invited Talk
)
SlidesLive Video » Privacy-Preserving Data Synthesis for General Purposes The recent success of deep neural networks (DNNs) hinges on the availability of large-scale datasets; however, training on such datasets often poses privacy risks for sensitive training information, such as face images and medical records of individuals. In this talk, I will mainly discuss how to explore the power of generative models and gradient sparsity, and talk about different scalable privacy-preserving generative models in both centralized and decentralized settings. In particular, I will introduce our recent work on large-scale privacy-preserving data generative models leveraging gradient compression with convergence guarantees. I will also introduce how to train generative models with privacy guarantees in heterogeneous environments, where data of local agents come from diverse distributions. We will finally discuss some potential applications for different privacy-preserving data synthesis strategies. |
Bo Li 🔗 |
Fri 2:45 p.m. - 3:00 p.m.
|
Closing remark
SlidesLive Video » |
🔗 |