Workshop
Synthetic Data Generation with Generative AI
Sergul Aydore · Zhaozhi Qian · Mihaela van der Schaar
Hall E2 (level 1)
Synthetic data (SD) is data that has been generated by a mathematical model to solve downstream data science tasks. SD can be used to address three key problems: 1/ private data release, 2/ data de-biasing and fairness, 3/ data augmentation for boosting the performance of ML models. While SD offers great opportunities for these problems, SD generation is still a developing area of research. Systematic frameworks for SD deployment and evaluation are also still missing. Additionally, despite the substantial advances in Generative AI, the scientific community still lacks a unified understanding of how generative AI can be utilized to generate SD for different modalities.The goal of this workshop is to provide a platform for vigorous discussion from all these different perspectives with research communities in the hope of progressing the ideal of using SD for better and trustworthy ML training. Through submissions and facilitated discussions, we aim to characterize and mitigate the common challenges of SD generation that span numerous application domains. The workshop is jointly organized by academic researchers (University of Cambridge) and industry partners from tech (Amazon AI).
Schedule
Sat 7:00 a.m. - 7:05 a.m.
|
Welcome and workshop overview
(
Talk
)
>
SlidesLive Video |
Sergul Aydore 🔗 |
Sat 7:05 a.m. - 7:15 a.m.
|
Synthetic Data: Charting New Research Frontiers, Maximizing Impact, and Cultivating Collaborative Communities
(
Talk
)
>
SlidesLive Video |
Mihaela van der Schaar 🔗 |
Sat 7:15 a.m. - 8:00 a.m.
|
Generating health records
(
Invited Talk
)
>
SlidesLive Video |
Edward Choi 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break & Poster Session
(
Poster
)
>
|
🔗 |
Sat 8:30 a.m. - 9:15 a.m.
|
Privacy and Synthetic data
(
Invited Talk
)
>
SlidesLive Video |
Antti Honkela 🔗 |
Sat 9:15 a.m. - 9:45 a.m.
|
Differentially Private Synthetic Data via Foundation Model APIs 1: Images ( Contributed Talk ) > link | Zinan Lin 🔗 |
Sat 9:45 a.m. - 10:15 a.m.
|
Effective Data Augmentation With Diffusion Models
(
Contributed Talk
)
>
link
SlidesLive Video |
Max Gurinas · Brandon Trabucco 🔗 |
Sat 10:15 a.m. - 11:30 a.m.
|
Lunch Break & Poster Session
(
Poster
)
>
|
🔗 |
Sat 11:30 a.m. - 12:15 p.m.
|
Diversity and Synthetic data
(
Invited Talk
)
>
SlidesLive Video |
Adji Bousso Dieng 🔗 |
Sat 12:15 p.m. - 12:45 p.m.
|
Fair Wasserstein Coresets
(
Contributed Talk
)
>
SlidesLive Video |
Vamsi Potluru 🔗 |
Sat 12:45 p.m. - 1:15 p.m.
|
Improving fairness for spoken language understanding in atypical speech with Text-to-Speech
(
Contributed Talk
)
>
SlidesLive Video |
Venkatesh Ravichandran · Helin Wang 🔗 |
Sat 1:15 p.m. - 1:30 p.m.
|
Coffee Break & Poster Session
(
Poster
)
>
|
🔗 |
Sat 1:30 p.m. - 2:15 p.m.
|
Generative Agents: Interactive Simulacra
(
Invited Talk
)
>
SlidesLive Video |
Michael Bernstein 🔗 |
Sat 2:15 p.m. - 3:00 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Danielle Belgrave · Cem Tekin · Robert Tillman · Megan Gibbs · Dino Oglic · Rudi Agius · Panagiota Konstantinou 🔗 |
-
|
Size Matters: Large Graph Generation with HiGGs
(
Poster
)
>
link
Large graphs are present in a variety of domains, including social networks, civilinfrastructure, and the physical sciences to name a few. Graph generation issimilarly widespread, with applications in drug discovery, network analysis andsynthetic datasets among others. While GNN (Graph Neural Network) modelshave been applied in these domains their high in-memory costs restrict them tosmall graphs. Conversely less costly rule-based methods struggle to reproducecomplex structures. We propose HIGGS (Hierarchical Generation of Graphs)as a model-agnostic framework of producing large graphs with realistic localstructures. HIGGS uses GNN models with conditional generation capabilities tosample graphs in hierarchies of resolution. As a result HIGGS has the capacityto extend the scale of generated graphs from a given GNN model by quadraticorder. As a demonstration we implement HIGGS using DiGress, a recent graph-diffusion model, including a novel edge-predictive-diffusion variant edge-DiGress.We use this implementation to generate categorically attributed graphs with tensof thousands of nodes. These HIGGS generated graphs are far larger than anypreviously produced using GNNs. Despite this jump in scale we demonstrate thatthe graphs produced by HIGGS are, on the local scale, more realistic than thosefrom the rule-based model BTER. |
Alex O. Davies · Nirav Ajmeri · Telmo Silva Filho 🔗 |
-
|
Generating Medical Instructions with Conditional Transformer
(
Poster
)
>
link
Access to real-world medical instructions is essential for medical research and healthcare quality improvement. However, access to real medical instructions is often limited due to the sensitive nature of the information expressed. Additionally, manually labelling these instructions for training and fine-tuning Natural Language Processing (NLP) models can be tedious and expensive. We introduce a novel task-specific model architecture, Label-To-Text-Transformer (LT3), tailored to generate synthetic medical instructions based on provided labels, such as a vocabulary list of medications and their attributes.LT3 is trained on a vast corpus of medical instructions extracted from the MIMIC-III database, allowing the model to produce valuable synthetic medical instructions. We evaluate LT3's performance by contrasting it with a state-of-the-art Pre-trained Language Model (PLM), T5, analysing the quality and diversity of generated texts. We deploy the generated synthetic data to train the SpacyNER model for the Named Entity Recognition (NER) task over the n2c2-2018 dataset.The experiments show that the model trained on synthetic data can achieve a 96-98\% F1 score at Label Recognition on Drug, Frequency, Route, Strength, and Form.LT3 codes will be shared at \url{https://github.com/HECTA-UoM/Label-To-Text-Transformer} |
Samuel Belkadi · Nicolo Micheletti · Lifeng Han · Warren Del-Pinto · Goran Nenadic 🔗 |
-
|
$\mathbb{S}$ci$\mathbb{F}$ix: Outperforming GPT3 on Scientific Factual Error Correction
(
Poster
)
>
link
Due to the prohibitively high cost of creating error correction datasets, most Factual Claim Correction methods rely on a powerful verification model to guide the correction process. This leads to a significant drop in performance in domains like Scientific Claim Correction, where good verification models do not always exist. In this work we introduce SciFix, a claim correction system that does not require a verifier but is able to outperform existing methods by a considerable margin — achieving correction accuracy of 84% on the SciFact dataset, 77% on SciFact-Open and 72% on the CovidFact dataset, compared to next best accuracies of 7%, 5% and 15% on the same datasets respectively. Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset that can be used for fully supervised training and regularization. We additionally use a claim-aware decoding procedure to improve the quality of corrected claims. Our method outperforms the very LLM that was used to generate the annotated dataset — with FewShot Prompting on GPT3.5 achieving 58%, 61% and 64% on the respective datasets, a consistently lower correction accuracy, despite using nearly 800 times as many parameters as our model. |
Dhananjay Ashok · Atharva Kulkarni · Hai Pham · Barnabas Poczos 🔗 |
-
|
Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI
(
Poster
)
>
link
To generate evidence regarding the safety and efficacy of artificial intelligence (AI) enabled medical devices, AI models need to be evaluated on a diverse population of patient cases, some of which may not be readily available. We propose an evaluation approach for testing medical imaging AI models that relies on in silico imaging pipelines in which stochastic digital models of human anatomy (in object space) with and without pathology are imaged using a digital replica imaging acquisition system to generate realistic synthetic image datasets. Here, we release M-SYNTH, a dataset of cohorts with four breast fibroglandular density distributions imaged at different exposure levels using Monte Carlo x-ray simulations with the publicly available Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) toolkit. We utilize the synthetic dataset to analyze AI model performance and find that model performance decreases with increasing breast density and increases with higher mass density, as expected. As exposure levels decrease, AI model performance drops with the highest performance achieved at exposure levels lower than the nominal recommended dose for the breast type. |
Elena Sizikova · Niloufar Saharkhiz · Diksha Sharma · Miguel Lago · Berkman Sahiner · Jana Delfino · Aldo Badano 🔗 |
-
|
Knowledge-Infused Prompting Improves Clinical Text Generation with Large Language Models
(
Poster
)
>
link
Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we propose ClinGen, which infuses knowledge into synthetic clinical text generation using LLMs for clinical NLP tasks. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Extensive studies across 7 clinical NLP tasks and 16 datasets reveal that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and enriching the diversity of generated training instances. |
Ran Xu · Hejie Cui · Yue Yu · Xuan Kan · Wenqi Shi · Yuchen Zhuang · Wei Jin · Joyce Ho · Carl Yang 🔗 |
-
|
Improving Code Style for Accurate Code Generation
(
Poster
)
>
link
Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based planning annotations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed programs improves the performance by up to \textbf{30\%} compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on one-eighth of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCode models. |
Naman Jain · Tianjun Zhang · Wei-Lin Chiang · Joseph Gonzalez · Koushik Sen · Ion Stoica 🔗 |
-
|
GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning
(
Poster
)
>
link
The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework surpasses the performance of baseline model trained on an English-only dataset by 5.05/6.50 points in F1/Exact Match(EM) for Hindi and by 3.81/3.69 points in F1/EM for Spanish on MLQA dataset. Notably, our approach uses a pre-trained LLM with no additional fine-tuning of LLM using only one annotated example in ICL to generate data, keeping the development process cost effective. |
Amani Namboori · Shivam Mangale · Andy Rosenbaum · Saleh Soltan 🔗 |
-
|
EDGE++: Improved Training and Sampling of EDGE
(
Poster
)
>
link
Traditional graph-generative models like the Stochastic-Block Model (SBM) fall short in capturing complex structures inherent in large graphs. Recently developed deep learning models like NetGAN, CELL, and Variational Graph Autoencoders have made progress but face limitations in replicating key graph statistics. Diffusion-based methods such as EDGE have emerged as promising alternatives, however, they present challenges in computational efficiency and generative performance. In this paper, we propose enhancements to the EDGE model to address these issues. Specifically, we introduce a degree-specific noise schedule that optimizes the number of active nodes at each timestep, significantly reducing memory consumption. Additionally, we present an improved sampling scheme that fine-tunes the generative process, allowing for better control over the similarity between the synthesized and the true network. Our experimental results demonstrate that the proposed modifications not only improve the efficiency but also enhance the accuracy of the generated graphs, offering a robust and scalable solution for graph generation tasks. |
Xiaohui Chen · Mingyang Wu · Liping Liu 🔗 |
-
|
Conditional Generative Modeling for High-dimensional Marked Temporal Point Processes
(
Poster
)
>
link
Recent advancements in generative modeling have made it possible to generate high-quality content from context information, but a key question remains: how to teach models to know when to generate content? To answer this question, this study proposes a novel event generative model that draws its statistical intuition from marked temporal point processes, and offers a clean, flexible, and computationally efficient solution for a wide range of applications involving the generation of asynchronous events with high-dimensional marks. We use a conditional generator that takes the history of events as input and generates the high-quality subsequent event that is likely to occur given the prior observations. The proposed framework offers a host of benefits, including considerable representational power to capture intricate dynamics in multi- or even high-dimensional event space, as well as exceptional efficiency in learning the model and generating samples. Our numerical results demonstrate superior performance compared to other state-of-the-art baselines. |
Zheng Dong · Zekai Fan · Shixiang Zhu 🔗 |
-
|
Synthetic Data Generation for Scarce Road Scene Detection Scenarios
(
Poster
)
>
link
Recent advancements in generative models have led to significant improvements in the quality of generated images, making them virtually indistinguishable from real ones. However, using AI generated images for training robust computer vision models for real-world applications, especially object detection in road scene perception, is still a challenge. AI generated images usually lack the required diversity and scene complexity where specific objects appear with critically low frequency in the available real datasets. An example of such applications is the detection of emergency vehicles like police cars, fire trucks, and ambulances in road scenes. These vehicles appear with drastically low frequencies in available datasets. Successfully generating synthetic images of road scenes that include these types of vehicles and using them in training downstream models would prove useful for autonomous driving vehicles, mitigating safety concerns on the road. To address this, this paper proposes a new approach for synthetically generating diverse, complex, and domain-compatible images of emergency vehicles in road scenes by employing a diffusion-based generative model pretrained on a generic dataset. We investigate the impact of using generated synthetic images in the performance of downstream object detection models. Finally, we thoroughly discuss challenges of generating synthetic datasets with the proposed approach. |
Dipika Khullar · Yash Shah · Ninadkulamz · Negin Sokhandan 🔗 |
-
|
Stable Diffusion For Aerial Object Detection
(
Poster
)
>
link
Aerial object detection is a challenging task, in which one major obstacle lies in the limitations of large-scale data collection and the long-tail distribution of certain classes. Synthetic data offers a promising solution, especially with recent advances in diffusion-based methods like stable diffusion (SD). However, the direct application of diffusion methods to aerial domains poses unique challenges: stable diffusion's optimization for rich ground-level semantics doesn't align with the sparse nature of aerial objects, and the extraction of post-synthesis object coordinates remains problematic. To address these challenges, we introduce a synthetic data augmentation framework tailored for aerial images. It encompasses sparse-to-dense region of interest (ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model with low-rank adaptation (LORA) to circumvent exhaustive retraining, and finally, a Copy-Paste method to compose synthesized objects with backgrounds, providing a nuanced approach to aerial object detection through synthetic data. Code will be released at https://github.com/Anonymous |
Yanan Jian · FUXUN YU · Simranjit Singh · Dimitrios Stamoulis 🔗 |
-
|
INTAGS: Interactive Agent-Guided Simulation
(
Poster
)
>
link
The development of realistic agent-based simulator (ABS) remains a challenging task, mainly due to the sequential and dynamic nature of such a multi-agent system (MAS). To fill this gap, this work proposes a metric to distinguish between real and synthetic multi-agent systems; The metric evaluation depends on the live interaction between the {\it experimental (Exp) autonomous agent} and {\it background (BG) agent(s)}, explicitly accounting for the systems' sequential and dynamic nature. Specifically, we propose to characterize the system/environment by studying the effect of a sequence of BG agents' responses to the environment state evolution, and we take such effects' differences as MAS distance metric; The effect estimation is cast as a causal inference problem since the environment evolution is confounded with the previous environment state. Importantly, we propose the \underline{Int}eractive \underline{A}gent-\underline{G}uided \underline{S}imulation (INTAGS) framework to build a realistic simulator by optimizing over this novel metric. To adapt to any environment with interactive sequential decision making agents, INTAGS formulates the simulator as a stochastic policy in reinforcement learning. Moreover, INTAGS utilizes the policy gradient update to bypass differentiating the proposed metric such that it can support non-differentiable operations of multi-agent environments. Through extensive experiments, we demonstrate the effectiveness of INTAGS on an equity stock market simulation example. |
Song Wei · Andrea Coletta · Svitlana Vyetrenko · Tucker Balch 🔗 |
-
|
CALICO: Conversational Agent Localization via Synthetic Data Generation
(
Poster
)
>
link
We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For named-entities, CALICO supports three operations: verbatim copy, literal transla- tion, and localization, i.e. generating entity values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 6 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized entities which are closer to the HL test set. |
Andy Rosenbaum · Ershad Banijamali · Christopher DiPersio · Pegah Kharazmi · Pan Wei · Lu Zeng · Gokmen Oz · Wael Hamza · Clement Chung · Karolina Owczarzak · Fabian Triefenbach
|
-
|
Improving fairness for spoken language understanding in atypical speech with Text-to-Speech
(
Oral
)
>
link
Spoken language understanding (SLU) systems often exhibit suboptimal performance in processing atypical speech, typically caused by neurological conditions and motor impairments. Recent advancements in Text-to-Speech (TTS) synthesis-based augmentation for more fair SLU have struggled to accurately capture the unique vocal characteristics of atypical speakers, largely due to insufficient data. To address this issue, we present a novel data augmentation method for atypical speakers by finetuning a TTS model, called Aty-TTS. Aty-TTS models speaker and atypical characteristics via knowledge transferring from a voice conversion model. Then, we use the augmented data to train SLU models adapted to atypical speech. To train these data augmentation models and evaluate the resulting SLU systems, we have collected a new atypical speech dataset containing intent annotation. Both objective and subjective assessments validate that Aty-TTS is capable of generating high-quality atypical speech. Furthermore, it serves as an effective data augmentation strategy, contributing to more fair SLU systems that can better accommodate individuals with atypical speech patterns. |
Helin Wang · Venkatesh Ravichandran · Milind Rao · Becky Lammers · Myra J. Sydnor · Nicholas Maragakis · Ankur Butala · Jayne Zhang · Lora Clawson · Victoria Chovaz · Laureano Moro-Velazquez
|
-
|
Generating Privacy-Preserving Longitudinal Synthetic Data
(
Poster
)
>
link
Before synthetic data (SD) generators are able to generate entire electronic health records, many challenges still have to be tackled. One of these challenges is to generate both privacy-preserving and longitudinal SD. This research combines the research streams of longitudinal SD and privacy-preserving static SD and presents a novel GAN architecture called Time-ADS-GAN. Time-ADS-GAN outperforms current state-of-the-art models on both utility and privacy on three datasets and is able to reproduce the results of a healthcare study significantly better than TimeGAN. As a second contribution, a variation of the $\epsilon$-identifiability metric is introduced and used in the analysis.
|
Robin van Hoorn 🔗 |
-
|
AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing
(
Poster
)
>
link
Diffusion model has become a main paradigm for synthetic data generation in many subfields of modern machine learning, including computer vision, language model, or speech synthesis. In this paper, we leverage the power of diffusion model for generating synthetic tabular data.The heterogeneous features in tabular data have been main obstacles in tabular data synthesis, and we tackle this problem by employing the auto-encoder architecture. When compared with the state-of-the-art tabular synthesizers, the resulting synthetic tables from our model show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities. We conducted the experiments over $15$ publicly available datasets. Notably, our model adeptly captures the correlations among features, which has been a long-standing challenge in tabular data synthesis. Our code is available upon request and will be publicly released if paper is accepted.
|
Namjoon Suh · Xiaofeng Lin · Din-Yin Hsieh · Mehrdad Honarkhah · Guang Cheng 🔗 |
-
|
Towards Effective Synthetic Data Sampling for Domain Adaptive Pose Estimation
(
Poster
)
>
link
In this paper, we investigate a synthetic data sampling approach towards unsupervised domain adaptation (UDA) for pose estimation. UDA is characterized by a labeled source domain and an unlabeled target domain. We observe that recent work in UDA for pose estimation fails to generalize across poses in target data, despite having support for such poses in the source data. We hypothesize that this failure to generalize is due to a lack of uniform support across poses of varying complexity in the source domain. Motivated by this challenge, we aim to sample and train with the source domain data to improve the domain adaptation performance on a target domain. The proposed sampling strategy sorts the source domain samples based on a difficulty score, which reflects the lack of uniform support across varying pose complexity in the source domain. The difficulty score is a reconstruction error obtained from training an auto-encoder on the source domain poses. We categorize the dataset into closely related groups using this score. Selectively training from all or some of these groups help us to better utilize the source pose distribution. Finally, current pose estimation evaluation metrics do not effectively measure the ability of the model to learn the geometry of pose. We evaluate our approach qualitatively and quantitatively on benchmark datasets. Our sampling strategy outperforms existing state-of-the-art for domain adaptation. |
Isha Dua · Arjun Sharma · Shuaib Ahmed · Rahul Tallamraju 🔗 |
-
|
Fair Wasserstein Coresets
(
Oral
)
>
link
Recent technological advancements have given rise to the ability of collecting vast amounts of data, that often exceed the capacity of commonly used machine learning algorithms. Approaches such as coresets and synthetic data distillation have emerged as frameworks to generate a smaller, yet representative, set of samples for downstream training. As machine learning is increasingly applied to decision-making processes, it becomes imperative for modelers to consider and address biases in the data concerning subgroups defined by factors like race, gender, or other sensitive attributes. Current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples. These methods, however, are not guaranteed to positively affect the performance or fairness of downstream learning processes. In this work, we present Fair Wasserstein Coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC aims to minimize the Wasserstein distance between the original datasets and the weighted synthetic samples while enforcing (an empirical version of) demographic parity, a prominent criterion for algorithmic fairness, via a linear constraint. We show that FWC can be tought of as a constrained version of Lloyd's algorithm for k-medians or k-means clustering.Our experiments, conducted on both synthetic and real datasets, demonstrate the scalability of our approach and highlight the competitive performance of FWC compared to existing fair clustering approaches, even when attempting to enhance the fairness of the latter through fair pre-processing techniques. |
Zikai Xiong · Niccolo Dalmasso · Vamsi Potluru · Tucker Balch · Manuela Veloso 🔗 |
-
|
Effective Data Augmentation With Diffusion Models
(
Oral
)
>
link
Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains. |
Brandon Trabucco · Kyle Doherty · Max Gurinas · Russ Salakhutdinov 🔗 |
-
|
Continuous Diffusion for Mixed-Type Tabular Data
(
Poster
)
>
link
Score-based generative models or diffusion models have proven successful acrossmany domains in generating texts and images. However, the consideration ofmixed-type tabular data with this model family has fallen short so far. Existingresearch mainly combines continuous and categorical diffusion processes and doesnot explicitly account for the feature heterogeneity inherent to tabular data. In thispaper, we combine score matching and score interpolation to ensure a commontype of continuous noise distribution that affects both continuous and categoricalfeatures. Further, we investigate the impact of distinct noise schedules per feature orper data type. We allow for adaptive, learnable noise schedules to ensure optimallyallocated model capacity and balanced generative capability. Results show thatour model outperforms the benchmark models consistently and that accounting forheterogeneity within the noise schedule design boosts sample quality. |
Markus Mueller · Kathrin Gruber · Dennis Fok 🔗 |
-
|
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets
(
Poster
)
>
link
Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate there are spurious correlations in COCO Captions, the most commonly used dataset for evaluating bias, between background context and the gender of people in-situ. This is problematic because commonly-used bias metrics (such as Bias@K) rely on per-gender base rates. To address this issue, we propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets, where only the gender of the subject is edited and the background is fixed. As existing image editing methods have limitations and sometimes produce low-quality images; we introduce a method to automatically filter the generated images based on their similarity to real images. Using our balanced synthetic contrast sets, we benchmark bias in multiple CLIP-based models, demonstrating how metrics are skewed by imbalance in the original COCO images. Our results indicate that the proposed approach improves the validity of the evaluation, ultimately contributing to more realistic understanding of bias in CLIP. |
Brandon Smith · Miguel Farinha · Siobhan Mackenzie Hall · Hannah Rose Kirk · Aleksandar Shtedritski · Max Bain 🔗 |
-
|
Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization
(
Poster
)
>
link
Recent advancements in deep learning have been primarily driven by the use of large models trained on increasingly vast datasets. While neural scaling laws have emerged to predict network performance given a specific level of computational resources, the growing demand for expansive datasets raises concerns. To address this, a new research direction has emerged, focusing on the creation of synthetic data as a substitute. In this study, we investigate how neural networks exhibit shape bias during training on synthetic datasets, serving as an indicator of the synthetic data quality. Specifically, our findings indicate three key points: (1) Shape bias varies across network architectures and types of supervision, casting doubt on its reliability as a predictor for generalization and its ability to explain differences in model recognition compared to human capabilities. (2) Relying solely on shape bias to estimate generalization is unreliable, as it is entangled with diversity and naturalism. (3) We propose a novel interpretation of shape bias as a tool for estimating the diversity of samples within a dataset. Our research aims to clarify the implications of using synthetic data and its associated shape bias in deep learning, addressing concerns regarding generalization and dataset quality. |
Elior Benarous · Sotiris Anagnostidis · Luca Biggio · Thomas Hofmann 🔗 |
-
|
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models
(
Oral
)
>
link
In an ever-evolving world, the dynamic nature of knowledge presents challenges for language models that are trained on static data, leading to outdated encoded information. However, real-world scenarios require models not only to acquire new knowledge but also to overwrite outdated information into updated ones. Addressing this, we introduce the temporally evolving question answering benchmark, EvolvingQA - a novel benchmark designed for training and evaluating LMs on an evolving Wikipedia database, where the construction of our benchmark is automated with our pipeline using large language models. Our benchmark incorporates question-answering as a downstream task to emulate real-world applications. Through EvolvingQA, we uncover that existing continual learning baselines have difficulty in updating and forgetting outdated knowledge. Our findings suggest that the models fail to learn properly when acquiring updated knowledge due to the small weight gradient. Furthermore, we elucidate that the models struggle mostly on providing numerical or temporal answers to questions asking for updated knowledge. Our work aims to model the dynamic nature of real-world information, offering a robust measure for the evolution-adaptability of language models. Our data construction code and dataset files are available at https://anonymous.4open.science/r/EvolvingQA/. |
Yujin Kim · Jaehong Yoon · Seonghyeon Ye · Sung Ju Hwang · Se-Young Yun 🔗 |
-
|
Learning to Place Objects into Scenes by Hallucinating Scenes around Objects
(
Poster
)
>
link
The ability to modify images to add new objects into a scene stands to be a powerful image editing control, but is currently not robustly supported by existing diffusion-based image editing methods. We design a two-step method for inserting objects of a given class into images that first predicts where the object is likely to go in the image and, then, realistically inpaints the object at this location. The central challenge of our approach is predicting where an object should go in a scene, given only an image of the scene. We learn a prediction model entirely from synthetic data by using diffusion-based image outpainting to hallucinate novel images of scenes surrounding a given object. We demonstrate that this weakly supervised approach, which requires no human labels at all, is able to generate more realistic object addition image edits than prior text-controlled diffusion-based approaches. We also demonstrate that, for a limited set of object categories, our learned object placement prediction model, despite being trained entirely on generated data, makes more accurate object placements than prior state-of-the-art models for object placement that were trained on a large, manually annotated dataset. |
Lu Yuan · James Hong · Vishnu Sarukkai · Kayvon Fatahalian 🔗 |
-
|
Evaluating VLMs for Property-Specific Annotation of 3D Objects
(
Poster
)
>
link
3D objects, which often lack clean text descriptions, present an opportunity to evaluate pretrained vision language models (VLMs) on a range of annotation tasks---from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method, to marginalize over arbitrary factors varied across VLM queries, which relies on the VLM’s scores for sampled responses. We first show that this aggregation method can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object’s type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show that VLMs approach the quality of human-verified annotations on both type and material inference on the large-scale Objaverse dataset. |
Rishabh Kabra · Loic Matthey · Alexander Lerchner · Niloy Mitra 🔗 |
-
|
Strong statistical parity through fair synthetic data
(
Poster
)
>
link
AI-generated synthetic data, in addition to protecting the privacy of original data sets, allows users and data consumers to tailor data to their needs. This paper explores the creation of synthetic data that embodies Fairness by Design, focusing on the statistical parity fairness definition. By equalizing the learned target probability distributions of the synthetic data generator across sensitive attributes, a downstream model trained on such synthetic data provides fair predictions across all thresholds, that is, strong fair predictions even when inferring from biased, original data. This fairness adjustment can be either directly integrated into the sampling process of a synthetic generator or added as a post-processing step. The flexibility allows data consumers to create fair synthetic data and fine-tune the trade-off between accuracy and fairness without any previous assumptions on the data or re-training the synthetic data generator. |
Ivona Krchova · Michael Platzer · Paul Tiwald 🔗 |
-
|
On the Limitation of Diffusion Models for Synthesizing Training Datasets
(
Poster
)
>
link
Synthetic samples from diffusion models are promising for leveraging in training discriminative models as replications of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models. This means that modern diffusion models do not perfectly represent the data distribution for the purpose of replicating datasets for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information added by diffusion models. Through assessing the reconstructed samples and trained models, we found that the synthetic data are concentrated in modes of the training data distribution as the reverse step increases, and thus, they are difficult to cover the outer edges of the distribution. Our findings imply that modern diffusion models are insufficient to replicate training data distribution perfectly, and there is room for the improvement of generative modeling in the replication of training datasets. |
Shin'ya Yamaguchi · Takuma Fukuda 🔗 |
-
|
STAR: Improving Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models
(
Poster
)
>
link
Information extraction tasks such as event extraction require an in-depth understanding of the output structure and sub-task dependencies. They heavily rely on task-specific training data in the form of (passage, target structure) pairs to obtain reasonable performance. However, obtaining such data through human annotation is costly, leading to a pressing need for low-resource information extraction approaches that require minimal human labeling for real-world applications. Fine-tuning supervised models with synthesized training data would be a generalizable method, but the existing data generation methods either still rely on large-scale ground-truth data or cannot be applied to complicated IE tasks due to their poor performance. To address these challenges, we propose STAR, a data generation method that leverages Large Language Models (LLMs) to synthesize data instances given limited seed demonstrations, thereby boosting low-resource information extraction performance. Our approach involves generating target structures (Y) followed by generating passages (X), all accomplished with the aid of LLMs. We design fine-grained step-by-step instructions to obtain the initial data instances. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. Our experiments show that the data generated by STAR significantly improves the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data. Human assessment of the data quality shows STAR-generated data exhibits higher passage quality and better align with the task definitions compared with the human-curated data. |
Mingyu Derek Ma · Xiaoxuan Wang · Po-Nien Kung · P. Jeffrey Brantingham · Nanyun Peng · Wei Wang 🔗 |
-
|
Feedback-guided Data Synthesis for Imbalanced Classification
(
Poster
)
>
link
Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier’s performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4% improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5% in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications. |
Reyhane Askari Hemmat · Mohammad Pezeshki · Florian Bordes · Michal Drozdzal · Adriana Romero 🔗 |
-
|
Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization
(
Poster
)
>
link
Large Language Models (LLMs) like the GPT and LLaMA families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. However, community concerns about these models' hallucination issues continue to rise. LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain NLP tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using human feedback has shown the promise of aligning LLMs to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. In this work, we propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. We focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical NLP tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. In addition, although GPT has reached the expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much previous work discussing whether GPT can generate expert-level edit feedback for LMs in the clinical note summarization task. We hope to fill this gap. Finally, Our evaluations demonstrate the potential use of GPT edits in human alignment, especially from a factuality perspective. |
Prakamya Mishra · Zonghai Yao · shuwei chen · Beining Wang · Rohan Mittal · Hong Yu 🔗 |
-
|
Privacy Measurements in Tabular Synthetic Data: State of the Art and Future Research Directions
(
Poster
)
>
link
Synthetic data (SD) have garnered attention as a privacy enhancing technology. Unfortunately, there is no standard for assessing their degree of privacy protection. In this paper, we discuss proposed assessment approaches. This contributes to the development of SD privacy standards; stimulates multi-disciplinary discussion; and helps SD researchers make informed modeling and evaluation decisions. |
Alexander Boudewijn · Andrea Filippo Ferraris · Daniele Panfilo · Vanessa Cocca · Sabrina Zinutti · Karel De Schepper · Carlo Chauvenet 🔗 |
-
|
On Consistent Bayesian Inference from Synthetic Data
(
Poster
)
>
link
Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic datasets converges to the posterior of the downstream analysis under standard regularity conditions when the analyst's model is compatible with the data provider's model. We also present several examples showing how the theory works in practice, and showing how Bayesian inference can fail when the compatibility assumption is not met, or the synthetic dataset is not significantly larger than the original. |
Ossi Räisä · Joonas Jälkö · Antti Honkela 🔗 |
-
|
Differentially Private Synthetic Data via Foundation Model APIs 1: Images
(
Oral
)
>
link
Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider.In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID≤7.9 with privacy cost ε = 0.67, significantly improving the previous SOTA from ε = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. |
Zinan Lin · Sivakanth Gopi · Janardhan Kulkarni · Harsha Nori · Sergey Yekhanin 🔗 |
-
|
Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models
(
Poster
)
>
link
This paper introduces a novel method for simulating Electronic Health Records (EHRs) using Diffusion Probabilistic Models (DPMs). We showcase the ability of DPMs to generate longitudinal EHRs with mixed-type variables – numeric, binary, and categorical. Our approach is benchmarked against existing Generative Adversarial Network (GAN)-based methods in two clinical scenarios: management of acute hypotension in the intensive care unit and antiretroviral therapy for people with human immunodeficiency virus. Our DPM-simulated datasets not only minimise patient disclosure risk but also outperform GAN-generated datasets in terms of realism. These datasets also prove effective for training downstream machine learning algorithms, including reinforcement learning and Cox proportional hazards models for survival analysis. |
Nicholas Kuo · Louisa Jorm · Sebastiano Barbieri 🔗 |
-
|
Diffusion-based Semantic-Discrepant Outlier Generation for Out-of-Distribution Detection
(
Poster
)
>
link
Out-of-distribution (OOD) detection, which determines whether a given sample is part of the training distribution, has recently shown promising results by training with synthetic OOD datasets. The important properties for effective synthetic OOD datasets are two-fold: (i) the OOD sample should be close to in-distribution (ID), but (ii) represents semantic-wise shifted information. To achieve this, we introduce a novel framework that consists of Semantic-Discrepant (SD) Outlier generation and an advanced OOD detection method. For SD outlier generation, we utilize a conditional diffusion model trained with pseudo-labels. Then, we propose a simple yet effective method, semantic-discrepant guidance, allowing model to generate realistic outliers that contain incoherent semantic shift while preserving nuisance information (e.g., background). Furthermore, we suggest SD outlier-aware OOD detector training and scoring methods. Our experiments demonstrate the effectiveness of our framework on CIFAR-10 dataset. We achieve AUROC of 98% when CIFAR-100 are given as OOD. The SD outlier dataset on CIFAR-10 is available at https://zenodo.org/record/8394847. |
Suhee Yoon · Sanghyu Yoon · Hankook Lee · Sangjun Han · Ye Seul Sim · Kyungeun Lee · Hyeseung Cho · Woohyung Lim 🔗 |