Skip to yearly menu bar Skip to main content

Workshop

Synthetic Data Generation with Generative AI

Sergul Aydore ⋅ Zhaozhi Qian ⋅ Mihaela van der Schaar

Project Page

Abstract

Synthetic data (SD) is data that has been generated by a mathematical model to solve downstream data science tasks. SD can be used to address three key problems: 1/ private data release, 2/ data de-biasing and fairness, 3/ data augmentation for boosting the performance of ML models. While SD offers great opportunities for these problems, SD generation is still a developing area of research. Systematic frameworks for SD deployment and evaluation are also still missing. Additionally, despite the substantial advances in Generative AI, the scientific community still lacks a unified understanding of how generative AI can be utilized to generate SD for different modalities.The goal of this workshop is to provide a platform for vigorous discussion from all these different perspectives with research communities in the hope of progressing the ideal of using SD for better and trustworthy ML training. Through submissions and facilitated discussions, we aim to characterize and mitigate the common challenges of SD generation that span numerous application domains. The workshop is jointly organized by academic researchers (University of Cambridge) and industry partners from tech (Amazon AI).

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

7:00 AM

Welcome and workshop overview

Sergul Aydore

Video

7:05 AM

Synthetic Data: Charting New Research Frontiers, Maximizing Impact, and Cultivating Collaborative Communities

Mihaela van der Schaar

Video

7:15 AM

Generating health records

Edward Choi

Video

8:00 AM

Coffee Break & Poster Session

8:30 AM

Privacy and Synthetic data

Antti Honkela

Video

9:15 AM

Differentially Private Synthetic Data via Foundation Model APIs 1: Images

Zinan Lin

9:45 AM

Effective Data Augmentation With Diffusion Models

Max Gurinas ⋅ Brandon Trabucco

Video

10:15 AM

Lunch Break & Poster Session

11:30 AM

Diversity and Synthetic data

Adji Bousso Dieng

Video

12:15 PM

Fair Wasserstein Coresets

Vamsi Potluru

Video

12:45 PM

Improving fairness for spoken language understanding in atypical speech with Text-to-Speech

Venkatesh Ravichandran ⋅ Helin Wang

Video

1:15 PM

Coffee Break & Poster Session

1:30 PM

Generative Agents: Interactive Simulacra

Michael Bernstein

Video

2:15 PM

Panel Discussion

Danielle Belgrave ⋅ Cem Tekin ⋅ Robert Tillman ⋅ Megan Gibbs ⋅ Dino Oglic ⋅ Rudi Agius ⋅ Panagiota Konstantinou

Video

Size Matters: Large Graph Generation with HiGGs

Alex O. Davies ⋅ Nirav Ajmeri ⋅ Telmo Silva Filho

Generating Medical Instructions with Conditional Transformer

Samuel Belkadi ⋅ Nicolo Micheletti ⋅ Lifeng Han ⋅ Warren Del-Pinto ⋅ Goran Nenadic

$\mathbb{S}$ci$\mathbb{F}$ix: Outperforming GPT3 on Scientific Factual Error Correction

Dhananjay Ashok ⋅ Atharva Kulkarni ⋅ Hai Pham ⋅ Barnabas Poczos

Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI

Elena Sizikova ⋅ Niloufar Saharkhiz ⋅ Diksha Sharma ⋅ Miguel Lago ⋅ Berkman Sahiner ⋅ Jana Delfino ⋅ Aldo Badano

Knowledge-Infused Prompting Improves Clinical Text Generation with Large Language Models

Ran Xu ⋅ Hejie Cui ⋅ Yue Yu ⋅ Xuan Kan ⋅ Wenqi Shi ⋅ Yuchen Zhuang ⋅ Wei Jin ⋅ Joyce Ho ⋅ Carl Yang

Improving Code Style for Accurate Code Generation

Naman Jain ⋅ Tianjun Zhang ⋅ Wei-Lin Chiang ⋅ Joseph Gonzalez ⋅ Koushik Sen ⋅ Ion Stoica

GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

Amani Namboori ⋅ Shivam Mangale ⋅ Andy Rosenbaum ⋅ Saleh Soltan

EDGE++: Improved Training and Sampling of EDGE

Xiaohui Chen ⋅ Mingyang Wu ⋅ Liping Liu

Conditional Generative Modeling for High-dimensional Marked Temporal Point Processes

Zheng Dong ⋅ Zekai Fan ⋅ Shixiang Zhu

Synthetic Data Generation for Scarce Road Scene Detection Scenarios

Dipika Khullar ⋅ Yash Shah ⋅ Ninadkulamz ⋅ Negin Sokhandan

Stable Diffusion For Aerial Object Detection

Yanan Jian ⋅ FUXUN YU ⋅ Simranjit Singh ⋅ Dimitrios Stamoulis

INTAGS: Interactive Agent-Guided Simulation

Song Wei ⋅ Andrea Coletta ⋅ Svitlana Vyetrenko ⋅ Tucker Balch

CALICO: Conversational Agent Localization via Synthetic Data Generation

Andy Rosenbaum ⋅ Ershad Banijamali ⋅ Christopher DiPersio ⋅ Pegah Kharazmi ⋅ Pan Wei ⋅ Lu Zeng ⋅ Gokmen Oz ⋅ Wael Hamza ⋅ Clement Chung ⋅ Karolina Owczarzak ⋅ Fabian Triefenbach

Improving fairness for spoken language understanding in atypical speech with Text-to-Speech

Helin Wang ⋅ Venkatesh Ravichandran ⋅ Milind Rao ⋅ Becky Lammers ⋅ Myra J. Sydnor ⋅ Nicholas Maragakis ⋅ Ankur Butala ⋅ Jayne Zhang ⋅ Lora Clawson ⋅ Victoria Chovaz ⋅ Laureano Moro-Velazquez

Generating Privacy-Preserving Longitudinal Synthetic Data

Robin van Hoorn

AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing

Namjoon Suh ⋅ Xiaofeng Lin ⋅ Din-Yin Hsieh ⋅ Mehrdad Honarkhah ⋅ Guang Cheng

Towards Effective Synthetic Data Sampling for Domain Adaptive Pose Estimation

Isha Dua ⋅ Arjun Sharma ⋅ Shuaib Ahmed ⋅ Rahul Tallamraju

Fair Wasserstein Coresets

Zikai Xiong ⋅ Niccolo Dalmasso ⋅ Vamsi Potluru ⋅ Tucker Balch ⋅ Manuela Veloso

Effective Data Augmentation With Diffusion Models

Brandon Trabucco ⋅ Kyle Doherty ⋅ Max Gurinas ⋅ Russ Salakhutdinov

Continuous Diffusion for Mixed-Type Tabular Data

Markus Mueller ⋅ Kathrin Gruber ⋅ Dennis Fok

Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets

Brandon Smith ⋅ Miguel Farinha ⋅ Siobhan Mackenzie Hall ⋅ Hannah Rose Kirk ⋅ Aleksandar Shtedritski ⋅ Max Bain

Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization

Elior Benarous ⋅ Sotiris Anagnostidis ⋅ Luca Biggio ⋅ Thomas Hofmann

Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models

Yujin Kim ⋅ Jaehong Yoon ⋅ Seonghyeon Ye ⋅ Sung Ju Hwang ⋅ Se-Young Yun

Learning to Place Objects into Scenes by Hallucinating Scenes around Objects

Lu Yuan ⋅ James Hong ⋅ Vishnu Sarukkai ⋅ Kayvon Fatahalian

Evaluating VLMs for Property-Specific Annotation of 3D Objects

Rishabh Kabra ⋅ Loic Matthey ⋅ Alexander Lerchner ⋅ Niloy Mitra

Strong statistical parity through fair synthetic data

Ivona Krchova ⋅ Michael Platzer ⋅ Paul Tiwald

On the Limitation of Diffusion Models for Synthesizing Training Datasets

Shin'ya Yamaguchi ⋅ Takuma Fukuda

STAR: Improving Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

Mingyu Derek Ma ⋅ Xiaoxuan Wang ⋅ Po-Nien Kung ⋅ P. Jeffrey Brantingham ⋅ Nanyun Peng ⋅ Wei Wang

Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat ⋅ Mohammad Pezeshki ⋅ Florian Bordes ⋅ Michal Drozdzal ⋅ Adriana Romero

Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization

Prakamya Mishra ⋅ Zonghai Yao ⋅ shuwei chen ⋅ Beining Wang ⋅ Rohan Mittal ⋅ Hong Yu

Privacy Measurements in Tabular Synthetic Data: State of the Art and Future Research Directions

Alexander Boudewijn ⋅ Andrea Filippo Ferraris ⋅ Daniele Panfilo ⋅ Vanessa Cocca ⋅ Sabrina Zinutti ⋅ Karel De Schepper ⋅ Carlo Chauvenet

On Consistent Bayesian Inference from Synthetic Data

Ossi Räisä ⋅ Joonas Jälkö ⋅ Antti Honkela

Differentially Private Synthetic Data via Foundation Model APIs 1: Images

Zinan Lin ⋅ Sivakanth Gopi ⋅ Janardhan Kulkarni ⋅ Harsha Nori ⋅ Sergey Yekhanin

Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models

Nicholas Kuo ⋅ Louisa Jorm ⋅ Sebastiano Barbieri

Diffusion-based Semantic-Discrepant Outlier Generation for Out-of-Distribution Detection

Suhee Yoon ⋅ Sanghyu Yoon ⋅ Hankook Lee ⋅ Sangjun Han ⋅ Ye Seul Sim ⋅ Kyungeun Lee ⋅ Hyeseung Cho ⋅ Woohyung Lim