Timezone: »
Compared to other machine learning tasks like classification and regression, synthetic data generation is a new area of inquiry for machine learning. One challenge we encountered early on in working with synthetic data was the lack of standardized metrics for evaluating it. Although evaluation for tabular synthetic data is less subjective than for ML-generated images or natural language, it comes with its own specific considerations. For instance, metrics must take into account what the data is being generated for, as well as tradeoffs between quality, privacy, and utility that are inherent to this type of data.
To begin addressing this need, we created an open source library called SDMetrics, which contains a number of synthetic data evaluation tools. We identified inherent hierarchies that exist in these evaluations — for example, columnwise comparison vs. correlation matrix comparison — and built ways to test and validate these metrics. The library also provides user-friendly, focused reports and mechanisms to prevent "metric fatigue.”
Author Information
Kalyan Veeramachaneni (Massachusetts Institute of Technology)
More from the Same Authors
-
2019 Poster: Modeling Tabular data using Conditional GAN »
Lei Xu · Maria Skoularidou · Alfredo Cuesta-Infante · Kalyan Veeramachaneni -
2013 Workshop: Data Driven Education »
Jonathan Huang · Sumit Basu · Kalyan Veeramachaneni