Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Synthetic Data for Empowering ML Research

Invited Talk #2, Kalyan Veeramachaneni, SDMetrics: Evaluating Synthetic Data

Kalyan Veeramachaneni


Abstract:

Compared to other machine learning tasks like classification and regression, synthetic data generation is a new area of inquiry for machine learning. One challenge we encountered early on in working with synthetic data was the lack of standardized metrics for evaluating it. Although evaluation for tabular synthetic data is less subjective than for ML-generated images or natural language, it comes with its own specific considerations. For instance, metrics must take into account what the data is being generated for, as well as tradeoffs between quality, privacy, and utility that are inherent to this type of data.

To begin addressing this need, we created an open source library called SDMetrics, which contains a number of synthetic data evaluation tools. We identified inherent hierarchies that exist in these evaluations — for example, columnwise comparison vs. correlation matrix comparison — and built ways to test and validate these metrics. The library also provides user-friendly, focused reports and mechanisms to prevent "metric fatigue.”

Chat is not available.