Timezone: »

Invited Talk #2, Kalyan Veeramachaneni, SDMetrics: Evaluating Synthetic Data
Kalyan Veeramachaneni

Fri Dec 02 09:20 AM -- 09:45 AM (PST) @

Compared to other machine learning tasks like classification and regression, synthetic data generation is a new area of inquiry for machine learning. One challenge we encountered early on in working with synthetic data was the lack of standardized metrics for evaluating it. Although evaluation for tabular synthetic data is less subjective than for ML-generated images or natural language, it comes with its own specific considerations. For instance, metrics must take into account what the data is being generated for, as well as tradeoffs between quality, privacy, and utility that are inherent to this type of data.

To begin addressing this need, we created an open source library called SDMetrics, which contains a number of synthetic data evaluation tools. We identified inherent hierarchies that exist in these evaluations — for example, columnwise comparison vs. correlation matrix comparison — and built ways to test and validate these metrics. The library also provides user-friendly, focused reports and mechanisms to prevent "metric fatigue.”

Author Information

Kalyan Veeramachaneni (Massachusetts Institute of Technology)

More from the Same Authors