Poster
in
Workshop: Table Representation Learning

Tabular Data Generation: Can We Fool XGBoost ?

EL Hacen Zein ⋅ Tanguy Urvoy

Keywords: Generative Models tabular data generative adversarial networks Classifier two-sample test Numerical features encoding Variational Autoencoders

Project Page [ Poster] [ OpenReview]

Abstract

If by 'realistic' we mean indistinguishable from (fresh) real data, generating realistic synthetic tabular data is far from being a trivial task. We present here a series of experiments showing that strong classifiers like XGBoost are able to distinguish state-of-the-art synthetic data from fresh real data almost perfectly on several tabular datasets. By studying the important features of these classifiers, we remark that mixed-type (continuous/discrete) and ill-distributed numerical columns are the ones which are the less faithfully reconstituted. We hence propose and experiment a series of automated reversible column-wise encoders which improve the realism of the generators.

Chat is not available.