Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Table Representation Learning

Tabular Data Generation: Can We Fool XGBoost ?

EL Hacen Zein · Tanguy Urvoy

Keywords: [ Variational Autoencoders ] [ Numerical features encoding ] [ Classifier two-sample test ] [ generative adversarial networks ] [ tabular data ] [ Generative Models ]


Abstract:

If by 'realistic' we mean indistinguishable from (fresh) real data, generating realistic synthetic tabular data is far from being a trivial task. We present here a series of experiments showing that strong classifiers like XGBoost are able to distinguish state-of-the-art synthetic data from fresh real data almost perfectly on several tabular datasets. By studying the important features of these classifiers, we remark that mixed-type (continuous/discrete) and ill-distributed numerical columns are the ones which are the less faithfully reconstituted. We hence propose and experiment a series of automated reversible column-wise encoders which improve the realism of the generators.

Chat is not available.