Timezone: »
Poster
Why do tree-based models still outperform deep learning on typical tabular data?
Leo Grinsztajn · Edouard Oyallon · Gael Varoquaux
While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and neural networks. This leads to a series of challenges which should guide researchers aiming to build tabular-specific neural network: 1) be robust to uninformative features, 2) preserve the orientation of the data, and 3) be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20\,000 compute hours hyperparameter search for each learner.
Author Information
Leo Grinsztajn (Université Paris Saclay-Inria)
Edouard Oyallon (CNRS/ISIR)
Gael Varoquaux (INRIA)
More from the Same Authors
-
2021 Spotlight: What’s a good imputation to predict with missing values? »
Marine Le Morvan · Julie Josse · Erwan Scornet · Gael Varoquaux -
2021 : AI as statistical methods for imperfect theories »
Gael Varoquaux -
2022 Poster: On Non-Linear operators for Geometric Deep Learning »
Grégoire Sergeant-Perthuis · Jakob Maier · Joan Bruna · Edouard Oyallon -
2021 Poster: What’s a good imputation to predict with missing values? »
Marine Le Morvan · Julie Josse · Erwan Scornet · Gael Varoquaux -
2020 Poster: NeuMiss networks: differentiable programming for supervised learning with missing values. »
Marine Le Morvan · Julie Josse · Thomas Moreau · Erwan Scornet · Gael Varoquaux -
2020 Oral: NeuMiss networks: differentiable programming for supervised learning with missing values. »
Marine Le Morvan · Julie Josse · Thomas Moreau · Erwan Scornet · Gael Varoquaux -
2019 Poster: Comparing distributions: $\ell_1$ geometry improves kernel two-sample testing »
Meyer Scetbon · Gael Varoquaux -
2019 Spotlight: Comparing distributions: $\ell_1$ geometry improves kernel two-sample testing »
Meyer Scetbon · Gael Varoquaux -
2019 Poster: Manifold-regression to predict from MEG/EEG brain signals without source modeling »
David Sabbagh · Pierre Ablin · Gael Varoquaux · Alexandre Gramfort · Denis A. Engemann -
2017 : Scikit-learn & nilearn: Democratisation of machine learning for brain imaging (INRIA) »
Gael Varoquaux -
2017 : Invited Talk: "Tales from fMRI: Learning from limited labeled data" »
Gael Varoquaux -
2017 Poster: Learning Neural Representations of Human Cognition across Many fMRI Studies »
Arthur Mensch · Julien Mairal · Danilo Bzdok · Bertrand Thirion · Gael Varoquaux -
2016 Poster: Learning brain regions via large-scale online structured sparse dictionary learning »
Elvis DOHMATOB · Arthur Mensch · Gael Varoquaux · Bertrand Thirion -
2015 Poster: Semi-Supervised Factored Logistic Regression for High-Dimensional Neuroimaging Data »
Danilo Bzdok · Michael Eickenberg · Olivier Grisel · Bertrand Thirion · Gael Varoquaux -
2013 Poster: Mapping paradigm ontologies to and from the brain »
Yannick Schwartz · Bertrand Thirion · Gael Varoquaux