Invited Talk

The Data-Centric Era: How ML is Becoming an Experimental Science

Isabelle Guyon

Moderator : Joaquin Vanschoren

Hall H (level 1)

Abstract:

NeurIPS has been in existence for more than 3 decades, each one marked by a dominant trend. The pioneering years saw the burgeoning of back-prop nets, the coming-of-age years blossomed with convex optimization, regularization, Bayesian methods, boosting, kernel methods, to name a few, and the junior years have been dominated by deep nets and big data. And now, recent analyses conclude that using ever bigger data and deeper networks is not a sustainable way of progressing. Meanwhile, other indicators show that Machine Learning is increasingly reliant upon good data and benchmarks, not only to train more powerful and/or more compact models, but also to soundly evaluate new ideas and to stress test models on their reliability, fairness, and protection against various attacks, including privacy attacks.

Simultaneously, in 2021, the NeurIPS Dataset and Benchmark track was launched and the Data-Centric AI initiative was born. This kickstarted the "data-centric era". It is gaining momentum in response to the new needs of data scientists who, admittedly, spend more time on understanding problems, designing experimental settings, and engineering datasets, than on designing and training ML models.

We will retrace the enormous collective efforts made by our community since the 1980's to share datasets and benchmarks, putting forward important milestones that led us to today's effervescence. We will pick a few hot topics that have raised controversy and have engendered novel thought-provoking contributions. Finally, we will highlight some of the most pressing issues that must be addressed by the community.