Skip to yearly menu bar Skip to main content


Oral
in
Datasets and Benchmarks: Dataset and Benchmark Track 2

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Bernard Koch · Emily Denton · Alex Hanna · Jacob G Foster


Abstract:

Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role benchmarking practices play in the field, relatively little attention has been paid to the dynamics of benchmark dataset use and resuse within and across machine learning subcommunities. In this work we dig into these dynamics, by studying how dataset usage patterns differ across different machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity and access within the field.