Moderator: Moritz Hardt
Benchmark datasets have played a crucial role in driving empirical progress in machine learning, leading to an interesting dynamic between those on a quest for state-of-the-art performance and those creating new challenging benchmarks. In this panel, we reflect on how benchmarks can lead to scientific progress, both in terms of new algorithmic innovations and improved scientific understanding. First, what qualities of a machine learning system should a good benchmark dataset seek to measure? How well can benchmarks assess performance in dynamic and novel environments, or in tasks with an open-ended set of acceptable answers? Benchmarks can also raise significant ethical concerns including poor data collection practices, under- and misrepresentation of subjects, as well as misspecification of objectives. Second, even given high-quality, carefully constructed benchmarks, which research questions can we hope to answer from leaderboard-climbing, and which ones are deprioritized or impossible to answer due to the limitations of the benchmark paradigm? In general, we hope to deepen the community’s awareness of the important role of benchmarks for advancing the science of machine learning.