In a typical machine learning competition or shared task, success is measured in terms of systems' ability to reproduce gold-standard labels. The potential impact of the systems being developed on stakeholder populations, if considered at all, is studied separately from system `performance'. Given the tight train-eval cycle of both shared tasks and system development in general, we argue that making disparate impact on vulnerable populations visible in dataset and metric design will be key to making the potential for such impact present and salient to developers. We see this as an effective way to promote the development of machine learning technology that is helpful for people, especially those who have been subject to marginalization. This talk will explore how to develop such shared tasks, considering task choice, stakeholder community input, and annotation and metric design desiderata.
Joint work with Hal Daumé III, University of Maryland, Bernease Herman, University of Washington, and Brandeis Marshall, Spelman College.
Emily M. Bender (University of Washington)
More from the Same Authors
2021 : AI and the Everything in the Whole Wide World Benchmark »
Deborah Raji · Remi Denton · Emily M. Bender · Alex Hanna · Amandalynne Paullada