Timezone: »

Workshop on Dataset Curation and Security
Nathalie Baracaldo Angel · Yonatan Bisk · Avrim Blum · Michael Curry · John Dickerson · Micah Goldblum · Tom Goldstein · Bo Li · Avi Schwarzschild

Fri Dec 11 06:00 AM -- 11:00 AM (PST) @ None
Event URL: https://securedata.lol/ »

Classical machine learning research has been focused largely on models, optimizers, and computational challenges. As technical progress and hardware advancements ease these challenges, practitioners are now finding that the limitations and faults of their models are the result of their datasets. This is particularly true of deep networks, which often rely on huge datasets that are too large and unwieldy for domain experts to curate them by hand. This workshop addresses issues in the following areas: data harvesting, dealing with the challenges and opportunities involved in creating and labeling massive datasets; data security, dealing with protecting datasets against risks of poisoning and backdoor attacks; policy, security, and privacy, dealing with the social, ethical, and regulatory issues involved in collecting large datasets, especially with regards to privacy; and data bias, related to the potential of biased datasets to result in biased models that harm members of certain groups. Dates and details can be found at securedata.lol

Fri 6:00 a.m. - 6:30 a.m.
Dawn Song (topic TBD) (Invited talk)
Dawn Song
Fri 6:30 a.m. - 7:00 a.m.

Large-scale vision benchmarks have driven—and often even defined—progress in machine learning. However, these benchmarks are merely proxies for the real-world tasks we actually care about. How well do our benchmarks capture such tasks?

In this talk, I will discuss the alignment between our benchmark-driven ML paradigm and the real-world uses cases that motivate it. First, we will explore examples of biases in the ImageNet dataset, and how state-of-the-art models exploit them. We will then demonstrate how these biases arise as a result of design choices in the data collection and curation processes.

Throughout, we illustrate how one can leverage relatively standard tools (e.g., crowdsourcing, image processing) to quantify the biases that we observe. Based on joint works with Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras and Kai Xiao.

Aleksander Madry
Fri 7:00 a.m. - 7:15 a.m.
Discussion (Discussion panel)
Fri 7:15 a.m. - 7:30 a.m.
Fri 7:30 a.m. - 8:00 a.m.
Darrell West (TBD) (Invited talk)
Darrell West
Fri 8:00 a.m. - 8:30 a.m.
Adversarial, Socially Aware, and Commonsensical Data (Invited talk)   
Yejin Choi
Fri 8:30 a.m. - 8:45 a.m.
Discussion panel (Discussion)
Fri 8:45 a.m. - 10:00 a.m.
Lunch Break
Fri 10:00 a.m. - 10:30 a.m.
Dataset Curation via Active Learning (Invited talk)
Robert Nowak
Fri 10:30 a.m. - 11:00 a.m.
Don't Steal Data (Invited talk)
Liz O'Sullivan
Fri 11:30 a.m. - 1:00 p.m.
Poster Session  link »

Author Information

Nathalie Baracaldo Angel (IBM Research AI)
Yonatan Bisk (LTI @ CMU)
Avrim Blum (Toyota Technological Institute at Chicago)
Michael Curry (University of Maryland)
John Dickerson (University of Maryland)
Micah Goldblum (UMD)
Tom Goldstein (University of Maryland)
Bo Li (UIUC)
Avi Schwarzschild (University of Maryland)

More from the Same Authors