Timezone: »
Data is one of the key drivers of progress in machine learning. Modern datasets require scale far beyond the ability of individual domain experts to produce. To overcome this limitation, a wide variety of techniques have been developed to build large datasets efficiently, including crowdsourcing, automated labeling, weak supervision, and many more. This tutorial describes classical and modern methods for building datasets beyond manual hand-labeling. It covers both theoretical and practical aspects of dataset construction. Theoretically, we discuss guarantees for a variety of crowdsourcing, active learning-based, and weak supervision techniques, with a particular focus on generalization properties of downstream models trained on the resulting datasets. Practically, we describe several popular systems implementing such techniques and their use in industry and beyond. We cover both the promise and potential pitfalls of using such methods. Finally, we offer a comparison of automated dataset construction versus other popular approaches to dealing with a lack of large amounts of labeled data, including few- and zero-shot methods enabled by foundation models.
Mon 5:00 p.m. - 6:50 p.m.
|
Tutorial part 1
(
tutorial part 1
)
SlidesLive Video » |
Frederic Sala · Ramya Korlakai Vinayak 🔗 |
Mon 6:50 p.m. - 7:00 p.m.
|
Q & A
(
questions
)
|
Frederic Sala · Ramya Korlakai Vinayak 🔗 |
Mon 7:00 p.m. - 7:05 p.m.
|
Break to welcome panellists
|
🔗 |
Mon 7:05 p.m. - 7:30 p.m.
|
Panel
SlidesLive Video » |
Mayee Chen · Alexander Ratner · Robert Nowak · Cody Coleman · Ramya Korlakai Vinayak 🔗 |
Author Information
Frederic Sala (University of Wisconsin, Madison)
Ramya Korlakai Vinayak (University of Wisconsin-Madison)
More from the Same Authors
-
2022 : Anomaly Detection with Multiple Reference Datasets in High Energy Physics »
Mayee Chen · Benjamin Nachman · Frederic Sala -
2022 : AutoML for Climate Change: A Call to Action »
Renbo Tu · Nicholas Roberts · Vishak Prasad C · Sibasis Nayak · Paarth Jain · Frederic Sala · Ganesh Ramakrishnan · Ameet Talwalkar · Willie Neiswanger · Colin White -
2022 : Domain Generalization with Nuclear Norm Regularization »
Zhenmei Shi · Yifei Ming · Ying Fan · Frederic Sala · Yingyu Liang -
2023 Poster: Mitigating Source Bias for Fairer Weak Supervision »
Changho Shin · Sonia Cromp · Dyah Adila · Frederic Sala -
2023 Poster: Geometry-Aware Adaptation for Pretrained Models »
Nicholas Roberts · Xintong Li · Dyah Adila · Sonia Cromp · Tzu-Heng Huang · Jitian Zhao · Frederic Sala -
2023 Poster: Promises and Pitfalls of Threshold-based Auto-labeling »
Harit Vishwakarma · Heguang Lin · Frederic Sala · Ramya Korlakai Vinayak -
2023 Poster: Skill-it! A data-driven skills framework for understanding and training language models »
Mayee Chen · Nicholas Roberts · Kush Bhatia · Jue WANG · Ce Zhang · Frederic Sala · Christopher Ré -
2023 Poster: Train 'n Trade: Foundations of Parameter Markets »
Tzu-Heng Huang · Harit Vishwakarma · Frederic Sala -
2023 Poster: Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification »
Neel Guha · Mayee Chen · Kush Bhatia · Azalia Mirhoseini · Frederic Sala · Christopher Ré -
2022 Competition: AutoML Decathlon: Diverse Tasks, Modern Methods, and Efficiency at Scale »
Samuel Guo · Cong Xu · Nicholas Roberts · Misha Khodak · Junhong Shen · Evan Sparks · Ameet Talwalkar · Yuriy Nevmyvaka · Frederic Sala · Anderson Schneider -
2022 : Panel »
Mayee Chen · Alexander Ratner · Robert Nowak · Cody Coleman · Ramya Korlakai Vinayak -
2022 : Q & A »
Frederic Sala · Ramya Korlakai Vinayak -
2022 : Tutorial part 1 »
Frederic Sala · Ramya Korlakai Vinayak -
2022 Poster: AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels »
Nicholas Roberts · Xintong Li · Tzu-Heng Huang · Dyah Adila · Spencer Schoenberg · Cheng-Yu Liu · Lauren Pick · Haotian Ma · Aws Albarghouthi · Frederic Sala -
2022 Poster: Lifting Weak Supervision To Structured Prediction »
Harit Vishwakarma · Frederic Sala -
2022 Poster: One for All: Simultaneous Metric and Preference Learning over Multiple Users »
Gregory Canal · Blake Mason · Ramya Korlakai Vinayak · Robert Nowak -
2022 Poster: NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks »
Renbo Tu · Nicholas Roberts · Misha Khodak · Junhong Shen · Frederic Sala · Ameet Talwalkar -
2016 Poster: Crowdsourced Clustering: Querying Edges vs Triangles »
Ramya Korlakai Vinayak · Babak Hassibi -
2014 Poster: Graph Clustering With Missing Data: Convex Algorithms and Analysis »
Ramya Korlakai Vinayak · Samet Oymak · Babak Hassibi