Timezone: »
Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.
Author Information
Jieyu Zhang (Department of Computer Science, University of Washington)
Bohan Wang (USTC)
Zhengyu Hu (HKUST)
Pang Wei Koh (University of Washington)
Alexander Ratner (Stanford University)
More from the Same Authors
-
2021 : WRENCH: A Comprehensive Benchmark for Weak Supervision »
Jieyu Zhang · Yue Yu · Yinghao Li · Yujing Wang · Yaming Yang · Mao Yang · Alexander Ratner -
2021 : Extending the WILDS Benchmark for Unsupervised Adaptation »
Shiori Sagawa · Pang Wei Koh · Tony Lee · Irena Gao · Sang Michael Xie · Kendrick Shen · Ananya Kumar · Weihua Hu · Michihiro Yasunaga · Henrik Marklund · Sara Beery · Ian Stavness · Jure Leskovec · Kate Saenko · Tatsunori Hashimoto · Sergey Levine · Chelsea Finn · Percy Liang -
2022 : Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time »
Caroline Choi · Huaxiu Yao · Yoonho Lee · Pang Wei Koh · Chelsea Finn -
2022 : Out-of-Distribution Robustness via Targeted Augmentations »
Irena Gao · Shiori Sagawa · Pang Wei Koh · Tatsunori Hashimoto · Percy Liang -
2023 : SCIBENCH: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models »
Xiaoxuan Wang · Ziniu Hu · Pan Lu · Yanqiao Zhu · Jieyu Zhang · Satyen Subramaniam · Arjun Loomba · Shichang Zhang · Yizhou Sun · Wei Wang -
2023 : NLPBench: Evaluating Large Language Models on Solving NLP Problems »
Linxin Song · Jieyu Zhang · Lechao Cheng · Pengyuan Zhou · Tianyi Zhou · Zihui Li -
2023 : FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation »
Sewon Min · Kalpesh Krishna · Xinxi Lyu · Mike Lewis · Scott Yih · Pang Wei Koh · Mohit Iyyer · Luke Zettlemoyer · Hannaneh Hajishirzi -
2023 : Retrieval-based Language Models Using a Multi-domain Datastore »
Rulin Shao · Sewon Min · Luke Zettlemoyer · Pang Wei Koh -
2023 : Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study »
Prin Phunyaphibarn · Junghyun Lee · Bohan Wang · Huishuai Zhang · Chulhee Yun -
2023 : Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study »
Prin Phunyaphibarn · Junghyun Lee · Bohan Wang · Huishuai Zhang · Chulhee Yun -
2023 Workshop: Workshop on Distribution Shifts: New Frontiers with Foundation Models »
Rebecca Roelofs · Fanny Yang · Hongseok Namkoong · Masashi Sugiyama · Jacob Eisenstein · Pang Wei Koh · Shiori Sagawa · Tatsunori Hashimoto · Yoonho Lee -
2023 Poster: Closing the gap between the upper bound and lower bound of Adam's iteration complexity »
Bohan Wang · Jingwen Fu · Huishuai Zhang · Nanning Zheng · Wei Chen -
2023 Poster: Proximity-Informed Calibration for Deep Neural Networks »
Miao Xiong · Ailin Deng · Pang Wei Koh · Jiaying Wu · Shen Li · Jianqing Xu · Bryan Hooi -
2023 Poster: Uncovering Neural Scaling Laws in Molecular Representation Learning »
Dingshuo Chen · Yanqiao Zhu · Jieyu Zhang · Yuanqi Du · Zhixun Li · Qiang Liu · Shu Wu · Liang Wang -
2023 Poster: SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality »
Cheng-Yu Hsieh · Jieyu Zhang · Zixian Ma · Aniruddha Kembhavi · Ranjay Krishna -
2023 Poster: Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias »
Yue Yu · Yuchen Zhuang · Jieyu Zhang · Yu Meng · Alexander Ratner · Ranjay Krishna · Jiaming Shen · Chao Zhang -
2023 Poster: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Oral: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Poster: Characterizing the Impacts of Semi-supervised Learning for Weak Supervision »
Jeffrey Li · Jieyu Zhang · Ludwig Schmidt · Alexander Ratner -
2023 Poster: Are aligned neural networks adversarially aligned? »
Nicholas Carlini · Milad Nasr · Christopher A. Choquette-Choo · Matthew Jagielski · Irena Gao · Pang Wei Koh · Daphne Ippolito · Florian Tramer · Ludwig Schmidt -
2023 Poster: Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions »
Xiang Cheng · Bohan Wang · Jingzhao Zhang · Yusong Zhu -
2022 Spotlight: Lightning Talks 4A-3 »
Zhihan Gao · Yabin Wang · Xingyu Qu · Luziwei Leng · Mingqing Xiao · Bohan Wang · Yu Shen · Zhiwu Huang · Xingjian Shi · Qi Meng · Yupeng Lu · Diyang Li · Qingyan Meng · Kaiwei Che · Yang Li · Hao Wang · Huishuai Zhang · Zongpeng Zhang · Kaixuan Zhang · Xiaopeng Hong · Xiaohan Zhao · Di He · Jianguo Zhang · Yaofeng Tu · Bin Gu · Yi Zhu · Ruoyu Sun · Yuyang (Bernie) Wang · Zhouchen Lin · Qinghu Meng · Wei Chen · Wentao Zhang · Bin CUI · Jie Cheng · Zhi-Ming Ma · Mu Li · Qinghai Guo · Dit-Yan Yeung · Tie-Yan Liu · Jianxing Liao -
2022 Spotlight: Does Momentum Change the Implicit Regularization on Separable Data? »
Bohan Wang · Qi Meng · Huishuai Zhang · Ruoyu Sun · Wei Chen · Zhi-Ming Ma · Tie-Yan Liu -
2022 : Panel »
Mayee Chen · Alexander Ratner · Robert Nowak · Cody Coleman · Ramya Korlakai Vinayak -
2022 Workshop: Workshop on Distribution Shifts: Connecting Methods and Applications »
Chelsea Finn · Fanny Yang · Hongseok Namkoong · Masashi Sugiyama · Jacob Eisenstein · Jonas Peters · Rebecca Roelofs · Shiori Sagawa · Pang Wei Koh · Yoonho Lee -
2022 Poster: Does Momentum Change the Implicit Regularization on Separable Data? »
Bohan Wang · Qi Meng · Huishuai Zhang · Ruoyu Sun · Wei Chen · Zhi-Ming Ma · Tie-Yan Liu -
2022 Poster: Understanding Programmatic Weak Supervision via Source-aware Influence Function »
Jieyu Zhang · Haonan Wang · Cheng-Yu Hsieh · Alexander Ratner -
2022 Poster: Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time »
Huaxiu Yao · Caroline Choi · Bochuan Cao · Yoonho Lee · Pang Wei Koh · Chelsea Finn -
2021 : AI workloads inside databases »
Guy Van den Broeck · Alexander Ratner · Ben Moseley · Konstantinos Karanasos · Parisa Kordjamshidi · Molham Aref · Arun Kumar -
2021 Workshop: Distribution shifts: connecting methods and applications (DistShift) »
Shiori Sagawa · Pang Wei Koh · Fanny Yang · Hongseok Namkoong · Jiashi Feng · Kate Saenko · Percy Liang · Sarah Bird · Sergey Levine -
2021 Poster: Optimizing Information-theoretical Generalization Bound via Anisotropic Noise of SGLD »
Bohan Wang · Huishuai Zhang · Jieyu Zhang · Qi Meng · Wei Chen · Tie-Yan Liu -
2021 : WRENCH: A Comprehensive Benchmark for Weak Supervision »
Jieyu Zhang · Yue Yu · Yinghao Li · Yujing Wang · Yaming Yang · Mao Yang · Alexander Ratner -
2020 : Q & A and Panel Session with Dan Weld, Kristen Grauman, Scott Yih, Emma Brunskill, and Alex Ratner »
Kristen Grauman · Wen-tau Yih · Alexander Ratner · Emma Brunskill · Douwe Kiela · Daniel S. Weld -
2020 : WILDS: A Survey and Benchmark of in-the-Wild Distribution Shifts »
Pang Wei Koh -
2019 Poster: On the Accuracy of Influence Functions for Measuring Group Effects »
Pang Wei Koh · Kai-Siang Ang · Hubert Teo · Percy Liang -
2019 Poster: Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices »
Vincent Chen · Sen Wu · Alexander Ratner · Jen Weng · Christopher Ré -
2019 Poster: Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations. »
Sawyer Birnbaum · Volodymyr Kuleshov · Zayd Enam · Pang Wei Koh · Stefano Ermon -
2017 Workshop: Learning with Limited Labeled Data: Weak Supervision and Beyond »
Isabelle Augenstein · Stephen Bach · Eugene Belilovsky · Matthew Blaschko · Christoph Lampert · Edouard Oyallon · Emmanouil Antonios Platanios · Alexander Ratner · Christopher Ré -
2017 : Coffee break and Poster Session II »
Mohamed Kane · Albert Haque · Vagelis Papalexakis · John Guibas · Peter Li · Carlos Arias · Eric Nalisnick · Padhraic Smyth · Frank Rudzicz · Xia Zhu · Theodore Willke · Noemie Elhadad · Hans Raffauf · Harini Suresh · Paroma Varma · Yisong Yue · Ognjen (Oggi) Rudovic · Luca Foschini · Syed Rameel Ahmad · Hasham ul Haq · Valerio Maggio · Giuseppe Jurman · Sonali Parbhoo · Pouya Bashivan · Jyoti Islam · Mirco Musolesi · Chris Wu · Alexander Ratner · Jared Dunnmon · Cristóbal Esteban · Aram Galstyan · Greg Ver Steeg · Hrant Khachatrian · Marc Górriz · Mihaela van der Schaar · Anton Nemchenko · Manasi Patwardhan · Tanay Tandon -
2017 Workshop: Machine Learning for Health (ML4H) - What Parts of Healthcare are Ripe for Disruption by Machine Learning Right Now? »
Jason Fries · Alex Wiltschko · Andrew Beam · Isaac S Kohane · Jasper Snoek · Peter Schulam · Madalina Fiterau · David Kale · Rajesh Ranganath · Bruno Jedynak · Michael Hughes · Tristan Naumann · Natalia Antropova · Adrian Dalca · SHUBHI ASTHANA · Prateek Tandon · Jaz Kandola · Uri Shalit · Marzyeh Ghassemi · Tim Althoff · Alexander Ratner · Jumana Dakka -
2017 Poster: Learning to Compose Domain-Specific Transformations for Data Augmentation »
Alexander Ratner · Henry Ehrenberg · Zeshan Hussain · Jared Dunnmon · Christopher Ré -
2017 Poster: Certified Defenses for Data Poisoning Attacks »
Jacob Steinhardt · Pang Wei Koh · Percy Liang -
2016 Poster: Data Programming: Creating Large Training Sets, Quickly »
Alexander Ratner · Christopher M De Sa · Sen Wu · Daniel Selsam · Christopher Ré -
2011 Poster: Sparse Filtering »
Jiquan Ngiam · Pang Wei Koh · Zhenghao Chen · Sonia A Bhaskar · Andrew Y Ng -
2011 Spotlight: Sparse Filtering »
Jiquan Ngiam · Pang Wei Koh · Zhenghao Chen · Sonia A Bhaskar · Andrew Y Ng -
2010 Poster: Tiled convolutional neural networks »
Quoc V. Le · Jiquan Ngiam · Zhenghao Chen · Daniel Jin hao Chia · Pang Wei Koh · Andrew Y Ng