Timezone: »
Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources---YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock---to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that mixing multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clipqualitynot_quantity.
Author Information
Thao Nguyen (University of Washington)
Gabriel Ilharco (Department of Computer Science, University of Washington)
Mitchell Wortsman (University of Washington, Allen Institute for Artificial Intelligence)
Sewoong Oh (University of Washington)
Ludwig Schmidt (University of Washington)
More from the Same Authors
-
2021 : Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning »
Thomas Liao · Rohan Taori · Deborah Raji · Ludwig Schmidt -
2021 : Do ImageNet Classifiers Generalize to ImageNet? »
Benjamin Recht · Becca Roelofs · Ludwig Schmidt · Vaishaal Shankar -
2021 : Evaluating Machine Accuracy on ImageNet »
Vaishaal Shankar · Becca Roelofs · Horia Mania · Benjamin Recht · Ludwig Schmidt -
2021 : Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2021 : Avoiding Spurious Correlations: Bridging Theory and Practice »
Thao Nguyen · Hanie Sedghi · Behnam Neyshabur -
2021 : Robust fine-tuning of zero-shot models »
Mitchell Wortsman · Gabriel Ilharco · Jong Wook Kim · Mike Li · Hanna Hajishirzi · Ali Farhadi · Hongseok Namkoong · Ludwig Schmidt -
2021 : Gradient flows on graphons: existence, convergence, continuity equations »
Sewoong Oh · Soumik Pal · Raghav Somani · Raghav Tripathi -
2022 : Few-shot Backdoor Attacks via Neural Tangent Kernels »
Jonathan Hayase · Sewoong Oh -
2023 Poster: Stable and low-precision training for large-scale vision-language models »
Mitchell Wortsman · Tim Dettmers · Luke Zettlemoyer · Ari Morcos · Ali Farhadi · Ludwig Schmidt -
2023 Poster: Label Poisoning is All You Need »
Rishi Jha · Jonathan Hayase · Sewoong Oh -
2023 Poster: Unleashing the Power of Randomization in Auditing Differentially Private ML »
Krishna Pillutla · Galen Andrew · Peter Kairouz · H. Brendan McMahan · Alina Oprea · Sewoong Oh -
2023 Poster: Private (Stochastic) Non-Convex Optimization Revisited: Second-Order Stationary Points and Excess Risks »
Daogao Liu · Arun Ganesh · Sewoong Oh · Abhradeep Guha Thakurta -
2023 Poster: Characterizing the Impacts of Semi-supervised Learning for Weak Supervision »
Jeffrey Li · Jieyu Zhang · Ludwig Schmidt · Alexander Ratner -
2023 Poster: Effective Robustness against Natural Distribution Shifts for Models with Different Training Data »
Zhouxing Shi · Nicholas Carlini · Ananth Balashankar · Ludwig Schmidt · Cho-Jui Hsieh · Alex Beutel · Yao Qin -
2023 Poster: Are aligned neural networks adversarially aligned? »
Nicholas Carlini · Florian Tramer · Daphne Ippolito · Ludwig Schmidt · Milad Nasr · Matthew Jagielski · Pang Wei Koh · Irena Gao · Christopher Choquette-Choo -
2023 Poster: Label Robust and Differentially Private Linear Regression: Computational and Statistical Efficiency »
Xiyang Liu · Prateek Jain · Weihao Kong · Sewoong Oh · Arun Suggala -
2023 Poster: Neural Priming for Sample-Efficient Adaptation »
Matthew Wallingford · Vivek Ramanujan · Alex Fang · Aditya Kusupati · Roozbeh Mottaghi · Aniruddha Kembhavi · Ludwig Schmidt · Ali Farhadi -
2023 Poster: On the Connection between Pre-training Data Diversity and Fine-tuning Robustness »
Vivek Ramanujan · Thao Nguyen · Sewoong Oh · Ali Farhadi · Ludwig Schmidt -
2023 Poster: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text »
Wanrong Zhu · Jack Hessel · Anas Awadalla · Samir Yitzhak Gadre · Jesse Dodge · Alex Fang · Youngjae Yu · Ludwig Schmidt · William Yang Wang · Yejin Choi -
2023 Poster: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Poster: VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models »
Yonatan Bitton · Hritik Bansal · Jack Hessel · Rulin Shao · Wanrong Zhu · Anas Awadalla · Josh Gardner · Rohan Taori · Ludwig Schmidt -
2023 Poster: Benchmarking Distribution Shift in Tabular Data with TableShift »
Josh Gardner · Zoran Popovic · Ludwig Schmidt -
2023 Poster: GenEval: An object-focused framework for evaluating text-to-image alignment »
Dhruba Ghosh · Hannaneh Hajishirzi · Ludwig Schmidt -
2023 Poster: Improving multimodal datasets with image captioning »
Thao Nguyen · Samir Yitzhak Gadre · Gabriel Ilharco · Sewoong Oh · Ludwig Schmidt -
2023 Poster: Objaverse-XL: A Colossal Universe of 3D Objects »
Matt Deitke · Ruoshi Liu · Matthew Wallingford · Huong Ngo · Oscar Michel · Aditya Kusupati · Alan Fan · Christian Laforte · Vikram Voleti · Samir Yitzhak Gadre · Eli VanderBilt · Aniruddha Kembhavi · Carl Vondrick · Georgia Gkioxari · Kiana Ehsani · Ludwig Schmidt · Ali Farhadi -
2023 Poster: Does progress on ImageNet transfer to real-world datasets? »
Alex Fang · Simon Kornblith · Ludwig Schmidt -
2023 Oral: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2022 Panel: Panel 2C-5: TVLT: Textless Vision-Language… & Quality Not Quantity:… »
Zineng Tang · Thao Nguyen -
2022 Poster: Patching open-vocabulary models by interpolating weights »
Gabriel Ilharco · Mitchell Wortsman · Samir Yitzhak Gadre · Shuran Song · Hannaneh Hajishirzi · Simon Kornblith · Ali Farhadi · Ludwig Schmidt -
2022 Poster: DP-PCA: Statistically Optimal and Differentially Private PCA »
Xiyang Liu · Weihao Kong · Prateek Jain · Sewoong Oh -
2022 Poster: Zonotope Domains for Lagrangian Neural Network Verification »
Matt Jordan · Jonathan Hayase · Alex Dimakis · Sewoong Oh -
2022 Poster: LAION-5B: An open large-scale dataset for training next generation image-text models »
Christoph Schuhmann · Romain Beaumont · Richard Vencu · Cade Gordon · Ross Wightman · Mehdi Cherti · Theo Coombes · Aarush Katta · Clayton Mullis · Mitchell Wortsman · Patrick Schramowski · Srivatsa Kundurthy · Katherine Crowson · Ludwig Schmidt · Robert Kaczmarczyk · Jenia Jitsev -
2022 Poster: Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation »
Josh Gardner · Zoran Popovic · Ludwig Schmidt -
2022 Poster: Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization »
Liang Zhang · Kiran Thekumparampil · Sewoong Oh · Niao He -
2021 Oral: Retiring Adult: New Datasets for Fair Machine Learning »
Frances Ding · Moritz Hardt · John Miller · Ludwig Schmidt -
2021 Poster: Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals »
Lang Liu · Krishna Pillutla · Sean Welleck · Sewoong Oh · Yejin Choi · Zaid Harchaoui -
2021 Poster: Robust and differentially private mean estimation »
Xiyang Liu · Weihao Kong · Sham Kakade · Sewoong Oh -
2021 Poster: Retiring Adult: New Datasets for Fair Machine Learning »
Frances Ding · Moritz Hardt · John Miller · Ludwig Schmidt -
2021 Poster: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning »
Timo Milbich · Karsten Roth · Samarth Sinha · Ludwig Schmidt · Marzyeh Ghassemi · Bjorn Ommer -
2021 Poster: Gradient Inversion with Generative Image Prior »
Jinwoo Jeon · jaechang Kim · Kangwook Lee · Sewoong Oh · Jungseul Ok -
2021 Poster: Statistically and Computationally Efficient Linear Meta-representation Learning »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2020 : 17 - Uncovering How Neural Network Representations Vary with Width and Depth »
Thao Nguyen -
2020 Poster: Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2020 Spotlight: Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2020 Poster: Robust Meta-learning for Mixed Linear Regression with Small Batches »
Weihao Kong · Raghav Somani · Sham Kakade · Sewoong Oh -
2020 Poster: Supermasks in Superposition »
Mitchell Wortsman · Vivek Ramanujan · Rosanne Liu · Aniruddha Kembhavi · Mohammad Rastegari · Jason Yosinski · Ali Farhadi -
2020 Poster: Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2020 Spotlight: Measuring Robustness to Natural Distribution Shifts in Image Classification »
Rohan Taori · Achal Dave · Vaishaal Shankar · Nicholas Carlini · Benjamin Recht · Ludwig Schmidt -
2019 Poster: Efficient Algorithms for Smooth Minimax Optimization »
Kiran Thekumparampil · Prateek Jain · Praneeth Netrapalli · Sewoong Oh -
2019 Poster: Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels »
Yihan Jiang · Hyeji Kim · Himanshu Asnani · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2019 Poster: Model Similarity Mitigates Test Set Overuse »
Horia Mania · John Miller · Ludwig Schmidt · Moritz Hardt · Benjamin Recht -
2019 Poster: Unlabeled Data Improves Adversarial Robustness »
Yair Carmon · Aditi Raghunathan · Ludwig Schmidt · John Duchi · Percy Liang -
2019 Poster: A Meta-Analysis of Overfitting in Machine Learning »
Becca Roelofs · Vaishaal Shankar · Benjamin Recht · Sara Fridovich-Keil · Moritz Hardt · John Miller · Ludwig Schmidt -
2019 Poster: Discovering Neural Wirings »
Mitchell Wortsman · Ali Farhadi · Mohammad Rastegari -
2019 Poster: Minimax Optimal Estimation of Approximate Differential Privacy on Neighboring Databases »
Xiyang Liu · Sewoong Oh -
2018 Poster: Deepcode: Feedback Codes via Deep Learning »
Hyeji Kim · Yihan Jiang · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2018 Poster: Robustness of conditional GANs to noisy labels »
Kiran Thekumparampil · Ashish Khetan · Zinan Lin · Sewoong Oh -
2018 Spotlight: Robustness of conditional GANs to noisy labels »
Kiran Thekumparampil · Ashish Khetan · Zinan Lin · Sewoong Oh -
2018 Poster: PacGAN: The power of two samples in generative adversarial networks »
Zinan Lin · Ashish Khetan · Giulia Fanti · Sewoong Oh -
2017 : New perspective from Blackwell's "comparisons of experiments" on generative adversarial networks »
Sewoong Oh -
2017 Poster: Optimal Sample Complexity of M-wise Data for Top-K Ranking »
Minje Jang · Sunghyun Kim · Changho Suh · Sewoong Oh -
2017 Poster: Estimating Mutual Information for Discrete-Continuous Mixtures »
Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2017 Poster: Matrix Norm Estimation from a Few Entries »
Ashish Khetan · Sewoong Oh -
2017 Spotlight: Estimating Mutual Information for Discrete-Continuous Mixtures »
Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2017 Spotlight: Matrix Norm Estimation from a Few Entries »
Ashish Khetan · Sewoong Oh -
2017 Poster: Discovering Potential Correlations via Hypercontractivity »
Hyeji Kim · Weihao Gao · Sreeram Kannan · Sewoong Oh · Pramod Viswanath -
2016 : Sewoong Oh: "The Minimax Rate for Adaptive Crowdsourcing" »
Sewoong Oh -
2016 Poster: Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation »
Weihao Gao · Sewoong Oh · Pramod Viswanath -
2016 Poster: Computational and Statistical Tradeoffs in Learning to Rank »
Ashish Khetan · Sewoong Oh -
2016 Poster: Achieving budget-optimality with adaptive schemes in crowdsourcing »
Ashish Khetan · Sewoong Oh -
2015 Workshop: Non-convex Optimization for Machine Learning: Theory and Practice »
Anima Anandkumar · Niranjan Uma Naresh · Kamalika Chaudhuri · Percy Liang · Sewoong Oh -
2015 Poster: Secure Multi-party Differential Privacy »
Peter Kairouz · Sewoong Oh · Pramod Viswanath -
2015 Poster: Collaboratively Learning Preferences from Ordinal Data »
Sewoong Oh · Kiran Thekumparampil · Jiaming Xu -
2014 Workshop: Analysis of Rank Data: Confluence of Social Choice, Operations Research, and Machine Learning »
Shivani Agarwal · Hossein Azari Soufiani · Guy Bresler · Sewoong Oh · David Parkes · Arun Rajkumar · Devavrat Shah -
2014 Poster: Provable Tensor Factorization with Missing Data »
Prateek Jain · Sewoong Oh -
2014 Poster: Extremal Mechanisms for Local Differential Privacy »
Peter Kairouz · Sewoong Oh · Pramod Viswanath -
2014 Poster: Minimax-optimal Inference from Partial Rankings »
Bruce Hajek · Sewoong Oh · Jiaming Xu -
2014 Poster: Learning Mixed Multinomial Logit Model from Ordinal Data »
Sewoong Oh · Devavrat Shah -
2012 Poster: Iterative ranking from pair-wise comparisons »
Sahand N Negahban · Sewoong Oh · Devavrat Shah -
2012 Spotlight: Iterative ranking from pair-wise comparisons »
Sahand N Negahban · Sewoong Oh · Devavrat Shah -
2011 Poster: Iterative Learning for Reliable Crowdsourcing Systems »
David R Karger · Sewoong Oh · Devavrat Shah -
2011 Oral: Iterative Learning for Reliable Crowdsourcing Systems »
David R Karger · Sewoong Oh · Devavrat Shah -
2009 Poster: Matrix Completion from Noisy Entries »
Raghunandan Keshavan · Andrea Montanari · Sewoong Oh