Timezone: »
Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.
Author Information
Kevin Jiang (Columbia University)
Weixin Liang (Stanford University)
James Zou (Stanford)
Yongchan Kwon (Columbia University)
More from the Same Authors
-
2022 : Predicting Immune Escape with Pretrained Protein Language Model Embeddings »
Kyle Swanson · Howard Chang · James Zou -
2022 : Data-driven subgroup identification for linear regression »
Zachary Izzo · Ruishan Liu · James Zou -
2022 : Is Unsupervised Performance Estimation Impossible When Both Covariates and Labels shift? »
Lingjiao Chen · Matei Zaharia · James Zou -
2022 : DrML: Diagnosing and Rectifying Vision Models using Language »
Yuhui Zhang · Jeff Z. HaoChen · Shih-Cheng Huang · Kuan-Chieh Wang · James Zou · Serena Yeung -
2022 : Provable Re-Identification Privacy »
Zachary Izzo · Jinsung Yoon · Sercan Arik · James Zou -
2022 : Recommendation for New Drugs with Limited Prescription Data »
Zhenbang Wu · Huaxiu Yao · Zhe Su · David Liebovitz · Lucas Glass · James Zou · Chelsea Finn · Jimeng Sun -
2023 : Generative AI for designing and validating easily synthesizable and structurally novel antibiotics »
Kyle Swanson · Gary Liu · Denise Catacutan · James Zou · Jonathan Stokes -
2023 : Analyzing ChatGPT’s Behavior Shifts Over Time »
Lingjiao Chen · Matei A Zaharia · James Zou -
2023 : Navigating Dataset Documentation in ML: A Large-Scale Analysis of Dataset Cards on Hugging Face »
Xinyu Yang · Weixin Liang · James Zou -
2023 : A Theoretical Study of Dataset Distillation »
Zachary Izzo · James Zou -
2023 Poster: TWIGMA: A dataset of AI-Generated Images with Metadata From Twitter »
Yiqun Chen · James Zou -
2023 Poster: Factorized Contrastive Learning: Going Beyond Multi-view Redundancy »
Paul Pu Liang · Zihao Deng · Martin Q. Ma · James Zou · Louis-Philippe Morency · Ruslan Salakhutdinov -
2023 Poster: Beyond Confidence: Reliable Models Should Also Consider Atypicality »
Mert Yuksekgonul · Linjun Zhang · James Zou · Carlos Guestrin -
2023 Poster: DataPerf: Benchmarks for Data-Centric AI Development »
Mark Mazumder · Colby Banbury · Xiaozhe Yao · Bojan Karlaš · William Gaviria Rojas · Sudnya Diamos · Greg Diamos · Lynn He · Alicia Parrish · Hannah Rose Kirk · Jessica Quaye · Charvi Rastogi · Douwe Kiela · David Jurado · David Kanter · Rafael Mosquera · Will Cukierski · Juan Ciro · Lora Aroyo · Bilge Acun · Lingjiao Chen · Mehul Raje · Max Bartolo · Evan Sabri Eyuboglu · Amirata Ghorbani · Emmett Goodman · Addison Howard · Oana Inel · Tariq Kane · Christine R. Kirkpatrick · D. Sculley · Tzu-Sheng Kuo · Jonas Mueller · Tristan Thrush · Joaquin Vanschoren · Margaret Warren · Adina Williams · Serena Yeung · Newsha Ardalani · Praveen Paritosh · Ce Zhang · James Zou · Carole-Jean Wu · Cody Coleman · Andrew Ng · Peter Mattson · Vijay Janapa Reddi -
2022 : An Electrocardiogram-Based Risk Score for Cardiovascular Mortality »
John Hughes · David Ouyang · Pierre Elias · James Zou · Euan Ashley · Marco Perez -
2022 : An Electrocardiogram-Based Risk Score for Cardiovascular Mortality »
John Hughes · David Ouyang · Pierre Elias · James Zou · Euan Ashley · Marco Perez -
2022 Poster: Estimating and Explaining Model Performance When Both Covariates and Labels Shift »
Lingjiao Chen · Matei Zaharia · James Zou -
2022 Poster: SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis »
Roxana Daneshjou · Mert Yuksekgonul · Zhuo Ran Cai · Roberto Novoa · James Zou -
2022 Poster: HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions »
Lingjiao Chen · Zhihua Jin · Evan Sabri Eyuboglu · Christopher Ré · Matei Zaharia · James Zou -
2022 Poster: Uncalibrated Models Can Improve Human-AI Collaboration »
Kailas Vodrahalli · Tobias Gerstenberg · James Zou -
2022 Poster: C-Mixup: Improving Generalization in Regression »
Huaxiu Yao · Yiping Wang · Linjun Zhang · James Zou · Chelsea Finn -
2022 Poster: Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning »
Weixin Liang · Yuhui Zhang · Yongchan Kwon · Serena Yeung · James Zou -
2022 Poster: WeightedSHAP: analyzing and improving Shapley based feature attributions »
Yongchan Kwon · James Zou -
2021 Poster: Adversarial Training Helps Transfer Learning via Better Representations »
Zhun Deng · Linjun Zhang · Kailas Vodrahalli · Kenji Kawaguchi · James Zou -
2020 Session: Orals & Spotlights Track 02: COVID/Health/Bio Applications »
Tristan Naumann · James Zou -
2019 Poster: Making AI Forget You: Data Deletion in Machine Learning »
Antonio Ginart · Melody Guan · Gregory Valiant · James Zou -
2019 Spotlight: Making AI Forget You: Data Deletion in Machine Learning »
Antonio Ginart · Melody Guan · Gregory Valiant · James Zou -
2017 Workshop: Machine Learning in Computational Biology »
James Zou · Anshul Kundaje · Gerald Quon · Nicolo Fusi · Sara Mostafavi -
2017 Poster: NeuralFDR: Learning Discovery Thresholds from Hypothesis Features »
Fei Xia · Martin J Zhang · James Zou · David Tse