Skip to yearly menu bar Skip to main content

Workshop: Data Centric AI

Data vast and low in variance: Augment machine learning pipelines with dataset profiles to improve data quality without sacrificing scale


The recent discussion of data-centric artificial intelligence (DCAI) has galvanized researchers and practitioners to elevate data quality and dataset iteration practices to the level of importance given to model iteration on fixed datasets. Some DCAI techniques successfully increase training data quality but at the expense of the number of training examples. Meanwhile, production AI systems are being increasingly deployed in new settings producing even more inference data. Dataset profiling techniques provide systematic ways of transferring important characteristics and data examples from large, real-time inference data sources to the smaller datasets used for training--delivering higher quality data without sacrificing scalability.