Timezone: »

Data Centric AI
Andrew Ng · Lora Aroyo · Greg Diamos · Cody Coleman · Vijay Janapa Reddi · Joaquin Vanschoren · Carole-Jean Wu · Sharon Zhou · Lynn He

Tue Dec 14 08:30 AM -- 06:00 PM (PST) @
Event URL: http://datacentricai.org/ »

Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.

The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers (neurips-data-centric-ai@googlegroups.com)

The ML community has a strong track record of building and using datasets for AI systems. But this endeavor is often artisanal—painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable. So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.

If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].

We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.

Author Information

Andrew Ng (DeepLearning.AI)
Lora Aroyo (Google Research)

I am a research scientist at Google Research where I work on research for Data Excellence by specifically focussing on metrics and strategies to measure quality of human-labeled data in a reliable and transparent way. I received MSc in Computer Science from Sofia University, Bulgaria, and PhD from Twente University, The Netherlands. Prior to joining Google, I was a computer science professor heading the User-Centric Data Science research group at the VU University Amsterdam. Our team invented the CrowdTruth crowdsourcing method and applied in various domains such as digital humanities, medical and online multimedia. I guided the human-in-the-loop strategies as a Chief Scientist at a NY-based startup Tagasauris. Currently I am also president of the User Modeling Society. For a list of my publications, please see my profile on Google Scholar.

Greg Diamos (Landing AI)
Cody Coleman (Stanford University)

Cody is a computer science Ph.D. candidate at Stanford University, is advised by Professors Matei Zaharia and Peter Bailis and is supported by a National Science Foundation Fellowship. As a member of the Stanford DAWN Project, Cody’s research is focused on democratizing machine learning through tools and infrastructure that enable more than the most well-funded teams to create innovative and impactful systems; this includes reducing the cost of producing state-of-the-art models and creating novel abstractions that simplify machine learning development and deployment. Prior to joining Stanford, he completed his B.S. and M.Eng. in electrical engineering and computer science at the Massachusetts Institute of Technology.

Vijay Janapa Reddi (Harvard University)
Joaquin Vanschoren (Eindhoven University of Technology)
Joaquin Vanschoren

Joaquin Vanschoren is an Assistant Professor in Machine Learning at the Eindhoven University of Technology. He holds a PhD from the Katholieke Universiteit Leuven, Belgium. His research focuses on meta-learning and understanding and automating machine learning. He founded and leads OpenML.org, a popular open science platform that facilitates the sharing and reuse of reproducible empirical machine learning data. He obtained several demo and application awards and has been invited speaker at ECDA, StatComp, IDA, AutoML@ICML, CiML@NIPS, AutoML@PRICAI, MLOSS@NIPS, and many other occasions, as well as tutorial speaker at NIPS and ECMLPKDD. He was general chair at LION 2016, program chair of Discovery Science 2018, demo chair at ECMLPKDD 2013, and co-organizes the AutoML and meta-learning workshop series at NIPS 2018, ICML 2016-2018, ECMLPKDD 2012-2015, and ECAI 2012-2014. He is also editor and contributor to the book 'Automatic Machine Learning: Methods, Systems, Challenges'.

Carole-Jean Wu (Facebook AI Research)
Sharon Zhou (Stanford University)
Lynn He (DeepLearning.AI)

More from the Same Authors