Timezone: »

Towards an Artificial Intelligence for Data Science
Charles Sutton · James Geddes · Zoubin Ghahramani · Padhraic Smyth · Chris Williams

Fri Dec 09 11:00 PM -- 09:30 AM (PST) @ Room 114
Event URL: http://workshops.inf.ed.ac.uk/nips2016-ai4datasci/ »

Machine learning methods have applied beyond their origins in artificial intelligence to a wide variety of data analysis problems in fields such as science, health care, technology, and commerce. Previous research in machine learning, perhaps motivated by its roots in AI, has primarily aimed at fully-automated approaches for prediction problems. But predictive analytics is only one step in the larger pipeline of data science, which includes data wrangling, data cleaning, exploratory visualization, data integration, model criticism and revision, and presentation of results to domain experts.

An emerging strand of work aims to address all of these challenges in one stroke is by automating a greater portion of the full data science pipeline. This workshop will bring together experts in machine learning, data mining, databases and statistics to discuss the challenges that arise in the full end-to-end process of collecting data, analysing data, and making decisions and building new methods that support, whether in an automated or semi-automated way, more of the full process of analysing real data.

Considering the full process of data science raises interesting questions for discussion, such as: What aspects of data analysis might potentially be automated and what aspects seem more difficult? Statistical model building often emphasizes interpretability and human understanding, while machine learning often emphasizes predictive modeling --- are ML methods truly suitable for supporting the full data analysis pipeline? Do recent advances in ML offer help here? Finally, are there low hanging fruit, i.e., how much time is wasted on routine tasks in scientific data analysis that could be automated?

Specific topics of interest include: data cleaning, exploratory data analysis, semi-supervised learning, active learning, interactive machine learning, model criticism, automated and semi-automated model construction, usable machine learning, interpretable prediction methods and automatic methods to explain predictions. We are especially interested in contributions that take a broader perspective, i.e., that aim toward supporting the process of data science more holistically.

Sat 12:10 a.m. - 12:50 a.m.

One of the first steps in the data analysis pipeline is data cleaning: detecting data from failed sensors. This talk will discuss the application of anomaly detection algorithms to find and remove bad readings from weather station data. We will review our previous work on DBN time series models and our current work on applying non-parametric anomaly detection algorithms as part of our SENSOR-DX multi-view anomaly detection architecture. A major challenge in evaluating these algorithms is to obtain ground truth, because real sensor data tends to be labeled conservatively by domain experts.

Tom Dietterich
Sat 12:50 a.m. - 1:10 a.m.
Automatic Discovery of the Statistical Types of Variables in a Dataset (Talk)
Isabel Valera, Zoubin Ghahramani
Sat 1:10 a.m. - 1:30 a.m.

Isabel Valera and Zoubin Ghahramani. Automatic Discovery of the Statistical Types of Variables in a Dataset

David Janz, Brooks Paige, Tom Rainforth, Jan-Willem van de Meent and Frank Wood Probabilistic structure discovery in time series data

Richard Lippmann, William Campbell and Joseph Campbell An Overview of the DARPA Data Driven Discovery of Models (D3M) Program

Kristin Bennett, John Erickson, Hannah De Los Santos, Evan Patton, John Sheehan and Deborah McGuinness Data Analytics as Data: A Semantic Workflow Approach

Lidia Contreras-Ochando, Fernando Martínez-Plumed, Cesar Ferri, Jose Hernandez-Orallo and Maria Jose Ramirez-Quintana General-Purpose Inductive Programming for Data Wrangling Automation

Lev Faivishevsky and Amitai Armon. Using Downhill Simplex Method for Optimizing Machine Learning Training Running Time

Lidia Contreras-Ochando, Fernando Martínez-Plumed, Cesar Ferri, Jose Hernandez-Orallo and Maria Jose Ramirez-Quintana. Logging Data Scientists: Collecting Evidence for Data Science Automation

Lin Li, William Campbell, Cagri Dagli and Joseph Campbell Making Sense of Unstructured Text Data

Cornelia Caragea Identifying Descriptive Keyphrases from Scholarly Big Data

Simao Eduardo and Charles Sutton Data Cleaning using Probabilistic Models of Integrity Constraints

Udayan Khurana, Fatemeh Nargesian, Horst Samulowitz, Elias Khalil and Deepak Turaga Automating Feature Engineering

Zhao Xu and Lorenzo von Ritter Poster Adaptive Streaming Anomaly Analysis

Sat 2:00 a.m. - 2:40 a.m.

Christian Steinruecken, University of Cambridge

Christian Steinruecken
Sat 2:40 a.m. - 3:00 a.m.

Existing methods for structure discovery in time series data construct interpretable, compositional kernels for Gaussian process regression models. While the learned Gaussian process model provides posterior mean and variance estimates, typically the structure is learned via a greedy optimization procedure. This restricts the space of possible solutions and leads to over-confident uncertainty estimates. We introduce a fully Bayesian approach, inferring a full posterior over structures, which more reliably captures the uncertainty of the model.

David Janz, Brooks Paige, Tom Rainforth, Jan-Willem van de Meent
Sat 3:00 a.m. - 5:00 a.m.
Poster session
Sat 5:00 a.m. - 5:40 a.m.
Invited talk, Carlos Guestrin (Talk)
Carlos Guestrin
Sat 5:40 a.m. - 6:00 a.m.

Richard Lippmann, William Campbell, Joseph Campbell

A new DARPA program called Data Driven Discovery of Models (D3M) aims to develop automated model discovery systems that can be used by researchers with specific subject matter expertise to create empirical models of real, complex processes. Two major goals of this program are to allow experts to create empirical models without the need for data scientists and to increase the productivity of data scientists via automation. Automated model discovery systems developed will be tested on real-world problems that progressively get harder during the course of the program. Toward the end of the program, problems will be both unsolved and underspecified in terms of data and desired outcomes. The program will emphasize creating and leveraging open source technology and architecture. Our presentation reviews the goals and structure of this program which will begin early in 2017. Although the deadline for submitting proposals has past, we welcome suggestions concerning challenge tasks, evaluations, or new open-source data sets to be included for system development and evaluation that would supplement data currently being curated from many sources.

Richard Lippmann, William Campbell
Sat 6:30 a.m. - 7:10 a.m.
Invited talk, Frank Hutter (Talk)
Frank Hutter
Sat 7:10 a.m. - 7:30 a.m.

Kristin Bennett, John Erickson, Hannah De Los Santos, Evan Patton, John Sheehan, Deborah McGuinness

By treating the end-to-end data science workflow as data itself and through the conceptual modeling of the goals and functional intent of the data analyst, the entire process of data analytics becomes open and accessible to the powerful tools of artificial intelligence, machine learning, statistics, and data mining. We examine the fundamental questions and capabilities that must be addressed to realize cap- turing and reasoning over workflows as well as interpreting and contextualizing their results. Our approach focuses on capturing key components of complete workflow processes, making explicit the “deep” semantics of the workflow plan; the analysis performed; the structure and sub-components of the workflow; and intermediate and final data products. Our goal is to provide sufficient detail to facilitate practical workflow and work product integration, interpretation, reuse, reproducibility, recommendation, and search. The structure for this workflow-as- data view is formalized by an extensible, reusable ontology that we are creating that applies to all aspects of the workflow representation and reasoning process. We report on our exploration and reuse of existing methods, tools and ontologies as well as our semantic analytics contributions to real world projects addressing childhood health challenges.

Kristin P Bennett
Sat 7:30 a.m. - 7:50 a.m.

Lidia Contreras-Ochando, Fernando Martínez-Plumed, Cesar Ferri, Jose Hernandez-Orallo, Maria Jose Ramirez-Quintana

Data acquisition, integration, transformation, cleansing and other highly tedious tasks take a large proportion of data science projects. These routine tasks are tedious basically because they are repetitive and, hence, automatable. As a consequence, progress in the automation of this process can lead to a dramatic reduction of the cost and duration of data science projects. Recently, Inductive Programming (IP) has shown a large potential as a paradigm for addressing this automation. This short paper elaborates on the recent success of induction using domain-specific languages (DSLs) for the automation of data wrangling process and advocating for the use of inductive programming over general-purpose declarative languages (GPDLs) using domain-specific background knowledge (DSBKs).

Author Information

Charles Sutton (Google)
James Geddes (The Alan Turing Institute)
Zoubin Ghahramani (Uber and University of Cambridge)

Zoubin Ghahramani is Professor of Information Engineering at the University of Cambridge, where he leads the Machine Learning Group. He studied computer science and cognitive science at the University of Pennsylvania, obtained his PhD from MIT in 1995, and was a postdoctoral fellow at the University of Toronto. His academic career includes concurrent appointments as one of the founding members of the Gatsby Computational Neuroscience Unit in London, and as a faculty member of CMU's Machine Learning Department for over 10 years. His current research interests include statistical machine learning, Bayesian nonparametrics, scalable inference, probabilistic programming, and building an automatic statistician. He has held a number of leadership roles as programme and general chair of the leading international conferences in machine learning including: AISTATS (2005), ICML (2007, 2011), and NIPS (2013, 2014). In 2015 he was elected a Fellow of the Royal Society.

Padhraic Smyth (University of California, Irvine)
Chris Williams (University of Edinburgh)

More from the Same Authors