`

Timezone: »

 
RASL: Relational Algebra in Scikit-Learn Pipelines
Chirag Sahni · Kiran Kate · Avi Shinnar · Thanh Lam Hoang · Martin Hirzel

Mon Dec 13 10:34 AM -- 10:48 AM (PST) @ None
Event URL: https://openreview.net/forum?id=u9ct1gjoDcn »

Keywords: relational algebra, scikit pipelines, machine learning TL;DR: This paper proposes RASL, an open-source library of relational algebra (RA) operators for scikit-learn (SL). Abstract: Integrating data preparation with machine-learning (ML) pipelines has been a long- standing challenge. Prior work tried to solve it by building new data processing platforms such as MapReduce or Spark, and then implementing new libraries of ML algorithms for those. But despite the availability of these platforms, many ML practitioners continue to use scikit-learn instead, owing to its clean design and rich set of algorithms. Therefore, this paper proposes a different approach: instead of extending a data processing platform for ML, extend an ML library for data processing. Specifically, this paper proposes RASL, an open-source library of relational algebra (RA) operators for scikit-learn (SL). We illustrate RASL with a detailed case study involving joins and aggregation across multi-table input data. We hope our approach will lead to cleaner integration of data preparation with machine learning in practice.

Author Information

Chirag Sahni (Rensselaer Polytechnic Institute)
Kiran Kate (IBM Research)
Avi Shinnar (International Business Machines)
Thanh Lam Hoang (IBM Research)
Martin Hirzel (IBM Research AI)

More from the Same Authors