Timezone: »

RASL: Relational Algebra in Scikit-Learn Pipelines
Kiran Kate · Avi Shinnar · Thanh Lam Hoang · Martin Hirzel
Event URL: https://openreview.net/forum?id=u9ct1gjoDcn »

Integrating data preparation with machine-learning (ML) pipelines has been a long- standing challenge. Prior work tried to solve it by building new data processing platforms such as MapReduce or Spark, and then implementing new libraries of ML algorithms for those. But despite the availability of these platforms, many ML practitioners continue to use scikit-learn instead, owing to its clean design and rich set of algorithms. Therefore, this paper proposes a different approach: instead of extending a data processing platform for ML, extend an ML library for data processing. Specifically, this paper proposes RASL, an open-source library of relational algebra (RA) operators for scikit-learn (SL). We illustrate RASL with a detailed case study involving joins and aggregation across multi-table input data. We hope our approach will lead to cleaner integration of data preparation with machine learning in practice.

Author Information

Kiran Kate (IBM Research)
Avi Shinnar (International Business Machines)
Thanh Lam Hoang (IBM Research)
Martin Hirzel (IBM Research AI)

More from the Same Authors