Relational data represents the vast majority of data present in the enterprise world. Yet none of the ML computations happens inside a relational database where data reside. Instead a lot of time is wasted in denormalizing the data and moving them outside of the databases in order to train models. Relational learning, which takes advantage of relational data structure, has been a 20 year old research area, but it hasn’t been connected with relational database systems, despite the fact that relational databases are the natural space for storing relational data. Recent advances in database research have shown that it is possible to take advantage of the relational structure in data in order to accelerate ML algorithms. Research in relational algebra originating from the database community has shown that it is possible to further accelerate linear algebra operations. Probabilistic Programming has also been proposed as a framework for AI that can be realized in relational databases. Data programming, a mechanism for weak/self supervision is slowly migrating to the natural space of storing data, the database. At last, as models in deep learning grow, several systems are being developed for model management inside relational databases
Mon 5:50 a.m. - 6:00 a.m.
|
Opening Remarks
(
Welcome from the organizers
)
SlidesLive Video » |
🔗 |
Mon 6:00 a.m. - 6:45 a.m.
|
Machine Learning through Database Glasses
(
Invited Talk
)
SlidesLive Video » Title: Machine Learning through Database Glasses Abstract: As we witness the data science revolution, each research community legitimately reflects on its relevance and place in this new landscape. The database research community has at least three reasons to feel empowered by this revolution. This has to do with the pervasiveness of relational data in data science, the widespread need for efficient data processing, and the new processing challenges posed by data science workloads beyond the classical database workloads. The first two aforementioned reasons are widely acknowledged as core to the community's raison d'être. The third reason explains the longevity of relational database management systems success: Whenever a new promising data-centric technology surfaces, research is under way to show that it can be captured naturally by variations or extensions of the existing relational techniques. In this talk, I will make the case for a first-principles approach to machine learning over relational databases that guided our recent work and can dramatically improve the runtime performance of machine learning. This approach exploits the algebraic and combinatorial structure of relational data processing. It also relies on compilation for hybrid database and learning workloads and on computation sharing across aggregates in learning-specific batches. This work is the outcome of extensive collaboration of the author with colleagues from RelationalAI (https://www.relational.ai), in particular Mahmoud Abo Khamis, Molham Aref, Hung Ngo, and XuanLong Nguyen, and from the FDB research project (https://fdbresearch.github.io/), in particular Ahmet Kara, Milos Nikolic, Maximilian Schleich, Amir Shaikhha, and Haozhe Zhang. |
Dan Olteanu 🔗 |
Mon 6:45 a.m. - 7:30 a.m.
|
Programmatic supervision for model centric AI
(
Invited talk
)
SlidesLive Video » Abstract: One of the key bottlenecks in building machine learning systems is creating and managing the massive training datasets that today’s models learn from. In this talk, we will describe our work at Snorkel AI on labeling training data efficiently using our system, Snorkel, which allows users to programmatically label training data. Snorkel has been deployed by major technology companies like Google, Facebook and Intel, academic labs, and government agencies. Rather than hand-labeling training data, users write labeling functions which label data using heuristic strategies such as pattern matching, distant supervision, and other models. These labeling functions can have noisy, conflicting, and correlated outputs, which Snorkel models and combines into clean training labels. This allows training sets to be built in hours or days, rather than months or years. |
Paroma Varma 🔗 |
Mon 7:30 a.m. - 8:00 a.m.
|
Break
|
🔗 |
Mon 8:00 a.m. - 8:45 a.m.
|
The New DBfication of ML/AI
(
Invited Talk
)
SlidesLive Video » The recent boom in ML/AI applications has brought into sharp focus the pressing need for tackling the concerns of scalability, usability, and manageability across the entire lifecycle of ML/AI applications. The ML/AI world has long studied the concerns of accuracy, automation, etc. from theoretical and algorithmic vantage points. But to truly democratize ML/AI, the vantage point of building and deploying practical systems is equally critical. In this talk, I will make the case that it is high time to bridge the gap between the ML/AI world and a world that exemplifies successful democratization of data technology: databases. I will show how new bridges rooted in the principles, techniques, and tools of the database world are helping tackle the above pressing concerns and in turn, posing new research questions to the world of ML/AI. As case studies of such bridges, I will describe two lines of work from my group: query optimization for ML systems and benchmarking data preparation in AutoML platforms. I will conclude with my thoughts on community mechanisms to foster more such bridges between research worlds and between research and practice. |
Arun Kumar 🔗 |
Mon 8:45 a.m. - 9:08 a.m.
|
Collective Grounding: Relational Learning Meets Relational Theory
(
Invited Talk
)
SlidesLive Video » Relational learning takes advantage of relational structure in its inputs, e.g., graphs, and its output, e.g., constraints. Building upon that, statistical relational learning (SRL) defines structure using first-order predicate logic and models probabilistic dependencies between outputs. The use of predicate logic provides a natural groundwork for SRL to take advantage of the relational theory used in modern databases. Despite this common basis, SRL frameworks still have many unexplored opportunities to use the methods developed by the database community. Grounding, the process of enumerating all valid instantiations of structured tuples in the model, is one of the most computationally expensive components in SRL systems. In this talk, I explore the use of several concepts from database research to accelerate grounding. To improve grounding, we borrow from three well known problems in the database community: query rewriting, query containment, and multi-query optimization. Although not exact matches, each of these problems appear in SRL grounding in a form analogous to its database counterpart. By recognizing the connection to well-researched database techniques, we are able to address these problems in a way that takes advantage of the structure provided by SRL and the existing research provided by the database community. We show by implementing these techniques within an existing SRL system, we can achieve up to a 60% speedup in grounding. |
Eriq Augustine 🔗 |
Mon 9:08 a.m. - 9:25 a.m.
|
Two Ways of Thinking about Weighted Relations
(
Invited Talk
)
SlidesLive Video » I will talk about two ways of describing weighted or probabilistic relations: First, mathematical notation for tensors with named axes, which removes the burden of keeping track of the order of axes and the purpose of each. It also makes it easy to extend operations on low-order tensors to higher order ones (e.g., to extend an operation on images to minibatches of images, or extend the attention mechanism to multiple attention heads). Our notation builds on ideas from many previous papers and software libraries, and we hope their adoption may result in clearer papers and less bug-prone implementations. Second, hyperedge replacement graph grammars for factor graphs, or factor graph grammars (FGGs) for short, generate sets of factor graphs and can describe a more general class of models than plate notation, dynamic graphical models, case-factor diagrams, and sum-product networks can. Moreover, inference can be done on FGGs without enumerating all the generated factor graphs. For finite variable domains (but possibly infinite sets of graphs), a generalization of variable elimination to FGGs allows exact and tractable inference in many situations. |
David Chiang 🔗 |
Mon 9:25 a.m. - 10:20 a.m.
|
Lunch Break
|
🔗 |
Mon 10:20 a.m. - 10:34 a.m.
|
DRL-Clusters: Buffer Management with Clustering based Deep Reinforcement Learning
(
Contributed Talk
)
link »
SlidesLive Video » Keywords: Buffer pool management, cache replacement, machine learning, deep learning, deep reinforcement learning, clustering TL;DR: This paper proposes a deep reinforcement learning-based approach, DRL-Clusters, to manage the buffer pool for database systems when handling changing workloads. Abstract: Buffer cache has been widely implemented in database systems to reduce disk I/Os. Existing database systems typically use heuristic-based algorithms for buffer replacement, which cannot dynamically adapt to changing workload patterns. This paper proposes a deep reinforcement learning-based approach, DRL-Clusters, to manage the buffer pool when handling changing workloads. DRL-Clusters can dynamically adapt to different workload patterns without incurring high inference overhead and miss ratio with page re-clustering and continuous interactions with the cache environment. Our evaluation results demonstrate that DRL-Clusters can achieve a lower or comparable miss ratio than the heuristic policies while reducing 13.3% - 26.8% page access overhead under changing workloads. |
Kai Li · Qi Zhang · Lei Yu · Hong Min 🔗 |
Mon 10:34 a.m. - 10:48 a.m.
|
RASL: Relational Algebra in Scikit-Learn Pipelines
(
Contributed Talk
)
link »
SlidesLive Video » Keywords: relational algebra, scikit pipelines, machine learning TL;DR: This paper proposes RASL, an open-source library of relational algebra (RA) operators for scikit-learn (SL). Abstract: Integrating data preparation with machine-learning (ML) pipelines has been a long- standing challenge. Prior work tried to solve it by building new data processing platforms such as MapReduce or Spark, and then implementing new libraries of ML algorithms for those. But despite the availability of these platforms, many ML practitioners continue to use scikit-learn instead, owing to its clean design and rich set of algorithms. Therefore, this paper proposes a different approach: instead of extending a data processing platform for ML, extend an ML library for data processing. Specifically, this paper proposes RASL, an open-source library of relational algebra (RA) operators for scikit-learn (SL). We illustrate RASL with a detailed case study involving joins and aggregation across multi-table input data. We hope our approach will lead to cleaner integration of data preparation with machine learning in practice. |
Chirag Sahni · Kiran Kate · Avi Shinnar · Thanh Lam Hoang · Martin Hirzel 🔗 |
Mon 10:48 a.m. - 11:02 a.m.
|
DP-KB: Data Programming with Knowledge Bases Improves Transformer Fine Tuning for Answer Sentence Selection
(
Contributed Talk
)
link »
SlidesLive Video » Keywords: Knowledge Bases, Transformers, Question Answering, Language Models, Data Programming, Answer Sentence Selection, Natural Language Processing TL;DR: We use data-programming to enrich transformer training data with KB-derived context, demonstrate that it beats the SOTA approach on challenging datasets like WikiQA and TrecQA, and explore widely studied deficiencies in transformer as implicit KBs Abstract: While transformers demonstrate impressive performance on many knowledge intensive (KI) tasks, their ability to serve as implicit knowledge bases (KBs) remains limited, as shown on several slot-filling, question-answering (QA), fact verification, and entity-linking tasks. In this paper, we implement an efficient, data-programming technique that enriches training data with KB-derived context and improves transformer utilization of encoded knowledge when fine-tuning for a particular QA task, namely answer sentence selection (AS2). Our method outperforms state of the art transformer approach on WikiQA and TrecQA, two widely studied AS2 benchmarks, increasing by 2.0% p@1, 1.3% MAP, 1.1% MRR, and 4.4% p@1, 0.9% MAP, 2.4% MRR, respectively. To demonstrate our improvements in an industry setting, we additionally evaluate our approach on a proprietary dataset of Alexa QA pairs, and show increase of 2.3% F1 and 2.0% MAP. We additionally find that these improvements remain even when KB context is omitted at inference time, allowing for the use of our models within existing transformer workflows without additional latency or deployment costs. |
Nicolaas Jedema · Thuy Vu · Manish Gupta · Alessandro Moschitti 🔗 |
Mon 11:02 a.m. - 11:16 a.m.
|
Compressing (Multidimensional) Learned Bloom Filters
(
Contributed Talk
)
link »
SlidesLive Video » Keywords: learned bloom filters, compression, multidimensional data TL;DR: Introducing a compression for reducing memory consumption of learned multidimensional Bloom filters while preserving accuracy. Abstract: Bloom filters are widely used data structures that compactly represent sets of elements. Querying a Bloom filter reveals if an element is not included in the underlying set or is included with a certain error rate. This membership testing can be modeled as a binary classification problem and solved through deep learning models, leading to what is called learned Bloom filters. We have identified that the benefits of learned Bloom filters are apparent only when considering a vast amount of data, and even then, there is a possibility to further reduce their memory consumption. For that reason, we introduce a lossless input compression technique that improves the memory consumption of the learned model while preserving a comparable model accuracy. We evaluate our approach and show significant memory consumption improvements over learned Bloom filters. |
Angjela Davitkova · Damjan Gjurovski · Sebastian Michel 🔗 |
Mon 11:16 a.m. - 11:30 a.m.
|
Numerical Reasoning over Legal Contracts via Relational Database
(
Contributed Talk
)
link »
SlidesLive Video » Keywords: Neural Symbolic TL;DR: A neural-symbolic system for numerical reasoning over legal contracts using a relational database. Abstract: Numerical reasoning over text requires deep integration between the semantic understanding of the natural language context and the mathematical calculation of the symbolic terms. However, existing approaches are limited in their ability to incorporate domain-specific knowledge and express mathematical formulas over data structures. Delegating logic reasoning to a relational database is a promising approach to enhance the reasoning complexity. We study the problem of distilling natural language text into a relational database with numerical data structure and querying this database to obtain desired answers. Specifically, given a legal contract and a set of date-related questions in natural language, we utilize pre-trained neural network models to create a relational database to retrieve and generate the target dates. We evaluate our method on the CUAD dataset and demonstrate that our approach has high correct answer coverage and reduces a significant amount of incorrect results even without any labels. |
Jiani Huang · Ziyang Li · Ilias Fountalis · Mayur Naik 🔗 |
Mon 11:30 a.m. - 12:00 p.m.
|
Deep Learning with Relations
(
Contributed Talk
)
link »
SlidesLive Video » |
Molham Aref 🔗 |
Mon 12:00 p.m. - 12:15 p.m.
|
Break
|
🔗 |
Mon 12:15 p.m. - 12:45 p.m.
|
Towards AI-Native Databases
(
Invited Talk
)
SlidesLive Video » |
Olga Papaemmanouil 🔗 |
Mon 12:45 p.m. - 1:59 p.m.
|
AI workloads inside databases
(
Panel
)
SlidesLive Video » In this panel we will discuss the following topics
|
Guy Van den Broeck · Alexander Ratner · Benjamin Moseley · Konstantinos Karanasos · Parisa Kordjamshidi · Molham Aref · Arun Kumar 🔗 |
Mon 2:00 p.m. - 2:05 p.m.
|
Closing Remarks
|
🔗 |
-
|
DP-KB: Data Programming with Knowledge Bases Improves Transformer Fine Tuning for Answer Sentence Selection
(
Oral
)
link »
While transformers demonstrate impressive performance on many knowledge intensive (KI) tasks, their ability to serve as implicit knowledge bases (KBs) remains limited, as shown on several slot-filling, question-answering (QA), fact verification, and entity-linking tasks. In this paper, we implement an efficient, data-programming technique that enriches training data with KB-derived context and improves transformer utilization of encoded knowledge when fine-tuning for a particular QA task, namely answer sentence selection (AS2). Our method outperforms state of the art transformer approach on WikiQA and TrecQA, two widely studied AS2 benchmarks, increasing by 2.0% p@1, 1.3% MAP, 1.1% MRR, and 4.4% p@1, 0.9% MAP, 2.4% MRR, respectively. To demonstrate our improvements in an industry setting, we additionally evaluate our approach on a proprietary dataset of Alexa QA pairs, and show increase of 2.3% F1 and 2.0% MAP. We additionally find that these improvements remain even when KB context is omitted at inference time, allowing for the use of our models within existing transformer workflows without additional latency or deployment costs. |
Nicolaas Jedema · Thuy Vu · Thuy Vu · Manish Gupta · Alessandro Moschitti 🔗 |
-
|
RASL: Relational Algebra in Scikit-Learn Pipelines
(
Oral
)
link »
Integrating data preparation with machine-learning (ML) pipelines has been a long- standing challenge. Prior work tried to solve it by building new data processing platforms such as MapReduce or Spark, and then implementing new libraries of ML algorithms for those. But despite the availability of these platforms, many ML practitioners continue to use scikit-learn instead, owing to its clean design and rich set of algorithms. Therefore, this paper proposes a different approach: instead of extending a data processing platform for ML, extend an ML library for data processing. Specifically, this paper proposes RASL, an open-source library of relational algebra (RA) operators for scikit-learn (SL). We illustrate RASL with a detailed case study involving joins and aggregation across multi-table input data. We hope our approach will lead to cleaner integration of data preparation with machine learning in practice. |
Kiran Kate · Avi Shinnar · Thanh Lam Hoang · Martin Hirzel 🔗 |
-
|
Compressing (Multidimensional) Learned Bloom Filters
(
Oral
)
link »
Bloom filters are widely used data structures that compactly represent sets of elements. Querying a Bloom filter reveals if an element is not included in the underlying set or is included with a certain error rate. This membership testing can be modeled as a binary classification problem and solved through deep learning models, leading to what is called learned Bloom filters. We have identified that the benefits of learned Bloom filters are apparent only when considering a vast amount of data, and even then, there is a possibility to further reduce their memory consumption. For that reason, we introduce a lossless input compression technique that improves the memory consumption of the learned model while preserving a comparable model accuracy. We evaluate our approach and show significant memory consumption improvements over learned Bloom filters. |
Angjela Davitkova · Damjan Gjurovski · Sebastian Michel 🔗 |
-
|
DRL-Clusters: Buffer Management with Clustering based Deep Reinforcement Learning
(
Oral
)
link »
Buffer cache has been widely implemented in database systems to reduce disk I/Os. Existing database systems typically use heuristic-based algorithms for buffer replacement, which cannot dynamically adapt to changing workload patterns. This paper proposes a deep reinforcement learning-based approach, DRL-Clusters, to manage the buffer pool when handling changing workloads. DRL-Clusters can dynamically adapt to different workload patterns without incurring high inference overhead and miss ratio with page re-clustering and continuous interactions with the cache environment. Our evaluation results demonstrate that DRL-Clusters can achieve a lower or comparable miss ratio than the heuristic policies while reducing 13.3% - 26.8% page access overhead under changing workloads. |
Kai Li · Qi Zhang · Lei Yu · Hong Min 🔗 |
-
|
Numerical Reasoning over Legal Contracts via Relational Database
(
Oral
)
link »
Numerical reasoning over text requires deep integration between the semantic understanding of the natural language context and the mathematical calculation of the symbolic terms. However, existing approaches are limited in their ability to incorporate domain-specific knowledge and express mathematical formulas over data structures. Delegating logic reasoning to a relational database is a promising approach to enhance the reasoning complexity. We study the problem of distilling natural language text into a relational database with numerical data structure and querying this database to obtain desired answers. Specifically, given a legal contract and a set of date-related questions in natural language, we utilize pre-trained neural network models to create a relational database to retrieve and generate the target dates. We evaluate our method on the CUAD dataset and demonstrate that our approach has high correct answer coverage and reduces a significant amount of incorrect results even without any labels. |
Jiani Huang · Ziyang Li · Ilias Fountalis · Mayur Naik 🔗 |