Expo Talk Panel
Room R06-R09 (level 2)

Within enterprises, there is a growing need to intelligently navigate data lakes. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. Example applications of this type of discovery include privacy enforcement and analytical queries that span multiple tables. There are now a number of pretrained models targeting the processing of tabular data, but none that target the data discovery use case in particular. There is also a dearth of benchmark tasks to help build the learning of data discovery tasks for neural tabular models. To help with neural tabular learning of data discovery, we developed a benchmark suite, LakeBench, for a diverse set of data discovery tasks based on government data from CKAN, Socrata, and the European Central Bank. Inspired by what has been shown to work well for data discovery tasks, we also used a novel approach based on data sketches to create a neural model TabSketchFM for data discovery. We contrast the data sketch based approach of TabSketchFM against row based approaches of other models and show that for data discovery tasks, data sketch based approaches are more effective. We examine which specific types of data sketches help which tasks with ablation studies. Finally we perform initial experiments to leverage models such as TabSketchFM in search, showing that they can re-rank and even improve top-k search results of the existing non-neural systems.

Chat is not available.