Talk
in
Workshop: Nearest Neighbors for Modern Applications with Massive Data: An Age-old Solution with New Challenges

Data-dependent methods for similarity search in high dimensions

Piotr Indyk

2017 Talk
in
Workshop: Nearest Neighbors for Modern Applications with Massive Data: An Age-old Solution with New Challenges

Abstract

I will give an overview of recent research on designing provable data-dependent approximate similarity search methods for high-dimensional data. Unlike many approaches proposed in the algorithms literature, the new methods lead to data structures that adapt to the data while retaining correctness and running time guarantees. The high-level principle behind those methods is that every data set (even a completely random one) has some structure that can be exploited algorithmically. This interpolates between basic "data-oblivious" approaches (e.g., those based on simple random projections) that are analyzable but do not adapt easily to the structure of the data, and basic "data-dependent" methods that exploit specific structure of the data, but might not perform well if this structure is not present.

Concretely, I will cover two recent lines of research: - Data-dependent Locality-Sensitive Hashing: approximate nearest neighbor search methods that provably improve over the best possible data-oblivious LSH algorithms (covers work from SODA'14, STOC'15 and NIPS'15). - Data-dependent metric compression: algorithms that represent sets of points using a small number of bits while approximately preserving the distances up to higher accuracy than previously known (covers work from SODA'17 and NIPS'17).

Chat is not available.