Skip to yearly menu bar Skip to main content


Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Dan Feldman · Mikhail Volkov · Daniela Rus

Area 5+6+7+8 #89

Keywords: [ Component Analysis (ICA,PCA,CCA, FLDA) ] [ Sparsity and Feature Selection ] [ (Application) Information Retrieval ] [ Matrix Factorization ] [ Clustering ]

Abstract: In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any $n\times d$ matrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the $n$ rows that approximates their sum of squared distances to \emph{every} $k$-dimensional \emph{affine} subspace. An open theoretical problem has been to compute such a coreset that is independent of both $n$ and $d$. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.

Live content is unavailable. Log in and register to view live content