Timezone: »
Poster
Sample Complexity of Testing the Manifold Hypothesis
Hariharan Narayanan · Sanjoy K Mitter
The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning.
In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is {\it independent} of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume.
For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume and curvature is unavoidable.
Whether the known lower bound of $O(\frac{k}{\eps^2} + \frac{\log \frac{1}{\de}}{\eps^2})$ for the sample complexity of Empirical Risk minimization on $k$means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 \cite{bart2}.
Here $\eps$ is the desired bound on the error and $\de$ is a bound on the probability of failure. We improve the best currently known upper bound \cite{pontil} of $O(\frac{k^2}{\eps^2} + \frac{\log \frac{1}{\de}}{\eps^2})$ to $O\left(\frac{k}{\eps^2}\left(\min\left(k, \frac{\log^4 \frac{k}{\eps}}{\eps^2}\right)\right) + \frac{\log \frac{1}{\de}}{\eps^2}\right)$.
Based on these results, we devise a simple algorithm for $k$means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension.
Author Information
Hariharan Narayanan (Tata Institute of Fundamental Research)
Sanjoy K Mitter (Massachusetts Institute of Technology)
Related Events (a corresponding poster, oral, or spotlight)

2010 Spotlight: Sample Complexity of Testing the Manifold Hypothesis »
Wed. Dec 8th 11:25  11:30 PM Room Regency Ballroom
More from the Same Authors

2010 Poster: Random Walk Approach to Regret Minimization »
Hariharan Narayanan · Sasha Rakhlin 
2010 Poster: Probabilistic Belief Revision with Structural Constraints »
Peter B Jones · Venkatesh Saligrama · Sanjoy K Mitter 
2006 Poster: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts »
Hariharan Narayanan · Mikhail Belkin · Partha Niyogi