Timezone: »
Poster
Query Kmeans Clustering and the Double Dixie Cup Problem
Eli Chien · Chao Pan · Olgica Milenkovic
We consider the problem of approximate $K$means clustering with outliers and side information provided by samecluster queries and possibly noisy answers. Our solution shows that, under some mild assumptions on the smallest cluster size, one can obtain an $(1+\epsilon)$approximation for the optimal potential with probability at least $1\delta$, where $\epsilon>0$ and $\delta\in(0,1)$, using an expected number of $O(\frac{K^3}{\epsilon \delta})$ noiseless samecluster queries and comparisonbased clustering of complexity $O(ndK + \frac{K^3}{\epsilon \delta})$; here, $n$ denotes the number of points and $d$ the dimension of space. Compared to a handful of other known approaches that perform importance sampling to account for small cluster sizes, the proposed query technique reduces the number of queries by a factor of roughly $O(\frac{K^6}{\epsilon^3})$, at the cost of possibly missing very small clusters. We extend this settings to the case where some queries to the oracle produce erroneous information, and where certain points, termed outliers, do not belong to any clusters. Our proof techniques differ from previous methods used for $K$means clustering analysis, as they rely on estimating the sizes of the clusters and the number of points needed for accurate centroid estimation and subsequent nontrivial generalizations of the double Dixie cup problem. We illustrate the performance of the proposed algorithm both on synthetic and real datasets, including MNIST and CIFAR $10$.
Author Information
Eli Chien (UIUC)
Chao Pan (University of Illinois UrbanaChampaign)
I am currently a ECE graduate student. My research interests are machine learning and optimization.
Olgica Milenkovic (University of Illinois at UrbanaChampaign)
More from the Same Authors

2019 Poster: Optimizing Generalized PageRank Methods for SeedExpansion Community Detection »
Pan Li · Eli Chien · Olgica Milenkovic 
2019 Poster: Online Convex Matrix Factorization with Representative Regions »
Jianhao Peng · Olgica Milenkovic · Abhishek Agarwal 
2018 Poster: Revisiting Decomposable Submodular Function Minimization with Incidence Relations »
Pan Li · Olgica Milenkovic 
2018 Poster: Quadratic Decomposable Submodular Function Minimization »
Pan Li · Niao He · Olgica Milenkovic 
2017 Poster: Inhomogeneous Hypergraph Clustering with Applications »
Pan Li · Olgica Milenkovic 
2017 Spotlight: Inhomogoenous Hypergraph Clustering with Applications »
Pan Li · Olgica Milenkovic 
2012 Workshop: Social Choice: Theory and Practice »
Behrouz Touri · Olgica Milenkovic · Faramarz Fekri