Timezone: »
We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means, in which the algorithm is allowed to output more than k centers (based on the recent "k-means++"), and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method.
Author Information
Nir Ailon (Technion)
Ragesh Jaiswal (Columbia University)
Claire Monteleoni (University of Colorado Boulder)
Claire Monteleoni is an associate professor of Computer Science at University of Colorado Boulder. Previously, she was an associate professor at George Washington University, and research faculty at the Center for Computational Learning Systems, at Columbia University. She did a postdoc in Computer Science and Engineering at the University of California, San Diego, and completed her PhD and Masters in Computer Science, at MIT. She holds a Bachelors in Earth and Planetary Sciences from Harvard. Her research focuses on machine learning algorithms and theory for problems including learning from data streams, learning from raw (unlabeled) data, learning from private data, and climate informatics: accelerating discovery in climate science with machine learning. Her work on climate informatics received the Best Application Paper Award at NASA CIDU 2010. In 2011, she co-founded the International Workshop on Climate Informatics, which is now in its fourth year, attracting climate scientists and data scientists from over 14 countries and 26 states.
More from the Same Authors
-
2014 Tutorial: Climate Change: Challenges for Machine Learning »
Arindam Banerjee · Claire Monteleoni -
2011 Poster: Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity »
Nir Ailon -
2008 Poster: Reconciling Real Scores with Binary Comparisons: A Unified Logistic Model for Ranking »
Nir Ailon -
2008 Poster: Privacy-preserving logistic regression »
Kamalika Chaudhuri · Claire Monteleoni -
2007 Spotlight: A general agnostic active learning algorithm »
Sanjoy Dasgupta · Daniel Hsu · Claire Monteleoni -
2007 Poster: A general agnostic active learning algorithm »
Sanjoy Dasgupta · Daniel Hsu · Claire Monteleoni