Timezone: »
Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality of the clustering in the distributed setting to match that in the centralized setting for which all the data resides on a single site. In this work, we study both graph and geometric clustering problems in two distributed models: (1) a point-to-point model, and (2) a model with a broadcast channel. We give protocols in both models which we show are nearly optimal by proving almost matching communication lower bounds. Our work highlights the surprising power of a broadcast channel for clustering problems; roughly speaking, to cluster n points or n vertices in a graph distributed across s servers, for a worst-case partitioning the communication complexity in a point-to-point model is n*s, while in the broadcast model it is n + s. We implement our algorithms and demonstrate this phenomenon on real life datasets, showing that our algorithms are also very efficient in practice.
Author Information
Jiecao Chen (Indiana University Bloomington)
He Sun (The University of Bristol)
David Woodruff (IBM Research)
Qin Zhang (Indiana University Bloomington)
More from the Same Authors
-
2020 Poster: Batched Coarse Ranking in Multi-Armed Bandits »
Nikolai Karpov · Qin Zhang -
2018 Poster: A Practical Algorithm for Distributed Clustering and Outlier Detection »
Jiecao Chen · Erfan Sadeqi Azer · Qin Zhang -
2018 Poster: Tight Bounds for Collaborative PAC Learning via Multiplicative Weights »
Jiecao Chen · Qin Zhang · Yuan Zhou -
2016 Poster: Sublinear Time Orthogonal Tensor Decomposition »
Zhao Song · David Woodruff · Huan Zhang -
2015 : Sketching as a tool for numerical linear algebra »
David Woodruff -
2014 Poster: Improved Distributed Principal Component Analysis »
Yingyu Liang · Maria-Florina F Balcan · Vandana Kanchanapally · David Woodruff -
2014 Poster: Low Rank Approximation Lower Bounds in Row-Update Streams »
David Woodruff -
2014 Poster: Subspace Embeddings for the Polynomial Kernel »
Haim Avron · Huy Nguyen · David Woodruff -
2013 Workshop: Large Scale Matrix Analysis and Inference »
Reza Zadeh · Gunnar Carlsson · Michael Mahoney · Manfred K. Warmuth · Wouter M Koolen · Nati Srebro · Satyen Kale · Malik Magdon-Ismail · Ashish Goel · Matei A Zaharia · David Woodruff · Ioannis Koutis · Benjamin Recht -
2013 Poster: Sketching Structured Matrices for Faster Nonlinear Regression »
Haim Avron · Vikas Sindhwani · David Woodruff