Keywords: [ Clustering ] [ Self-Supervised Learning ]

Abstract:
As a fundamental unsupervised learning task, hierarchical clustering has been extensively studied in the past decade. In particular, standard metric formulations as hierarchical $k$-center, $k$-means, and $k$-median received a lot of attention and the problems have been studied extensively in different models of computation. Despite all this interest, not many efficient parallel algorithms are known for these problems. In this paper we introduce a new parallel algorithm for the Euclidean hierarchical $k$-median problem that, when using machines with memory $s$ (for $s\in \Omega(\log^2 (n+\Delta+d))$), outputs a hierarchical clustering such that for every fixed value of $k$ the cost of the solution is at most an $O(\min\{d, \log n\} \log \Delta)$ factor larger in expectation than that of an optimal solution. Furthermore, we also get that for all $k$ simultanuously the cost of the solution is at most an $O(\min\{d, \log n\} \log \Delta \log (\Delta d n))$ factor bigger that the corresponding optimal solution. The algorithm requires in $O\left(\log_{s} (nd\log(n+\Delta))\right)$ rounds. Here $d$ is the dimension of the data set and $\Delta$ is the ratio between the maximum and minimum distance of two points in the input dataset. To the best of our knowledge, this is the first \emph{parallel} algorithm for the hierarchical $k$-median problem with theoretical guarantees. We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice.

Chat is not available.