Workshop: OPT 2021: Optimization for Machine Learning

Integer Programming Approaches To Subspace Clustering With Missing Data

Akhilesh Soni · Daniel Pimentel-Alarcón

Abstract: In the \emph{Subspace Clustering with Missing Data} (SCMD) problem, we are given a collection of $n$ data points, arranged into columns of a matrix $X \in \mathbb{R}^{d \times n}$, and each of the data points is observed only on a subset of its coordinates. The data points are assumed to be concentrated near a union of low-dimensional subspaces. The goal of SCMD is to cluster the vectors of the data matrix $X$ as per their subspace membership. State-of-the-art algorithms for SCMD can perform poorly on instances with a large amount of missing data or if the data matrix $X$ is nearly full-rank. We propose a novel integer programming-based method for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facility-location problem, with subspaces playing the role of facilities and data points that of customers. We propose a column-generation approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than state-of-the-art methods when the data is high-rank or the percentage of missing data is high.