Poster
Linear Time Approximation Algorithm for Column Subset Selection with Local Search
YuanBin Zou · Ziyun Huang · Jinhui Xu · Qilong Feng · Jianxin Wang
West Ballroom A-D #5906
[
Abstract
]
Thu 12 Dec 4:30 p.m. PST
— 7:30 p.m. PST
Abstract:
The Column Subset Selection (CSS) problem has been widely studied in dimensionality reduction and feature selection, where the goal is to output a submatrix $S$ consisting of k columns from a n×d input matrix A that minimizes the residual error ‖A-SS^\dagger A‖_F^2 and S^\dagger is the Moore-Penrose inverse matrix of $S$. Many previous approximation algorithms have non-linear running times in n and d, while the existing linear time algorithm has a relatively larger approximation ratio. Additionally, the local search algorithms in existing results for solving the CSS problem are heuristic. To achieve linear running time while maintaining better approximation using a local search strategy, we propose a local search-based approximation algorithm for the CSS problem with exactly k columns selected. A key challenge in achieving linear running time with the local search strategy is how to avoid exhaustive enumerations of candidate columns for constructing swap pairs in each local search step. To address this issue, we propose a two-step mixed sampling method which can reduce the number of enumerations for swap pair construction from O(dk) to k in linear time. Although two-step mixed sampling method reduces the search space of local search strategy, bounding the residual error after swaps is a non-trivial task. To estimate the changes in residual error after swaps, we propose a matched swap pair construction method to bound the approximation loss, guaranteeing a constant probability of reduction in loss with each local search step. These techniques enable us to obtain the local search algorithm for the CSS problem with theoretical guarantees, where a 110(k+1)-approximate solution can be obtained in linear running time O(ndk^3\log k) in expectation. Empirical experiments show that our proposed algorithm achieves better quality and time compared to previous algorithms on both small and large datasets. Moreover, it is at least 10 times faster than state-of-the-art algorithms across all large-scale datasets.
Live content is unavailable. Log in and register to view live content