Abstract:
Data pruning consists of identifying a subset of the training set that can be used for training instead of the full dataset. This pruned dataset is often chosen to satisfy some desirable properties. In this paper, we leverage some existing theory on importance sampling with Stochastic Gradient Descent (SGD) to derive a new principled data pruning algorithm based on Lipschitz properties of the loss function. The goal is to identify a training subset that accelerates training (compared to e.g. random pruning). We call this algorithm $\texttt{LiPrune}$. We illustrate cases where $\texttt{LiPrune}$ outperforms existing methods and show the limitations and failures of this algorithm in the context of deep learning.