Subsampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the concept of an asymptotically linear estimator and the associated influence function leads to asymptotically optimal sampling probabilities for a wide class of popular models. This is the only tight optimality result for subsampling we are aware of as other methods only provide probabilistic error bounds or optimal rates. Furthermore, for linear regression models, which have well-studied procedures for non-uniform subsampling, we empirically show our optimal influence function based method outperforms previous approaches even when using approximations to the optimal probabilities.
Daniel Ting (Tableau Software)
Eric Brochu (Tableau Software)
More from the Same Authors
2022 Poster: Order-Invariant Cardinality Estimators Are Differentially Private »
Charlie Dickens · Justin Thaler · Daniel Ting