Timezone: »
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1,...,n} with corresponding data values x1,...,xn. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worstcase, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worstcase is with respect to the data values x1,...,xn. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample valuesâ€“where the weights are functions of the distribution P and the sample and target sets A, Band show that the worstcase expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector (xi,yi). This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction".
Author Information
Justin Chen (MIT)
Gregory Valiant (Stanford University)
Paul Valiant (IAS; Purdue University)
Related Events (a corresponding poster, oral, or spotlight)

2020 Poster: WorstCase Analysis for Randomly Collected Data »
Thu Dec 10th 05:00  07:00 AM Room Poster Session 4
More from the Same Authors

2019 Poster: A Polynomial Time Algorithm for LogConcave Maximum Likelihood via Locally Exponential Families »
Brian Axelrod · Ilias Diakonikolas · Alistair Stewart · Anastasios Sidiropoulos · Gregory Valiant 
2018 Poster: Estimating Learnability in the Sublinear Data Regime »
Weihao Kong · Gregory Valiant 
2013 Poster: Estimating the Unseen: Improved Estimators for Entropy and other Properties »
Paul Valiant · Gregory Valiant