Timezone: »

 
The Value of Out-of-distribution Data
Ashwin De Silva · Rahul Ramesh · Carey E Priebe · Pratik Chaudhari · Joshua T Vogelstein
Event URL: https://openreview.net/forum?id=tsx6Pyh0Er »

More data is expected to help us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this phenomenon using linear classifiers on synthetic datasets and medium-sized neural networks on vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS, and DomainNet, and observe the effect data augmentation, hyperparameter optimization, and pre-training have on this behavior.

Author Information

Ashwin De Silva (Johns Hopkins University)
Rahul Ramesh (University of Pennsylvania)
Carey E Priebe (Johns Hopkins University)
Pratik Chaudhari (Univ. of Pennsylvania / AWS)
Joshua T Vogelstein (The Johns Hopkins University)

More from the Same Authors