Prioritizing data collection under distribution shift

Dominik Rothenhaeusler Speaker
Stanford University
 
Wednesday, Aug 6: 9:00 AM - 9:25 AM
Invited Paper Session 
Music City Center 
Should I gather cheap, low-quality data or expensive, high-quality data? For example, in the social sciences, running behavioral experiments on Amazon Mechanical Turk is often easier than carefully recruiting participants from the target population. However, data from Amazon Mechanical Turk may suffer from bias issues. This leads to a tradeoff between data quantity and data quality. We formalize this decision problem from a distribution shift perspective, taking into account (a) data quality, (b) data quantity, and (c) problem difficulty. We demonstrate that it is possible to predict the usefulness of data based on summary statistics. More specifically, our proposed notion of data usefulness allows us to predict how much the mean squared error (MSE) of estimation and prediction procedures would improve with additional data from a particular candidate distribution, without having access to individual-level data from the candidate distribution. We illustrate the effectiveness of our approach on both estimation and prediction tasks.

Keywords

Distribution shift

Robust inference

Causal inference

Active learning

Experimental design