Prioritizing data collection under distribution shift
Wednesday, Aug 6: 9:00 AM - 9:25 AM
Invited Paper Session
Music City Center
Should I gather cheap, low-quality data or expensive, high-quality data? For example, in the social sciences, running behavioral experiments on Amazon Mechanical Turk is often easier than carefully recruiting participants from the target population. However, data from Amazon Mechanical Turk may suffer from bias issues. This leads to a tradeoff between data quantity and data quality. We formalize this decision problem from a distribution shift perspective, taking into account (a) data quality, (b) data quantity, and (c) problem difficulty. We demonstrate that it is possible to predict the usefulness of data based on summary statistics. More specifically, our proposed notion of data usefulness allows us to predict how much the mean squared error (MSE) of estimation and prediction procedures would improve with additional data from a particular candidate distribution, without having access to individual-level data from the candidate distribution. We illustrate the effectiveness of our approach on both estimation and prediction tasks.
Distribution shift
Robust inference
Causal inference
Active learning
Experimental design
You have unsaved changes.