Print Close

Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling

Presented During: Explorations in Online Learning and Time Series

Sherrie Wang Co-Author
MIT

Kerri Lu Co-Author
MIT

Tijana Zrnic Co-Author
University of California

Stephen Bates Co-Author
Stanford University

Dan Kluger First Author
MIT

Dan Kluger Presenting Author
MIT

Tuesday, Aug 5: 9:05 AM - 9:20 AM
2033
Contributed Papers

Music City Center

Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model.

Keywords

prediction-powered inference

synthetic data

missing data

measurement error

two-phase sampling designs

bootstrap

Main Sponsor

IMS