Optimal Variance Reduction with Multiple Synthetic Proxies: A Dynamic Control Variate Framework

Yuyang Li Speaker
Washington University in St. Louis
 
XUMING HE Co-Author
Washington University in St. Louis
 
Jimin Ding Co-Author
Washington University At St. Louis
 
Thursday, Aug 6: 8:30 AM - 10:20 AM
3042 
Contributed Papers 
Thomas M. Menino Convention & Exhibition Center 
Synthetic data analysis augments small, labeled datasets with massive unlabeled datasets containing proxy outcomes. Moving beyond existing single-proxy frameworks, we demonstrate that integrating multiple synthetic copies-either overlapping or disjoint-substantially amplifies estimation efficiency. Operating within a bias-correction framework for the parameter of statistical interest, we identify the optimal variance-minimizing weight for both linear and generalized linear models. This strategy guarantees a "free lunch" for variance reduction even with imperfect proxies, avoiding the model specification assumptions of traditional semi-supervised learning. Furthermore, to address uneven proxy quality, we introduce a dynamic coefficient approach that adapts the correction locally to maximize efficiency where proxies are most reliable. We validate the method through asymptotic theory, simulations, and an analysis of St. Louis housing prices, yielding significantly sharper estimates of school district quality capitalization compared to standard methods.

Keywords

Synthetic Data Analysis

Prediction-Based Inference

Multiple Data Integration

Heterogeneous 

Main Sponsor

Section on Statistical Learning and Data Science