Optimal Variance Reduction with Multiple Synthetic Proxies: A Dynamic Control Variate Framework
Yuyang Li
Speaker
Washington University in St. Louis
XUMING HE
Co-Author
Washington University in St. Louis
Jimin Ding
Co-Author
Washington University At St. Louis
Thursday, Aug 6: 8:30 AM - 10:20 AM
3042
Contributed Papers
Thomas M. Menino Convention & Exhibition Center
Synthetic data analysis augments small, labeled datasets with massive unlabeled datasets containing proxy outcomes. Moving beyond existing single-proxy frameworks, we demonstrate that integrating multiple synthetic copies-either overlapping or disjoint-substantially amplifies estimation efficiency. Operating within a bias-correction framework for the parameter of statistical interest, we identify the optimal variance-minimizing weight for both linear and generalized linear models. This strategy guarantees a "free lunch" for variance reduction even with imperfect proxies, avoiding the model specification assumptions of traditional semi-supervised learning. Furthermore, to address uneven proxy quality, we introduce a dynamic coefficient approach that adapts the correction locally to maximize efficiency where proxies are most reliable. We validate the method through asymptotic theory, simulations, and an analysis of St. Louis housing prices, yielding significantly sharper estimates of school district quality capitalization compared to standard methods.
Synthetic Data Analysis
Prediction-Based Inference
Multiple Data Integration
Heterogeneous
Main Sponsor
Section on Statistical Learning and Data Science
You have unsaved changes.