Surrogate-powered Regularized Estimation: Semi-Supervised Modeling with Multi-Wave Sampling

Huiyuan Wang Co-Author
University of Pennsylvania
 
Thomas Lumley Co-Author
University of Auckland
 
Yong Chen Co-Author
University of Pennsylvania, Perelman School of Medicine
 
Jianmin Chen First Author
University of Pennsylvania, Perelman School of Medicine
 
Jianmin Chen Presenting Author
University of Pennsylvania, Perelman School of Medicine
 
Tuesday, Aug 5: 12:05 PM - 12:20 PM
2269 
Contributed Papers 
Music City Center 
Surrogate-powered modeling is an emerging approach in semi-supervised learning that improves statistical efficiency by integrating large-scale unlabeled data with a small labeled dataset using multiple surrogate outcomes. This framework is particularly useful in risk modeling with electronic health records (EHR), where gold-standard outcomes are limited due to costly chart reviews, while algorithm-generated surrogates are widely available. Key challenges include effectively combining labeled and unlabeled data with multiple surrogates and designing efficient sampling rules for chart reviews. To address these, we propose a multi-wave sampling strategy to adaptively approximate the optimal sampling rule and introduce a novel semi-supervised estimator with first-order bias correction and sparse regularization to reduce estimation errors. The estimator is asymptotically normal, unbiased, and improves statistical efficiency. Extensive numerical studies demonstrate its effectiveness in reducing mean-squared error.

Keywords

EHR data

semi-supervised learning

surrogate regression

bias-reduction 

Main Sponsor

Section on Statistics in Epidemiology