Synthetic Sampling Weights for Volunteer-Based National Biobanks: A Case Study with the All of Us Research Program
Lina Sulieman
Co-Author
Department of Biomedical Informatics, Vanderbilt University
Robert Cronin
Co-Author
Department of Internal Medicine, The Ohio State University
Wednesday, Aug 6: 9:55 AM - 10:15 AM
Topic-Contributed Paper Session
Music City Center
While national biobanks are essential for advancing medical research, their nonprobability sampling designs limit their
representativeness of the target population. This paper proposes a method that leverages high-quality national surveys to create
synthetic sampling weights for non-probabilistic cohort studies, aiming to improve representativeness. Specifically, we focus on deriving more accurate base weights, which enhance calibration by meeting population constraints, and on automating data-supported selection of cross-tabulations for calibration. This approach combines a pseudo-design-based model with a novel Last-In-First-Out criterion, enhancing both the accuracy and stability of estimates. Extensive simulations demonstrate that our method, named nps-lifo-rake, reduces bias, improves efficiency, and strengthens inference compared to existing approaches. We apply the proposed method to the All of Us Research Program, leveraging data from the National Health Interview Survey 2020 and American Community Survey 2022, and compare the resulting prevalence estimates for common phenotypes against national benchmarks. The results underscore our method's ability to effectively reduce selection bias in non-probability samples, offering a valuable tool for enhancing biobank representativeness. Using the developed sampling weights for the All of Us Research Program, we can estimate the
United States population prevalence for phenotypes and genotypes not captured by national probability studies.
Calibration Weighting
Generalized Raking
Nested Propensity Score
Non-Probability
Prevalence
Sampling Design
You have unsaved changes.