Hand in Glove: Novel Applications of Data Integration of Probability and Non-Probability Samples

Kim Huynh Chair
Bank of Canada
 
Heng Chen Organizer
Bank of Canada
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
0655 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-104E 

Applied

No

Main Sponsor

Survey Research Methods Section

Co Sponsors

SSC (Statistical Society of Canada)

Presentations

Selection Bias Correction for Imbalanced Samples

Selection bias correction is often applied with the two-sample setup. That is, along with the nonprobability sample we are interested in, a probability sample sharing some common auxiliary variables is used for constructing correction weights for the nonprobability sample. The two-sample setup allows one to calculate weighted estimates for population parameters of interest based on the nonprobability sample. Since the nonprobability sample is usually easy to collect, we often have a large nonprobability sample and a small probability sample. The imbalance of the two samples may cause difficulties in modeling the propensity of units to be included in the nonprobability sample. This presentation discusses some often-seen solutions for imbalanced samples in machine learning literature, i.e., undersampling, Synthetic Minority Oversampling Technique (SMOTE), and a mixture of both. A selection bias correction framework is adjusted to incorporate the imbalance solutions. 

Keywords

TBD 

Speaker

An-Chiao Liu, Utrecht University

A Bayesian Framework for Combining Probability and Non-Probability Samples in Small Area Estimation

The integration of probability (PS) and non-probability (NPS) samples offers a promising avenue for robust small area estimation, addressing challenges such as sample representation, measurement error, and sample size limitations. We propose a Bayesian hierarchical framework that leverages the complementary strengths of PS and NPS. We consider the scenario when the binary outcome measure of interest is measured in both studies but subject to measurement error. We introduce latent variables for binary outcomes and link them via a shared dependency structure. This approach provides a principled mechanism to utilize the representativeness of PS and the scale of NPS in estimating small-area means. Our methodology is demonstrated through applications to data from the Health and Retirement Study (HRS) and Electronic Health Records (EHR), offering insights into practical implications and utility in real-world settings. Preliminary simulation studies validate the efficacy and robustness of the proposed framework, highlighting its potential for broader application. This work contributes to the development of advanced statistical methods for combining disparate data sources and enhancing the precision and reliability of small area estimates. 

Keywords

TBD 

Speaker

Soumojit Das

Construction of Indirect Sampling Weights Through Data Integration

This paper employs the data integration approach to construct a new set of weights for the indirect sampling survey. By leveraging with the reference probability sample, we recast the indirectly sampled units as the non-probability sample to compute inverse probability weights. Compared to the existing Generalized Weight Share method, the proposed data-integrated weights can be applied to estimations of median and quantiles. In an empirical application of estimating merchant cash acceptance, we assess the effectiveness of our proposed indirect sampling weights in terms of bias and variance. 

Keywords

TBD 

Speaker

Heng Chen, Bank of Canada

Survey Data Integration for Estimating Distribution Functions and Quantiles

Estimates of finite population cumulative distribution functions (CDFs) and quantiles are critical for policy-making, resource allocation, and public health planning. For instance, federal finance agencies may require accurate estimates of the proportion of individuals with incomes below the federal poverty line to determine funding eligibility, while health organizations may rely on precise quantile estimates of key health variables to guide local health interventions. Despite the growing interest in survey data integration, research on the integration of probability and nonprobability samples for estimating CDFs and quantiles remains limited. In this study, we propose a novel residual-based CDF estimator that integrates information from a probability sample with data from potentially large nonprobability samples. Our approach leverages shared covariates observed in both datasets, while the response variable is available only in the nonprobability sample. Using a semiparametric approach, we train an outcome model on the nonprobability sample and incorporate model residuals with sampling weights from the probability sample to estimate the CDF of the target variable. Based on this CDF estimator, we define a quantile estimator and introduce linearization and bootstrap methods for variance estimation of both the CDF and quantile estimators. Under certain regularity conditions, we provide the asymptotic bias and variance formulae of the CDF estimator and compare them to the corresponding formulae of the naïve CDF estimator derived from the nonprobability sample only. Our empirical results demonstrate the favorable performance of the proposed estimators. A real data example will be presented to illustrate the proposed estimators.  

Keywords

TBD 

Co-Author(s)

Jeremy Flood
Sayed Mostafa, North Carolina A&T State University

Speaker

Sayed Mostafa, North Carolina A&T State University