Survey Data Integration for Estimating Distribution Functions and Quantiles

Jeremy Flood Co-Author
 
Sayed Mostafa Co-Author
North Carolina A&T State University
 
Sayed Mostafa Speaker
North Carolina A&T State University
 
Thursday, Aug 7: 9:15 AM - 9:35 AM
Topic-Contributed Paper Session 
Music City Center 
Estimates of finite population cumulative distribution functions (CDFs) and quantiles are critical for policy-making, resource allocation, and public health planning. For instance, federal finance agencies may require accurate estimates of the proportion of individuals with incomes below the federal poverty line to determine funding eligibility, while health organizations may rely on precise quantile estimates of key health variables to guide local health interventions. Despite the growing interest in survey data integration, research on the integration of probability and nonprobability samples for estimating CDFs and quantiles remains limited. In this study, we propose a novel residual-based CDF estimator that integrates information from a probability sample with data from potentially large nonprobability samples. Our approach leverages shared covariates observed in both datasets, while the response variable is available only in the nonprobability sample. Using a semiparametric approach, we train an outcome model on the nonprobability sample and incorporate model residuals with sampling weights from the probability sample to estimate the CDF of the target variable. Based on this CDF estimator, we define a quantile estimator and introduce linearization and bootstrap methods for variance estimation of both the CDF and quantile estimators. Under certain regularity conditions, we provide the asymptotic bias and variance formulae of the CDF estimator and compare them to the corresponding formulae of the naïve CDF estimator derived from the nonprobability sample only. Our empirical results demonstrate the favorable performance of the proposed estimators. A real data example will be presented to illustrate the proposed estimators.

Keywords

TBD