Innovative Analytic Strategies for Navigating Nonprobability Samples

Chao Xu Chair
University of Oklahoma Health Sciences Center
 
Brady West Discussant
Institute for Social Research
 
Sixia Chen Organizer
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
0564 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-104E 

Applied

Yes

Main Sponsor

Survey Research Methods Section

Co Sponsors

Government Statistics Section
Social Statistics Section

Presentations

Data Integration With Biased Summary Data via Generalized Entropy Balancing

Statistical methods for integrating individual-level data with external summary data have attracted attention due to their potential to significantly reduce data collection costs. Effective utilization of summary data can enhance estimation precision, thereby saving both time and resources. However, incorporating external data introduces the risk of bias, primarily due to potential differences in background distributions between the current study and the external source.
Model-based approaches, such as mass imputation and propensity score balancing, have been developed to integrate external summary data with internal individual-level data while mitigating these biases. Nonetheless, these methods remain vulnerable to bias from model misspecification.
In this paper, we propose a methodology utilizing generalized entropy balancing, designed to integrate external summary data even when derived from biased samples. Our method exhibits double robustness, offering enhanced protection against specific types of model misspecification.
We illustrate the versatility and effectiveness of our proposed estimator through an application to the analysis of Nationwide Public-Access Defibrillation data in Japan. 

Speaker

Kosuke Morikawa

Sensitivity Analyses for Nonignorable Selection Bias When Estimating Subgroup Parameters in Nonprobability Samples

Selection bias in survey estimates is a major concern, for both non-probability samples and probability samples with low response rates. The proxy-pattern mixture model (PPMM) has been proposed as a method for conducting a sensitivity analysis that allows selection to depend on survey outcomes of interest, i.e., assuming a nonignorable selection mechanism. Indices based on the PPMM have been proposed and used to quantify the potential for non-ignorable nonresponse or selection bias, including the SMUB for means and the MUBP for proportions. These methods require information from a reference data source, such as a large probability-based survey, with summary-level auxiliary information for the target population of interest (means, variances, and covariances of the auxiliary variables). To this point, the SMUB/MUBP measures have exclusively been used to estimate bias in overall population-level estimates. Extension to domain-level estimates is straightforward if the reference data source contains the domain indicator so that population-level margins within the domain of interest can be calculated.

However, interest may often lie in subgroups for which population-level summaries are not available. This will happen in cases where the domain indicator is observed on the survey only (not in the reference data source) and can also happen when the goal is estimation within intersectional subgroups for which stable/reliable population-level estimates of auxiliary variables may not be available. To combat this issue, we propose creating nonignorable selection weights based on the PPMM and using these weights for domain estimation and subsequent calculation of the SMUB/MUBP within subgroups.

These PPMM selection weights rely on a single sensitivity parameter that ranges from 0 to 1 and captures a range of selection mechanisms, from ignorable to an "extreme" non-ignorable mechanism where selection depends only on the outcome of interest. The PPMM selection weights are based on the re-expression of the PPMM as a selection model, using the known equivalence between pattern-mixture models and selection models. In this talk, we briefly describe the re-expression of the PPMM as a selection model and illustrate the use of the novel non-ignorable selection weights to estimate various subgroup quantities using the Census Household Pulse Survey under a range of assumptions on the selection mechanism. 

Keywords

proxy pattern-mixture model

domain estimation 

Co-Author

Seth Adarkwah Yiadom, The Ohio State University

Speaker

Rebecca Andridge, The Ohio State University

Analysis of Non-probability samples using generalized entropy calibration

Statistical analysis of non-probability sample (NPS) survey data is an important area of research in survey sampling. We consider a unified approach to voluntary survey data analysis under the assumption that the sampling mechanism is ignorable. Generalized entropy calibration is introduced as a unified tool for calibration weighting to control the selection bias. We first establish the relationship between the generalized calibration weighting and its dual expression for regression estimation. The dual relationship is critical in identifying the implied regression model and developing model selection for calibration weighting. Also, if a linear regression model for an important study variable is available, then two-step calibration method can be used to smooth the final weights and achieve the statistical efficiency. 

Speaker

Jae-Kwang Kim, Iowa State University

Boosted Pseudo-Weighting for Nonprobability Samples to Improve Population Inference

Nonprobability samples have rapidly emerged to address time-sensitive priority topics in various fields. While these data are timely, they are prone to selection bias. To mitigate selection bias, a wide body of literature in survey research has explored the use of propensity-score (PS) adjustment methods to enhance the population representativeness of nonprobability samples, using probability-based survey samples as external references. A recent advancement, the 2-step PS-based pseudo-weighting adjustment method (2PS by Li 2024), has been shown to improve upon recent developments with respect to mean squared error. However, the effectiveness of these methods in reducing bias critically depends on the ability of the underlying
propensity model to accurately reflect the true (self-)selection process, which is challenging with
parametric regression. In this study, we propose a set of pseudo-weight construction methods, 2PS-ML, which utilize both machine learning (ML) methods (to estimate PSs) and 2PS (to construct pseudo-weights based on the ML-estimated PSs), offering greater flexibility compared to logistic regression-based methods. We compare the proposed 2PS-ML pseudoweights, based on gradient boosting, with existing methods including 2PS. The proposed methods are evaluated numerically via simulation studies and empirically using the naïve unweighted National Health and Nutrition Examination Survey III sample, while taking the 1997 National Health Interview Survey as the reference, to estimate various health outcomes. 

Speaker

Yan Li, University of Maryland, College Park