Combining Probability and Non-Probability Data: Considerations, Methods, and Applications

Brady West Chair
Institute for Social Research
 
Morgan Earp Organizer
National Center for Health Statistics
 
Katherine Irimata Organizer
National Center for Health Statistics
 
Monday, Aug 5: 2:00 PM - 3:50 PM
1359 
Invited Paper Session 
Oregon Convention Center 
Room: CC-C123 

Applied

Yes

Main Sponsor

Survey Research Methods Section

Co Sponsors

Government Statistics Section
Social Statistics Section

Presentations

A Look at Propensity-based Methods for Combining Probability and Non-probability Sample Data

While the probability sample provides the theoretical backbone for extrapolating from a sample to a population, non-sampling errors such as non-response and measurement errors have made it more challenging and costly to rely solely on probability samples. As a complement, many have proposed using administrative or convenience data which does not have a probabilistic connection to the population but may have richer detail on individuals and may be less costly to acquire. We will present some recent approaches in the literature for combining probability and non-probability samples using propensity-based methods, which attempt to combine the benefits of both sources and mitigate the shortcomings. We highlight motivating applications taken from ecology, public health, and genetics. 

Speaker

Matt Williams, RTI

A New Evaluation of the Impact of Combining Probability and Non-probability Sample Data

In recent years, survey researchers have begun to explore the possibility of "sample blending", wherein a questionnaire is administered simultaneously to a probability sample selected randomly from a population and also to a non-probability sample of people who volunteer to complete questionnaires without compensation but have not been selected using any purposing method. A great deal of research shows that probability samples continue to yield highly accurate characterizations of populations, whereas non-probability samples yield notably less accurate measurements. Sample blending involves weighting a non-probability sample to match a probability sample using a handful of variables, with the intent that the weighting will eliminate the inaccuracy of the non-probability sample and yield an effectively larger sample size for much lower cost than would be incurred by collecting exclusively probability sample data. This paper tests the effectiveness of a variety of weighting approaches applied to datasets collected from large probability and non-probability national samples who answered the same long and elaborate questionnaire, which afford opportunities for different analyses. 

Co-Author

Sierra Davis, Stanford University

Speaker

Jon Krosnick, Stanford University

Comparing Alternative Estimation Methods Using Combined Probability and Nonprobability Samples

Probability sampling may remain as the standard basis for inference from a sample to a population. With declining participation and increasing costs, however, there has been growing interest in combining probability and nonprobability samples to improve the timeliness and cost efficiency of survey estimation without loss of statistical accuracy. An array of estimation methods for combining probability and nonprobability samples are found in the literature. In this paper, we can compare the performance of a group of methods through Monte Carlo simulations. The simulation samples are created from the survey completes of a large-scale national study that employed both probability and nonprobability samples. Five estimation methods are compared, including (1) matching-propensity, (2) division tree model, (3) inverse probability weighting, (4) mass imputation, and (5) doubly robust. The first two methods are developed at NORC, while the other three methods are implemented by the recent R package nonprobsvy. Evaluation metrics include variance, bias, mean square error, and confidence interval coverage.

Key words: Nonprobability sample; Pseudo inclusion probability; Estimation methods; 

Co-Author(s)

Soubhik Barari, NORC at the University of Chicago
David Dutwin, NORC at the University of Chicago
Chien-Min Huang, NORC at the University of Chicago
Stanislav Kolenikov, NORC at The University of Chicago

Speaker

Michael Yang, NORC at The University of Chicago

Leveraging Non-Probability Data at the National Center for Health Statistics

As the principal federal health statistics agency in the United States, the National Center for Health Statistics (NCHS) guides actions and polices through the dissemination of health statistics. This data is collected through a variety of sources including vital records, population and healthcare provider surveys, and, more recently, probability-based commercial panel surveys. In addition, non-probability data have been considered in certain applications to expand research, data collection, and reporting of official health statistics. This presentation gives an overview of the non-probability data and methods being utilized at NCHS. This includes procedures for combining data from probability and non-probability panels, the use of non-probability data to produce more granular estimates, including small domain estimates, as well as oversampling hard-to-reach populations. NCHS has also utilized a commercially available non-probability hospital database to augment a probability-based sample of hospitals providing healthcare encounter data. This presentation concludes the session by discussing the considerations for using non-probability data in a federal statistical agency. 

Co-Author(s)

Paul Scanlon, National Center for Health Statistics
Lauren Rossen, National Center for Health Statistics
Guangyu Zhang, National Center for Health Statistics

Speaker

Katherine Irimata, National Center for Health Statistics

Utilizing Data from an Incomplete Sample to Supplement the Probability-Based U.S. PIAAC Cycle II

Incomplete survey data can arise when there are unexpected disruptions to data collection, resulting in a sample that is a product of the probability-based sample design, the non-probabilistic mechanism that determined which sampled cases were worked, and nonresponse. In this paper, we describe a method used in the U.S. PIAAC Survey for combining incomplete survey data with complete survey data. The sample design consisted of a core national sample and a state-based supplemental sample. Data collection for the state supplement was halted less than halfway into the data collection period, before interviewers had visited all areas. Although the core sample was sufficient for national estimates, including the partial data from the state supplement could help improve small area estimates and psychometric modeling. We combined the two samples by using a composite weighting technique, where the compositing factor was based on effective sample size to reflect variance and a Kolmogorov-Smirnov statistic to reflect potential bias. As an evaluation, we compared survey estimates, variances, and measures of association using the resulting weights compared to those for the weighted core sample. 

Co-Author(s)

Tom Krenzke, Westat
Benjamin Schneider, Westat
Mike Kwanisai, Westat

Speaker

Wendy Van de Kerckhove, Westat