Monday, Aug 5: 2:00 PM - 3:50 PM
1359
Invited Paper Session
Oregon Convention Center
Room: CC-C123
Applied
Yes
Main Sponsor
Survey Research Methods Section
Co Sponsors
Government Statistics Section
Social Statistics Section
Presentations
While the probability sample provides the theoretical backbone for extrapolating from a sample to a population, non-sampling errors such as non-response and measurement errors have made it more challenging and costly to rely solely on probability samples. As a complement, many have proposed using administrative or convenience data which does not have a probabilistic connection to the population but may have richer detail on individuals and may be less costly to acquire. We will present some recent approaches in the literature for combining probability and non-probability samples using propensity-based methods, which attempt to combine the benefits of both sources and mitigate the shortcomings. We highlight motivating applications taken from ecology, public health, and genetics.
In recent years, survey researchers have begun to explore the possibility of "sample blending", wherein a questionnaire is administered simultaneously to a probability sample selected randomly from a population and also to a non-probability sample of people who volunteer to complete questionnaires without compensation but have not been selected using any purposing method. A great deal of research shows that probability samples continue to yield highly accurate characterizations of populations, whereas non-probability samples yield notably less accurate measurements. Sample blending involves weighting a non-probability sample to match a probability sample using a handful of variables, with the intent that the weighting will eliminate the inaccuracy of the non-probability sample and yield an effectively larger sample size for much lower cost than would be incurred by collecting exclusively probability sample data. This paper tests the effectiveness of a variety of weighting approaches applied to datasets collected from large probability and non-probability national samples who answered the same long and elaborate questionnaire, which afford opportunities for different analyses.
Probability sampling may remain as the standard basis for inference from a sample to a population. With declining participation and increasing costs, however, there has been growing interest in combining probability and nonprobability samples to improve the timeliness and cost efficiency of survey estimation without loss of statistical accuracy. An array of estimation methods for combining probability and nonprobability samples are found in the literature. In this paper, we can compare the performance of a group of methods through Monte Carlo simulations. The simulation samples are created from the survey completes of a large-scale national study that employed both probability and nonprobability samples. Five estimation methods are compared, including (1) matching-propensity, (2) division tree model, (3) inverse probability weighting, (4) mass imputation, and (5) doubly robust. The first two methods are developed at NORC, while the other three methods are implemented by the recent R package nonprobsvy. Evaluation metrics include variance, bias, mean square error, and confidence interval coverage.
Key words: Nonprobability sample; Pseudo inclusion probability; Estimation methods;
As the principal federal health statistics agency in the United States, the National Center for Health Statistics (NCHS) guides actions and polices through the dissemination of health statistics. This data is collected through a variety of sources including vital records, population and healthcare provider surveys, and, more recently, probability-based commercial panel surveys. In addition, non-probability data have been considered in certain applications to expand research, data collection, and reporting of official health statistics. This presentation gives an overview of the non-probability data and methods being utilized at NCHS. This includes procedures for combining data from probability and non-probability panels, the use of non-probability data to produce more granular estimates, including small domain estimates, as well as oversampling hard-to-reach populations. NCHS has also utilized a commercially available non-probability hospital database to augment a probability-based sample of hospitals providing healthcare encounter data. This presentation concludes the session by discussing the considerations for using non-probability data in a federal statistical agency.
Incomplete survey data can arise when there are unexpected disruptions to data collection, resulting in a sample that is a product of the probability-based sample design, the non-probabilistic mechanism that determined which sampled cases were worked, and nonresponse. In this paper, we describe a method used in the U.S. PIAAC Survey for combining incomplete survey data with complete survey data. The sample design consisted of a core national sample and a state-based supplemental sample. Data collection for the state supplement was halted less than halfway into the data collection period, before interviewers had visited all areas. Although the core sample was sufficient for national estimates, including the partial data from the state supplement could help improve small area estimates and psychometric modeling. We combined the two samples by using a composite weighting technique, where the compositing factor was based on effective sample size to reflect variance and a Kolmogorov-Smirnov statistic to reflect potential bias. As an evaluation, we compared survey estimates, variances, and measures of association using the resulting weights compared to those for the weighted core sample.