Big Data Initiatives in Survey Statistics

Jamie Ridenhour Chair
RTI International
 
Tuesday, Aug 6: 8:30 AM - 10:20 AM
5094 
Contributed Papers 
Oregon Convention Center 
Room: CC-A104 

Main Sponsor

Survey Research Methods Section

Presentations

Adaptive Sampling Design for Estimating Spatiotemporal Pathogen Prevalence in Cities

Cities are fragile to pandemic threats but predictive awareness and emergent response can fundamentally change epidemic dynamics. We leverage passive data collection via wastewater-based epidemiology in disjoint microsewersheds that span a city to develop an ongoing and adaptive monitoring protocol for infectious diseases such as COVID-19 and influenza. Adaptively sampling in cities has the potential to efficiently detect hotspots and identify the emergence pathogens even when prevalence in the general population is very low. Each week, we utilize the past time series of wastewater pathogen concentration measurements, Census data at the block level, and information provided by local public health officials to select the next sampling locations to minimize the opportunity cost and efficiently reduce uncertainty in parameter values. In this talk, we describe the adaptive sampling design, derive appropriate estimators and standard errors, and discuss their statistical properties including robustness to small sample size, missing data, and measurement error. We present preliminary results from data collected from February-July, 2024 in three cities in Oregon. 

Keywords

adaptive sampling

COVID-19

prevalence estimation

spatial sampling

wastewater-based epidemiology 

Co-Author(s)

Jeffrey Bethel, Oregon State University
Nicole Breuner, Oregon State University
Benjamin Dalziel, Oregon State University
Kathryn Higley, Oregon State University
Allison Myers, Oregon State University
Justin Preece, Oregon State University
Tyler Radniecki, Oregon State University

First Author

Katherine McLaughlin, Oregon State University

Presenting Author

Katherine McLaughlin, Oregon State University

Estimating Control Total Acres for Desired Geographies Using Cropland Data Layer

This project aims to accurately estimate the total acreage of pastureland in any specified geographic region. Image photo interpretation data from the National Resource Inventory (NRI), which is a well-established longitudinal survey to assess conditions and trends of soil, water, and related resources on non-federal lands of the U.S., is used to achieve this goal. However, the weights of NRI data are developed to represent non-federal land of states only, and the number of points falling into the target region might be small. In this paper, we develop a model-assisted approach utilizing satellite-based cropland data layer (CDL) data as auxiliary information via machine learning methods to accurately estimate the total acreage of pastureland in any arbitrary geographic region. This procedure encompasses three key steps: firstly, estimating the relationship between pastureland indicators in survey data and numerous CDL variables; secondly, applying this relationship to project pastureland probabilities across the entire U.S. map; and finally, extracting specific regions from this imputed map to calculate the total acreage of pastureland. 

Keywords

Sample surveys

Spatial data analysis

Machine learning 

First Author

Mingyue Hu

Presenting Author

Mingyue Hu

Exploring the big data paradox for various estimands using vaccination data from a global survey

Selection bias poses a substantial challenge to valid statistical inference in non-probability samples. This study compared estimates of the first-dose COVID-19 vaccination rates among Indian adults in 2021 from a large non-probability sample, the COVID-19 Trends and Impact Survey (CTIS), and a small probability survey, the Center for Voting Options and Trends in Election Research (CVoter), against national benchmark data from the COVID Vaccine Intelligence Network (CoWIN). Notably, CTIS exhibits a larger estimation error on average (0.37) compared to CVoter (0.14). Additionally, we explored the accuracy of CTIS in estimating successive differences (over time) and subgroup differences in mean vaccine uptakes (for females versus males). Compared to the overall vaccination rates, targeting these alternative estimands comparing differences or relative differences in two means, increased the effective sample size. These results suggest that the Big Data Paradox can manifest in countries beyond the US and may not apply equally to every estimand of interest. 

Keywords

Big Data Paradox

Non-probability sample

Selection bias

Online survey

Vaccine uptake 

Co-Author(s)

Walter Dempsey
Peisong Han, Gilead Sciences
Yashwant Deshmukh, CVoter Foundation
Sylvia Richardson, Cambridge College London
Brian Tom, University of Cambridge
Bhramar Mukherjee, University of Michigan

First Author

Youqi Yang

Presenting Author

Youqi Yang

False Discovery Rate in Large-Scale Data Error Localization

Statistical data editing means identifying potential response errors in the data. Data editing is subject to two types of errors: labeling a correct observation as erroneous and not identifying an incorrect value. There is no statistical criterion to decide how many observations should be edited. Over-editing can increase data errors, degrade data quality, change the data structure and increase costs. Error localization consists of separate tests on each observation, where the null hypothesis states that the observation is error free and the alternative states that the observation is erroneous. The False Discovery Rate (FDR) is the fraction of false-positive findings among those deemed to be erroneous. Because FDR control is related to the number of edited observations, imposing an FDR requirement specifies the number of outliers to be edited, thereby controlling overediting. In this presentation we apply FDR theory to error localization and verify the theory on simulated data.  

Keywords

data editing

response errors

over-editing

multiple hypothesis tests

periodic surveys 

Co-Author(s)

Paul Smith, University of Maryland (retired)
Chin-Fang Weng, Census Bureau (retired)
Eric Slud, U. S. Census Bureau

First Author

Chin-Fang Weng, US Census Bureau

Presenting Author

Chin-Fang Weng, US Census Bureau

Selection bias in big data in official statistics from a practitioner’s point of view

Which is better: simple random sample (SRS) with nonresponse or a large dataset with selection bias?

Selection bias is increasingly problematic in surveys. Inspired by Meng's (2018) paper Statistical paradises and paradoxes in big data, where he highlights statistical issues that bigness of data sets incur, we simulated sequences of growing populations with two different data collection methods: simple random sample (SRS) with nonresponse and organic data with selection bias ("big data"). The results showed a trade-off between bias and coverage probability caused by the amount of data available. Users of statistics often focus on good point estimates. Then a large nonprobability data source may be better than an SRS with nonresponse. On the other hand, if the user wants a reliable confidence interval, a probability sample with missingness may be preferred. Tools for comparing different data sources were investigated and discussed from a practical point of view. 

Keywords

Selection bias

Simulation study

Bias-variance tradeoff

Non-response bias 

Co-Author(s)

Dan Hedlin, Stockholm University
Edgar Bueno, Stockholm University

First Author

Martin Hyllienmark

Presenting Author

Martin Hyllienmark

The Use of Big Data-Based Model Prediction for Stratification of Household Addresses

The National Survey of Early Care and Education (NSECE) is the most comprehensive study of the availability and use of early care and education (ECE) in the U.S. Bec ause the target population of the NSECE's household survey is a relatively small proportion of all households, the cost of screening households to determine eligibility has always been an important constraint for the NSECE. Like many household surveys the NSECE also faces the twin challenges of declining response rates and rising data collection costs. In response the 2024 NSECE incorporates big data classification and disproportionate stratification into its frame construction and sampling design. Household commercial data are used as inputs for a machine learning model that predicts the probability that a given household on the frame falls within the target population. Household addresses are then stratified accordingly and households with a high probability of eligibility are oversampled. In this study we will evaluate the tradeoff between cost savings and survey precision and compare realized eligibility rates during data collection to their predicted equivalents at the design stage. 

Keywords

Big Data

Machine Learning

Stratification

Sample Design 

First Author

Noah Bassel, NORC

Presenting Author

Noah Bassel, NORC

The utility of big data for evaluating public opinion

Social media data sources like Twitter (now X) provide a wealth of information that could be used evaluate public opinion in real time. However, users of Twitter (in particular the most vocal ones) are not representative of the general population. Characteristics that would traditionally be used for weighting to generalize such non-representative data (e.g., demographics) are unknown for Twitter users. By combining the results of two surveys, we show how proxies for such characteristics can be developed for any Twitter user, and we illustrate the use of those in developing weights that generalize a large universe of Twitter users. Large language models are used to evaluate the sentiment of posts made by our universe of Twitter users regarding Donald Trump. The sentiment analysis, in combination with the statistical weighting, is used to track Trump's approval rating over the period of 9/20/2020 to 1/20/2021. 

Keywords

Weighting

Big data

Sentiment analysis 

First Author

Michael Robbins, RAND Corporation

Presenting Author

Michael Robbins, RAND Corporation