Tuesday, Aug 6: 8:30 AM - 10:20 AM
5094
Contributed Papers
Oregon Convention Center
Room: CC-A104
Main Sponsor
Survey Research Methods Section
Presentations
Cities are fragile to pandemic threats but predictive awareness and emergent response can fundamentally change epidemic dynamics. We leverage passive data collection via wastewater-based epidemiology in disjoint microsewersheds that span a city to develop an ongoing and adaptive monitoring protocol for infectious diseases such as COVID-19 and influenza. Adaptively sampling in cities has the potential to efficiently detect hotspots and identify the emergence pathogens even when prevalence in the general population is very low. Each week, we utilize the past time series of wastewater pathogen concentration measurements, Census data at the block level, and information provided by local public health officials to select the next sampling locations to minimize the opportunity cost and efficiently reduce uncertainty in parameter values. In this talk, we describe the adaptive sampling design, derive appropriate estimators and standard errors, and discuss their statistical properties including robustness to small sample size, missing data, and measurement error. We present preliminary results from data collected from February-July, 2024 in three cities in Oregon.
Keywords
adaptive sampling
COVID-19
prevalence estimation
spatial sampling
wastewater-based epidemiology
This project aims to accurately estimate the total acreage of pastureland in any specified geographic region. Image photo interpretation data from the National Resource Inventory (NRI), which is a well-established longitudinal survey to assess conditions and trends of soil, water, and related resources on non-federal lands of the U.S., is used to achieve this goal. However, the weights of NRI data are developed to represent non-federal land of states only, and the number of points falling into the target region might be small. In this paper, we develop a model-assisted approach utilizing satellite-based cropland data layer (CDL) data as auxiliary information via machine learning methods to accurately estimate the total acreage of pastureland in any arbitrary geographic region. This procedure encompasses three key steps: firstly, estimating the relationship between pastureland indicators in survey data and numerous CDL variables; secondly, applying this relationship to project pastureland probabilities across the entire U.S. map; and finally, extracting specific regions from this imputed map to calculate the total acreage of pastureland.
Keywords
Sample surveys
Spatial data analysis
Machine learning
Selection bias poses a substantial challenge to valid statistical inference in non-probability samples. This study compared estimates of the first-dose COVID-19 vaccination rates among Indian adults in 2021 from a large non-probability sample, the COVID-19 Trends and Impact Survey (CTIS), and a small probability survey, the Center for Voting Options and Trends in Election Research (CVoter), against national benchmark data from the COVID Vaccine Intelligence Network (CoWIN). Notably, CTIS exhibits a larger estimation error on average (0.37) compared to CVoter (0.14). Additionally, we explored the accuracy of CTIS in estimating successive differences (over time) and subgroup differences in mean vaccine uptakes (for females versus males). Compared to the overall vaccination rates, targeting these alternative estimands comparing differences or relative differences in two means, increased the effective sample size. These results suggest that the Big Data Paradox can manifest in countries beyond the US and may not apply equally to every estimand of interest.
Keywords
Big Data Paradox
Non-probability sample
Selection bias
Online survey
Vaccine uptake
Statistical data editing means identifying potential response errors in the data. Data editing is subject to two types of errors: labeling a correct observation as erroneous and not identifying an incorrect value. There is no statistical criterion to decide how many observations should be edited. Over-editing can increase data errors, degrade data quality, change the data structure and increase costs. Error localization consists of separate tests on each observation, where the null hypothesis states that the observation is error free and the alternative states that the observation is erroneous. The False Discovery Rate (FDR) is the fraction of false-positive findings among those deemed to be erroneous. Because FDR control is related to the number of edited observations, imposing an FDR requirement specifies the number of outliers to be edited, thereby controlling overediting. In this presentation we apply FDR theory to error localization and verify the theory on simulated data.
Keywords
data editing
response errors
over-editing
multiple hypothesis tests
periodic surveys
Which is better: simple random sample (SRS) with nonresponse or a large dataset with selection bias?
Selection bias is increasingly problematic in surveys. Inspired by Meng's (2018) paper Statistical paradises and paradoxes in big data, where he highlights statistical issues that bigness of data sets incur, we simulated sequences of growing populations with two different data collection methods: simple random sample (SRS) with nonresponse and organic data with selection bias ("big data"). The results showed a trade-off between bias and coverage probability caused by the amount of data available. Users of statistics often focus on good point estimates. Then a large nonprobability data source may be better than an SRS with nonresponse. On the other hand, if the user wants a reliable confidence interval, a probability sample with missingness may be preferred. Tools for comparing different data sources were investigated and discussed from a practical point of view.
Keywords
Selection bias
Simulation study
Bias-variance tradeoff
Non-response bias
The National Survey of Early Care and Education (NSECE) is the most comprehensive study of the availability and use of early care and education (ECE) in the U.S. Bec ause the target population of the NSECE's household survey is a relatively small proportion of all households, the cost of screening households to determine eligibility has always been an important constraint for the NSECE. Like many household surveys the NSECE also faces the twin challenges of declining response rates and rising data collection costs. In response the 2024 NSECE incorporates big data classification and disproportionate stratification into its frame construction and sampling design. Household commercial data are used as inputs for a machine learning model that predicts the probability that a given household on the frame falls within the target population. Household addresses are then stratified accordingly and households with a high probability of eligibility are oversampled. In this study we will evaluate the tradeoff between cost savings and survey precision and compare realized eligibility rates during data collection to their predicted equivalents at the design stage.
Keywords
Big Data
Machine Learning
Stratification
Sample Design
Social media data sources like Twitter (now X) provide a wealth of information that could be used evaluate public opinion in real time. However, users of Twitter (in particular the most vocal ones) are not representative of the general population. Characteristics that would traditionally be used for weighting to generalize such non-representative data (e.g., demographics) are unknown for Twitter users. By combining the results of two surveys, we show how proxies for such characteristics can be developed for any Twitter user, and we illustrate the use of those in developing weights that generalize a large universe of Twitter users. Large language models are used to evaluate the sentiment of posts made by our universe of Twitter users regarding Donald Trump. The sentiment analysis, in combination with the statistical weighting, is used to track Trump's approval rating over the period of 9/20/2020 to 1/20/2021.
Keywords
Weighting
Big data
Sentiment analysis