SPEED 7: Statistical Methods in Surveys & Policy Applications, Part 1

Barbara Bailey Chair
San Diego State University
 
Wednesday, Aug 7: 8:30 AM - 10:20 AM
5129 
Contributed Speed 
Oregon Convention Center 
Room: CC-D135 

Presentations

A Partially Observed Merton’s Jump Model for Ultra-High Frequency Financial Data with Bayesian Learn

The time-stamped transactions financial data, which possess the most detailed information for price evolution, are coined as "ultrahigh-frequency (UHF) data" in Engle (2000). A general partially observed Markov process framework with marked point observations and the related Bayesian inference (estimation and model selection) via stochastic filtering equations are developed in Hu, Kuipers, and Zeng (2018a and 2018b). The general framework accommodates the two features of UHF data: random trading times and trading noises. While several specific partially observed models, including the Black-Scholes (BS) and stochastic volatility models, have been studied, the partially observed Merton's model, extending the BS model with a jump component representing the impact of good and bad news, has not been investigated. In this study, we fill such a gap by proposing a partially observed Merton's model for ultra-high frequency financial data, accommodating the two UHF-data features. The joint posterior distribution of the parameters of interest and the intrinsic value process (which is the Merton model) is characterized by the normalized filtering equation. The Bayes factors of the partially ob 

Keywords

Ultrahigh-frequency data

Partially observed Merton’s jump model

Normalized filtering equation

Bayes factors 

View Abstract 3794

Co-Author

Yong Zeng, National Science Foundation

First Author

Jamila Kridan

Presenting Author

Jamila Kridan

A Practical Approach for Case Prioritization in A Panel Survey

Case prioritization has been employed by survey researchers as an adaptive survey design strategy to achieve optimal goals under fixed resources. One major use is to target the subgroups among low response propensity cases and prioritize them in the interviewers' workload without increasing data collection resources, to equalize response rates and to reduce nonresponse bias. Although existing research has shown the effectiveness of case prioritization, the approach of allocating the cases being prioritized in practical settings is not straightforward, especially in a dynamic case prioritization process. Tourangeau et al. (2017) provided a clear notion of what cases are the most worth pursuing. The authors recommended using a composited score with considerations of a case's response propensity, design weight, and its effect on the sample balance. Inspired by that research, this presentation provides a revised approach of identifying the most valuable cases in a panel survey setting with oversampling of subpopulations. 

Keywords

case prioritization

response propensity

dynamic adaptive design

nonresponse bias 

View Abstract 2257

Co-Author(s)

Xiaoshu Zhu, Westat
Nicholas Askew, WESTAT
Ting Yan, Westat
Sylvia Dohrmann, Westat

First Author

Rui Jiao, Westat

Presenting Author

Rui Jiao, Westat

Accounting for reporting delays in real-time phylodynamic analyses with preferential sampling

The ongoing pandemic demonstrated that fast and accurate analysis of continually collected infectious disease surveillance data is crucial for situational awareness and policy making. Phylodynamic analysis uses genetic sequences of a pathogen to estimate changes in its genetic diversity in a population of interest, the effective population size, which under certain conditions can be connected to the number of infections in the population. Phylodynamics is an important tool because its methods utilize a data source in a way that is resilient to the ascertainment biases present in traditional surveillance data. Unfortunately, it takes weeks or months to sequence and obtain the sampled pathogen genome for use in such analyses. When the number of infections depends on the sampling frequency, the missing data results in underestimation of the effective population size. Here we present a method that affords reliable estimation of the effective population size trajectory closer to the time of data collection, allowing for policy decisions to be based on more recent data, with a better understanding of the limitations and uncertainties of such inference. 

Keywords

infectious disease dynamics

disease surveillance

Bayesian phylogenetics

genomic epidemiology

Bayesian nonparametrics 

View Abstract 2293

Co-Author(s)

Julia Palacios, Stanford University
Lorenzo Cappello, Pompeu Fabra University
Volodymyr Minin, University of California-Irvine

First Author

Catalina Medina, University of California, Irvine

Presenting Author

Catalina Medina, University of California, Irvine

Analysis of Total Survey Error in the 2022 National Immunization Survey-Child

Total survey error (TSE) is the difference between a survey estimate and the true value of the corresponding population parameter. We use TSE to evaluate sampling and nonsampling errors in vaccination coverage estimates for children aged 19-35 months from CDC's National Immunization Survey-Child. We derive estimates of sampling-frame coverage error, nonresponse error, measurement error, and sampling error using such data sources as the National Health Interview Survey and immunization information systems. A Monte Carlo approach then combines estimated distributions of error components into a TSE distribution for the survey estimate of vaccination coverage. The mean of the TSE distribution provides an estimate of total bias in the survey estimator, and the 95% credible interval provides an interval within which total survey error falls with 0.95 probability. Our estimates of mean TSE for 4+ doses of DTaP (-4.0 percentage points), 1+ doses of MMR (-1.7 pp), Hep B birth dose (-3.3 pp), and the combined 7-vaccine series (-9.2 pp) indicate underestimates of vaccination coverage. Measurement error (or provider underascertainment) is consistently found to be the largest error component. 

Keywords

Total survey error

Sampling-frame coverage error

Nonresponse error

Nonresponse error

Random digit dialing 

View Abstract 2144

Co-Author(s)

Zachary Seeskin, NORC at The University of Chicago
Benjamin Skalland, NORC at the University of Chicago
Kirk Wolter, NORC at The University of Chicago & University of Chicago
Holly Hill, Centers for Disease Control and Prevention
David Yankey, CDC
Laurie D Elam-Evans, CDC
Yi Mu
Kushagra Vashist, CDC

First Author

YUHEI KOSHINO

Presenting Author

YUHEI KOSHINO

Analyzing Survey Data with Tree Models: rpms R Package

The R-package rpms provides an algorithm for producing design consistent tree models of survey data. Tree models are an effective and flexible way for analyzing survey data because they provide an easily interpretable model that includes automatic variable selection as well as interaction effects, which make them very popular with analysts working with data collected from surveys.. Besides providing the functions for estimating these models, the package includes a number of functions that operate on the tree based objects to assist in understanding and analyzing survey data. We will demonstrate many of the tools in this package on data collected from a complex sample. 

Keywords

Sample Design

Regression Tree

Machine Learning

Government Survey Data

Statistical Inference

Statistical Model 

View Abstract 3089

First Author

Daniell Toth, US Bureau of Labor Statistics

Presenting Author

Daniell Toth, US Bureau of Labor Statistics

By ignoring statistics, the government sometimes spread pandemic misinformation.

Media platforms allowed misinformation to easily propagate during the Pandemic. In an attempt to quell misinformation, the CDC and the Federal Government attempted, sometimes successfully, to shut down debate. Yet, at times, these entities themselves spread false information regarding measures meant to prevent Covid. We will look briefly at two topics through government statements and scientific evidence: vaccines and masks. The scientific evidence behind the Covid vaccines was extremely strong, and seemingly difficuit to over-state, but the CDC did indeed overstate their benefit, while ignoring other factors regularly considered when recommending vaccines, such as age, side effects, and prior exposure. Regarding masks, the CDC pushed policies based on little or no scientific data, and ignored or even suppressed scientific data that called their efficacy into doubt.  

Keywords

masks

vaccines

myocarditis

CDC

Covid 

Abstracts


First Author

Alan Salzberg, Salt Hill Statistical Consulting

Presenting Author

Alan Salzberg, Salt Hill Statistical Consulting

Contrastive dimension estimation

Contrastive dimension reduction methods have been used to uncover the low-dimensional structure that distinguishes one dataset (foreground) from another (background). However, current contrastive dimension reduction techniques do not estimate the number of unique dimensions, denoted as d_c, within the foreground data. Instead, they require this quantity as an input and proceed to estimate the dimensions themselves. In this paper, we formally define the contrastive dimension, d_c, and present what we believe to be the first estimator for this parameter. Under a linear model, we demonstrate the consistency of this estimator, establish a finite-sample error bound, and develop a hypothesis test for d_c = 0. This test is valuable for determining the suitability of a contrastive method for a given dataset. Furthermore, we provide a detailed analysis of our findings, supported by simulations using both synthetic and real-world datasets. 

Keywords

Dimension reduction

Contrastive dimension 

View Abstract 2535

Co-Author

Didong Li

First Author

Sam Hawke

Presenting Author

Sam Hawke

Cross-fitting model evaluation for small area estimation using complex survey data.

Model checking, evaluation, or comparison in Small Area Estimation (SAE) with limited data is difficult. A generic problem is that given a survey dataset D, what is a good metric to score a model M? Considering cluster sampling for the national surveys, we would like to achieve two goals: 1) to score models based on their ability to estimate subpopulation prevalence at different administrative levels. 2) to decide if a given model M can be accepted (or not rejected under a hypothesis testing framework). Focusing on a scenario where there is one level of spatial unit, we want to score models based on their ability to produce national estimates. We evaluate models using score rules such as mean square error (MSE), continuous ranked probability score (CRPS), and distribution-free score from conformal prediction, based on leave-one-region out, leave-one-cluster-out, or other splitting methods, and we use design-based estimates as a reference. 

Keywords

Cross validation

Small Area Estimation

Complex survey data 

View Abstract 3360

Co-Author

Zehang Richard Li, University of California, Santa Cruz

First Author

Qianyu Dong

Presenting Author

Qianyu Dong

Evaluation of Data Quality and Imputation Methods for EIA’s Liquefied Natural Gas Storage Report

In January 2022, the U.S. Energy Information Administration (EIA) launched a new census survey to collect natural gas inventory storage data from all operating liquefied natural gas (LNG) storage facilities in the U.S. The EIA-191L, Monthly Liquefied Natural Gas Storage Report, collects data on injections, withdrawals, total gas in storage, total capacity, and maximum delivery for operators of LNG facilities across 29 states. EIA uses these data to publish state‑level monthly estimates on LNG storage in EIA's Natural Gas Monthly. The data are also used in several other EIA publications such as the Natural Gas Annual, Monthly Energy Review, and Short-Term Energy Outlook.

To account for unit non-response in the 2022 survey, we developed a donor-based imputation method. It creates imputation cells using the monthly activities for the LNG facilities and selects donors based on the donors' expected total gas and the recipient's reported total gas for January 2023. In this presentation, we will discuss data quality metrics and statistical methodologies used in EIA-191L, emphasizing statistical editing and imputation methods. 

Keywords

Energy Statistics

Clustering 

View Abstract 2987

Co-Author(s)

Preston McDowney, DOE/EIA/SMG
Pushpal Mukhopadhyay, U.S. Energy Information Administration
Hongbin Weng, Energy Information Administration

First Author

Makayla Cowles, Energy Information Administration

Presenting Author

Makayla Cowles, Energy Information Administration

Inference of effective reproduction number dynamics from wastewater data in small populations

The effective reproduction number is an important descriptor of an infectious disease epidemic. In small populations, ideally we would estimate the effective reproduction number using a Markov Jump Process (MJP) model of the spread of infectious disease, but in practice this is computationally challenging. We propose a computationally tractable approximation to an MJP which tracks only latent and infectious individuals, the EI model, an MJP where the time-varying immigration rate into the E compartment is equal to the product of the proportion of susceptibles in the population and the transmission rate. We use an analogue of the central limit theorem for MJPs to approximate transition densities as normal, which makes Bayesian computation tractable. Using simulated pathogen RNA concentrations collected from wastewater data, we demonstrate the advantages of our stochastic model against deterministic counterparts for the purpose of estimating effective reproduction number dynamics. We apply our new model to estimating the effective reproduction number of SARS-CoV-2 in several college campus communities. 

Keywords

Bayesian Statistics

Infectious Disease Statistics

Stochastic Processes

Nowcasting

Epidemic Modeling

Infectious Disease Surveillance 

View Abstract 3356

Co-Author

Volodymyr Minin, University of California-Irvine

First Author

Isaac Goldstein, University of California, Irvine

Presenting Author

Isaac Goldstein, University of California, Irvine

Multifaceted Gender Identity Measurement As An Alternative to Forced-Choice Assessments

While suggesting specific question wording for surveys collecting data on gender identity and sexual orientation, a 2022 National Academies of Sciences, Engineering, and Medicine report on "Measuring Sex, Gender Identity, and Sexual Orientation" recognized limitations of "forced-choice measurement" using multiple-choice items and recommended further research into representing nonbinary gender identity and gender fluidity. Here we describe a framework for "Multifaceted Gender Identity Measurement" (M-GIM) asking respondents about the extent to which they agree or disagree with a series of gender-identity and sexual-orientation prompts, anticipating that identifiable clusters will emerge from patterns in ordinal responses without requiring individuals to self-classify into one of a limited number of categories. After highlighting the appeal of keeping such queries free of the implicit constraints and negative associations built into mutually-exclusive response options, the presentation will discuss a conceptual framework for investigating disparities in quality-of-life outcomes across population subgroups characterized by similar gender-identity or sexual-orientation profiles. 

Keywords

gender identity

sexual orientation

ordinal data

cluster analysis

nonbinary

gender fluidity 

View Abstract 3371

Co-Author(s)

Hilary Aralis, University of California Los Angeles
Zichen Liu, Univ
Andrew Chuang, University of California, Los Angeles
Sung-Jae Lee
Donatello Telesca, UCLA School of Public Health

First Author

Thomas Belin, University of California-Los Angeles

Presenting Author

Thomas Belin, University of California-Los Angeles

Multilevel Regression and Poststratification with Population Margins: Application to HIV Inference

Multilevel Regression and Poststratification (MRP) has gained popularity in survey sampling for population inference. This involves two stages: the first fits a model, regressing the outcome on poststratification variables. The second predicts the outcome using this model and aggregates predictions for population. Existing methods on settings where the joint distribution of the population post-stratifiers is known. However, in practice, such information is not available; instead, we are provided with the margins of the post-stratifiers. Motivated by this challenge, we propose an adapted MRP that models both the survey outcome and the population sizes of subgroups formed by post-stratifiers. Simulations demonstrate that the adapted MRP outperforms methods, with smaller bias, and coverage rate for the 95% probability interval. We apply the adapted MRP to estimate the proportion of viral load and means of mental/physical among with HIV in NYC using the 2020-21 wave of the Community Health Advisory & Information Network survey, in which the collection of was disrupted by the COVID-19 pandemic. 

Keywords

Multilevel Regression and Poststratification (MRP)

Bayesian

Survey Methods

COVID-19

HIV 

View Abstract 2190

Co-Author(s)

Maiko Yomogida, Columbia University
Angela Aidala, Columbia University
Andrew Gelman, Columbia University
Qixuan Chen, Columbia University

First Author

Amy Pitts, Columbia University

Presenting Author

Amy Pitts, Columbia University

Restricted Adaptive Probability-Based Latin Hypercube Design

The complexity of environmental sampling comes from the combination of varied inclusion probabilities, irregular sampling region, spatial-filling requirements and sampling cost constraints. This article proposes a restricted adaptive probability-based latin hypercube design for environmental sampling. Meriting from a first stage pilot design, the approach largely reduces the computation burden under traditional adaptive sampling without network replacement, while still achieves the same effective control on the final sample size. Under the restricted adaptive probability-based latin hypercube design, Thompson-Horvitz and Hansen-Hurwitz type estimators are biased. A modified Murthy-type unbiased estimator with Rao-Blackwell improvements are thus proposed. The proposed approach is shown to have better performances than several well-known sampling methodologies. 

Keywords

Adaptive sampling

Environmental sampling

Latin Hypercube Design

Rao-Blackwell 

View Abstract 2543

First Author

HUIJUAN LI

Presenting Author

HUIJUAN LI

Simulating Low-cost Rotating Panel Designs for the Commercial Buildings Energy Consumption Survey

The U.S. Energy Information Administration's (EIA's) Commercial Buildings Energy Consumption Survey (CBECS) is the primary source of data on energy use in the U.S. commercial sector. The survey collects detailed information about commercial building characteristics, energy consuming equipment, and fuel use in commercial buildings. EIA has collected 11 cycles of CBECS data since 1979. Because CBECS data collection is complicated and increasingly expensive, EIA is researching options to potentially reduce costs for future CBECS cycles. Winkler et al (2022) examined rotating panel designs recommended for the CBECS by the National Academy of Sciences (NAS 2012). In this study, we consider low-cost panel designs involving fewer panels and smaller samples within the panels. Simulation results suggest that a longitudinal CBECS, involving dependent interviewing (Ridolfo et al. 2022), may provide useful time-series data despite increased standard errors. 

Keywords

rotating panel surveys

complex surveys

simulation 

View Abstract 2308

Co-Author(s)

Janice Lent, U.S. Energy Information Administration
Michael Winkler, Energy Information Administration

First Author

Adebowale Sijuwade, U.S. Energy Information Administration

Presenting Author

Adebowale Sijuwade, U.S. Energy Information Administration

Spatial Smoothing and FDR Control in Climate

In this paper we explore FDR control in the climate setting, focusing on applications to the commonly used gridbox-by-gridbox simple linear regression technique. In order to properly evaluate simulation results, a modification of the standard hypothesis tests is proposed and developed, and the consequences of using the new hypothesis tests is explored. In order to improve the power of the Benjamini-Hochberg method in this setting, a method for locally smoothing the data is proposed. This method estimates local spatial covariances and uses the estimated covariances to create smoothing weights. Simulation results show that the smoothing method improves the number of true rejections and the sensitivity of FDR approaches at the cost of increasing the probability of finding no rejections. The technique is applied to January sea surface temperature standardized anomalies with a simulated response. 

Keywords

FDR

Spatial

Climate

Smoothing

Multiple Hypotheses

Regression 

View Abstract 3283

Co-Author

Karen McKinnon, University of California, Los Angeles

First Author

Kyle McEvoy

Presenting Author

Kyle McEvoy

Statistical Behaviour of Mixed Crowds of Humans and Automata

The behavior of highly dense crowds under emergency situations caught the attention of researchers in the last years. But, it is now common the presence of automata among people in public buildings, malls, etc. Automata are set to do specific tasks and may or may not carry an emergency plan. This work simulates the dynamics of an escaping situation due to some kind of danger. A mixed population of humans and automata try to get out from a room. We handle the pedestrians' dynamics by means of the Social Force Model (SFM). The automata, however, evaluates the costs of deviation from a preset route. Both models interact, yielding quite unkown scenarios. Our aim is to identify the most favorable scenarios for the human safety. 

Keywords

crowd dynamics

social force model

emergency

automata

safety 

View Abstract 2574

Co-Author

Claudio Dorso, Instituto de Fı́sica de Buenos Aires, CONICET

First Author

Guillermo Frank

Presenting Author

Guillermo Frank

Supplementing a Non-probability Sample with a Probability Sample to Predict the Population Mean

We show how to analyze a non-probability sample (nps) with limited information from a small probability sample (ps). The most practical case is when the nps has auxiliary variables and study variable but no survey weights and the ps has known weights, auxiliary variables, but no study variable. Two samples are taken from the same population and variables are common to both the nps and the ps. A large non-probability sample can reduce the cost but will give biased estimator with small variance, the small probability sample can provide supplemental information. Following this, we apply these weights to fit a mixture model, enhancing the robustness of the results and enabling the estimation of the finite population mean. Additionally, we present a method to enhance the efficiency of the Gibbs sampler. 

Keywords

adjusted survey weight,

Gibbs sampling,

logistic regression,

missing data,

propensity score,

robust model 

View Abstract 2224

Co-Author

Balgobin Nandram, Worcester Polytechnic Institute

First Author

Zihang Xu

Presenting Author

Zihang Xu