Wednesday, Aug 7: 8:30 AM - 10:20 AM
5129
Contributed Speed
Oregon Convention Center
Room: CC-D135
Presentations
The time-stamped transactions financial data, which possess the most detailed information for price evolution, are coined as "ultrahigh-frequency (UHF) data" in Engle (2000). A general partially observed Markov process framework with marked point observations and the related Bayesian inference (estimation and model selection) via stochastic filtering equations are developed in Hu, Kuipers, and Zeng (2018a and 2018b). The general framework accommodates the two features of UHF data: random trading times and trading noises. While several specific partially observed models, including the Black-Scholes (BS) and stochastic volatility models, have been studied, the partially observed Merton's model, extending the BS model with a jump component representing the impact of good and bad news, has not been investigated. In this study, we fill such a gap by proposing a partially observed Merton's model for ultra-high frequency financial data, accommodating the two UHF-data features. The joint posterior distribution of the parameters of interest and the intrinsic value process (which is the Merton model) is characterized by the normalized filtering equation. The Bayes factors of the partially ob
Keywords
Ultrahigh-frequency data
Partially observed Merton’s jump model
Normalized filtering equation
Bayes factors
Case prioritization has been employed by survey researchers as an adaptive survey design strategy to achieve optimal goals under fixed resources. One major use is to target the subgroups among low response propensity cases and prioritize them in the interviewers' workload without increasing data collection resources, to equalize response rates and to reduce nonresponse bias. Although existing research has shown the effectiveness of case prioritization, the approach of allocating the cases being prioritized in practical settings is not straightforward, especially in a dynamic case prioritization process. Tourangeau et al. (2017) provided a clear notion of what cases are the most worth pursuing. The authors recommended using a composited score with considerations of a case's response propensity, design weight, and its effect on the sample balance. Inspired by that research, this presentation provides a revised approach of identifying the most valuable cases in a panel survey setting with oversampling of subpopulations.
Keywords
case prioritization
response propensity
dynamic adaptive design
nonresponse bias
The ongoing pandemic demonstrated that fast and accurate analysis of continually collected infectious disease surveillance data is crucial for situational awareness and policy making. Phylodynamic analysis uses genetic sequences of a pathogen to estimate changes in its genetic diversity in a population of interest, the effective population size, which under certain conditions can be connected to the number of infections in the population. Phylodynamics is an important tool because its methods utilize a data source in a way that is resilient to the ascertainment biases present in traditional surveillance data. Unfortunately, it takes weeks or months to sequence and obtain the sampled pathogen genome for use in such analyses. When the number of infections depends on the sampling frequency, the missing data results in underestimation of the effective population size. Here we present a method that affords reliable estimation of the effective population size trajectory closer to the time of data collection, allowing for policy decisions to be based on more recent data, with a better understanding of the limitations and uncertainties of such inference.
Keywords
infectious disease dynamics
disease surveillance
Bayesian phylogenetics
genomic epidemiology
Bayesian nonparametrics
Total survey error (TSE) is the difference between a survey estimate and the true value of the corresponding population parameter. We use TSE to evaluate sampling and nonsampling errors in vaccination coverage estimates for children aged 19-35 months from CDC's National Immunization Survey-Child. We derive estimates of sampling-frame coverage error, nonresponse error, measurement error, and sampling error using such data sources as the National Health Interview Survey and immunization information systems. A Monte Carlo approach then combines estimated distributions of error components into a TSE distribution for the survey estimate of vaccination coverage. The mean of the TSE distribution provides an estimate of total bias in the survey estimator, and the 95% credible interval provides an interval within which total survey error falls with 0.95 probability. Our estimates of mean TSE for 4+ doses of DTaP (-4.0 percentage points), 1+ doses of MMR (-1.7 pp), Hep B birth dose (-3.3 pp), and the combined 7-vaccine series (-9.2 pp) indicate underestimates of vaccination coverage. Measurement error (or provider underascertainment) is consistently found to be the largest error component.
Keywords
Total survey error
Sampling-frame coverage error
Nonresponse error
Nonresponse error
Random digit dialing
The R-package rpms provides an algorithm for producing design consistent tree models of survey data. Tree models are an effective and flexible way for analyzing survey data because they provide an easily interpretable model that includes automatic variable selection as well as interaction effects, which make them very popular with analysts working with data collected from surveys.. Besides providing the functions for estimating these models, the package includes a number of functions that operate on the tree based objects to assist in understanding and analyzing survey data. We will demonstrate many of the tools in this package on data collected from a complex sample.
Keywords
Sample Design
Regression Tree
Machine Learning
Government Survey Data
Statistical Inference
Statistical Model
Media platforms allowed misinformation to easily propagate during the Pandemic. In an attempt to quell misinformation, the CDC and the Federal Government attempted, sometimes successfully, to shut down debate. Yet, at times, these entities themselves spread false information regarding measures meant to prevent Covid. We will look briefly at two topics through government statements and scientific evidence: vaccines and masks. The scientific evidence behind the Covid vaccines was extremely strong, and seemingly difficuit to over-state, but the CDC did indeed overstate their benefit, while ignoring other factors regularly considered when recommending vaccines, such as age, side effects, and prior exposure. Regarding masks, the CDC pushed policies based on little or no scientific data, and ignored or even suppressed scientific data that called their efficacy into doubt.
Keywords
masks
vaccines
myocarditis
CDC
Covid
Abstracts
Contrastive dimension reduction methods have been used to uncover the low-dimensional structure that distinguishes one dataset (foreground) from another (background). However, current contrastive dimension reduction techniques do not estimate the number of unique dimensions, denoted as d_c, within the foreground data. Instead, they require this quantity as an input and proceed to estimate the dimensions themselves. In this paper, we formally define the contrastive dimension, d_c, and present what we believe to be the first estimator for this parameter. Under a linear model, we demonstrate the consistency of this estimator, establish a finite-sample error bound, and develop a hypothesis test for d_c = 0. This test is valuable for determining the suitability of a contrastive method for a given dataset. Furthermore, we provide a detailed analysis of our findings, supported by simulations using both synthetic and real-world datasets.
Keywords
Dimension reduction
Contrastive dimension
Model checking, evaluation, or comparison in Small Area Estimation (SAE) with limited data is difficult. A generic problem is that given a survey dataset D, what is a good metric to score a model M? Considering cluster sampling for the national surveys, we would like to achieve two goals: 1) to score models based on their ability to estimate subpopulation prevalence at different administrative levels. 2) to decide if a given model M can be accepted (or not rejected under a hypothesis testing framework). Focusing on a scenario where there is one level of spatial unit, we want to score models based on their ability to produce national estimates. We evaluate models using score rules such as mean square error (MSE), continuous ranked probability score (CRPS), and distribution-free score from conformal prediction, based on leave-one-region out, leave-one-cluster-out, or other splitting methods, and we use design-based estimates as a reference.
Keywords
Cross validation
Small Area Estimation
Complex survey data
In January 2022, the U.S. Energy Information Administration (EIA) launched a new census survey to collect natural gas inventory storage data from all operating liquefied natural gas (LNG) storage facilities in the U.S. The EIA-191L, Monthly Liquefied Natural Gas Storage Report, collects data on injections, withdrawals, total gas in storage, total capacity, and maximum delivery for operators of LNG facilities across 29 states. EIA uses these data to publish state‑level monthly estimates on LNG storage in EIA's Natural Gas Monthly. The data are also used in several other EIA publications such as the Natural Gas Annual, Monthly Energy Review, and Short-Term Energy Outlook.
To account for unit non-response in the 2022 survey, we developed a donor-based imputation method. It creates imputation cells using the monthly activities for the LNG facilities and selects donors based on the donors' expected total gas and the recipient's reported total gas for January 2023. In this presentation, we will discuss data quality metrics and statistical methodologies used in EIA-191L, emphasizing statistical editing and imputation methods.
Keywords
Energy Statistics
Clustering
The effective reproduction number is an important descriptor of an infectious disease epidemic. In small populations, ideally we would estimate the effective reproduction number using a Markov Jump Process (MJP) model of the spread of infectious disease, but in practice this is computationally challenging. We propose a computationally tractable approximation to an MJP which tracks only latent and infectious individuals, the EI model, an MJP where the time-varying immigration rate into the E compartment is equal to the product of the proportion of susceptibles in the population and the transmission rate. We use an analogue of the central limit theorem for MJPs to approximate transition densities as normal, which makes Bayesian computation tractable. Using simulated pathogen RNA concentrations collected from wastewater data, we demonstrate the advantages of our stochastic model against deterministic counterparts for the purpose of estimating effective reproduction number dynamics. We apply our new model to estimating the effective reproduction number of SARS-CoV-2 in several college campus communities.
Keywords
Bayesian Statistics
Infectious Disease Statistics
Stochastic Processes
Nowcasting
Epidemic Modeling
Infectious Disease Surveillance
While suggesting specific question wording for surveys collecting data on gender identity and sexual orientation, a 2022 National Academies of Sciences, Engineering, and Medicine report on "Measuring Sex, Gender Identity, and Sexual Orientation" recognized limitations of "forced-choice measurement" using multiple-choice items and recommended further research into representing nonbinary gender identity and gender fluidity. Here we describe a framework for "Multifaceted Gender Identity Measurement" (M-GIM) asking respondents about the extent to which they agree or disagree with a series of gender-identity and sexual-orientation prompts, anticipating that identifiable clusters will emerge from patterns in ordinal responses without requiring individuals to self-classify into one of a limited number of categories. After highlighting the appeal of keeping such queries free of the implicit constraints and negative associations built into mutually-exclusive response options, the presentation will discuss a conceptual framework for investigating disparities in quality-of-life outcomes across population subgroups characterized by similar gender-identity or sexual-orientation profiles.
Keywords
gender identity
sexual orientation
ordinal data
cluster analysis
nonbinary
gender fluidity
Multilevel Regression and Poststratification (MRP) has gained popularity in survey sampling for population inference. This involves two stages: the first fits a model, regressing the outcome on poststratification variables. The second predicts the outcome using this model and aggregates predictions for population. Existing methods on settings where the joint distribution of the population post-stratifiers is known. However, in practice, such information is not available; instead, we are provided with the margins of the post-stratifiers. Motivated by this challenge, we propose an adapted MRP that models both the survey outcome and the population sizes of subgroups formed by post-stratifiers. Simulations demonstrate that the adapted MRP outperforms methods, with smaller bias, and coverage rate for the 95% probability interval. We apply the adapted MRP to estimate the proportion of viral load and means of mental/physical among with HIV in NYC using the 2020-21 wave of the Community Health Advisory & Information Network survey, in which the collection of was disrupted by the COVID-19 pandemic.
Keywords
Multilevel Regression and Poststratification (MRP)
Bayesian
Survey Methods
COVID-19
HIV
The complexity of environmental sampling comes from the combination of varied inclusion probabilities, irregular sampling region, spatial-filling requirements and sampling cost constraints. This article proposes a restricted adaptive probability-based latin hypercube design for environmental sampling. Meriting from a first stage pilot design, the approach largely reduces the computation burden under traditional adaptive sampling without network replacement, while still achieves the same effective control on the final sample size. Under the restricted adaptive probability-based latin hypercube design, Thompson-Horvitz and Hansen-Hurwitz type estimators are biased. A modified Murthy-type unbiased estimator with Rao-Blackwell improvements are thus proposed. The proposed approach is shown to have better performances than several well-known sampling methodologies.
Keywords
Adaptive sampling
Environmental sampling
Latin Hypercube Design
Rao-Blackwell
The U.S. Energy Information Administration's (EIA's) Commercial Buildings Energy Consumption Survey (CBECS) is the primary source of data on energy use in the U.S. commercial sector. The survey collects detailed information about commercial building characteristics, energy consuming equipment, and fuel use in commercial buildings. EIA has collected 11 cycles of CBECS data since 1979. Because CBECS data collection is complicated and increasingly expensive, EIA is researching options to potentially reduce costs for future CBECS cycles. Winkler et al (2022) examined rotating panel designs recommended for the CBECS by the National Academy of Sciences (NAS 2012). In this study, we consider low-cost panel designs involving fewer panels and smaller samples within the panels. Simulation results suggest that a longitudinal CBECS, involving dependent interviewing (Ridolfo et al. 2022), may provide useful time-series data despite increased standard errors.
Keywords
rotating panel surveys
complex surveys
simulation
In this paper we explore FDR control in the climate setting, focusing on applications to the commonly used gridbox-by-gridbox simple linear regression technique. In order to properly evaluate simulation results, a modification of the standard hypothesis tests is proposed and developed, and the consequences of using the new hypothesis tests is explored. In order to improve the power of the Benjamini-Hochberg method in this setting, a method for locally smoothing the data is proposed. This method estimates local spatial covariances and uses the estimated covariances to create smoothing weights. Simulation results show that the smoothing method improves the number of true rejections and the sensitivity of FDR approaches at the cost of increasing the probability of finding no rejections. The technique is applied to January sea surface temperature standardized anomalies with a simulated response.
Keywords
FDR
Spatial
Climate
Smoothing
Multiple Hypotheses
Regression
The behavior of highly dense crowds under emergency situations caught the attention of researchers in the last years. But, it is now common the presence of automata among people in public buildings, malls, etc. Automata are set to do specific tasks and may or may not carry an emergency plan. This work simulates the dynamics of an escaping situation due to some kind of danger. A mixed population of humans and automata try to get out from a room. We handle the pedestrians' dynamics by means of the Social Force Model (SFM). The automata, however, evaluates the costs of deviation from a preset route. Both models interact, yielding quite unkown scenarios. Our aim is to identify the most favorable scenarios for the human safety.
Keywords
crowd dynamics
social force model
emergency
automata
safety
We show how to analyze a non-probability sample (nps) with limited information from a small probability sample (ps). The most practical case is when the nps has auxiliary variables and study variable but no survey weights and the ps has known weights, auxiliary variables, but no study variable. Two samples are taken from the same population and variables are common to both the nps and the ps. A large non-probability sample can reduce the cost but will give biased estimator with small variance, the small probability sample can provide supplemental information. Following this, we apply these weights to fit a mixture model, enhancing the robustness of the results and enabling the estimation of the finite population mean. Additionally, we present a method to enhance the efficiency of the Gibbs sampler.
Keywords
adjusted survey weight,
Gibbs sampling,
logistic regression,
missing data,
propensity score,
robust model