Wednesday, Aug 6: 8:30 AM - 10:20 AM
0599
Topic-Contributed Paper Session
Music City Center
Room: CC-202B
This session contains the papers from the 5 winners of the joint SRMS/GSS/SSS Student Paper Competition. These five papers cover a wide range of topics relevant to survey, government, and social statistics.
Student Paper Competition Winners
Small area estimation
Weighting adjustments
Causal inference
Selection models
Applied
No
Main Sponsor
Survey Research Methods Section
Co Sponsors
Government Statistics Section
Social Statistics Section
Presentations
Studying racial bias in policing is a critically important problem, but one that comes with a number of inherent difficulties due to the nature of the available data. In this manuscript we tackle multiple key issues in the causal analysis of racial bias in policing. First, we formalize race and place policing, the idea that individuals of one race are policed differently when they are in neighborhoods primarily made up of individuals of other races. We develop an estimand to study this question rigorously, show the assumptions necessary for causal identification, and develop sensitivity analyses to assess robustness to violations of key assumptions. Additionally, we investigate difficulties with existing estimands targeting racial bias in policing. We show for these estimands, and the estimands developed in this manuscript, that estimation can benefit from incorporating mobility data into analyses. We apply these ideas to a study in New York City, where we find a large amount of racial bias, as well as race and place policing, and that these findings are robust to large violations of untestable assumptions. We additionally show that mobility data can make substantial impacts on the resulting estimates, suggesting it should be used whenever possible in subsequent studies.
Keywords
Causal inference
Mobility data
Racial discrimination
Race and place
Sensitivity analysis
Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating forecasts. However, ESNs lack a direct mechanism to account for the neighborhood structure inherent in area-level data. Ignoring these spatial relationships can significantly compromise the accuracy and utility of forecasts. In this paper, we incorporate approximate graph spectral filters at the input stage of the ESN, thereby improving forecast accuracy while preserving the model's computational efficiency during training. We demonstrate the effectiveness of our approach using Eurostat's tourism occupancy dataset and show how it can support more informed decision-making in policy and planning contexts.
Keywords
Areal data
Echo State Network
Graph Convolutional Network
Survey
Proxy pattern-mixture models (PPMM) have previously been proposed as a model-based framework for assessing the potential for nonignorable nonresponse in sample surveys and nonignorable selection in nonprobability samples. One defining feature of the PPMM is the single sensitivity parameter, ø, that ranges from 0 to 1 and governs the degree of departure from ignorability. While this sensitivity parameter is attractive in its simplicity, it may also be of interest to describe departures from ignorability in terms of how the odds of response (or selection) depend on the outcome being measured. In this paper, we re-express the PPMM as a selection model, using the known relationship between pattern-mixture models and selection models, in order to better understand the underlying assumptions of the PPMM and the implied effect of the outcome on nonresponse. The selection model that corresponds to the PPMM is a quadratic function of the survey outcome and proxy variable, and the magnitude of the effect depends on the value of the sensitivity parameter, ø (missingness/selection mechanism), the differences in the proxy means and standard deviations for the respondent and nonrespondent populations, and the strength of the proxy, ρ. Large values of ø (beyond 0.5) often result in unrealistic selection mechanisms, and the corresponding selection model can be used to establish more realistic bounds on ø. We illustrate the results using data from the U.S. Census Household Pulse Survey.
Keywords
nonignorable nonresponse
nonignorable selection
Proxy pattern-mixture models
nonprobability samples
sensitivity analysis
nonresponse and selection bias
We advance the theory of parametric bootstrap in constructing highly efficient empirical best (EB) prediction intervals of small area means. The coverage error of such a prediction interval is of the order O(m−3/2), where m is the number of small areas to be pooled using a linear mixed normal model. In the context of an area level model where the random effects follow a non-normal known distribution except possibly for unknown hyperparameters, we analytically show that the order of coverage error of empirical best linear (EBL) prediction interval remains the same even if we relax the normality of the random effects by the existence of pivot for a suitably standardized random effects when hyperpameters are known. Recognizing the challenge of showing existence of a pivot, we develop a simple moment-based method to claim non-existence of pivot. We show that existing parametric bootstrap EBL prediction interval fails to achieve the desired order of the coverage error, i.e. O(m−3/2), in absence of a pivot. We obtain a surprising result that the order O(m−1) term is always positive under certain conditions indicating possible overcoverage of the existing parametric bootstrap EBL prediction interval. In general, we analytically show for the first time that the coverage problem can be corrected by adopting a suitably devised double parametric bootstrap. Our Monte Carlo simulations show that our proposed single bootstrap method performs
reasonably well when compared to rival methods.
Keywords
Small area estimation
empirical Bayes
linear mixed model
best linear predictor
Co-Author(s)
Yuting Chen, University of Maryland College Park
Masayo Hirose, Kyushu University, Institute of Mathematics for Industry
Partha Lahiri, University of Maryland-College Park
Speaker
Yuting Chen, University of Maryland College Park
While national biobanks are essential for advancing medical research, their nonprobability sampling designs limit their
representativeness of the target population. This paper proposes a method that leverages high-quality national surveys to create
synthetic sampling weights for non-probabilistic cohort studies, aiming to improve representativeness. Specifically, we focus on deriving more accurate base weights, which enhance calibration by meeting population constraints, and on automating data-supported selection of cross-tabulations for calibration. This approach combines a pseudo-design-based model with a novel Last-In-First-Out criterion, enhancing both the accuracy and stability of estimates. Extensive simulations demonstrate that our method, named nps-lifo-rake, reduces bias, improves efficiency, and strengthens inference compared to existing approaches. We apply the proposed method to the All of Us Research Program, leveraging data from the National Health Interview Survey 2020 and American Community Survey 2022, and compare the resulting prevalence estimates for common phenotypes against national benchmarks. The results underscore our method's ability to effectively reduce selection bias in non-probability samples, offering a valuable tool for enhancing biobank representativeness. Using the developed sampling weights for the All of Us Research Program, we can estimate the
United States population prevalence for phenotypes and genotypes not captured by national probability studies.
Keywords
Calibration Weighting
Generalized Raking
Nested Propensity Score
Non-Probability
Prevalence
Sampling Design