SPEED 1: Data Challenge I, Statistical Applications, & Statistics in Policy, Part 1

Xingpei Zhao Chair
University of South Carolina
 
Sunday, Aug 4: 2:00 PM - 3:50 PM
5001 
Contributed Speed 
Oregon Convention Center 
Room: CC-D135 

Presentations

A Calibrated Sensitivity Analysis for Weighted Disparity Decompositions

Disparities in health or well-being experienced by racial and sexual minority groups can be difficult to study using the traditional exposure-outcome paradigm in causal inference, since potential outcomes in variables such as race or sexual minority status are challenging to interpret. Decomposition analysis addresses this gap by considering causal impacts on a disparity via interventions to other, intervenable exposures that may play a mediating role in the disparity. Moreover, decomposition analyses are conducted in observational settings and require untestable assumptions that rule out unmeasured confounders. Using the marginal sensitivity model, we develop a sensitivity analysis for unobserved confounders in studies of disparities. We use the percentile bootstrap to construct valid confidence intervals for disparities and causal effects on disparities under given levels of confounding under mild conditions. We also explore amplifications that give insight into multiple confounding mechanisms. We illustrate our framework on a study examining disparities in youth suicide rates among sexual minorities using the Adolescent Brain Cognitive Development Study. 

Keywords

Causal Inference

Sensitivity Analysis

Causal Decompositions

Disparity

Weighting 

View Abstract 2663

Co-Author

Samuel Pimentel, University of California-Berkeley

First Author

Andy Shen

Presenting Author

Andy Shen

A Matter of Perception: An Analysis of Factors Influencing Perceived Crime in Selected U.S. Cities

Recently, there has been a trend towards integrating advanced technologies into law enforcement strategies, which has dramatically improved police departments' efficacies. Our study aims to build upon this trend, using machine learning to analyze national and local datasets in hopes of exploring factors that influence public perceptions of crime, attitudes towards law enforcement and the criminal justice system, and average crime rates. On the national scale, we intend to use datasets obtained from the FBI's Crime Data Explorer, the General Social Survey, and polling results from sources including Gallup and the Pew Research Center to explore the relationship between variables of interest and crime. Then, we will focus on several large cities with high crime rates, using geospatial crime data to begin trying to determine key factors in crime rates and perceived crime. We hope that our analysis will shed light onto previously unconsidered variables, such as the presence of public parks. Our findings will provide insights that researchers and policymakers can utilize to create informed public safety legislation. 

View Abstract 3239

Co-Author(s)

Erick Jiang, Duke University
Ziyao Cui, Duke University
Nicholas Sortisio, Duke University
Cynthia Rudin, Duke University
Eric Chen, Duke University

First Author

Haiyan Wang

Presenting Author

Haiyan Wang

Applying an Alpha Spending Function to an Observational Study for the Purpose of Accelerating the Assessment of a Healthcare Quality Improvement Initiative

Alpha spending functions (ASFs) for randomized controlled trials (RCTs) control type I errors of interim analyses aimed to shorten study duration. Propensity score matching (PSM) in observational settings mimic RCTs to limit biases but can remain lengthy. This project's purpose is to assess potential benefits of applying ASF to a PSM study of a healthcare quality improvement initiative.
Blankenship (2022) published a 2-year PSM study assessing incident dialysis patients attending Transitional Care Units (TCU). The clinical outcomes assessed were either not statistically significant or significant with a positive association for TCUs.
This project divided the study into 8 quarters consisting of the patients and outcome information available at each look. PSM and statistical methods were performed as in the published analysis while additionally applying an ASF.
Applying ASF to the TCU study demonstrated key findings would have been determined in as little as 6 months. This, as well as additional previously unknown insights, reveal potential benefits of applying ASF to an observational program evaluation. Further research would inform if and how to apply ASFs to observational studies. 

Keywords

Alpha spending function

Observational study

Real world evidence

Propensity score matching

Longitudinal analysis

Group sequential design 

Abstracts


Co-Author

Quentin Eloise, Fresenius Medical Care

First Author

Derek Blankenship, Fresenius Medical Care North America

Presenting Author

Derek Blankenship, Fresenius Medical Care North America

Associations Between Neurocognitive Patterns and Mean Diffusivity in Childhood Brain Tumor Survivors

Limited research exists on the inter-individual variability of neurocognitive patterns in childhood brain tumor survivors. Our study aims to assess cognitive patterns and their association with brain substructure mean diffusivity (MD), a measure of isotropic diffusion indicating microstructural injury. Using Group-based multi-trajectory modeling with intelligence quotient (IQ), processing speed (PS), and working memory (WM), we identified two distinct neurocognitive patterns: a High-Group (55%) with sustained high cognitive performance and a Low-Group (45%) exhibiting decreasing performance over time. High-Group patients, less likely to undergo radiation, showed significantly lower MD in the hippocampus (β=-45, p=0.045), middle frontal gyrus (β=-43, p=0.02), thalamus (β=-35, p=0.02), inferior frontal gyrus (β=-34, p=0.01), and superior frontal gyrus (β=-35, p=0.02) compared to Low-Group in unadjusted linear mixed models. Adjusting for age, sex, and interaction with time, High-Group patients exhibited a decreasing trend in MD compared to Low-Group, suggesting greater microstructural injury progression in these regions in the low-performance compared to high-performance group. 

Keywords

multi-trajectory modeling

mean diffusivity

Neurocognitive

Brain Tumor 

View Abstract 2857

Co-Author(s)

Ryan Oglesby, Johns Hopkins School of Medicine
Rachel Peterson, Johns Hopkins School of Medicine, Baltimore
Sahaja Acharya, Johns Hopkins School of Medicine

First Author

Chathurangi Pathiravasan, Johns Hopkins University School of Public Health

Presenting Author

Chathurangi Pathiravasan, Johns Hopkins University School of Public Health

Building Classification Models for Early Detection of Asthma in Child for US Population

Asthma is the most prominent chronic disease in children and one of the most challenging ailments to diagnose in infants and preschoolers. Utilizing the BRFSS (2011-2020) data, this study focuses on building an efficient data-driven analytical predictive model based on the 28 associated risk factors and identifying the most contributing factors influencing the childhood asthma using the XGBoost (eXtreme Gradient Boosting) algorithm.
Respondents were randomly divided into training and testing samples. The grid-search mechanism was implemented to compute the optimum values of the hyper-parameters of the analytical XGBoost model. The fitted XGBoost model was compared with four competing ML models including support vector machine (SVM), random forest, LASSO regression, and GBM. The performance of all the models was compared using accuracy, AUC, precision, and recall.
XGBoost was found to be the best performing model with AUC 0.96, followed by SVM (AUC 0.93).
The analytical methodology of the model development can be instrumental in predicting different types of chronic lung diseases affecting people of all ages from multidimensional behavioral health survey data. 

Keywords

Childhood Asthma

Predictive Modeling

BRFSS Data

XGBoost 

View Abstract 3563

Co-Author

AKM R. Bashar, Augustana College

First Author

Aditya Chakraborty, Eastern Virginia Medical School

Presenting Author

Aditya Chakraborty, Eastern Virginia Medical School

Classification of Stimuli Through Colorimetric & Impedance Responses of a Matrix of Polydiacetylenes

Colorimetric sensor arrays typically consist of a matrix of agents meant to provide unique color responses to target stimuli. Polydiacetylenes (PDAs) are a suitable candidate for colorimetric sensor arrays in tamper identification settings as they will change color from visibly blue to red. PDAs also may elicit an electrochemical signature visualized via electrochemical impedance spectroscopy (EIS), which can be utilized to identify molecular species such as volatile organic compounds (VOCs). Thus, a suitably calibrated matrix of known PDAs can be utilized to uniquely identify stimulants in several settings using the three-dimensional scalar color values from the colorimetric array, and functional impedance measurements. Reliably and accurately identifying a range of stimuli using these metrics poses a challenging classification problem. A suite of classification algorithms supplemented by dimension reduction techniques to combine the scalar and functional responses and uncertainty quantification incorporated through conformal prediction were leveraged to classify stimuli.
SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525. SAND2024-01105A. 

Keywords

classification

conformal prediction

colorimetric

chemometrics 

View Abstract 3463

Co-Author(s)

Marie Tuft
Stephanie White, Sandia National Laboratories
Cody Corbin, Sandia National Laboratories

First Author

Marieke Sorge, Arizona State University

Presenting Author

Marieke Sorge, Arizona State University

Equal Opportunity in the Presence of Risk Distribution Differences

The proliferation of risk prediction models and algorithm-assisted decision making has spurred demands for methods to identify, quantify and reduce algorithm bias. One popular notion of fairness, equal opportunity, requires parity in true positive rate (TPR) across subgroups. While intuitively appealing, models constrained to satisfy the equal opportunity condition suffer from a loss in overall accuracy. Moreover, even models with perfect prediction can fail to satisfy equal opportunity if the risk distribution differs between subgroups. In the healthcare setting, the true risk distribution can be expected to differ between subgroups; therefore, fairness measures that account for these differences are needed. We investigate how TPR and related error-rate based metrics are expected to differ between subgroups in the presence of differences in underlying risk distribution and/or subgroup-dependent calibration bias through a combination of theoretical and numerical work. We further propose a modified formulation of the equal opportunity criterion and apply it to a risk prediction model implemented in a large, urban health system. 

Keywords

algorithm bias

fairness

equal opportunity

true positive rate

calibration bias 

View Abstract 3419

Co-Author(s)

Kristin Linn, University of Pennsylvania
Jinbo Chen, University of Pennsylvania

First Author

Sarah Hegarty, University of Pennsylvania

Presenting Author

Sarah Hegarty, University of Pennsylvania

Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian point proces

Understanding of the spread of infectious diseases such as COVID-19 is crucial for informed decision-making and resource allocation. A critical component of disease behavior is the velocity with which disease spreads, defined as the rate of change for each location and time. In this paper, we propose a spatio-temporal modeling approach to determine the velocities of infectious disease spread. Our approach assumes that the locations and times of people infected can be considered as a spatio-temporal point pattern that arises as a realization of a spatio-temporal log-Gaussian Cox process. The intensity of this process is estimated using fast Bayesian inference by employing the integrated nested Laplace approximation (INLA) and the Stochastic Partial Differential Equations (SPDE) approaches. Velocities are then computed by using finite differences that approximate the derivatives of the intensity function. Finally, the directions and magnitudes of the velocities can be mapped at specific times to better examine disease spread across the region. We demonstrate our method by analyzing COVID-19 spread in Cali, Colombia, during the 2020-2021 pandemic. 

Keywords

Bayesian inference

Log-Gaussian Cox processes

Spatio-temporal point patterns

Velocities 

View Abstract 2718

Co-Author(s)

Jorge Mateu, Department of Mathematics, University Jaume I, 12071 Castellón, Spain
Paula Moraga, King Abdullah University of Science and Technology

First Author

Fernando Rodriguez Avellaneda, King Abdullah University of Science and Technology

Presenting Author

Fernando Rodriguez Avellaneda, King Abdullah University of Science and Technology

Exploring the 2024 GSS

We will present the results of our analysis for the 2024 Data Challenge Expo. 

Keywords

General social survey

Statistical graphics 

View Abstract 1977

Co-Author

Bryanna Schaffer

First Author

Adam Loy, Carleton College

Presenting Author

Yichi Song, Carleton College

Exploring the Effects of Sleeping Behaviors and Restless Leg Syndrome on Fetal Growth

Despite the plausible mechanisms, how gestational sleep behaviors and the development of sleep disorders influence fetal growth trajectory has not been fully explored. Analyzing the prospective cohort data from the multicenter NICHD Fetal Growth Studies - Singletons (2009-2013), the study included 2,458 pregnant women recruited between 8-13 gestational weeks and followed up to five times during pregnancy. The trajectory of estimated fetal weight (EFW) from 10 – 41 weeks of gestation was derived from three ultrasonographic measures. Linear mixed effect models were applied to model the EFW in relation to self-reported sleep and RLS exposures, adjusting for age, race and ethnicity, education, parity, pre-pregnancy body mass index category, infant sex, and prepregnancy sleep-napping behavior. From enrollment to near delivery, pregnant women's sleep duration and nap frequency declined and RLS symptoms frequency increased generally. No significant differences were observed by sleep-napping group or by RLS status between 10 – 41 weeks. Since week 30, small, but statistically insignificant, divergence in the mean EFW was observed by sleep-napping groups. However, our data do not support an association between gestational sleep behaviors and RLS symptoms and fetal growth between 10-41 weeks of gestation in healthy pregnant women. 

Keywords

time-varying exposures

linear mixed effect models

estimated fetal weight

restless keg syndrome

gestational sleep

cohort study 

View Abstract 3574

Co-Author(s)

Samidha Shetty, Montana State University
Xiaoyue Niu, Penn State University
Stefanie Hinkle, University of Pennsylvania
Cuilin Zhang, National University of Singapore
Xiang Gao, Fudan University

First Author

Muzi Na, Pennsylvania State University

Presenting Author

Samidha SudhakarShetty, Pennsylvania State University

Investigating Predictors of NFL Running Back Production Via Traditional and Regression Tree Models

Highly drafted running backs are becoming increasingly rare in the NFL Running back contracts are also not typically as lucrative and lengthy as other positions due to injuries and lack of longevity in the league. As a potential instructional data set to compare traditional regression techniques and tree modeling, a data set was compiled that investigated NFL running back production in the period from 1999 to 2021. Responses such as years with original draft team and total career NFL rushing yards were investigated. Predictors included total college attempts, college conference and overall draft pick number. Results revealed some anticipated predictor significance as well as some less anticipated predictor importance. Furthermore, tree modeling revealed interesting ranges of predictor variables that might be useful in evaluating college players and predicting NFL performance. 

Keywords

regression

trees

modeling

prediction

teaching

sports 

View Abstract 2958

First Author

Bob Downer, Grand Valley State University

Presenting Author

Bob Downer, Grand Valley State University

Life Before and After COVID-19: Changes in Relationships, Technology Use, and Well-being

Social distancing during the COVID-19 pandemic caused significant changes in everyday life. Leveraging the rich longitudinal data provided by the General Social Survey (GSS), we analyze the profound impacts of the COVID-19 pandemic on American life, specifically focusing on changes in interpersonal relationships, computer and technology usage, and subjective well-being. By analyzing trends before and during the pandemic, we identify significant shifts in the reported behaviors and psychological state of survey respondents. We present novel interactive visualizations to explore these trends. We also construct statistical models to assess the degree to which changes in subjective well-being can be explained by the changes in interpersonal relationships and technology use. These analyses inform societal responses to pandemics and could also guide policies and interventions to mitigate negative social effects brought on by COVID-19 and future pandemics. 

Keywords

statistical social science

data visualization

COVID-19 

View Abstract 2656

Co-Author

Maximilian Rohde

First Author

Jiangmei Xiong

Presenting Author

Caroline Birdrow, Vanderbilt University Medical Center

PPV-guided cost-effective chart review for model-agnostic RWD-based discovery

Electronic Health Record (EHR)-based association studies have been commonly used to identify the risk factors associated with patient clinical phenotypes. While EHR-derived phenotypes (i.e., surrogates) have recently been utilized, manual chart reviews remain the gold standard for ensuring the quality of the phenotypes. This process is notably time-consuming and costly. Therefore, determining the optimal subset size for chart review is of great importance. In this paper, we propose a PPV/NPV-guided method to determine the minimum sample size required for chart review, thereby substantially saving the cost. Subsequently, we introduce an augmented estimation procedure that effectively combines the chart reviews with the surrogates to achieve asymptotically unbiased and efficient estimators for the EHR-based association studies. Our approach offers a cost-effective solution that ensures accuracy and efficiency in estimation without explicitly specifying the misclassification mechanism of the surrogates. The robustness of our method is validated through extensive simulation studies and the evaluation of real-world data, utilizing the Flatiron dataset as a benchmark for verification. 

Keywords

Electronic Health Record (EHR)

Association study

Outcome dependent sampling 

View Abstract 2967

Co-Author(s)

Yiwen Lu
Yong Chen, University of Pennsylvania, Perelman School of Medicine

First Author

Jiayi Tong

Presenting Author

Yiwen Lu

Predicting the Final Points of the Decathlon Based on the Results of the First Day Events using Regr

The decathlon is a complex athletics discipline that combines ten track and field events held over the course of two days for male athletes. These ten events can be classified as "running," "jumping," and "throwing" events. A dataset was gathered from the competition results of all Olympic games and world athletics championships from 1984 to 2023 (n = 595), and it was divided into training (90%) and testing (10%) subsets.
The main objective of this study is to predict the decathlon final points standings using the five events of the first day. The training and test set were resampled with replacement 10000 times of the original dataset, then four regression models were applied to test which model fits the data better, and the root mean square error (RMSE) was used as a model performance criterion. The results showed that the final performance is highly influenced by two events from the first day, which are long jump (LJ) and shot put (SP). In addition, the multiple linear regression model was the best performing model to predict the final results followed by partial least square regression and quantile regression. 

Keywords

Decathlon

Multiple linear regression model

Partial least square regression

Quantile regression

Principal component regression model

Root mean square error (RMSE) 

View Abstract 3831

First Author

Abdelmonaem Jornaz, Park University

Presenting Author

Abdelmonaem Jornaz, Park University

Quantitative Decision Making (QDM) in Phase 2b studies

Optimizing the design and dosing regimen for Phase 2b dose-ranging studies is crucial to achieve an optimal benefit-risk balance for patients. It is also essential for ensuring a seamless and successful transition to Phase 3, with the identification of the right dose and optimal design. Our proposal involves the integration of a Bayesian-based Quantitative Decision-Making (QDM) framework into Phase 2b design, enabling the incorporation of informative prior information. This shift from p-value-based to probability-based decision-making enhances the efficiency of our decisions. In line with the Commit to Medicine Development (C2MD) milestone, our focus extends to measuring the conditional probability of success in Phase 3 studies. This measurement reflects the design's ability to de-risk later phases based on our current prior belief and pre-defined success criteria. To illustrate this process, we will present results from a simulation involving an HIV asset, demonstrating the effectiveness of our approach. 

Keywords

Bayesian

Decision making

Clinical trial

Adaptive 

View Abstract 3211

First Author

JIANJUN GAN, GlaxoSmithKline

Presenting Author

JIANJUN GAN, GlaxoSmithKline

Robust Weighted Random Forest with Imbalanced Classification Problems

In many applications, it is common to have numerous features with different levels of information and an imbalanced outcome ratio simultaneously. Weighted Random Forest (WRF) has been utilized to address low-signal-to-noise problem by assigning more weights to informative features prioritizing the inclusion of a feature subset at each node of individual trees. However, it has not been actively studied in class imbalanced problem. In this work, we propose to use RF variable importance in the area under the receiver operating characteristic curve - referred to VI-AUC - as weights with WRF to account for class imbalanced problems. Our simulation studies show that WRF with VI-AUC is superior and stable compared to other weighting methods, particularly in class imbalanced scenarios with small sample size. Applications using an immunologic marker dataset from an HIV vaccine efficacy trial are illustrated. 

Keywords

Variable importance

Weighted random forest

Class imbalance

AUC 

View Abstract 2800

Co-Author

Yunbi Nam

First Author

Sunwoo Han, University of Miami

Presenting Author

Sunwoo Han, University of Miami

Super Learner Prediction and Variable Importance in Nursing Home Resident Suicidal Ideation

The super learner method combines the stacking algorithm and regression analysis to obtain weighted predictions from varied statistical strategies for model prediction. It is shown to perform no worse than any single prediction method as well as to provide consistent estimates. The targeted maximum likelihood estimation (TMLE) method was further introduced for variable importance analyses, in which super learner predictions were compared between the saturated model and reduced models when each variable was left out. Variable importance was profiled by corresponding p-values.

In the study of nursing home resident suicide ideation, we first performed individual modeling for each of the eleven parametric or non-parametric strategies. Cross-validation was implemented in each strategy, and the aggregated estimates for each algorithm were approached. We further estimated the composite parameter estimates by enameling all model specific estimates, in which mean squared error (MSE) was used to identify best weights for the assembling. The TMLE method was used to identify ten most important risk factors associated with nursing home resident suicide ideation. 

Keywords

Super learner

targeted maximum likelihood

risk analysis 

View Abstract 1996

Co-Author(s)

Shan Gao, University of Rochester
Yue Li, University of Rochester

First Author

Xueya Cai, University of Rochester

Presenting Author

Xueya Cai, University of Rochester

Understanding Post-Pandemic Mental Health via Statistical Learning

Understanding the prevalence and impact of anxiety and depressive symptoms is crucial in recognizing the global pandemic aftermath. This research will explore the mental health complexity in the post-pandemic landscape of 2022 by analyzing data from the General Social Survey (https://gss.norc.org/). The research will focus on studying two primary mental health measures, Generalized Anxiety Disorder (GAD) and Patient Health Questionnaire (PHQ) scores to identify individual well-being subtleties that affect these scores and offer insights into the evolving mental health relationship. We propose a two-step approach for this research: first, employing machine learning algorithms to analyze and identify distinct subgroups and structure patterns within the individual mental health data, and second, based on findings from first step, utilizing advanced statistical models to explore the joint impact of individual and societal factors on mental health. The ultimate goal of this study is to identify key factors influencing post-pandemic mental health and to provide actionable insights for policymakers, clinicians, and mental health practitioners. 

Keywords

Mental Health

Post Pandemic

GAD-2

PHQ-2

Data Science 

View Abstract 2437

Co-Author

Lu Lu

First Author

Xiankui Yang

Presenting Author

Xiankui Yang

Utilizing Bayesian Optimization for Efficient Dispersion Curve Feature Acquisition

The conventional usage of Nuclear Quadrupole Resonance (NQR) technology in detecting explosives holds promise for its application in narcotics detection. However, its advancement is hindered by the inefficiency in ascertaining excitation frequencies for new substances. Currently, experimental physicists rely on identifying features in a dispersion curve, which necessitates conducting experiments often spanning several months across a dense frequency range. Our research delves into the incorporation of Bayesian Optimization and Active Learning techniques, aiming to enable data-driven decision-making in the selection of frequency subsets for experimentation. This innovative approach seeks to expedite the process of dispersion curve feature acquisition, ultimately enhancing the utility of NQR technology in narcotics detection by overcoming a current bottleneck in the field. 

Keywords

Bayesian Optimization

Bayesian Active Learning

Experimental Design

Experimental Physics

Applied Machine Learning 

View Abstract 2322

Co-Author(s)

Natalie Klein, Los Alamos National Laboratory
Sinead Williamson

First Author

Amber Day

Presenting Author

Amber Day