Sunday, Aug 4: 2:00 PM - 3:50 PM
5001
Contributed Speed
Oregon Convention Center
Room: CC-D135
Presentations
Disparities in health or well-being experienced by racial and sexual minority groups can be difficult to study using the traditional exposure-outcome paradigm in causal inference, since potential outcomes in variables such as race or sexual minority status are challenging to interpret. Decomposition analysis addresses this gap by considering causal impacts on a disparity via interventions to other, intervenable exposures that may play a mediating role in the disparity. Moreover, decomposition analyses are conducted in observational settings and require untestable assumptions that rule out unmeasured confounders. Using the marginal sensitivity model, we develop a sensitivity analysis for unobserved confounders in studies of disparities. We use the percentile bootstrap to construct valid confidence intervals for disparities and causal effects on disparities under given levels of confounding under mild conditions. We also explore amplifications that give insight into multiple confounding mechanisms. We illustrate our framework on a study examining disparities in youth suicide rates among sexual minorities using the Adolescent Brain Cognitive Development Study.
Keywords
Causal Inference
Sensitivity Analysis
Causal Decompositions
Disparity
Weighting
Recently, there has been a trend towards integrating advanced technologies into law enforcement strategies, which has dramatically improved police departments' efficacies. Our study aims to build upon this trend, using machine learning to analyze national and local datasets in hopes of exploring factors that influence public perceptions of crime, attitudes towards law enforcement and the criminal justice system, and average crime rates. On the national scale, we intend to use datasets obtained from the FBI's Crime Data Explorer, the General Social Survey, and polling results from sources including Gallup and the Pew Research Center to explore the relationship between variables of interest and crime. Then, we will focus on several large cities with high crime rates, using geospatial crime data to begin trying to determine key factors in crime rates and perceived crime. We hope that our analysis will shed light onto previously unconsidered variables, such as the presence of public parks. Our findings will provide insights that researchers and policymakers can utilize to create informed public safety legislation.
Alpha spending functions (ASFs) for randomized controlled trials (RCTs) control type I errors of interim analyses aimed to shorten study duration. Propensity score matching (PSM) in observational settings mimic RCTs to limit biases but can remain lengthy. This project's purpose is to assess potential benefits of applying ASF to a PSM study of a healthcare quality improvement initiative.
Blankenship (2022) published a 2-year PSM study assessing incident dialysis patients attending Transitional Care Units (TCU). The clinical outcomes assessed were either not statistically significant or significant with a positive association for TCUs.
This project divided the study into 8 quarters consisting of the patients and outcome information available at each look. PSM and statistical methods were performed as in the published analysis while additionally applying an ASF.
Applying ASF to the TCU study demonstrated key findings would have been determined in as little as 6 months. This, as well as additional previously unknown insights, reveal potential benefits of applying ASF to an observational program evaluation. Further research would inform if and how to apply ASFs to observational studies.
Keywords
Alpha spending function
Observational study
Real world evidence
Propensity score matching
Longitudinal analysis
Group sequential design
Abstracts
Limited research exists on the inter-individual variability of neurocognitive patterns in childhood brain tumor survivors. Our study aims to assess cognitive patterns and their association with brain substructure mean diffusivity (MD), a measure of isotropic diffusion indicating microstructural injury. Using Group-based multi-trajectory modeling with intelligence quotient (IQ), processing speed (PS), and working memory (WM), we identified two distinct neurocognitive patterns: a High-Group (55%) with sustained high cognitive performance and a Low-Group (45%) exhibiting decreasing performance over time. High-Group patients, less likely to undergo radiation, showed significantly lower MD in the hippocampus (β=-45, p=0.045), middle frontal gyrus (β=-43, p=0.02), thalamus (β=-35, p=0.02), inferior frontal gyrus (β=-34, p=0.01), and superior frontal gyrus (β=-35, p=0.02) compared to Low-Group in unadjusted linear mixed models. Adjusting for age, sex, and interaction with time, High-Group patients exhibited a decreasing trend in MD compared to Low-Group, suggesting greater microstructural injury progression in these regions in the low-performance compared to high-performance group.
Keywords
multi-trajectory modeling
mean diffusivity
Neurocognitive
Brain Tumor
Asthma is the most prominent chronic disease in children and one of the most challenging ailments to diagnose in infants and preschoolers. Utilizing the BRFSS (2011-2020) data, this study focuses on building an efficient data-driven analytical predictive model based on the 28 associated risk factors and identifying the most contributing factors influencing the childhood asthma using the XGBoost (eXtreme Gradient Boosting) algorithm.
Respondents were randomly divided into training and testing samples. The grid-search mechanism was implemented to compute the optimum values of the hyper-parameters of the analytical XGBoost model. The fitted XGBoost model was compared with four competing ML models including support vector machine (SVM), random forest, LASSO regression, and GBM. The performance of all the models was compared using accuracy, AUC, precision, and recall.
XGBoost was found to be the best performing model with AUC 0.96, followed by SVM (AUC 0.93).
The analytical methodology of the model development can be instrumental in predicting different types of chronic lung diseases affecting people of all ages from multidimensional behavioral health survey data.
Keywords
Childhood Asthma
Predictive Modeling
BRFSS Data
XGBoost
Colorimetric sensor arrays typically consist of a matrix of agents meant to provide unique color responses to target stimuli. Polydiacetylenes (PDAs) are a suitable candidate for colorimetric sensor arrays in tamper identification settings as they will change color from visibly blue to red. PDAs also may elicit an electrochemical signature visualized via electrochemical impedance spectroscopy (EIS), which can be utilized to identify molecular species such as volatile organic compounds (VOCs). Thus, a suitably calibrated matrix of known PDAs can be utilized to uniquely identify stimulants in several settings using the three-dimensional scalar color values from the colorimetric array, and functional impedance measurements. Reliably and accurately identifying a range of stimuli using these metrics poses a challenging classification problem. A suite of classification algorithms supplemented by dimension reduction techniques to combine the scalar and functional responses and uncertainty quantification incorporated through conformal prediction were leveraged to classify stimuli.
SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525. SAND2024-01105A.
Keywords
classification
conformal prediction
colorimetric
chemometrics
The proliferation of risk prediction models and algorithm-assisted decision making has spurred demands for methods to identify, quantify and reduce algorithm bias. One popular notion of fairness, equal opportunity, requires parity in true positive rate (TPR) across subgroups. While intuitively appealing, models constrained to satisfy the equal opportunity condition suffer from a loss in overall accuracy. Moreover, even models with perfect prediction can fail to satisfy equal opportunity if the risk distribution differs between subgroups. In the healthcare setting, the true risk distribution can be expected to differ between subgroups; therefore, fairness measures that account for these differences are needed. We investigate how TPR and related error-rate based metrics are expected to differ between subgroups in the presence of differences in underlying risk distribution and/or subgroup-dependent calibration bias through a combination of theoretical and numerical work. We further propose a modified formulation of the equal opportunity criterion and apply it to a risk prediction model implemented in a large, urban health system.
Keywords
algorithm bias
fairness
equal opportunity
true positive rate
calibration bias
Understanding of the spread of infectious diseases such as COVID-19 is crucial for informed decision-making and resource allocation. A critical component of disease behavior is the velocity with which disease spreads, defined as the rate of change for each location and time. In this paper, we propose a spatio-temporal modeling approach to determine the velocities of infectious disease spread. Our approach assumes that the locations and times of people infected can be considered as a spatio-temporal point pattern that arises as a realization of a spatio-temporal log-Gaussian Cox process. The intensity of this process is estimated using fast Bayesian inference by employing the integrated nested Laplace approximation (INLA) and the Stochastic Partial Differential Equations (SPDE) approaches. Velocities are then computed by using finite differences that approximate the derivatives of the intensity function. Finally, the directions and magnitudes of the velocities can be mapped at specific times to better examine disease spread across the region. We demonstrate our method by analyzing COVID-19 spread in Cali, Colombia, during the 2020-2021 pandemic.
Keywords
Bayesian inference
Log-Gaussian Cox processes
Spatio-temporal point patterns
Velocities
We will present the results of our analysis for the 2024 Data Challenge Expo.
Keywords
General social survey
Statistical graphics
Despite the plausible mechanisms, how gestational sleep behaviors and the development of sleep disorders influence fetal growth trajectory has not been fully explored. Analyzing the prospective cohort data from the multicenter NICHD Fetal Growth Studies - Singletons (2009-2013), the study included 2,458 pregnant women recruited between 8-13 gestational weeks and followed up to five times during pregnancy. The trajectory of estimated fetal weight (EFW) from 10 – 41 weeks of gestation was derived from three ultrasonographic measures. Linear mixed effect models were applied to model the EFW in relation to self-reported sleep and RLS exposures, adjusting for age, race and ethnicity, education, parity, pre-pregnancy body mass index category, infant sex, and prepregnancy sleep-napping behavior. From enrollment to near delivery, pregnant women's sleep duration and nap frequency declined and RLS symptoms frequency increased generally. No significant differences were observed by sleep-napping group or by RLS status between 10 – 41 weeks. Since week 30, small, but statistically insignificant, divergence in the mean EFW was observed by sleep-napping groups. However, our data do not support an association between gestational sleep behaviors and RLS symptoms and fetal growth between 10-41 weeks of gestation in healthy pregnant women.
Keywords
time-varying exposures
linear mixed effect models
estimated fetal weight
restless keg syndrome
gestational sleep
cohort study
Highly drafted running backs are becoming increasingly rare in the NFL Running back contracts are also not typically as lucrative and lengthy as other positions due to injuries and lack of longevity in the league. As a potential instructional data set to compare traditional regression techniques and tree modeling, a data set was compiled that investigated NFL running back production in the period from 1999 to 2021. Responses such as years with original draft team and total career NFL rushing yards were investigated. Predictors included total college attempts, college conference and overall draft pick number. Results revealed some anticipated predictor significance as well as some less anticipated predictor importance. Furthermore, tree modeling revealed interesting ranges of predictor variables that might be useful in evaluating college players and predicting NFL performance.
Keywords
regression
trees
modeling
prediction
teaching
sports
First Author
Bob Downer, Grand Valley State University
Presenting Author
Bob Downer, Grand Valley State University
Social distancing during the COVID-19 pandemic caused significant changes in everyday life. Leveraging the rich longitudinal data provided by the General Social Survey (GSS), we analyze the profound impacts of the COVID-19 pandemic on American life, specifically focusing on changes in interpersonal relationships, computer and technology usage, and subjective well-being. By analyzing trends before and during the pandemic, we identify significant shifts in the reported behaviors and psychological state of survey respondents. We present novel interactive visualizations to explore these trends. We also construct statistical models to assess the degree to which changes in subjective well-being can be explained by the changes in interpersonal relationships and technology use. These analyses inform societal responses to pandemics and could also guide policies and interventions to mitigate negative social effects brought on by COVID-19 and future pandemics.
Keywords
statistical social science
data visualization
COVID-19
Electronic Health Record (EHR)-based association studies have been commonly used to identify the risk factors associated with patient clinical phenotypes. While EHR-derived phenotypes (i.e., surrogates) have recently been utilized, manual chart reviews remain the gold standard for ensuring the quality of the phenotypes. This process is notably time-consuming and costly. Therefore, determining the optimal subset size for chart review is of great importance. In this paper, we propose a PPV/NPV-guided method to determine the minimum sample size required for chart review, thereby substantially saving the cost. Subsequently, we introduce an augmented estimation procedure that effectively combines the chart reviews with the surrogates to achieve asymptotically unbiased and efficient estimators for the EHR-based association studies. Our approach offers a cost-effective solution that ensures accuracy and efficiency in estimation without explicitly specifying the misclassification mechanism of the surrogates. The robustness of our method is validated through extensive simulation studies and the evaluation of real-world data, utilizing the Flatiron dataset as a benchmark for verification.
Keywords
Electronic Health Record (EHR)
Association study
Outcome dependent sampling
The decathlon is a complex athletics discipline that combines ten track and field events held over the course of two days for male athletes. These ten events can be classified as "running," "jumping," and "throwing" events. A dataset was gathered from the competition results of all Olympic games and world athletics championships from 1984 to 2023 (n = 595), and it was divided into training (90%) and testing (10%) subsets.
The main objective of this study is to predict the decathlon final points standings using the five events of the first day. The training and test set were resampled with replacement 10000 times of the original dataset, then four regression models were applied to test which model fits the data better, and the root mean square error (RMSE) was used as a model performance criterion. The results showed that the final performance is highly influenced by two events from the first day, which are long jump (LJ) and shot put (SP). In addition, the multiple linear regression model was the best performing model to predict the final results followed by partial least square regression and quantile regression.
Keywords
Decathlon
Multiple linear regression model
Partial least square regression
Quantile regression
Principal component regression model
Root mean square error (RMSE)
Optimizing the design and dosing regimen for Phase 2b dose-ranging studies is crucial to achieve an optimal benefit-risk balance for patients. It is also essential for ensuring a seamless and successful transition to Phase 3, with the identification of the right dose and optimal design. Our proposal involves the integration of a Bayesian-based Quantitative Decision-Making (QDM) framework into Phase 2b design, enabling the incorporation of informative prior information. This shift from p-value-based to probability-based decision-making enhances the efficiency of our decisions. In line with the Commit to Medicine Development (C2MD) milestone, our focus extends to measuring the conditional probability of success in Phase 3 studies. This measurement reflects the design's ability to de-risk later phases based on our current prior belief and pre-defined success criteria. To illustrate this process, we will present results from a simulation involving an HIV asset, demonstrating the effectiveness of our approach.
Keywords
Bayesian
Decision making
Clinical trial
Adaptive
In many applications, it is common to have numerous features with different levels of information and an imbalanced outcome ratio simultaneously. Weighted Random Forest (WRF) has been utilized to address low-signal-to-noise problem by assigning more weights to informative features prioritizing the inclusion of a feature subset at each node of individual trees. However, it has not been actively studied in class imbalanced problem. In this work, we propose to use RF variable importance in the area under the receiver operating characteristic curve - referred to VI-AUC - as weights with WRF to account for class imbalanced problems. Our simulation studies show that WRF with VI-AUC is superior and stable compared to other weighting methods, particularly in class imbalanced scenarios with small sample size. Applications using an immunologic marker dataset from an HIV vaccine efficacy trial are illustrated.
Keywords
Variable importance
Weighted random forest
Class imbalance
AUC
The super learner method combines the stacking algorithm and regression analysis to obtain weighted predictions from varied statistical strategies for model prediction. It is shown to perform no worse than any single prediction method as well as to provide consistent estimates. The targeted maximum likelihood estimation (TMLE) method was further introduced for variable importance analyses, in which super learner predictions were compared between the saturated model and reduced models when each variable was left out. Variable importance was profiled by corresponding p-values.
In the study of nursing home resident suicide ideation, we first performed individual modeling for each of the eleven parametric or non-parametric strategies. Cross-validation was implemented in each strategy, and the aggregated estimates for each algorithm were approached. We further estimated the composite parameter estimates by enameling all model specific estimates, in which mean squared error (MSE) was used to identify best weights for the assembling. The TMLE method was used to identify ten most important risk factors associated with nursing home resident suicide ideation.
Keywords
Super learner
targeted maximum likelihood
risk analysis
Co-Author(s)
Shan Gao, University of Rochester
Yue Li, University of Rochester
First Author
Xueya Cai, University of Rochester
Presenting Author
Xueya Cai, University of Rochester
Understanding the prevalence and impact of anxiety and depressive symptoms is crucial in recognizing the global pandemic aftermath. This research will explore the mental health complexity in the post-pandemic landscape of 2022 by analyzing data from the General Social Survey (https://gss.norc.org/). The research will focus on studying two primary mental health measures, Generalized Anxiety Disorder (GAD) and Patient Health Questionnaire (PHQ) scores to identify individual well-being subtleties that affect these scores and offer insights into the evolving mental health relationship. We propose a two-step approach for this research: first, employing machine learning algorithms to analyze and identify distinct subgroups and structure patterns within the individual mental health data, and second, based on findings from first step, utilizing advanced statistical models to explore the joint impact of individual and societal factors on mental health. The ultimate goal of this study is to identify key factors influencing post-pandemic mental health and to provide actionable insights for policymakers, clinicians, and mental health practitioners.
Keywords
Mental Health
Post Pandemic
GAD-2
PHQ-2
Data Science
The conventional usage of Nuclear Quadrupole Resonance (NQR) technology in detecting explosives holds promise for its application in narcotics detection. However, its advancement is hindered by the inefficiency in ascertaining excitation frequencies for new substances. Currently, experimental physicists rely on identifying features in a dispersion curve, which necessitates conducting experiments often spanning several months across a dense frequency range. Our research delves into the incorporation of Bayesian Optimization and Active Learning techniques, aiming to enable data-driven decision-making in the selection of frequency subsets for experimentation. This innovative approach seeks to expedite the process of dispersion curve feature acquisition, ultimately enhancing the utility of NQR technology in narcotics detection by overcoming a current bottleneck in the field.
Keywords
Bayesian Optimization
Bayesian Active Learning
Experimental Design
Experimental Physics
Applied Machine Learning