S4: Speed Session 4

Conference: Women in Statistics and Data Science 2025
11/13/2025: 2:30 PM - 4:00 PM EST
Speed 

Presentations

01. The Regional Effect of Mental Health in the United States

What is the impact of mental health on mortality rates in the United States? Using mortality statistics from the CDC and Mental Health America metrics, we analyze county-level mortality rates, suicide rates, and mental health indicators across 18 age groups from 2000 to 2023. We employ Bayesian spatio-temporal models with conditional auto-regressive (CAR) priors on the spatial effects to account for both regional and temporal trends in mortality. With these models, we explore the relationship between regional mental health conditions and mortality outcomes. These models allow us to investigate the relationship between local mental health conditions and mortality outcomes while accounting for geographic and age-group variation. Our approach supports both retrospective analysis and forecasting, identifying regions where mental health is improving or deteriorating and quantifying rates of change. We specifically examine the impact of COVID-19 on mental health as it relates to mortality, and examine regions where the pandemic contributed to shifts in mortality trends. Our findings provide demographers, statisticians, actuaries, and public health professionals with evidence-based insights on when and how to incorporate mental health data into mortality modeling, enhancing the accuracy of demographic projections and risk assessment in insurance and public health applications. 

Presenting Author

Brianne Weaver, Brigham Young University

First Author

Brianne Weaver, Brigham Young University

CoAuthor(s)

Robert Richardson, Brigham Young University
Brian Hartman, Brigham Young University
Brigg Trendler, Brigham Young University
Chris Groendyke, Robert Morris University
Davey Erekson, Brigham Young University

02. Saving lives, reducing costs: Predicting healthcare utilization from whole-person health using partially validated electronic health records data

A standardized measure of whole-person health in electronic health records (EHR) could be instrumental in identifying at-risk patients, preventing disease, and reducing patient engagement in the healthcare system. The allostatic load index (ALI), calculated from ten component stressors of the cardiovascular, metabolic, and inflammatory regulation systems, offers a promising estimate of holistic health. The ALI can be calculated from EHR data, but they are prone to error and missingness. Calculating the ALI from non-validated data can lead to inaccurate conclusions about patient health and the association with healthcare utilization. To address this challenge, EHR data for 1000 patients from a large academic health system were partially validated, with expert chart review completed for 100 patients to improve their data quality and completeness. Using machine learning techniques, these data were used to predict patient engagement in the healthcare system (hospitalization or emergency department visit) based on ALI. To better explore how ALI can predict whether people engage in the healthcare system, we explore additional models, evaluate methods to fill in information gaps for the 900 remaining patients, and assess different strategies (like imputation) for dealing with data quality issues for unvalidated patients. 

Presenting Author

Grayson Weavil, Wake Forest University

First Author

Grayson Weavil, Wake Forest University

CoAuthor

Sarah Lotspeich, Wake Forest University

03. Time-Varying Covariate Analysis of Dementia Risk Between Autistic and Non-Autistic Older Adults in Medicare

Prior studies have demonstrated heightened dementia risk among autistic older adults. Understanding disparities in dementia risk is increasingly important as more autistic individuals reach older adulthood. While research suggests autistic older adults experience heightened risk for most physical and mental health conditions compared with their non-autistic peers, it remains unclear the extent to which these risk factors play a role in the heightened dementia risk observed in autistic older adults.
While logistic regression is commonly used to model binary outcomes, it fails to account for the timing of disease onset or the dynamic nature of risk exposure. Using a national Medicare sample, we modeled the association between autism and dementia using a Cox proportional hazards model with time-varying covariates for known dementia risk factors 9 years of follow-up data. Risk factors for dementia in the general population have been well-documented and include hypertension, hearing loss, type 2 diabetes, obesity, high cholesterol, TBI, depression, alcohol use, and tobacco use.
Our findings suggest that, even after accounting for these factors as time-dependent exposures, autistic older adults remain at significantly elevated risk for dementia. These results suggest that known risk factors only partially explain the disparity and that additional contributors may underlie the increased risk of dementia observed in the autistic population. 

Presenting Author

Melica Nikahd

First Author

Melica Nikahd

CoAuthor(s)

Madison Hyer
Lauren Bishop, University of Wisconsin–Madison
Brittany Hand, The Ohio State University

Withdrawn - 04. Statistical Framework for Spatial and Economic Prioritization of Distributed Wind

This study presents a statistically rigorous framework for assessing where distributed wind (DW) energy can most effectively reduce household energy burdens in the U.S. We integrate high-resolution geospatial techno-economic data from NREL's Distributed Wind Energy Futures Study with probability distribution modeling and inferential statistics to examine links between wind potential and socioeconomic vulnerability across multiple scenarios.

First, we construct a household-level electric-only energy burden (EB) metric using public data on income and electricity spending. We validate this metric through empirical distribution comparisons against official estimates and find it exhibits right-skewed, non-normal behavior. Normality tests, Box-Cox transformations, and nonparametric methods (e.g., Kolmogorov–Smirnov) guide our statistical choices.

Second, we define a standardized wind feasibility metric−Annual Energy Production (AEP) normalized by sectoral energy demand (AEP-to-demand ratio) − to enable cross-county comparisons across time (2022, 2025, 2035) and policy (2025) scenarios. Using this and our EB metric, we conduct:

- Pairwise scenario comparisons (parametric and nonparametric).
-Correlation analysis (Pearson, Kendall).
-Linear mixed-effects modeling with state as a random effect and demographic/economic covariates as fixed effects.
-Rank-based geographic comparisons using Spearman's ρ.

To enhance interpretability, we apply a Net Energy Return (NER) indicator−an algebraic function of energy burden−and extend both EB and NER to county-level values normalized by local GDP.

Results show significant correlations between AEP-to-demand ratio and EB, especially in Georgia, Louisiana, and Colorado. Agricultural employment and poverty rate emerge as key burden predictors. This framework informs spatially targeted DW deployment strategies. 

Presenting Author

Sara Abril Guevara, National Renewable Energy Laboratory

First Author

Sara Abril Guevara, National Renewable Energy Laboratory

CoAuthor(s)

Paula Perez, National Renewable Energy Laboratory
Jane Lockshin, National Renewable Energy Laboratory
Caleb Phillips, National Renewable Energy Laboratory

05. Exploring Associations between Latino Cultural Values, ADHD Stigma, and Help-Seeking Attitudes in a Community Sample in Mexico: A Mediation Analysis

Associations between mental health stigma and help-seeking are well-researched, but there is limited understanding of associations between ADHD stigma, cultural values, and help-seeking attitudes. We examined if the hypothesized association of adherence to Latino cultural values (including familismo and respeto) and help-seeking attitudes is mediated by ADHD stigma in a secondary analysis of cross-sectional survey data from 313 school staff and parents of children from 8 public elementary schools participating in a trial of the Collaborative Life Skills school-based ADHD program in Mexico (CLS-FUERTE). Families and teachers were shown a behavioral impairment video of a child portraying ADHD features and asked to complete the ADHD Stigma Questionnaire, Mexican American Cultural Values scale, Problem Recognition and Service Selection Questionnaire, Child Symptom Inventory, and ADHD-FX scale. The outcome, help-seeking attitudes, is split into two domains: ADHD problem recognition and types of help participants considered appropriate. We conducted a causal mediation analysis to assess the direct and indirect effects of Latino cultural values on help-seeking attitudes, with ADHD stigma as a mediator on the causal pathway. Results analyze three causal estimands: the natural indirect effect (NIE), natural direct effect (NDE), and controlled direct effect (CDE). We aim to assess multiple levels of confounding using a counterfactual framework. Results yield understanding of how ADHD stigma may mediate the association of adherence to Latino cultural values and help-seeking attitudes (NIE), how much of this association may filter through non-ADHD stigma pathways (NDE), and the expected association of adherence to Latino cultural values and help-seeking attitudes if ADHD stigma were eliminated (CDE). We aim improve understanding of and address ADHD stigma in culturally adapted services, reducing disparities for underserved populations like Spanish-speaking youth. 

Presenting Author

Aranya Shukla

First Author

Aranya Shukla

CoAuthor(s)

Suzanne Dufault, University of California, San Francisco
Samira Soleimanpour, University of California, San Francisco
Lauren Haack, UCSF

06. AI-Driven Patent Data Extraction and Analysis for Agricultural Patents

The AI-Driven Patent Data Extraction and Analysis System for Corteva Agriscience was a research project under The Data Mine at Purdue University. A team of 9 undergraduate and graduate students designed the system for efficient retrieval, extraction, and analysis of agricultural patents related to crop protection. The project integrated cutting-edge technologies, including large language models (LLMs) and advanced tools for data extraction and structured search. These capabilities will allow scientists and researchers to efficiently access, extract, and analyze patent data, enabling faster and more informed decision-making.

Project Objectives:

1. Patent Retrieval Development – Developed the system for retrieving patents directly from Google patents.
2. Automated Data Extraction – Developed a tool that extracts and converts patent metadata as well as relevant content from the example section of patents into a structured table format, making it downloadable for further analysis.
3. Interactive Chat Module – Implemented an LLM-chatbot that helps scientists to perform IP-related queries. 

Presenting Author

Srishti Maurya

First Author

Srishti Maurya

CoAuthor(s)

Anna Bajszczak
Lina Im, Student

07. Maximum Entropy Mortality Forecasting for U.S. Females

The mortality experience in a population can be studied through the age-at-death distribution. This study investigates the mortality experience of a population by examining the age-at-death distribution, focusing on its first four statistical moments for calculating the mean, variance, skewness, and kurtosis. These moments are used to provide a good approximation of the shape of the probability density function of the underlying distribution of deaths. Using data from the Human Mortality Database for the U.S. female population, we apply the Maximum Entropy Mortality (MEM) model to reconstruct the full mortality density. We validate MEM by comparing observed and reconstructed densities, demonstrating its ability to capture shifting mortality patterns. Forecast accuracy is further evaluated using standard metrics and benchmarked against the Lee–Carter and some other models. Our analysis will highlight how the MEM model, leveraging moment information, provides a detailed characterization of age-specific mortality levels for the purpose of forecasting. 

Presenting Author

Jing Jing

First Author

Jing Jing

CoAuthor

Tatjana Miljkovic, Miami University

08. Comparing novel machine-learning derived weights versus standard weights for the Charlson Comorbidity Index in predicting mortality for autistic older adults

The Charlson Comorbidity Index (CCI) was developed in 1987 to predict one-year mortality of breast cancer patients based on 19 weighted conditions, providing a valuable tool for assessing patient risk in clinical and research settings. The CCI was updated over the years, including accommodating changes in coding systems like ICD-9 and ICD-10 and improving predictive accuracy. The CCI provides important foundational work for the field. Yet, this index is not ideally equipped to predict mortality in certain patient populations, like autistic older adults, who experience a higher prevalence of co-occurring conditions than the general population. We aimed to develop autism-specific weights for the 19 conditions in the CCI to better quantify the risk of mortality for autistic older adults. We hypothesized that the novel autism-specific weights for the CCI would have greater predictive validity than the established CCI weights, developed for the general population. We used a 100% sample of national Medicare claims from the years 2013-2021. We leveraged a machine learning optimization technique called stochastic hill climbing, where the weights of the 12 updated CCI conditions were randomly shifted to optimize a weighted mean for quantifying mortality risk. We then compared the predictive validity of the new autism-specific comorbidity weights to the established CCI weights through the area under the curve (AUC) of a logistic regression model with mortality as the outcome and the weighted mean as the predictor. We found the AUCs were similar for predicting mortality using the novel autism-specific weights (AUC=0.68) and the established CCI weights (AUC=0.67) among autistic older adults. These findings may suggest that additional health conditions not currently captured by the CCI, or patient demographics, may need to be added to the CCI to better predict mortality in autistic older adults. Future studies on developing an autism-specific mortality risk index are warranted. 

Presenting Author

Madison Blake

First Author

Madison Blake

CoAuthor(s)

Madison Hyer
Melica Nikahd
Lauren Bishop, University of Wisconsin–Madison
Brittany Hand, The Ohio State University

09. Comparing preprocessing methods on inference of exposomic and metabolomic data with application to liver disease health outcomes in a clinical study

Exposomics and metagenomics data, like other high throughput data, require preprocessing before further downstream statistical analysis may occur. One step includes normalization, which is often accomplished with standard normalization techniques (e.g., quantile, sum, median, reference sample or reference feature) readily available in specialized software (e.g., Metaboanalyst). Analysis of exposomics and metabolomics in larger cohorts will continue to increase due to lowered costs, and include a richer set of clinical and patient characteristics that may be beneficial in the normalization process. Current software does not allow normalization by these additional characteristics, including class factors (i.e., biological sample classifier) or other important study design features (e.g., age, sex, and other patient characteristics). Herein, we examine the performance of a normalization procedure that accounts for study design features in the cohort study, using simulations studies and an environmentally exposed cohort. We compare these to the results of normalization with the standard options to determine the best methods to assess differentially expressed exposomic and metabolomic features in liver outcomes. We find similarities in some of these data but note differences in the outcomes based on normalization methods. Future studies may benefit from including known clinical and study design features in their analyses. 

Presenting Author

Christina Pinkston, University of Louisville and Biostats, Health Inform & Data Sci, University of Cincinnati College of Medicine

First Author

Christina Pinkston, University of Louisville and Biostats, Health Inform & Data Sci, University of Cincinnati College of Medicine

CoAuthor(s)

Shesh Rai, Biostats, Health Inform & Data Sci, University of Cincinnati College of Medicine
Matthew Cave, University of Louisville

10. Mixed Integer Programming for Feature Selection in Scalar-on-Function Regression

Feature selection is a critical challenge in model selection, particularly for functional data, where appropriate statistical methodologies remain underdeveloped. This study investigates the application of Mixed Integer Programming (MIP) combined with information criteria for best feature subset selection in scalar-on-function regression (i.e., regression models where predictors are curves). Transforming the functional regression problem into a classic linear model framework with grouped variables allows the use of model selection criteria such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Generalized Cross Validation (GCV), in combination with MIP. In simulation studies, we compared our MIP method to alternative approaches and found that it consistently identifies truly active features. 

Presenting Author

Asha Pantula

First Author

Asha Pantula

CoAuthor(s)

Luca Frigato, Università di Torino
Ana Kenney, University of California – Irvine
Marzia Cremona, Universite Laval

Withdrawn - 11. "From Local to Live: Taking Analyses into Production"

In many research applications, analyses are generated through local scripts or notebooks, often built for one-time use and confined to a single machine. While effective in the short term, these workflows limit reproducibility, collaboration, and broader scientific impact. In this poster, I address the importance of transitioning these isolated efforts into robust, shareable tools. I explore the benefits of deploying reusable, open-source, and publicly accessible software, and highlight how productionizing analysis workflows enhances reproducibility, fosters collaboration, and enables researchers to build on each other's work more efficiently. Drawing from the experience within the Statistical Engineering Division of the National Institute of Technology, I present a practical overview of the systems, practices, and infrastructure that I use to take local code into production. This includes containerization, API development, continuous integration, and cloud deployment strategies that support sustainable, scalable research software. By showcasing real-world examples, I hope to inspire other teams to consider the lifecycle of their analytical work, from isolated, often scattered workflows, to living, maintained software that can support ongoing and future research. 

Presenting Author

David Newton

First Author

David Newton

12. Consensus Dimension Reduction via Data Integration

A plethora of dimension reduction methods have been developed to visualize high-dimensional data in low dimensions. However, different dimension reduction methods often output different visualizations, and there are many challenges that make it difficult for researchers to determine which visualization is best. We thus propose a novel consensus dimension reduction framework, which summarizes multiple visualizations into a single "consensus" visualization. Here, we leverage ideas from data integration in order to identify the patterns that are most stable or shared across the many different dimension reduction visualizations and subsequently visualize this shared structure in a single low-dimensional plot. We demonstrate that this consensus visualization effectively identifies and preserves the shared low-dimensional data structure through extensive simulations and real-world case studies. We further highlight our method's robustness to the choice of dimension reduction method and/or hyperparameters --- a highly-desirable property when working towards trustworthy and reproducible data science. 

Presenting Author

Bingxue An, University of Notre Dame

First Author

Bingxue An, University of Notre Dame

CoAuthor

Tiffany Tang

Withdrawn - 13. Decomposing the Spectral Density of ARMA Models to Describe Quasi-Periodic Oscillations in X-ray Binary Systems

Astronomers aim to model the power spectral density (PSD) of X-ray binary systems, focusing on quasi-periodic oscillations (QPOs), narrow peaks in the frequency domain that reveal processes near compact objects such as black holes or neutron stars. A common method involves computing the periodogram and fitting a sum of Lorentzian functions to estimate spectral peaks (e.g., Uttley et al., 2014; Pawar et al., 2015; Malzac et al., 2018). While widely used, this two-step approach separates estimation from modeling, limiting coherence and interpretability.

We propose a statistically grounded alternative using autoregressive moving average (ARMA) models to represent the PSD directly. We show that the spectral density of an ARMA(p,q) process can be analytically decomposed into s component functions, where s depends on the nature of the roots of the autoregressive polynomial-real or complex. Each component corresponds to a spectral peak, and we derive closed-form expressions for their frequencies, offering an interpretable alternative to Lorentzian fitting. To illustrate this methodology, we estimate an ARMA model using its state-space form and the Kalman filter. We apply it to data from the Rossi X-ray Timing Explorer (RXTE) light curve of the binary system XTE J1550–564. The resulting decomposition captures both broadband variability and QPO features, showing the model's ability to represent complex astrophysical signals in a coherent way.

Though motivated by astrophysics, this framework applies to fields where spectral peaks matter-such as neuroscience, geophysics, and engineering-offering a flexible, analytically grounded approach to frequency-domain analysis. This work, developed with Dr. Giovanni Motta (Columbia University) and Dr. Malgorzata Sobolewska (Smithsonian Astrophysical Observatory), reflects a multidisciplinary, women-led collaboration advancing statistical innovation at the interface of data science and astronomy. 

Presenting Author

Darlin Soto, Universidad del Bío Bío

First Author

Darlin Soto, Universidad del Bío Bío

CoAuthor(s)

Giovanni Motta, Columbia University
Malgorzata Sobolewska

14. Evaluating variable selection methods to detect DNA methylation patterns predictive of longitudinal outcomes: The impact of ignoring within-subject correlation

DNA methylation (DNAm) is a promising biomarker for quantifying biological aging and has been shown to be predictive of various health outcomes, including mortality, frailty, and chronic disease risk. However, its high dimensionality and longitudinal measurement structure present challenges for variable selection and model interpretation. This study assesses the performance of three penalized regression methods-LASSO with minimum cross-validation error (LASSO-min), LASSO with the one-standard-error rule (LASSO-1se), and the Sparse Penalized Linear Mixed Model (SPLMM)-using simulation data that emulate key features of longitudinal DNAm studies, such as random slopes and missing data. LASSO methods assume independence across observations and do not account for within-subject correlation, whereas SPLMM incorporates random effects to explicitly model such dependencies. Performance was evaluated using sensitivity, specificity, mean squared error, and the Matthews correlation coefficient. Simulation-based results suggest that LASSO-1se achieves a desirable balance between predictive accuracy and model sparsity, even in the presence of moderate serial correlation. While SPLMM appropriately models within-subject correlation, its computational burden is substantially higher and does not yield consistent gains. Real data applications will consider the selection of DNAm patterns to predict longitudinal outcomes, including settings subject to loss to follow-up. 

Presenting Author

Wenjing Liu

First Author

Wenjing Liu

CoAuthor

Fernanda Schumacher, Ohio State University

15. Bridging the Gap: Complementary Data Science Skills That Drive Impact in the Industry

In industry, building accurate models is only half the job-driving real product impact requires a strong understanding of the "why" behind the problem, product thinking, and effective communication. This talk explores how data science education can be complemented to better prepare professionals for these challenges after studies. Drawing on my industry experience as a Senior Data Scientist after Ph.D., I'll highlight the importance of intuitive model understanding, business acumen, project management, stakeholder communication, and real-world data interpretation which helped me drive millions in revenue. Attendees will gain practical insights into how these skills can elevate technical training and what can help them gain these crucial skills. 

Presenting Author

Gunjan Mahindre

First Author

Gunjan Mahindre

16. FireGen: Quantifying Wildfire Risk by Simulation

The extent and severity of new wildfires continue to devastate our planet, leaving lasting impacts on human health, property, water quality, essential services, biodiversity, and contributing to global climate change.

The goal of this project is to quantify how geophysical characteristics affect wildfire risk and variability. We assess the risk by creating a "wildfire generator" that simulates burned areas under changing climate conditions. We propose using spatial point process models to describe the distribution of burned area events in Northern California. This approach allows environmental factors, such as climate influences, to inform our model of potential fire occurrences. We then simulate burned areas as ellipses, using parameters derived from the fitted characteristics of observed wildfire shapes. Future work will involve developing risk metrics related to infrastructure exposure, such as calculating the total burned area and estimating the number of buildings impacted under various simulated wildfire scenarios. 

Presenting Author

Allyson Hineman

First Author

Allyson Hineman

17. Bayesian Copula Factor Models for Mixed-Type Time Series Data

We present a Bayesian Copula Factor Autoregressive (BCFAR) model adapted to analyze multivariate time series data with mixed variable types. This approach captures dynamic relationships among macroeconomic indicators and stock market indices by modeling both main effects and interactions through latent factors. The BCFAR framework integrates copula functions with quadratic autoregression to flexibly accommodate conditional dependence structures across continuous and discrete covariates. To improve computational scalability, we adopt a semiparametric extended rank likelihood and develop an efficient MCMC algorithm combining Metropolis-Hastings and Forward Filtering Backward Sampling within a Gibbs sampler. Simulation studies and real-world macroeconomic data analysis demonstrate the accuracy and efficiency of our methods. 

Presenting Author

Samira Zaroudi, CUNY, John Jay College of Criminal Justice

First Author

Samira Zaroudi, CUNY, John Jay College of Criminal Justice

CoAuthor(s)

Hadi Safari Katesari
Seyed Yaser Samadi, Southern Illinois University-Carbondale

18. Understanding the Complex Association Between School-Based Mental Health Services and K-2 Early Reading Achievement in 2021-2024 using Longitudinal Data

School closures in 2020-2021 negatively impacted learning trajectories for young children. To facilitate academic recovery, one large, urban school district allocated additional resources in the post-closure years for direct school-based mental health services. We describe an evaluation of the impact of receiving such services on a student's early literacy trajectory. This presentation will also emphasize the common challenges faced in causal inference and the process for selecting a rigorous statistical analysis plan. Analytical challenges arise since the probability of receiving the intervention is substantially higher among students with lower literacy and slower trajectories with limited information for balancing. The selected method leverages a student's own pre-intervention trajectory to address the counterfactual question of whether the initiation and accumulation of services translated to better gains in literacy. Administrative data included 691,742 early literacy assessments completed by students in grades K-2 in three consecutive years. A linear mixed effects model was fit including student grade, an indicator of whether an initial service already occurred, cumulative services to date, and assessment time. Detailed trajectories were estimated using interactions and random effects accounted for nesting and repeated measures. Growth from beginning to middle and end of year was slower among students receiving services. Higher total number of prior services also translated to lower beginning of year scores. However, as services accumulated within a given year, the number of services received was positively correlated with the rate of increase in literacy. This study builds a flexible model to disentangle the impact of mental health service accumulation from the increased propensity for receiving services among students with lower reading achievement. The results display a promising pattern of recovery through the use of school-based mental health services. 

Presenting Author

Naomi Wilcox, UCLA Jane and Terry Semel Institute for Neuroscience and Human Behavior

First Author

Naomi Wilcox, UCLA Jane and Terry Semel Institute for Neuroscience and Human Behavior

CoAuthor(s)

Hilary J. Aralis, Public Health Sciences, School of Medicine, University of California, Davis
Patricia Tan, UCLA Jane and Terry Semel Institute for Neuroscience & Human Behavior, VA Greater LA Health System
Alyssa R. Palmer, UCLA Jane and Terry Semel Institute for Neuroscience & Human Behavior
Sheryl H. Kataoka Endo, UCLA Jane and Terry Semel Institute for Neuroscience & Human Behavior
Alison Wood, UCLA Jane and Terry Semel Institute for Neuroscience & Human Behavior, VA Greater LA Health System
Roya Ijadi-Maghsoodi, UCLA Jane and Terry Semel Institute for Neuroscience & Human Behavior, VA Greater LA Health System