SPAAC Poster Competition — Topic Contributed Poster Presentations

Shirin Golchi Chair
McGill University
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4075 
Contributed Posters 
Music City Center 
Room: CC-Hall B 

Main Sponsor

Business Analytics/Statistics Education Interest Group
Business and Economic Statistics Section
Casualty Actuarial Society
Scientific and Public Affairs Advisory Committee

Presentations

01: A Framework for Loss Scale Determination in General Bayesian Updating

General Bayesian updating (GBU) is a framework for updating prior beliefs about the parameter of interest to a posterior distribution via a loss function without imposing the distribution assumption on data. In recent years, the asymptotic distribution of the loss-likelihood bootstrap (LLB) sample has been a standard for determining the loss scale parameter which controls the relative weight of the loss function to the prior in GBU. However, the existing method fails to consider the prior distribution since it relies on the asymptotic equivalence between GBU and LLB. To address this limitation, we propose a new finite-sample-based approach to loss scale determination using the Bayesian generalized method of moments (BGMM) as a reference. We develop an efficient algorithm that determines the loss scale parameter by minimizing the Kullback-Leibler divergence between the exact posteriors of GBU and BGMM. We prove the convexity of our objective function to ensure a unique solution. Asymptotic properties of the proposed method are established to demonstrate its generalizability. We demonstrate the performance of our proposed method through a simulation study and a real data application. 

Keywords

General Bayesian updating

Loss-likelihood bootstrap

Generalized method of moments

Kullback-Leibler divergence

Monte Carlo Newton-Raphson method 

Abstracts


Co-Author(s)

Seung Jun Park, Department of Statistics, Kyungpook National University
Pamela Kim Salonga, Department of Statistics, Kyungpook National University
Kyeong Eun Lee, Department of Statistics, Kyungpook National University
Gyuhyeong Goh, Department of Statistics, Kyungpook National University

First Author

YU JIN SEO, Department of Statistics, Kyungpook National University

Presenting Author

YU JIN SEO, Department of Statistics, Kyungpook National University

02: A Graph Database Approach for Biomarker Discovery

Traditional statistical and machine learning methods often struggle to capture complex, interconnected relationships within biological data that enable biomarker discovery. We present a novel graph-based framework that leverages graph neural networks and network-based feature engineering to identify predictive biomarkers. Our approach constructs several biological networks by integrating gene expression data and clinical attributes using a graph database, providing multiple representations of patient-specific relationships. We employ graph learning techniques to ensemble graphs to identify candidates using hierarchical, feature-based and filter-based methods. Using three independent datasets, we demonstrate that our method improves predictive performance compared to conventional machine learning models. This scalable and interpretable strategy has broad applications in biomarker discovery across diverse disease domains. 

Keywords

graph database

graph neural network

biomarker

feature engineering

feature selection 

Abstracts


Co-Author(s)

Jason Huse, MD Anderson Cancer Center
Kasthuri Kannan, MD Anderson Cancer Center

First Author

Yang Liu, SPH in The University of Texas Health Science Center at Houston | MD Anderson Cancer Center

Presenting Author

Yang Liu, SPH in The University of Texas Health Science Center at Houston | MD Anderson Cancer Center

03: A multivariate approach to estimating the withdrawal time in food animal species

In the US, the FDA uses linear regression and non-central t distribution to estimate the upper limit of the 95% CI for the 99% quantile (TLM) and define the time as withdrawal time (WDT) when this TLM falls at or below a safe concentration level (tolerance) following the administration of the approved drug in labeled or extra-label manner in food animal species. It involves only the concentrations at or above the limit of detections (LODs) and determines the WDT for each tissue separately. However, the tissues, namely, liver, kidney, muscle, and fat collected from an animal, may be correlated. Therefore, the multivariate linear regression model (MvLR) appropriately addresses this high inter-tissue correlation. In addition, involving only the concentrations above LOD, censored observations can also impact the correlations or covariance pattern among the tissues and result in biased and imprecise estimators. Therefore, we propose using ordinary least squares (OLS), generalized least squares (GLS) in MvLR, and expectation-conditional maximization (ECM) algorithm in censored MvLR along with multivariate t distribution to estimate will provide more precise and accurate estimates of WDT. 

Keywords

TLM

WDT

Multivariate t distribution

OLS

GLS

ECM 

Abstracts


Co-Author(s)

Ronald Baynes, Dr.
Jim Riviere, Professor
Jacqueline Hughes-Oliver, North Carolina State University
Majid Jaberi-Douraki, Professor

First Author

Farha Ferdous Sheela

Presenting Author

Farha Ferdous Sheela

04: Active Safety Surveillance of the 2023–2024 COVID-19 Vaccine Formulation: Findings from VSD

Real-world post-vaccine safety monitoring is crucial for detecting adverse events and maintaining public trust. This study applied Rapid Cycle Analysis (RCA) to assess 2023–2024 COVID-19 vaccines (Pfizer, Moderna, Novavax) for 14 outcomes, including ischemic stroke, GBS, and myocarditis.
VSD data from nine healthcare organizations included 2.7M doses (Sep 2023–Apr 2024). Outcomes were identified in healthcare records. RCA used a concurrent comparator design to compare adverse event rates across risk and comparison intervals. Weekly monitoring with Pocock alpha-spending controlled Type I error ensured real-time safety assessment and reduced bias.
RCA identified a GBS signal after Pfizer in ≥65 yrs (aRR: 4.45, 95% CI: 1.07–22.62) and ischemic stroke signals with Pfizer (18–64 yrs: aRR: 1.48, 95% CI: 1.04–2.11) and Moderna (≥65 yrs: aRR: 1.68, 95% CI: 1.05–2.70). No signals were found for other outcomes.
RCA enables real-time vaccine safety monitoring, addressing limits of traditional comparators. While GBS and stroke signals require further evaluation, the 2023–2024 COVID-19 vaccines show a reassuring safety profile. Ongoing monitoring remains key for public trust and safety. 

Keywords

Vaccine Safety
Post-Vaccination Surveillance
Adverse Events
Rare Events
Signal Detection

Rapid Cycle Analysis (RCA)
Sequential Analysis
Concurrent Comparator Design
Pocock Alpha-Spending
Real-Time Monitoring

Vaccine Safety Datalink (VSD)
ICD-10 Codes
Healthcare Records

Guillain-Barré Syndrome (GBS)
Ischemic Stroke, Myocarditis
Adjusted Rate Ratio (aRR)
Risk and Comparison Intervals

COVID-19 Vaccine
Pfizer
Moderna
Novavax 

Abstracts


Co-Author(s)

Eric Weintraub, CDC
Burney Kieke, Marshfield Clinic Research Institute
Tat’Yana Kenigsberg, CDC
Kimp Walton, CDC
Michael McNeil, CDC
Jonathan Duffy, CDC

First Author

Lily Wang, CDC

Presenting Author

Lily Wang, CDC

05: An Unbiased Convex Estimator for Classical Linear Regression Model Using Prior Information

We propose an unbiased restricted estimator that leverages prior information to enhance estimation efficiency for the linear regression model. The statistical properties of the proposed estimator are rigorously examined, highlighting its superiority over several existing methods. A simulation study is conducted to evaluate the performance of the estimators, and real-world data on total national research and development expenditures by country are analyzed to illustrate the findings. Both the simulation results and real-data analysis demonstrate that the proposed estimator consistently outperforms the alternatives considered in this study. 

Keywords

Linear model

MSE

Unbiased ridge estimator

Restricted least-squares estimator

Multicollinearity 

Abstracts


Co-Author(s)

HM Nayem
B M Golam Kibria, Florida International University

First Author

Mustafa I. Alheety, University of Anbar

Presenting Author

HM Nayem

06: Analyzing Wildfire Patterns in Colorado, Montana, Utah, and Wyoming Using Spatio-Temporal Methods

In recent times, wildfires have posed significant threats to forest ecosystems, human communities, and economic assets. This study applies spatio-temporal techniques to analyze and predict wildfire patterns in the forests of Colorado, Montana, Utah, and Wyoming. Utilizing historical wildfire data, weather conditions, vegetation types, and topographic features, we aim to develop comprehensive models to identify high-risk areas and forecast future wildfire events. By integrating remote sensing data and geographic information systems (GIS), we perform detailed spatio-temporal analyses to uncover underlying patterns and trends. The findings from this research provide valuable insights for forest management, risk assessment, and wildfire mitigation strategies, contributing to more effective resource allocation and community preparedness. 

Keywords

Spatio-temporal analysis

Wildfire patterns

Forest Ecosystems

Risk assessment

Predictive modeling

Remote sensing 

Abstracts


Co-Author

Mostafa Zahed, East Tennessee State University

First Author

Princess Tagoe

Presenting Author

Princess Tagoe

07: Application of Analysis Methods for Vaccine Efficacy in a Dengue Trial

The vaccine efficacy is defined as the reduction of relative risk of disease 1-(Rv/Rp), with Rv and Rp as the incidence rates for the disease of interest in the vaccine and placebo groups, respectively. A conditional exact method proposed by Chan and Bohidar (1998) is often used to estimate the vaccine efficacy and its confidence interval. In this poster, we will compare the conditional exact method with two alternative approaches: Poisson regression and modified Poisson regression proposed by Zou (2004) by using trial data with and without follow-up time adjustment. Dengue epidemiologic distribution across Brazil will be provided via a QR code. 

Keywords

Vaccine efficacy

conditional exact method

Poisson regression

modified Poisson regression. 

Abstracts


Co-Author

Jung-Jin Lee, MSD

First Author

Tulin Shekar, Merck & Co., Inc.

Presenting Author

Tulin Shekar, Merck & Co., Inc.

08: Bayesian Kernel Machine Regression Model for Analysing Sequence Count Data

RNA sequencing (RNA-Seq) is a powerful technology for quantifying gene expression and identifying genes influenced by environmental exposures or other treatments. There is currently interest in how mixtures of multiple environmental chemicals affect gene expression using RNA-Seq data. There are many popular methods for RNA-Seq analysis; however, none focus on correlated environmental exposures. The existing Bayesian kernel machine regression (BKMR) effectively analyses mixture effects for continuous outcomes but cannot handle the unique challenges of RNA-Seq count data. We develop BKMRSeq, a novel BKMR model tailored for RNA-Seq count data to address this gap. BKMRSeq uses Polya-Gamma augmentation within a Markov chain Monte Carlo (MCMC) framework to estimate a complex non-linear association between exposures and gene expression using a Gaussian kernel matrix and select genes that are differentially expressed. Through simulation studies, BKMRSeq demonstrates superior performance compared to the existing methods to analyze gene expression when there is a complex exposure-response relation. We further validate its utility by applying it to real-world RNA-Seq data. 

Keywords

RNA-Seq; Correlated environmental exposures; BKMR; Polya-gamma distribution; Data augmentation 

Abstracts


Co-Author

Ander Wilson, Colorado State University

First Author

Mosammat Sonia Khatun

Presenting Author

Mosammat Sonia Khatun

09: Bayesian Likelihood-free Inference with High-dimensional Data

With the growing availability of high-dimensional data, variable selection has become an inevitable step in regression analysis. Traditional Bayesian inference, however, depends on correctly specifying the likelihood, which is often impractical. The loss-likelihood bootstrap (LLB) has recently gained attention as a tool for likelihood-free Bayesian inference. In this paper, we aim to overcome the limited applicability of LLB for high-dimensional regression problems. To this end, we develop a likelihood-free Markov Chain Monte Carlo Model Composition (MC3) method. Traditional MC3 requires marginal likelihoods, which are not available in our likelihood-free setting. To address this, we propose a novel technique that utilizes the Laplace approximation to estimate marginal likelihood ratios without requiring explicit likelihood evaluations. This advancement allows for efficient and accurate model comparisons within the likelihood-free context. Our proposed method is applicable to various high-dimensional regression methods including machine learning techniques. The performance of the proposed method is examined via simulation studies and real data analysis. 

Keywords

Bayesian variable selection

High-dimensional regression

Likelihood-free Bayesian inference

Loss-likelihood bootstrap (LLB) 

Abstracts


Co-Author(s)

Gyuhyeong Goh, Department of Statistics, Kyungpook National University
Dipak Dey, University of Connecticut

First Author

Minhye Park, Kyungpook National University

Presenting Author

Minhye Park, Kyungpook National University

10: Benchmarking Spatial Co-Localization Methods for Single-Cell Multiplex Imaging Data of Cancers

Single-cell multiplex imaging (scMI) measures cell locations and phenotypes in tissues, enabling insights into the tumor microenvironment. In scMI studies, quantifying spatial co-localization of immune cells and its link to clinical outcomes, such as survival, is crucial. However, it is unclear which spatial indices have sufficient power to detect within-sample co-localization and its association with outcomes. This study evaluated six frequentist spatial co-localization metrics using simulated data to assess their power and Type I error. Additionally, these metrics were applied to two scMI datasets-high-grade serous ovarian cancer (HGSOC) and triple-negative breast cancer (TNBC)-to detect co-localization between cell types and its relation to survival. Simulations showed Ripley's K had the highest power, followed by pair correlation g, while other metrics exhibited low power. In cancer studies, Ripley's K, pair correlation g, and the scLMM index were most effective in detecting within-sample co-localization and associations with survival, highlighting their utility in scMI analyses. 

Keywords

Co-Clustering

Multiplex Imaging

Spatial Biology

Spatial Proteomics 

Abstracts


Co-Author

Simon Vandekar, Vanderbilt University

First Author

Ishaan Gadiyar, Vanderbilt University Medical Center

Presenting Author

Ishaan Gadiyar, Vanderbilt University Medical Center

11: Can You Learn Statistics Online with An AI Tutor? A First Look at the BEAR!

Artificial Intelligence (AI) is showing up in more and more parts of our lives, including college classrooms. But can an AI actually help students learn tough subjects like statistics? This study takes a first look at the BEAR (Biostatistical Education Assistant and Resource), an AI tutor created using ChatGPT to help students in an online Biostatistics course at the University of Alabama. The BEAR is designed to help students understand challenging concepts, stimulate deeper thinking through thoughtful questioning, and provide supportive guidance similar to that of a real teacher. A key feature is its two-way interactivity, where both the BEAR and students can ask and answer questions, making the learning process more engaging and potentially more effective. At the end of the Spring 2025 semester, students will be surveyed to find out how helpful they found the BEAR, whether it improved their understanding, and how satisfied they were with the experience. This pilot study aims to share early insights into how AI tools like the BEAR might shape the future of online statistics education. 

Keywords

AI

ChatGPT

Online Statistical Education

AI Tutoring 

Abstracts


Co-Author

Subhabrata Chakraborti, The University of Alabama

First Author

Yuhui Yao, The University of Alabama

Presenting Author

Yuhui Yao, The University of Alabama

12: Deep Learning Models for Forecasting Treasury Bill Rates Using Gasoline Prices and Gold EPI

Artificial intelligence has given researchers access to a broader range of historical data, enabling the development of more accurate predictive models in finance. This study investigates the relationship between treasury bill rates and two high-demand commodities: gasoline and gold. We compiled a comprehensive dataset spanning ten years of historical prices: July 2014-July 2024, consisting of treasury bill rates, gasoline prices, gold-EPI, and natural gas consumption, and vital economic indicators such as GDP, inflation rates, and interest rates. Our analysis reveals that natural gas consumption exhibits the highest positive correlation (0.64) with treasury bill rates, while gasoline prices and the export price index of gold show moderate positive correlations (0.51 and 0.52, respectively). Training the data on a machine learning model (Artificial Neural Network) and three deep learning models (Recurrent Neural Network, Convolutional Neural Network, and 3D Convolutional Neural Network). Their performance was assessed using mean absolute error, mean squared error, root mean squared error (RMSE), and R-sq score. The Recurrent Neural Network model performed best with an RMSE of 0.08. 

Keywords

Deep Learning

Recurrent Neural Network

Treasury Bill Rate

3D Convolutional Neural Network

Artificial Neural Network

Machine Learning 

Abstracts


Co-Author

Mary Akinyemi, Austin Peay State University

First Author

JaNiah Harris

Presenting Author

JaNiah Harris

13: Demographic Representativeness of the RADx-Up Cohort Compared to the U.S. Census Data

Wastewater surveillance is a promising tool for tracking COVID-19,but its effectiveness in underserved populations-who may experience disproportionately severe illness-has not been fully established. The NIH-funded RADx-UP program, comprising 144 projects across the U.S., aims to expand COVID-19 testing accessibility, particularly in hard-hit areas. The utility of wastewater surveillance in underserved communities can be assessed by comparing its findings with screening data from RADx-UP, which includes both symptomatic and asymptomatic individuals. However, this assessment requires RADx-UP to be representative of the U.S. population, a criterion that can be evaluated through generalizability analysis. In this study, we report participant characteristics and compute a generalizability score, which quantifies how well a study sample reflects a target population of interest based on demographics and clinical characteristics, at the Federal Information Processing Standard (FIPS) geography code level. With over 350,000 participants in 1005 FIPS, we anticipate major metropolitan areas have sufficient data for generalizable estimates, especially in underserved counties. 

Keywords

causal inference study

propensity score model

multiple imputation 

Abstracts


Co-Author(s)

Meri Varkila, Stanford Univeristy
Nivetha Subramanian, Stanford University
Julie Parsonnet, Stanford Univeristy
Shuchi Anand, Stanford University
Maria Montez-Rath, Stanford University
Glenn Chertow, Stanford University

First Author

Xue Yu, Stanford University

Presenting Author

Xue Yu, Stanford University

14: Detecting Monotone Data Drift in Galaxy Luminosities

Luminosity function is a fundamental concept in astrophysics and cosmology, which describes the distribution of luminosities within a group of astronomical objects, such as galaxies. In reality, there could possibly exist a monotone data drift (selection bias) when we collect the data on luminosities: given the same distance, the stars/galaxies with larger luminosity have a higher chance of being observed. This poses challenges for standard estimation procedures. Ignoring this bias can lead to failure. Conversely, procedures accounting for it may be inefficient when there is no selection bias. This poster introduces a semi-parametric procedures for detecting monotone drifts in data with unknown parameters, with its application on a real dataset of galaxy luminosities. 

Keywords

selection bias

semiparametric

luminosity function 

Abstracts


Co-Author(s)

Jiayang Sun, George Mason University
Mary Meyer, Colorado State University
Michael Woodroofe, Univ of Michigan

First Author

Zixiang Xu

Presenting Author

Zixiang Xu

15: Detection of multiple change points for non-stationary network autoregressive models.

The Network Autoregression (NAR) model is widely used for analyzing network-dependent data. However, assuming fixed parameters over time is often unrealistic in dynamic systems. Identifying time points where NAR parameters shift-known as change points-is crucial for capturing structural changes in the network process. This work proposes a rolling-window approach to detect these change points efficiently. The method adapts to evolving network structures and parameter variations, improving the model's flexibility in real-world applications. Simulation studies demonstrate the effectiveness of the proposed approach, and its applicability is further illustrated using real-world network data. 

Keywords

Change Point Detection,

Network Autoregressive (NAR) Model

Network-Dependent Data

Non-Stationary Processes

Structural Changes 

Abstracts


Co-Author

Abolfazl Safikhani, University of Florida

First Author

Ruishan Lin

Presenting Author

Ruishan Lin

16: Developing NLP AND Supervised Machine Learning Techniques to Classify Mars Tasks

As NASA's Human Research Program (HRP) prepares for long-duration Mars missions, understanding astronaut tasks is crucial. This study, conducted at NASA Glenn Research Center (GRC), applied Natural Language Processing (NLP) and machine learning to classify 1,058 Mars tasks into 18 Human System Task Categories (HSTCs) [1]. We developed an NLP model using Google's BERT [2] to capture semantic and syntactic nuances. Supervised training on a subset of tasks improved classification accuracy for 9 of 18 HSTCs, especially when incorporating HSTC descriptions. To address severe class imbalance, we introduced weighting and sampling techniques for data augmentation [3]. Additionally, we fine-tuned BERT to implement a pairwise relatedness scoring method, enabling task clustering and progressing toward unsupervised labeling. This presentation covers data preprocessing, key syntax extraction using BERT, and supervised classification, highlighting NLP's potential for analyzing Mars mission tasks in crew health and performance studies. 

Keywords

Natural Language Processing

Machine Learning

BERT (Bidirectional Encoder Representations from Transformers)


NASA Human Research Program (HRP)

Mars Task Classification

Supervised Learning 

Abstracts


Co-Author(s)

Mona Matar, NASA Glenn Research Center
Hunter Rehm, HX5 LLC, NASA Glenn Research Center

First Author

Adam Kurth, Arizona State University - Biodesign Institute

Presenting Author

Adam Kurth, Arizona State University - Biodesign Institute

17: Distributed Learning for Whole-Brain Functional Connectivity Analysis in Resting-State fMRI

Resting-state fMRI (rfMRI) is a powerful tool for characterizing brain-related phenotypes, but current approaches are often limited in their ability to efficiently capture the heterogeneous nature of functional connectivity due to lack of robust, scalable statistical methods. In this paper, we propose a new distributed learning framework for modeling the voxel-level dependencies across the whole brain, thus avoiding the need to average voxel outcomes within each region of interest. In addition, our method addresses confounder heterogeneity by integrating subject-level covariates in the estimation, which allows for comparing functional connectivity across diverse populations. We demonstrate the effectiveness and scalability of our approach in handling large rfMRI outcomes through simulations. Finally, we apply the proposed framework to study the association between brain connectivity and autism spectrum disorder (ASD), uncovering connectivity patterns that may advance our understanding of ASD-related neural mechanisms. 

Keywords

rfMRI

functional connectivity,

oxel-level dependencies

distributed learning

ASD

whole-brain modeling 

Abstracts


Co-Author(s)

Emily Hector, North Carolina State University
Brian Reich, North Carolina State University

First Author

Wei Zhao

Presenting Author

Wei Zhao

18: Dynamic Functional Mediation Model in Geospatial Studies

Although the causal effects can be affected by the interaction effects between exposure and confounders, it is difficult to capture in a mediator model. Some function-on-scalar models with functional responses and scalar covariates have studied the dynamic relationship between the functional responses and covariates or the relationship between the functional responses depending on the time and some covariates, but the interaction effects between exposure and confounders could not be found. To explore the dynamic effects of the single index including the dynamic confounders, we propose a dynamic functional mediation model. The simulation studies evaluate the performance and validity of the proposed model with the estimation and inference procedures for causal estimands using the wild bootstrap method. We apply our approach to country-level datasets with an adjustment of the dependency between countries by estimating the spatial dependence. Thus, the pathways of individual indirect effects from exposure to outcome through the geospatial mediator can be investigated without linear assumption and dependency issues. 

Keywords

COVID-19

dynamic interaction semiparametric function-on-scalar model

mediation analysis

penalized scalar-on-function linear regression 

Abstracts


Co-Author

Chao Huang, University of Georgia

First Author

Miyeon Yeon, The University of Tennessee Health Science Center

Presenting Author

Miyeon Yeon, The University of Tennessee Health Science Center

19: Dynamic Visualization of Complex Space-Time Processes Applied to Daily PM2.5 Concentrations

Recent studies link air pollution exposure to human and environmental health. It is critical to identify time intervals and spatial regions where such exposure risks are high. For fine particles we need to be able to visualize and model effects of various quantiles, in addition to mean effects, since people are more adversely affected by excessive levels of pollution.
We propose versatile tools to describe and visualize quantiles of data with wide-ranging spatial-temporal structures and various degrees of missingness. We illustrate this methodology through dynamic visualization of spatial-temporal patterns to provide useful insights to where and when the process changes. This approach does not require strong theoretical assumptions and is useful to guide future modeling efforts.
The mentioned statistical framework is applied to daily PM2.5 concentrations for the years 2020-24 collected at 108 locations across NY, NJ, PA. We show how the PM2.5 exposure risks evolve over space and time, identifying possible clusters. Our approach demonstrates the importance of effective dynamic visualizations of complex spatial-temporal datasets, with plans to expand analysis to further regions. 

Keywords

Statistical Modeling

PM2.5 Pollution

Data Modeling 

Abstracts


Co-Author(s)

Dana Sylvan, Hunter College, City University of New York
Peter Craigmile, Hunter College, CUNY

First Author

Danielle Elterman

Presenting Author

Danielle Elterman

20: Enhancing Educational Approaches in Teaching Regression Techniques

This presentation highlights the impact of the COS Instructional Grant in enhancing the educational experience of undergraduate and graduate students in Regression Analysis courses. Our project aimed to improve students' statistical skills, encourage critical thinking in real-world problem-solving, and help them communicate complex concepts clearly. Students explored data by applying regression analysis to scenarios like studying the relationship between physical characteristics and systolic blood pressure while identifying challenges like outliers. While students mastered technical skills, the project revealed a need to improve critical thinking and data interpretation, especially in recognizing anomalies. We will discuss current and future efforts, including earlier integration of crucial thinking through visualizations, creating separate course sections for different student backgrounds, and funding undergraduate research to develop statistical tools like an R package for mixed models. Our initiatives aim to boost student engagement, improve course alignment, and equip students with technical and transferable skills for careers in data analysis and beyond. 

Keywords

Statistical Education

Regression Analysis

Course Enhancement 

Abstracts


Co-Author

Anne Driscoll

First Author

Sierra Merkes, Virginia Tech Statistics Department

Presenting Author

Sierra Merkes, Virginia Tech Statistics Department

21: Enhancing SQL Code Efficiency in Insurance Data with LLMs: A Repeated Measures Approach

This project evaluates the effectiveness of an LLM-driven (Large Language Model) tool for SQL documentation and programming language conversion/SQL code generation. The experiment tests the LLM tool with code samples at three complexity levels-beginner, intermediate, and advanced-under three prompt conditions: minimally defined, moderately defined, and extremely defined. Raters will assess the LLM-generated outputs using a pre-set rubric. The statistical analysis will employ a Repeated Measures ANOVA to determine the impact of the experimental conditions on the tool's performance. Inter-rater reliability will be measured using Shrout and Fleiss intraclass correlations to measure evaluation consistency. 

Keywords

LLM (Large Language Models)

SQL Code

Documentation

Quality Evaluation

Repeated Measures ANOVA

Inter-Rater Reliability Statistics 

Abstracts


Co-Author(s)

Gabriel Cotapos Jr, CSAA
Sean McCarthy, CSAA

First Author

Philip Wong, CSAA IG

Presenting Author

Philip Wong, CSAA IG

22: Enhancing Time Series Forecasting with Diffusion Models and Conformal Prediction

Accurate time series predictions are crucial across various scientific fields. Traditional statistical methods, such as autoregressive integrated moving average (ARIMA) models, have been widely used but often rely on assumptions of stationarity and linearity, limiting their ability to capture complex real-world patterns. To overcome these limitations, this study introduces novel methods for point forecasting and uncertainty quantification. Generative diffusion models are employed alongside a conformal prediction-based calibration method to enhance the reliability of prediction intervals. The effectiveness of this approach is demonstrated through simulations and an application to electricity load forecasting. The results contribute broadly to time series analysis by improving predictive accuracy and ensuring robust uncertainty quantification. 

Keywords

Generative models

Diffusion models

Conformal prediction

Time series analysis

Uncertainty quantification 

Abstracts


Co-Author

Hsin-Cheng Huang, Academia Sinica

First Author

Yu-Ting Fan, Academia Sinica

Presenting Author

Yu-Ting Fan, Academia Sinica

23: Evaluating Exposure Mixture Analysis Techniques for Pooling Multiple Cohorts

Environmental exposures during critical developmental periods significantly impact children's health, yet analyzing complex exposure mixtures across diverse populations remains challenging. We propose a unified framework combining mixture analysis and machine learning techniques for pooling multiple cohort data. Through comprehensive simulation studies, we evaluate methods including Weighted Quantile Sum (WQS), Bayesian Kernel Machine Regression (BKMR), quantile-based g-computation, and Partial Linear Single Index Model (PLSIM), assessing their performance across various scenarios of cohort heterogeneity, exposure correlations, and missing data patterns. We apply this framework to examine prenatal air pollution exposure effects on autism traits using data from over eight thousand mother-child pairs across multiple cohorts from the Environmental influences on Child Health Outcomes (ECHO) program. Our approach integrates ensemble learning, meta-learning algorithms, and Bayesian hierarchical models to account for between-cohort heterogeneity while maximizing information sharing. This methodology promises to enhance our understanding of environmental exposure effects across diverse pop 

Keywords

Environmental mixture analysis

Machine learning

Heterogeneity

ECHO program 

Abstracts


Co-Author(s)

Akhgar Ghassabian, NYU Langone Health
Mengling Liu, New York University Grossman School of Medicine

First Author

Yuyan Wang, New York University

Presenting Author

Yuyan Wang, New York University

24: Heterogeneity-Aware Regression with Variable Selection: Applications to Readmission Prediction

Readmission prediction is a critical but challenging clinical task, as the inherent relationship between high-dimensional covariates and readmission is complex and heterogeneous. Despite this complexity, models should be interpretable to aid clinicians in understanding an individual's risk prediction. Readmissions are often heterogeneous, as individuals hospitalized for different reasons have materially different subsequent risks of readmission. To allow flexible yet interpretable modeling that accounts for patient heterogeneity, we propose hierarchical-group structure kernels that capture nonlinear and higher-order interactions via functional ANOVA, selecting variables through sparsity-inducing kernel summation, while modeling heterogeneity and allowing variable importance to vary across interactions. Extensive simulations and a Hematologic readmission dataset (N=18,096) demonstrate superior performance across subgroups of patients (AUROC, PRAUC) over the lasso and XGBoost. Additionally, our model provides interpretable insights into variable importance and group heterogeneity. 

Keywords

Readmission Prediction

Heterogeneity

Functional ANOVA

Kernel Methods

Sparsity-Inducing Regularization 

Abstracts


Co-Author(s)

Angela Bailey, University of Minnesota Twin Cities
Jared Huling, University of Minnesota

First Author

Wei Wang, University of Minnesota Twin Cities

Presenting Author

Wei Wang, University of Minnesota Twin Cities

25: High-dimensional Bayesian regression and classification using discretized hyperpriors

In Bayesian statistics, various shrinkage priors such as the horseshoe and lasso priors have been widely used for the problem of high-dimensional regression and classification. The type of shrinkage priors is determined by the choice of the distributions for hyperparameters, called hyperpriors. As a result, the posterior sampling method should vary depending on the choice of hyperpriors. To address this issue, we develop a new family of hyperpriors via a notion of discretization. The great merit of our discretization approach is that the full conditional of any hyperparameter always becomes a multinomial distribution. This feature provides a unifying posterior sampling scheme for any choice of hyperpriors. In addition, the proposed discretization approach includes the spike-and-slab prior as a special case. We illustrate the proposed method using several commonly used shrinkage priors such as horseshoe prior, Dirichlet-Laplace prior, and Bayesian lasso prior. We demonstrate the performance of our proposed method through a simulation study and a real data application. 

Keywords

Bayesian shrinkage priors

Discretization

Gibbs sampler

High-dimensional regression and classification 

Abstracts


Co-Author(s)

Gyuhyeong Goh, Department of Statistics, Kyungpook National University
Dipak Dey, University of Connecticut

First Author

Gwanyeong Choi, Kyungpook National University

Presenting Author

Gwanyeong Choi, Kyungpook National University

26: Hypothesis Tests for Comparing Two Independent Population Proportions with Censored Data

Real-world electronic health record(EHR) data can be used to identify rare adverse events of medications due to its large sample size. This study use a large nationwide EHR COVID-19 database to identify potential adverse effects of COVID-19 vaccines by comparing to historical EHR data of influenza vaccination. To compare the proportion of adverse events in a pre-specified time interval after vaccination, the censored or incomplete data should be considered. We propose an inverse probability of censoring weight(IPCW) adjusted hypothesis test to deal with the censored EHR data problem. An asymptotic consistent estimator of the event proportion is proposed and the corresponding asymptotic distribution is derived under the data censoring. The asymptotic hypothesis test for comparing the event proportion between the two groups under censoring is established. The simulation studies demonstrate the validity of the proposed IPCW-adjusted estimate and test statistic. We show that the proposed IPCW-adjusted estimate of the event proportion can remove the estimation bias and the type I error of IPCW-adjusted test can be controlled well at the nominal level when the censoring rate increases. 

Keywords

Adverse effects

COVID-19 vaccine

hypothesis testing with censored data

IPCW 

Abstracts


Co-Author(s)

Mohsen Rezapour, 1Department of Biostatistics and Data Science, School of Public Health, UT Health Houston
Vahed Maroufy, Department of Biostatistics and Data Science, School of Public Health, UTHealth Houston
Yashar Talebi, Department of Biostatistics and Data Science, School of Public Health, UTHealth Houston
Guo-qiang Zhang, Texas Institute for Restorative Neuro-technologies (TIRN), UTHealth Houston
Hulin Wu, University of Texas Health Science Center At Houston

First Author

Lili Liu

Presenting Author

Lili Liu

27: Impact of Performance Metrics in AI Model Evaluation

The selection of a performance metric for the purpose of model evaluation is not as trivial as it may appear. On one hand, the model commissioners' expectations of the model's contribution to achieving their business objective (s) often lack empirical support. On the other, the model developers can easily be confused by the multitude of quantitative metrics recommended by the statistical literature. Hence the need for a methodology to guide the effective selection of statistical performance metric during model evaluation. In Salami et al (2024), we considered a fraud detection use case and we showed that F-beta (F_β, β>1) is more appropriate than F_1 or the Area Under the Precision Recall Curve (AUPRC) metric in measuring the model's contribution to the business objective. In this paper, we examine two facets of the F_β, namely the weighted F_β and the non-weighted F_β, and discuss how the selection of one in lieu of the other can lead to erroneous decisions with adverse impacts. As the use of AI algorithms becomes more prevalent in decision making, our paper brings a new perspective to the selection of statistical performance metrics for the purpose of evaluating AI models. 

Keywords

artificial intelligence

machine learning

performance evaluation

performance metrics

model testing

model monitoring, performance thresholds 

Abstracts


Co-Author

Victor Lo, Fidelity Investments

First Author

Youssouf Salami, Fidelity Investments

Presenting Author

Youssouf Salami, Fidelity Investments

28: Implementation of Nonlinear Z-score Analysis for Cognitive Abnormality Detection

Traditional linear methods for generating age, sex, and education level corrected Z-scores in neuropsychological assessments can be problematic because of nonlinearity and bounded test scores. We propose a nonlinear censored regression model for generating Z-scores that adjusts for age, sex, education, and race, while incorporating age-varying residual standard deviations. This approach addresses non-normal score distributions and boundary censoring, enhancing the detection of abnormal cognitive performance. Application to diverse normative datasets demonstrates improved accuracy and sensitivity over traditional methods, as corroborated by clinician feedback. Our results advocate for adopting this model to refine neuropsychological evaluations across varied populations. 

Keywords

Neuropsychological Testing


Z-Score Adjustment

Censored Regression

Nonlinear Modeling

Cognitive Assessment 

Abstracts


Co-Author(s)

John Kornak, University of California-San Francisco
Adam Staffaroni, University of California San Francisco
Julie Fields, Mayo Clinic
Jingxuan Wang, University of California, San Francisco
Elena Tsoy, University of California, San Francisco

First Author

Peijun Liu, University of California San Francisco

Presenting Author

Peijun Liu, University of California San Francisco

29: Improving naive Bayes classifiers with high-dimensional non-Gaussian data

The naive Bayes classifier, which assumes the conditional independence of predictors, improves classification efficiency and has a great advantage in handling high-dimensional data as well as imbalanced data. However, the success of the naive Bayes classifier hinges on the normality assumption for each continuous predictor and its performance decreases considerably as many irrelevant predictor are included.
In this paper, we develop a way of improving the performance of naive Bayes classifiers when we deal with high-dimensional non-Gaussian data. To remove irrelevant predictors, we develop an efficient variable selection procedure in the context of naive Bayes classification using the notion of Bayesian Information Criteria (BIC). In addition, we adapt the naive Bayes classifier for use with non-Gaussian data via power transformation. We conduct a comparative simulation study to demonstrate the superiority of our proposed classifier over existing classification methods. We also apply our proposed classifier to real data and confirm its effectiveness. 

Keywords

Bayes classifier

Generative classifier

High-dimensional variable selection

Power transformation 

Abstracts


Co-Author(s)

Gyuhyeong Goh, Department of Statistics, Kyungpook National University
Dipak Dey, University of Connecticut

First Author

Mijin Jeong, Kyungpook National University

Presenting Author

Mijin Jeong, Kyungpook National University

30: Item-Level Imputation of Missing K10 Data in Unconditional Cash Transfer Trials Using XGBoost

Traditional imputation methods for psychological scales often focus on aggregate scores, potentially obscuring item-level response patterns. This study addresses the challenge of item- level missingness by introducing a multiple imputation framework that preserves the Kessler-10 (K10) scale's internal structure in a longitudinal setting. We analyzed item responses from the K10 across six waves in a three-arm RCT (n = 878) and, departing from conventional total-score imputation, employed XGBoost with 10 iterations to impute missing values at the individual item level. Despite missing data ranging from 2.5% to 34.4%, our approach yielded robust estimates with mean uncertainty on a 5-point scale varying between 0.003 and 0.111. This method enhances the analysis of psychological assessment data by capturing item-specific variability, ultimately improving data integrity in mental health research. 

Keywords

Multiple Imputation by Chained Equation (MICE)

Item-level imputation

XGBoost



Randomized controlled trial (RCT) 

Abstracts


First Author

Nidhi Tandon, University of Pennsylvania

Presenting Author

Nidhi Tandon, University of Pennsylvania

31: Learning Smooth Populations of Parameters with Trial Heterogeneity

We revisit the classical problem of estimating the mixing distribution of Binomial mixtures under trial heterogeneity and smoothness. This problem has been studied extensively when the trial parameter is homogeneous, but not under the more realistic scenario of heterogeneous trials, and only within a low smoothness regime, where the resulting rates are suboptimal. Under the assumption that the density is s-smooth, we derive faster error rates for nonparametric density estimators under trial heterogeneity. Importantly, even when reduced to the homogeneous case, our result improves upon the state of the art. We further discuss data-driven tuning parameter selection via cross-validation and a measure of a difference between two densities. Our work is motivated by an application in criminal justice: assessing the effectiveness of indigent representation in Pennsylvania. We find that the estimated conviction rates for appointed counsel (court-appointed private attorneys) are generally higher than those for public defenders, potentially due to a confounding factor: appointed counsel are more likely to take on severe cases. 

Keywords

Binomial mixtures

mixing distribution

nonparametric density estimation

criminal justice 

Abstracts


Co-Author

Edward Kennedy

First Author

JungHo Lee

Presenting Author

JungHo Lee

32: Leveraging the Bayesian Stochastic Antecedent Model to Quantify Long-Term Drought Effects on Forest

Statistical methods are essential in ecology, helping researchers navigate complex, noisy data to understand environmental change and predict ecosystem responses. One critical application is assessing how forests, which play a key role in carbon capture, are impacted by increasing drought frequency-a threat that may have long-term consequences overlooked by traditional models. This study addresses that gap by applying the Bayesian Stochastic Antecedent Model, a statistical approach designed to quantify how past drought severity and duration influence future growth. This methodology will be applied to B4WarmED, a new and rare long-term experimental dataset that uniquely combines temporal depth, a controlled causal setup, and the ability to extract signals from a low signal-to-noise ratio, making it ideal for comprehensive understanding. This methodology, combined with the depth and design of B4WarmED, enables nuanced, data-driven insights into the complex, long-term effects of drought, allowing for stronger causal claims about how past conditions shape future tree growth. Refining predictions of forest carbon dynamics will improve climate models and inform conservation strategies. 

Keywords

ecology

time-series

antecedent events

causal inference

tree growth 

Abstracts


First Author

Ashlan Simpson

Presenting Author

Ashlan Simpson

33: Light Absorption Enhancement in Coated Atmospheric Tar Balls Using Mie Theory

Atmospheric tar balls (TBs) are solid, strong light-absorbing organic particles emitted through wildfires. TBs can disrupt Earth's energy balance by absorbing incoming solar radiation. However, there are still large uncertainties with TBs' optical properties. Moreover, when TBs have different coatings (e.g., water and organic), their light absorption properties will be enhanced. This highly variable optical property and light enhancement of TBs are not included in a climate model.

This study applies Mie calculations to investigate the light absorption enhancement of different coatings on TBs. We used different optical properties of TBs in the literature to cover the variable optical properties of TBs by testing the different coating species (e.g., water, secondary organic, and brown carbon). The core and coating thickness varied between x to y and a to b, respectively. We found that clear and brown coatings enhance light absorption through the "lensing effect," with brown coatings showing a marked increase, both accounted for in the core-shell parameterization. This model-measurement approach improves predictions of TB optical properties. 

Keywords

Mie Theory

Atmospheric modeling


light absorption

light scattering

Optical properties

Core-shell particles 

Abstracts


Co-Author(s)

Zezhen Cheng, Pacific Northwest National Laboratory
Manish Shrivastava, Pacific Northwest National Laboratory

First Author

Karen Magaña, Washington State University

Presenting Author

Karen Magaña, Washington State University

34: Mediation Approach to study Interplay of Poverty, Socioeconomic Status and Treatment on Mortality

Prior studies showed Persistent poverty (PP) to have effect on risk of mortality for various cancers. We used a causal path-specific mediation approach to understand how the interplay between census-tract level PP, socioeconomic status (SES), and receipt of cancer treatment affect mortality. Using Cox models, we obtain weighted causal estimates for the natural direct effect (NDE) of PP on mortality, and natural indirect effects (NIE) of PP on mortality through the combined pathways considering an exposure-induced mediator-outcome confounder SES and treatment mediator. We use data on 50,533 stage I-IV hepatocellular carcinoma patients, identified in the SEER program. The analysis showed PP had indirect effect on higher mortality probability. Cox models yielded NDE 1.06 (95% CI: 0.99-1.13) of PP on mortality accounted for 31% of total effect. The NIE 1.03 (95% CI: 1.02-1.05) of PP on mortality, through the combined SES pathway, accounted for 16% of total effect, and the NIE 1.10 (95% CI: 1.04-1.17) of PP on mortality only through treatment mediator, accounted for 53% of total effect. Thus, SES and receiving treatment contribute to understanding the causal effect of PP on mortality. 

Keywords

Causal Analysis

Mediation Analysis

Persistent Poverty

Cancer

Mortality 

Abstracts


Co-Author(s)

Yesung Kweon, The Ohio State University
Samiila Obeng-Gyasi, The Ohio State University Wexner Medical Center
Jesse Plascak, The Ohio State University Wexner Medical Center
Mohamed Elsaid

First Author

Demond Handley, The Ohio State University

Presenting Author

Demond Handley, The Ohio State University

35: Modeling Autoregressive Conditional Regional Extremes with Applications to Solar Flare Prediction

This poster studies big data streams with regional-temporal extreme event (REE) structures and solar flare prediction. An autoregressive Conditional Fréchet model with time-varying parameters for regional and its adjacent regional extremes (ACRAE) is proposed. The ACRAE model can quickly and accurately predict rare REEs (i.e., solar flares) in big data streams. The ACRAE model, with some mild regularity conditions, is proved to be stationary and ergodic. The parameter estimators are derived through the conditional maximum likelihood method. The consistency and asymptotic normality of the estimators are established. Simulations are used to demonstrate the efficiency of the proposed parameter estimators. In real solar flare prediction, with the new dynamic extreme value modeling, the occurrence and climax of solar activity can be predicted earlier than with existing algorithms. The empirical study shows that the ACRAE model outperforms the existing prediction algorithms with sampling strategies. 

Keywords

big data

solar flare detection

time series of regional extremes

extreme value theory

tail index dynamics 

Abstracts


Co-Author(s)

Jili Wang, University of Wisconsin-Madison
Zhengjun Zhang, University of Chinese Academy of Sciences

First Author

Steven Moen, University of Wisconsin-Madison

Presenting Author

Steven Moen, University of Wisconsin-Madison

36: Modeling Forage Quantity and Quality Using Machine Learning Models and Remote Sensing Data.

Efficient monitoring and measuring of forage resources is challenging in livestock production and management. Failure to develop a real-time monitoring tool can result in overgrazed forage resources, ecosystem degradation, decreased animal production, and reduced resiliency to climate change. Utilizing remote sensing data such as satellite imagery provides a cost-effective tool for monitoring forage quality and quantity. This study aims to develop data pipelines to automate the extraction of climate and satellite imagery from Google Earth Engine. Specifically, forage quantity and quality indicators such as Neutral Detergent Fiber, Acid Detergent Fiber, Biomass, and Crude Protein, are predicted using precipitation metrics, seasonal weather metrics, and vegetation Indices. The performance of univariate and multivariate Random Forest, General Additive Model, Least Absolute Shrinkage and Selection Operator model, Autoregressive Integrated Moving Average model, Nonlinear Autoregressive exogenous model, and Multivariate Time Series models are compared. The results show that the non-linear models outperformed the linear models while being computationally efficient. 

Keywords

Livestock

Forage

Machine learning

Remote sensing 

Abstracts


Co-Author(s)

Jameson Brennan, South Dakota State University
Hossein Moradi Rekabdarkolaee, South Dakota State University

First Author

Michael Abalo

Presenting Author

Michael Abalo

37: Modeling Seasonal Time Series with Periodic Mean-Reverting Stochastic Differential Equations

Seasonal variation is a key feature of many environmental and biological systems, including infectious disease outbreaks and temperature patterns. Periodic mean-reverting stochastic differential equations (SDEs) effectively model such variability. We present periodic mean-reverting SDE models, $dX(t) = r(\beta(t) - X(t))dt + d\beta(t) + \sigma X^p(t)dW(t),$ for \(p = 0, 1/2, 2/3, 5/6, 1\), with periodic mean \(\beta(t)\), and fit them to seasonally varying influenza and temperature data. The model with \(p = 0\) corresponds to the Ornstein-Uhlenbeck process, while \(p = 1/2\) and \(p = 1\) relate to the Cox-Ingersoll-Ross (CIR) process and geometric Brownian motion (GBM), and other mean-reverting SDEs \(p = 2/3, 5/6\), respectively. We show that the higher-order moments of CIR and GBM processes exhibit periodicity. Novel model-fitting methods combine least squares for \(\beta(t)\) estimation and maximum likelihood for \(r\) and \(\sigma\). Confidence regions are constructed via bootstrapping, and missing data are handled using a modified MissForest algorithm. These models provide a robust framework for capturing seasonal dynamics and offer flexibility in mean function specification 

Keywords

Parameter estimation

Mean-reverting stochastic differential equations

Confidence region

Seasonal time series 

Abstracts


Co-Author

Linda J.S. Allen, Texas Tech University

First Author

GM Fahad Bin Mostafa, Arizona State University

Presenting Author

GM Fahad Bin Mostafa, Arizona State University

38: On the Statistical Capacity of Deep Generative Models

Deep generative models are routinely used in generating samples from complex, highdimensional distributions. Despite their apparent successes, their statistical properties are not
well understood. A common assumption is that with enough training data and sufficiently large
neural networks, deep generative model samples will have arbitrarily small errors in sampling
from any continuous target distribution. We set up a unifying framework that debunks this belief.
We demonstrate that broad classes of deep generative models, including variational autoencoders
and generative adversarial networks, are not universal generators. Under the predominant case of
Gaussian latent variables, these models can only generate concentrated samples that exhibit light
tails. Using tools from concentration of measure and convex geometry, we give analogous results
for more general log-concave and strongly log-concave latent variable distributions. We extend
our results to diffusion models via a reduction argument. We use the Gromov–Levy inequality to
give similar guarantees when the latent variables lie on manifolds with positive Ricci curvature.
These results shed light on the limited capacity 

Keywords

Deep Generative Models

Diffusion models

Generative Adversarial Networks

Variational Autoencoders

Concentration of Measure 

Abstracts


Co-Author

David Dunson

First Author

Edric Tam, Stanford University

Presenting Author

Edric Tam, Stanford University

39: Orthogonal Multimodality Integration and Clustering in Single-cell Data

Multimodal integration combines data from diverse sources or modalities to provide a more holistic understanding of a phenomenon. The challenges in multi-omics data analysis stem from the complexity, high dimensionality, and heterogeneity of the data, which require advanced computational tools and visualization methods for effective interpretation. This paper introduces a novel method called Orthogonal Multimodality Integration and Clustering (OMIC) to analyze CITE-seq data.

Our approach allows researchers to integrate various data sources while accounting for interdependencies. We demonstrate its effectiveness in cell clustering using CITE-seq datasets. The results show that our method outperforms existing techniques in terms of accuracy, computational efficiency, and interpretability. We conclude that OMIC is a powerful tool for multimodal data analysis, enhancing the feasibility and reliability of integrated data analysis. 

Keywords

Multimodality Integration

CITE-seq

Cell Clustering 

Abstracts


Co-Author(s)

Yongkai Chen
Haoran Lu, University of Georgia
Wenxuan Zhong, University of Georgia
Guo-cheng Yuan, Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai
Ping Ma, University of Georgia

First Author

Yufang Liu

Presenting Author

Yufang Liu

40: Personalized Dynamic Dose-Finding for Longitudinal Observational Data

The prescribed doses for many drugs are based on population norms or physician discretion. For example, very low birth weight (VLBW) infants (BW < 1500 grams) often experience slower postnatal growth and require glucose treatment to support weight gain and prevent hyperglycemia. A uniform dosage is unsuitable due to individual differences in glucose metabolism influenced by weight, gestational age, and other factors. Personalized dynamic dosing adjusts to an infant's observed responses, but quantifying uncertainty is essential in patient-critical environments. This study employs a longitudinal mixed model to estimate personalized optimal doses, using patient-specific random effects as biomarkers to capture individual sensitivities. We quantify the uncertainty of these doses through the implicit value delta method, aiding in safe clinical decision-making. Simulation studies validate our model's robustness, and analysis of NICU data on glucose treatments for VLBW infants over their first seven days demonstrates key differences between optimal and prescribed dosing strategies. 

Keywords

Personalized medicine

Optimal dose

Longitudinal Data Analysis

Implicit Delta Method 

Abstracts


Co-Author

Alexander McLain, University of South Carolina

First Author

Md Nasim Saba Nishat, Department of Epidemiology and Biostatistics, University of South Carolina

Presenting Author

Md Nasim Saba Nishat, Department of Epidemiology and Biostatistics, University of South Carolina

41: Practical Approaches to Machine Learning for Mental Illness Detection on Social Media

Social media provides a valuable avenue for mental health research, offering insights into conditions through user-generated content. Yet, the application of machine learning (ML) and deep learning (DL) models in this domain presents methodological challenges, including dataset representativeness, linguistic complexity, the need to distinguish multiple types of mental illness, and class imbalance. This project offers practical guidance to address these issues, focusing on best practices in data preprocessing, feature engineering, and model evaluation. The project introduces strategies for handling imbalanced datasets, optimizing hyperparameter tuning, and improving model transparency and reproducibility. Additionally, it demonstrates techniques for effectively differentiating various mental health conditions within social media data, ensuring that models capture their nuanced presentations. With real-world examples and step-by-step implementation, this project aims to provide tools to build more robust and interpretable ML/DL models for mental illness detection. These improvements contribute to the development of effective early detection and intervention tools in public health. 

Keywords

Machine Learning

Deep Learning

Mental Health

Social Media

NLP

Multi-Class Classification 

Abstracts


Co-Author(s)

Zhanyi Ding, New York University
Yexin Tian, Georgia Institute of Technology
Jianglai Dai, UC Berkeley
Xiaorui Shen, Northeastern University
Yeyubei Zhang, University of Pennsylvania
Yunchong Liu, University of Pennsylvania
Yuchen Cao, Northeastern University

First Author

Zhongyan Wang, New York University

Presenting Author

Yuchen Cao, Northeastern University

42: Predicting EGFR Expression from Lung Cancer Pathology images Using DNN

In this project, we develop a computational framework to predict Epidermal Growth Factor Receptor (EGFR) expression levels using pathology images. The workflow begins with cell segmentation and image feature extraction, performed using the scikit-image library in Python, to derive quantitative features from each patient's pathology images. These features are then utilized in a deep neural network model for variable selection and EGFR expression prediction. Our models demonstrate strong predictive performance and two professional pathologists validate the extracted features, ensuring their clinical interpretation. This approach has the potential to significantly reduce the costs of gene expression sequencing and provide valuable guidance for pathologists in clinical trial analysis. By integrating computational pathology with deep learning, our work offers an efficient workflow for EGFR expression prediction, bridging the gap between traditional pathology and advanced statistical models. 

Keywords

Feature images extraction

Deep Neural Network

Gene expression prediction 

Abstracts


Co-Author(s)

Tong Wang, Yale University
Shuangge Ma

First Author

Yibo ZHAI

Presenting Author

Yibo ZHAI

43: PRITS framework for investigating and assessing web-scraped datasets for research applications

The PRITS framework addresses the lack of integrated technical and statistical guidance on the programmatic collection of data from online data sources and assessing existing web-scraped datasets for specific research uses. The framework covers five stages: Planning, Retrieval, Investigation, Transformation and Summary (PRITS). The 'Planning' stage focuses on problem and context definition, and sampling design. 'Retrieval' involves the technical execution and automated documentation of web-scraping processes and outputs (i.e. paradata and substantive data). 'Investigation' assesses the content and completeness of the retrieved web response objects. 'Transformation' involves parsing and cleaning the retrieved web data, potential integration with other data, and documentation of key decisions such as imputation or harmonisation strategies. Finally, the 'Summary' stage documents any decisions that might materially impact downstream analysis, and describes key properties (i.e. metadata) and limitations of the final web-scraped dataset. 

Keywords

internet data

sampling design

web scraping

data quality 

Abstracts


Co-Author(s)

Tina Lam, Monash University
Mitchell O'Hara-Wild, Monash University

First Author

Cynthia Huang

Presenting Author

Cynthia Huang

44: Random Processes with Stationary Increments and Intrinsic Random Functions on the Real Line

Random processes with stationary increments and intrinsic random processes are two concepts commonly used to deal with non-stationary random processes. They are broader classes than stationary random processes and conceptually closely related to each other. A random process with stationary increments is a stochastic process where its distribution of the increments only depends on its temporal or spatial intervals. On the other hand, an intrinsic random function is a flexible family of non-stationary processes where the process is assumed to have lower monomials as its mean and the transformed process becomes stationary. This research illustrates the relationship between these two concepts of stochastic processes and shows that, under certain conditions, they are equivalent on the real line. 

Keywords

intrinsic random function

random process with stationary increment

non-stationary random process

spatial statistics

time series 

Abstracts


Co-Author

Chunfeng Huang, Indiana University

First Author

Jongwook Kim, Indiana University Bloomington

Presenting Author

Jongwook Kim, Indiana University Bloomington

45: Recent Regulatory Guidance Trends for Randomization Monitoring in Clinical Trials

Monitoring is a process to ensure that data integrity is maintained across the duration of a clinical trial. Regulatory authorities recommend that Sponsors focus monitoring strategies on data critical to reliability of trial results. As Randomization is considered critical data, recent regulatory guidance (ICH, MHRA, FDA, EMA) places higher focus on Randomization Monitoring. This type of monitoring concentrates on reviewing accumulated randomization data to confirm that randomization has occurred per protocol/relevant specifications. Randomization monitoring is important in every clinical trial to provide verifiable evidence proving the randomization's integrity. It becomes even more crucial in complex innovative designs (e.g., Master Protocols, trials with AI-enabled devices/machine learning) due to their complexity/novelty. This presentation will establish the importance of randomization monitoring (both in standard and complex/novel protocol designs) and summarize regulatory requirements for randomization monitoring (e.g., sponsors' responsibilities, contents of monitoring plans/reports). It will also present guidance for developing an effective randomization monitoring process. 

Keywords

Randomization Monitoring

Clinical Trial Monitoring

Randomization

Regulatory Guidance Review 

Abstracts


Co-Author(s)

Kevin Venner, Almac Group
Noelle Sassany, Almac Group
Brian Stella, Almac

First Author

Jennifer Ross, Almac Group

Presenting Author

Jennifer Ross, Almac Group

46: Refining Community Characteristic Composites Representing American Community Survey Item Data

Our earlier work reported two US county-level composites, health/economics (HEC) and community capital/urbanicity (CCU), using American Community Survey (ACS) data and two other national health-related databases. We aim to develop new composites using 2023 5-year ACS estimates. One hundred two ACS variables among 3,144 counties were analyzed using principal components analysis. Standardized composite scores were computed for each county and their associations with HEC, CCU, Neighborhood Deprivation Index (NDI) and Urban-Rural Classification Scheme (URC) scores were evaluated. A 74-item, two-component solution that approximated "simple structure" and accounted for 41.4% of the total item covariance was interpreted to represent: 1) financial resources & educational attainment (FREA) and 2) age, demographic characteristics & urbanicity (ADU). Non-chance associations (r or rho, p<0.001) between FREA and HEC, CCU, NDI and URC were 0.77, 0.57, -0.88 and -0.44, respectively. For ADU they were 0.45, -0.69, -0.23 and 0.37. Patterns of associations indicated general concordance of the new composites with other measures. Future work will focus on smaller geographic entities using ACS data. 

Keywords

Latent variable models and principal components analysis

American Community Survey data

United States county community characteristics

Social determinants of health 

Abstracts


Co-Author(s)

Wali Johnson, Vanderbilt University Medica Center
Irene Feurer, Vanderbilt University Medical Center

First Author

Scott Rega, Vanderbilt University Medical Center

Presenting Author

Irene Feurer, Vanderbilt University Medical Center

47: Resilience Forecasting through Advanced Predictive Models: A Transformer-Based Approach for Dynamic Analysis

Predicting long-term systems resilience is essential for strategic planning and risk management. Resilience forecasting plays a critical role in understanding and mitigating the impacts of shocks and facilitating quicker recoveries, especially in dynamic environments like crude oil markets. Traditional models lack dynamic adaptability. We proposed TimeGPT, a transformer-based model, to manage temporal dependencies and ensure system stability under stress. Using a 10-year crude oil price dataset, TimeGPT demonstrated robust zero-shot learning, enhanced by feature engineering (e.g., integrating external data like public holidays, anomaly indicators, and temporal trends) and fine-tuning. Attention mechanisms prioritized key features like US Rig Count, while filtering noise from less relevant variables such as Mortgage Rate and Import. Evaluated across different data splits (30%-70% to 70%-30%, incrementally by 10%), TimeGPT outperformed traditional models, capturing complex market dynamics and predicting long-term resilience. Metrics like MAE, RMSE, and R2 confirmed its accuracy. This approach supports strategic decision-making in uncertain economic environments. 

Keywords

Resilience prediction

time-varying covariates

Long-term prediction

Time series

TimeGPT models 

Abstracts


Co-Author(s)

Hesam Saki, Department of Computer Engineering, University of Tehran
Lance Fiondella

First Author

Fatemeh Salboukh, University of Massachusetts at Dartmouth

Presenting Author

Fatemeh Salboukh, University of Massachusetts at Dartmouth

48: Risk Analysis of Flowlines in the Oil and Gas Sector: A GIS and Machine Learning Approach

This paper presents a comprehensive risk analysis of flowlines in the oil and gas sector using Geographic Information Systems (GIS) and machine learning. Flowlines, vital conduits transporting oil, gas, and water from wellheads to surface facilities, often face under-assessment compared to pipelines. This study addresses this gap using advanced tools to predict and mitigate failures, enhancing environmental safety and reducing casualties. Extensive datasets from the Colorado Energy and Carbon Management Commission (ECMC) were processed through spatial matching, feature engineering, and geometric extraction to build robust predictive models. Various machine learning algorithms were used to assess risks, with ensemble classifiers showing superior accuracy, especially when paired with Principal Component Analysis (PCA) for dimensionality reduction. Exploratory Data Analysis (EDA) highlighted spatial and operational factors influencing risks, identifying high-risk zones for focused monitoring. The study demonstrates the potential of integrating GIS and machine learning in flowline risk
management, proposing a data-driven approach to enhance safety in petroleum extraction. 

Keywords

Risk Analysis

Flowlines

Machine Learning

GIS 

Abstracts


Co-Author

Soutir Bandyopadhyay, Colorado School of Mines

First Author

Isabella Chittumuri

Presenting Author

Isabella Chittumuri

49: Scalable Bayesian regression with massive and high-dimensional data

Recent advances in scalable MCMC methods for high-dimensional Bayesian regression have focused on addressing computational challenges associated with iterative computations with large-scale covariance matrices. While existing research heavily relies on the large-p-small-n assumption to scale down computational costs, scenarios where both n and p have not been yet explored. In this study, we propose an innovative solution to the large-p-large-n problem by integrating a randomized sketching approach into a Gibbs sampling framework. Our method leverages a random sketching matrix to approximate high-dimensional posterior distributions efficiently, enabling scalable Bayesian inference for high-dimensional and massive datasets. Our proposed approach is applicable to the variety of shrinkage priors that are widely used in high-dimensional regression. We investigated the performance of the proposed method via simulation studies. 

Keywords

Acceptance-rejection method

Gibbs sampling

High-dimensional Bayesian regression

Scalable Bayesian computation 

Abstracts


Co-Author(s)

Gyuhyeong Goh, Department of Statistics, Kyungpook National University
Dipak Dey, University of Connecticut

First Author

Gyeongmin Park, Kyungpook National University

Presenting Author

Gyeongmin Park, Kyungpook National University

50: Similarities & Differences Between Inferential Statistics and Global Haiku: Follow-up

In this poster, we provide additional background and examples for our Feb. 2025 paper in Chance. By juxtaposing statistical estimates which strive to be precise and accurate, with haiku that are more often purposely ambiguous or have several nuanced meanings, we raise the idea that the main similarity between statistics and haiku is that both are a reduction in dimensionality. Statistics takes many data points to a single or a set of numerical summaries or coefficients in a model, while haiku places "moments in time" into 3-line verse to convey gist which may be multisensory and filled with the poet's poignant feelings. To demonstrate this contiguity, we begin by introducing readers to pop-culture haiku and literary haiku with examples of both, then briefly refresh readers with the patterns of praxis in descriptive and inferential statistics. We provide examples of regression to illustrate dimension reduction and side by side boxplots in a biomechanical data. Additional similarities include: imagery (data visualization), hypothesis testing vs. 3rd-line of haiku, pairing and contrast. Next the current research use of AI in haiku is explored. Finally Stefanski-style residuals. 

Keywords

haiku

regression diagnostics

data visualization

AI haiku

basketball plots 

Abstracts


Co-Author

David McMurray, Kagoshima International University

First Author

Charles Smith, North Carolina State Univ.

Presenting Author

Charles Smith, North Carolina State Univ.

51: STIFT: Spatiotemporal Transcriptomics Integration by Spatially Informed Multi-Timepoint Bridging

Recent advances in spatial transcriptomics have highlighted the need for integrating spatial transcriptomics data across multiple developmental and regenerative stages. We present STIFT (SpatioTemporal Integration Framework for Transcriptomics), a three-component framework combining developmental spatiotemporal optimal transport, spatiotemporal graph construction, and triplet-informed graph attention autoencoder (GATE) specifically designed for integrating spatiotemporal transcriptomics data. STIFT efficiently processes large-scale 2D and 3D spatiotemporal trancriptomics data while preserving temporal patterns and biological structures, enabling batch effect removal, spatial domain identification, trajectory inference and exploration of developmental dynamics. Applied to axolotl brain regeneration, mouse embryonic development, and 3D planarian regeneration datasets, STIFT efficiently removes batch effects and achieves clear spatial domain identification while preserving temporal developmental patterns and biological variations across hundreds of thousands of spots, demonstrating its effectiveness and specificity in integrating spatiotemporal transcriptomics data. 

Keywords

spatial transcriptomics

spatiotemporal data integration

graph attention autoencoder

developmental biology 

Abstracts


Co-Author(s)

Muyang Ge, The Chinese University of Hong Kong
Jishuai MIAO, The Chinese University of Hong Kong
Xiaocheng Zhou, The Chinese University of Hong Kong
Zhixiang Lin, The Chinese University of Hong Kong

First Author

Ji Qi

Presenting Author

Ji Qi

52: Stratified Differential Privacy in Randomized Response: A Simulation Study

This research explores the application of stratified differential privacy in randomized response mechanisms to ensure data confidentiality while maintaining analytical utility. Using R simulations, we implement the Warner randomized response technique with stratification, incorporating Laplace and Gaussian noise mechanisms under varying privacy budgets. The study evaluates the bias and variance of the estimated proportions across different strata. Our results highlight the impact of differential privacy parameters on data utility and privacy protection. 

Keywords

Data Privacy

Randomized response technique

Stratified Random Sampling

Probability 

Abstracts


First Author

Grace Kim, Wayzata High School, Minnesota, USA

Presenting Author

Grace Kim, Wayzata High School, Minnesota, USA

53: Supervised Variational Autoencoder with Mixture-of-Experts Prediction

Large-scale datasets, such as images and texts, often exhibit complex heterogeneous structures caused by diverse data sources, intricate experimental designs, or latent subpopulations. Supervised learning from such data is challenging as it requires capturing relevant information from ultra-high-dimensional data while accounting for structural heterogeneity. We propose a unified framework that addresses both challenges simultaneously, facilitating effective feature extraction, structural learning, and robust prediction. The proposed framework employs a supervised variant of variational autoencoder (VAE) for both learning and prediction. Specifically, two types of latent variables are learned through the VAE: low-dimensional latent features and a latent stick-breaking process that characterizes the heterogeneous structure of samples. The latent features reduce the dimensionality of the input data, and the latent stick-breaking process serves as a gating function for mixture-of-experts prediction. This general framework reduces to a supervised VAE when the number of latent clusters is set to one, and to a stick-breaking VAE when both the latent features and response variables are omitted. We demonstrate advantages of the proposed framework by comparing it with supervised VAE and principal component regression in two simulation studies and a real data application involving brain tumor images. 

Keywords

Variational Autoencoder

Data Heterogeneity

Stick-breaking Process

Supervised machine learning 

Abstracts


Co-Author

Hongxiao Zhu, Virginia Tech

First Author

Jaeyoung Lee, Virginia Tech

Presenting Author

Jaeyoung Lee, Virginia Tech

54: Survival Analysis of Tumor Incidence in a Randomized Placebo-Controlled Clinical Trial

Clonal hematopoiesis (CH), a condition increasingly prevalent with aging, has been associated with elevated risks of hematological malignancies and cardiovascular diseases. The most frequent CH-associated mutations occur in the DNMT3A and TET2 genes, leading to heightened proinflammatory signaling. We conducted a stratified analysis to evaluate the incidence of non-hematological malignancies by treatment group and CH mutation status. The Kaplan-Meier and risks models are used to compare malignancy outcomes over the trial period. The incidence of non-hematological malignancies across cancer subtypes are assessed for patients with TET2 or DNMT3A mutations. Those individuals from experimental arm exhibited distinct performance on various cancer types. The estimated cumulative incidence of at least one malignancy event was also assessed to understand the outcomes for the individuals with TET2-mutant and treated with treatment compared to the placebo group. The observed reduction in cancer incidence among TET2-mutant highlights a potential role for anti-inflammatory in cancer prevention and warrants further investigation using causal inference and longitudinal modeling approaches. 

Keywords

Kaplan-Meier

longitudinal modeling

malignancies 

Abstracts


First Author

Tingting Zhai, University of Kentucky

Presenting Author

Tingting Zhai, University of Kentucky

55: Synthetic Control by Covariate Balancing Propensity Score for Disaggregated Data

Traditionally, most quasi-experimental approaches like the synthetic control method (SCM) were developed for relatively small-size panel data (< 1000). In settings with large-scale environmental data with a large number of treated units and untreated units (e.g., from a few to a few hundred treated units with a donor pool size of a few thousand), with a relatively large number of covariate size, it becomes challenging to apply the traditional SCM due to problems of multiplicity of solutions and computational inefficiency. Despite recent developments on the penalized synthetic control method that resolves the issue of multiplicity of solution by adding a nearest neighbor matching (NNM) penalty to the original SC estimator, this methodology is still computationally inefficient for high-dimensional datasets such as ours. On the other hand, casting our SCM problem as a covariate balancing problem using propensity score (CBPS), in implementation we encounter problems of covariate approximation and non-sparsity of solutions. We conducted various simulation studies to compare the CBPS estimator and the penalized SCM estimator, and proposed a new CBPS estimator for disaggregated data. 

Keywords

causal inference

synthetic control method

Covariate Balancing Propensity Score

Disaggregated Data 

Abstracts


First Author

Yanran Li

Presenting Author

Yanran Li

56: Tensor Response Regression with Low Tubal Rank and Sparsity

For contemporary scientific data, complicated structured tensor data with high dimensions are everywhere. Motivated by modeling the relationship between the multivariate covariates with the complicated tensor response, we proposed a tensor response model with low tubal rank and sparsity constraint. The low tubal rank constraint can capture the space-shifting or time-shifting characteristic of the data while sparsity can reduce the number of free parameters.

One special case of our model is equivalent to the multivariate reduced rank regression model. We also put forward a proven convergent ADMM algorithm that can obtain the optimized estimation efficiently. Simulations show that our method significantly outperforms the existing tensor response models. 

Keywords

ADMM

multidimensional array

multivariate linear regression

reduced rank regression

tubal rank

fourier transform 

Abstracts


Co-Author

Xin Zhang, Florida State University

First Author

Jiping Wang

Presenting Author

Jiping Wang

57: Test methods for missing at random mechanism in clustered data

Methods dealing with missing data rely on assumptions underlying the missing data whether they are explicit or implicit.
For example, one of the most popular such method, multiple imputation (MI), typically assumes missing at random (MAR) in its most implementations, even though the theory of MI does not necessarily require MAR. In this work, we consider formal tests for condition known as missing always at random (MAAR) as a way to explore MAR mechanism in settings where observational units are nested within naturally occurring groups. Specifically, we propose two tests for MAR mechanisms that extend existing methods to incorporating clustered data structures: 1) comparison of conditional means (CCM) with clustering effects and 2) testing a posited missingness mechanism with clustering effects. We design a simulation study to evaluate the tests' performance in correctly capturing the missingness mechanism and demonstrate their use in a real-word application on post-Covid conditions that utilizes an EHR dataset. These test methods are expected to provide empirical evidence for improved selection of missing data approaches in application. 

Keywords

MAR

missingness mechanism

test

clustering effects 

Abstracts


Co-Author(s)

Recai Yucel, Temple University
Resa M. Jones, Temple University, Department of Epidemiology & Biostatistics
Edoardo Airoldi, Temple University

First Author

Haoyu Zhou, Temple University

Presenting Author

Haoyu Zhou, Temple University

58: Testing Against Parametric Regression Function using Shape-constrained Splines with AR(p) Errors

Estimating a regression function using a parametric model makes it easier to describe and interpret the relationship being studied. Many practitioners prefer this approach over using a nonparametric model. Here we consider the case of a stipulated parametric function, when there are a priori assumptions about the shape and smoothness of the true regression function in the presence of AR(p) errors. For example, suppose it is known that the function must be non-decreasing; we can test the null hypothesis of linear and increasing against the alternative of smooth and increasing, using constrained splines for the the alternative fit when there exists AR(1) errors. We show that the test is consistent and that the power approaches one as the sample size increases, if the alternative is true in the presence of AR(1) errors. There are few existing methods available for comparison with our proposed test. Through simulations, we demonstrate that our test performs well, particularly when compared to the WAVK test in the funtimes R package. 

Keywords

Shape restrictions

Regression splines

Parametric function

Autoregressive errors

Consistent 

Abstracts


Co-Author

Mary Meyer, Colorado State University

First Author

Musfiq Nabeen, Colorado State University

Presenting Author

Musfiq Nabeen, Colorado State University

59: The Impact of Lymphovascular Invasion on 2-Year Survival in HNSC: A Copula Regression Approach

Lymphovascular invasion (LVI) significantly impacts the survival of head and neck squamous cell carcinoma cancer. Traditional two-stage analyses risk biasing the estimate of the effect of the LVI on patient survival because of endogeneity. To address these issues, we propose a joint approach using a bivariate recursive copula model to estimate the effect of LVI status on two-year survival while controlling for potential endogeneity. This framework separates the joint model from the marginal distributions, offering a flexible dependence structure.
Using data from The Cancer Genome Atlas (TCGA), we integrate miRNA expression, clinical covariates, and demographic factors to estimate LVI's average treatment effect (ATE) on survival. Key miRNAs (e.g., hsa-miR-203a-3p, hsa-miR-194-5p, hsa-miR-337-3p) were analyzed for their association with survival outcomes. Results indicate that LVI significantly reduces 2-year survival, with an ATE of -47%. Age at diagnosis exhibits a nonlinear effect on survival outcomes. This study highlights the utility of copula models in addressing endogeneity and provides insights into the interplay between LVI, molecular biomarkers, and survival outcomes. 

Keywords

Lymphovascular invasion

survival analysis

copula regression

endogeneity

miRNA 

Abstracts


Co-Author

Roger S Zoh, Indiana University

First Author

Yang Ou, Indiana University

Presenting Author

Yang Ou, Indiana University

60: Using Statistical Modeling to Revolutionize Ultimate Frisbee Strategy

Ultimate Frisbee, a fast-paced and dynamic sport, demands innovative offensive
strategies to outmaneuver opponents and maximize scoring opportunities. This presentation
introduces a simulation-based statistical modeling framework designed to evaluate and compare
the success probabilities of various offensive plays. By integrating player-specific
attributes-such as throwing precision and catching reliability-into a detailed model of frisbee
dynamics, this approach provides actionable insights to optimize team performance. The model
adapts to diverse team compositions and opposing defenses, offering a practical tool to support
strategic planning in this evolving sport. 

Keywords

Ultimate frisbee

modeling and simulation in sports

sports analytics

statistical modeling 

Abstracts


First Author

Leo Shi

Presenting Author

Leo Shi

61: Utilizing Variational Autoencoders to Shift Individual Level Data Towards Summary Level Statistics

Access to individual level data (ILD) from published literature poses a hurdle for researchers. However, access is a driving force for many analyses (surrogate outcome validation, subgroup analyses, and other settings). Generative modeling can produce synthetic data that reflects the underlying properties of existing ILD. Specifically, while utilizing Variational Autoencoders (VAEs) and extending to tabular data, new possibilities for accelerating research arise. This application of VAEs, within R, presents a simple method for researchers to leverage a set of ILD. This method applies to a mixture of distributions (binary, categorical, normal, etc.). While access to ILD may be difficult, summary level information is more readily available. We propose an extension of VAEs to shift the underlying distribution of the data towards summary level statistics. This extension produces multiple sets of ILD under different prior information. The resulting, shifted, ILD can be considered a trustworthy representation of a published paper's data. By extending the framework of VAEs to tabular data and allowing for a distribution shift, exploratory research without direct ILD access is plausible. 

Keywords

Variational Autoencoders



Synthetic Data


Distribution Shift

Machine Learning

Generative Modeling

Summary Level Data 

Abstracts


Co-Author(s)

Janice Weinberg, Boston Univ School of Public Health
Fatema Shafie Khorassani

First Author

Sarah Milligan

Presenting Author

Sarah Milligan

62: Weakly Supervised Transformer for Rare Disease Phenotyping

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. Efforts to automate rare disease detection through computational phenotyping are limited by the scarcity of labeled data and biases in available label sources. Gold-standard labels from registries or expert chart review offer high accuracy but suffer from selection bias and high ascertainment costs, while labels derived from electronic health records (EHRs) capture broader patient populations but introduce noise. To address these challenges, we propose a weakly supervised, transformer-based framework that integrates gold-standard labels with iteratively refined silver-standard labels from EHR data to train a scalable and generalizable phenotyping model. We first learn concept-level embeddings from EHR co-occurrence patterns, which are then refined and aggregated into patient-level representations using a multi-layer transformer. Using rare pulmonary diseases as a case study, we validate our framework on EHR data from Boston Children's Hospital. Our approach improves phenotype classification, uncovers clinically meaningful subphenotypes, and enhances disease progression prediction, enabling more accurate and scalable identification and stratification of rare disease patients. 

Keywords

Semi-Supervised Learning

Transformers

Phenotyping

Electronic Health Records

Rare Diseases

Machine Learning 

Abstracts


Co-Author(s)

Zongxin Yang, Harvard Medical School
Mengyan Li, Bentley University
Han Tong, Columbia University
Alon Geva, Boston Children's Hospital
Kenneth Mandl, Boston Children's Hospital
Tianxi Cai, Harvard University

First Author

Kimberly Greco, Harvard University

Presenting Author

Kimberly Greco, Harvard University

63: Weighted Bayesian Bootstrap for Reduced Rank Regression with Singular Value Decomposition

Bayesian Reduced Rank Regression (RRR) has attracted increasing attention as a means to quantify the uncertainty of both the coefficient matrix and its rank in a multivariate linear regression framework. However, the existing Bayesian RRR approach relies on the strong assumption that the positions of independent coefficient vectors are known when the rank of the coefficient matrix is given. In contrast, the conventional RRR approach is free from this assumption since it permits the singular value decomposition (SVD) of the coefficient matrix. In this paper, we propose a Weighted Bayesian Bootstrap (WBB) approach to incorporate the SVD into the Bayesian RRR framework. The proposed Bayesian method offers an innovative way of sampling from the posterior distribution of the low-rank coefficient matrix. In addition, our WBB approach allows simultaneous posterior sampling for all ranks, which greatly improves computational efficiency. To quantify the rank uncertainty, we develop a posterior sample-based Monte Carlo method for marginal likelihood calculation. We demonstrate the superiority and applicability of the proposed method by conducting simulation studies and real data analysis. 

Keywords

Bayesian Reudced Rank Regression

Singular Value Decomposition

Weighted Bayesian Bootstrap

Bayes factors 

Abstracts


Co-Author(s)

Hyeonji Shin, Kyungpook National University
Hyewon Oh, Kyungpook National University
Yeonsu Lee, Kyungpook National University
Minseok Kim, Kyungpook National University
Seongyun Kim, Kyungpook National University
Gyuhyeong Goh, Department of Statistics, Kyungpook National University

First Author

Wonbin Jung, Department of Statistics, Kyungpook National University

Presenting Author

Wonbin Jung, Department of Statistics, Kyungpook National University