Monday, Aug 4: 2:00 PM - 3:50 PM
4075
Contributed Posters
Music City Center
Room: CC-Hall B
Main Sponsor
Business Analytics/Statistics Education Interest Group
Business and Economic Statistics Section
Casualty Actuarial Society
Scientific and Public Affairs Advisory Committee
Presentations
General Bayesian updating (GBU) is a framework for updating prior beliefs about the parameter of interest to a posterior distribution via a loss function without imposing the distribution assumption on data. In recent years, the asymptotic distribution of the loss-likelihood bootstrap (LLB) sample has been a standard for determining the loss scale parameter which controls the relative weight of the loss function to the prior in GBU. However, the existing method fails to consider the prior distribution since it relies on the asymptotic equivalence between GBU and LLB. To address this limitation, we propose a new finite-sample-based approach to loss scale determination using the Bayesian generalized method of moments (BGMM) as a reference. We develop an efficient algorithm that determines the loss scale parameter by minimizing the Kullback-Leibler divergence between the exact posteriors of GBU and BGMM. We prove the convexity of our objective function to ensure a unique solution. Asymptotic properties of the proposed method are established to demonstrate its generalizability. We demonstrate the performance of our proposed method through a simulation study and a real data application.
Keywords
General Bayesian updating
Loss-likelihood bootstrap
Generalized method of moments
Kullback-Leibler divergence
Monte Carlo Newton-Raphson method
Abstracts
Co-Author(s)
Seung Jun Park, Department of Statistics, Kyungpook National University
Pamela Kim Salonga, Department of Statistics, Kyungpook National University
Kyeong Eun Lee, Department of Statistics, Kyungpook National University
Gyuhyeong Goh, Department of Statistics, Kyungpook National University
First Author
YU JIN SEO, Department of Statistics, Kyungpook National University
Presenting Author
YU JIN SEO, Department of Statistics, Kyungpook National University
Traditional statistical and machine learning methods often struggle to capture complex, interconnected relationships within biological data that enable biomarker discovery. We present a novel graph-based framework that leverages graph neural networks and network-based feature engineering to identify predictive biomarkers. Our approach constructs several biological networks by integrating gene expression data and clinical attributes using a graph database, providing multiple representations of patient-specific relationships. We employ graph learning techniques to ensemble graphs to identify candidates using hierarchical, feature-based and filter-based methods. Using three independent datasets, we demonstrate that our method improves predictive performance compared to conventional machine learning models. This scalable and interpretable strategy has broad applications in biomarker discovery across diverse disease domains.
Keywords
graph database
graph neural network
biomarker
feature engineering
feature selection
Abstracts
Co-Author(s)
Jason Huse, MD Anderson Cancer Center
Kasthuri Kannan, MD Anderson Cancer Center
First Author
Yang Liu, SPH in The University of Texas Health Science Center at Houston | MD Anderson Cancer Center
Presenting Author
Yang Liu, SPH in The University of Texas Health Science Center at Houston | MD Anderson Cancer Center
In the US, the FDA uses linear regression and non-central t distribution to estimate the upper limit of the 95% CI for the 99% quantile (TLM) and define the time as withdrawal time (WDT) when this TLM falls at or below a safe concentration level (tolerance) following the administration of the approved drug in labeled or extra-label manner in food animal species. It involves only the concentrations at or above the limit of detections (LODs) and determines the WDT for each tissue separately. However, the tissues, namely, liver, kidney, muscle, and fat collected from an animal, may be correlated. Therefore, the multivariate linear regression model (MvLR) appropriately addresses this high inter-tissue correlation. In addition, involving only the concentrations above LOD, censored observations can also impact the correlations or covariance pattern among the tissues and result in biased and imprecise estimators. Therefore, we propose using ordinary least squares (OLS), generalized least squares (GLS) in MvLR, and expectation-conditional maximization (ECM) algorithm in censored MvLR along with multivariate t distribution to estimate will provide more precise and accurate estimates of WDT.
Keywords
TLM
WDT
Multivariate t distribution
OLS
GLS
ECM
Abstracts
Real-world post-vaccine safety monitoring is crucial for detecting adverse events and maintaining public trust. This study applied Rapid Cycle Analysis (RCA) to assess 2023–2024 COVID-19 vaccines (Pfizer, Moderna, Novavax) for 14 outcomes, including ischemic stroke, GBS, and myocarditis.
VSD data from nine healthcare organizations included 2.7M doses (Sep 2023–Apr 2024). Outcomes were identified in healthcare records. RCA used a concurrent comparator design to compare adverse event rates across risk and comparison intervals. Weekly monitoring with Pocock alpha-spending controlled Type I error ensured real-time safety assessment and reduced bias.
RCA identified a GBS signal after Pfizer in ≥65 yrs (aRR: 4.45, 95% CI: 1.07–22.62) and ischemic stroke signals with Pfizer (18–64 yrs: aRR: 1.48, 95% CI: 1.04–2.11) and Moderna (≥65 yrs: aRR: 1.68, 95% CI: 1.05–2.70). No signals were found for other outcomes.
RCA enables real-time vaccine safety monitoring, addressing limits of traditional comparators. While GBS and stroke signals require further evaluation, the 2023–2024 COVID-19 vaccines show a reassuring safety profile. Ongoing monitoring remains key for public trust and safety.
Keywords
Vaccine Safety
Post-Vaccination Surveillance
Adverse Events
Rare Events
Signal Detection
Rapid Cycle Analysis (RCA)
Sequential Analysis
Concurrent Comparator Design
Pocock Alpha-Spending
Real-Time Monitoring
Vaccine Safety Datalink (VSD)
ICD-10 Codes
Healthcare Records
Guillain-Barré Syndrome (GBS)
Ischemic Stroke, Myocarditis
Adjusted Rate Ratio (aRR)
Risk and Comparison Intervals
COVID-19 Vaccine
Pfizer
Moderna
Novavax
Abstracts
We propose an unbiased restricted estimator that leverages prior information to enhance estimation efficiency for the linear regression model. The statistical properties of the proposed estimator are rigorously examined, highlighting its superiority over several existing methods. A simulation study is conducted to evaluate the performance of the estimators, and real-world data on total national research and development expenditures by country are analyzed to illustrate the findings. Both the simulation results and real-data analysis demonstrate that the proposed estimator consistently outperforms the alternatives considered in this study.
Keywords
Linear model
MSE
Unbiased ridge estimator
Restricted least-squares estimator
Multicollinearity
Abstracts
In recent times, wildfires have posed significant threats to forest ecosystems, human communities, and economic assets. This study applies spatio-temporal techniques to analyze and predict wildfire patterns in the forests of Colorado, Montana, Utah, and Wyoming. Utilizing historical wildfire data, weather conditions, vegetation types, and topographic features, we aim to develop comprehensive models to identify high-risk areas and forecast future wildfire events. By integrating remote sensing data and geographic information systems (GIS), we perform detailed spatio-temporal analyses to uncover underlying patterns and trends. The findings from this research provide valuable insights for forest management, risk assessment, and wildfire mitigation strategies, contributing to more effective resource allocation and community preparedness.
Keywords
Spatio-temporal analysis
Wildfire patterns
Forest Ecosystems
Risk assessment
Predictive modeling
Remote sensing
Abstracts
The vaccine efficacy is defined as the reduction of relative risk of disease 1-(Rv/Rp), with Rv and Rp as the incidence rates for the disease of interest in the vaccine and placebo groups, respectively. A conditional exact method proposed by Chan and Bohidar (1998) is often used to estimate the vaccine efficacy and its confidence interval. In this poster, we will compare the conditional exact method with two alternative approaches: Poisson regression and modified Poisson regression proposed by Zou (2004) by using trial data with and without follow-up time adjustment. Dengue epidemiologic distribution across Brazil will be provided via a QR code.
Keywords
Vaccine efficacy
conditional exact method
Poisson regression
modified Poisson regression.
Abstracts
RNA sequencing (RNA-Seq) is a powerful technology for quantifying gene expression and identifying genes influenced by environmental exposures or other treatments. There is currently interest in how mixtures of multiple environmental chemicals affect gene expression using RNA-Seq data. There are many popular methods for RNA-Seq analysis; however, none focus on correlated environmental exposures. The existing Bayesian kernel machine regression (BKMR) effectively analyses mixture effects for continuous outcomes but cannot handle the unique challenges of RNA-Seq count data. We develop BKMRSeq, a novel BKMR model tailored for RNA-Seq count data to address this gap. BKMRSeq uses Polya-Gamma augmentation within a Markov chain Monte Carlo (MCMC) framework to estimate a complex non-linear association between exposures and gene expression using a Gaussian kernel matrix and select genes that are differentially expressed. Through simulation studies, BKMRSeq demonstrates superior performance compared to the existing methods to analyze gene expression when there is a complex exposure-response relation. We further validate its utility by applying it to real-world RNA-Seq data.
Keywords
RNA-Seq; Correlated environmental exposures; BKMR; Polya-gamma distribution; Data augmentation
Abstracts
With the growing availability of high-dimensional data, variable selection has become an inevitable step in regression analysis. Traditional Bayesian inference, however, depends on correctly specifying the likelihood, which is often impractical. The loss-likelihood bootstrap (LLB) has recently gained attention as a tool for likelihood-free Bayesian inference. In this paper, we aim to overcome the limited applicability of LLB for high-dimensional regression problems. To this end, we develop a likelihood-free Markov Chain Monte Carlo Model Composition (MC3) method. Traditional MC3 requires marginal likelihoods, which are not available in our likelihood-free setting. To address this, we propose a novel technique that utilizes the Laplace approximation to estimate marginal likelihood ratios without requiring explicit likelihood evaluations. This advancement allows for efficient and accurate model comparisons within the likelihood-free context. Our proposed method is applicable to various high-dimensional regression methods including machine learning techniques. The performance of the proposed method is examined via simulation studies and real data analysis.
Keywords
Bayesian variable selection
High-dimensional regression
Likelihood-free Bayesian inference
Loss-likelihood bootstrap (LLB)
Abstracts
Co-Author(s)
Gyuhyeong Goh, Department of Statistics, Kyungpook National University
Dipak Dey, University of Connecticut
First Author
Minhye Park, Kyungpook National University
Presenting Author
Minhye Park, Kyungpook National University
Single-cell multiplex imaging (scMI) measures cell locations and phenotypes in tissues, enabling insights into the tumor microenvironment. In scMI studies, quantifying spatial co-localization of immune cells and its link to clinical outcomes, such as survival, is crucial. However, it is unclear which spatial indices have sufficient power to detect within-sample co-localization and its association with outcomes. This study evaluated six frequentist spatial co-localization metrics using simulated data to assess their power and Type I error. Additionally, these metrics were applied to two scMI datasets-high-grade serous ovarian cancer (HGSOC) and triple-negative breast cancer (TNBC)-to detect co-localization between cell types and its relation to survival. Simulations showed Ripley's K had the highest power, followed by pair correlation g, while other metrics exhibited low power. In cancer studies, Ripley's K, pair correlation g, and the scLMM index were most effective in detecting within-sample co-localization and associations with survival, highlighting their utility in scMI analyses.
Keywords
Co-Clustering
Multiplex Imaging
Spatial Biology
Spatial Proteomics
Abstracts
Artificial Intelligence (AI) is showing up in more and more parts of our lives, including college classrooms. But can an AI actually help students learn tough subjects like statistics? This study takes a first look at the BEAR (Biostatistical Education Assistant and Resource), an AI tutor created using ChatGPT to help students in an online Biostatistics course at the University of Alabama. The BEAR is designed to help students understand challenging concepts, stimulate deeper thinking through thoughtful questioning, and provide supportive guidance similar to that of a real teacher. A key feature is its two-way interactivity, where both the BEAR and students can ask and answer questions, making the learning process more engaging and potentially more effective. At the end of the Spring 2025 semester, students will be surveyed to find out how helpful they found the BEAR, whether it improved their understanding, and how satisfied they were with the experience. This pilot study aims to share early insights into how AI tools like the BEAR might shape the future of online statistics education.
Keywords
AI
ChatGPT
Online Statistical Education
AI Tutoring
Abstracts
Artificial intelligence has given researchers access to a broader range of historical data, enabling the development of more accurate predictive models in finance. This study investigates the relationship between treasury bill rates and two high-demand commodities: gasoline and gold. We compiled a comprehensive dataset spanning ten years of historical prices: July 2014-July 2024, consisting of treasury bill rates, gasoline prices, gold-EPI, and natural gas consumption, and vital economic indicators such as GDP, inflation rates, and interest rates. Our analysis reveals that natural gas consumption exhibits the highest positive correlation (0.64) with treasury bill rates, while gasoline prices and the export price index of gold show moderate positive correlations (0.51 and 0.52, respectively). Training the data on a machine learning model (Artificial Neural Network) and three deep learning models (Recurrent Neural Network, Convolutional Neural Network, and 3D Convolutional Neural Network). Their performance was assessed using mean absolute error, mean squared error, root mean squared error (RMSE), and R-sq score. The Recurrent Neural Network model performed best with an RMSE of 0.08.
Keywords
Deep Learning
Recurrent Neural Network
Treasury Bill Rate
3D Convolutional Neural Network
Artificial Neural Network
Machine Learning
Abstracts
Wastewater surveillance is a promising tool for tracking COVID-19,but its effectiveness in underserved populations-who may experience disproportionately severe illness-has not been fully established. The NIH-funded RADx-UP program, comprising 144 projects across the U.S., aims to expand COVID-19 testing accessibility, particularly in hard-hit areas. The utility of wastewater surveillance in underserved communities can be assessed by comparing its findings with screening data from RADx-UP, which includes both symptomatic and asymptomatic individuals. However, this assessment requires RADx-UP to be representative of the U.S. population, a criterion that can be evaluated through generalizability analysis. In this study, we report participant characteristics and compute a generalizability score, which quantifies how well a study sample reflects a target population of interest based on demographics and clinical characteristics, at the Federal Information Processing Standard (FIPS) geography code level. With over 350,000 participants in 1005 FIPS, we anticipate major metropolitan areas have sufficient data for generalizable estimates, especially in underserved counties.
Keywords
causal inference study
propensity score model
multiple imputation
Abstracts
Luminosity function is a fundamental concept in astrophysics and cosmology, which describes the distribution of luminosities within a group of astronomical objects, such as galaxies. In reality, there could possibly exist a monotone data drift (selection bias) when we collect the data on luminosities: given the same distance, the stars/galaxies with larger luminosity have a higher chance of being observed. This poses challenges for standard estimation procedures. Ignoring this bias can lead to failure. Conversely, procedures accounting for it may be inefficient when there is no selection bias. This poster introduces a semi-parametric procedures for detecting monotone drifts in data with unknown parameters, with its application on a real dataset of galaxy luminosities.
Keywords
selection bias
semiparametric
luminosity function
Abstracts
The Network Autoregression (NAR) model is widely used for analyzing network-dependent data. However, assuming fixed parameters over time is often unrealistic in dynamic systems. Identifying time points where NAR parameters shift-known as change points-is crucial for capturing structural changes in the network process. This work proposes a rolling-window approach to detect these change points efficiently. The method adapts to evolving network structures and parameter variations, improving the model's flexibility in real-world applications. Simulation studies demonstrate the effectiveness of the proposed approach, and its applicability is further illustrated using real-world network data.
Keywords
Change Point Detection,
Network Autoregressive (NAR) Model
Network-Dependent Data
Non-Stationary Processes
Structural Changes
Abstracts
As NASA's Human Research Program (HRP) prepares for long-duration Mars missions, understanding astronaut tasks is crucial. This study, conducted at NASA Glenn Research Center (GRC), applied Natural Language Processing (NLP) and machine learning to classify 1,058 Mars tasks into 18 Human System Task Categories (HSTCs) [1]. We developed an NLP model using Google's BERT [2] to capture semantic and syntactic nuances. Supervised training on a subset of tasks improved classification accuracy for 9 of 18 HSTCs, especially when incorporating HSTC descriptions. To address severe class imbalance, we introduced weighting and sampling techniques for data augmentation [3]. Additionally, we fine-tuned BERT to implement a pairwise relatedness scoring method, enabling task clustering and progressing toward unsupervised labeling. This presentation covers data preprocessing, key syntax extraction using BERT, and supervised classification, highlighting NLP's potential for analyzing Mars mission tasks in crew health and performance studies.
Keywords
Natural Language Processing
Machine Learning
BERT (Bidirectional Encoder Representations from Transformers)
NASA Human Research Program (HRP)
Mars Task Classification
Supervised Learning
Abstracts
Co-Author(s)
Mona Matar, NASA Glenn Research Center
Hunter Rehm, HX5 LLC, NASA Glenn Research Center
First Author
Adam Kurth, Arizona State University - Biodesign Institute
Presenting Author
Adam Kurth, Arizona State University - Biodesign Institute
Resting-state fMRI (rfMRI) is a powerful tool for characterizing brain-related phenotypes, but current approaches are often limited in their ability to efficiently capture the heterogeneous nature of functional connectivity due to lack of robust, scalable statistical methods. In this paper, we propose a new distributed learning framework for modeling the voxel-level dependencies across the whole brain, thus avoiding the need to average voxel outcomes within each region of interest. In addition, our method addresses confounder heterogeneity by integrating subject-level covariates in the estimation, which allows for comparing functional connectivity across diverse populations. We demonstrate the effectiveness and scalability of our approach in handling large rfMRI outcomes through simulations. Finally, we apply the proposed framework to study the association between brain connectivity and autism spectrum disorder (ASD), uncovering connectivity patterns that may advance our understanding of ASD-related neural mechanisms.
Keywords
rfMRI
functional connectivity,
oxel-level dependencies
distributed learning
ASD
whole-brain modeling
Abstracts
Although the causal effects can be affected by the interaction effects between exposure and confounders, it is difficult to capture in a mediator model. Some function-on-scalar models with functional responses and scalar covariates have studied the dynamic relationship between the functional responses and covariates or the relationship between the functional responses depending on the time and some covariates, but the interaction effects between exposure and confounders could not be found. To explore the dynamic effects of the single index including the dynamic confounders, we propose a dynamic functional mediation model. The simulation studies evaluate the performance and validity of the proposed model with the estimation and inference procedures for causal estimands using the wild bootstrap method. We apply our approach to country-level datasets with an adjustment of the dependency between countries by estimating the spatial dependence. Thus, the pathways of individual indirect effects from exposure to outcome through the geospatial mediator can be investigated without linear assumption and dependency issues.
Keywords
COVID-19
dynamic interaction semiparametric function-on-scalar model
mediation analysis
penalized scalar-on-function linear regression
Abstracts
Co-Author
Chao Huang, University of Georgia
First Author
Miyeon Yeon, The University of Tennessee Health Science Center
Presenting Author
Miyeon Yeon, The University of Tennessee Health Science Center
Recent studies link air pollution exposure to human and environmental health. It is critical to identify time intervals and spatial regions where such exposure risks are high. For fine particles we need to be able to visualize and model effects of various quantiles, in addition to mean effects, since people are more adversely affected by excessive levels of pollution.
We propose versatile tools to describe and visualize quantiles of data with wide-ranging spatial-temporal structures and various degrees of missingness. We illustrate this methodology through dynamic visualization of spatial-temporal patterns to provide useful insights to where and when the process changes. This approach does not require strong theoretical assumptions and is useful to guide future modeling efforts.
The mentioned statistical framework is applied to daily PM2.5 concentrations for the years 2020-24 collected at 108 locations across NY, NJ, PA. We show how the PM2.5 exposure risks evolve over space and time, identifying possible clusters. Our approach demonstrates the importance of effective dynamic visualizations of complex spatial-temporal datasets, with plans to expand analysis to further regions.
Keywords
Statistical Modeling
PM2.5 Pollution
Data Modeling
Abstracts
This presentation highlights the impact of the COS Instructional Grant in enhancing the educational experience of undergraduate and graduate students in Regression Analysis courses. Our project aimed to improve students' statistical skills, encourage critical thinking in real-world problem-solving, and help them communicate complex concepts clearly. Students explored data by applying regression analysis to scenarios like studying the relationship between physical characteristics and systolic blood pressure while identifying challenges like outliers. While students mastered technical skills, the project revealed a need to improve critical thinking and data interpretation, especially in recognizing anomalies. We will discuss current and future efforts, including earlier integration of crucial thinking through visualizations, creating separate course sections for different student backgrounds, and funding undergraduate research to develop statistical tools like an R package for mixed models. Our initiatives aim to boost student engagement, improve course alignment, and equip students with technical and transferable skills for careers in data analysis and beyond.
Keywords
Statistical Education
Regression Analysis
Course Enhancement
Abstracts
This project evaluates the effectiveness of an LLM-driven (Large Language Model) tool for SQL documentation and programming language conversion/SQL code generation. The experiment tests the LLM tool with code samples at three complexity levels-beginner, intermediate, and advanced-under three prompt conditions: minimally defined, moderately defined, and extremely defined. Raters will assess the LLM-generated outputs using a pre-set rubric. The statistical analysis will employ a Repeated Measures ANOVA to determine the impact of the experimental conditions on the tool's performance. Inter-rater reliability will be measured using Shrout and Fleiss intraclass correlations to measure evaluation consistency.
Keywords
LLM (Large Language Models)
SQL Code
Documentation
Quality Evaluation
Repeated Measures ANOVA
Inter-Rater Reliability Statistics
Abstracts
Accurate time series predictions are crucial across various scientific fields. Traditional statistical methods, such as autoregressive integrated moving average (ARIMA) models, have been widely used but often rely on assumptions of stationarity and linearity, limiting their ability to capture complex real-world patterns. To overcome these limitations, this study introduces novel methods for point forecasting and uncertainty quantification. Generative diffusion models are employed alongside a conformal prediction-based calibration method to enhance the reliability of prediction intervals. The effectiveness of this approach is demonstrated through simulations and an application to electricity load forecasting. The results contribute broadly to time series analysis by improving predictive accuracy and ensuring robust uncertainty quantification.
Keywords
Generative models
Diffusion models
Conformal prediction
Time series analysis
Uncertainty quantification
Abstracts
Environmental exposures during critical developmental periods significantly impact children's health, yet analyzing complex exposure mixtures across diverse populations remains challenging. We propose a unified framework combining mixture analysis and machine learning techniques for pooling multiple cohort data. Through comprehensive simulation studies, we evaluate methods including Weighted Quantile Sum (WQS), Bayesian Kernel Machine Regression (BKMR), quantile-based g-computation, and Partial Linear Single Index Model (PLSIM), assessing their performance across various scenarios of cohort heterogeneity, exposure correlations, and missing data patterns. We apply this framework to examine prenatal air pollution exposure effects on autism traits using data from over eight thousand mother-child pairs across multiple cohorts from the Environmental influences on Child Health Outcomes (ECHO) program. Our approach integrates ensemble learning, meta-learning algorithms, and Bayesian hierarchical models to account for between-cohort heterogeneity while maximizing information sharing. This methodology promises to enhance our understanding of environmental exposure effects across diverse pop
Keywords
Environmental mixture analysis
Machine learning
Heterogeneity
ECHO program
Abstracts
Readmission prediction is a critical but challenging clinical task, as the inherent relationship between high-dimensional covariates and readmission is complex and heterogeneous. Despite this complexity, models should be interpretable to aid clinicians in understanding an individual's risk prediction. Readmissions are often heterogeneous, as individuals hospitalized for different reasons have materially different subsequent risks of readmission. To allow flexible yet interpretable modeling that accounts for patient heterogeneity, we propose hierarchical-group structure kernels that capture nonlinear and higher-order interactions via functional ANOVA, selecting variables through sparsity-inducing kernel summation, while modeling heterogeneity and allowing variable importance to vary across interactions. Extensive simulations and a Hematologic readmission dataset (N=18,096) demonstrate superior performance across subgroups of patients (AUROC, PRAUC) over the lasso and XGBoost. Additionally, our model provides interpretable insights into variable importance and group heterogeneity.
Keywords
Readmission Prediction
Heterogeneity
Functional ANOVA
Kernel Methods
Sparsity-Inducing Regularization
Abstracts
Co-Author(s)
Angela Bailey, University of Minnesota Twin Cities
Jared Huling, University of Minnesota
First Author
Wei Wang, University of Minnesota Twin Cities
Presenting Author
Wei Wang, University of Minnesota Twin Cities
In Bayesian statistics, various shrinkage priors such as the horseshoe and lasso priors have been widely used for the problem of high-dimensional regression and classification. The type of shrinkage priors is determined by the choice of the distributions for hyperparameters, called hyperpriors. As a result, the posterior sampling method should vary depending on the choice of hyperpriors. To address this issue, we develop a new family of hyperpriors via a notion of discretization. The great merit of our discretization approach is that the full conditional of any hyperparameter always becomes a multinomial distribution. This feature provides a unifying posterior sampling scheme for any choice of hyperpriors. In addition, the proposed discretization approach includes the spike-and-slab prior as a special case. We illustrate the proposed method using several commonly used shrinkage priors such as horseshoe prior, Dirichlet-Laplace prior, and Bayesian lasso prior. We demonstrate the performance of our proposed method through a simulation study and a real data application.
Keywords
Bayesian shrinkage priors
Discretization
Gibbs sampler
High-dimensional regression and classification
Abstracts
Real-world electronic health record(EHR) data can be used to identify rare adverse events of medications due to its large sample size. This study use a large nationwide EHR COVID-19 database to identify potential adverse effects of COVID-19 vaccines by comparing to historical EHR data of influenza vaccination. To compare the proportion of adverse events in a pre-specified time interval after vaccination, the censored or incomplete data should be considered. We propose an inverse probability of censoring weight(IPCW) adjusted hypothesis test to deal with the censored EHR data problem. An asymptotic consistent estimator of the event proportion is proposed and the corresponding asymptotic distribution is derived under the data censoring. The asymptotic hypothesis test for comparing the event proportion between the two groups under censoring is established. The simulation studies demonstrate the validity of the proposed IPCW-adjusted estimate and test statistic. We show that the proposed IPCW-adjusted estimate of the event proportion can remove the estimation bias and the type I error of IPCW-adjusted test can be controlled well at the nominal level when the censoring rate increases.
Keywords
Adverse effects
COVID-19 vaccine
hypothesis testing with censored data
IPCW
Abstracts
Co-Author(s)
Mohsen Rezapour, 1Department of Biostatistics and Data Science, School of Public Health, UT Health Houston
Vahed Maroufy, Department of Biostatistics and Data Science, School of Public Health, UTHealth Houston
Yashar Talebi, Department of Biostatistics and Data Science, School of Public Health, UTHealth Houston
Guo-qiang Zhang, Texas Institute for Restorative Neuro-technologies (TIRN), UTHealth Houston
Hulin Wu, University of Texas Health Science Center At Houston
First Author
Lili Liu
Presenting Author
Lili Liu
The selection of a performance metric for the purpose of model evaluation is not as trivial as it may appear. On one hand, the model commissioners' expectations of the model's contribution to achieving their business objective (s) often lack empirical support. On the other, the model developers can easily be confused by the multitude of quantitative metrics recommended by the statistical literature. Hence the need for a methodology to guide the effective selection of statistical performance metric during model evaluation. In Salami et al (2024), we considered a fraud detection use case and we showed that F-beta (F_β, β>1) is more appropriate than F_1 or the Area Under the Precision Recall Curve (AUPRC) metric in measuring the model's contribution to the business objective. In this paper, we examine two facets of the F_β, namely the weighted F_β and the non-weighted F_β, and discuss how the selection of one in lieu of the other can lead to erroneous decisions with adverse impacts. As the use of AI algorithms becomes more prevalent in decision making, our paper brings a new perspective to the selection of statistical performance metrics for the purpose of evaluating AI models.
Keywords
artificial intelligence
machine learning
performance evaluation
performance metrics
model testing
model monitoring, performance thresholds
Abstracts
Traditional linear methods for generating age, sex, and education level corrected Z-scores in neuropsychological assessments can be problematic because of nonlinearity and bounded test scores. We propose a nonlinear censored regression model for generating Z-scores that adjusts for age, sex, education, and race, while incorporating age-varying residual standard deviations. This approach addresses non-normal score distributions and boundary censoring, enhancing the detection of abnormal cognitive performance. Application to diverse normative datasets demonstrates improved accuracy and sensitivity over traditional methods, as corroborated by clinician feedback. Our results advocate for adopting this model to refine neuropsychological evaluations across varied populations.
Keywords
Neuropsychological Testing
Z-Score Adjustment
Censored Regression
Nonlinear Modeling
Cognitive Assessment
Abstracts
The naive Bayes classifier, which assumes the conditional independence of predictors, improves classification efficiency and has a great advantage in handling high-dimensional data as well as imbalanced data. However, the success of the naive Bayes classifier hinges on the normality assumption for each continuous predictor and its performance decreases considerably as many irrelevant predictor are included.
In this paper, we develop a way of improving the performance of naive Bayes classifiers when we deal with high-dimensional non-Gaussian data. To remove irrelevant predictors, we develop an efficient variable selection procedure in the context of naive Bayes classification using the notion of Bayesian Information Criteria (BIC). In addition, we adapt the naive Bayes classifier for use with non-Gaussian data via power transformation. We conduct a comparative simulation study to demonstrate the superiority of our proposed classifier over existing classification methods. We also apply our proposed classifier to real data and confirm its effectiveness.
Keywords
Bayes classifier
Generative classifier
High-dimensional variable selection
Power transformation
Abstracts
Co-Author(s)
Gyuhyeong Goh, Department of Statistics, Kyungpook National University
Dipak Dey, University of Connecticut
First Author
Mijin Jeong, Kyungpook National University
Presenting Author
Mijin Jeong, Kyungpook National University
Traditional imputation methods for psychological scales often focus on aggregate scores, potentially obscuring item-level response patterns. This study addresses the challenge of item- level missingness by introducing a multiple imputation framework that preserves the Kessler-10 (K10) scale's internal structure in a longitudinal setting. We analyzed item responses from the K10 across six waves in a three-arm RCT (n = 878) and, departing from conventional total-score imputation, employed XGBoost with 10 iterations to impute missing values at the individual item level. Despite missing data ranging from 2.5% to 34.4%, our approach yielded robust estimates with mean uncertainty on a 5-point scale varying between 0.003 and 0.111. This method enhances the analysis of psychological assessment data by capturing item-specific variability, ultimately improving data integrity in mental health research.
Keywords
Multiple Imputation by Chained Equation (MICE)
Item-level imputation
XGBoost
Randomized controlled trial (RCT)
Abstracts
We revisit the classical problem of estimating the mixing distribution of Binomial mixtures under trial heterogeneity and smoothness. This problem has been studied extensively when the trial parameter is homogeneous, but not under the more realistic scenario of heterogeneous trials, and only within a low smoothness regime, where the resulting rates are suboptimal. Under the assumption that the density is s-smooth, we derive faster error rates for nonparametric density estimators under trial heterogeneity. Importantly, even when reduced to the homogeneous case, our result improves upon the state of the art. We further discuss data-driven tuning parameter selection via cross-validation and a measure of a difference between two densities. Our work is motivated by an application in criminal justice: assessing the effectiveness of indigent representation in Pennsylvania. We find that the estimated conviction rates for appointed counsel (court-appointed private attorneys) are generally higher than those for public defenders, potentially due to a confounding factor: appointed counsel are more likely to take on severe cases.
Keywords
Binomial mixtures
mixing distribution
nonparametric density estimation
criminal justice
Abstracts
Statistical methods are essential in ecology, helping researchers navigate complex, noisy data to understand environmental change and predict ecosystem responses. One critical application is assessing how forests, which play a key role in carbon capture, are impacted by increasing drought frequency-a threat that may have long-term consequences overlooked by traditional models. This study addresses that gap by applying the Bayesian Stochastic Antecedent Model, a statistical approach designed to quantify how past drought severity and duration influence future growth. This methodology will be applied to B4WarmED, a new and rare long-term experimental dataset that uniquely combines temporal depth, a controlled causal setup, and the ability to extract signals from a low signal-to-noise ratio, making it ideal for comprehensive understanding. This methodology, combined with the depth and design of B4WarmED, enables nuanced, data-driven insights into the complex, long-term effects of drought, allowing for stronger causal claims about how past conditions shape future tree growth. Refining predictions of forest carbon dynamics will improve climate models and inform conservation strategies.
Keywords
ecology
time-series
antecedent events
causal inference
tree growth
Abstracts
Atmospheric tar balls (TBs) are solid, strong light-absorbing organic particles emitted through wildfires. TBs can disrupt Earth's energy balance by absorbing incoming solar radiation. However, there are still large uncertainties with TBs' optical properties. Moreover, when TBs have different coatings (e.g., water and organic), their light absorption properties will be enhanced. This highly variable optical property and light enhancement of TBs are not included in a climate model.
This study applies Mie calculations to investigate the light absorption enhancement of different coatings on TBs. We used different optical properties of TBs in the literature to cover the variable optical properties of TBs by testing the different coating species (e.g., water, secondary organic, and brown carbon). The core and coating thickness varied between x to y and a to b, respectively. We found that clear and brown coatings enhance light absorption through the "lensing effect," with brown coatings showing a marked increase, both accounted for in the core-shell parameterization. This model-measurement approach improves predictions of TB optical properties.
Keywords
Mie Theory
Atmospheric modeling
light absorption
light scattering
Optical properties
Core-shell particles
Abstracts
Prior studies showed Persistent poverty (PP) to have effect on risk of mortality for various cancers. We used a causal path-specific mediation approach to understand how the interplay between census-tract level PP, socioeconomic status (SES), and receipt of cancer treatment affect mortality. Using Cox models, we obtain weighted causal estimates for the natural direct effect (NDE) of PP on mortality, and natural indirect effects (NIE) of PP on mortality through the combined pathways considering an exposure-induced mediator-outcome confounder SES and treatment mediator. We use data on 50,533 stage I-IV hepatocellular carcinoma patients, identified in the SEER program. The analysis showed PP had indirect effect on higher mortality probability. Cox models yielded NDE 1.06 (95% CI: 0.99-1.13) of PP on mortality accounted for 31% of total effect. The NIE 1.03 (95% CI: 1.02-1.05) of PP on mortality, through the combined SES pathway, accounted for 16% of total effect, and the NIE 1.10 (95% CI: 1.04-1.17) of PP on mortality only through treatment mediator, accounted for 53% of total effect. Thus, SES and receiving treatment contribute to understanding the causal effect of PP on mortality.
Keywords
Causal Analysis
Mediation Analysis
Persistent Poverty
Cancer
Mortality
Abstracts
This poster studies big data streams with regional-temporal extreme event (REE) structures and solar flare prediction. An autoregressive Conditional Fréchet model with time-varying parameters for regional and its adjacent regional extremes (ACRAE) is proposed. The ACRAE model can quickly and accurately predict rare REEs (i.e., solar flares) in big data streams. The ACRAE model, with some mild regularity conditions, is proved to be stationary and ergodic. The parameter estimators are derived through the conditional maximum likelihood method. The consistency and asymptotic normality of the estimators are established. Simulations are used to demonstrate the efficiency of the proposed parameter estimators. In real solar flare prediction, with the new dynamic extreme value modeling, the occurrence and climax of solar activity can be predicted earlier than with existing algorithms. The empirical study shows that the ACRAE model outperforms the existing prediction algorithms with sampling strategies.
Keywords
big data
solar flare detection
time series of regional extremes
extreme value theory
tail index dynamics
Abstracts
Co-Author(s)
Jili Wang, University of Wisconsin-Madison
Zhengjun Zhang, University of Chinese Academy of Sciences
First Author
Steven Moen, University of Wisconsin-Madison
Presenting Author
Steven Moen, University of Wisconsin-Madison
Efficient monitoring and measuring of forage resources is challenging in livestock production and management. Failure to develop a real-time monitoring tool can result in overgrazed forage resources, ecosystem degradation, decreased animal production, and reduced resiliency to climate change. Utilizing remote sensing data such as satellite imagery provides a cost-effective tool for monitoring forage quality and quantity. This study aims to develop data pipelines to automate the extraction of climate and satellite imagery from Google Earth Engine. Specifically, forage quantity and quality indicators such as Neutral Detergent Fiber, Acid Detergent Fiber, Biomass, and Crude Protein, are predicted using precipitation metrics, seasonal weather metrics, and vegetation Indices. The performance of univariate and multivariate Random Forest, General Additive Model, Least Absolute Shrinkage and Selection Operator model, Autoregressive Integrated Moving Average model, Nonlinear Autoregressive exogenous model, and Multivariate Time Series models are compared. The results show that the non-linear models outperformed the linear models while being computationally efficient.
Keywords
Livestock
Forage
Machine learning
Remote sensing
Abstracts
Seasonal variation is a key feature of many environmental and biological systems, including infectious disease outbreaks and temperature patterns. Periodic mean-reverting stochastic differential equations (SDEs) effectively model such variability. We present periodic mean-reverting SDE models, $dX(t) = r(\beta(t) - X(t))dt + d\beta(t) + \sigma X^p(t)dW(t),$ for \(p = 0, 1/2, 2/3, 5/6, 1\), with periodic mean \(\beta(t)\), and fit them to seasonally varying influenza and temperature data. The model with \(p = 0\) corresponds to the Ornstein-Uhlenbeck process, while \(p = 1/2\) and \(p = 1\) relate to the Cox-Ingersoll-Ross (CIR) process and geometric Brownian motion (GBM), and other mean-reverting SDEs \(p = 2/3, 5/6\), respectively. We show that the higher-order moments of CIR and GBM processes exhibit periodicity. Novel model-fitting methods combine least squares for \(\beta(t)\) estimation and maximum likelihood for \(r\) and \(\sigma\). Confidence regions are constructed via bootstrapping, and missing data are handled using a modified MissForest algorithm. These models provide a robust framework for capturing seasonal dynamics and offer flexibility in mean function specification
Keywords
Parameter estimation
Mean-reverting stochastic differential equations
Confidence region
Seasonal time series
Abstracts
Deep generative models are routinely used in generating samples from complex, highdimensional distributions. Despite their apparent successes, their statistical properties are not
well understood. A common assumption is that with enough training data and sufficiently large
neural networks, deep generative model samples will have arbitrarily small errors in sampling
from any continuous target distribution. We set up a unifying framework that debunks this belief.
We demonstrate that broad classes of deep generative models, including variational autoencoders
and generative adversarial networks, are not universal generators. Under the predominant case of
Gaussian latent variables, these models can only generate concentrated samples that exhibit light
tails. Using tools from concentration of measure and convex geometry, we give analogous results
for more general log-concave and strongly log-concave latent variable distributions. We extend
our results to diffusion models via a reduction argument. We use the Gromov–Levy inequality to
give similar guarantees when the latent variables lie on manifolds with positive Ricci curvature.
These results shed light on the limited capacity
Keywords
Deep Generative Models
Diffusion models
Generative Adversarial Networks
Variational Autoencoders
Concentration of Measure
Abstracts
Multimodal integration combines data from diverse sources or modalities to provide a more holistic understanding of a phenomenon. The challenges in multi-omics data analysis stem from the complexity, high dimensionality, and heterogeneity of the data, which require advanced computational tools and visualization methods for effective interpretation. This paper introduces a novel method called Orthogonal Multimodality Integration and Clustering (OMIC) to analyze CITE-seq data.
Our approach allows researchers to integrate various data sources while accounting for interdependencies. We demonstrate its effectiveness in cell clustering using CITE-seq datasets. The results show that our method outperforms existing techniques in terms of accuracy, computational efficiency, and interpretability. We conclude that OMIC is a powerful tool for multimodal data analysis, enhancing the feasibility and reliability of integrated data analysis.
Keywords
Multimodality Integration
CITE-seq
Cell Clustering
Abstracts
The prescribed doses for many drugs are based on population norms or physician discretion. For example, very low birth weight (VLBW) infants (BW < 1500 grams) often experience slower postnatal growth and require glucose treatment to support weight gain and prevent hyperglycemia. A uniform dosage is unsuitable due to individual differences in glucose metabolism influenced by weight, gestational age, and other factors. Personalized dynamic dosing adjusts to an infant's observed responses, but quantifying uncertainty is essential in patient-critical environments. This study employs a longitudinal mixed model to estimate personalized optimal doses, using patient-specific random effects as biomarkers to capture individual sensitivities. We quantify the uncertainty of these doses through the implicit value delta method, aiding in safe clinical decision-making. Simulation studies validate our model's robustness, and analysis of NICU data on glucose treatments for VLBW infants over their first seven days demonstrates key differences between optimal and prescribed dosing strategies.
Keywords
Personalized medicine
Optimal dose
Longitudinal Data Analysis
Implicit Delta Method
Abstracts
Co-Author
Alexander McLain, University of South Carolina
First Author
Md Nasim Saba Nishat, Department of Epidemiology and Biostatistics, University of South Carolina
Presenting Author
Md Nasim Saba Nishat, Department of Epidemiology and Biostatistics, University of South Carolina
Social media provides a valuable avenue for mental health research, offering insights into conditions through user-generated content. Yet, the application of machine learning (ML) and deep learning (DL) models in this domain presents methodological challenges, including dataset representativeness, linguistic complexity, the need to distinguish multiple types of mental illness, and class imbalance. This project offers practical guidance to address these issues, focusing on best practices in data preprocessing, feature engineering, and model evaluation. The project introduces strategies for handling imbalanced datasets, optimizing hyperparameter tuning, and improving model transparency and reproducibility. Additionally, it demonstrates techniques for effectively differentiating various mental health conditions within social media data, ensuring that models capture their nuanced presentations. With real-world examples and step-by-step implementation, this project aims to provide tools to build more robust and interpretable ML/DL models for mental illness detection. These improvements contribute to the development of effective early detection and intervention tools in public health.
Keywords
Machine Learning
Deep Learning
Mental Health
Social Media
NLP
Multi-Class Classification
Abstracts
In this project, we develop a computational framework to predict Epidermal Growth Factor Receptor (EGFR) expression levels using pathology images. The workflow begins with cell segmentation and image feature extraction, performed using the scikit-image library in Python, to derive quantitative features from each patient's pathology images. These features are then utilized in a deep neural network model for variable selection and EGFR expression prediction. Our models demonstrate strong predictive performance and two professional pathologists validate the extracted features, ensuring their clinical interpretation. This approach has the potential to significantly reduce the costs of gene expression sequencing and provide valuable guidance for pathologists in clinical trial analysis. By integrating computational pathology with deep learning, our work offers an efficient workflow for EGFR expression prediction, bridging the gap between traditional pathology and advanced statistical models.
Keywords
Feature images extraction
Deep Neural Network
Gene expression prediction
Abstracts
The PRITS framework addresses the lack of integrated technical and statistical guidance on the programmatic collection of data from online data sources and assessing existing web-scraped datasets for specific research uses. The framework covers five stages: Planning, Retrieval, Investigation, Transformation and Summary (PRITS). The 'Planning' stage focuses on problem and context definition, and sampling design. 'Retrieval' involves the technical execution and automated documentation of web-scraping processes and outputs (i.e. paradata and substantive data). 'Investigation' assesses the content and completeness of the retrieved web response objects. 'Transformation' involves parsing and cleaning the retrieved web data, potential integration with other data, and documentation of key decisions such as imputation or harmonisation strategies. Finally, the 'Summary' stage documents any decisions that might materially impact downstream analysis, and describes key properties (i.e. metadata) and limitations of the final web-scraped dataset.
Keywords
internet data
sampling design
web scraping
data quality
Abstracts
Random processes with stationary increments and intrinsic random processes are two concepts commonly used to deal with non-stationary random processes. They are broader classes than stationary random processes and conceptually closely related to each other. A random process with stationary increments is a stochastic process where its distribution of the increments only depends on its temporal or spatial intervals. On the other hand, an intrinsic random function is a flexible family of non-stationary processes where the process is assumed to have lower monomials as its mean and the transformed process becomes stationary. This research illustrates the relationship between these two concepts of stochastic processes and shows that, under certain conditions, they are equivalent on the real line.
Keywords
intrinsic random function
random process with stationary increment
non-stationary random process
spatial statistics
time series
Abstracts
Monitoring is a process to ensure that data integrity is maintained across the duration of a clinical trial. Regulatory authorities recommend that Sponsors focus monitoring strategies on data critical to reliability of trial results. As Randomization is considered critical data, recent regulatory guidance (ICH, MHRA, FDA, EMA) places higher focus on Randomization Monitoring. This type of monitoring concentrates on reviewing accumulated randomization data to confirm that randomization has occurred per protocol/relevant specifications. Randomization monitoring is important in every clinical trial to provide verifiable evidence proving the randomization's integrity. It becomes even more crucial in complex innovative designs (e.g., Master Protocols, trials with AI-enabled devices/machine learning) due to their complexity/novelty. This presentation will establish the importance of randomization monitoring (both in standard and complex/novel protocol designs) and summarize regulatory requirements for randomization monitoring (e.g., sponsors' responsibilities, contents of monitoring plans/reports). It will also present guidance for developing an effective randomization monitoring process.
Keywords
Randomization Monitoring
Clinical Trial Monitoring
Randomization
Regulatory Guidance Review
Abstracts
Our earlier work reported two US county-level composites, health/economics (HEC) and community capital/urbanicity (CCU), using American Community Survey (ACS) data and two other national health-related databases. We aim to develop new composites using 2023 5-year ACS estimates. One hundred two ACS variables among 3,144 counties were analyzed using principal components analysis. Standardized composite scores were computed for each county and their associations with HEC, CCU, Neighborhood Deprivation Index (NDI) and Urban-Rural Classification Scheme (URC) scores were evaluated. A 74-item, two-component solution that approximated "simple structure" and accounted for 41.4% of the total item covariance was interpreted to represent: 1) financial resources & educational attainment (FREA) and 2) age, demographic characteristics & urbanicity (ADU). Non-chance associations (r or rho, p<0.001) between FREA and HEC, CCU, NDI and URC were 0.77, 0.57, -0.88 and -0.44, respectively. For ADU they were 0.45, -0.69, -0.23 and 0.37. Patterns of associations indicated general concordance of the new composites with other measures. Future work will focus on smaller geographic entities using ACS data.
Keywords
Latent variable models and principal components analysis
American Community Survey data
United States county community characteristics
Social determinants of health
Abstracts
Co-Author(s)
Wali Johnson, Vanderbilt University Medica Center
Irene Feurer, Vanderbilt University Medical Center
First Author
Scott Rega, Vanderbilt University Medical Center
Presenting Author
Irene Feurer, Vanderbilt University Medical Center
Predicting long-term systems resilience is essential for strategic planning and risk management. Resilience forecasting plays a critical role in understanding and mitigating the impacts of shocks and facilitating quicker recoveries, especially in dynamic environments like crude oil markets. Traditional models lack dynamic adaptability. We proposed TimeGPT, a transformer-based model, to manage temporal dependencies and ensure system stability under stress. Using a 10-year crude oil price dataset, TimeGPT demonstrated robust zero-shot learning, enhanced by feature engineering (e.g., integrating external data like public holidays, anomaly indicators, and temporal trends) and fine-tuning. Attention mechanisms prioritized key features like US Rig Count, while filtering noise from less relevant variables such as Mortgage Rate and Import. Evaluated across different data splits (30%-70% to 70%-30%, incrementally by 10%), TimeGPT outperformed traditional models, capturing complex market dynamics and predicting long-term resilience. Metrics like MAE, RMSE, and R2 confirmed its accuracy. This approach supports strategic decision-making in uncertain economic environments.
Keywords
Resilience prediction
time-varying covariates
Long-term prediction
Time series
TimeGPT models
Abstracts
This paper presents a comprehensive risk analysis of flowlines in the oil and gas sector using Geographic Information Systems (GIS) and machine learning. Flowlines, vital conduits transporting oil, gas, and water from wellheads to surface facilities, often face under-assessment compared to pipelines. This study addresses this gap using advanced tools to predict and mitigate failures, enhancing environmental safety and reducing casualties. Extensive datasets from the Colorado Energy and Carbon Management Commission (ECMC) were processed through spatial matching, feature engineering, and geometric extraction to build robust predictive models. Various machine learning algorithms were used to assess risks, with ensemble classifiers showing superior accuracy, especially when paired with Principal Component Analysis (PCA) for dimensionality reduction. Exploratory Data Analysis (EDA) highlighted spatial and operational factors influencing risks, identifying high-risk zones for focused monitoring. The study demonstrates the potential of integrating GIS and machine learning in flowline risk
management, proposing a data-driven approach to enhance safety in petroleum extraction.
Keywords
Risk Analysis
Flowlines
Machine Learning
GIS
Abstracts
Recent advances in scalable MCMC methods for high-dimensional Bayesian regression have focused on addressing computational challenges associated with iterative computations with large-scale covariance matrices. While existing research heavily relies on the large-p-small-n assumption to scale down computational costs, scenarios where both n and p have not been yet explored. In this study, we propose an innovative solution to the large-p-large-n problem by integrating a randomized sketching approach into a Gibbs sampling framework. Our method leverages a random sketching matrix to approximate high-dimensional posterior distributions efficiently, enabling scalable Bayesian inference for high-dimensional and massive datasets. Our proposed approach is applicable to the variety of shrinkage priors that are widely used in high-dimensional regression. We investigated the performance of the proposed method via simulation studies.
Keywords
Acceptance-rejection method
Gibbs sampling
High-dimensional Bayesian regression
Scalable Bayesian computation
Abstracts
In this poster, we provide additional background and examples for our Feb. 2025 paper in Chance. By juxtaposing statistical estimates which strive to be precise and accurate, with haiku that are more often purposely ambiguous or have several nuanced meanings, we raise the idea that the main similarity between statistics and haiku is that both are a reduction in dimensionality. Statistics takes many data points to a single or a set of numerical summaries or coefficients in a model, while haiku places "moments in time" into 3-line verse to convey gist which may be multisensory and filled with the poet's poignant feelings. To demonstrate this contiguity, we begin by introducing readers to pop-culture haiku and literary haiku with examples of both, then briefly refresh readers with the patterns of praxis in descriptive and inferential statistics. We provide examples of regression to illustrate dimension reduction and side by side boxplots in a biomechanical data. Additional similarities include: imagery (data visualization), hypothesis testing vs. 3rd-line of haiku, pairing and contrast. Next the current research use of AI in haiku is explored. Finally Stefanski-style residuals.
Keywords
haiku
regression diagnostics
data visualization
AI haiku
basketball plots
Abstracts
Recent advances in spatial transcriptomics have highlighted the need for integrating spatial transcriptomics data across multiple developmental and regenerative stages. We present STIFT (SpatioTemporal Integration Framework for Transcriptomics), a three-component framework combining developmental spatiotemporal optimal transport, spatiotemporal graph construction, and triplet-informed graph attention autoencoder (GATE) specifically designed for integrating spatiotemporal transcriptomics data. STIFT efficiently processes large-scale 2D and 3D spatiotemporal trancriptomics data while preserving temporal patterns and biological structures, enabling batch effect removal, spatial domain identification, trajectory inference and exploration of developmental dynamics. Applied to axolotl brain regeneration, mouse embryonic development, and 3D planarian regeneration datasets, STIFT efficiently removes batch effects and achieves clear spatial domain identification while preserving temporal developmental patterns and biological variations across hundreds of thousands of spots, demonstrating its effectiveness and specificity in integrating spatiotemporal transcriptomics data.
Keywords
spatial transcriptomics
spatiotemporal data integration
graph attention autoencoder
developmental biology
Abstracts
This research explores the application of stratified differential privacy in randomized response mechanisms to ensure data confidentiality while maintaining analytical utility. Using R simulations, we implement the Warner randomized response technique with stratification, incorporating Laplace and Gaussian noise mechanisms under varying privacy budgets. The study evaluates the bias and variance of the estimated proportions across different strata. Our results highlight the impact of differential privacy parameters on data utility and privacy protection.
Keywords
Data Privacy
Randomized response technique
Stratified Random Sampling
Probability
Abstracts
First Author
Grace Kim, Wayzata High School, Minnesota, USA
Presenting Author
Grace Kim, Wayzata High School, Minnesota, USA
Large-scale datasets, such as images and texts, often exhibit complex heterogeneous structures caused by diverse data sources, intricate experimental designs, or latent subpopulations. Supervised learning from such data is challenging as it requires capturing relevant information from ultra-high-dimensional data while accounting for structural heterogeneity. We propose a unified framework that addresses both challenges simultaneously, facilitating effective feature extraction, structural learning, and robust prediction. The proposed framework employs a supervised variant of variational autoencoder (VAE) for both learning and prediction. Specifically, two types of latent variables are learned through the VAE: low-dimensional latent features and a latent stick-breaking process that characterizes the heterogeneous structure of samples. The latent features reduce the dimensionality of the input data, and the latent stick-breaking process serves as a gating function for mixture-of-experts prediction. This general framework reduces to a supervised VAE when the number of latent clusters is set to one, and to a stick-breaking VAE when both the latent features and response variables are omitted. We demonstrate advantages of the proposed framework by comparing it with supervised VAE and principal component regression in two simulation studies and a real data application involving brain tumor images.
Keywords
Variational Autoencoder
Data Heterogeneity
Stick-breaking Process
Supervised machine learning
Abstracts
Clonal hematopoiesis (CH), a condition increasingly prevalent with aging, has been associated with elevated risks of hematological malignancies and cardiovascular diseases. The most frequent CH-associated mutations occur in the DNMT3A and TET2 genes, leading to heightened proinflammatory signaling. We conducted a stratified analysis to evaluate the incidence of non-hematological malignancies by treatment group and CH mutation status. The Kaplan-Meier and risks models are used to compare malignancy outcomes over the trial period. The incidence of non-hematological malignancies across cancer subtypes are assessed for patients with TET2 or DNMT3A mutations. Those individuals from experimental arm exhibited distinct performance on various cancer types. The estimated cumulative incidence of at least one malignancy event was also assessed to understand the outcomes for the individuals with TET2-mutant and treated with treatment compared to the placebo group. The observed reduction in cancer incidence among TET2-mutant highlights a potential role for anti-inflammatory in cancer prevention and warrants further investigation using causal inference and longitudinal modeling approaches.
Keywords
Kaplan-Meier
longitudinal modeling
malignancies
Abstracts
Traditionally, most quasi-experimental approaches like the synthetic control method (SCM) were developed for relatively small-size panel data (< 1000). In settings with large-scale environmental data with a large number of treated units and untreated units (e.g., from a few to a few hundred treated units with a donor pool size of a few thousand), with a relatively large number of covariate size, it becomes challenging to apply the traditional SCM due to problems of multiplicity of solutions and computational inefficiency. Despite recent developments on the penalized synthetic control method that resolves the issue of multiplicity of solution by adding a nearest neighbor matching (NNM) penalty to the original SC estimator, this methodology is still computationally inefficient for high-dimensional datasets such as ours. On the other hand, casting our SCM problem as a covariate balancing problem using propensity score (CBPS), in implementation we encounter problems of covariate approximation and non-sparsity of solutions. We conducted various simulation studies to compare the CBPS estimator and the penalized SCM estimator, and proposed a new CBPS estimator for disaggregated data.
Keywords
causal inference
synthetic control method
Covariate Balancing Propensity Score
Disaggregated Data
Abstracts
For contemporary scientific data, complicated structured tensor data with high dimensions are everywhere. Motivated by modeling the relationship between the multivariate covariates with the complicated tensor response, we proposed a tensor response model with low tubal rank and sparsity constraint. The low tubal rank constraint can capture the space-shifting or time-shifting characteristic of the data while sparsity can reduce the number of free parameters.
One special case of our model is equivalent to the multivariate reduced rank regression model. We also put forward a proven convergent ADMM algorithm that can obtain the optimized estimation efficiently. Simulations show that our method significantly outperforms the existing tensor response models.
Keywords
ADMM
multidimensional array
multivariate linear regression
reduced rank regression
tubal rank
fourier transform
Abstracts
Methods dealing with missing data rely on assumptions underlying the missing data whether they are explicit or implicit.
For example, one of the most popular such method, multiple imputation (MI), typically assumes missing at random (MAR) in its most implementations, even though the theory of MI does not necessarily require MAR. In this work, we consider formal tests for condition known as missing always at random (MAAR) as a way to explore MAR mechanism in settings where observational units are nested within naturally occurring groups. Specifically, we propose two tests for MAR mechanisms that extend existing methods to incorporating clustered data structures: 1) comparison of conditional means (CCM) with clustering effects and 2) testing a posited missingness mechanism with clustering effects. We design a simulation study to evaluate the tests' performance in correctly capturing the missingness mechanism and demonstrate their use in a real-word application on post-Covid conditions that utilizes an EHR dataset. These test methods are expected to provide empirical evidence for improved selection of missing data approaches in application.
Keywords
MAR
missingness mechanism
test
clustering effects
Abstracts
Estimating a regression function using a parametric model makes it easier to describe and interpret the relationship being studied. Many practitioners prefer this approach over using a nonparametric model. Here we consider the case of a stipulated parametric function, when there are a priori assumptions about the shape and smoothness of the true regression function in the presence of AR(p) errors. For example, suppose it is known that the function must be non-decreasing; we can test the null hypothesis of linear and increasing against the alternative of smooth and increasing, using constrained splines for the the alternative fit when there exists AR(1) errors. We show that the test is consistent and that the power approaches one as the sample size increases, if the alternative is true in the presence of AR(1) errors. There are few existing methods available for comparison with our proposed test. Through simulations, we demonstrate that our test performs well, particularly when compared to the WAVK test in the funtimes R package.
Keywords
Shape restrictions
Regression splines
Parametric function
Autoregressive errors
Consistent
Abstracts
Lymphovascular invasion (LVI) significantly impacts the survival of head and neck squamous cell carcinoma cancer. Traditional two-stage analyses risk biasing the estimate of the effect of the LVI on patient survival because of endogeneity. To address these issues, we propose a joint approach using a bivariate recursive copula model to estimate the effect of LVI status on two-year survival while controlling for potential endogeneity. This framework separates the joint model from the marginal distributions, offering a flexible dependence structure.
Using data from The Cancer Genome Atlas (TCGA), we integrate miRNA expression, clinical covariates, and demographic factors to estimate LVI's average treatment effect (ATE) on survival. Key miRNAs (e.g., hsa-miR-203a-3p, hsa-miR-194-5p, hsa-miR-337-3p) were analyzed for their association with survival outcomes. Results indicate that LVI significantly reduces 2-year survival, with an ATE of -47%. Age at diagnosis exhibits a nonlinear effect on survival outcomes. This study highlights the utility of copula models in addressing endogeneity and provides insights into the interplay between LVI, molecular biomarkers, and survival outcomes.
Keywords
Lymphovascular invasion
survival analysis
copula regression
endogeneity
miRNA
Abstracts
Co-Author
Roger S Zoh, Indiana University
First Author
Yang Ou, Indiana University
Presenting Author
Yang Ou, Indiana University
Ultimate Frisbee, a fast-paced and dynamic sport, demands innovative offensive
strategies to outmaneuver opponents and maximize scoring opportunities. This presentation
introduces a simulation-based statistical modeling framework designed to evaluate and compare
the success probabilities of various offensive plays. By integrating player-specific
attributes-such as throwing precision and catching reliability-into a detailed model of frisbee
dynamics, this approach provides actionable insights to optimize team performance. The model
adapts to diverse team compositions and opposing defenses, offering a practical tool to support
strategic planning in this evolving sport.
Keywords
Ultimate frisbee
modeling and simulation in sports
sports analytics
statistical modeling
Abstracts
Access to individual level data (ILD) from published literature poses a hurdle for researchers. However, access is a driving force for many analyses (surrogate outcome validation, subgroup analyses, and other settings). Generative modeling can produce synthetic data that reflects the underlying properties of existing ILD. Specifically, while utilizing Variational Autoencoders (VAEs) and extending to tabular data, new possibilities for accelerating research arise. This application of VAEs, within R, presents a simple method for researchers to leverage a set of ILD. This method applies to a mixture of distributions (binary, categorical, normal, etc.). While access to ILD may be difficult, summary level information is more readily available. We propose an extension of VAEs to shift the underlying distribution of the data towards summary level statistics. This extension produces multiple sets of ILD under different prior information. The resulting, shifted, ILD can be considered a trustworthy representation of a published paper's data. By extending the framework of VAEs to tabular data and allowing for a distribution shift, exploratory research without direct ILD access is plausible.
Keywords
Variational Autoencoders
Synthetic Data
Distribution Shift
Machine Learning
Generative Modeling
Summary Level Data
Abstracts
Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. Efforts to automate rare disease detection through computational phenotyping are limited by the scarcity of labeled data and biases in available label sources. Gold-standard labels from registries or expert chart review offer high accuracy but suffer from selection bias and high ascertainment costs, while labels derived from electronic health records (EHRs) capture broader patient populations but introduce noise. To address these challenges, we propose a weakly supervised, transformer-based framework that integrates gold-standard labels with iteratively refined silver-standard labels from EHR data to train a scalable and generalizable phenotyping model. We first learn concept-level embeddings from EHR co-occurrence patterns, which are then refined and aggregated into patient-level representations using a multi-layer transformer. Using rare pulmonary diseases as a case study, we validate our framework on EHR data from Boston Children's Hospital. Our approach improves phenotype classification, uncovers clinically meaningful subphenotypes, and enhances disease progression prediction, enabling more accurate and scalable identification and stratification of rare disease patients.
Keywords
Semi-Supervised Learning
Transformers
Phenotyping
Electronic Health Records
Rare Diseases
Machine Learning
Abstracts
Bayesian Reduced Rank Regression (RRR) has attracted increasing attention as a means to quantify the uncertainty of both the coefficient matrix and its rank in a multivariate linear regression framework. However, the existing Bayesian RRR approach relies on the strong assumption that the positions of independent coefficient vectors are known when the rank of the coefficient matrix is given. In contrast, the conventional RRR approach is free from this assumption since it permits the singular value decomposition (SVD) of the coefficient matrix. In this paper, we propose a Weighted Bayesian Bootstrap (WBB) approach to incorporate the SVD into the Bayesian RRR framework. The proposed Bayesian method offers an innovative way of sampling from the posterior distribution of the low-rank coefficient matrix. In addition, our WBB approach allows simultaneous posterior sampling for all ranks, which greatly improves computational efficiency. To quantify the rank uncertainty, we develop a posterior sample-based Monte Carlo method for marginal likelihood calculation. We demonstrate the superiority and applicability of the proposed method by conducting simulation studies and real data analysis.
Keywords
Bayesian Reudced Rank Regression
Singular Value Decomposition
Weighted Bayesian Bootstrap
Bayes factors
Abstracts
Co-Author(s)
Hyeonji Shin, Kyungpook National University
Hyewon Oh, Kyungpook National University
Yeonsu Lee, Kyungpook National University
Minseok Kim, Kyungpook National University
Seongyun Kim, Kyungpook National University
Gyuhyeong Goh, Department of Statistics, Kyungpook National University
First Author
Wonbin Jung, Department of Statistics, Kyungpook National University
Presenting Author
Wonbin Jung, Department of Statistics, Kyungpook National University