Contributed Poster Presentations: Section on Statistical Computing

Shirin Golchi Chair
McGill University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
4102 
Contributed Posters 
Music City Center 
Room: CC-Hall B 

Main Sponsor

Section on Statistical Computing

Presentations

24: An Application of Cox Mixture Models to End-Stage Kidney Disease

Persons with end-stage kidney disease (ESKD) require undergoing dialysis or receiving a kidney transplant. Ethnic minority groups are disproportionately affected by ESKD in the United States. Due to the large range of ethnic and socio-economic groups in the United States, the assumption of proportional hazards (PH), which is required for Cox regression, could easily be violated. Hence, an investigation into the appropriate subpopulations which better satisfy the PH assumption is performed. Data from USRDS on patients with ESKD is analyzed. Cox mixture (CM) and deep Cox mixtures (DCM) models are utilized to identify and model latent subpopulations while modeling time to death. CM models were investigated to leverage the interpretability of typical Cox regression models with the increased performance of the mixture model. DCM is used for comparison. We found that CM and DCM models outperformed the Cox model in terms of Brier score and a time-dependent concordance index. The mixture models also show better performance for the smaller subpopulations, i.e., race/ethnicity, region of the United States, and rurality of the community the patient belongs. 

Keywords

Survival Analysis

Finite Mixture Model

Unsupervised Learning

Cox Regression

End-Stage Kidney Disease 

Co-Author

Semhar Michael, South Dakota State University

First Author

Jason Hasse, South Dakota State University

Presenting Author

Jason Hasse, South Dakota State University

25: BENFORD'S LAW: A COLLECTION OF FORMAL GOODNESS--OF--FIT TESTS BASED ON EMPIRICAL TRANSFORMS

We propose a number of goodness--of--fit tests for the probability law of significant digits postulated by the celebrated Benford law. First, the observations are transformed either to uniformity, or to normality, or to exponentiality, or to the Poisson law. Then test statistics are formulated by means of L2--type contrasts between the empirical transform of the transformed data and the corresponding population quantity under the null hypothesis. We also address the problem of a relaxed null hypothesis that only accounts for the probability distribution of a given number of significant digits under Benford's law. Computational formulae are provided for each case, and the suggested tests are compared via a detailed Monte Carlo study that includes competitors as well popular alternatives to Benford's law. The methods are also applied on a few real--data sets 

Keywords

Empirical characteristic function

Empirical Laplace transform

Monte Carlo 

Co-Author(s)

Simos Meintanis, National and Kapodistrian University of Athens
Lethani Ndwandwe, UJ

First Author

James Allison, North-West University

Presenting Author

James Allison, North-West University

26: Exploring Physical Activity Data Integration: Merging Accelerometer Data from Multiple Studies

The "mobile Motor Activity Research Consortium for Health" (mMARCH) is a collaborative network of clinical and community studies across Switzerland, Australia, and the United States, focusing on the relationship between motor activity and human physiology, behavior, and health. Involving 10 mMARCH cohorts (N=8,903), accelerometer data were processed using the GGIR package. Functional principal component analysis (FPCA) assessed the effects of study, device, season, age, sex, and BMI on physical activity data. Notably, all data were collected using GeneActiv devices, except for one study (N=1,052) that utilized the GT3X device. The analysis revealed that the type of device used was the most significant factor influencing motor activity measurements, with GT3X data distinctly separating from GeneActiv data in FPCA plots. Additionally, functional principal components were strongly affected by study and age, while sex and BMI had moderate impacts. In conclusion, due to significant variations attributed to the study in the mMARCH cohort, motor activity data could not be directly merged, and statistical analyses involving data from multiple studies should be approached with caution. 

Keywords

mMARCH

GGIR package

functional principal component analysis

data merging 

Co-Author(s)

Andrew Leroux, Department of Biostatistics & Informatics, University of Colorado, Denver, CO
Vadim Zipunnikov, Johns Hopkins University
Kathleen Merikangas, National Instututes of Health

First Author

Wei Guo, National Institutes of Health

Presenting Author

Wei Guo, National Institutes of Health

28: Improvement of Bayesian Personalized Ranking inference using AWSGLD algorithm

User purchase history or rating data often suffer from biases and sparsity. To overcome this problem, Bayesian personalized ranking (BPR; Rendle et al., 2009) leverages statistical techniques to analyze data that reflects user preferences inferred from behavioral history, capitalizing on extensive feedback data that is typically large-scale yet sparse in nature. The traditional BPR algorithm employs stochastic gradient descent (SGD) due to computational simplicity and ease of implementation. However, SGD struggles with inefficiencies when optimizing anisotropic functions, where gradients vary by direction. To overcome this limitation, this study proposes optimizing the BPR posterior distribution using the adaptively weighted stochastic gradient Langevin dynamics (AWSGLD; Deng et al., 2022) algorithm, which is highly scalable and capable of self-adjustment within the sample space. Additionally, we explore the application of the adaptively weighted technique to stochastic gradient Nose-Hoover thermostat (SGNHT; Ding et al., 2014). Empirical analyses demonstrate that the proposed AWSGMCMC-based BPR algorithms significantly outperform traditional recommendation methods, highlighting their potential to enhance recommendation accuracy. 

Keywords

Personalized recommendation algorithm

Bayesian Personalized Ranking

adaptively weighted stochastic gradient MCMC

Implicit data 

Co-Author

Sooyoung Cheon

First Author

Ah-Rim Joo

Presenting Author

Ah-Rim Joo

29: Intrinsic Dimension of Undirected Networks on Unknown Latent Manifold of Constant Curvature

This study proposes a novel data-driven approach for estimating the intrinsic dimension and curvature of complex networks by modeling them as simply connected, complete Riemannian manifolds of constant curvature. Unlike existing methods that rely on predefined structural assumptions, our framework integrates the k-nearest neighbors (KNN) algorithm with the TWO-NN approach, enabling adaptive and robust network partitioning, which enhances the accuracy of dimensionality reduction while preserving essential geometric properties. By leveraging fundamental forms and hypothesis testing, our method ensures precise curvature estimation and manifold classification. Experimental results demonstrate superior robustness against noise and improved effectiveness in capturing intrinsic network geometry, significantly advancing the interpretability and applicability of network data analysis. 

Keywords

Intrinsic dimension estimation

Manifold geometry

Simply connected Riemannian manifold 

Co-Author(s)

Hongyu Miao, Florida State University
Xing Qiu

First Author

Feng Wang, University of Texas Health Science Center at Houston

Presenting Author

Feng Wang, University of Texas Health Science Center at Houston

30: Introducing mlmhelpr: A collection of R helper functions for lme4

This poster session will introduce and discuss a new R package, mlmhelpr, which includes helper functions for lme4 to streamline the estimation of multilevel models for applied researchers. With mlmhelpr, users can easily conduct common tasks, such as calculating intraclass correlation coefficients and design effects, centering variables and refitting models, obtaining pseudo-R squared measures, and estimating random intercept and slope reliabilities. The package also includes functions to compute cluster-robust and bootstrap standard errors, non-constant variance tests for detecting heterocedasticity, and Hausman's statistic to test for differences between fixed-effect and random-effect models. Statics and tests reported in the package are from popular multilevel modeling textbooks including Raudenbush & Bryk (2002), Hox et al. (2018), Fox (2016), and Snijders & Bosker (2012). 

Keywords

linear mixed models

multilevel modeling

R programming

R package

statistical models 

First Author

Louis Rocconi, University of Tennessee

Presenting Author

Louis Rocconi, University of Tennessee

31: Numerical Methods for Parameter Estimation of Spatio-Temporal Hawkes Processes

The Hawkes process is a widely used statistical model used for point processes, where past events increase the intensity of the process. Strong dependence in these processes leads to challenges in point estimation and constructing confidence intervals. Previous studies have shown that asymptotic confidence intervals perform poorly in simulation studies, while the parametric bootstrap achieves nominal coverage. This study explores non-parametric resampling methods, such as the block-bootstrap and subsampling, for constructing confidence regions in highly dependent spatio-temporal Hawkes processes. These methods are applied to a criminology dataset to illustrate their practical implications. 

Keywords

Hawkes Process

Block Bootstrap

Subsampling

Simulation Study

Criminology 

Co-Author(s)

Rodney Sturdivant
Rakheon Kim, Baylor University

First Author

Caleb Fox, Baylor University

Presenting Author

Caleb Fox, Baylor University

32: Robustness of OLS, Ridge, Lasso, and Elastic Net in Presence of Outliers: Simulation and Application

In linear regression models, multicollinearity often results in unstable and unreliable parameter estimates. Ridge regression, a biased estimation technique, is commonly used to mitigate this issue and produce more reliable estimates of regression coefficients. Several estimators have been developed to select the optimal ridge parameter. This study focuses on the top 16 estimators from the 366 evaluated by Mermi et al. (2024), along with seven additional estimators introduced over time. These 23 estimators were compared to Ordinary Least Squares (OLS), Elastic-Net (EN), Lasso, and generalized ridge (GR) regression to evaluate their performance across different levels of multicollinearity. Simulated data, both with and without outliers, and various parametric conditions were used for the comparisons. The results indicated that certain ridge regression estimators perform reliably with small sample sizes and high correlations without outliers. However, some estimators performed better when outliers were present due to small sample sizes and increased variance. GR, EN, and Lasso were robust with large datasets, except with substantial outliers and high variance. 

Keywords

MSE

Multicollinearity

Ridge regression

Lasso

Elastic net

OLS 

Co-Author

Sinha Aziz, Florida International University

First Author

HM Nayem

Presenting Author

Sinha Aziz, Florida International University

33: TableMage: An LLM-Enhanced Python Package for Low-Code/Conversational Clinical Data Science

Data analysis is essential to evidence-based medicine, yet many clinicians encounter significant technical barriers due to limited training in statistical learning and data science workflows. These challenges often result in inefficiencies, errors, and a dependency on external experts for quantitative analyses. To address this, we introduce TableMage, an open-source, user-friendly Python package tailored for clinical researchers. TableMage enhances analytical workflows through a low-code API that supports exploratory data analysis, regression modeling, and machine learning. It also features a no-code interface powered by large language models (LLMs), enabling users to conduct secure analyses of proprietary datasets via locally hosted open-source LLMs, thereby ensuring data privacy. Our benchmarks against GPT-4o Advanced Data Analysis on 21 public datasets demonstrate that TableMage delivers comparable accuracy in core data analysis tasks, superior performance in machine learning applications, and enhanced flexibility for secure data handling. By equipping clinicians with the tools to directly engage with data, TableMage fosters more efficient, accurate, and independent research. 

Keywords

software

data science

large language models

agents

generative AI

machine learning 

Co-Author(s)

Andrew Yang, Brown University
Joshua Woo, Warren Alpert Medical School
Alan Mach, Warren Alpert Medical School
Prem Ramkumar, Commons Clinic
Ying Ma

First Author

Ryan Zhang, Carnegie Mellon University

Presenting Author

Ryan Zhang, Carnegie Mellon University

34: The Use of Multiple Imputation for Missing Data in A Health-Related Study

Multiple imputation of missing data has been an active area of statistics research before the big data era. In this project, we study the use of multiple imputation approach to a health-related data set with eight identified variables with data missing rates from 0 to 16%. We conducted multiple imputations (simple random) on this data set.
Furthermore, to investigate the use of multiple imputation in a variety of missing data structures and missing data rates, we generated incomplete data sets from the complete data set obtained from the health-related data. The generated incomplete data sets were analyzed with logistic regression by using multiple imputation to handle missing data. The results of regression analysis on those incomplete data sets were compared with the one obtained from analysis of complete data set. Our results suggest that estimation using five imputations is similar to those using 100 imputations with the logistic regression analysis. Our results indicate that the missing data has substantial
influence on coefficients, odds ratios, and p-values in logistic regression analysis, especially when the missing rate is high. In such cases, even with multiple imputati 

Keywords

Missing data

multiple imputation

simulation study

logistic regression

Health-related study 

First Author

Bin Ge, University of Missouri-Columbia

Presenting Author

Bin Ge, University of Missouri-Columbia

35: Time Series Forecasting with Conformal Prediction: A Critical Assessment

Forecasting time series is critical in domains like finance, epidemiology, and engineering. Classical models like ARIMA, GARCH, and state-space formulations capture temporal dependencies and volatility structures, while modern approaches like reservoir computing and deep learning handle complex dynamics. A key challenge across these methods is principled uncertainty quantification. Conformal prediction (CP) provides a model-agnostic framework for constructing prediction intervals with finite-sample validity, ensuring reliable uncertainty quantification. However, CP's efficiency varies across data-generating processes (DGPs), particularly in settings with residual dependence, complex temporal structures, or limited data.

This study evaluates CP across diverse DGPs, including stationary and non-stationary processes, latent-state models, and differential equation-driven systems. We compare classical and modern forecasting methods on interval coverage, efficiency, and robustness under distribution shifts. Additionally, we explore empirical Bayes as a bridge between likelihood-based inference and CP, offering insights into balancing predictive flexibility and reliability. 

Keywords

Conformal Prediction

Time Series Forecasting

Uncertainty Quantification

Machine Learning 

Co-Author

Sumanta Basu, Cornell University

First Author

Minjie Jia

Presenting Author

Minjie Jia

36: Using AI to generate R Code for Statistical Computations in Clinical Trial Designs

Commercial software for clinical trial design can have limitations. To address this, customized R code is often integrated with software tools to either replace or enhance native capabilities so that users can simulate with flexible design. To assist in this process, we offer an AI coding assistant that helps with writing compatible R functions. This AI assistant is particularly beneficial for new users and ensures compatibility in terms of the input/output parameters allowed in the R function template.

The current integration points we are focusing on include:
* Simulating patients' responses for Binary, Continuous, Time-to-event, and Repeated-measure endpoints
* Analyzing simulated data for the above endpoints
* Randomization of patients
* Customized enrollment and dropout mechanism

Our platform, powered by Azure OpenAI's GPT-4 LLM, integrates with Cytel's in-house R package, CyneRgy, for custom adaptive clinical trial designs. We provide testing code for AI-generated R functions, with features to detect errors. We adhere to Azure OpenAI's data protection policies to ensure security. At present, access to our platform is exclusive to Cytel's East Horizon users. 

Keywords

Generative-AI

R-Coding-Assistant

Custom-Adaptive-Clinical-Trial-Design

LLM

R-Integration 

Co-Author(s)

Subhajit Sengupta, Cytel
Sudipta Basu, Cytel
J. Kyle Wathen, Cytel

First Author

Subhajit Sengupta, Cytel

Presenting Author

Subhajit Sengupta, Cytel