Tuesday, Aug 5: 10:30 AM - 12:20 PM
4102
Contributed Posters
Music City Center
Room: CC-Hall B
Main Sponsor
Section on Statistical Computing
Presentations
Persons with end-stage kidney disease (ESKD) require undergoing dialysis or receiving a kidney transplant. Ethnic minority groups are disproportionately affected by ESKD in the United States. Due to the large range of ethnic and socio-economic groups in the United States, the assumption of proportional hazards (PH), which is required for Cox regression, could easily be violated. Hence, an investigation into the appropriate subpopulations which better satisfy the PH assumption is performed. Data from USRDS on patients with ESKD is analyzed. Cox mixture (CM) and deep Cox mixtures (DCM) models are utilized to identify and model latent subpopulations while modeling time to death. CM models were investigated to leverage the interpretability of typical Cox regression models with the increased performance of the mixture model. DCM is used for comparison. We found that CM and DCM models outperformed the Cox model in terms of Brier score and a time-dependent concordance index. The mixture models also show better performance for the smaller subpopulations, i.e., race/ethnicity, region of the United States, and rurality of the community the patient belongs.
Keywords
Survival Analysis
Finite Mixture Model
Unsupervised Learning
Cox Regression
End-Stage Kidney Disease
We propose a number of goodness--of--fit tests for the probability law of significant digits postulated by the celebrated Benford law. First, the observations are transformed either to uniformity, or to normality, or to exponentiality, or to the Poisson law. Then test statistics are formulated by means of L2--type contrasts between the empirical transform of the transformed data and the corresponding population quantity under the null hypothesis. We also address the problem of a relaxed null hypothesis that only accounts for the probability distribution of a given number of significant digits under Benford's law. Computational formulae are provided for each case, and the suggested tests are compared via a detailed Monte Carlo study that includes competitors as well popular alternatives to Benford's law. The methods are also applied on a few real--data sets
Keywords
Empirical characteristic function
Empirical Laplace transform
Monte Carlo
The "mobile Motor Activity Research Consortium for Health" (mMARCH) is a collaborative network of clinical and community studies across Switzerland, Australia, and the United States, focusing on the relationship between motor activity and human physiology, behavior, and health. Involving 10 mMARCH cohorts (N=8,903), accelerometer data were processed using the GGIR package. Functional principal component analysis (FPCA) assessed the effects of study, device, season, age, sex, and BMI on physical activity data. Notably, all data were collected using GeneActiv devices, except for one study (N=1,052) that utilized the GT3X device. The analysis revealed that the type of device used was the most significant factor influencing motor activity measurements, with GT3X data distinctly separating from GeneActiv data in FPCA plots. Additionally, functional principal components were strongly affected by study and age, while sex and BMI had moderate impacts. In conclusion, due to significant variations attributed to the study in the mMARCH cohort, motor activity data could not be directly merged, and statistical analyses involving data from multiple studies should be approached with caution.
Keywords
mMARCH
GGIR package
functional principal component analysis
data merging
User purchase history or rating data often suffer from biases and sparsity. To overcome this problem, Bayesian personalized ranking (BPR; Rendle et al., 2009) leverages statistical techniques to analyze data that reflects user preferences inferred from behavioral history, capitalizing on extensive feedback data that is typically large-scale yet sparse in nature. The traditional BPR algorithm employs stochastic gradient descent (SGD) due to computational simplicity and ease of implementation. However, SGD struggles with inefficiencies when optimizing anisotropic functions, where gradients vary by direction. To overcome this limitation, this study proposes optimizing the BPR posterior distribution using the adaptively weighted stochastic gradient Langevin dynamics (AWSGLD; Deng et al., 2022) algorithm, which is highly scalable and capable of self-adjustment within the sample space. Additionally, we explore the application of the adaptively weighted technique to stochastic gradient Nose-Hoover thermostat (SGNHT; Ding et al., 2014). Empirical analyses demonstrate that the proposed AWSGMCMC-based BPR algorithms significantly outperform traditional recommendation methods, highlighting their potential to enhance recommendation accuracy.
Keywords
Personalized recommendation algorithm
Bayesian Personalized Ranking
adaptively weighted stochastic gradient MCMC
Implicit data
This study proposes a novel data-driven approach for estimating the intrinsic dimension and curvature of complex networks by modeling them as simply connected, complete Riemannian manifolds of constant curvature. Unlike existing methods that rely on predefined structural assumptions, our framework integrates the k-nearest neighbors (KNN) algorithm with the TWO-NN approach, enabling adaptive and robust network partitioning, which enhances the accuracy of dimensionality reduction while preserving essential geometric properties. By leveraging fundamental forms and hypothesis testing, our method ensures precise curvature estimation and manifold classification. Experimental results demonstrate superior robustness against noise and improved effectiveness in capturing intrinsic network geometry, significantly advancing the interpretability and applicability of network data analysis.
Keywords
Intrinsic dimension estimation
Manifold geometry
Simply connected Riemannian manifold
Co-Author(s)
Hongyu Miao, Florida State University
Xing Qiu
First Author
Feng Wang, University of Texas Health Science Center at Houston
Presenting Author
Feng Wang, University of Texas Health Science Center at Houston
This poster session will introduce and discuss a new R package, mlmhelpr, which includes helper functions for lme4 to streamline the estimation of multilevel models for applied researchers. With mlmhelpr, users can easily conduct common tasks, such as calculating intraclass correlation coefficients and design effects, centering variables and refitting models, obtaining pseudo-R squared measures, and estimating random intercept and slope reliabilities. The package also includes functions to compute cluster-robust and bootstrap standard errors, non-constant variance tests for detecting heterocedasticity, and Hausman's statistic to test for differences between fixed-effect and random-effect models. Statics and tests reported in the package are from popular multilevel modeling textbooks including Raudenbush & Bryk (2002), Hox et al. (2018), Fox (2016), and Snijders & Bosker (2012).
Keywords
linear mixed models
multilevel modeling
R programming
R package
statistical models
The Hawkes process is a widely used statistical model used for point processes, where past events increase the intensity of the process. Strong dependence in these processes leads to challenges in point estimation and constructing confidence intervals. Previous studies have shown that asymptotic confidence intervals perform poorly in simulation studies, while the parametric bootstrap achieves nominal coverage. This study explores non-parametric resampling methods, such as the block-bootstrap and subsampling, for constructing confidence regions in highly dependent spatio-temporal Hawkes processes. These methods are applied to a criminology dataset to illustrate their practical implications.
Keywords
Hawkes Process
Block Bootstrap
Subsampling
Simulation Study
Criminology
In linear regression models, multicollinearity often results in unstable and unreliable parameter estimates. Ridge regression, a biased estimation technique, is commonly used to mitigate this issue and produce more reliable estimates of regression coefficients. Several estimators have been developed to select the optimal ridge parameter. This study focuses on the top 16 estimators from the 366 evaluated by Mermi et al. (2024), along with seven additional estimators introduced over time. These 23 estimators were compared to Ordinary Least Squares (OLS), Elastic-Net (EN), Lasso, and generalized ridge (GR) regression to evaluate their performance across different levels of multicollinearity. Simulated data, both with and without outliers, and various parametric conditions were used for the comparisons. The results indicated that certain ridge regression estimators perform reliably with small sample sizes and high correlations without outliers. However, some estimators performed better when outliers were present due to small sample sizes and increased variance. GR, EN, and Lasso were robust with large datasets, except with substantial outliers and high variance.
Keywords
MSE
Multicollinearity
Ridge regression
Lasso
Elastic net
OLS
Data analysis is essential to evidence-based medicine, yet many clinicians encounter significant technical barriers due to limited training in statistical learning and data science workflows. These challenges often result in inefficiencies, errors, and a dependency on external experts for quantitative analyses. To address this, we introduce TableMage, an open-source, user-friendly Python package tailored for clinical researchers. TableMage enhances analytical workflows through a low-code API that supports exploratory data analysis, regression modeling, and machine learning. It also features a no-code interface powered by large language models (LLMs), enabling users to conduct secure analyses of proprietary datasets via locally hosted open-source LLMs, thereby ensuring data privacy. Our benchmarks against GPT-4o Advanced Data Analysis on 21 public datasets demonstrate that TableMage delivers comparable accuracy in core data analysis tasks, superior performance in machine learning applications, and enhanced flexibility for secure data handling. By equipping clinicians with the tools to directly engage with data, TableMage fosters more efficient, accurate, and independent research.
Keywords
software
data science
large language models
agents
generative AI
machine learning
Multiple imputation of missing data has been an active area of statistics research before the big data era. In this project, we study the use of multiple imputation approach to a health-related data set with eight identified variables with data missing rates from 0 to 16%. We conducted multiple imputations (simple random) on this data set.
Furthermore, to investigate the use of multiple imputation in a variety of missing data structures and missing data rates, we generated incomplete data sets from the complete data set obtained from the health-related data. The generated incomplete data sets were analyzed with logistic regression by using multiple imputation to handle missing data. The results of regression analysis on those incomplete data sets were compared with the one obtained from analysis of complete data set. Our results suggest that estimation using five imputations is similar to those using 100 imputations with the logistic regression analysis. Our results indicate that the missing data has substantial
influence on coefficients, odds ratios, and p-values in logistic regression analysis, especially when the missing rate is high. In such cases, even with multiple imputati
Keywords
Missing data
multiple imputation
simulation study
logistic regression
Health-related study
First Author
Bin Ge, University of Missouri-Columbia
Presenting Author
Bin Ge, University of Missouri-Columbia
Forecasting time series is critical in domains like finance, epidemiology, and engineering. Classical models like ARIMA, GARCH, and state-space formulations capture temporal dependencies and volatility structures, while modern approaches like reservoir computing and deep learning handle complex dynamics. A key challenge across these methods is principled uncertainty quantification. Conformal prediction (CP) provides a model-agnostic framework for constructing prediction intervals with finite-sample validity, ensuring reliable uncertainty quantification. However, CP's efficiency varies across data-generating processes (DGPs), particularly in settings with residual dependence, complex temporal structures, or limited data.
This study evaluates CP across diverse DGPs, including stationary and non-stationary processes, latent-state models, and differential equation-driven systems. We compare classical and modern forecasting methods on interval coverage, efficiency, and robustness under distribution shifts. Additionally, we explore empirical Bayes as a bridge between likelihood-based inference and CP, offering insights into balancing predictive flexibility and reliability.
Keywords
Conformal Prediction
Time Series Forecasting
Uncertainty Quantification
Machine Learning
Commercial software for clinical trial design can have limitations. To address this, customized R code is often integrated with software tools to either replace or enhance native capabilities so that users can simulate with flexible design. To assist in this process, we offer an AI coding assistant that helps with writing compatible R functions. This AI assistant is particularly beneficial for new users and ensures compatibility in terms of the input/output parameters allowed in the R function template.
The current integration points we are focusing on include:
* Simulating patients' responses for Binary, Continuous, Time-to-event, and Repeated-measure endpoints
* Analyzing simulated data for the above endpoints
* Randomization of patients
* Customized enrollment and dropout mechanism
Our platform, powered by Azure OpenAI's GPT-4 LLM, integrates with Cytel's in-house R package, CyneRgy, for custom adaptive clinical trial designs. We provide testing code for AI-generated R functions, with features to detect errors. We adhere to Azure OpenAI's data protection policies to ensure security. At present, access to our platform is exclusive to Cytel's East Horizon users.
Keywords
Generative-AI
R-Coding-Assistant
Custom-Adaptive-Clinical-Trial-Design
LLM
R-Integration