Sunday, Aug 3: 2:00 PM - 3:50 PM
4006
Contributed Papers
Music City Center
Room: CC-103B
Main Sponsor
Section on Statistical Computing
Presentations
With the dramatic rise of automatic data collection, a huge volume of data is recorded on a daily basis. Despite the potential of big data, it is computationally expensive to fit traditional regression models to datasets with billions of rows. This motivates the use of Optimal Design Based (ODB) subsampling, which identifies a subset that maximizes an optimality criterion typically used in experimental design. Existing methods, such as Information-Based Optimal Subdata Selection (IBOSS), focus on the D-optimality criterion, which minimizes the generalized variance of the parameter estimates. While this is helpful for parameter estimation, little attention has been given to criteria that favor model prediction, such as the I-optimality criterion. In this paper, we propose new algorithms that identify I-optimal subsamples from massive datasets. These algorithms lead to computationally efficient and reliable prediction for linear regression models. The algorithms are extended to the case where there is heteroscedasticity in the errors. Case studies illustrate that the proposed methods have smaller prediction error than existing methods.
Keywords
Experiment Design
Big Data
Subsampling
I-optimality
The a priori procedure is concerned with determining appropriate sample sizes to
ensure that sample statistics to be obtained are likely to be good estimators of corresponding
population parameters. Previous researchers have shown how to compute a priori confidence
interval means or locations for normal distribution and asymmetry distributions. In this
paper, we extend a priori thinking to the SUN distribution, where the researcher is interested
in the location for one sample and the difference in locations across two matched samples.
The proposed procedure can be used under the assumption that sample(s) come from the
unified skew normal distributions. Simulation studies support the equations presented, and
two applications involving real data sets for illustrations of our main results.
Keywords
A priori procedure
Unified skew-normal distribution
Sample size
Confidence interval
Coverage probability
Recent advances in generative artificial intelligence (GenAI) models have enabled the generation of personalized content that adapts to up-to-date user context. While personalized decision systems are often modeled using bandit formulations, the integration of GenAI introduces new structure into otherwise classical sequential learning problems. In GenAI-powered interventions, the agent selects a query, but the environment experiences a stochastic response drawn from the generative model. Standard bandit methods do not explicitly account for this structure, where actions influence rewards only through stochastic, observed treatments. We introduce generator-mediated bandit-Thompson sampling (GAMBITTS), a bandit approach designed for this action/treatment split, using mobile health interventions with large language model-generated text as a motivating case study. GAMBITTS explicitly models both the treatment and reward generation processes, using information in the delivered treatment to accelerate policy learning relative to standard methods. We establish regret bounds for GAMBITTS by decomposing sources of uncertainty in treatment and reward, identifying conditions where it achieves stronger guarantees than standard bandit approaches. In simulation studies, GAMBITTS consistently outperforms conventional algorithms by leveraging observed treatments to more accurately estimate expected rewards.
Keywords
Thompson Sampling
Contextual Bandit
Just-in-Time Adaptive Interventions
Mobile Health (mHealth)
Reinforcement Learning
Large Language Models (LLMs)
In retrospective studies, inverse probability treatment weighting (IPTW) and entropy balancing (EB) help achieve covariate balance and reduce confounding. This study compared these two methods using Merative claims data (2006-2024). Three patient cohort groups were balanced on age, sex, insurance type, region and Elixhauser Comorbidity Index (ECI): two with binary treatments using average treatment effect on treated (ATT) and one multinomial treatment using average treatment effect (ATE). Balance was assessed via effective sample size (ESS), weight distribution and absolute standardized mean difference (ASMD). In the first binary group (48 vs. 4,800 patients), both methods achieved balance: IPTW (ASMD <0.01; ESS: 1,545; weights: 0.01-0.1) and EB (ASMD <0.001; ESS: 1,353; weights: 0.01-11.69). In the second binary group (24,423 vs. 16,406 patients), only EB balanced all covariates (ASMD <0.0001; ESS: 5,913; weights: 0.01-24). In the multinomial group (350 vs. 53 vs. 82 patients), only EB balanced all covariates (ASMD <0.001; ESS: 338, 39, 48; weights: 0.01-4.8). Findings suggest EB, especially with second-moment constraints, provides better covariate balance in real-world studies.
Keywords
Entropy Balancing,
Inverse Probability Treatment Weighting (IPTW)
Real World Data
Observational studies
Multinomial
Second moments
Improving prediction accuracy in precision medicine is critical for identifying and treating patients at risk in a timely manner. Accounting for temporal dynamics between variables through jointly modelling longitudinal data and data increases time-to-event predictions. However, parametric assumptions in both the longitudinal and survival sub-models and computational burden in integrating a large number of random effects of multivariate longitudinal data are limitations of traditional joint models. In this study, we propose a deep-learning joint modeling architecture using Kolmogorov-Arnold Networks: JM-KAN.
We utilized various survival loss functions such as Cox proportional hazards (PH) and non-proportional Cox-Time in building a survival sub-model for JM-KAN. We have utilized two clinical datasets: 1) 2711 unique patients with Mild Cognitive Impairment (MCI) without any prior diagnosis of Alzheimer's disease (AD) from the National Alzheimer's Coordinating Center (NACC), to predict their disease progression from MCI to AD, and 2) 32,525 liver transplantation (LT) recipients with major adverse cardiovascular events (MACE) diagnosis within 90 days post-LT to track their death following MACE. We also utilized 100 simulated datasets of 1000 subjects, with PH, unspecified interactions, and non-PH scenarios.
Comparing the KAN-based survival sub-model to existing survival methods such as random survival forests, probability mass function (Deephit) demonstrated that Cox PH model showed high discrimination in PH scenario and Cox-Time model showed enhanced overall performance. Cox-Time model also showed superiority in death prediction in OPTN data. Joining these Cox PH (CPH) and Cox-Time (CT) sub-models to dynamic longitudinal predictions, we have found that JM-KAN-CT had the highest discrimination performance for all three simulation scenarios (integrated area under curve (AUC) 0.912-0.921) as well as calibration (integrated Brier Score (BS) 0.057-0.064). JM-KAN-CPH also showed comparable calibration as JM-KAN-CT under the PH scenario. In clinical datasets, JM-KAN-CPH showed superiority in dynamic prediction of both longitudinal covariates and survival probability when compared to existing methods such as Deepsurv, MFPCCox, and MATCH-net using the NACC dataset. Similarly, JM-KAN-CT had the highest iAUC (0.669) and the lowest iBS (0.171) when compared to the same models using the OPTN data.
We can conclude that JM-KAN performed well in discrimination and calibration, although the computational burden such as run time and requiring the whole data for analysis remains a challenge. In the future, fast approximation to loss function as well as integration of stochastic methods may be warranted.
Keywords
AI
Joint modeling
Neural Networks
Prognosis
Prediction
Co-Author(s)
Ruosha Li, University of Texas School of Public Health
Wenyaw Chan, University of Texas-Houston
Xi Luo, University of Texas Health Science Center At Houston
Cui Tao, Mayo Clinic Department of Artificial Intelligence and Informatics
First Author
Sori Lundin, The University of Texas Science Center At Houston
Presenting Author
Sori Lundin, The University of Texas Science Center At Houston
When the quality of training data underlying a classifier is
degraded, multiple effects arise, on the boundary structure of the classifier,
its performance on the training data, and on its performance on validation
data. We illustrate these effects in the context of metagenomic assembly of
short DNA reads arising from one of three genomes, for four classifiers: naive
Bayes classifier, partition model, random forest and neural net.
In particular, which the quality of the training data can be parameterized, we
show the existence of phase transitions where the behavior of the individual
classifiers, as well as the congruence among them, changes dramatically.
Keywords
Classifier
Training data
Data quality
Phase transition
Clustering algorithms for quantitative data have been explored in literature extensively. However, many real-life applications involve qualitative data. The range of clustering procedures available in this framework is very limited. Categorical sequences have attracted the attention of researchers recently. Several existing methods used for the analysis of such data have been developed for univariate sequences. Oftentimes, however, observations in the form of multivariate categorical sequences are utilized. Currently, there is a lack of models developed for this framework. The analysis of several univariate sequences ignores possible effects of the sequences on each other and poses challenges related to the agglomeration of obtained results. In this paper, we propose a novel mixture model for multivariate categorical sequences that can effectively model heterogeneity in data and reflect the dynamic nature of the data. As we demonstrate in the series of simulation studies, the developed mixture model shows good model-based clustering performance. The application of the method to the British Household Panel Survey data set produces meaningful results.
Keywords
EM algorithm
Finite mixture model
Markov model
model-based clustering
multivariate categorical sequences