Computing Advances in Sampling Methodology & Categorical Data

Li-Hsiang Lin Chair
Georgia State University
 
Sunday, Aug 3: 2:00 PM - 3:50 PM
4006 
Contributed Papers 
Music City Center 
Room: CC-103B 

Main Sponsor

Section on Statistical Computing

Presentations

Advances in Exact Subsampling Methods with Linear Regression Models

With the dramatic rise of automatic data collection, a huge volume of data is recorded on a daily basis. Despite the potential of big data, it is computationally expensive to fit traditional regression models to datasets with billions of rows. This motivates the use of Optimal Design Based (ODB) subsampling, which identifies a subset that maximizes an optimality criterion typically used in experimental design. Existing methods, such as Information-Based Optimal Subdata Selection (IBOSS), focus on the D-optimality criterion, which minimizes the generalized variance of the parameter estimates. While this is helpful for parameter estimation, little attention has been given to criteria that favor model prediction, such as the I-optimality criterion. In this paper, we propose new algorithms that identify I-optimal subsamples from massive datasets. These algorithms lead to computationally efficient and reliable prediction for linear regression models. The algorithms are extended to the case where there is heteroscedasticity in the errors. Case studies illustrate that the proposed methods have smaller prediction error than existing methods.  

Keywords

Experiment Design

Big Data

Subsampling

I-optimality 

Co-Author

Nicholas Rios, George Mason University

First Author

Jiayi Zheng

Presenting Author

Jiayi Zheng

The A Priori Procedure for estimating the location parameter in the Unified Skew Normal distribution

The a priori procedure is concerned with determining appropriate sample sizes to
ensure that sample statistics to be obtained are likely to be good estimators of corresponding
population parameters. Previous researchers have shown how to compute a priori confidence
interval means or locations for normal distribution and asymmetry distributions. In this
paper, we extend a priori thinking to the SUN distribution, where the researcher is interested
in the location for one sample and the difference in locations across two matched samples.
The proposed procedure can be used under the assumption that sample(s) come from the
unified skew normal distributions. Simulation studies support the equations presented, and
two applications involving real data sets for illustrations of our main results. 

Keywords

A priori procedure

Unified skew-normal distribution

Sample size

Confidence interval

Coverage probability 

Co-Author

Weizhong Tian

First Author

Cong Wang, University of Nebraska at Omaha

Presenting Author

Cong Wang, University of Nebraska at Omaha

Generator-Mediated Bandits: Thompson Sampling for GenAI-Powered Adaptive Interventions

Recent advances in generative artificial intelligence (GenAI) models have enabled the generation of personalized content that adapts to up-to-date user context. While personalized decision systems are often modeled using bandit formulations, the integration of GenAI introduces new structure into otherwise classical sequential learning problems. In GenAI-powered interventions, the agent selects a query, but the environment experiences a stochastic response drawn from the generative model. Standard bandit methods do not explicitly account for this structure, where actions influence rewards only through stochastic, observed treatments. We introduce generator-mediated bandit-Thompson sampling (GAMBITTS), a bandit approach designed for this action/treatment split, using mobile health interventions with large language model-generated text as a motivating case study. GAMBITTS explicitly models both the treatment and reward generation processes, using information in the delivered treatment to accelerate policy learning relative to standard methods. We establish regret bounds for GAMBITTS by decomposing sources of uncertainty in treatment and reward, identifying conditions where it achieves stronger guarantees than standard bandit approaches. In simulation studies, GAMBITTS consistently outperforms conventional algorithms by leveraging observed treatments to more accurately estimate expected rewards. 

Keywords

Thompson Sampling

Contextual Bandit

Just-in-Time Adaptive Interventions

Mobile Health (mHealth)

Reinforcement Learning

Large Language Models (LLMs) 

Co-Author(s)

Gabriel Durham, University of Michigan
Kihyuk Hong, University of Michigan

First Author

Marc Brooks

Presenting Author

Marc Brooks

Comparing Entropy Balancing & IPTW for Cohort Balancing: Case Studies from Real-World Claims Data

In retrospective studies, inverse probability treatment weighting (IPTW) and entropy balancing (EB) help achieve covariate balance and reduce confounding. This study compared these two methods using Merative claims data (2006-2024). Three patient cohort groups were balanced on age, sex, insurance type, region and Elixhauser Comorbidity Index (ECI): two with binary treatments using average treatment effect on treated (ATT) and one multinomial treatment using average treatment effect (ATE). Balance was assessed via effective sample size (ESS), weight distribution and absolute standardized mean difference (ASMD). In the first binary group (48 vs. 4,800 patients), both methods achieved balance: IPTW (ASMD <0.01; ESS: 1,545; weights: 0.01-0.1) and EB (ASMD <0.001; ESS: 1,353; weights: 0.01-11.69). In the second binary group (24,423 vs. 16,406 patients), only EB balanced all covariates (ASMD <0.0001; ESS: 5,913; weights: 0.01-24). In the multinomial group (350 vs. 53 vs. 82 patients), only EB balanced all covariates (ASMD <0.001; ESS: 338, 39, 48; weights: 0.01-4.8). Findings suggest EB, especially with second-moment constraints, provides better covariate balance in real-world studies. 

Keywords

Entropy Balancing,

Inverse Probability Treatment Weighting (IPTW)

Real World Data

Observational studies

Multinomial

Second moments 

Co-Author(s)

Jason Poh, EVERSANA
Mostafa Shokoohi, EVERSANA

First Author

Ramaa Nathan, EVERSANA

Presenting Author

Ramaa Nathan, EVERSANA

AI-powered joint model of longitudinal and survival outcomes with various survival loss functions

Improving prediction accuracy in precision medicine is critical for identifying and treating patients at risk in a timely manner. Accounting for temporal dynamics between variables through jointly modelling longitudinal data and data increases time-to-event predictions. However, parametric assumptions in both the longitudinal and survival sub-models and computational burden in integrating a large number of random effects of multivariate longitudinal data are limitations of traditional joint models. In this study, we propose a deep-learning joint modeling architecture using Kolmogorov-Arnold Networks: JM-KAN.
We utilized various survival loss functions such as Cox proportional hazards (PH) and non-proportional Cox-Time in building a survival sub-model for JM-KAN. We have utilized two clinical datasets: 1) 2711 unique patients with Mild Cognitive Impairment (MCI) without any prior diagnosis of Alzheimer's disease (AD) from the National Alzheimer's Coordinating Center (NACC), to predict their disease progression from MCI to AD, and 2) 32,525 liver transplantation (LT) recipients with major adverse cardiovascular events (MACE) diagnosis within 90 days post-LT to track their death following MACE. We also utilized 100 simulated datasets of 1000 subjects, with PH, unspecified interactions, and non-PH scenarios.
Comparing the KAN-based survival sub-model to existing survival methods such as random survival forests, probability mass function (Deephit) demonstrated that Cox PH model showed high discrimination in PH scenario and Cox-Time model showed enhanced overall performance. Cox-Time model also showed superiority in death prediction in OPTN data. Joining these Cox PH (CPH) and Cox-Time (CT) sub-models to dynamic longitudinal predictions, we have found that JM-KAN-CT had the highest discrimination performance for all three simulation scenarios (integrated area under curve (AUC) 0.912-0.921) as well as calibration (integrated Brier Score (BS) 0.057-0.064). JM-KAN-CPH also showed comparable calibration as JM-KAN-CT under the PH scenario. In clinical datasets, JM-KAN-CPH showed superiority in dynamic prediction of both longitudinal covariates and survival probability when compared to existing methods such as Deepsurv, MFPCCox, and MATCH-net using the NACC dataset. Similarly, JM-KAN-CT had the highest iAUC (0.669) and the lowest iBS (0.171) when compared to the same models using the OPTN data.
We can conclude that JM-KAN performed well in discrimination and calibration, although the computational burden such as run time and requiring the whole data for analysis remains a challenge. In the future, fast approximation to loss function as well as integration of stochastic methods may be warranted. 

Keywords

AI

Joint modeling

Neural Networks

Prognosis

Prediction 

Co-Author(s)

Ruosha Li, University of Texas School of Public Health
Wenyaw Chan, University of Texas-Houston
Xi Luo, University of Texas Health Science Center At Houston
Cui Tao, Mayo Clinic Department of Artificial Intelligence and Informatics

First Author

Sori Lundin, The University of Texas Science Center At Houston

Presenting Author

Sori Lundin, The University of Texas Science Center At Houston

Effect of Training Data Quality on Classifier Performance

When the quality of training data underlying a classifier is
degraded, multiple effects arise, on the boundary structure of the classifier,
its performance on the training data, and on its performance on validation
data. We illustrate these effects in the context of metagenomic assembly of
short DNA reads arising from one of three genomes, for four classifiers: naive
Bayes classifier, partition model, random forest and neural net.

In particular, which the quality of the training data can be parameterized, we
show the existence of phase transitions where the behavior of the individual
classifiers, as well as the congruence among them, changes dramatically. 

Keywords

Classifier

Training data

Data quality

Phase transition 

Co-Author

Jeanne Ruane, University of Pennsylvania

First Author

Alan Karr, Temple University

Presenting Author

Alan Karr, Temple University

On finite mixture modeling and model-based clustering of multivariate categorical sequences

Clustering algorithms for quantitative data have been explored in literature extensively. However, many real-life applications involve qualitative data. The range of clustering procedures available in this framework is very limited. Categorical sequences have attracted the attention of researchers recently. Several existing methods used for the analysis of such data have been developed for univariate sequences. Oftentimes, however, observations in the form of multivariate categorical sequences are utilized. Currently, there is a lack of models developed for this framework. The analysis of several univariate sequences ignores possible effects of the sequences on each other and poses challenges related to the agglomeration of obtained results. In this paper, we propose a novel mixture model for multivariate categorical sequences that can effectively model heterogeneity in data and reflect the dynamic nature of the data. As we demonstrate in the series of simulation studies, the developed mixture model shows good model-based clustering performance. The application of the method to the British Household Panel Survey data set produces meaningful results. 

Keywords

EM algorithm

Finite mixture model

Markov model

model-based clustering

multivariate categorical sequences 

Co-Author

Volodymyr Melnykov, University of Alabama

First Author

Yingying Zhang, Western Michigan University

Presenting Author

Yingying Zhang, Western Michigan University