Wednesday, Aug 6: 8:30 AM - 10:20 AM
0422
Invited Paper Session
Music City Center
Room: CC-207B
Applied
Yes
Main Sponsor
Health Policy Statistics Section
Co Sponsors
Biometrics Section
Section on Statistics in Defense and National Security
Presentations
The increasing availability of wearable sensor data presents new opportunities for population-scale risk prediction, which is particularly difficult for rare diseases. In this study, we leverage UK Biobank data (42,157 controls, 157 cases; 264,988 person-years of follow-up) to compare 12 traditional risk factors and 8 accelerometer-based physical measures for predicting incident Alzheimer's disease diagnoses in adults 65 years and older.
Cox proportional hazard models and a robust, data-driven variable selection procedure revealed that moderate-to-vigorous physical activity (MVPA) was the most informative modifiable predictor, improving model concordance beyond age and comorbidities such as diabetes. Additional sensitivity analyses were conducted to account for potential reverse causality. Results suggest that each 14.5-minute increase in daily MVPA was associated with a substantially lower hazard of Alzheimer's disease diagnosis (p = 0.0001), comparable to a two-year reduction in age.
While grounded in classical survival analysis, this methodology reflects core elements of AI frameworks, including algorithmic model selection and prediction-oriented evaluation. These approaches highlight how statistical methods contribute to the development of scalable, data-driven tools for disease monitoring and prevention.
Accurate prediction of recurrent clinical events is crucial for effective management of chronic conditions such as cancer and cardiovascular disease. In recent years, longitudinal health informatics databases, which routinely collect data on repeated clinical events, have been increasingly utilized to construct risk prediction models. We introduce a novel nonparametric framework to predict recurrent events on a gap time scale using survival tree ensembles. Our framework incorporates two predictive modeling strategies: episode-specific model and global model. These models avoid strong assumptions on how future event risk depends on previous event history and other predictors, making them a promising alternative to Cox-type models. Additional complexities in tree-based prediction for recurrent events include induced informative censoring of gap times and inter-event correlations. We develop algorithms to address these issues through the use of inverse probability of censoring weighting and modified resampling procedures. Applied to SEER-Medicare data to predict repeated hospitalizations for breast cancer patients, our models showed superior performance. In particular, borrowing information across events via global models substantially improved prediction accuracy for later hospitalizations.
Keywords
Dynamic risk prediction
Random forests
Studying adverse healthcare events caused by diagnostic errors or delays, which are both common and costly, presents a significant opportunity to reduce preventable harm, and therefore vital for healthcare improvement. Traditional effort like chart reviews are labor-intensive and do not scale well. To enhance diagnostic performance monitoring and identify improvement areas more efficiently, researchers suggest using electronic health records or claim data to analyze the relationship between symptoms and diseases. Specifically, tracking elevated disease risk after a false-negative diagnosis can help signal potential harm. We introduced a mixture regression model along with harm measures and profiling analysis procedures designed to quantify, evaluate, and compare misdiagnosis-related harm across medical institutes with varying patient population compositions.
Keywords
EHR
diagnostic errors
mixture models
Electronic Health Records (EHR) offer rich real-world data for personalized medicine, providing insights into disease progression, treatment responses, and patient outcomes. However, their sparsity, heterogeneity, and high dimensionality make them difficult to model, while the lack of standardized ground truth further complicates predictive modeling. To address these challenges, we propose SCORE, a semi-supervised representation learning framework that captures multi-domain disease profiles through patient embeddings. SCORE employs a Poisson-Adapted Latent factor Mixture (PALM) Model with pre-trained code embeddings to characterize codified features and extract meaningful patient phenotypes and embeddings. To handle the computational challenges of large-scale data, it introduces a hybrid Expectation-Maximization (EM) and Gaussian Variational Approximation (GVA) algorithm, leveraging limited labeled data to refine estimates on a vast pool of unlabeled samples. We theoretically establish the convergence of this hybrid approach, quantify GVA errors, and derive SCORE's error rate under diverging embedding dimensions. Our analysis shows that incorporating unlabeled data enhances accuracy and reduces sensitivity to label scarcity. Extensive simulations confirm SCORE's superior finite-sample performance over existing methods. Finally, we apply SCORE to predict disability status for patients with multiple sclerosis (MS) using partially labeled EHR data, demonstrating that it produces more informative and predictive patient embeddings for multiple MS-related conditions compared to existing approaches.