Machine Learning Integration of Longitudinal Clinical and High-Dimensional Omics Data for Disease Subtype Identification

Boyi Hu Co-Author
Columbia University
 
Badri Vardarajan Co-Author
Columbia University
 
Philip De Jager Co-Author
Columbia University
 
David Bennett Co-Author
Rush Alzheimer Disease Center
 
Yuanjia Wang Co-Author
Columbia University
 
Annie Lee Speaker
Columbia University Irving Medical Center
 
Thursday, Aug 7: 9:35 AM - 9:55 AM
Topic-Contributed Paper Session 
Music City Center 
Background: Identifying latent subgroups in heterogeneous populations is key to understanding disease mechanisms and advancing precision medicine. Although high-dimensional omics and longitudinal clinical data provide rich phenotypic and molecular insights, few methods jointly model outcome dynamics and molecular heterogeneity. We introduce TPClust, a supervised generative subtyping model that integrates longitudinal outcomes with high-dimensional molecular data, flexibly accounting for time-varying and static covariates.

Methods: TPClust models covariate effects as smooth functions of time via nonparametric splines and applies structured regularization—sparse group and exclusive lasso—for robust subtype-specific feature selection. Inference uses a scalable variational EM algorithm with bootstrap-based confidence intervals. We applied TPClust to 1,020 adults from the Religious Orders Study and Memory and Aging Project (ROSMAP), integrating longitudinal cognitive trajectories with postmortem prefrontal cortex transcriptomics in Alzheimer's Disease (AD). Analyses adjusted for sex, APOE ε4, and vascular risk factors. We estimated subtype-specific time-varying effects and examined differences in neuropathology, proteomic, and epigenomic markers. Simulation studies evaluated model accuracy.

Results: TPClust uncovered four distinct aging subtypes: Resilient (n=642), Late-Onset Decline (n=102), Early Vulnerability (n=76), and Rapid Decline (n=200). Resilient individuals maintained high cognition and low pathology with preserved synaptic and mitochondrial function. Late-Onset Decline remained stable until age 85, then exhibited accelerated decline among individuals with APOE ε4, diabetes, and stroke, accompanied by a moderate pathological burden. Early Vulnerability showed an earlier, steeper decline after age 84 and increased vulnerability to stroke, frailty, and male sex, along with reduced neuronal resilience and elevated stress-response markers. Rapid Decline exhibited the earliest deterioration (starting ~age 73), highest dementia risk (87% by age 85), and greatest burden of amyloid, tau, TDP-43, and vascular pathology, alongside broad vulnerability to genetic and vascular factors and dysregulation of tau transcription, blood–brain barrier integrity, and inflammation. Simulation studies confirmed TPClust's accuracy in subtyping, time-varying inference, and high-dimensional feature selection.

Conclusions: TPClust offers a robust framework for outcome-guided subtyping in longitudinal clinical data and molecular data. It reveals distinct cognitive and mechanistic profiles among aging and AD subtypes, advancing biomarker discovery, disease stratification, and precision medicine strategies.

Keywords

Disease subtyping

Integrative approach

Longitudinal clinical data

High-dimensional omics data