Innovative Statistical and Machine Learning Approaches for Precision Medicine in Diverse Data Sources

Runjia Li Chair
University of Pittsburgh
 
Runjia Li Organizer
University of Pittsburgh
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
0675 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-214 

Applied

Yes

Main Sponsor

Biopharmaceutical Section

Co Sponsors

Biometrics Section
Health Policy Statistics Section

Presentations

A Bayesian Machine Learning Approach for Estimating Heterogeneous Survivor Causal Effects: Applications to a Critical Care Trial

Assessing heterogeneity in the effects of treatments has become increasingly popular in the field of causal inference and carries important implications for clinical decision-making. While extensive literature exists for studying treatment effect heterogeneity when outcomes are fully observed, there has been limited development in tools for estimating heterogeneous causal effects when patient-centered outcomes are truncated by a terminal event, such as death. Due to mortality occurring during study follow-up, the outcomes of interest are unobservable, undefined, or not fully observed for many participants in which case principal stratification is an appealing framework to draw valid causal conclusions. Motivated by the Acute Respiratory Distress Syndrome Network (ARDSNetwork) ARDS respiratory management (ARMA) trial, we developed a flexible Bayesian machine learning approach to estimate the average causal effect and heterogeneous causal effects among the always-survivors stratum when clinical outcomes are subject to truncation. We adopted Bayesian additive regression trees (BART) to flexibly specify separate mean models for the potential outcomes and latent stratum membership. In the analysis of the ARMA trial, we found that the low tidal volume treatment had an overall benefit for participants sustaining acute lung injuries on the outcome of time to returning home but substantial heterogeneity in treatment effects among the always-survivors, driven most strongly by biologic sex and the alveolar-arterial oxygen gradient at baseline (a physiologic measure of lung function and degree of hypoxemia). These findings illustrate how the proposed methodology could guide the prognostic enrichment of future trials in the field. 

Keywords

causal inference

heterogeneity of treatment effects

intercurrent events

principal stratification

truncation by death

acute lung injury 

Speaker

Fan Li, Yale School of Public Health

Adaptive Seamless Subgroup Enrichment Design and Novel Strategies for Expediated Drug Development

To support the expedited drug development that addresses unmet medical needs in heterogenous patient population, seamless phase 2/3 design that makes the phase switching decision based on an early surrogate endpoint is gaining more popularity in practice. For also catering to potentially more beneficial patient subgroups based on predictive biomarkers, it is appealing to incorporate the subgroup enrichment feature into the seamless phase 2/3 design. However, the sample size planning for such a complex adaptive design is challenging, as it must strike a balance among shortening development timeline, mitigating development risks, and accounting for uncertainty related to subgroup effects. To fill this gap, we propose a flexible seamless phase 2/3 design framework with population selection and sample size re-estimation using a surrogate endpoint. We elucidate the patterns of the overall type I error for the proposed adaptive design and propose an easy-to-implement approach to control the overall type I error. Extensive simulation studies are conducted to demonstrate the advantages of our proposal design compared to the fixed-sample design in terms of efficiency, power, and timeline saving.  

Keywords

Adaptive design

Subgroup enrichment

Heterogenous treatment effect 

Speaker

Liwen Wu, Takeda Pharmaceuticals

Empirical Assessments for Cost and Benefit of Cancer Screening with Multi-state Models for Semi-competing Risks Data

Health-care policy makers are often interested in the cost-effectiveness of an intervention. The effectiveness is usually measured by quality adjusted life years, which is subject to informative censoring, and the costs, both of which are often assessed from large-scale observational studies and databases (e.g., claims data, large cohort studies) and are thus susceptible to confounding. There is considerably rich literature available to accommodate censoring and adjust for confounding factors. However, most cost-effectiveness studies are primarily concerned with the terminal event rather than the entire disease progression. Motivated by informing optimal initial screening age for colorectal cancer (CRC) through cost-effectiveness analysis, we provide a unified measure of cost-effectiveness with semi-competing risks and multistate modeling, which allows us to gain insights on benefit and cost at each stage of cancer progression. Unlike most existing causal inference works focusing on static interventions, we develop a causal framework and estimation procedure to evaluate cost-effectiveness as a function of time-varying screening strategy. These methods are justified theoretically and numerically using both simulation and the CRC data from the Women's Health Initiative observational study.  

Speaker

Yi Xiong, University at Buffalo

Machine Learning Integration of Longitudinal Clinical and High-Dimensional Omics Data for Disease Subtype Identification

Background: Identifying latent subgroups in heterogeneous populations is key to understanding disease mechanisms and advancing precision medicine. Although high-dimensional omics and longitudinal clinical data provide rich phenotypic and molecular insights, few methods jointly model outcome dynamics and molecular heterogeneity. We introduce TPClust, a supervised generative subtyping model that integrates longitudinal outcomes with high-dimensional molecular data, flexibly accounting for time-varying and static covariates.

Methods: TPClust models covariate effects as smooth functions of time via nonparametric splines and applies structured regularization—sparse group and exclusive lasso—for robust subtype-specific feature selection. Inference uses a scalable variational EM algorithm with bootstrap-based confidence intervals. We applied TPClust to 1,020 adults from the Religious Orders Study and Memory and Aging Project (ROSMAP), integrating longitudinal cognitive trajectories with postmortem prefrontal cortex transcriptomics in Alzheimer's Disease (AD). Analyses adjusted for sex, APOE ε4, and vascular risk factors. We estimated subtype-specific time-varying effects and examined differences in neuropathology, proteomic, and epigenomic markers. Simulation studies evaluated model accuracy.

Results: TPClust uncovered four distinct aging subtypes: Resilient (n=642), Late-Onset Decline (n=102), Early Vulnerability (n=76), and Rapid Decline (n=200). Resilient individuals maintained high cognition and low pathology with preserved synaptic and mitochondrial function. Late-Onset Decline remained stable until age 85, then exhibited accelerated decline among individuals with APOE ε4, diabetes, and stroke, accompanied by a moderate pathological burden. Early Vulnerability showed an earlier, steeper decline after age 84 and increased vulnerability to stroke, frailty, and male sex, along with reduced neuronal resilience and elevated stress-response markers. Rapid Decline exhibited the earliest deterioration (starting ~age 73), highest dementia risk (87% by age 85), and greatest burden of amyloid, tau, TDP-43, and vascular pathology, alongside broad vulnerability to genetic and vascular factors and dysregulation of tau transcription, blood–brain barrier integrity, and inflammation. Simulation studies confirmed TPClust's accuracy in subtyping, time-varying inference, and high-dimensional feature selection.

Conclusions: TPClust offers a robust framework for outcome-guided subtyping in longitudinal clinical data and molecular data. It reveals distinct cognitive and mechanistic profiles among aging and AD subtypes, advancing biomarker discovery, disease stratification, and precision medicine strategies.
 

Keywords

Disease subtyping

Integrative approach

Longitudinal clinical data

High-dimensional omics data 

Co-Author(s)

Boyi Hu, Columbia University
Badri Vardarajan, Columbia University
Philip De Jager, Columbia University
David Bennett, Rush Alzheimer Disease Center
Yuanjia Wang, Columbia University

Speaker

Annie Lee, Columbia University Irving Medical Center

A Bayesian Finite Mixture Model Approach for Clustering Correlated Mixed-type Variables and Censored Biomarkers

Clustering mixed-type data is a major challenge in biopharmaceutical research, particularly for phenotyping complex diseases where patient heterogeneity complicates treatment. Existing methods often assume local independence or fail to handle high-dimensional datasets with correlated continuous and categorical variables and censored biomarkers. We propose a Bayesian finite mixture model (BFMM) that integrates flexible dependence structures, spike-and-slab priors for variable importance, and a specialized Gibbs sampling step for imputing censored biomarkers. BFMM enables stable clustering and provides interpretable importance weights for both variable types, offering insights into cluster assignments. Simulations show BFMM outperforms existing methods, particularly for correlated data with varying censoring levels. Application to real-world datasets further validates its effectiveness. Our findings underscore BFMM's potential as a robust, interpretable tool for biomedical data analysis, with implications for precision medicine and targeted interventions. 

Co-Author(s)

Yueting Wang, University of Pittsburgh
Shu Wang, FDA
Jonathan Yabes

Speaker

Chung-Chou Chang, University of Pittsburgh