Risk Scores and Polygenic Modeling of Genomic and Genetic Data

Tabitha Peter Chair
University of Iowa
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
4025 
Contributed Papers 
Music City Center 
Room: CC-209C 

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

Modeling continuous genetic ancestry to improve risk prediction across diverse populations

Polygenic risk scores are widely used in disease risk stratification, but their accuracy varies across diverse populations. Recent methods large-scale leverage multi-ancestry data to improve accuracy in under-represented populations but require labelling individuals by ancestry for prediction. This poses challenges for practical use, as clinical practices are typically not based on ancestry. We propose SPLENDID, a novel penalized regression framework for diverse biobank-scale data. Our method utilizes ancestry principal component interactions to model genetic ancestry as a continuum within a single prediction model for all ancestries, eliminating the need for discrete labels. In extensive simulations and analyses of 9 traits from the All of Us Research Program (N=224,364) and UK Biobank (N=340,140), SPLENDID significantly outperformed existing methods in prediction accuracy and model sparsity. By directly modeling continuous genetic ancestry, SPLENDID stands as a valuable tool for robust risk prediction across diverse populations and fairer clinical implementation. 

Keywords

polygenic risk scores

genetic ancestry

penalized regression

All of Us

UK Biobank

genetic interactions 

Co-Author(s)

Haoyu Zhang, National Cancer Institute
Rahul Mazumder, Massachusetts Institute of Technology
Xihong Lin, Harvard T.H. Chan School of Public Health

First Author

Tony Chen, Harvard University

Presenting Author

Tony Chen, Harvard University

A Poly-Adaptive Metric (PAM) model for multi-objective variable selection in predictive modeling

High-dimensional datasets require effective variable selection to reduce search space for downstream predictive analyses. Traditionally, statistical significance has been the primary criterion for variable selection. However, statistical significance does not necessarily reflect a variable's predictive utility. Moreover, multiple predictive metrics evaluate different aspects of predictability, yet no existing framework systematically integrates variable selection across these diverse criteria. Here, we propose the Poly-Adaptive Metric (PAM) model, a multi-objective ensemble approach that combines statistical significance with predictive performance metrics to optimize variable selection. The PAM model quantifies the reliability of each selection criterion to construct a unified variable importance matrix that guides the ensembling of different selection strategies. We applied the PAM model to the UK Biobank to construct polygenic risk scores (PRS) and compared its predictive performance against PRS generated using conventional p-value-based SNP selection. Our results demonstrate that the PAM model consistently outperforms p-value-based selection across multiple evaluation metrics. 

Keywords

variable selection

ensemble learning

multi-objective

predictive modeling

polygenic risk score 

Co-Author(s)

Ruowang Li, Cedars-Sinai Medical Center
Attri Ghosh, Cedars Sinai Medical Center

First Author

Raelynn Chen, Cedars-Sinai Medical Center

Presenting Author

Ruowang Li, Cedars-Sinai Medical Center

Modeling and predicting single-cell multi-gene perturbation responses with scLAMBDA

While single-cell RNA-sequencing has facilitated profiling of heterogeneous transcriptional responses to genetic perturbations at the single-cell level, there remains a pressing need for computational models that can decode the mechanisms driving these responses and accurately predict outcomes to prioritize target genes for experimental design. We present scLAMBDA, a deep generative learning framework designed to model and predict single-cell transcriptional responses to genetic perturbations. By leveraging gene embeddings derived from large language models, scLAMBDA effectively integrates prior biological knowledge and disentangles basal cell states from perturbation-specific salient representations. Through comprehensive evaluations on multiple datasets, scLAMBDA consistently outperformed other methods in predicting perturbation outcomes. It demonstrated robust generalization to unseen target genes and perturbations, capturing both average expression changes and the heterogeneity of single-cell responses. Furthermore, its predictions enable diverse downstream analyses, including the identification of differentially expressed genes and the exploration of genetic interactions. 

Keywords

single-cell RNA-sequencing

genetic perturbation

deep learning 

Co-Author(s)

Tianyu Liu
Jia Zhao, Yale University
Youshu Cheng, Yale University
Hongyu Zhao, Yale University

First Author

Gefei Wang, Yale University

Presenting Author

Gefei Wang, Yale University

Efficient implementation of cumulative probability models for association studies

Association studies, such as genome-wide association studies (GWAS), for continuous outcomes are commonly conducted using standard linear regression models. However, these biological outcomes are frequently skewed, necessitating transformations (e.g., log transformations) that are often not known in advance. This dependency on transformations can lead to variability in results and interpretations. Recently, cumulative probability models (CPMs) have emerged as a semi-parametric alternative to linear models. CPMs treat continuous outcomes as ordered categories, assigning each unique value as a category, and utilize cumulative link models for estimation. While existing algorithms for association analyses with ordinal outcomes are efficient and scalable, they struggle to handle a significantly large number of outcome categories. To address this, we leverage the CPM's sparse Hessian structure to develop an efficient score test algorithm, making association studies for continuous outcomes with CPMs computationally feasible. We demonstrate the algorithm's effectiveness using a large-scale omics dataset. 

Keywords

ordinal regression

genome-wide association study

set-based testing

cumulative link models 

Co-Author

Chun Li, Case Western Reserve University

First Author

Eric Kawaguchi

Presenting Author

Eric Kawaguchi

Longitudinal Bayes Omics Regression Discovery

Clinicians are increasingly interested in discovering computational biomarkers from short-term longitudinal omics data sets. This work focuses on Bayesian regression and variable selection for longitudinal omics datasets, which can quantify uncertainty and control false discovery. In both approaches, we use the first difference scale of longitudinal predictor and the response. In our univariate approach, Zellner's g prior is used with two different options of the tuning parameter g: g=sqrt n and a g that minimizes Stein's unbiased risk estimate (SURE). Bayes Factors were used to quantify uncertainty and control for false discovery. In the multivariate approach, we use Bayesian Group LASSO with a spike and slab prior for group variable selection. We compare our method against commonly used linear mixed effect models on simulated data and real data from a Pulmonary Tuberculosis study on metabolite biomarker selection. With an automated selection of hyperparameters, the Zellner's g prior and Multivariate Bayesian Group Lasso spike and slab approach correctly identifies target metabolites with high specificity and sensitivity across various simulation and real data scenarios. 

Keywords

Disease Progression

Feature Selection

Mixed Models

Bayesian Group Lasso

Uncertainty Quantification

Zellners g-prior 

Co-Author(s)

Martin Wells, Cornell University
Sumanta Basu, Cornell University
Myung Hee Lee, Weill Cornell Medicine

First Author

Livia Popa

Presenting Author

Livia Popa

Improved Prognostic Survival Models Using High Dimensional Gene Expression Data

Recent genetic, epigenetic, and transcriptomic analyses have stratified Medulloblastoma (MB) into four subgroups of Wingless Type (WNT), Sonic Hedgehog (SHH), and Group 3 and Group 4, with discrete patient profiles and prognoses. Using a dataset of over ten thousand gene expression profiles, this study explores which genes can improve prognostic accuracy for survival, while accounting for molecular stratification and known clinical covariates. This approach involves a Benjamini-Hochberg screening of all genes, adjustment for molecular stratification and clinical covariates, followed by several high-dimensional models to predict survival. A case study of 483 pediatric MB patients demonstrated improved prognostic performance, as assessed by the C-index, Brier Scores, and time dependent AUCs, when gene expression data were included. Simulation studies validated the method's performance, successfully excluding non-informative genes in null scenarios and reliably identifying influential genes in alternative scenarios. This approach provides a robust framework for enhancing survival prediction and uncovering biologically significant markers. 

Keywords

High-dimensional genomic data

Medulloblastoma molecular stratification

prognostic models

variable selection

survival analysis 

Co-Author(s)

Shuoyang Wang, University of Louisville
Akshitkumar Mistry, UofL Health – Brown Cancer Center, University of Louisville, Louisville, KY
Howard Donninger, UofL Health – Brown Cancer Center, University of Louisville, Louisville, KY
Kavitha Yaddanapudi, UofL Health – Brown Cancer Center, University of Louisville, Louisville, KY
Maiying Kong, University of Louisville
Mst Sharmin Akter Sumy, Department of Bioinformatics and Biostatistics, SPHIS, University of Louisville, Louisville, KY

First Author

Tyler Jones

Presenting Author

Tyler Jones

Risk Score Prediction Model for Treatment Response in SLE Nephritis

This study quantifies the predictive value of clinical, genetic, and molecular data in forecasting treatment responses in Systemic Lupus Erythematosus (SLE) nephritis through a data-agnostic, statistically rigorous modeling approach that integrates traditional regression-based methods and machine learning techniques. While existing risk models are limited and often lack thorough validation, this study expands upon them by incorporating a genetic-based risk score, allowing for a direct comparison of its contribution to predictive performance relative to a base model without genetic features. Model performance has been assessed using the corrected C-index and B-score across different specifications and covariate selections, with subgroup analyses evaluating variations in predictive accuracy. Internal validation was performed via bootstrapping, while external validation is ongoing with multicenter datasets. Multiple imputation techniques have addressed missing data, enhancing the robustness of findings and refining the predictive utility of clinical and genetic factors in treatment response. 

Keywords

Genetic Risk Score Modeling

Predictive Analysis

Internal & External Validation

Machine Learning Techniques 

Co-Author(s)

Fei Ye, Vanderbilt University Medical Center
April Barnado, MD, MSCI, Department of Biomedical Informatics

First Author

Kun Bai

Presenting Author

Kun Bai