Sunday, Aug 3: 4:00 PM - 5:50 PM
4025
Contributed Papers
Music City Center
Room: CC-209C
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Polygenic risk scores are widely used in disease risk stratification, but their accuracy varies across diverse populations. Recent methods large-scale leverage multi-ancestry data to improve accuracy in under-represented populations but require labelling individuals by ancestry for prediction. This poses challenges for practical use, as clinical practices are typically not based on ancestry. We propose SPLENDID, a novel penalized regression framework for diverse biobank-scale data. Our method utilizes ancestry principal component interactions to model genetic ancestry as a continuum within a single prediction model for all ancestries, eliminating the need for discrete labels. In extensive simulations and analyses of 9 traits from the All of Us Research Program (N=224,364) and UK Biobank (N=340,140), SPLENDID significantly outperformed existing methods in prediction accuracy and model sparsity. By directly modeling continuous genetic ancestry, SPLENDID stands as a valuable tool for robust risk prediction across diverse populations and fairer clinical implementation.
Keywords
polygenic risk scores
genetic ancestry
penalized regression
All of Us
UK Biobank
genetic interactions
High-dimensional datasets require effective variable selection to reduce search space for downstream predictive analyses. Traditionally, statistical significance has been the primary criterion for variable selection. However, statistical significance does not necessarily reflect a variable's predictive utility. Moreover, multiple predictive metrics evaluate different aspects of predictability, yet no existing framework systematically integrates variable selection across these diverse criteria. Here, we propose the Poly-Adaptive Metric (PAM) model, a multi-objective ensemble approach that combines statistical significance with predictive performance metrics to optimize variable selection. The PAM model quantifies the reliability of each selection criterion to construct a unified variable importance matrix that guides the ensembling of different selection strategies. We applied the PAM model to the UK Biobank to construct polygenic risk scores (PRS) and compared its predictive performance against PRS generated using conventional p-value-based SNP selection. Our results demonstrate that the PAM model consistently outperforms p-value-based selection across multiple evaluation metrics.
Keywords
variable selection
ensemble learning
multi-objective
predictive modeling
polygenic risk score
While single-cell RNA-sequencing has facilitated profiling of heterogeneous transcriptional responses to genetic perturbations at the single-cell level, there remains a pressing need for computational models that can decode the mechanisms driving these responses and accurately predict outcomes to prioritize target genes for experimental design. We present scLAMBDA, a deep generative learning framework designed to model and predict single-cell transcriptional responses to genetic perturbations. By leveraging gene embeddings derived from large language models, scLAMBDA effectively integrates prior biological knowledge and disentangles basal cell states from perturbation-specific salient representations. Through comprehensive evaluations on multiple datasets, scLAMBDA consistently outperformed other methods in predicting perturbation outcomes. It demonstrated robust generalization to unseen target genes and perturbations, capturing both average expression changes and the heterogeneity of single-cell responses. Furthermore, its predictions enable diverse downstream analyses, including the identification of differentially expressed genes and the exploration of genetic interactions.
Keywords
single-cell RNA-sequencing
genetic perturbation
deep learning
Association studies, such as genome-wide association studies (GWAS), for continuous outcomes are commonly conducted using standard linear regression models. However, these biological outcomes are frequently skewed, necessitating transformations (e.g., log transformations) that are often not known in advance. This dependency on transformations can lead to variability in results and interpretations. Recently, cumulative probability models (CPMs) have emerged as a semi-parametric alternative to linear models. CPMs treat continuous outcomes as ordered categories, assigning each unique value as a category, and utilize cumulative link models for estimation. While existing algorithms for association analyses with ordinal outcomes are efficient and scalable, they struggle to handle a significantly large number of outcome categories. To address this, we leverage the CPM's sparse Hessian structure to develop an efficient score test algorithm, making association studies for continuous outcomes with CPMs computationally feasible. We demonstrate the algorithm's effectiveness using a large-scale omics dataset.
Keywords
ordinal regression
genome-wide association study
set-based testing
cumulative link models
Clinicians are increasingly interested in discovering computational biomarkers from short-term longitudinal omics data sets. This work focuses on Bayesian regression and variable selection for longitudinal omics datasets, which can quantify uncertainty and control false discovery. In both approaches, we use the first difference scale of longitudinal predictor and the response. In our univariate approach, Zellner's g prior is used with two different options of the tuning parameter g: g=sqrt n and a g that minimizes Stein's unbiased risk estimate (SURE). Bayes Factors were used to quantify uncertainty and control for false discovery. In the multivariate approach, we use Bayesian Group LASSO with a spike and slab prior for group variable selection. We compare our method against commonly used linear mixed effect models on simulated data and real data from a Pulmonary Tuberculosis study on metabolite biomarker selection. With an automated selection of hyperparameters, the Zellner's g prior and Multivariate Bayesian Group Lasso spike and slab approach correctly identifies target metabolites with high specificity and sensitivity across various simulation and real data scenarios.
Keywords
Disease Progression
Feature Selection
Mixed Models
Bayesian Group Lasso
Uncertainty Quantification
Zellners g-prior
Recent genetic, epigenetic, and transcriptomic analyses have stratified Medulloblastoma (MB) into four subgroups of Wingless Type (WNT), Sonic Hedgehog (SHH), and Group 3 and Group 4, with discrete patient profiles and prognoses. Using a dataset of over ten thousand gene expression profiles, this study explores which genes can improve prognostic accuracy for survival, while accounting for molecular stratification and known clinical covariates. This approach involves a Benjamini-Hochberg screening of all genes, adjustment for molecular stratification and clinical covariates, followed by several high-dimensional models to predict survival. A case study of 483 pediatric MB patients demonstrated improved prognostic performance, as assessed by the C-index, Brier Scores, and time dependent AUCs, when gene expression data were included. Simulation studies validated the method's performance, successfully excluding non-informative genes in null scenarios and reliably identifying influential genes in alternative scenarios. This approach provides a robust framework for enhancing survival prediction and uncovering biologically significant markers.
Keywords
High-dimensional genomic data
Medulloblastoma molecular stratification
prognostic models
variable selection
survival analysis
Co-Author(s)
Shuoyang Wang, University of Louisville
Akshitkumar Mistry, UofL Health – Brown Cancer Center, University of Louisville, Louisville, KY
Howard Donninger, UofL Health – Brown Cancer Center, University of Louisville, Louisville, KY
Kavitha Yaddanapudi, UofL Health – Brown Cancer Center, University of Louisville, Louisville, KY
Maiying Kong, University of Louisville
Mst Sharmin Akter Sumy, Department of Bioinformatics and Biostatistics, SPHIS, University of Louisville, Louisville, KY
First Author
Tyler Jones
Presenting Author
Tyler Jones
This study quantifies the predictive value of clinical, genetic, and molecular data in forecasting treatment responses in Systemic Lupus Erythematosus (SLE) nephritis through a data-agnostic, statistically rigorous modeling approach that integrates traditional regression-based methods and machine learning techniques. While existing risk models are limited and often lack thorough validation, this study expands upon them by incorporating a genetic-based risk score, allowing for a direct comparison of its contribution to predictive performance relative to a base model without genetic features. Model performance has been assessed using the corrected C-index and B-score across different specifications and covariate selections, with subgroup analyses evaluating variations in predictive accuracy. Internal validation was performed via bootstrapping, while external validation is ongoing with multicenter datasets. Multiple imputation techniques have addressed missing data, enhancing the robustness of findings and refining the predictive utility of clinical and genetic factors in treatment response.
Keywords
Genetic Risk Score Modeling
Predictive Analysis
Internal & External Validation
Machine Learning Techniques
Co-Author(s)
Fei Ye, Vanderbilt University Medical Center
April Barnado, MD, MSCI, Department of Biomedical Informatics
First Author
Kun Bai
Presenting Author
Kun Bai