Bayesian and Machine Learning Models in Genomics and Genetics

Tony Chen Chair
Harvard University
 
Monday, Aug 4: 8:30 AM - 10:20 AM
4034 
Contributed Papers 
Music City Center 
Room: CC-202B 

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

Interval-censored Bayesian Fine-mapping using Functional Annotations for Genetic Variants

National and international genetic compendiums, such as the UK Biobank, have become invaluable resources for identifying genetic variants associated with complex diseases. These biobanks often collect data in interval-censored form; however, there is a lack of methodologies for performing genetic association testing with such outcomes. Specifically, the use of Bayesian variable selection methods to fine-map genetic variants linked to interval-censored outcomes remains an understudied area. Fine-mapping specific SNPs within causal gene sets can offer deeper insights into the genetic mechanisms underlying the condition. Additionally, incorporating functional annotation information into the variable selection framework can prioritize variants with biological relevance and offer more power in detection. In this work, we extend Bayesian fine-mapping methods to incorporate functional annotation information in the model to improve selection. Our selection algorithm includes a MCMC scheme that is computationally efficient and allows interpretable results. We apply this method in a study using data from the UK Biobank to identify causal variants associated with colorectal cancer. 

Keywords

GWAS

Interval-censored

Bayesian Fine-mapping

Functional Annotations 

First Author

Jaihee Choi, Marquette University

Presenting Author

Jaihee Choi, Marquette University

Multivariate Quantitative Bayesian LASSO: Detecting rare haplotype association with multiple trait

Often multiple traits are correlated, and they share underlying genetic factors. Rather than analyzing each phenotype separately, analyzing them jointly improves statistical power and allows biological insight into the shared genetic mechanisms. The Bivariate Quantitative Bayesian LASSO (QBL) was developed to detect rare haplotypes associated with two correlated continuous phenotypes by leveraging a latent variable to model their correlation and using Bayesian regularization to identify rare haplotypes associated with one or both phenotypes. However, its reliance on Markov Chain Monte Carlo (MCMC) for posterior estimation limits scalability as the number of phenotypes increases. To overcome this, we extend bivariate QBL to a Multivariate QBL (mQBL) framework, enabling efficient modeling of rare haplotype associations with multiple phenotypes. We employ Mean Field Variational Bayes (MFVB) for scalable posterior approximation, maintaining methodological rigor while significantly improving computational efficiency. Simulations demonstrate that mQBL performs comparably to bivariate QBL while being substantially faster. 

Keywords

Genetic Association

Rare Haplotype

Multiple Traits

Variational Bayes

Bayesian LASSO

Computational Scalability 

Co-Author

Swati Biswas, University of Texas at Dallas

First Author

Ibrahim Hossain Sajal, National Cancer Institute

Presenting Author

Ibrahim Hossain Sajal, National Cancer Institute

Robust Bayesian analysis of Sparse Longitudinal cancer omics data using Mixed Effect Models

The central task of analysis of omics data in complex diseases including cancers is to identify susceptible genetic factors that are associated with cancer phenotypes with inferential guarantees. Such an analysis is of a high-dimensional nature and has been further challenged if the disease phenotypes follow skewed distributions due to cancer heterogeneity and are longitudinally measured. To overcome the limitation of existing longitudinal studies that usually lack robustness and valid uncertainty quantification procedures, we have developed a sparse robust Bayesian mixed-effect model to analyze heterogeneous longitudinal omics data. The Gibbs samplers of the MCMC have been developed and efficiently implemented. Extensive numerical studies have indicated the superior performance of the proposed model in estimation and variable selection. In particular, we show that the proposed model can lead to valid inference results on finite samples even in the presence of heterogeneous omics data. Case studies on longitudinal cancer omics data and other types of longitudinal omics data show that the proposed method identifies susceptible genetic factors with important biological implications. 

Keywords

Robust Bayesian variable selection

Mixed effect models

Cancer omics data

Longitudinal studies

Uncertainty quantification 

Co-Author

Cen Wu, Kansas State University

First Author

Srijana Subedi, Kansas State University

Presenting Author

Srijana Subedi, Kansas State University

Robust Bayesian Bi-level Selection for Gene-Environment Interactions in Longitudinal Studies

Identifying important gene-environment (G×E) interactions in high-dimensional longitudinal studies poses unique challenges in the presence of long-tailed distributions or outliers in clinical outcomes. Robust Bayesian variable selection methods have been recently shown to effectively address outliers and outcome skewness in G×E studies. However, their potential for accommodating structured sparsity in longitudinal settings has not been fully investigated. In this study, we develop a novel robust Bayesian mixed-effects model for bi-level G×E interaction analysis in longitudinal studies. The proposed method performs effective sparse group selection for main and interaction effects through structured spike-and-slab priors, while accounting for within-subject correlations. To facilitate fast computation and reliable posterior inference, we develop efficient Gibbs samplers and MCMC algorithms. The superior performance of the proposed method in variable selection, estimation, and statistical inference, compared to existing approaches, is demonstrated through extensive simulation studies and applications to longitudinal cohorts with high-dimensional G×E interactions. 

Keywords

Robust Bayesian mixed-effects model

Sparse group selection

Longitudinal studies

Gene-environment interaction

Spike-and-slab priors 

First Author

Jie Ren, Indiana University School of Medicine

Presenting Author

Jie Ren, Indiana University School of Medicine

Penalized linear mixed models for correlated genetic data

Background

As genome-wide association studies (GWAS) aim to represent diverse populations and examine the heritability of complex traits, there emerges genetic data with multiple layers of correlation, e.g., family groups within different data collection sites. Such correlation structure motivated our innovative application of high-dimensional regression. We propose a methodology and a new R package for applying penalized linear mixed models to correlated genetic data.

Methods

We introduce a novel projection technique to decorrelate structured genetic data. Our approach addresses practical model-building challenges, including cross-validation. The methodology is implemented in our R/C++ package, plmmr, which fits the regression model without reading data into memory, enabling scalability to GWAS-sized analyses.

Results

We demonstrate our method using data from a GWAS of orofacial clefts which involved family groups from multiple global sites.

Discussion

We will explore how our approach may be used to create polygenic risk scores. 

Keywords

Statistical genetics

GWAS

High-dimensional regression

lasso

Statistical computing 

Co-Author

Patrick Breheny, University of Iowa

First Author

Tabitha Peter, University of Iowa

Presenting Author

Tabitha Peter, University of Iowa

PPCA-XNORM: Harmonizing Multi-Platform Gene Expression Data via Probabilistic PCA

Cross-platform normalization is essential for integrating gene expression data from multiple platforms to improve statistical power and maximize the utility of publicly available datasets. However, existing methods struggle to disentangle biological variability from platform-specific effects, particularly when handling small or unbalanced sample sizes or data from more than two platforms. We propose PPCA-XNORM, a novel normalization framework based on Probabilistic Principal Component Analysis (PPCA), designed to address these limitations. Our model accounts for gene-specific platform effects through flexible location and scale adjustments while simultaneously capturing biological structure shared across genes via a low-rank between-gene correlation model. We develop a computationally efficient parameter estimation algorithm that combines conditional maximum likelihood estimation and gradient descent. Unlike previous methods, PPCA-XNORM supports normalization across three or more platforms, accommodates missing or unmatched samples during training, and enables cross-platform data transformation between arbitrary platforms via a closed-form conditional expectation without retraining. Using both simulated data and real-world RNA-seq and microarray datasets, we demonstrate that PPCA-XNORM consistently outperforms existing approaches, including MatchMixeR and Shambhala-2, in preserving biological signals while removing platform-specific artifacts. 

Keywords

Cross-platform normalization

Probabilistic PCA

Platform-specific bias

Gene expression harmonization 

Co-Author(s)

Disa Yu, Sanofi
Jinfeng Zhang, Florida State University
Xing Qiu

First Author

Zhining Sui

Presenting Author

Zhining Sui