Monday, Aug 4: 8:30 AM - 10:20 AM
4034
Contributed Papers
Music City Center
Room: CC-202B
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
National and international genetic compendiums, such as the UK Biobank, have become invaluable resources for identifying genetic variants associated with complex diseases. These biobanks often collect data in interval-censored form; however, there is a lack of methodologies for performing genetic association testing with such outcomes. Specifically, the use of Bayesian variable selection methods to fine-map genetic variants linked to interval-censored outcomes remains an understudied area. Fine-mapping specific SNPs within causal gene sets can offer deeper insights into the genetic mechanisms underlying the condition. Additionally, incorporating functional annotation information into the variable selection framework can prioritize variants with biological relevance and offer more power in detection. In this work, we extend Bayesian fine-mapping methods to incorporate functional annotation information in the model to improve selection. Our selection algorithm includes a MCMC scheme that is computationally efficient and allows interpretable results. We apply this method in a study using data from the UK Biobank to identify causal variants associated with colorectal cancer.
Keywords
GWAS
Interval-censored
Bayesian Fine-mapping
Functional Annotations
Often multiple traits are correlated, and they share underlying genetic factors. Rather than analyzing each phenotype separately, analyzing them jointly improves statistical power and allows biological insight into the shared genetic mechanisms. The Bivariate Quantitative Bayesian LASSO (QBL) was developed to detect rare haplotypes associated with two correlated continuous phenotypes by leveraging a latent variable to model their correlation and using Bayesian regularization to identify rare haplotypes associated with one or both phenotypes. However, its reliance on Markov Chain Monte Carlo (MCMC) for posterior estimation limits scalability as the number of phenotypes increases. To overcome this, we extend bivariate QBL to a Multivariate QBL (mQBL) framework, enabling efficient modeling of rare haplotype associations with multiple phenotypes. We employ Mean Field Variational Bayes (MFVB) for scalable posterior approximation, maintaining methodological rigor while significantly improving computational efficiency. Simulations demonstrate that mQBL performs comparably to bivariate QBL while being substantially faster.
Keywords
Genetic Association
Rare Haplotype
Multiple Traits
Variational Bayes
Bayesian LASSO
Computational Scalability
The central task of analysis of omics data in complex diseases including cancers is to identify susceptible genetic factors that are associated with cancer phenotypes with inferential guarantees. Such an analysis is of a high-dimensional nature and has been further challenged if the disease phenotypes follow skewed distributions due to cancer heterogeneity and are longitudinally measured. To overcome the limitation of existing longitudinal studies that usually lack robustness and valid uncertainty quantification procedures, we have developed a sparse robust Bayesian mixed-effect model to analyze heterogeneous longitudinal omics data. The Gibbs samplers of the MCMC have been developed and efficiently implemented. Extensive numerical studies have indicated the superior performance of the proposed model in estimation and variable selection. In particular, we show that the proposed model can lead to valid inference results on finite samples even in the presence of heterogeneous omics data. Case studies on longitudinal cancer omics data and other types of longitudinal omics data show that the proposed method identifies susceptible genetic factors with important biological implications.
Keywords
Robust Bayesian variable selection
Mixed effect models
Cancer omics data
Longitudinal studies
Uncertainty quantification
Identifying important gene-environment (G×E) interactions in high-dimensional longitudinal studies poses unique challenges in the presence of long-tailed distributions or outliers in clinical outcomes. Robust Bayesian variable selection methods have been recently shown to effectively address outliers and outcome skewness in G×E studies. However, their potential for accommodating structured sparsity in longitudinal settings has not been fully investigated. In this study, we develop a novel robust Bayesian mixed-effects model for bi-level G×E interaction analysis in longitudinal studies. The proposed method performs effective sparse group selection for main and interaction effects through structured spike-and-slab priors, while accounting for within-subject correlations. To facilitate fast computation and reliable posterior inference, we develop efficient Gibbs samplers and MCMC algorithms. The superior performance of the proposed method in variable selection, estimation, and statistical inference, compared to existing approaches, is demonstrated through extensive simulation studies and applications to longitudinal cohorts with high-dimensional G×E interactions.
Keywords
Robust Bayesian mixed-effects model
Sparse group selection
Longitudinal studies
Gene-environment interaction
Spike-and-slab priors
First Author
Jie Ren, Indiana University School of Medicine
Presenting Author
Jie Ren, Indiana University School of Medicine
Background
As genome-wide association studies (GWAS) aim to represent diverse populations and examine the heritability of complex traits, there emerges genetic data with multiple layers of correlation, e.g., family groups within different data collection sites. Such correlation structure motivated our innovative application of high-dimensional regression. We propose a methodology and a new R package for applying penalized linear mixed models to correlated genetic data.
Methods
We introduce a novel projection technique to decorrelate structured genetic data. Our approach addresses practical model-building challenges, including cross-validation. The methodology is implemented in our R/C++ package, plmmr, which fits the regression model without reading data into memory, enabling scalability to GWAS-sized analyses.
Results
We demonstrate our method using data from a GWAS of orofacial clefts which involved family groups from multiple global sites.
Discussion
We will explore how our approach may be used to create polygenic risk scores.
Keywords
Statistical genetics
GWAS
High-dimensional regression
lasso
Statistical computing
Cross-platform normalization is essential for integrating gene expression data from multiple platforms to improve statistical power and maximize the utility of publicly available datasets. However, existing methods struggle to disentangle biological variability from platform-specific effects, particularly when handling small or unbalanced sample sizes or data from more than two platforms. We propose PPCA-XNORM, a novel normalization framework based on Probabilistic Principal Component Analysis (PPCA), designed to address these limitations. Our model accounts for gene-specific platform effects through flexible location and scale adjustments while simultaneously capturing biological structure shared across genes via a low-rank between-gene correlation model. We develop a computationally efficient parameter estimation algorithm that combines conditional maximum likelihood estimation and gradient descent. Unlike previous methods, PPCA-XNORM supports normalization across three or more platforms, accommodates missing or unmatched samples during training, and enables cross-platform data transformation between arbitrary platforms via a closed-form conditional expectation without retraining. Using both simulated data and real-world RNA-seq and microarray datasets, we demonstrate that PPCA-XNORM consistently outperforms existing approaches, including MatchMixeR and Shambhala-2, in preserving biological signals while removing platform-specific artifacts.
Keywords
Cross-platform normalization
Probabilistic PCA
Platform-specific bias
Gene expression harmonization