Bayesian Regularization of Tweedie Family: Discovering Omics Data Associations

Ali Rahnavard Co-Author
The George Washington University
 
Ali Taheriyoun First Author
George Washington University
 
Ali Taheriyoun Presenting Author
George Washington University
 
Sunday, Aug 3: 4:50 PM - 5:05 PM
2594 
Contributed Papers 
Music City Center 
High-throughput sequencing technologies in microbiome, transcriptome, and genome studies have produced massive omics datasets, where the primary outcomes are either count data (e.g., RNA-seq) or relative abundance data (e.g., microbial taxa proportions). We aim to model such data collected in longitudinal studies. Unlike time-course (time series) data, which track realizations of stochastic processes, longitudinal data are sparse and subject-specific. Biomarker interactions—such as correlated metabolites in diabetes studies—can enhance detection power. However, fully multivariate models for serial measurements pose high-dimensional estimation challenges. A practical alternative for univariate outcomes is to incorporate random effects into fixed-effect models, such as linear or generalized linear mixed models (GLMMs). A widely adopted approach employs the negative binomial distribution to account for overdispersion in count data. However, this model is inappropriate for relative abundance data, which are continuous, non-negative, and often zero-inflated—violating the discrete nature assumed by the negative binomial distribution.

Meanwhile, the widely used Benjamini-Hochberg $p$-value adjustment addresses the multiple-testing burden in high-dimensional settings but does not yield an estimation or predictive model. Thus, there is a clear need for efficient GLMM estimation techniques in high-dimensional contexts—an area previously addressed in the literature, but typically under normality assumptions or limited to select distributions from the exponential dispersion family.

In most omics applications, microbiome, transcriptome, and genome data are normalized by total count, resulting in relative abundance values. These values lie in [0,1] and reflect compositional rather than raw count data. Modeling such data with a negative binomial distribution violates key assumptions, misrepresents zeros caused by detection limits or true absence, and fails to account for compositional constraints or batch effects that influence library size. Moreover, omics datasets often exhibit sparsity (high proportions of zeros) and skewness, particularly due to inter-sample variability, sequencing depth, and preprocessing thresholds. These characteristics necessitate statistical models capable of handling both zero-inflation and continuous positive values.

To address these challenges, we assume that the $j$th measurement for subject $i$, conditional on the random effects, follows a Tweedie distribution with mean $\mu_{ij}$, and unknown dispersion, and Tweedie index parameters. The mean is linked to both fixed and random effects via a log link function.

A major obstacle in applying standard LASSO to omics-scale data is computational inefficiency. We instead perform regularized quasi-likelihood estimation using $l_1$ regularization within a Bayesian framework. We assume that each regression coefficient follows a double-exponential (Laplace) prior, such that the maximum a posteriori (MAP) estimate under the quasi-likelihood corresponds to a regularized quasi-maximum likelihood solution. To address scalability issues, we implement an efficient MCMC algorithm that leverages posterior sampling to improve computational performance. Unlike standard least-squares or penalized likelihood approaches—which often fail under high dimensionality and zero-inflation—our MCMC method accommodates large covariate spaces, efficiently explores the posterior distribution under non-Gaussian outcomes, and ensures robust convergence even in the presence of singularities.

We benchmark our method through simulations that evaluate bias, sparsity recovery, and convergence across varying degrees of zero-inflation and sequencing depth. We also apply our method to a real transcriptomic dataset with associated treatment and clinical metadata, demonstrating improved model fit and interpretability compared to negative binomial-based models.

Keywords

Bayesian lasso

compound Poisson distribution

generalized linear mixed model

longitudinal omics data

Tweedie family 

Main Sponsor

Biometrics Section