Print Close

Latest Research in Genomics and Microbiome with a Hint of Bayesian

Michael Newton Chair
University of Wisconsin-Madison

Sunday, Aug 3: 4:00 PM - 5:50 PM
4019
Contributed Papers

Music City Center

Room: CC-205C

This session will showcase latest genomics, microbiome, metabolics, and sequencing research with more Bayesian methodogy incorporated into these research areas being presented.

Main Sponsor

Biometrics Section

Presentations

A Generalized Framework for Multi-Level Image Data via Multivariate Log Gaussian Cox Processes

In microscopic images of cells, various cell populations often co-exist in a particular tissue, forming highly spatially structured communities where different taxa interact at micrometer scales. Quantifying the spatial relationships of microbes is essential for uncovering biofilm functions and biological mechanisms. Multivariate log Gaussian Cox processes are flexible models for the analysis of multivariate point patterns. However, they have so far been focused on single realizations only (i.e. single images), ignoring similarity and dissimilarity across images. We move beyond this limitation to model spatial interactions among multiple object types, integrating multi-level images from multiple subjects. Particularly, we propose a unified hierarchical multivariate log Gaussian Cox process framework for multi-level image data from multiple subjects with a global governing process, providing a comprehensive quantification of the multivariate spatial relationships among object types. The proposed framework is appealing due to the ability to quantify both within-sample and across-sample variability and to derive global and subject-level inter-type spatial relationships simultaneously.

Keywords

Microbiome Biofilm Image

Cross-pair Correlation

Log Gaussian Cox Process

Multivariate Point Process

Spatial Ecology

Co-Author(s)

Suman Majumder, University of Missouri
Brent Coull, Harvard T.H. Chan School of Public Health
Jessica Mark Welch, The Forsyth Institute
Jacqueline Starr, Channing Division of Network Medicine, Brigham and Women's Hospital
Kyu Ha Lee, Harvard T.H. Chan School of Public Health

First Author

Shuwan Wang

Presenting Author

Shuwan Wang

Assessment of Bayesian Hierarchical Covariance Structures for Modeling Spatial Protein Imaging Data

Examining the tumor immune microenvironment (TIME) has been revolutionized by advancements in spatial proteomic imaging techniques. These techniques assess multiple markers simultaneously to differentiate different immune cell populations in the TIME. The analysis of these immune profiles has become increasingly significant with the progress of immunotherapy treatments. The over-dispersed nature of the cell count data is accounted for by modeling the count data using a beta-binomial distribution. To account for the correlation between the different cell populations in the TIME (i.e., T cells and Cytotoxic T cells), we developed a Bayesian hierarchical beta-binomial model. The Bayesian model can incorporate different covariance (or relationship) structures between the different immune cell populations to incorporate immune differentiation paths. To illustrate the Bayesian model and different covariance structures that are possible, the model is applied to spatial proteomic data from three large epidemiologic cohorts (N = 486) looking at the TIME of ovarian cancer.

Keywords

Bayesian

beta-binomial model

covariance structures

hierarchical

spatial protein imaging data

tumor immune microenvironment

Co-Author(s)

Alex Soupir, Biostatistics and Bioinformatics Shared Resource, Moffitt Cancer Center
Mary Townsend, Division of Oncological Sciences, Knight Cancer Institute Oregon Health and Science University
Jose Laborde, Moffitt Cancer Center
Courtney Johnson, Emory University
Andrew Lawson, Medical University of South Carolina, College of Medicine
Joellen Schildkraut, Emory University
Shelley Tworoger, Moffitt Cancer Center
Kathryn Terry, Brigham and Women’s Hospital and Harvard Medical School
Lauren Peres, Moffitt Cancer Center
Brooke Fridley, Children's Mercy

First Author

Chase Sakitis, Children's Mercy

Presenting Author

Chase Sakitis, Children's Mercy

Bayesian Group Shrinkage model to Identify the Key Genera in Microbiome-Metabolite Relation Dynamics

The gut microbiome influences cancer therapy responses, particularly immunotherapies, by shaping the metabolome. While some studies examine specific microbial genera and metabolites, little work identifies key genera driving overall metabolome profiles. To address this, we introduce B-MASTER (Bayesian Multivariate Analysis for Selecting Targeted Essential Regressors), a fully Bayesian framework with L1 and L2 penalties for sparsity and shrinkage, paired with a scalable Gibbs sampler. B-MASTER enables full posterior inference for models with up to four million parameters efficiently. Using this approach, we identify key microbial genera shaping metabolite profiles and analyze their relevance to colorectal cancer.

Keywords

Bayesian penalized regression,

Gibbs sampling

Scalable high-dimensional models

Microbiome-metabolites dynamics

Colorectal cancer.

Co-Author(s)

Priyam Das, Virginia Commonwealth University
Tanujit Dey, Brigham and Women's Hospital, Harvard University
Christine Peterson, University of Texas MD Anderson Cancer Center

First Author

Sounak Chakraborty, University of Missouri-Columbia

Presenting Author

Sounak Chakraborty, University of Missouri-Columbia

Bayesian Regularization of Tweedie Family: Discovering Omics Data Associations

High-throughput sequencing technologies in microbiome, transcriptome, and genome studies have produced massive omics datasets, where the primary outcomes are either count data (e.g., RNA-seq) or relative abundance data (e.g., microbial taxa proportions). We aim to model such data collected in longitudinal studies. Unlike time-course (time series) data, which track realizations of stochastic processes, longitudinal data are sparse and subject-specific. Biomarker interactions—such as correlated metabolites in diabetes studies—can enhance detection power. However, fully multivariate models for serial measurements pose high-dimensional estimation challenges. A practical alternative for univariate outcomes is to incorporate random effects into fixed-effect models, such as linear or generalized linear mixed models (GLMMs). A widely adopted approach employs the negative binomial distribution to account for overdispersion in count data. However, this model is inappropriate for relative abundance data, which are continuous, non-negative, and often zero-inflated—violating the discrete nature assumed by the negative binomial distribution.

Meanwhile, the widely used Benjamini-Hochberg $p$-value adjustment addresses the multiple-testing burden in high-dimensional settings but does not yield an estimation or predictive model. Thus, there is a clear need for efficient GLMM estimation techniques in high-dimensional contexts—an area previously addressed in the literature, but typically under normality assumptions or limited to select distributions from the exponential dispersion family.

In most omics applications, microbiome, transcriptome, and genome data are normalized by total count, resulting in relative abundance values. These values lie in [0,1] and reflect compositional rather than raw count data. Modeling such data with a negative binomial distribution violates key assumptions, misrepresents zeros caused by detection limits or true absence, and fails to account for compositional constraints or batch effects that influence library size. Moreover, omics datasets often exhibit sparsity (high proportions of zeros) and skewness, particularly due to inter-sample variability, sequencing depth, and preprocessing thresholds. These characteristics necessitate statistical models capable of handling both zero-inflation and continuous positive values.

To address these challenges, we assume that the $j$th measurement for subject $i$, conditional on the random effects, follows a Tweedie distribution with mean $\mu_{ij}$, and unknown dispersion, and Tweedie index parameters. The mean is linked to both fixed and random effects via a log link function.

A major obstacle in applying standard LASSO to omics-scale data is computational inefficiency. We instead perform regularized quasi-likelihood estimation using $l_1$ regularization within a Bayesian framework. We assume that each regression coefficient follows a double-exponential (Laplace) prior, such that the maximum a posteriori (MAP) estimate under the quasi-likelihood corresponds to a regularized quasi-maximum likelihood solution. To address scalability issues, we implement an efficient MCMC algorithm that leverages posterior sampling to improve computational performance. Unlike standard least-squares or penalized likelihood approaches—which often fail under high dimensionality and zero-inflation—our MCMC method accommodates large covariate spaces, efficiently explores the posterior distribution under non-Gaussian outcomes, and ensures robust convergence even in the presence of singularities.

We benchmark our method through simulations that evaluate bias, sparsity recovery, and convergence across varying degrees of zero-inflation and sequencing depth. We also apply our method to a real transcriptomic dataset with associated treatment and clinical metadata, demonstrating improved model fit and interpretability compared to negative binomial-based models.

Keywords

Bayesian lasso

compound Poisson distribution

generalized linear mixed model

longitudinal omics data

Tweedie family

Co-Author

Ali Rahnavard, The George Washington University

First Author

Ali Taheriyoun, George Washington University

Presenting Author

Ali Taheriyoun, George Washington University

Genome-Wide Variants Significantly Contribute to Longitudinal Phenotype Dynamics

Considerable progress has been made in quantifying the heritability of cross-sectional traits, but analyzing longitudinal phenotypic trajectories remains challenging. This study introduces a mixed model integrating genome-wide genetic variants to disentangle heritability metrics on baseline trait levels and rates of change over time, providing insights into both static and dynamic aspects of traits. Key challenges primarily stem from the potential for large-scale studies, truncated estimates due to limited measurements per subject, joint genetic effects. To address these complexities, we compare the average information restricted maximum likelihood algorithm, augmented with meta analysis to tackle truncation, with the restricted Haseman-Elston regression approach, which avoids reliance on precision matrix computations. Using these approaches, we analyzed 6,948,674 genome-wide common variants to study PSA trajectories in males from the PLCO Screening Trial. Our findings reveal moderate heritability of baseline PSA levels but significant heritability of PSA velocity, underscoring an increasing heritability trend with age and enabling more accurate prediction of disease risk.

Keywords

AI-REML algorithm

truncation

REHE method

heritability

PSA level

large-scale studies

Co-Author(s)

Jianxin Shi
Paul Albert, National Cancer Institute

First Author

Pei Zhang, University of Maryland, College Park

Presenting Author

Pei Zhang, University of Maryland, College Park

Improved Bayesian Graphical Models for Omics Data

The study of protein–protein interactions (PPIs) provides insight into various biological mechanisms, including the binding of antibodies to antigens, enzymes to inhibitors or promoters, and receptors to ligands. Recent studies of PPIs have led to significant biological breakthroughs. Graphical models are useful tools for understanding complex biological relationships between biomolecules in high-dimensional data. Nevertheless, their current usability is limited, particularly in a Bayesian estimation paradigm when handling multiclass large datasets, particularly in the field of biology, due to computational limitations. Here, we introduce a clustering-focused iterative (CFI) methodology designed to enhance the scalability and accuracy of multiple Gaussian Graphical Model (GGM) estimation in high-dimensional spaces. Further, we present a framework for a Bayesian graphical model which allows for group-specific prior distribution specification leading to improved model accuracy. We present results from simulation studies as well as a real-world application to data from host-response mass spectrometry studies.

Keywords

graphical model

Bayesian

omics data

Co-Author(s)

David Degnan, Pacific Northwest National Laboratory
Erik VonKaenel
Moses Obiri, Pacific Northwest National Laboratory
Daniel Adrian, Grand Valley State University

First Author

Lisa Bramer, Pacific Northwest National Laboratory

Presenting Author

Lisa Bramer, Pacific Northwest National Laboratory

Robust Bayesian Graphical Regression Models for Assessing Tumor Heterogeneity in Proteomic Networks

Graphical models are powerful tools to investigate complex dependency structures in high-throughput datasets. However, most existing graphical models make one of two canonical assumptions: (1) a homogeneous graph with a common network for all subjects or (2) an assumption of normality especially in the context of Gaussian graphical models. Both assumptions are restrictive and can fail in certain applications such as proteomic networks. We propose an approach termed robust Bayesian graphical regression (rBGR) to estimate heterogeneous graphs for non-normally distributed data. rBGR is a flexible framework that accommodates non-normality by random marginal transformations and constructs covariate-dependent graphs to accommodate heterogeneity via graphical regressions. We formulate a new characterization of dependencies, conditional sign independence with covariates, with an efficient sampler. Simulation studies show that rBGR outperforms existing graphical models for data from various levels of non-normality in both edge and covariate selection. We use rBGR to access proteomic networks and find protein-protein interactions that are differentially associated with immune cell abundance.

Keywords

Bayesian graphical models

Cancer

Conditional sign independence

Covariate-dependent graphs

Protein-protein interactions

First Author

Tsung-Hung Yao, The University of Texas MD Anderson Cancer Center

Presenting Author

Tsung-Hung Yao, The University of Texas MD Anderson Cancer Center