Print Close

From Compositional Microbiome Data to Longitudinal Biomarker Analysis: Cutting-Edge Statistical Methods

Michelle Mellers Chair
Center for Health Services Research, USU, HJF

Thursday, Aug 7: 8:30 AM - 10:20 AM
4207
Contributed Papers

Music City Center

Room: CC-207B

Main Sponsor

ENAR

Presentations

Bayesian GLMs for Analyzing Compositional and Sub-compositional Microbiome Data via EM Algorithm

The study of compositional microbiome data is crucial for understanding microbial roles in health and disease. Traditional log-ratio transformations have shifted to methods enforcing a zero-sum constraint on coefficients. However, penalized regression only provides point estimates, while Markov Chain Monte Carlo (MCMC) methods, though accurate, are computationally intensive for high-dimensional data.

We proposed Bayesian generalized linear models for analyzing compositional and sub-compositional microbiome data. The model uses a spike-and-slab double-exponential prior, enabling weak shrinkage for significant coefficients and strong shrinkage for irrelevant ones. The sum-to-zero constraint is handled via soft-centering, applying a prior distribution on the sum of coefficients. A fast and stable algorithm integrates EM steps into the IWLS algorithm to improve computational efficiency.

Extensive simulations show that our method outperforms existing approaches in accuracy and prediction. We applied it to a microbiome study, identifying microorganisms linked to inflammatory bowel disease (IBD). The method is available in the R package BhGLM (https://github.com/nyiuab/BhGLM).

Keywords

Bayesian GLMs

Compositional data

EM algorithm

Microbiome

Sum-to-zero constraint

Spike-and-slab priors

Co-Author(s)

ZHENYING DING, UAB
Jinhong Cui, UAB
Xiaoxiao Zhou, Duke University
Nengjun Yi, University of Alabama at Birmingham

First Author

Li Zhang, Fox Chase Cancer Center

Presenting Author

Li Zhang, Fox Chase Cancer Center

Bayesian Survival Analysis for High-Dimensional Compositional Data

Survival analysis integrating microbiome and clinical data offers powerful insights into human health. The microbiome influences immunity, inflammation, and cancer outcomes. Combining microbial profiles with clinical factors like tumor stage, treatment, and demographics enhances understanding of how host-microbe interactions affect survival.
To analyze these complex datasets, advanced statistical methods, including Bayesian models, Cox proportional hazards models, and machine learning techniques are employed. These approaches can uncover novel biomarkers and therapeutic targets, leading to personalized treatment strategies that optimize patient outcome. However, challenges remain due to the compositionality, high-dimensionality, and phylogenetic relation between taxa in microbiome data.
In this paper, we proposed a Bayesian compositional Cox proportional hazards model with a regularized horseshoe prior analyzing compositional microbiome and clinical data. We applied a soft sum-to-zero constraint to microbiome coefficients to deal with compositionality. Also, we introduced a structured shrinkage prior incorporating the similarity of microbiome to address the phylogenetic structure. To evaluate the predictive performance of our model, we conducted analysis on extensive simulations and a real dataset. The implementation was carried out using the R package brms, with results summarized based on two Markov Chain Monte Carlo (MCMC) algorithms executed in Stan.

Keywords

Survival analysis

High-dimensional

Compositional

Horseshoe prior

MCMC

Soft sum-to-zero constrain

Co-Author

Nengjun Yi, University of Alabama at Birmingham

First Author

ZHENYING DING, UAB

Presenting Author

ZHENYING DING, UAB

Beyond fixed thresholds: optimizing summaries of wearable device data

Wearable devices, such as actigraphy monitors and continuous glucose monitors (CGMs), capture high-frequency data, typically summarized by the percentage of time spent within fixed thresholds. For example, CGM data are categorized into hypoglycemia, normoglycemia, and hyperglycemia based on a standard glucose range of 70–180 mg/dL. Although scientific guidelines inform the choice of thresholds, it remains unclear whether this choice is optimal and whether the same thresholds should be applied across different populations. In this work, we define threshold optimality with loss functions that quantify discrepancies between the empirical distributions of wearable device measurements and threshold-based summaries. Using the Wasserstein distance as the base measure, we reformulate the loss minimization as optimal piecewise linearization of quantile functions, solved via stepwise algorithms and differential evolution. We also formulate semi-supervised approaches that incorporate some predefined thresholds based on scientific rationale. Applications to CGM data reveal that data-driven thresholds differ by population and improve discriminative power over fixed thresholds.

Keywords

Amalgamation

Continuous glucose monitoring (CGM)

Histogram

Time-in-Range (TIR)

Piecewise linearization

Wasserstein distance

Co-Author(s)

Neo Kok, University of Michigan
Irina Gaynanova, University of Michigan

First Author

Junyoung Park, University of Michigan

Presenting Author

Junyoung Park, University of Michigan

Estimation and Inference in Cluster Randomized Trials with Few Large Clusters for Binary Outcomes

Cluster randomized trials (CRTs) are essential for evaluating cluster-level interventions in medicine and public health. However, many CRTs include only a few clusters, such as hospital-based interventions where a small number of large hospitals are randomized. Conventional methods often require at least 30–40 clusters for reliable inference. This study uses simulations to explore statistical methods for CRTs with binary outcomes when there are ≤10 clusters with large sizes. We investigate whether asymptotic properties hold in this challenging yet common scenario.
We compare generalized estimating equations (GEE), generalized linear mixed models (GLMM), cluster-level summaries (CLS), and randomization-based methods (RB). Simulations show that GLMM and CLS performed best for Type 1 error and power. RB maintained Type 1 error but lagged in power compared to CLS and GLMM. GEE had the worst Type 1 error, with the standard sandwich variance estimator inflating Type 1 error, while bias-corrected versions tended to underestimate it. These findings can better guide the choice of analytic methods for CRTs with few but large clusters, ensuring more robust inference in real-world settings

Keywords

Cluster Randomized Trials

Multilevel Models

Type I Error

Simulation Study

Few Clusters

Inference

Co-Author(s)

Donna Spiegelman, Yale School of Public Health
Fan Li, Yale School of Public Health

First Author

Zachary Frere

Presenting Author

Zachary Frere

Frequency Band Analysis of Multiple Stationary Time Series

The frequency domain properties of biomedical signals offer valuable insights into health and functioning of underlying physiological systems. The power spectrum, which characterizes these properties, is often summarized by partitioning frequencies into standard bands and averaging power within bands. These summary measures are regularly used for analysis, but are not guaranteed to optimally retain differences in power spectra across signals from different participants. We propose a data-adaptive method for identifying frequency band summary measures that preserve spectral variability within a population of interest. The method can also identify subpopulations with distinct power spectra and summary measures that best characterize local dynamics. Validation criteria are developed to select a reasonable number of bands and subpopulations. An evolutionary algorithm is designed to simultaneously identify subpopulations and their corresponding summary measures. The method is used to analyze stride interval series from patients with different neurological disorders, revealing distinct subpopulations and the need for subpopulation-dependent summary measures.

Keywords

Spectral analysis

Cluster validation

Evolutionary algorithm

Multitaper estimation

Gait variability

Stride interval

Co-Author(s)

Jack Manning, Texas A&M University
Jennifer Yentes, Texas A&M University

First Author

Connor Brubaker, Texas A&M University

Presenting Author

Scott Bruce, Texas A&M University

Joint Modeling of Multiple Longitudinal Biomarkers and Survival Outcome via Threshold Regression

Longitudinal biomarker data and health outcomes are routinely collected in many studies to assess how biomarker trajectories predict health outcomes. Existing methods primarily focus on mean biomarker profiles, treating variability as a nuisance. However, excess variability may indicate system dysregulations that may be associated with poor outcomes. In this paper, we address the long-standing problem of using variability information of multiple longitudinal biomarkers in time-to-event analyses by formulating and studying a Bayesian joint model. We first model multiple longitudinal biomarkers, some of which are subject to limit-of-detection censoring. We then model the survival times by incorporating random effects and variances from the longitudinal component as predictors through threshold regression that admits non-proportional hazards. We demonstrate the operating characteristics of the proposed joint model through simulations and apply it to data from the Study of Women's Health Across the Nation (SWAN) to investigate the impact of the mean and variability of follicle-stimulating hormone (FSH) and anti-MÜllerian hormone (AMH) on age at the final menstrual period (FMP).

Keywords

Bayesian hierarchical model

Multiple biomarkers

Variability

Limit of detection

Time-to-event outcomes

Threshold regression

Co-Author(s)

Zhenke Wu, University of Michigan
Michael Elliott, University of Michigan
Sioban Harlow, University of Michigan
Carrie Karvonen-Gutierrez, University of Michigan
Michelle Hood, University of Michigan

First Author

Mingyan Yu

Presenting Author

Mingyan Yu

Variable Weighted Random Forest for Two-Phase Case-Control Studies

Variable weighted random forest (vwRF) is a variant version of RF by assigning different weights to feature sampling at each node of trees during the model construction. The vwRF has shown a successful prediction performance as a feature selection method in low-signal-to-noise problems. However, it has not been studied with datasets from two-phase case-control studies that suffer from low-signal-to-noise and class imbalanced problems simultaneously. In this talk, we introduce a novel weighting strategy to vwRF for improving prediction in two-phase sampling designs facing these problems. For the weights, we adopted RF permutation variable importance combined with area under the precision-recall curve and the receiver operating characteristic curve. We demonstrated the improved prediction of our proposed methods through simulation studies. We also illustrated the use of our methods using a real example of an immunologic biomarker dataset from RV144 phase 3 HIV vaccine efficacy trial.

Keywords

Variable weighted random forest

Two-phase case-control study

HIV

Co-Author(s)

Yunbi Nam, Vanderbilt University
Youyi Fong, Fred Hutchinson Cancer Research Center

First Author

Sunwoo Han, University of Miami

Presenting Author

Sunwoo Han, University of Miami