Thursday, Aug 7: 8:30 AM - 10:20 AM
4207
Contributed Papers
Music City Center
Room: CC-207B
Main Sponsor
ENAR
Presentations
The study of compositional microbiome data is crucial for understanding microbial roles in health and disease. Traditional log-ratio transformations have shifted to methods enforcing a zero-sum constraint on coefficients. However, penalized regression only provides point estimates, while Markov Chain Monte Carlo (MCMC) methods, though accurate, are computationally intensive for high-dimensional data.
We proposed Bayesian generalized linear models for analyzing compositional and sub-compositional microbiome data. The model uses a spike-and-slab double-exponential prior, enabling weak shrinkage for significant coefficients and strong shrinkage for irrelevant ones. The sum-to-zero constraint is handled via soft-centering, applying a prior distribution on the sum of coefficients. A fast and stable algorithm integrates EM steps into the IWLS algorithm to improve computational efficiency.
Extensive simulations show that our method outperforms existing approaches in accuracy and prediction. We applied it to a microbiome study, identifying microorganisms linked to inflammatory bowel disease (IBD). The method is available in the R package BhGLM (https://github.com/nyiuab/BhGLM).
Keywords
Bayesian GLMs
Compositional data
EM algorithm
Microbiome
Sum-to-zero constraint
Spike-and-slab priors
Survival analysis integrating microbiome and clinical data offers powerful insights into human health. The microbiome influences immunity, inflammation, and cancer outcomes. Combining microbial profiles with clinical factors like tumor stage, treatment, and demographics enhances understanding of how host-microbe interactions affect survival.
To analyze these complex datasets, advanced statistical methods, including Bayesian models, Cox proportional hazards models, and machine learning techniques are employed. These approaches can uncover novel biomarkers and therapeutic targets, leading to personalized treatment strategies that optimize patient outcome. However, challenges remain due to the compositionality, high-dimensionality, and phylogenetic relation between taxa in microbiome data.
In this paper, we proposed a Bayesian compositional Cox proportional hazards model with a regularized horseshoe prior analyzing compositional microbiome and clinical data. We applied a soft sum-to-zero constraint to microbiome coefficients to deal with compositionality. Also, we introduced a structured shrinkage prior incorporating the similarity of microbiome to address the phylogenetic structure. To evaluate the predictive performance of our model, we conducted analysis on extensive simulations and a real dataset. The implementation was carried out using the R package brms, with results summarized based on two Markov Chain Monte Carlo (MCMC) algorithms executed in Stan.
Keywords
Survival analysis
High-dimensional
Compositional
Horseshoe prior
MCMC
Soft sum-to-zero constrain
Wearable devices, such as actigraphy monitors and continuous glucose monitors (CGMs), capture high-frequency data, typically summarized by the percentage of time spent within fixed thresholds. For example, CGM data are categorized into hypoglycemia, normoglycemia, and hyperglycemia based on a standard glucose range of 70–180 mg/dL. Although scientific guidelines inform the choice of thresholds, it remains unclear whether this choice is optimal and whether the same thresholds should be applied across different populations. In this work, we define threshold optimality with loss functions that quantify discrepancies between the empirical distributions of wearable device measurements and threshold-based summaries. Using the Wasserstein distance as the base measure, we reformulate the loss minimization as optimal piecewise linearization of quantile functions, solved via stepwise algorithms and differential evolution. We also formulate semi-supervised approaches that incorporate some predefined thresholds based on scientific rationale. Applications to CGM data reveal that data-driven thresholds differ by population and improve discriminative power over fixed thresholds.
Keywords
Amalgamation
Continuous glucose monitoring (CGM)
Histogram
Time-in-Range (TIR)
Piecewise linearization
Wasserstein distance
Cluster randomized trials (CRTs) are essential for evaluating cluster-level interventions in medicine and public health. However, many CRTs include only a few clusters, such as hospital-based interventions where a small number of large hospitals are randomized. Conventional methods often require at least 30–40 clusters for reliable inference. This study uses simulations to explore statistical methods for CRTs with binary outcomes when there are ≤10 clusters with large sizes. We investigate whether asymptotic properties hold in this challenging yet common scenario.
We compare generalized estimating equations (GEE), generalized linear mixed models (GLMM), cluster-level summaries (CLS), and randomization-based methods (RB). Simulations show that GLMM and CLS performed best for Type 1 error and power. RB maintained Type 1 error but lagged in power compared to CLS and GLMM. GEE had the worst Type 1 error, with the standard sandwich variance estimator inflating Type 1 error, while bias-corrected versions tended to underestimate it. These findings can better guide the choice of analytic methods for CRTs with few but large clusters, ensuring more robust inference in real-world settings
Keywords
Cluster Randomized Trials
Multilevel Models
Type I Error
Simulation Study
Few Clusters
Inference
The frequency domain properties of biomedical signals offer valuable insights into health and functioning of underlying physiological systems. The power spectrum, which characterizes these properties, is often summarized by partitioning frequencies into standard bands and averaging power within bands. These summary measures are regularly used for analysis, but are not guaranteed to optimally retain differences in power spectra across signals from different participants. We propose a data-adaptive method for identifying frequency band summary measures that preserve spectral variability within a population of interest. The method can also identify subpopulations with distinct power spectra and summary measures that best characterize local dynamics. Validation criteria are developed to select a reasonable number of bands and subpopulations. An evolutionary algorithm is designed to simultaneously identify subpopulations and their corresponding summary measures. The method is used to analyze stride interval series from patients with different neurological disorders, revealing distinct subpopulations and the need for subpopulation-dependent summary measures.
Keywords
Spectral analysis
Cluster validation
Evolutionary algorithm
Multitaper estimation
Gait variability
Stride interval
Longitudinal biomarker data and health outcomes are routinely collected in many studies to assess how biomarker trajectories predict health outcomes. Existing methods primarily focus on mean biomarker profiles, treating variability as a nuisance. However, excess variability may indicate system dysregulations that may be associated with poor outcomes. In this paper, we address the long-standing problem of using variability information of multiple longitudinal biomarkers in time-to-event analyses by formulating and studying a Bayesian joint model. We first model multiple longitudinal biomarkers, some of which are subject to limit-of-detection censoring. We then model the survival times by incorporating random effects and variances from the longitudinal component as predictors through threshold regression that admits non-proportional hazards. We demonstrate the operating characteristics of the proposed joint model through simulations and apply it to data from the Study of Women's Health Across the Nation (SWAN) to investigate the impact of the mean and variability of follicle-stimulating hormone (FSH) and anti-MÜllerian hormone (AMH) on age at the final menstrual period (FMP).
Keywords
Bayesian hierarchical model
Multiple biomarkers
Variability
Limit of detection
Time-to-event outcomes
Threshold regression
Variable weighted random forest (vwRF) is a variant version of RF by assigning different weights to feature sampling at each node of trees during the model construction. The vwRF has shown a successful prediction performance as a feature selection method in low-signal-to-noise problems. However, it has not been studied with datasets from two-phase case-control studies that suffer from low-signal-to-noise and class imbalanced problems simultaneously. In this talk, we introduce a novel weighting strategy to vwRF for improving prediction in two-phase sampling designs facing these problems. For the weights, we adopted RF permutation variable importance combined with area under the precision-recall curve and the receiver operating characteristic curve. We demonstrated the improved prediction of our proposed methods through simulation studies. We also illustrated the use of our methods using a real example of an immunologic biomarker dataset from RV144 phase 3 HIV vaccine efficacy trial.
Keywords
Variable weighted random forest
Two-phase case-control study
HIV