Print Close

Advances in Mendelian Randomization and Microbiome Studies

Judong Shen Chair
Merck & Co., Inc.

Monday, Aug 4: 10:30 AM - 12:20 PM
4046
Contributed Papers

Music City Center

Room: CC-207D

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

A More Robust Approach to Multivariable Mendelian Randomization

Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effects of multiple exposures on an outcome. However, unlike univariable MR, MVMR often faces greater challenges with many weak instruments, which can lead to bias not necessarily toward zero and inflation of type I errors. In this work, we introduce a new asymptotic regime that allows exposures to have varying degrees of instrument strength, providing a more accurate theoretical framework for studying MVMR estimators. Under this regime, our analysis of the widely used multivariable inverse-variance weighted method shows that it is often biased and tends to produce misleadingly narrow confidence intervals in the presence of many weak instruments. To address this, we propose a simple, closed-form modification to the multivariable inverse-variance weighted estimator to reduce bias from weak instruments, and additionally introduce a novel spectral regularization technique to improve finite-sample performance. We show that the resulting spectral-regularized estimator remains consistent and asymptotically normal under many weak instruments. Through simulations and real data applications, we demonstrate that our proposed estimator and asymptotic framework can enhance the robustness of MVMR analyses.

Keywords

Causal inference

genetic variation

GWAS

instrumental variable

weak instruments

Co-Author(s)

Hyunseung Kang, University of Wisconsin-Madison
Ting Ye, University of Washington

First Author

Yinxiang Wu, University of Washington

Presenting Author

Yinxiang Wu, University of Washington

Valid inference for two sample summary data Mendelian randomization

Mendelian randomization (MR) studies commonly utilize summary statistics from genome-wide association studies (GWASs). However, a rigorous theoretical foundation for this practice remains underdeveloped. Assuming that the instrumental single nucleotide polymorphisms (SNPs) are in linkage equilibrium, we derive exact analytical expressions for both the two-stage least squares (TSLS) estimator and the two-sample TSLS (TSTSLS) estimator, along with their corresponding variances, directly in terms of GWAS summary statistics. These derivations yield several important insights. Notably, the widely used inverse variance weighted (IVW) estimator is shown to be nonequivalent to either the TSLS or TSTSLS estimators. Furthermore, the standard error of the IVW estimator is inconsistent when the causal effect is non-zero. Given the proliferation of IVW-based methods in MR research, our findings underscore the need to critically reassess these approaches to ensure valid causal inference. We validate our theoretical results through extensive simulation studies and apply them to a broad spectrum of complex traits using publicly available GWAS summary data. The simulations reveal two novel findings: (1) causal effect estimates derived from GWAS summary statistics using the IVW, TSLS, or TSTSLS estimators exhibit finite sample bias, as evidenced by the coverage rates of the 95\% Wald confidence intervals; and (2) this bias has a negligible impact on the standard error of these estimators.

Keywords

Mendelian randomization

summary statistics

two-stage least-squares regression

inverse variance weighting

generalized method of moment

Co-Author

Grace Wang, Department of Biostatistics, Harvard T.H. Chan School of Public Health

First Author

Kai Wang, University of Iowa

Presenting Author

Kai Wang, University of Iowa

A Shiny App to Support Rigor and Reproducibility in Mendelian Randomization Studies

We present a Shiny app that supports and facilitates two-sample Mendelian randomization studies with genome-wide association study (GWAS) summary statistics. The proliferation of GWAS and the sharing of their marginal SNP association statistics have enabled researchers to address causal inference questions between two complex traits. Two-sample Mendelian randomization posits a causal relationship between a putative exposure and a putative outcome. Our Shiny app will enable researchers to input GWAS summary statistics for the putative outcome and putative exposure. The app supports diverse sensitivity analyses to assess the assumptions that underlie Mendelian randomization. To ensure computational reproducibility, the user can download a Rmarkdown file with all analysis code from our app. We also briefly discuss anticipated issues with app deployment.

Keywords

reproducibility

genetics

genome-wide association study

causal inference

software development

Co-Author

Ji Hoon Park, South Dakota State University

First Author

Frederick Boehm

Presenting Author

Frederick Boehm

Accounting for Unobserved Confounding to Reduce False Discoveries in Microbiome Research

Recent research has highlighted false discoveries in microbiome studies, particularly in differential abundance (DA) analyses. While data compositionality has received attention, we demonstrate that unobserved confounding (e.g., population heterogeneity, recent antibiotic use, or seasonal dietary changes) can be an even stronger driver of false discoveries. Using real-data evidence, we show that unobserved confounding inflates false discoveries in microbiome DA, more than data compositionality. To address this, we introduce a novel factor-modeling regression method, Microbiome Latent Confounder DA (MiLC), to estimate unobserved confounding factors and control false discoveries. MiLC can be applied to both relative abundance and read count microbiome data. We validate its performance in controlling false discoveries, relative to existing methods, using extensive simulation- and real-data-based benchmarking. Our results highlight the critical need to correct for hidden confounders, offering a more reliable framework for microbiome DA analyses and ultimately improving the robustness of microbiome research findings.

Keywords

False discovery rate

Unobserved confounding

Microbiome

Differential abundance

Latent factor models

Co-Author(s)

Eric Koplin, Vanderbilt University
Dong Wang, Harvard T.H Chan School of Public Health
Tina Hartert, Vanderbilt University School of Medicine
Suman Das, Vanderbilt University
Yu Shyr, Vanderbilt University Medical Center
Chris McKennan, The University of Chicago
Siyuan Ma, Vanderbilt University Medical Center

First Author

Chih-Ting Yang, Vanderbilt University

Presenting Author

Chih-Ting Yang, Vanderbilt University

DHT: A nonparametric test for homogeneity of multivariate dispersions

Testing homogeneity across groups in multivariate data is often a standalone scientific question as well as an auxiliary step in verifying assumptions of ANOVA. Existing methods either construct test statistics based on distance of each observation from the group center, or mean of pairwise dissimilarity of the data points in a group. Both approaches can fail when mean within-group distance is similar across groups but the distribution of the within-group distances are different. This is a pertinent question in high dimensional microbiome data, where outliers and overdispersion can distort the performance of a mean-dissimilarity based test. We introduce a non-parametric Distance based Homogeneity Test (DHT) which combines information provided by Kolmogorov Smirnov as well as Wasserstein distance between the within-group dissimilarities for each pair of groups. Pairwise group tests are combined in the subsequent step to provide a permutation based p-value. Through simulations we show that our method has higher power than existing tests for homogeneity in certain situations. We also provide a general framework for extending the test to a continuous covariate.

Keywords

permutation tests

ANOVA

multivariate tests

nonparametric

Wasserstein Distance

Kolmogorov-Smirnov

Co-Author(s)

ni Zhao, Johns Hopkins University
Glen Satten, Emory University School of Medicine

First Author

Asmita Roy, Johns Hopkins University School of Public Health

Presenting Author

Asmita Roy, Johns Hopkins University School of Public Health

Microbiome Data Integration via Shared Dictionary Learning

Data integration is a powerful tool for facilitating a comprehensive understanding of microbial communities and their association with outcomes of interest. However, integrating data sets from different studies remains a challenging problem because of severe batch effects, unobserved confounding variables, and high heterogeneity across data sets. We propose a new data integration method called MetaDICT, which initially estimates the batch effects by weighting methods in causal inference literature and then refine the estimation via a novel shared dictionary learning. Compared with existing methods, MetaDICT can better avoid the overcorrection of batch effects and preserve biological variation when there exist unobserved confounding variables or data sets are highly heterogeneous across studies. Applications to synthetic and real microbiome data sets demonstrate the robustness and effectiveness of MetaDICT in integrative analysis. Using MetaDICT, we characterize microbial interaction, identify generalizable microbial signatures, and enhance the accuracy of disease prediction in an integrative analysis of colorectal cancer metagenomics studies.

Keywords

data integration

shared dictionary learning

batch effect

microbiome

embedding

Co-Author

Shulei Wang

First Author

Bo Yuan

Presenting Author

Bo Yuan

Wald-Based Weighted Logistic Regression for Differential Abundance Analysis in Microbiome Data

Recent advances in sequencing technologies have vastly increased microbiome data availability and depth, posing significant computational and statistical challenges. While LOCOM provides strong FDR control and high sensitivity for differential abundance testing, its permutation-based framework becomes computationally expensive at large scales. Moreover, large datasets frequently exhibit batch effects and substantial library size variations, potentially confounding disease associations. Because LOCOM's likelihood-based estimation inherently upweights high-depth samples, these disparities can further bias results. We introduce an M-estimator-based weighted logistic regression for more balanced weighting and use a computationally efficient alternative that replaces permutation-based inference with a Wald test. In addition to supporting equal weighting to mitigate biases, our approach accommodates relative abundance data, whereas LOCOM only accepts count data. Through realistic simulations, we show that our method is computationally efficient and offers robust FDR control.

Keywords

large-scale microbiome data

differential abundance testing

M-estimator

FDR Control

relative abundance data

Co-Author(s)

Yijuan Hu, Emory University, Department of Biostatistics & Bioinformatics
Glen Satten, Emory University School of Medicine

First Author

Mengyu He, Emory University, Rollins School of Public Health

Presenting Author

Mengyu He, Emory University, Rollins School of Public Health