Monday, Aug 4: 10:30 AM - 12:20 PM
4046
Contributed Papers
Music City Center
Room: CC-207D
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effects of multiple exposures on an outcome. However, unlike univariable MR, MVMR often faces greater challenges with many weak instruments, which can lead to bias not necessarily toward zero and inflation of type I errors. In this work, we introduce a new asymptotic regime that allows exposures to have varying degrees of instrument strength, providing a more accurate theoretical framework for studying MVMR estimators. Under this regime, our analysis of the widely used multivariable inverse-variance weighted method shows that it is often biased and tends to produce misleadingly narrow confidence intervals in the presence of many weak instruments. To address this, we propose a simple, closed-form modification to the multivariable inverse-variance weighted estimator to reduce bias from weak instruments, and additionally introduce a novel spectral regularization technique to improve finite-sample performance. We show that the resulting spectral-regularized estimator remains consistent and asymptotically normal under many weak instruments. Through simulations and real data applications, we demonstrate that our proposed estimator and asymptotic framework can enhance the robustness of MVMR analyses.
Keywords
Causal inference
genetic variation
GWAS
instrumental variable
weak instruments
Abstracts
Mendelian randomization (MR) studies commonly utilize summary statistics from genome-wide association studies (GWASs). However, a rigorous theoretical foundation for this practice remains underdeveloped. Assuming that the instrumental single nucleotide polymorphisms (SNPs) are in linkage equilibrium, we derive exact analytical expressions for both the two-stage least squares (TSLS) estimator and the two-sample TSLS (TSTSLS) estimator, along with their corresponding variances, directly in terms of GWAS summary statistics. These derivations yield several important insights. Notably, the widely used inverse variance weighted (IVW) estimator is shown to be nonequivalent to either the TSLS or TSTSLS estimators. Furthermore, the standard error of the IVW estimator is inconsistent when the causal effect is non-zero. Given the proliferation of IVW-based methods in MR research, our findings underscore the need to critically reassess these approaches to ensure valid causal inference. We validate our theoretical results through extensive simulation studies and apply them to a broad spectrum of complex traits using publicly available GWAS summary data. The simulations reveal two novel findings: (1) causal effect estimates derived from GWAS summary statistics using the IVW, TSLS, or TSTSLS estimators exhibit finite sample bias, as evidenced by the coverage rates of the 95\% Wald confidence intervals; and (2) this bias has a negligible impact on the standard error of these estimators.
Keywords
Mendelian randomization
summary statistics
two-stage least-squares regression
inverse variance weighting
generalized method of moment
Abstracts
Co-Author
Grace Wang, Department of Biostatistics, Harvard T.H. Chan School of Public Health
First Author
Kai Wang, University of Iowa
Presenting Author
Kai Wang, University of Iowa
We present a Shiny app that supports and facilitates two-sample Mendelian randomization studies with genome-wide association study (GWAS) summary statistics. The proliferation of GWAS and the sharing of their marginal SNP association statistics have enabled researchers to address causal inference questions between two complex traits. Two-sample Mendelian randomization posits a causal relationship between a putative exposure and a putative outcome. Our Shiny app will enable researchers to input GWAS summary statistics for the putative outcome and putative exposure. The app supports diverse sensitivity analyses to assess the assumptions that underlie Mendelian randomization. To ensure computational reproducibility, the user can download a Rmarkdown file with all analysis code from our app. We also briefly discuss anticipated issues with app deployment.
Keywords
reproducibility
genetics
genome-wide association study
causal inference
software development
Recent research has highlighted false discoveries in microbiome studies, particularly in differential abundance (DA) analyses. While data compositionality has received attention, we demonstrate that unobserved confounding (e.g., population heterogeneity, recent antibiotic use, or seasonal dietary changes) can be an even stronger driver of false discoveries. Using real-data evidence, we show that unobserved confounding inflates false discoveries in microbiome DA, more than data compositionality. To address this, we introduce a novel factor-modeling regression method, Microbiome Latent Confounder DA (MiLC), to estimate unobserved confounding factors and control false discoveries. MiLC can be applied to both relative abundance and read count microbiome data. We validate its performance in controlling false discoveries, relative to existing methods, using extensive simulation- and real-data-based benchmarking. Our results highlight the critical need to correct for hidden confounders, offering a more reliable framework for microbiome DA analyses and ultimately improving the robustness of microbiome research findings.
Keywords
False discovery rate
Unobserved confounding
Microbiome
Differential abundance
Latent factor models
Testing homogeneity across groups in multivariate data is often a standalone scientific question as well as an auxiliary step in verifying assumptions of ANOVA. Existing methods either construct test statistics based on distance of each observation from the group center, or mean of pairwise dissimilarity of the data points in a group. Both approaches can fail when mean within-group distance is similar across groups but the distribution of the within-group distances are different. This is a pertinent question in high dimensional microbiome data, where outliers and overdispersion can distort the performance of a mean-dissimilarity based test. We introduce a non-parametric Distance based Homogeneity Test (DHT) which combines information provided by Kolmogorov Smirnov as well as Wasserstein distance between the within-group dissimilarities for each pair of groups. Pairwise group tests are combined in the subsequent step to provide a permutation based p-value. Through simulations we show that our method has higher power than existing tests for homogeneity in certain situations. We also provide a general framework for extending the test to a continuous covariate.
Keywords
permutation tests
ANOVA
multivariate tests
nonparametric
Wasserstein Distance
Kolmogorov-Smirnov
Co-Author(s)
ni Zhao, Johns Hopkins University
Glen Satten, Emory University School of Medicine
First Author
Asmita Roy, Johns Hopkins University School of Public Health
Presenting Author
Asmita Roy, Johns Hopkins University School of Public Health
Data integration is a powerful tool for facilitating a comprehensive understanding of microbial communities and their association with outcomes of interest. However, integrating data sets from different studies remains a challenging problem because of severe batch effects, unobserved confounding variables, and high heterogeneity across data sets. We propose a new data integration method called MetaDICT, which initially estimates the batch effects by weighting methods in causal inference literature and then refine the estimation via a novel shared dictionary learning. Compared with existing methods, MetaDICT can better avoid the overcorrection of batch effects and preserve biological variation when there exist unobserved confounding variables or data sets are highly heterogeneous across studies. Applications to synthetic and real microbiome data sets demonstrate the robustness and effectiveness of MetaDICT in integrative analysis. Using MetaDICT, we characterize microbial interaction, identify generalizable microbial signatures, and enhance the accuracy of disease prediction in an integrative analysis of colorectal cancer metagenomics studies.
Keywords
data integration
shared dictionary learning
batch effect
microbiome
embedding
Recent advances in sequencing technologies have vastly increased microbiome data availability and depth, posing significant computational and statistical challenges. While LOCOM provides strong FDR control and high sensitivity for differential abundance testing, its permutation-based framework becomes computationally expensive at large scales. Moreover, large datasets frequently exhibit batch effects and substantial library size variations, potentially confounding disease associations. Because LOCOM's likelihood-based estimation inherently upweights high-depth samples, these disparities can further bias results. We introduce an M-estimator-based weighted logistic regression for more balanced weighting and use a computationally efficient alternative that replaces permutation-based inference with a Wald test. In addition to supporting equal weighting to mitigate biases, our approach accommodates relative abundance data, whereas LOCOM only accepts count data. Through realistic simulations, we show that our method is computationally efficient and offers robust FDR control.
Keywords
large-scale microbiome data
differential abundance testing
M-estimator
FDR Control
relative abundance data
Abstracts
Co-Author(s)
Yijuan Hu, Emory University, Department of Biostatistics & Bioinformatics
Glen Satten, Emory University School of Medicine
First Author
Mengyu He, Emory University, Rollins School of Public Health
Presenting Author
Mengyu He, Emory University, Rollins School of Public Health