Emerging Statistical/Computational Methods for Genomic Data Science in the Era of Diverse Biobanks

Xihao Li Chair
University of North Carolina at Chapel Hill
 
Xihao Li Organizer
University of North Carolina at Chapel Hill
 
Wednesday, Aug 6: 8:30 AM - 10:20 AM
0265 
Invited Paper Session 
Music City Center 
Room: CC-103B 

Applied

Yes

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

Testing a large number of composite null hypotheses for mediation, pleiotropy, and replication analyses in genome-wide studies

Causal mediation, pleiotropy, and replication analyses are three highly popular genetic study designs. Although these analyses address different scientific questions, the underlying statistical inference problems all involve large-scale testing of composite null hypotheses. The goal is to determine whether all null hypotheses—as opposed to at least one—in a set of individual tests should simultaneously be rejected. Recently, various methods have been proposed for each of these situations, including an appealing two- group empirical Bayes approach that calculates local false discovery rates (lfdr). However, lfdr estimation is difficult due to the need for multivariate density estimation. Furthermore, the multiple testing rules for the empirical Bayes lfdr approach can disagree with conventional frequentist z-statistics, which is troubling for a field that ubiquitously uses summary statistics. This work proposes a framework to unify two-group testing in genetic association composite null settings, the conditionally symmetric multidimensional Gaussian mixture model (csmGmm). Crucially, the csmGmm offers interpretability guarantees by harmonizing lfdr and z-statistic testing rules. We apply the model to a collection of translational lung cancer genetic association studies that motivated this work. 

Keywords

Composite null

Empirical Bayes

Mediation analysis

Pleiotropy

Replication analysis

Genome-wide association study 

Co-Author(s)

Ryan Sun, University of Texas, MD Anderson Cancer Center
Zachary McCaw, Harvard School of Public Health
Xihong Lin, Harvard T.H. Chan School of Public Health

Speaker

Ryan Sun, University of Texas, MD Anderson Cancer Center

Using large language models for rare variant association testing in large-scale biobanks

The application of whole exome sequencing in studying of rare genetic variation has been well-established as a powerful and cost-effective strategy for novel drug target discovery. The study of rare genetic variation, potentially important in the development of complex diseases, has been increasingly performed thanks to advances in sequencing technologies. Gene-based tests have been developed to address the challenges with single variant tests caused by the rarity of these variants and the need for large sample sizes. These tests aggregate information across many variants and can integrate external functional annotations to improve the power of rare variant analysis. In recent years, large language models have been used to predict the functional impact of genetic mutations, potentially enhancing the power of rare variant association tests, and complementing functional prediction approaches based on in-silico algorithms. We showcase the integration of functional scores leveraging protein language models for large-scale gene-based association testing in the UK Biobank. 

Co-Author(s)

Christopher Gillies, Regeneron Genetics Center
Andrey Ziyatdinov, Regeneron Genetics Center
The Regeneron Genetics Center, Regeneron Genetics Center
Maya Ghoussaini, Regeneron Genetics Center
Jonathan Marchini, Regeneron Genetics Center

Speaker

Joelle Mbatchou

PresentationCC

Co-Author(s)

Wenxuan Lu, Bloomberg School of Public Health, Johns Hopkins University
Yuzheng Dun
Ruzhang Zhao, Johns Hopkins University

Speaker

Nilanjan Chatterjee

Integrating Common and Rare Variants Improves Polygenic Risk Prediction Across Diverse Populations

Polygenic risk scores (PRS) predict complex traits by aggregating genetic effects across the genome, yet most models focus on common variants, overlooking rare variants that may contribute to hidden heritability. We developed RICE, a new PRS framework integrating both common and rare variants to improve genetic risk prediction across diverse ancestries. RICE constructs separate PRSs: for common variants, it integrates methods using ensemble learning; for rare variants, it uses gene-level testing with functional annotations and penalized regression. We evaluated RICE using simulated datasets and sequencing data from UK Biobank and All of Us, involving up to 740 million genetic variants from 361,939 individuals across diverse ancestries and 11 complex traits. In real data analysis, RICE improved predictive accuracy by an average of 25.7% compared to leading common variant PRS methods. Our findings demonstrate that incorporating rare variants significantly enhances PRS, providing a more accurate and inclusive approach to genetic risk prediction. 

Co-Author(s)

Peter Kraft, National Cancer Institute
Wendy Wong, National Cancer Institute
Jacob Williams
Tony Chen, Harvard University
Xing Hua
Kai Yu
Xihao Li, University of North Carolina at Chapel Hill
Haoyu Zhang, National Cancer Institute

Speaker

Haoyu Zhang, National Cancer Institute