Wednesday, Aug 6: 8:30 AM - 10:20 AM
0265
Invited Paper Session
Music City Center
Room: CC-103B
Applied
Yes
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Causal mediation, pleiotropy, and replication analyses are three highly popular genetic study designs. Although these analyses address different scientific questions, the underlying statistical inference problems all involve large-scale testing of composite null hypotheses. The goal is to determine whether all null hypotheses—as opposed to at least one—in a set of individual tests should simultaneously be rejected. Recently, various methods have been proposed for each of these situations, including an appealing two- group empirical Bayes approach that calculates local false discovery rates (lfdr). However, lfdr estimation is difficult due to the need for multivariate density estimation. Furthermore, the multiple testing rules for the empirical Bayes lfdr approach can disagree with conventional frequentist z-statistics, which is troubling for a field that ubiquitously uses summary statistics. This work proposes a framework to unify two-group testing in genetic association composite null settings, the conditionally symmetric multidimensional Gaussian mixture model (csmGmm). Crucially, the csmGmm offers interpretability guarantees by harmonizing lfdr and z-statistic testing rules. We apply the model to a collection of translational lung cancer genetic association studies that motivated this work.
Keywords
Composite null
Empirical Bayes
Mediation analysis
Pleiotropy
Replication analysis
Genome-wide association study
Co-Author(s)
Ryan Sun, University of Texas, MD Anderson Cancer Center
Zachary McCaw, Harvard School of Public Health
Xihong Lin, Harvard T.H. Chan School of Public Health
Speaker
Ryan Sun, University of Texas, MD Anderson Cancer Center
The application of whole exome sequencing in studying of rare genetic variation has been well-established as a powerful and cost-effective strategy for novel drug target discovery. The study of rare genetic variation, potentially important in the development of complex diseases, has been increasingly performed thanks to advances in sequencing technologies. Gene-based tests have been developed to address the challenges with single variant tests caused by the rarity of these variants and the need for large sample sizes. These tests aggregate information across many variants and can integrate external functional annotations to improve the power of rare variant analysis. In recent years, large language models have been used to predict the functional impact of genetic mutations, potentially enhancing the power of rare variant association tests, and complementing functional prediction approaches based on in-silico algorithms. We showcase the integration of functional scores leveraging protein language models for large-scale gene-based association testing in the UK Biobank.
Polygenic risk scores (PRS) predict complex traits by aggregating genetic effects across the genome, yet most models focus on common variants, overlooking rare variants that may contribute to hidden heritability. We developed RICE, a new PRS framework integrating both common and rare variants to improve genetic risk prediction across diverse ancestries. RICE constructs separate PRSs: for common variants, it integrates methods using ensemble learning; for rare variants, it uses gene-level testing with functional annotations and penalized regression. We evaluated RICE using simulated datasets and sequencing data from UK Biobank and All of Us, involving up to 740 million genetic variants from 361,939 individuals across diverse ancestries and 11 complex traits. In real data analysis, RICE improved predictive accuracy by an average of 25.7% compared to leading common variant PRS methods. Our findings demonstrate that incorporating rare variants significantly enhances PRS, providing a more accurate and inclusive approach to genetic risk prediction.