Frontiers at the Intersection of Statistics and Genetics: Causal Inference, Network Analysis, and Machine Learning/Artificial Intelligence

Wei Pan Chair
University of Minnesota
 
Wei Pan Organizer
University of Minnesota
 
Tuesday, Aug 5: 2:00 PM - 3:50 PM
0795 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-207C 

Applied

Yes

Main Sponsor

Section on Statistics in Genomics and Genetics

Co Sponsors

Biometrics Section
Section on Statistical Learning and Data Science

Presentations

An improved Graph-MRcML algorithm for causal network inference with Mendelian randomization

Understanding causal networks among multiple traits is crucial for unraveling complex biological relationships and informing interventions. Mendelian Randomization (MR) has emerged as a powerful tool for causal inference, utilizing genetic variants as instrumental variables (IVs) to estimate causal effects. However, when the causal relationships among traits are unknown, reconstructing the underlying causal network remains a significant challenge. The recently proposed Graph-MRcML method addresses this by estimating pairwise causal effects using a robust bidirectional MR approach and applying network deconvolution to infer direct causal relationships. While empirically effective, certain theoretical limitations remain in its formulation.
In this study, we first clarify the underlying model with cycles and the relationship between the effects estimated by MR and the causal network. Then we introduce an improved version of Graph-MRcML, incorporating a more rigorous IV screening procedure to enhance the recovery of causal networks. Through extensive simulations, we demonstrate that the new method achieves higher accuracy and exhibits improved statistical properties. We further validate its practical utility by applying it to a dataset of 15 traits, showcasing its effectiveness in real-world applications.
 

Keywords

Causal network, directed cyclic graph, direct causal effect, horizontal pleiotropy, total causal effect 

Speaker

Zhaotong Lin

Benchmarking DNA Foundation Models for Genomic Sequence Classification

The rapid advancement of DNA foundation language models has revolutionized the field of genomics, enabling the decoding of complex patterns and regulatory mechanisms within DNA sequences. However, the current evaluation of these models often relies on fine-tuning and limited datasets, which introduces biases and limits the assessment of their true potential. Here, we present a benchmarking study of three recent DNA foundation language models, including DNABERT-2, Nucleotide Transformer version-2 (NT-v2), and HyenaDNA, focusing on the quality of their zero-shot embeddings across a diverse range of genomic tasks and species through analyses of 57 real datasets. We found that DNABERT-2 exhibits the most consistent performance across human genome-related tasks, while NT-v2 excels in epigenetic modification detection. HyenaDNA stands out for its exceptional runtime scalability and ability to handle long input sequences. Importantly, we demonstrate that using mean token embedding consistently improves the performance of all three models compared to the default setting of sentence-level summary token embedding, with average AUC improvements ranging from 4.3% to 9.7% for different DNA foundation models. Furthermore, the performance differences between these models are significantly reduced when using mean token embedding. Our findings provide a framework for selecting and optimizing DNA language models, guiding researchers in applying these tools effectively in genomic studies. 

Speaker

Chong Wu, The University of Texas MD Anderson Cancer Center

Leveraging Auxiliary Data on Related Traits to Enhance GWAS Power

Genome-wide association studies (GWAS) have been widely applied to identify genetic variants that are robustly associated with complex human traits and diseases. This has facilitated subsequent analyses, such as the calculation of polygenic risk scores and the performance of Mendelian randomization. Moreover, a variety of approaches have been developed to enhance GWAS power from different perspectives. Despite the considerable success of GWAS, many genetic variants linked to human traits remain undiscovered due to limited sample sizes. For instance, the UK Biobank Pharma Proteomics Project (UKB-PPP) recently highlighted that the number of identified protein quantitative trait loci (pQTL) has continued to increase steadily as sample size increased to its maximum of approximately 50,000. Motivated by the UKB-PPP proteomic data, and considering that (1) proteins are causal to many traits, and (2) genotype data and outcomes for various traits in the UKB encompass much larger sample sizes, we develop a novel method to leverage causal relationships and auxiliary data to enhance GWAS power and apply them to the UKB-PPP for pQTL discovery. The general framework of the proposed method make it broadly applicable to other biobank-scale data as well. 

Keywords

Protein Quantitative Trait Loci (pQTL)

Single Nucleotide Polymorphism (SNP)

Causal Inference 

Speaker

Haoran Xue, City University of Hong Kong

On network deconvolution for undirected graphs

Network deconvolution (ND) is a method to reconstruct a direct-effect network describing direct (or conditional) effects (or associations) between any two nodes from a given network depicting total (or marginal) effects (or associations). Its key idea is that, in a directed graph, a total effect can be decomposed into the sum of a direct and an indirect effects, with the latter further decomposed as the sum of various products of direct effects. This yields a simple closed-form solution for the direct-effect network, facilitating its important applications to distinguish direct and indirect effects. Despite its application to undirected graphs, it is not well known why the method works, leaving it with skepticism. We first clarify the implicit linear model assumption underlying ND, then derive a surprisingly simple result on the equivalence between ND and use of precision matrices, offering insightful justification and interpretation for the application of ND to undirected graphs. We also establish a formal result to characterize the effect of scaling a total-effect graph. Finally, leveraging large-scale genome-wide association study data, we show a novel application of ND to contrast marginal versus conditional genetic correlations between body height and risk of coronary artery disease; the results align with an inferred causal directed graph using ND. We conclude that ND is a promising approach with its easy and wide applicability to both directed and undirected graphs. 

Speaker

Isaac Pan, University of North Carolina

ReHLine: Regularized Composite ReLU-ReHU Loss Minimization with Linear Computation and Linear Convergence



Empirical risk minimization (ERM) is a crucial framework that offers a general approach to handling a broad range of machine learning tasks. In this paper, we propose a novel algorithm, called ReHLine, for minimizing a set of regularized ERMs with convex piecewise linear-quadratic loss functions and optional linear constraints. The proposed algorithm can effectively handle diverse combinations of loss functions, regularizations, and constraints, making it particularly well-suited for complex domain-specific problems. Examples of such problems include FairSVM, elastic net regularized quantile regression, Huber minimization, etc. In addition, ReHLine enjoys a provable linear convergence rate and exhibits a per-iteration computational complexity that scales linearly with the sample size. The algorithm is implemented with both Python and R interfaces, and its performance is benchmarked on various tasks and datasets. Our experimental results demonstrate that ReHLine significantly surpasses generic optimization solvers in terms of computational efficiency on large-scale datasets. Moreover, it also outperforms specialized solvers such as liblinear in SVM, hqreg in Huber minimization and lightning(SAGA, SAG, SDCA, SVRG) in smooth SVM, exhibiting exceptional flexibility and efficiency.
 

Keywords

convex optimization

statistical computing 

Speaker

Ben Dai, The Chinese University of Hong Kong