New Frontiers in Measuring Association: Methods, Applications, and Challenges

Jingyi Jessica Li Chair
UCLA
 
Xinzhou Ge Discussant
Oregon State University
 
Xinzhou Ge Organizer
Oregon State University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0802 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-101B 
This session focuses on the measures of association, with a particular emphasis on their applications in biomedical data analysis. We have invited leading statisticians who develop novel theoretical measures, alongside bioinformaticians who apply these methods in real-world data analysis to uncover meaningful biological insights. The aim of the session is to foster collaboration between these two groups, driving innovation and enhancing the future development of association measures. Additionally, we will feature a discussant who will address the issue of statistical rigor in the use of different measures.
Session Format:
The session will feature four distinguished speakers who will present their latest research on association measures and their genetic applications:
1. Mona Azadkia
Title: Measuring Dependence and Conditional Dependence
Dr. Azadkia will discuss a newly developed measure of dependence, which is both simple, akin to classical coefficients like Pearson's correlation, and robust. This measure consistently captures the strength of dependence between variables, equating to zero only when variables are independent, and to one when one variable is a measurable function of the other.
2. Lucy Xia
Title: Conditional Semi-Distance Correlation
Dr. Xia will introduce a novel measure of conditional dependence designed to assess the relationship between a categorical random variable and a potentially high-dimensional random vector, conditioned on another random vector. Her work also includes a conditional independence test, with asymptotic distributions derived under the null hypothesis. She proposes the use of this correlation as a screening tool for group variables, with particular insights into spatial transcriptomics.
3. Han Chen
Title: Efficient Mixed Model Association Test for Nonlinear Effects in Large-Scale Human Genetic Studies
Dr. Chen will present a generalized additive mixed model framework for testing nonlinear genetic effects using smoothing splines. This method addresses multi-allelic genetic variations, even in the presence of nonlinear effects. He will also introduce a variance component score test for nonlinear effects and a joint test for linear and nonlinear genetic effects, demonstrated through a real-world GWAS example.
4. Changhu Wang
Title: A Powerful Framework for False Discovery Rate Control in High-Dimensional Variable Selection
Dr. Wang will present a novel framework for controlling the FDR in high-dimensional variable selection which maintains the integrity of the original data. It is versatile, easy to implement, and consistently outperforms state-of-the-art methods, such as knockoffs and data-splitting, in terms of FDR control, statistical power, and computational efficiency across various statistical models.
Following these presentations, a discussant will summarize the four presentations, offering a brief discussion on the statistical rigor in using different measures of association.
Each talk and the subsequent discussion will last approximately 20 minutes, including a 5-minute Q&A session after each presentation.
Significance:
Measures of association are foundational to statistics and have broad applications across various fields, particularly in biomedical science. This session aims to refine statistical methodologies, particularly in genetics, promoting more rigorous statistical practices in biomedical research in the era of big data-an alignment with the theme.

Applied

Yes

Main Sponsor

WNAR

Co Sponsors

International Chinese Statistical Association
Section on Statistics in Genomics and Genetics

Presentations

Measuring dependence and Conditional Dependence: A new approach

Following the recent developments in measuring dependence between random variables, we introduce a new measure of dependence that is as simple as the classical coefficients like Pearson's correlation; and captures the strength of dependence between the variables consistently by being 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other. We introduce two different families of estimators and study their asymptotic behaviour under different regimes. Finally, we showcase the application of this new measure in variable selection.  

Co-Author

Mona Azadkia, London School of Economics and Political Science

Speaker

Mona Azadkia, London School of Economics and Political Science

Conditional Semi-Distance Correlation

Measuring conditional dependence is crucial in various fields, including genetic association studies and graphical models. In spatial transcriptomics, one of the focuses is on elucidating the relationships between gene expression levels (continuous variables) and spatially-related covariates, such as spatial locations, brain layers, and cell types (categorical variables), conditioning on other factors.

We introduce a novel measure of conditional dependence that assesses the relationship between a categorical random variable and a potentially high-dimensional random vector, conditioned on another random vector. This measure is based on semi-distance correlation and extends the concept of conditional distance correlation to incorporate categorical variables. Importantly, it serves as a general conditional dependence metric, unrestricted by linear or monotonic relationships. Further, we will develop a conditional independence test and derive its asymptotic distributions under the null hypothesis. This allows us to efficiently compute p-values, providing a significant computational advantage for high-dimensional data analysis over traditional regression tests, which will be especially useful for spatial transcriptomics data. 

Co-Author

Lucy Xia

Speaker

Lucy Xia

Efficient mixed model association test for nonlinear effects in large-scale human genetic studies

Generalized linear mixed model (GLMM) based genetic association tests have been widely applied in human genetic studies with related individuals to identify genetic variants associated with complex diseases and quantitative traits. In recent years, efficient GLMM-based tests have been implemented in the genome-wide association studies (GWAS) from large biobank-scale cohorts with hundreds of thousands of individuals, such as the UK Biobank and All of Us. These methods and software programs often assume an additive coding scheme for bi-allelic genetic variants such as single nucleotide polymorphisms (SNPs). While additive coding is convenient and computationally efficient, the linearity assumption may be violated for multi-allelic genetic variations, such as structural variants (SVs), copy number variations (CNVs), and tandem repeats (TRs). In the presence of nonlinear genetic effects, GLMM-based tests with additive coding may suffer from substantial power loss, especially when the linear effects are weak. Here we develop a generalized additive mixed model (GAMM) based framework for testing nonlinear genetic effects using smoothing splines for multi-allelic genetic variations. To improve the computational efficiency, instead of fitting a separate GAMM for each genetic variant, we only fit a null model without any genetic effects once in a GWAS. We then develop a variance component score test for nonlinear effects after projecting out linear effects, as well as a joint test for linear and nonlinear genetic effects. Assuming a sparse kinship matrix for modeling sample relatedness with a bounded maximum cluster size, and a limited number of observed alleles for each genetic variant, the computational complexity scales linearly with both the sample size and the number of variants in a GWAS. We perform simulation studies to evaluate type I error control and power gain of GAMM-based association tests compared to GLMM-based tests, in the presence of nonlinear genetic effects for multi-allelic genetic variations. We also illustrate the new method in a real data example by performing GWAS on TRs. 

Co-Author

Han Chen, The University of Texas Health Science Center at Houston

Speaker

Han Chen, The University of Texas Health Science Center at Houston

SyNPar: Synthetic Null Data Parallelism for High-Power False Discovery Rate Control in High-Dimensional Variable Selection

Balancing false discovery rate (FDR) control and statistical power is a fundamental challenge in high-dimensional variable selection. Existing FDR control methods often perturb the original data, either by concatenating knockoffs variables or splitting the data, which can compromise power. In this paper, we introduce SyNPar, a novel framework that controls the FDR in high-dimensional variable selection while preserving the integrity of the original data. The framework is versatile, straightforward to implement, and applicable to a wide range of statistical models, including high-dimensional linear regression, generalized linear models (GLMs), Cox models, and Gaussian graphical models. Through extensive simulations and real-world data applications, we demonstrate that SyNPar consistently outperforms state-of-the-art methods, such as knockoffs and data-splitting techniques, in terms of FDR control, statistical power, and computational efficiency. 

Co-Author

Changhu Wang, UCLA

Speaker

Changhu Wang, UCLA