Statistical Methods for High-dimensional Data

Noah Gade Chair
Wake Forest University
 
Monday, Aug 4: 10:30 AM - 12:20 PM
4061 
Contributed Papers 
Music City Center 
Room: CC-103B 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

A Debiased Estimator for the Mediation Functional in Ultra-High-Dimensional Setting in the Presence of Interaction Effects

Mediation analysis is a crucial tool for uncovering the mechanisms through which a treatment affects the outcome, providing deeper causal insights and guiding effective interventions.
Despite advances in analyzing the mediation effect with fixed/low-dimensional mediators and covariates, our understanding of estimation and inference of mediation functional in the presence of (ultra)-high-dimensional mediators and covariates is still limited. In this paper, we present an estimator for mediation functional in a high-dimensional setting that accommodates the interaction between covariates and treatment in generating mediators, as well as interactions between both covariates and treatment and mediators and treatment in generating the response. We demonstrate that our estimator is $\sqrt{n}$-consistent and asymptotically normal, thus enabling reliable inference on direct and indirect treatment effects with asymptotically valid confidence intervals. A key technical contribution of our work is to develop a multi-step debiasing technique, which may also be valuable in other statistical settings with similar structural complexities where accurate estimation depends on debiasing.
We evaluate our proposed methodology through extensive simulation studies and apply it to the TCGA lung cancer dataset to estimate the effect of smoking, mediated by DNA methylation, on the survival time of lung cancer patients. 

Keywords

Causal Mediation Analysis

Debiased Estimation

Ultra-High-Dimensional Models

Interaction
Effects 

Co-Author(s)

AmirEmad Ghassami, Boston University
Debarghya Mukherjee, Boston University

First Author

Shi Bo, Boston University

Presenting Author

Shi Bo, Boston University

Advanced Survival Transformer for Integrating Multi-Omics Longitudinal Data with High-Dimensional Fe

The TransformerPseudo model, an advanced survival Transformer, is specifically designed to analyze high-dimensional, longitudinal multi-omics data while addressing the challenges of covariate-dependent censoring. By transforming survival outcomes into pseudo probabilities, the model circumvents the need for observed survival times and censoring variables, enabling robust estimation of covariate effects on patient survival. Its architecture leverages positional encodings and multi-head attention mechanisms to efficiently capture temporal dependencies and high-dimensional feature interactions, reducing information loss and simplifying the analysis by eliminating the reliance on random effects. Furthermore, the model employs SHAP values for interpretable visualization, offering comprehensive insights into the impact of multi-omics variables. Validated through extensive simulations and real-world applications using the TEDDY disease datasets, the TransformerPseudo model consistently achieves superior predictive accuracy and outperforms traditional methodologies. 

Keywords

Survival Analysis

Transformer

Deep Learning

Covariate-dependent censoring

High-dimensional longitudinal multi-omics

Robust estimation 

Co-Author

Jiyuan Hu, NYU Grossman School of Medicine

First Author

Yeji Kim, New York University, Division of Biotatistics, Department of Population Health

Presenting Author

Yeji Kim, New York University, Division of Biotatistics, Department of Population Health

Envelope-Guided Regularization for Improved Prediction in High-Dimensional Multivariate Regression

Envelope methods perform dimension reduction of predictors or responses in multivariate regression, exploiting the relationship between them to improve estimation efficiency. While most research on envelopes has focused on their estimation properties, certain envelope estimators have been shown to excel at prediction in both low and high dimensions. We propose to further improve prediction through envelope-guided regularization (EgReg), a novel method which uses envelope-derived information to guide shrinkage along the principal components (PCs) of the predictor matrix. We situate EgReg among other PC-based regression methods and envelope methods to motivate its development. We show that EgReg delivers lower prediction risk than a closely related non-shrinkage envelope estimator in fixed dimensions and in an asymptotic regime where the true intrinsic dimension of the predictors and n diverge proportionally. We compare the prediction performance of EgReg with envelope methods and other PC-based prediction methods in simulations and a real data example, observing improved prediction performance over these alternative approaches in general. 

Keywords

double descent

predictor envelopes

principal components

shrinkage estimator 

Co-Author

Oh-Ran Kwon

First Author

Tate Jacobson, Oregon State University

Presenting Author

Tate Jacobson, Oregon State University

REML estimators in High-Dimensional Kernel Linear Mixed Models

REstricted Maximum Likelihood (REML) estimators are commonly used to produce unbiased estimators for the variance components in linear mixed models. Nowadays, the dimension of the design matrix with respect to the random effects may be high, especially in genetic association studies. Originating from this, I will first introduce the high-dimensional kernel linear mixed models. The REML equations will be derived followed by a discussion on the consistency of REML estimators for some commonly used kernel matrices. The validity of the theories is demonstrated via some simulation studies. 

Keywords

Kernel Methods

Inner Product Random Matrices

Restricted Maximum Likelihood Estimator

Consistency 

Co-Author

Qing Lu

First Author

Xiaoxi Shen, Texas State University

Presenting Author

Xiaoxi Shen, Texas State University

Supervised Dimension Reduction for Regression Models with High-Dimensional Output

Regression models with high-dimensional response are increasingly ubiquitous across various domains, including computer experiments with high-dimensional output. Current methodology involves compressing the response using Unsupervised Dimension Reduction (UnsuperDR) techniques such as Singular Value Decomposition (SVD), and training regression models to predict the compressed values. We implement a novel Supervised Dimension Reduction (SuperDR) approach to infer an optimal linear compression within a comprehensive statistical model to simultaneously compress and predict high-dimensional response variables. Leveraging recent advances in SuperDR for linear models and regression modeling for multivariate output, our approach alternates between estimating a compressed regression model and an expansion matrix, theoretically converging to an optimal solution. Our framework is agnostic to the chosen regression model, as demonstrated by our implementation with Polynomial Chaos Expansion and Random Forests regression. We compare the effectiveness of SuperDR against the state-of-the-art UnsuperDR framework. 

Keywords

Supervised Dimension Reduction

High-Dimensional Response

Nonlinear Regression 

Co-Author(s)

Derek Tucker, Sandia National Laboratories
Carlos Llosa, Sandia National Laboratories

First Author

Gavin Collins

Presenting Author

Gavin Collins

Tutorial: Differential Network Analysis with Application to High-Dimensional Biological Data

Identifying differences between biological networks, particularly in disease settings, can point to targets for diagnosis and treatment. Differential network analysis has been used to characterize gene expression network differences in cancer vs normal tissue, changes in brain networks over time in Alzheimer's, and gut microbial networks under various conditions. We identified 40+ methods for estimating and testing network differences, which are implemented in over 25 R and Python packages. There is not currently a practical review and comparative study of this wide variety of software. We will present a comprehensive application-focused review and comparative evaluation, with the aim of providing a resource for applied researchers to select and put these methods into practice. We will give accessible explanations of the overarching methods and provide reproducible code examples for select methods using publicly available biological datasets. We focus on comparing popular frequentist estimation (Joint Graphical Lasso, Danaher 2014 and iDingo, Class 2017) with a Bayesian counterpart (Spike and Slab Joint Graphical Lasso, Li 2019). 

Keywords

graphical models

biological pathway estimation

differential network analysis

Bayesian network estimation

software application

method overview 

Co-Author(s)

Katherine Shutta, Harvard School of Public Health
Raji Balasubramanian, University of Massachusetts, Amherst

First Author

Margaret Janiczek, University of Massachusetts, Amherst

Presenting Author

Margaret Janiczek, University of Massachusetts, Amherst

Ultra-High-Dimensional Mediation Analysis via Latent Factor Models with Interaction Effects

Causal mediation analysis provides a framework to understand the causal pathways through which an independent variable affects an outcome via intermediary mediators. However, in high-dimensional settings — where the number of covariates and mediators far exceeds number of observations, traditional methods often fail to capture the complexity of the data. To address this, we propose a novel approach that leverages latent factor models to estimate the mediation functional via modified diversified projection method where one is not required to know the true dimension of the latent factors. We present a √n-consistent and asymptotically normal estimator that allows the interaction between covariates and treatment in generating mediators, as well as the interactions between covariates and treatment and mediators and treatment in generating the response. To demonstrate the practical relevance of our approach, we conduct an analysis on ADNI (Alzheimer's Disease Neuroimaging Initiative) study, investigating how DNA methylation mediates the progress of Alzheimer's Disease (AD). Moveover, our findings supported by extensive simulations and an investigation of geriatric depression scale effect on Alzheimer's Disease Assessment Scale – Cognitive Subscale (ADAS-Cog), mediated via DNA methylation.  

Keywords

Latent Factors

Mediation Analysis

High-Dimensional Models

Factor Model

Diversified Projection Method 

Co-Author(s)

Shi Bo, Boston University
Debarghya Mukherjee, Boston University
AmirEmad Ghassami, Boston University

First Author

Himani Yadav, Boston University

Presenting Author

Himani Yadav, Boston University