Advances in Variable Selection

Tasneem Zaihra Rizvi Chair
Lahey Hospital and Medical Center
 
Wednesday, Aug 6: 2:00 PM - 3:50 PM
4189 
Contributed Papers 
Music City Center 
Room: CC-202B 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

Comparing Feature Selection Methods in Clinical Data Modeling: LASSO and Stepwise Regression

In clinical data modeling, a common challenge is the high dimensionality of features relative to the number of patients, which complicates reliable inference. Regression models are frequently employed due to their interpretability and ability to quantify parameter estimates and confidence intervals. While stepwise feature selection has historically been popular, recent studies suggest that regularized methods like LASSO, Ridge Regression, and Elastic Net offer superior performance.

This study evaluates and compares the performance of LASSO, Ridge Regression, Elastic Net, and Stepwise Regression using both simulated datasets and a prospective study of endogenous hypercortisolism in a population with difficult to control type 2 diabetes. Key metrics include feature selection overlap, parameter estimates with confidence intervals, and test statistics. Results indicate a significant overlap in features selected by LASSO and Stepwise Regression, with LASSO selecting a more comprehensive and robust feature set. LASSO also outperforms Stepwise Regression in accuracy and robustness, while Stepwise Regression exhibits a higher tendency for overfitting. 

Keywords

Lasso

stepwise regression

simulation

hypercortisolism 

Co-Author(s)

Daniel Einhorn, Corcept Therapeutics Inc.
Cristina Tudor, Corcept Therapeutics Inc.

First Author

Yumeng Wang

Presenting Author

Yumeng Wang

Fractional Ridge Regression: A New Perspective on Shrinkage Regression and Variable Selection

lp norm penalization, notably the Lasso, has become a standard technique, extending shrinkage regression to subset selection. Despite aiming for oracle properties and consistent estimation, existing Lasso-derived methods still rely on shrinkage toward a null model, necessitating careful tuning parameter selection and yielding stepwise variable selection. This research introduces Fractional Ridge Regression (Fridge), a novel generalization of the Lasso penalty that penalizes only a fraction of the coefficients. Critically, Fridge shrinks the model toward a non-null model of a prespecified target size, even under extreme regularization. By selectively penalizing coefficients associated with less important variables, Fridge aims to reduce bias, improve performance relative to the Lasso, and offer more intuitive model interpretation while retaining certain advantages of best subset selection. 

Keywords

Shrinkage Regression

Regularization

Variable Selection

Sparse Modeling 

Co-Author

Leonard Stefanski, North Carolina State University

First Author

Sihyung Park

Presenting Author

Sihyung Park

Robust Bayesian Elastic Net with Spike-and-Slab Priors

In high-dimensional regression problems, the demand for robust variable selection arises due to the commonly observed outliers, heavy-tailed distributions of the response variable, and model misspecifications when structured sparsity is ignored. The elastic net enjoys wide popularity in genomics studies as it can accommodate the strong correlations among omics features. Therefore, the robust elastic net in both the frequentist and Bayesian frameworks has received much attention in recent years for the robust identification of important omics features. In this study, we propose a robust Bayesian elastic net with spike-and-slab priors that overcomes the major limitations of the existing family of elastic net methods. Specifically, we have developed a fully Bayesian method that builds on the robust likelihood function to safeguard against the heterogeneity of complex diseases while accounting for structured sparsity. Incorporation of the spike-and-slab priors in the Bayesian hierarchical model has significantly improved accuracy in shrinkage estimation and variable selection. The advantages of the proposed method have been demonstrated through the simulation and real data analysis. 

Keywords

robust Bayesian elastic net

Markov Chain Monte Carlo

robust Bayesian variable selection

spike-and-slab priors

robust regularization

Bayesian inference analysis 

Co-Author(s)

Cen Wu, Kansas State University
Jie Ren, Indiana University School of Medicine
Shuangge Ma

First Author

Xi Lu, UH

Presenting Author

Xi Lu, UH

Stability of a Generalized Debiased Lasso with Applications to Resampling-Based Variable Selection

Suppose that we first apply the Lasso to a design matrix, and then update one of its columns. In general, the signs of the Lasso coefficients may change, and there is no closed-form expression for updating the Lasso solution exactly. In this work, we propose an approximate formula for updating a debiased Lasso coefficient. We provide general nonasymptotic error bounds in terms of the norms and correlations of a given design matrix's columns, and then prove asymptotic convergence results for the case of a random design matrix with i.i.d.\ sub-Gaussian row vectors and i.i.d.\ Gaussian noise. Notably, the approximate formula is asymptotically correct for most coordinates in the proportional growth regime, under the mild assumption that each row of the design matrix is sub-Gaussian with a covariance matrix having a bounded condition number. Our proof only requires certain concentration and anti-concentration properties to control various error terms and the number of sign changes. In contrast, rigorously establishing distributional limit properties (e.g.\ Gaussian limits for the debiased Lasso) under similarly general assumptions has been considered open problem in the universality 

Keywords

debiased Lasso

inference in high-dimensional regression models

knockoffs

false discovery rate

university

small ball probability 

First Author

Jingbo Liu, UIUC

Presenting Author

Jingbo Liu, UIUC

Using sufficiency and sparsity for more powerful controlled variable selection in the linear model

We show that for the problem of controlled variable selection in the Gaussian linear
model, informative and valid weights (for weighted multiple testing) can be derived
entirely from sufficient statistics and a belief in sparsity using only the data itself and
no external quantitative side information. This idea results in new procedures with
strict guarantees on the (unweighted) familywise error rate or false discovery rate and
that are more powerful than existing methods when the model is sparse. A naive
implementation of our idea is computationally intensive, so we propose computational
improvements that maintain strict validity while having little impact on the power.
We show that the same idea extends asymptotically to any setting with a Gaussian
limit and consistently estimable covariance matrix, such as any M-estimation problem.
We demonstrate the performance of our methods in simulations and an application to
HIV drug resistance. 

Keywords

Variable selection

Weighted multiple testing

Sparsity

Familywise error rate

False discovery rate 

Co-Author

Lucas Janson, Harvard University

First Author

Souhardya Sengupta, Harvard University

Presenting Author

Souhardya Sengupta, Harvard University

Variable Selection in Multi-State Models of Correlated Data: An Application to COVID-19 Vaccination

Multi-state models (MSM) are the primary analytical approach used to depict patient transitions among multiple clinical states in medical research. MSM are typically complex, with multiple transition paths and many parameters. This complexity introduces computational and numerical challenges in parameter estimation and scientific difficulties in model interpretation. Compounding these issues is the inherent within-subject correlation. For example, in care transitions among patients receiving COVID-19 vaccines, transition times among different states within the same subject tend to be correlated. Failing to accommodate these correlations may lead to inefficient estimation and questionable inference. We propose a method for variable selection in MSM with correlated data by reparameterizing the likelihood function and approximating the penalty term with a hyperbolic tangent function. We conducted a simulation study to evaluate the accuracy of this approach and applied the method to data from an observational study of transitions among people receiving COVID-19 vaccines, focusing on four health states: healthy, infection, emergency department or hospital admission, and death. 

Keywords

Multi-state Model

Variable Selection

Correlated data

COVID-19

EHR Data 

Co-Author(s)

Yang Li, Indiana University Purdue University Indianapolis
Wanzhu Tu, Indiana University School of Medicine

First Author

Jason Mao

Presenting Author

Jason Mao

Variable Selection in Partial Linear Models

Variable selection in partial linear models (PLMs) is crucial for high-dimensional data analysis, where accurately estimating both linear and nonlinear components is essential. In this work, we develop a methodology based on Variational Bayes (VB) approach for variable selection in PLMs, incorporating a spike-and-slab prior on both the linear coefficients and the parameters of a neural network (NN) that is used to estimate the nonlinear component. The spike-and-slab prior promotes sparsity in the linear component while simultaneously regularizing the neural network, which ensures flexibility in capturing complex nonlinear relationships without overfitting. The VB framework provides an efficient and scalable inference procedure. We evaluate our method against existing approaches by assessing variable selection accuracy for both linear and nonlinear variables. We further check the performance of our method through extensive simulations involving covariates with correlated structure and real-data experiments, where our method demonstrates superior performance, achieving more precise non
linear function estimation as well as variable selection and the same for linear covariates. 

Keywords

Partial Linear Models (PLM)

Variational Bayes(VB)

Neural Network(NN)

Spike-and-Slab

Variable Selection 

Co-Author(s)

Shrijita Bhattacharya, Michigan State University
Tapabrata Maiti, Michigan State University

First Author

Tathagata Dutta, Michigan State University

Presenting Author

Tathagata Dutta, Michigan State University