Statistical Learning Methods for Feature Selection & Model Tuning

Chris Nnanatu Chair
University of Southampton
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4078 
Contributed Papers 
Music City Center 
Room: CC-101B 

Main Sponsor

Section on Statistical Computing

Presentations

A Framework for Comprehensive Model and Variable Selection

We propose a framework for choosing variables and relationships without assuming additivity or parametric forms. The relationships between the response and each of the continuous predictors are modeled with regression splines and assumed to be smooth and one of the following: increasing, decreasing, convex, concave, or a combination of monotonicity and convexity. The eight shapes include a wide range of popular parametric functions such as linear, quadratic, exponential, etc., and the set of choices is appropriate if the component functions "do not wiggle." An ordinal predictor can have its set of possible orderings, such as increasing, decreasing, tree or umbrella orderings, no ordering, or constant. Interactions between continuous predictors will be modeled as multi-dimensional warped-plane spline surfaces, where the same possibilities for shapes are considered. We propose combining stepwise selection methods with information criteria, LASSO ideas, and model selection using a genetic algorithm. 

Keywords

variable selection

shape and order constraints

nonparametric

nonadditive 

Co-Author

Mary Meyer, Colorado State University

First Author

Xiyue Liao, San Diego State University

Presenting Author

Xiyue Liao, San Diego State University

A Generalized Extension of Finite Mixture Models

Finite mixture models have been widely used where one assumes that a random variable follows some distribution conditioned on a categorical latent variable. Under the standard finite mixture model regime, each observation is assumed to have a single latent variable associated with it. More complex regimes exist such as semi-supervised finite mixture models where either some of the latent variables are known or where multiple observations are dependent on the same latent variable. Further variations exist, such as covariance clustering mixture models, finite mixture model discrimination analysis, and parsimonious mixture models among many others. We propose a generalized extension of finite mixture models where each observation follows some distribution conditioned on an arbitrary number of latent and/or known variables sampled from an arbitrary number of categorical distributions. It can be shown that a vast number of models can be expressed in this form. We derive the Expectation Maximization algorithm for the generalized extension of the finite mixture model, which allows for the rapid development, implementation, and estimation of new models that follow this form. 

Keywords

Clustering

Classification

Few-Shot Learning 

Co-Author

Semhar Michael, South Dakota State University

First Author

Andrew Simpson, South Dakota State University

Presenting Author

Andrew Simpson, South Dakota State University

SUPER: a tuning-free procedure for subgroup analysis

Subgroup analysis has gained considerable attention as heterogeneity becomes increasingly common in many contemporary applications. Without any prior information, current popular methods for subgroup identification typically rely on pairwise fusion penalized mechanisms for shrinkage estimation in clustering. However, these methods require the tuning of an optimal regularization parameter from a broad range of potential values, resulting in significant computational costs associated with certain information criteria. In this paper, we propose a new methodology called scaled fusion penalized regression (SUPER) which evaluates the noise level in the fusion penalized regression and incorporates it into the determination of penalty level in an automatic way, thus enjoying the tuning-free property and facilitating further statistical inferences. An algorithm of alternative direction method of multipliers (ADMM) is then developed to implement the proposed method. We also establish the consistency and asymptotic normality for the proposed estimator. Both computational and theoretical advantages of SUPER are demonstrated by simulation studies and a real data analysis. 

Keywords

Subgroup analysis

Heterogeneity

Scale-invariance

Tuning free

Penalized fusion

Statistical inference 

Co-Author(s)

Daoji Li, California State University, Fullerton
Jie Wu, Anhui University
Zemin Zheng, University of Science and Technology of China

First Author

Letian Li, University of Science and Technology of China

Presenting Author

Daoji Li, California State University, Fullerton

Autotune: fast, efficient, and automatic tuning parameter selection for LASSO

Tuning parameter selection for penalized regression methods such as LASSO is an important issue in practice, albeit less explored in the literature of statistical methodology. Most common choices include cross-validation (CV), which is computationally expensive, or information criterions such as AIC/BIC, which are known to perform worse in high-dimensional scenarios. Guided by the asymptotic theory of LASSO that connects choice of tuning parameter λ to estimation of error standard deviation σ, we propose autotune, an automatic tuning algorithm that alternately maximizes a penalized log-likelihood over regression coefficients β and the nuisance parameter σ. The core insight behind autotune is that under exact or approximate sparsity conditions, estimation of the scalar nuisance parameter σ may often be statistically & computationally easier than estimation of the high-dimensional regression parameter β, leading to a gain in efficiency. Using simulated & real data sets, we show that autotune is faster, & provides superior estimation, variable selection and prediction performance than existing tuning strategies for LASSO as well as alternatives such as the scaled LASSO. 

Keywords

Tuning

Biconvex Optimization

Linear Models

High Dimension

Cross Validation

Noise Variance 

Co-Author(s)

Sumanta Basu, Cornell University Department of Statistics and Data Science
Ines Wilms, Maastricht University
Stephan Smeekes, Maastricht University

First Author

Tathagata Sadhukhan, Cornell University

Presenting Author

Tathagata Sadhukhan, Cornell University

High-dimensional Inference for Sparse Vector Autoregression Processes

The rapid growth of big data has increased focus on high-dimensional data analysis across various fields. Vector Auto-Regression (VAR) models are widely used in econometrics for capturing dynamic relationships between variables. However, high-dimensional VAR models often exhibit sparsity, where many coefficients are zero. Exploiting this sparsity improves model efficiency, interpretability, and prediction accuracy.

In this study, we propose two algorithms for sparse VAR model identification, designed for situations where the number of parameters (m) is comparable to the sample size (n). Both methods use p-values for sparsification. The thresholding method (TLSE) removes coefficients with p-values above a cutoff determined by n, and the model is re-estimated. The information criterion-based method (BLSE) uses p-value rankings to fit increasingly larger models, selecting the one with the smallest Bayesian Information Criterion (BIC).

Simulation results show that the proposed algorithms outperform lasso and BigVAR methods in recovering sparsity patterns, demonstrating their effectiveness for high-dimensional data analysis. 

Keywords

threshold estimation

information criteria

oracle property

Granger-causality

Sparse VAR 

Co-Author(s)

Mihai Giurcanu, University of Chicago, Department of Public Health Sciences
Alexandre Trindade, Texas Tech University, Department of Mathematics & Statistics

First Author

Dananjani Madiwala Liyanage, University of Minnesota - Duluth

Presenting Author

Dananjani Madiwala Liyanage, University of Minnesota - Duluth

WITHDRAWN: Margin Weighted Robust Discriminant Score for Feature Selection in Imbalanced Gene Expression Class

Feature selection for high-dimensional gene expression classification faces significant challenges. Conventional procedures such as Wilcoxon Rank-Sum Test, Proportional Overlap Score, Weighted Signal-to-Noise Ratio, Fisher Score, ensemble Minimum Redundancy Maximum Relevance, etc. struggle with the issues of redundancy and class imbalance, often inadequately representing the minority class. To solve these issues, this work proposes the Margin Weighted Robust Discriminant Score (MW-RDS) as a novel feature selection method for high-dimensional imbalanced data. MW-RDS uses a Minority Amplification Factor to amplify observations in the minority class, coupled with robust discriminant score (RDS) based class-specific stability weights. Margin weights derived using support vectors improve the discriminative capability of genes. The gene set is further refined using l1-regularization, reducing redundancy. The procedure is assessed on 9 high dimensional gene expression and simulated datasets using three classifiers, with superior performance across several metrics. Additional visualization via Boxplots and stability plots further validates the efficacy of the proposed method. 

Keywords

Classification

High dimensional gene expression datasets

Feature selection 

Co-Author(s)

Saeed Aldahmani
Zardad Khan

First Author

Sheema Gul, Abdul University,