Monday, Aug 4: 2:00 PM - 3:50 PM
4078
Contributed Papers
Music City Center
Room: CC-101B
Main Sponsor
Section on Statistical Computing
Presentations
We propose a framework for choosing variables and relationships without assuming additivity or parametric forms. The relationships between the response and each of the continuous predictors are modeled with regression splines and assumed to be smooth and one of the following: increasing, decreasing, convex, concave, or a combination of monotonicity and convexity. The eight shapes include a wide range of popular parametric functions such as linear, quadratic, exponential, etc., and the set of choices is appropriate if the component functions "do not wiggle." An ordinal predictor can have its set of possible orderings, such as increasing, decreasing, tree or umbrella orderings, no ordering, or constant. Interactions between continuous predictors will be modeled as multi-dimensional warped-plane spline surfaces, where the same possibilities for shapes are considered. We propose combining stepwise selection methods with information criteria, LASSO ideas, and model selection using a genetic algorithm.
Keywords
variable selection
shape and order constraints
nonparametric
nonadditive
Co-Author
Mary Meyer, Colorado State University
First Author
Xiyue Liao, San Diego State University
Presenting Author
Xiyue Liao, San Diego State University
Finite mixture models have been widely used where one assumes that a random variable follows some distribution conditioned on a categorical latent variable. Under the standard finite mixture model regime, each observation is assumed to have a single latent variable associated with it. More complex regimes exist such as semi-supervised finite mixture models where either some of the latent variables are known or where multiple observations are dependent on the same latent variable. Further variations exist, such as covariance clustering mixture models, finite mixture model discrimination analysis, and parsimonious mixture models among many others. We propose a generalized extension of finite mixture models where each observation follows some distribution conditioned on an arbitrary number of latent and/or known variables sampled from an arbitrary number of categorical distributions. It can be shown that a vast number of models can be expressed in this form. We derive the Expectation Maximization algorithm for the generalized extension of the finite mixture model, which allows for the rapid development, implementation, and estimation of new models that follow this form.
Keywords
Clustering
Classification
Few-Shot Learning
Subgroup analysis has gained considerable attention as heterogeneity becomes increasingly common in many contemporary applications. Without any prior information, current popular methods for subgroup identification typically rely on pairwise fusion penalized mechanisms for shrinkage estimation in clustering. However, these methods require the tuning of an optimal regularization parameter from a broad range of potential values, resulting in significant computational costs associated with certain information criteria. In this paper, we propose a new methodology called scaled fusion penalized regression (SUPER) which evaluates the noise level in the fusion penalized regression and incorporates it into the determination of penalty level in an automatic way, thus enjoying the tuning-free property and facilitating further statistical inferences. An algorithm of alternative direction method of multipliers (ADMM) is then developed to implement the proposed method. We also establish the consistency and asymptotic normality for the proposed estimator. Both computational and theoretical advantages of SUPER are demonstrated by simulation studies and a real data analysis.
Keywords
Subgroup analysis
Heterogeneity
Scale-invariance
Tuning free
Penalized fusion
Statistical inference
Co-Author(s)
Daoji Li, California State University, Fullerton
Jie Wu, Anhui University
Zemin Zheng, University of Science and Technology of China
First Author
Letian Li, University of Science and Technology of China
Presenting Author
Daoji Li, California State University, Fullerton
Tuning parameter selection for penalized regression methods such as LASSO is an important issue in practice, albeit less explored in the literature of statistical methodology. Most common choices include cross-validation (CV), which is computationally expensive, or information criterions such as AIC/BIC, which are known to perform worse in high-dimensional scenarios. Guided by the asymptotic theory of LASSO that connects choice of tuning parameter λ to estimation of error standard deviation σ, we propose autotune, an automatic tuning algorithm that alternately maximizes a penalized log-likelihood over regression coefficients β and the nuisance parameter σ. The core insight behind autotune is that under exact or approximate sparsity conditions, estimation of the scalar nuisance parameter σ may often be statistically & computationally easier than estimation of the high-dimensional regression parameter β, leading to a gain in efficiency. Using simulated & real data sets, we show that autotune is faster, & provides superior estimation, variable selection and prediction performance than existing tuning strategies for LASSO as well as alternatives such as the scaled LASSO.
Keywords
Tuning
Biconvex Optimization
Linear Models
High Dimension
Cross Validation
Noise Variance
The rapid growth of big data has increased focus on high-dimensional data analysis across various fields. Vector Auto-Regression (VAR) models are widely used in econometrics for capturing dynamic relationships between variables. However, high-dimensional VAR models often exhibit sparsity, where many coefficients are zero. Exploiting this sparsity improves model efficiency, interpretability, and prediction accuracy.
In this study, we propose two algorithms for sparse VAR model identification, designed for situations where the number of parameters (m) is comparable to the sample size (n). Both methods use p-values for sparsification. The thresholding method (TLSE) removes coefficients with p-values above a cutoff determined by n, and the model is re-estimated. The information criterion-based method (BLSE) uses p-value rankings to fit increasingly larger models, selecting the one with the smallest Bayesian Information Criterion (BIC).
Simulation results show that the proposed algorithms outperform lasso and BigVAR methods in recovering sparsity patterns, demonstrating their effectiveness for high-dimensional data analysis.
Keywords
threshold estimation
information criteria
oracle property
Granger-causality
Sparse VAR
Feature selection for high-dimensional gene expression classification faces significant challenges. Conventional procedures such as Wilcoxon Rank-Sum Test, Proportional Overlap Score, Weighted Signal-to-Noise Ratio, Fisher Score, ensemble Minimum Redundancy Maximum Relevance, etc. struggle with the issues of redundancy and class imbalance, often inadequately representing the minority class. To solve these issues, this work proposes the Margin Weighted Robust Discriminant Score (MW-RDS) as a novel feature selection method for high-dimensional imbalanced data. MW-RDS uses a Minority Amplification Factor to amplify observations in the minority class, coupled with robust discriminant score (RDS) based class-specific stability weights. Margin weights derived using support vectors improve the discriminative capability of genes. The gene set is further refined using l1-regularization, reducing redundancy. The procedure is assessed on 9 high dimensional gene expression and simulated datasets using three classifiers, with superior performance across several metrics. Additional visualization via Boxplots and stability plots further validates the efficacy of the proposed method.
Keywords
Classification
High dimensional gene expression datasets
Feature selection