Print Close

Big Data & The Curse of Dimensionality

Luyang Fang Chair
University of Georgia

Tuesday, Aug 5: 8:30 AM - 10:20 AM
4086
Contributed Papers

Music City Center

Room: CC-104D

Main Sponsor

Section on Statistical Computing

Presentations

A Pathwise Coordinate Descent Algorithm for LASSO Penalized Quantile Regression

L1 penalized quantile regression (PQR) is used in many fields as an alternative to penalized least squares regressions for data analysis. Existing algorithms for PQR either use linear programming, which does not scale well in high dimension, or an approximate coordinate descent (CD) which does not solve for exact coordinatewise minimum of the nonsmooth loss function. Further, neither approaches leverage sparsity structure of the problem in large-scale datasets. To avoid the computational challenges associated with the nonsmooth quantile loss, some recent works have even advocated using smooth approximations to the exact problem. In this work, we develop a fast, pathwise CD algorithm to compute exact L1 PQR estimates for all dimensional data. We derive an easy-to-compute exact solution for the coordinatewise nonsmooth loss minimization, which, to the best of our knowledge, has not been reported in the literature. We also employ a random perturbation to help the algorithm avoid getting stuck along the regularization path. In simulated and real world datasets, we show that our algorithm runs substantially faster than existing alternatives, while retaining the same level of estimation accuracy.

Keywords

LASSO

penalized quantile regression

coordinate descent

pathwise algorithm

Co-Author

Sumanta Basu, Cornell University

First Author

Sanghee Kim, Cornell University

Presenting Author

Sanghee Kim, Cornell University

Evaluating Dimension Reduction Techniques for Linear and Nonlinear Data Structures

Dimension reduction techniques play a significant role in analyzing high-dimensional data, especially in fields like radiomics, where extracting meaningful patterns from complex datasets is essential. This study evaluates the performance of Principal Component Analysis (PCA), Isomap, and t-Distributed Stochastic Neighbor Embedding (t-SNE) in preserving data structure based on average silhouette scores. Through extensive simulations, we compare these methods across datasets with varying sample sizes (n = 100, 200, 300, 400, 500), noise levels (σ² = 0.25, 0.5, 0.75, 1, 1.5, 2), and feature counts (p = 20, 50, 100, 200, 300, 400). Our findings indicate that for datasets with an underlying linear structure, PCA achieves the highest accuracy in maintaining cluster integrity, as measured by the average silhouette score. Conversely, for nonlinear data structures, Isomap and t-SNE outperform PCA in preserving meaningful relationships.
One important application of these findings is in radiomics, where high-dimensional imaging data is used to extract quantitative biomarkers for cancer diagnosis and prognosis.

Keywords

Dimension Reductions Techniques

Linear and Nonlinear Data Structures

Radiomics

Principal Component Analysis (PCA)

Isomap

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Co-Author

Maryam Skafyan, East Tennessee State Univesity

First Author

Mostafa Zahed, East Tennessee State University

Presenting Author

Mostafa Zahed, East Tennessee State University

Finite Mixture of Hidden Markov Models for Tensor-variate Time Series Data

The need to model data with higher dimensions, such as a tensor-variate framework
where each observation is considered a three-dimensional object, increases
due to rapid improvements in computational power and data storage capabilities.
In this study, a finite mixture of hidden Markov model for tensor-variate time
series data is developed. Simulation studies demonstrate high classification accuracy
for both cluster and regime IDs. To further validate the usefulness of the
proposed model, it is applied to real-life data with promising results.

Keywords

Finite Mixture model

Hidden Markov model

Forward-backward algorithm

tensor-variate time series

Co-Author(s)

Xuwen Zhu, The University of Alabama
Shuchismita Sarkar

First Author

Abdullah Asilkalkan

Presenting Author

Xuwen Zhu, The University of Alabama

High-Dimensional Inference of Multivariate Linear Model

We aim to estimate and conduct inference for the effects of multiple covariates of interest simultaneously, after adjusting for the effects of high-dimensional control variables under a multivariate linear model setting. A chi-square statistic is proposed, based on the residuals obtained from fitting the response variables and the target covariates to the control covariates via regularized estimation. Procedures for hypothesis testing and confidence interval construction are developed. The proposed procedures mitigate the potential overfitting errors from regularized estimation on the inference of the target parameters and account for the inherent interconnectivity between response variables.

Keywords

High dimensional Inference

Multivariate

Hotelling-Lawley trace

Co-Author

Yumou Qiu, Peking University

First Author

Chong You, Fudan University

Presenting Author

Chong You, Fudan University

Penalized Principal Component Analysis Using Nesterov Smoothing

Principal components computed via PCA are traditionally used to reduce dimensionality in genomic data or correct for population stratification. In this statistical paper, we explore the penalized eigenvalue problem (PEP), which reformulates the first eigenvector computation as an optimization problem, adding an L1 penalty to enforce sparsity. In our threefold contribution, we first extend PEP by applying Nesterov smoothing to the LASSO-type L1 penalty, enabling analytical gradient computation for faster, more efficient objective function minimization. Second, we illustrate how higher order eigenvectors can be computed with PEP using established SVD results. Third, we present experimental studies exhibiting the utility of smoothed penalized eigenvectors compared to other state-of-the-art methods. Using 1000 Genomes Project data, we empirically show that our smoothed PEP improves numerical stability and yields meaningful eigenvectors. We employ the PEP approach in further real data applications (polygenic risk score computation and clustering), demonstrating that exchanging the penalized eigenvectors for smoothed counterparts enhances prediction accuracy and cluster discernibility.

Keywords

Principal Component Analysis

Eigenvector

Smoothing

Genomic Relationship Matrix

Singular Value Decomposition

Nesterov

Co-Author

Georg Hahn

First Author

Rebecca Hurwitz, Harvard University

Presenting Author

Rebecca Hurwitz, Harvard University

Sparse Deep P-Spline Regression with Applications

Deep neural networks (DNNs) have been widely used for real-world regression tasks, but applying them to high-dimensional, low-sample-size data presents unique challenges. Existing approaches often prioritize sparse linear relationships before extending to the full DNN structure, which can overlook important nonlinear associations. The problem becomes even more complex when selecting network architecture, such as determining the optimal number of layers and neurons. This study addresses these challenges by linking neuron selection in DNNs to knot placement in basis expansion techniques and additive modeling with introducing a sparsity-inducing difference penalty. This penalty automates knot selection and promotes parsimony in neuron activations, resulting in an efficient and scalable fitting method with optimizing architecture selection. The proposed method, named by Sparse Deep P-Spline, is validated through numerical studies, demonstrating its ability to efficiently detect sparse nonlinear relationships. Applications to the analysis of computer experiments are also presented.

Keywords

Deep Smoothing Regression

Additive Models

Feature Selection

Fast Tuning Algorithm

Co-Author

Noah Hung, Georgia State University

First Author

Li-Hsiang Lin, Georgia State University

Presenting Author

Li-Hsiang Lin, Georgia State University

Unveiling Social Vulnerability: A Variational Inference Framework for Regularized Multivariate Regression

In this work, we develop a novel variational inference framework for a regularized multivariate regression model that integrates latent clustering with advanced low-rank regression techniques. We demonstrate the utility of our method through simulation studies and an application to county-level COVID-19 outcomes, the Social Vulnerability Index (SVI), and non-pharmaceutical interventions (NPIs) in Florida. Our experiments show that the proposed framework not only enhances model flexibility and computational scalability but also offers valuable insights for targeted interventions, particularly in identifying vulnerable groups.

Keywords

Low-Rank Regression

Variational Inference

Social Vulnerability

Co-Author

Suyeon Kang, University of Central Florida

First Author

Hsin-Hsiung Huang, University of Central Florida

Presenting Author

Suyeon Kang, University of Central Florida