SPEED 3: Statistical Methods for High Dimensional and Complex Data , Part 1

Jijia Wang Chair
UT Southwestern Medical Center
 
Monday, Aug 4: 10:30 AM - 12:20 PM
4059 
Contributed Speed 
Music City Center 
Room: CC-104A 

Presentations

$S^1$-Indexed Brownian Motion Through Abstract Wiener Spaces

Brownian motion is typically introduced as a stochastic process indexed by the half-closed ray starting at 0, while a Brownian sheet is indexed by an octant of Euclidean space. Recent research has focused on extending these concepts to non-Euclidean index sets (primarily Riemannian manifolds), seeking to define stochastic processes over them that merit the name 'Brownian motion.' This extension is not merely a mathematical exercise - it aims to provide rigorous foundations for the 'SPDE approach' when analyzing data over such spaces, particularly addressing questions about the sparsity of Matérn covariance functions in these settings. In this work, we identify a critical gap in existing approaches: the lack of guaranteed path continuity in the processes explored so far. We present a modification that resolves this limitation, thereby establishing a more robust theoretical foundation for this emerging line of research. 

Keywords

SPDE Approach

Brownian motion

Matérn covariance 

Co-Author

Chunfeng Huang, Indiana University

First Author

Nicolas Escobar

Presenting Author

Nicolas Escobar

A Multiple Imputation for Binary and Ordinal Responses: A Multivariate Probit Model Approach

Handling missing data in studies with mixed multivariate responses is a critical challenge in statistical research. We propose a multiple imputation technique for datasets with binary and ordinal variables. This method, based on a multivariate probit model using Markov chain Monte Carlo, captures the correlation structure among variables while respecting their categorical nature. We evaluate the method under various missing data scenarios: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Comparisons with standard imputation techniques, such as multivariate normal-based and multiple imputations by chained equations (MICE), reveal that our approach outperforms existing methods. It better preserves the joint distribution of data and provides unbiased parameter estimates, particularly under complex missingness patterns. Our findings highlight the multivariate probit model's potential as a robust and flexible tool for multiple imputation in datasets with mixed ordinal and binary responses. This advancement enhances the reliability of statistical inference in applied research involving such data structures. 

Keywords

Multiple Imputation

Multinomial probit model

Markov chain Monte Carlo (MCMC)

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNAR). 

First Author

Stephen Kofi Acheampong

Presenting Author

Stephen Kofi Acheampong

Advancing Ultra-high-dimensional Functional Regression: Exploring Genome-wide Association Studies

Genome-Wide Association Studies (GWAS) with imaging phenotypes pose significant challenges due to the complex interplay between high-dimensional genetic data and intricate spatial structures inherent in imaging data. In this paper, we develop an ultra-high-dimensional functional regression model tailored for GWAS with imaging phenotypes, incorporating genetic and non-visual contextual information. We approximate the coefficient functions using bivariate penalized splines and propose a forward selection procedure based on a functional Bayesian Information Criterion. This procedure is designed to identify critical main effects and interactions, adapting to imaging data characteristics. It achieves consistent variable selection in moderately high-dimensional settings and exhibits the sure screening property in ultra-high-dimensional scenarios. Extensive simulation studies and an analysis of data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the superior performance of the proposed method. 

Keywords

Bayesian Information Criterion

Functional linear model

Bivariate splines

Forward selection

GWAS 

Co-Author(s)

Lily Wang, George Mason University
Guannan Wang, College of William and Mary

First Author

Wanying Zhu

Presenting Author

Wanying Zhu

An R Package for Multivariate Penalized Splines on Triangulations with Global & Distributed Learning

The MPST (Multivariate Penalized Spline over Triangulation) package provides a robust and efficient framework for statistical modeling of large-scale 2D and 3D data. Using advanced multivariate penalized splines, MPST effectively handles irregular domains, noisy observations, and sparse datasets. It supports global and distributed learning, enabling seamless large-scale analysis. Its distributed framework employs domain decomposition, partitioning data into subsets based on triangulation, processing them in parallel, and integrating results efficiently. This approach enhances computational performance without sacrificing accuracy. A key strength of MPST is its ability to achieve precise local fitting with varying smoothness across subdomains, ensuring smooth global transitions and overcoming traditional spline limitations. Additionally, MPST provides user-friendly 2D and 3D visualization tools, aiding result interpretation. Numerical studies show MPST outperforms existing smoothing methods in accuracy, efficiency, and scalability. By integrating state-of-the-art smoothing techniques with distributed computing, MPST is a powerful tool for complex, high-dimensional data modeling. 

Keywords

Complex multidimensional data

Computational efficiency

Distributed learning

Nonparametric smoothing

Multivariate spline smoothing

MPST package 

Co-Author(s)

Lily Wang, George Mason University
Guannan Wang, College of William and Mary

First Author

YU-CHUN WANG, George Mason University

Presenting Author

YU-CHUN WANG, George Mason University

Analysis of Multivariate Binary Data Using D-Vine Copula Model

Multivariate binary data arise in various scientific fields. The Multivariate Probit (MP) model is widely used for analyzing such data. However, it can fail even within a feasible range of binary variable correlations due to its requirement for a positive definite latent correlation matrix. To address this limitation, we propose a pair copula model using D-vine with an assumed dependence structure of either first-order autoregressive or equicorrelation, which overcomes the difficulties associated with the MP model. Our presentation begins with introducing copulas and discussing the differences between D-vine and C-vine pair copula models. We present visualizations illustrating the relationship between the copula parameter and the binary variable correlation coefficient. We then derive the probability mass function (PMF) for bivariate and trivariate binary variables and provide numerical examples. Finally, we present an application of our model to a real-life dataset analysis. 

Keywords

Multivariate Binary

Copula

D-Vine 

Co-Author

N. Rao Chaganty, Old Dominion University

First Author

Huihui Lin, Hampton University

Presenting Author

N. Rao Chaganty, Old Dominion University

WITHDRAWN Automatic Calibration of Agent-Based Models using Recurrent Neural Networks

This study presents a deep learning framework for calibrating Agent-Based Models (ABMs), focusing on the Susceptible-Infected-Recovered (SIR) model. By leveraging Convolutional Neural Networks (CNNs) for pattern extraction and Recurrent Neural Networks (RNNs) for temporal dependencies, the approach enhances parameter estimation accuracy and efficiency. A synthetic dataset generated using epiworldR enabled model training, with RNNs achieving lower Mean Absolute Errors (MAEs).

To support real-world applications, we developed epiworldRcalibrate, an R package for real-time SIR parameter estimation and epidemic visualization. Validated on 10,000 simulated datasets, the framework proved robust and adaptable. This method offers a scalable solution for real-time epidemiological modeling, improving decision-making in public health and beyond. 

Keywords

Parameter Calibration
Agent-Based Models (ABMs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Susceptible-Infected-Recovered (SIR) Model


Parameter Calibration

Recurrent Neural Networks (RNNs) 

Co-Author(s)

George Vega Yon, University of Utah
Yue Zhang, University of Utah
Bernardo Modenesi, University of Utah

First Author

sima Najafzadehkhoei

Consensus Dimension Reduction via Data Integration

A plethora of dimension reduction methods have been developed to visualize high-dimensional data in low dimensions. However, different dimension reduction methods often output different visualizations, and many challenges make it difficult for researchers to determine which visualization is best. We thus propose a novel consensus dimension reduction framework, which summarizes multiple visualizations into a single "consensus" visualization. Here, we leverage ideas from data integration (or data fusion) to identify the patterns that are most stable or shared across the many different dimension reduction visualizations and subsequently visualize this shared structure in a single low-dimensional plot. We demonstrate that this consensus visualization effectively identifies and preserves the shared low-dimensional data structure through extensive simulations and real-world case studies. We further highlight our method's robustness to the choice of dimension reduction method and/or hyperparameters --- a highly desirable property when working towards trustworthy and reproducible data science. 

Keywords

dimension reduction

data integration

data visualization 

Co-Author

Tiffany Tang, University of Notre Dame

First Author

Bingxue An, University of Notre Dame

Presenting Author

Bingxue An, University of Notre Dame

High-Dimensional Graphical Latent Gaussian Copula Model with Covariates

In this work, we propose a high-dimensional Graphical Latent Gaussian Copula Model that extends traditional Gaussian graphical models by incorporating external covariates. The model assumes a latent Gaussian structure where observed variables arise through monotonic transformations, allowing for a flexible representation of conditional dependencies. We introduce a novel approach in which the mean and precision matrix of the latent variables are modeled as functions of covariates, capturing population-level and individual-specific network structures.

To estimate the model parameters, we develop an efficient estimation procedure that leverages bridge functions to infer latent correlations from observed data. The estimation is further refined using a sparse group lasso penalty to encourage structured sparsity.

Simulation studies and real-world applications demonstrate the model's ability to recover latent dependency structures and identify covariate-driven variations in network connectivity. This framework has broad applicability in biomedical and social sciences, where latent interactions play a crucial role in data analysis. 

Keywords

Copula Model

High Dimensional Data

Graphical Model

Precision Matrix Estimation

Sparse Group Lasso

Covariate-Dependent Networks 

First Author

Zhentao Yu

Presenting Author

Zhentao Yu

ImgKnock: Novel Knockoff Generation and Feature Selection for Image Data via Latent Representations

Over 3 million Americans currently have glaucoma, a series of eye conditions that damage the optic nerve, leading to more severe eye issues. Diagnosis and monitoring of glaucoma can be accomplished through examination of fundus images, such as thinning of the neuroretinal rim. While traditional feature selection techniques can be applied to pixelated fundus image data, they often struggle with high dimensionality, computational inefficiency, and procedural rigidity. To resolve these issues and control FDR, we present a novel approach that leverages latent representation learning to construct higher-level features from image data and generate knockoffs of the latent features, followed by knockoff feature selection with FDR control. Called ImgKnock, our four-step procedure uses a deep latent representation learning-based approach integrated with a model-X knockoffs framework. Simulations are conducted using the common MNIST and CIFAR-10 datasets to demonstrate the efficacy of ImgKnock. Results indicate proper FDR control, particularly with MNIST data, showing an AUC metric of up to 0.889. The proposed ImgKnock is also applied to fundus images from the UCLA Stein Eye Institute. 

Keywords

knockoff selection

latent representation learning

FDR control

fundus images

self-supervised learning 

Co-Author

Zhe Fei, University of California, Riverside

First Author

Jericho Lawson, University of California, Riverside

Presenting Author

Jericho Lawson, University of California, Riverside

Integration of Image Segmentation with Classical L2 Optimization Theories of Statistics

Minimum distance estimation methodology based on an empirical distribution function has been popular due to its desirable properties including robustness. Even though the statistical literature is awash with research on the minimum distance estimation, most of it is confined to the theoretical findings: only a few statisticians conducted research on the application of the method to real-world problems. Through this paper, we extend the domain of application of this methodology to various applied fields by providing a solution to a rather challenging and complicated computational problem. The problem this paper tackles is image segmentation, which has been used in various fields. We propose a novel method based on the classical minimum distance estimation theory to solve the image segmentation problem. The performance of the proposed method is then further elevated by integrating it with the "segmenting-together" strategy. We demonstrate that the proposed method combined with the segmenting-together strategy successfully completes the segmentation problem when it is applied to complex images such as magnetic resonance images. 

Keywords

Empirical distribution

Cramer-von Mises

magnetic resonance

minimum distance

segmenting together 

Co-Author(s)

Jinhee Jang, Seoul St. Mary Hospital, College of Medicine, The Catholic University of Kor
Kun Bu, Department of Mathematics and Statistics

First Author

Jiwoong Kim, Department of Mathematics and Statistics University of South Florida

Presenting Author

Jiwoong Kim, Department of Mathematics and Statistics University of South Florida

Interpretable Deep Learning with Scalable Kernel-Based Density Estimation

Interpretable deep learning is critical in fields such as healthcare, finance, and autonomous systems, where transparency is essential. This study presents a computationally efficient framework integrating Random Fourier Features (RFF) with softmax-weighted kernel density estimation to introduce interpretability in deep learning models. By employing RFF for kernel approximation and refining kernel density estimation, the method provides a structured approach to modeling complex data distributions while maintaining accuracy and efficiency. To assess robustness, a sensitivity analysis is conducted on the dimensionality (D) of the mapped space to evaluate its impact on computational complexity. Additionally, the study examines the integration of multiple kernels within deep learning models, allowing flexible representation of high-dimensional data. This is particularly relevant when distinct feature sets, such as gene collections, require separate kernel representations. The framework's performance is assessed through benchmarking in a conditional density estimation setting using real-world data. 

Keywords

interpretable deep learning

machine learning

learning with kernels

random features

nonparametric conditional density estimation 

Co-Author

Mithat Gonen, Memorial Sloan-Kettering Cancer Center

First Author

Ayyuce Begum Bektas, Memorial Sloan Kettering Cancer Center

Presenting Author

Ayyuce Begum Bektas, Memorial Sloan Kettering Cancer Center

Joint Graphical Lasso with Regularized Aggregation

We present methods for estimating multiple precision matrices for high-dimensional time series within the framework of Gaussian graphical models, with a specific focus on analyzing functional magnetic resonance imaging (fMRI) data collected from multiple subjects. Our goal is to estimate both individual brain networks and a collective structure representing a group of subjects. To achieve this, we propose a method that utilizes group Graphical Lasso and regularized aggregation to simultaneously estimate individual and group precision matrices, assigning varying weights to each individual based on their outlier status within the group. We investigate the convergence rates of the precision matrix estimators across different norms and expectations, assessing their performance under both sub-Gaussian and heavy-tailed assumptions. The effectiveness of our methods is demonstrated through simulations and real fMRI data analysis. 

Keywords

Aggregation

Brain connectivity

Joint estimation

Precision matrix estimation

Regularization

Long-memory 

Co-Author(s)

Qihu Zhang, University of Georgia
Jennifer McDowell, University of Georgia
Cheolwoo Park, KAIST

First Author

Jongik Chung, University of Central Florida

Presenting Author

Jongik Chung, University of Central Florida

Localized Sparse Principal Component Analysis of Multivariate Time Series in Frequency Domain

In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process. Given a high-dimensional weakly stationary time series, it is of interest to obtain principal components of the spectral density matrices that are interpretable as being sparse in coordinates and localized in frequency. In this talk, we introduce a formulation of this novel problem and an algorithm for estimating the object of interest. In addition, we propose a smoothing procedure that improves estimation of eigenvector trajectories over the frequency range. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a patient hospitalized for a first psychotic episode and compared with a healthy control individual. 

Keywords

Principal Component Analysis

High Dimensional Time Series

Spectral Density Matrix

Sparse Estimation

EEG Data 

Co-Author(s)

Robert Krafty, Emory University
Amita Manatunga, Emory University
Fabio Ferrarelli, University of Pittsburgh

First Author

Jamshid Namdari

Presenting Author

Jamshid Namdari

Merging Versus Ensembling: An Adaptive Blending Approach for Handling Domain Heterogeneity

In multi-domain settings, where observations come from distinct but related data sources, heterogeneity often exists across domains due to shifts in data distributions. In cases of high heterogeneity, (1) training individual models on each domain and ensembling their predictions (ensemble approach) has been shown to outperform (2) combining domain datasets and fitting a single model (merged approach). However, determining when to choose each approach is less clear. This paper presents Multi-Study Adaptive Blend (MSAB), a method for optimally combining predictions from the ensemble and merged approaches adaptively across varying levels of heterogeneity. First, we provide theoretical insights on optimizing the combination weight in a linear model setting. Second, we propose a domain-wise cross-validation strategy for estimating the optimal blending weight as a practical, data-driven approach for broader applications. For a given heterogeneity level, MSAB performs comparable to or better than the best individual strategy (merged or ensemble), offering robust performance across low and high heterogeneity settings. MSAB offers potential improvements in predictive performance and mitigates the risk of selecting a suboptimal approach in multi-domain settings.  

Keywords

machine learning

domain generalization

ensemble learning

multi-study prediction 

Co-Author(s)

Prasad Patil, Boston University
Kevin Lane, Boston University - Department of Environmental Health

First Author

Daniel Kojis

Presenting Author

Daniel Kojis

Modification to the LASSO Regression Model via its Bayesian Interpretation

This study presents a generalized LASSO regression model based on the generalized Laplace (GL) distribution. Within the T-R{Y} framework, a family of GL distributions is developed, with a particular case offering a Bayesian perspective on LASSO. This perspective introduces additional terms to the standard LASSO constraint. These terms are examined geometrically, as well as the impact of the parameters of the GL distribution on the generalized LASSO model. Finally, the model's adaptability and effectiveness in variable selection and prediction are illustrated using a real-world dataset. 

Keywords

LASSO regression

beta-Laplace distribution

T-Laplace family

Variable selection

Prediction 

Co-Author(s)

Felix Famoye, Central Michigan University
Carl Lee, Central Michigan University

First Author

Gayan Warahena Liyanage, University of Dayton

Presenting Author

Gayan Warahena Liyanage, University of Dayton

Nonparametric Estimation of Spatial Covariance Function using Mixtures of Gaussian Kernels

Estimating the covariance function of a spatial process is important for model estimation and spatial prediction. Many spatial models, such as Gaussian Processes, rely on covariance functions to define their structure. However, parametric estimation can suffer from model misspecification leading to biased predictions if the chosen covariance structure is incorrect. In this work, we study a nonparametric approach to estimate the covariance function of an isotropic stationary process in R^d. We focus on a class of covariance functions that are valid in all dimensions d>=1, which includes popular parametric kernels such as Matern kernel. Leveraging the fact that such covariance functions can be represented as infinite mixtures of scaled Gaussian kernels, we propose two estimation methods: least squares and nonparametric maximum likelihood estimation for estimating the mixing measure of scaled Gaussian kernels. We also develop computationally efficient methods to solve the optimizations using non-negative least squares and fisher-scoring updates. Finally, we evaluate our proposed methods through simulations and real data, comparing them against parametric and nonparametric approaches. 

Keywords

Stationary isotropic processes

Spatial covariance function

Nonparametric estimation

Gaussian mixtures

Fast computation 

Co-Author(s)

Hyebin Song, Penn State
Stephen Berg

First Author

Kanahela Muhandiramge Manushi Hansani Siriwardana

Presenting Author

Kanahela Muhandiramge Manushi Hansani Siriwardana

Self-normalization Tests for Change Points in Functional Time Series

Change point detection for functional time series has attracted considerable attention from researchers. Existing methods either rely on functional principle component analysis (FPCA), which may perform poorly with complex data, or use bootstrap approaches in forms that fall short in effectively detecting diverse types of changes. In our study, we propose a novel self-normalization (SN) test for functional time series implemented via a non-overlapping block bootstrap to circumvent the reliance on FPCA. The test statistic is a normalized cumulative sum (CUSUM) where the normalizing factor allows the capture of subtle local changes in the mean function. The theory contains the weak convergence and test consistency for both the original and the bootstrap versions of the test statistic. We further extend the test to detect changes in the lag-1 autocovariance operator. Simulation studies confirm the superior performance of our test across various settings, and real-world applications further illustrate its practical utility. 

Keywords

Change point detection

Functional time series

Self-normalization

Non-overlapping block bootstrap 

Co-Author

Pang Du, Virginia Tech

First Author

Zhiyuan Du

Presenting Author

Zhiyuan Du

Semi-parametric Spatial Intensity Estimation with Bandwidth Selection on KDE-NN based model

In spatial point processes intensity estimation, traditional methods like kernel estimators and regression models have been effective in estimating the intensity function of a spatial point pattern. However, they fall short when dealing with nonlinear correlations. Deep learning models, such as Neural Networks(NN), and Variational AutoEncoders (VAE), offer a promising alternative to address these limitations due to their inherent properties and settings. These are widely used and acknowledged for their flexibility and capability to handle complex, nonlinear relationships. In this study, we additionally incorporate a bandwidth trainable KDE layer to our model, the KDE-NN based model provides additional flexibility to capture any spatial correlation in the data, while also controlling the degree of smoothness. 

Keywords

Spatial Intensity Estimation

Deep Learning Model

Bandwidth Selection

Kernel Density Estimation 

Co-Author

Ji Meng Loh, New Jersey Institute of Technology

First Author

Zhiwen Wang, New Jersey Insititute of Technology

Presenting Author

Zhiwen Wang, New Jersey Insititute of Technology

Sparse-Input Neural Network using Group Concave Regularization

Simultaneous feature selection and non-linear function estimation are challenging, especially in high-dimensional settings where the number of variables exceeds the available sample size. We investigate feature selection in neural networks and address the limitations of group LASSO, which tends to select unimportant variables due to over-shrinkage. To overcome this, we propose a sparse-input neural network framework using group concave regularization for feature selection in both low- and high-dimensional settings. The key idea is to apply a concave penalty to the $l_2$ norm of weights from all outgoing connections of each input node, yielding a neural net that uses only a small subset of variables. We also develop an efficient algorithm based on backward path-wise optimization to produce stable solution paths and tackle complex optimization landscapes. Extensive simulations and real data examples demonstrate the proposed estimator's strong performance in feature selection and prediction for continuous, binary, and time-to-event outcomes. 

Keywords

Neural networks

Feature selection

High dimensionality

LASSO

nonconvex penalty 

Co-Author

Susan Halabi, Duke University

First Author

Bin Luo, Kennesaw State University

Presenting Author

Bin Luo, Kennesaw State University

Visualize your fitted non-linear dimension reduction model in the high-dimensional data space

Non-linear dimension reduction (NLDR) techniques such as tSNE, UMAP provide a low-dimensional representation of high-dimensional data by applying non-linear transformation. The methods and parameter choices can create wildly different representations, so much so that it is difficult to decide which is best, or whether any or all are accurate or misleading. NLDR often exaggerates random patterns, sometimes due to the samples observed, but NLDR views have an important role in data analysis because, if done well, they provide a concise visual (and conceptual) summary of high dimensional distributions. To help evaluate the NLDR we have developed a way to take the fitted model, as represented by the positions of points in 2D, and turn it into a high-dimensional wireframe to overlay on the data, viewing it with a tour. Viewing a model in the data space is an ideal way to examine the fit. It is used here to help with the difficult decision on which 2D layout is the best representation of the high-dimensional distribution, or whether the 2D layout is displaying mostly random structure, and how different methods have same summary or particular quirks. Available in the R package `quollr`. 

Keywords

high-dimensional data vizualization

non-linear dimension reduction

tour 

Co-Author(s)

Dianne Cook, Monash University
Paul Harrison, Monash University
Michael Lydeamore, Dr
Thiyanga S. Talagala, University of Sri Jayewardenepura, Sri Lanka

First Author

Piyadi Gamage Jayani Lakshika

Presenting Author

Piyadi Gamage Jayani Lakshika