Monday, Aug 4: 10:30 AM - 12:20 PM
4059
Contributed Speed
Music City Center
Room: CC-104A
Presentations
Brownian motion is typically introduced as a stochastic process indexed by the half-closed ray starting at 0, while a Brownian sheet is indexed by an octant of Euclidean space. Recent research has focused on extending these concepts to non-Euclidean index sets (primarily Riemannian manifolds), seeking to define stochastic processes over them that merit the name 'Brownian motion.' This extension is not merely a mathematical exercise - it aims to provide rigorous foundations for the 'SPDE approach' when analyzing data over such spaces, particularly addressing questions about the sparsity of Matérn covariance functions in these settings. In this work, we identify a critical gap in existing approaches: the lack of guaranteed path continuity in the processes explored so far. We present a modification that resolves this limitation, thereby establishing a more robust theoretical foundation for this emerging line of research.
Keywords
SPDE Approach
Brownian motion
Matérn covariance
Handling missing data in studies with mixed multivariate responses is a critical challenge in statistical research. We propose a multiple imputation technique for datasets with binary and ordinal variables. This method, based on a multivariate probit model using Markov chain Monte Carlo, captures the correlation structure among variables while respecting their categorical nature. We evaluate the method under various missing data scenarios: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Comparisons with standard imputation techniques, such as multivariate normal-based and multiple imputations by chained equations (MICE), reveal that our approach outperforms existing methods. It better preserves the joint distribution of data and provides unbiased parameter estimates, particularly under complex missingness patterns. Our findings highlight the multivariate probit model's potential as a robust and flexible tool for multiple imputation in datasets with mixed ordinal and binary responses. This advancement enhances the reliability of statistical inference in applied research involving such data structures.
Keywords
Multiple Imputation
Multinomial probit model
Markov chain Monte Carlo (MCMC)
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR).
Genome-Wide Association Studies (GWAS) with imaging phenotypes pose significant challenges due to the complex interplay between high-dimensional genetic data and intricate spatial structures inherent in imaging data. In this paper, we develop an ultra-high-dimensional functional regression model tailored for GWAS with imaging phenotypes, incorporating genetic and non-visual contextual information. We approximate the coefficient functions using bivariate penalized splines and propose a forward selection procedure based on a functional Bayesian Information Criterion. This procedure is designed to identify critical main effects and interactions, adapting to imaging data characteristics. It achieves consistent variable selection in moderately high-dimensional settings and exhibits the sure screening property in ultra-high-dimensional scenarios. Extensive simulation studies and an analysis of data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the superior performance of the proposed method.
Keywords
Bayesian Information Criterion
Functional linear model
Bivariate splines
Forward selection
GWAS
The MPST (Multivariate Penalized Spline over Triangulation) package provides a robust and efficient framework for statistical modeling of large-scale 2D and 3D data. Using advanced multivariate penalized splines, MPST effectively handles irregular domains, noisy observations, and sparse datasets. It supports global and distributed learning, enabling seamless large-scale analysis. Its distributed framework employs domain decomposition, partitioning data into subsets based on triangulation, processing them in parallel, and integrating results efficiently. This approach enhances computational performance without sacrificing accuracy. A key strength of MPST is its ability to achieve precise local fitting with varying smoothness across subdomains, ensuring smooth global transitions and overcoming traditional spline limitations. Additionally, MPST provides user-friendly 2D and 3D visualization tools, aiding result interpretation. Numerical studies show MPST outperforms existing smoothing methods in accuracy, efficiency, and scalability. By integrating state-of-the-art smoothing techniques with distributed computing, MPST is a powerful tool for complex, high-dimensional data modeling.
Keywords
Complex multidimensional data
Computational efficiency
Distributed learning
Nonparametric smoothing
Multivariate spline smoothing
MPST package
Multivariate binary data arise in various scientific fields. The Multivariate Probit (MP) model is widely used for analyzing such data. However, it can fail even within a feasible range of binary variable correlations due to its requirement for a positive definite latent correlation matrix. To address this limitation, we propose a pair copula model using D-vine with an assumed dependence structure of either first-order autoregressive or equicorrelation, which overcomes the difficulties associated with the MP model. Our presentation begins with introducing copulas and discussing the differences between D-vine and C-vine pair copula models. We present visualizations illustrating the relationship between the copula parameter and the binary variable correlation coefficient. We then derive the probability mass function (PMF) for bivariate and trivariate binary variables and provide numerical examples. Finally, we present an application of our model to a real-life dataset analysis.
Keywords
Multivariate Binary
Copula
D-Vine
This study presents a deep learning framework for calibrating Agent-Based Models (ABMs), focusing on the Susceptible-Infected-Recovered (SIR) model. By leveraging Convolutional Neural Networks (CNNs) for pattern extraction and Recurrent Neural Networks (RNNs) for temporal dependencies, the approach enhances parameter estimation accuracy and efficiency. A synthetic dataset generated using epiworldR enabled model training, with RNNs achieving lower Mean Absolute Errors (MAEs).
To support real-world applications, we developed epiworldRcalibrate, an R package for real-time SIR parameter estimation and epidemic visualization. Validated on 10,000 simulated datasets, the framework proved robust and adaptable. This method offers a scalable solution for real-time epidemiological modeling, improving decision-making in public health and beyond.
Keywords
Parameter Calibration
Agent-Based Models (ABMs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Susceptible-Infected-Recovered (SIR) Model
Parameter Calibration
Recurrent Neural Networks (RNNs)
A plethora of dimension reduction methods have been developed to visualize high-dimensional data in low dimensions. However, different dimension reduction methods often output different visualizations, and many challenges make it difficult for researchers to determine which visualization is best. We thus propose a novel consensus dimension reduction framework, which summarizes multiple visualizations into a single "consensus" visualization. Here, we leverage ideas from data integration (or data fusion) to identify the patterns that are most stable or shared across the many different dimension reduction visualizations and subsequently visualize this shared structure in a single low-dimensional plot. We demonstrate that this consensus visualization effectively identifies and preserves the shared low-dimensional data structure through extensive simulations and real-world case studies. We further highlight our method's robustness to the choice of dimension reduction method and/or hyperparameters --- a highly desirable property when working towards trustworthy and reproducible data science.
Keywords
dimension reduction
data integration
data visualization
In this work, we propose a high-dimensional Graphical Latent Gaussian Copula Model that extends traditional Gaussian graphical models by incorporating external covariates. The model assumes a latent Gaussian structure where observed variables arise through monotonic transformations, allowing for a flexible representation of conditional dependencies. We introduce a novel approach in which the mean and precision matrix of the latent variables are modeled as functions of covariates, capturing population-level and individual-specific network structures.
To estimate the model parameters, we develop an efficient estimation procedure that leverages bridge functions to infer latent correlations from observed data. The estimation is further refined using a sparse group lasso penalty to encourage structured sparsity.
Simulation studies and real-world applications demonstrate the model's ability to recover latent dependency structures and identify covariate-driven variations in network connectivity. This framework has broad applicability in biomedical and social sciences, where latent interactions play a crucial role in data analysis.
Keywords
Copula Model
High Dimensional Data
Graphical Model
Precision Matrix Estimation
Sparse Group Lasso
Covariate-Dependent Networks
Over 3 million Americans currently have glaucoma, a series of eye conditions that damage the optic nerve, leading to more severe eye issues. Diagnosis and monitoring of glaucoma can be accomplished through examination of fundus images, such as thinning of the neuroretinal rim. While traditional feature selection techniques can be applied to pixelated fundus image data, they often struggle with high dimensionality, computational inefficiency, and procedural rigidity. To resolve these issues and control FDR, we present a novel approach that leverages latent representation learning to construct higher-level features from image data and generate knockoffs of the latent features, followed by knockoff feature selection with FDR control. Called ImgKnock, our four-step procedure uses a deep latent representation learning-based approach integrated with a model-X knockoffs framework. Simulations are conducted using the common MNIST and CIFAR-10 datasets to demonstrate the efficacy of ImgKnock. Results indicate proper FDR control, particularly with MNIST data, showing an AUC metric of up to 0.889. The proposed ImgKnock is also applied to fundus images from the UCLA Stein Eye Institute.
Keywords
knockoff selection
latent representation learning
FDR control
fundus images
self-supervised learning
Co-Author
Zhe Fei, University of California, Riverside
First Author
Jericho Lawson, University of California, Riverside
Presenting Author
Jericho Lawson, University of California, Riverside
Minimum distance estimation methodology based on an empirical distribution function has been popular due to its desirable properties including robustness. Even though the statistical literature is awash with research on the minimum distance estimation, most of it is confined to the theoretical findings: only a few statisticians conducted research on the application of the method to real-world problems. Through this paper, we extend the domain of application of this methodology to various applied fields by providing a solution to a rather challenging and complicated computational problem. The problem this paper tackles is image segmentation, which has been used in various fields. We propose a novel method based on the classical minimum distance estimation theory to solve the image segmentation problem. The performance of the proposed method is then further elevated by integrating it with the "segmenting-together" strategy. We demonstrate that the proposed method combined with the segmenting-together strategy successfully completes the segmentation problem when it is applied to complex images such as magnetic resonance images.
Keywords
Empirical distribution
Cramer-von Mises
magnetic resonance
minimum distance
segmenting together
Co-Author(s)
Jinhee Jang, Seoul St. Mary Hospital, College of Medicine, The Catholic University of Kor
Kun Bu, Department of Mathematics and Statistics
First Author
Jiwoong Kim, Department of Mathematics and Statistics University of South Florida
Presenting Author
Jiwoong Kim, Department of Mathematics and Statistics University of South Florida
Interpretable deep learning is critical in fields such as healthcare, finance, and autonomous systems, where transparency is essential. This study presents a computationally efficient framework integrating Random Fourier Features (RFF) with softmax-weighted kernel density estimation to introduce interpretability in deep learning models. By employing RFF for kernel approximation and refining kernel density estimation, the method provides a structured approach to modeling complex data distributions while maintaining accuracy and efficiency. To assess robustness, a sensitivity analysis is conducted on the dimensionality (D) of the mapped space to evaluate its impact on computational complexity. Additionally, the study examines the integration of multiple kernels within deep learning models, allowing flexible representation of high-dimensional data. This is particularly relevant when distinct feature sets, such as gene collections, require separate kernel representations. The framework's performance is assessed through benchmarking in a conditional density estimation setting using real-world data.
Keywords
interpretable deep learning
machine learning
learning with kernels
random features
nonparametric conditional density estimation
We present methods for estimating multiple precision matrices for high-dimensional time series within the framework of Gaussian graphical models, with a specific focus on analyzing functional magnetic resonance imaging (fMRI) data collected from multiple subjects. Our goal is to estimate both individual brain networks and a collective structure representing a group of subjects. To achieve this, we propose a method that utilizes group Graphical Lasso and regularized aggregation to simultaneously estimate individual and group precision matrices, assigning varying weights to each individual based on their outlier status within the group. We investigate the convergence rates of the precision matrix estimators across different norms and expectations, assessing their performance under both sub-Gaussian and heavy-tailed assumptions. The effectiveness of our methods is demonstrated through simulations and real fMRI data analysis.
Keywords
Aggregation
Brain connectivity
Joint estimation
Precision matrix estimation
Regularization
Long-memory
In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process. Given a high-dimensional weakly stationary time series, it is of interest to obtain principal components of the spectral density matrices that are interpretable as being sparse in coordinates and localized in frequency. In this talk, we introduce a formulation of this novel problem and an algorithm for estimating the object of interest. In addition, we propose a smoothing procedure that improves estimation of eigenvector trajectories over the frequency range. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a patient hospitalized for a first psychotic episode and compared with a healthy control individual.
Keywords
Principal Component Analysis
High Dimensional Time Series
Spectral Density Matrix
Sparse Estimation
EEG Data
In multi-domain settings, where observations come from distinct but related data sources, heterogeneity often exists across domains due to shifts in data distributions. In cases of high heterogeneity, (1) training individual models on each domain and ensembling their predictions (ensemble approach) has been shown to outperform (2) combining domain datasets and fitting a single model (merged approach). However, determining when to choose each approach is less clear. This paper presents Multi-Study Adaptive Blend (MSAB), a method for optimally combining predictions from the ensemble and merged approaches adaptively across varying levels of heterogeneity. First, we provide theoretical insights on optimizing the combination weight in a linear model setting. Second, we propose a domain-wise cross-validation strategy for estimating the optimal blending weight as a practical, data-driven approach for broader applications. For a given heterogeneity level, MSAB performs comparable to or better than the best individual strategy (merged or ensemble), offering robust performance across low and high heterogeneity settings. MSAB offers potential improvements in predictive performance and mitigates the risk of selecting a suboptimal approach in multi-domain settings.
Keywords
machine learning
domain generalization
ensemble learning
multi-study prediction
This study presents a generalized LASSO regression model based on the generalized Laplace (GL) distribution. Within the T-R{Y} framework, a family of GL distributions is developed, with a particular case offering a Bayesian perspective on LASSO. This perspective introduces additional terms to the standard LASSO constraint. These terms are examined geometrically, as well as the impact of the parameters of the GL distribution on the generalized LASSO model. Finally, the model's adaptability and effectiveness in variable selection and prediction are illustrated using a real-world dataset.
Keywords
LASSO regression
beta-Laplace distribution
T-Laplace family
Variable selection
Prediction
Estimating the covariance function of a spatial process is important for model estimation and spatial prediction. Many spatial models, such as Gaussian Processes, rely on covariance functions to define their structure. However, parametric estimation can suffer from model misspecification leading to biased predictions if the chosen covariance structure is incorrect. In this work, we study a nonparametric approach to estimate the covariance function of an isotropic stationary process in R^d. We focus on a class of covariance functions that are valid in all dimensions d>=1, which includes popular parametric kernels such as Matern kernel. Leveraging the fact that such covariance functions can be represented as infinite mixtures of scaled Gaussian kernels, we propose two estimation methods: least squares and nonparametric maximum likelihood estimation for estimating the mixing measure of scaled Gaussian kernels. We also develop computationally efficient methods to solve the optimizations using non-negative least squares and fisher-scoring updates. Finally, we evaluate our proposed methods through simulations and real data, comparing them against parametric and nonparametric approaches.
Keywords
Stationary isotropic processes
Spatial covariance function
Nonparametric estimation
Gaussian mixtures
Fast computation
Change point detection for functional time series has attracted considerable attention from researchers. Existing methods either rely on functional principle component analysis (FPCA), which may perform poorly with complex data, or use bootstrap approaches in forms that fall short in effectively detecting diverse types of changes. In our study, we propose a novel self-normalization (SN) test for functional time series implemented via a non-overlapping block bootstrap to circumvent the reliance on FPCA. The test statistic is a normalized cumulative sum (CUSUM) where the normalizing factor allows the capture of subtle local changes in the mean function. The theory contains the weak convergence and test consistency for both the original and the bootstrap versions of the test statistic. We further extend the test to detect changes in the lag-1 autocovariance operator. Simulation studies confirm the superior performance of our test across various settings, and real-world applications further illustrate its practical utility.
Keywords
Change point detection
Functional time series
Self-normalization
Non-overlapping block bootstrap
In spatial point processes intensity estimation, traditional methods like kernel estimators and regression models have been effective in estimating the intensity function of a spatial point pattern. However, they fall short when dealing with nonlinear correlations. Deep learning models, such as Neural Networks(NN), and Variational AutoEncoders (VAE), offer a promising alternative to address these limitations due to their inherent properties and settings. These are widely used and acknowledged for their flexibility and capability to handle complex, nonlinear relationships. In this study, we additionally incorporate a bandwidth trainable KDE layer to our model, the KDE-NN based model provides additional flexibility to capture any spatial correlation in the data, while also controlling the degree of smoothness.
Keywords
Spatial Intensity Estimation
Deep Learning Model
Bandwidth Selection
Kernel Density Estimation
Co-Author
Ji Meng Loh, New Jersey Institute of Technology
First Author
Zhiwen Wang, New Jersey Insititute of Technology
Presenting Author
Zhiwen Wang, New Jersey Insititute of Technology
Simultaneous feature selection and non-linear function estimation are challenging, especially in high-dimensional settings where the number of variables exceeds the available sample size. We investigate feature selection in neural networks and address the limitations of group LASSO, which tends to select unimportant variables due to over-shrinkage. To overcome this, we propose a sparse-input neural network framework using group concave regularization for feature selection in both low- and high-dimensional settings. The key idea is to apply a concave penalty to the $l_2$ norm of weights from all outgoing connections of each input node, yielding a neural net that uses only a small subset of variables. We also develop an efficient algorithm based on backward path-wise optimization to produce stable solution paths and tackle complex optimization landscapes. Extensive simulations and real data examples demonstrate the proposed estimator's strong performance in feature selection and prediction for continuous, binary, and time-to-event outcomes.
Keywords
Neural networks
Feature selection
High dimensionality
LASSO
nonconvex penalty
Co-Author
Susan Halabi, Duke University
First Author
Bin Luo, Kennesaw State University
Presenting Author
Bin Luo, Kennesaw State University
Non-linear dimension reduction (NLDR) techniques such as tSNE, UMAP provide a low-dimensional representation of high-dimensional data by applying non-linear transformation. The methods and parameter choices can create wildly different representations, so much so that it is difficult to decide which is best, or whether any or all are accurate or misleading. NLDR often exaggerates random patterns, sometimes due to the samples observed, but NLDR views have an important role in data analysis because, if done well, they provide a concise visual (and conceptual) summary of high dimensional distributions. To help evaluate the NLDR we have developed a way to take the fitted model, as represented by the positions of points in 2D, and turn it into a high-dimensional wireframe to overlay on the data, viewing it with a tour. Viewing a model in the data space is an ideal way to examine the fit. It is used here to help with the difficult decision on which 2D layout is the best representation of the high-dimensional distribution, or whether the 2D layout is displaying mostly random structure, and how different methods have same summary or particular quirks. Available in the R package `quollr`.
Keywords
high-dimensional data vizualization
non-linear dimension reduction
tour