05/24/2023: 3:45 PM - 5:15 PM CDT
Lightning
Room: Grand Ballroom C
This session will be followed by e-poster presentations on Thursday, 5/25 at 9:55 AM.
Chair
Joyce Robbins, Columbia University
Tracks
Computational Statistics
Machine Learning
Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2023
Presentations
Cluster analysis plays an important role in spatial data mining because it allows interesting structures and clusters to be discovered directly from the data without the use of any background knowledge. Commonly-used clustering algorithms tend to identify ellipsoidal, spherical, or other regularly-shaped clusters, but encounter difficulties when dealing with complex underlying groups that possess non-linear shapes, various densities, and noisy connections between multiple clusters. In this article, we proposed a graph-based spatial clustering technique that utilizes Delaunay triangulation along with conventional mechanisms like DBSCAN (density-based spatial clustering of applications with noise) and KNN (k-nearest neighbors). Using these mechanisms, we take into account the distribution of triangle area, angles, and relative side length as one of the criteria for separating clusters with different densities. Moreover, by integrating Otsu's segmentation algorithm, our proposed method is able to resolve the issue of adjacent clusters touching one another. In performance evaluations using simulated synthetic data, as well as real data with regular and irregular structures, our methodology maintains top performance in clustering and separability of neighboring clusters compared to traditional clustering techniques.
Presenting Author
Sihan Zhou
First Author
Sihan Zhou
Statistical analysis of spatio-temporal data has evolved over time to handle increasingly large data sets. E.g., the North American CORDEX program is producing daily values of climate-related variables on spatial grids with approximately 100,000 locations over 150 years. Smoothing of such massive and noisy data is essential to understanding their spatio-temporal features. It also reduces the size of the data by representing them in terms of suitable basis functions, which facilitates further computations and statistical analysis. Traditional tensor-based methods break down under the size of such massive data. We develop a penalized spline method for representing such data using a generalization of the sandwich smoother proposed by Xiao et al. (2013). Unlike the original method, our generalization treats the spatial and temporal dimensions distinctly and allows the methodology to be directly applied to non-gridded data. Additionally, this new method can exploit parallel computing architectures. We demonstrate the practicality of the methodology using both simulated and real data. The new smoother, as well as the original sandwich smoother, are implemented in the hero R package.
Presenting Author
Joshua French, University of Colorado-Denver
First Author
Joshua French, University of Colorado-Denver
CoAuthor
Piotr Kokoszka, Colorado State University
In many scientific applications, measured time series are corrupted by noise or distortions. Traditional denoising techniques often fail to recover the signal of interest, particularly when the signal-to-noise ratio is low or when certain assumptions on the signal and noise are violated. In this work, we demonstrate that deep learning-based denoising methods can outperform traditional techniques while exhibiting greater robustness to variation in noise and signal characteristics. Our motivating example is magnetic resonance spectroscopy, in which a primary goal is to detect the presence of short-duration, low-amplitude radio frequency signals that are often obscured by strong interference that can be difficult to separate from the signal using traditional methods. We explore various deep learning architecture choices to capture the inherently complex-valued nature of magnetic resonance signals. On both synthetic and experimental data, we show that our deep learning-based approaches can exceed performance of traditional techniques, providing a powerful new class of methods for analysis of scientific time series data.
Presenting Author
Amber Day, University of Texas at Austin
First Author
Amber Day, University of Texas at Austin
CoAuthor(s)
Natalie Klein
Sinead Williamson
Biological networks, such as protein-protein and gene-gene networks, are crucial to the physiology and function of organisms. High-throughput technology has led to significant progress in understanding individual biological entities, but comprehending the interactions between them remains challenging due to the complexity and vastness of these networks. To address this, we explore the use of network curvature as a mathematical concept that measures the natural behaviors of a graph, including diffusion, information flow, and network resilience. Analyzing biological networks with network curvature allows for insights into the fundamental structure and dynamics of networks under biological phenomena, which can lead to a better understanding of disease mechanisms and treatment options. We investigate the application of well-defined curvatures, including Ollivier-Ricci, Balanced Forman, Diffusion, and Bakry-Emery, on protein and gene networks. Our findings provide new insights into the structure and dynamics of these networks, with potential implications for understanding disease mechanisms and identifying effective treatments. In particular, Ollivier-Ricci curvature can be used to measure network clustering, Balanced Forman curvature can identify critical genes and proteins, Diffusion curvature can study disease spread and intervention effectiveness, and Bakry-Emery curvature can identify key metabolic pathways. Our work contributes to the development of network curvature as a valuable tool for understanding biological networks.
Presenting Author
Yun Jin Park
First Author
Yun Jin Park
CoAuthor
Didong Li, University of North Carolina, Chapel Hill
Spatial clustering is a common unsupervised learning problem with many applications to areas such as public health, urban planning, or transportation, where the goal is to identify clusters of similar locations based on regionalization as well as patterns in characteristics over those locations. Unlike standard clustering, a well-studied area with a rich literature including methods such as K-Means clustering, spectral clustering, and hierarchical clustering, spatial clustering is a relatively sparse area of study due to inherent differences between the spatial domain of the data and its corresponding covariates. For example, in the American Community Survey dataset, spatial differences in tracts cannot be directly compared to differences in participant survey responses to indicators such as employment status or income. In this paper, we develop a spatial clustering algorithm, called Gaussian Process Spatial Clustering (GPSC), which clusters functions between data leveraging the flexibility of Gaussian processes and extend it to the case of clustering geospatial data. We provide theoretical guarantees and demonstrate its capabilities to recover true clusters in several simulation studies and a real-world dataset to identify clusters of tracts in North Carolina based on socioeconomic and environmental indicators associated with health and cancer risk.
Presenting Author
Hongqian Niu
First Author
Hongqian Niu
CoAuthor(s)
Melissa Troester, University of North Carolina - Chapel Hill
Didong Li, University of North Carolina, Chapel Hill
Agglomerative Hierarchical Clustering (AHC) is a very popular statistical method that allows objects to be grouped into homogeneous clusters. This clustering method has the advantage of being able to represent its steps in a graphical form with a dendrogram.
In such a cluster analysis, the choice of the number of clusters is crucial. In practice, statistical experts often use the dendrogram to determine it. Indeed, a large gap in the dendrogram characterises two heterogeneous clusters, whereas a small gap implies that the two aggregated clusters are close.
Much has been written to help expert and non-expert users choose the correct number of clusters with many indices proposed (Charrad et al., 2014). However, the vast majority of these indices have been constructed to satisfy the most classical case: Euclidean distance and Ward's criterion. Although they perform well in this case, they become obsolete when the distance or the aggregation method changes. Indeed, depending on the type of the data (numerical, categorical) and preferences, the user can use a distance such as Canberra or an aggregation method such as single linkage while needing guidance on the number of clusters to choose. Therefore, we propose to generalise two well-known indices to any distance and aggregation method: the Hartigan index (Hartigan, 1975) and the Calinski-Harabasz index (Calinski & Harabasz, 1974). As we demonstrate, these indices can be obtained directly from the dendrogram values in the Euclidean/Ward's case. Moreover, they are related to the heterogeneity gap, which is usually interpreted graphically by experts. Thanks to these properties, we show that we can generalise these two indices by directly using the dendrogram values, regardless of the distance and aggregation method chosen by the user.
Finally, the limitations of using the two raw indices outside the Euclidean/Ward context and the benefits of the proposed generalisation are illustrated with XLSTAT software.
Presenting Author
Fabien Llobell, Addirisoft, XLSTAT
First Author
Fabien Llobell, Addirisoft, XLSTAT
CoAuthor
Nour Selmi, Lumivero, XLSTAT, Paris, France
Statistics models are used for explaining and/or predicting an outcome of interest. For explanations, the focus is on parameter estimation that describes an independent variable's effect. In this regard, the effect of a misclassified outcome variable and how to correct it has been studied extensively, with one popular method being MCSIMEX. However, a relevant question yet to be addressed is how misclassification affects predictive performance. We investigate this through extensive simulation studies. Motivated by a real world example, we generated a binary event status Y that is subject to misclassification. We fit a logistic regression model using the misclassified Y* and assessed model performance on a test data simulated from the same underlying model without misclassification. We show that the predictive performance on test data is similar regardless of whether or not the misclassified Y* was corrected and always better than the performance on the training data.
Presenting Author
Zorina Han, University of Alberta
First Author
Zorina Han, University of Alberta
CoAuthor
Yan Yuan, University of Alberta
It has been shown that network models with community structure, such as the stochastic block model and its generalizations, can be defined as generalized random dot product graphs with various community-wise structures. As an example, the stochastic block model is a generalized random dot product graph in which the communities are represented by point masses in the latent space. Based on this connection, we define the manifold block model as a generalized random dot product graph in which the communities are represented by manifolds in the latent space. The manifold block model is motivated by networks observed in real data. This leads to the K-curves clustering algorithm for community detection in this setting. We derive asymptotic properties for a semi-supervised version of this algorithm and demonstrate them via simulation.
Presenting Author
John Koo, Indiana University
First Author
John Koo, Indiana University
CoAuthor(s)
Minh Tang, North Carolina State University
Michael Trosset, Indiana University
In statistics, truncation is defined as when the values for a given probability distribution are limited to being above or below a specific threshold or are within a specific range, occurring when no information is available for values which are outside of the bounds of truncation. Truncated data can appear in a wide variety of settings, including the fields of reliability and econometrics. In addition, another application of truncated distributions can be the modeling of proportion data. For example, when the arcsine square root transformation is applied to a given proportion p,(〖sin〗^(-1) (√p)), the transformed data can be modeled using a truncated Gaussian distribution, where the region of truncation is 0 ≤ p ≤ π/2. One area of statistical modeling where the truncated Gaussian distribution has been used to model proportion data is small area estimation (SAE). For example, area-level SAE models have been used to model county-level proportions (arcsine square root transformed) of various health outcomes using truncated Gaussian distributions via Markov Chain Monte Carlo (MCMC).
An essential feature of MCMC modeling is determining whether or not the MCMC sample has converged to a stationary distribution. There are several ways to evaluate convergence including graphical (e.g., trace plots, autocorrelation plots, density plots) and statistical (e.g., Geweke, Heidelberger-Welch, Gelman-Rubin, and Raftery-Lewis tests), but there has been limited research into the impact truncation may have on the various methods used to evaluate MCMC convergence. For this work, we will primarily focus on the statistical tests most commonly used in assessing MCMC convergence to determine how the statistical derivations of each MCMC convergence diagnostic are impacted by truncation. In addition, simulations will be used to evaluate how the type and degree of truncation impact statistical tests used to assess MCMC convergence.
Presenting Author
John Pleis, National Center for Health Statistics
First Author
John Pleis, National Center for Health Statistics
CoAuthor(s)
Diba Khan
Benmei Liu, National Cancer Institute
Yulei He, National Center for Health Statistics
Van Parsons, National Center for Health Statistics
Bill Cai, National Center for Health Statistics
Longitudinal cluster randomized trial (LCRT) is one type of cluster randomized trials, which has been frequently used in clinical research. In LCRTs, clusters of subjects are randomly assigned to different treatment groups or sequences with various treatment orders, and each subject has repeated measurements over the time during the study. These features, however, present challenges that need to be addressed in both experimental design and data analysis stages. Two salient features of LCRTs are the complicated correlation structure constituted by longitudinal and between-subject correlations and the missing scenarios caused by the prolonged study period. To handle them, we propose closed-form sample size and power formulas for detecting the intervention effect for LCRTs with different types of outcomes and distinct design features, which offer great flexibility to account for unbalanced design, various design matrices, different missing patterns, and complicated correlation structures. Extensive simulation studies showed that the proposed methods achieve good performance over a wide spectrum of design configurations.
Presenting Author
Jijia Wang, UT Southwestern Medical Center
First Author
Jijia Wang, UT Southwestern Medical Center
Functional time series (FTS) are constituted by dependent functions and can be used to modelseveral applied processes. Several machine-learning approaches have been developed in theliterature to gain insight into the stochastic processes that generate FTS. In this work, we presentregularization techniques in the analysis of FTS. Singular spectrum analysis (SSA) is a non-parametric technique for decomposing time series into trends, periodicities, and noise components.Functional SSA (FSSA) is the functional extension of SSA applied to FTS. We begin by representingFTS as multivariate time series (MTS) data and develop a regularization technique for multivariatesingular spectrum analysis (MSSA). MSSA is a decomposition technique for MTS, and we denote theregularized version of the algorithm as reMSSA. reMSSA is formulated as a penalized lossminimization problem where we employ regularized singular value decomposition (RSVD) to findlow-rank trajectory matrix approximations of the data. Next, we develop a similar regularizationtechnique for FSSA. Regularized FSSA (reFSSA) is developed as an extension of FSSA. A penaltyfunction with a smoothing parameter is added to the loss function measuring the reconstructionerror of a low-rank trajectory operator approximation. Regularized functional SVD (RfSVD) is usedto solve the minimization problem. RfSVD allows the derivation of a closed-form generalized cross-validation (GCV) criterion for selecting smoothing parameters. Hilbert SSA (HSSA) is the applicationof SSA to FTS objects created by defining a basis system in the Hilbert space. The basis system forHSSA is different from the known basis function systems (monomial, Fourier, b-spline, etc.) used forFSSA. We develop a regularization technique for HSSA. Finally, we apply reMSSA, reFSSA, andregularization based on HSSA to call center data that contains the number of incoming calls to abank's call center in Israel. We show that the proposed regularization techniques, reMSSA, reFSSA,and regularization based on HSSA, outperform MSSA, FSSA, and HSSA, respectively, by effectivelysmoothing the rough components generated by MSSA, FSSA, and HSSA of the MTS and FTS objects.
Presenting Author
Jesse Adikolrey, Marquette University
First Author
Jesse Adikolrey, Marquette University
CoAuthor
Mehdi Maadooliat, Marquette University
A number of density functions of interest can be written in the form of a weighted density: the product of a base density and a nonnegative weight function that provides an adjustment. Generation of random variates from such a distribution may be nontrivial and can involve an intractable normalizing constant. Rejection sampling may be used to generate exact draws but requires determination of a proposal distribution. To be practical for an intended application, the proposal must both be convenient to sample from and accept draws with large enough probability. A well-known approach to obtain a proposal involves decomposing the target density into a finite mixture where components may correspond to a partition of the support. This work considers focusing such a construction on an envelope for the weight function. This may be applicable when assumptions for adaptive rejection sampling and related algorithms are not met. An upper bound on rejection probability from this proposal construction can be expressed and potentially reduced to a desired tolerance by making suitable refinements. Several example applications will be considered to illustrate the method.
Presenting Author
Andrew Raim, U.S. Census Bureau
First Author
Andrew Raim, U.S. Census Bureau
CoAuthor(s)
James Livsey, US Census Bureau
Kyle Irimata
The paper considers the DIverse MultiPLEx (DIMPLE) network model, introduced in Pensky and Wang (2021), where all layers of the network have the same collection of nodes and are equipped with the Stochastic Block Models. In addition, all layers can be partitioned into groups with the same community structures, although the layers in the same group may have different matrices of block connection probabilities. The DIMPLE model generalizes a multitude of papers that study multilayer networks with the same community structures in all layers, as well as the Mixture Multilayer Stochastic Block Model (MMLSBM), where the layers in the same group have identical matrices of block connection probabilities. While Pensky and Wang (2021) applied spectral clustering to the proxy of the adjacency tensor, the present paper uses Sparse Subspace Clustering (SSC) for identifying groups of layers with identical community structures. Under mild conditions, the latter leads to the strongly consistent between-layer clustering. In addition, SSC allows to handle much larger networks than methodology of Pensky and Wang (2021), and is perfectly suitable for application of parallel computing.
Presenting Author
Majid Noroozi, University of Memphis
First Author
Majid Noroozi, University of Memphis
CoAuthor
Marianna Pensky, University of Central Florida