Monday, Aug 4: 10:30 AM - 12:20 PM
4047
Contributed Papers
Music City Center
Room: CC-102B
Main Sponsor
Section on Statistical Learning and Data Science
Presentations
Ordinary imputation methods may not be able to handle heterogeneous co-missing data, such as the lung function measures from the spirometry test in population-based studies. This work aims to review and evaluate various statistical and machine learning imputation methods for estimating the prevalence of impaired lung function, such as chronic obstructive pulmonary disease (COPD), using data from public surveys on aging studies. Unsupervised learning (clustering) methods improve multiple imputations. The k-prototype method outperforms DBSCAN as it can handle categorical data more effectively. Direct imputations based on the predicted values of random forests and artificial neural networks are unsatisfactory. When combined with multiple imputations, the k-prototype clustering method appears to be the most suitable one for imputing missing spirometry values. Even if the imputation functions are not the same as those used in simulation, the k-prototype method can improve the estimates of the MI methods.
Keywords
co-missing
heterogeneous
multiple imputations
machine learning
This paper studies a factor modeling-based approach for clustering high-dimensional data. Statistical modeling with correlated structures pervades modern applications in economics, finance, genomics, wireless sensing, etc. Standard techniques for high-dimensional clustering, e.g., naive spectral method, often fail to yield good results in highly correlated setups. To address the problem in such scenarios we propose the Factor Adjusted Spectral Clustering (FASC) algorithm, which uses an additional data denoising step by eliminating the factor component to cope with data dependency. We prove that the FASC algorithm achieves an exponentially low mislabeling rate with respect to the signal to noise ratio under general assumptions. Our assumption bridges many classical factor models in the literature such as the pervasive factor model, the weak factor model, and the sparse factor model. FASC is also efficient, requiring only near-linear sample complexity with respect to the data dimension. We also show the applicability of FASC with real data experiments and numerical studies and establish that FASC provides significant results in many cases where traditional spectral clustering fails.
Keywords
Dependency modeling
dimensionality reduction
data denoising
mislabeling
Topic modeling offers valuable insights into disease mechanisms, patient subgroups, and personalized treatment strategies. While traditional methods such as Latent Dirichlet Allocation can be time-consuming and sensitive to initialization, Topic-SCORE leverages a stable, interpretable, polynomial-time approach based on singular value decomposition (SVD). However, single-institution EHR datasets often lack sufficient sample size and population diversity, limiting accurate and generalizable topic estimation. To address this challenge, we propose a privacy-preserving federated algorithm that implements Topic-SCORE on heterogeneous datasets across multiple institutions by aggregating summary-level projection matrices of singular vectors. In simulations, the federated approach achieves near-pooled performance and reduces L1 error from ground-truth topic loadings by a median of 76% (IQR: 73% - 79%) compared to single-site models, particularly when topic weight distributions substantially differ across sites. We demonstrate our algorithm on a multi-institutional EHR database to extract meaningful topics for patients from both codified EHR fields and unstructured clinical notes.
Keywords
Federated Learning
Topic Modeling
Electronic Health Records
Finite mixture models have been used to cluster data into different groups based on statistical distribution. In flow cytometry, for example, we applied finite mixture model using a multivariate log-normal distribution and normal distributions to identify the cell populations of mixture of pollen. an expectation–maximization (EM) algorithm is used to approximate parameters by maximum likelihood estimation. Maximum likelihood estimation is used on the M step, so we apply other optimization methods such as Gradient descent, Stochastic gradient descent and Newton-Raphson to estimate the parameters of the finite mixture models.
In terms of comparison, we simulated a data set that has three clusters. Samples in the first cluster have a multivariate log-normal distribution while samples in other clusters have multivariate normal distributions with different mean. Processing of time, accuracy, bias, and MSE will be provided to compare to performance of these optimization methods.
Keywords
Finite mixture models
EM algorithm
Optimization
Graph Neural Networks (GNNs) have demonstrated an exceptional ability to model relationships within graph data, achieving remarkable results on tasks such as node clustering, node classification, and link prediction. However, most existing approaches rely on arbitrary or simplistic node embedding initialization, which can yield slow convergence and degrade performance. To address these challenges, we introduce a GEE-driven GNN (GG), which employs One-Hot Graph Encoder Embedding (GEE) to provide structured and expressive initialization for the GNN. We evaluate GG on node clustering tasks and simulations show that it converges faster and achieves superior results on certain graphs. Moreover, experiments on real-world datasets further demonstrate GG's strong performance, highlighting its potential as a powerful tool for diverse graph-related applications.
Keywords
Graph Neural Networks
Graph Embedding
One-Hot Encoding
Node Features
Node Clustering
Random Forests can be used for classification and clustering. In the unsupervised Random Forest used for clustering, the proximity matrix needed for clustering can be estimated. Clustering algorithms use data to form groups of similar subjects that share distinct properties. Phenotypes can be identified using a proximity matrix generated by the unsupervised Random Forests and subsequent clustering by the Partitioning around Medoids (PAM) algorithm. PAM uses the dissimilarity matrix in its class partitioning or clustering algorithm and is more robust to noise and outliers as compared to the more commonly used k-means algorithm.
Headache is the most common type of pain resulting from mild traumatic brain injury. Roughly half of those with persistent post-traumatic headache (PPTH) also report neck pain associated with greater headache severity. Identification of biologically based phenotypes could improve our mechanistic understanding and management PPTH with concomitant neck pain. The purpose of this study was to identify PPTH subgroups who share common biological impairments in cervical muscle health, pain sensitivity, and/or functional connectivity of brain networks inv
Keywords
magnetic resonance imaging (MRI)
neck pain
Topological data analysis (TDA) is a powerful tool for detecting hidden structures in complex data like biological signals and networks. A key TDA algorithm, persistent homology (PH), captures multi-scale topological features in data robust to noise, as summarized by persistence diagrams (PDs). However, PDs' non-Euclidean nature complicates traditional analysis. Recent topological inference methods use heat kernel (HK) expansion of PDs in multi-group permutation tests. Extending the topological inference methods, we develop a topological clustering framework based on HK expansion of PDs. This flexible framework allows incorporation of Euclidean covariates into topological clustering. An automate data-driven selection procedure is also included for identifying the optimal number of topological clusters. Based on our HK-expansion-based topological clustering framework, we develop a data-driven method for selecting an optimal number of topological clusters and most significant covariates linked to them. We demonstrate our method's effectiveness in detecting clusters with varying degrees of topological dissimilarity through simulations and applications to brain signals and networks.
Keywords
Topological data analysis
Topological clustering
Heat kernel expansion
Co-Author(s)
Jian Yin, City University of Hong Kong
Parth Desai, University of California, Berkeley
Rahul Ghosal, Arnold School of Public Health, University of South Carolina
Yuan Wang, Arnold School of Public Health, University of South Carolina
First Author
Jiaying Yi, University of South Carolina
Presenting Author
Jiaying Yi, University of South Carolina