Print Close

Clustering and Structure Identification

Sima Sharghi Chair
Akron Children's Hospital

Monday, Aug 4: 10:30 AM - 12:20 PM
4047
Contributed Papers

Music City Center

Room: CC-102B

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

Application of machine learning methods in the imputation of heterogeneous co-missing data

Ordinary imputation methods may not be able to handle heterogeneous co-missing data, such as the lung function measures from the spirometry test in population-based studies. This work aims to review and evaluate various statistical and machine learning imputation methods for estimating the prevalence of impaired lung function, such as chronic obstructive pulmonary disease (COPD), using data from public surveys on aging studies. Unsupervised learning (clustering) methods improve multiple imputations. The k-prototype method outperforms DBSCAN as it can handle categorical data more effectively. Direct imputations based on the predicted values of random forests and artificial neural networks are unsatisfactory. When combined with multiple imputations, the k-prototype clustering method appears to be the most suitable one for imputing missing spirometry values. Even if the imputation functions are not the same as those used in simulation, the k-prototype method can improve the estimates of the MI methods.

Keywords

co-missing

heterogeneous

multiple imputations

machine learning

Co-Author(s)

Jinhui Ma, McMaster
Lauren Griffith
Narayanaswamy Balakrishnan, McMaster University

First Author

Hon Yiu So, Oakland University

Presenting Author

Hon Yiu So, Oakland University

Factor Adjusted Spectral Clustering for Mixture Models

This paper studies a factor modeling-based approach for clustering high-dimensional data. Statistical modeling with correlated structures pervades modern applications in economics, finance, genomics, wireless sensing, etc. Standard techniques for high-dimensional clustering, e.g., naive spectral method, often fail to yield good results in highly correlated setups. To address the problem in such scenarios we propose the Factor Adjusted Spectral Clustering (FASC) algorithm, which uses an additional data denoising step by eliminating the factor component to cope with data dependency. We prove that the FASC algorithm achieves an exponentially low mislabeling rate with respect to the signal to noise ratio under general assumptions. Our assumption bridges many classical factor models in the literature such as the pervasive factor model, the weak factor model, and the sparse factor model. FASC is also efficient, requiring only near-linear sample complexity with respect to the data dimension. We also show the applicability of FASC with real data experiments and numerical studies and establish that FASC provides significant results in many cases where traditional spectral clustering fails.

Keywords

Dependency modeling

dimensionality reduction

data denoising

mislabeling

Co-Author(s)

Soham Jana, University of Notre Dame
Jianqing Fan, Princeton University

First Author

Shange Tang, Princeton University

Presenting Author

Soham Jana, University of Notre Dame

Federated Algorithm for SVD-based Topic Modeling to Analyze Electronic Health Records Data

Topic modeling offers valuable insights into disease mechanisms, patient subgroups, and personalized treatment strategies. While traditional methods such as Latent Dirichlet Allocation can be time-consuming and sensitive to initialization, Topic-SCORE leverages a stable, interpretable, polynomial-time approach based on singular value decomposition (SVD). However, single-institution EHR datasets often lack sufficient sample size and population diversity, limiting accurate and generalizable topic estimation. To address this challenge, we propose a privacy-preserving federated algorithm that implements Topic-SCORE on heterogeneous datasets across multiple institutions by aggregating summary-level projection matrices of singular vectors. In simulations, the federated approach achieves near-pooled performance and reduces L1 error from ground-truth topic loadings by a median of 76% (IQR: 73% - 79%) compared to single-site models, particularly when topic weight distributions substantially differ across sites. We demonstrate our algorithm on a multi-institutional EHR database to extract meaningful topics for patients from both codified EHR fields and unstructured clinical notes.

Keywords

Federated Learning

Topic Modeling

Electronic Health Records

Co-Author(s)

Anand Viswanathan, Massachusettes General Hospital
Christopher Anderson, Brigham and Women's Hospital
Rui Duan

First Author

Zhiyu Yan, Harvard T. H. Chan School of Publich Health

Presenting Author

Zhiyu Yan, Harvard T. H. Chan School of Publich Health

Finite mixture models with log-normal and normal distributions among the optimization methods

Finite mixture models have been used to cluster data into different groups based on statistical distribution. In flow cytometry, for example, we applied finite mixture model using a multivariate log-normal distribution and normal distributions to identify the cell populations of mixture of pollen. an expectation–maximization (EM) algorithm is used to approximate parameters by maximum likelihood estimation. Maximum likelihood estimation is used on the M step, so we apply other optimization methods such as Gradient descent, Stochastic gradient descent and Newton-Raphson to estimate the parameters of the finite mixture models.
In terms of comparison, we simulated a data set that has three clusters. Samples in the first cluster have a multivariate log-normal distribution while samples in other clusters have multivariate normal distributions with different mean. Processing of time, accuracy, bias, and MSE will be provided to compare to performance of these optimization methods.

Keywords

Finite mixture models

EM algorithm

Optimization

Co-Author

Larry Tang, University of Central Florida

First Author

Slun Booppasiri

Presenting Author

Slun Booppasiri

Graph Neural Networks Powered by Encoder Embedding for Improved Node Clustering

Graph Neural Networks (GNNs) have demonstrated an exceptional ability to model relationships within graph data, achieving remarkable results on tasks such as node clustering, node classification, and link prediction. However, most existing approaches rely on arbitrary or simplistic node embedding initialization, which can yield slow convergence and degrade performance. To address these challenges, we introduce a GEE-driven GNN (GG), which employs One-Hot Graph Encoder Embedding (GEE) to provide structured and expressive initialization for the GNN. We evaluate GG on node clustering tasks and simulations show that it converges faster and achieves superior results on certain graphs. Moreover, experiments on real-world datasets further demonstrate GG's strong performance, highlighting its potential as a powerful tool for diverse graph-related applications.

Keywords

Graph Neural Networks

Graph Embedding

One-Hot Encoding

Node Features

Node Clustering

Co-Author(s)

Youngser Park, Johns Hopkins University
Cencheng Shen, University of Delaware
Carey Priebe, Johns Hopkins University

First Author

Shiyu Chen, Johns Hopkins University

Presenting Author

Shiyu Chen, Johns Hopkins University

Random Forests and Clustering for Identifying Persistent Post-Traumatic Headache Phenotypes

Random Forests can be used for classification and clustering. In the unsupervised Random Forest used for clustering, the proximity matrix needed for clustering can be estimated. Clustering algorithms use data to form groups of similar subjects that share distinct properties. Phenotypes can be identified using a proximity matrix generated by the unsupervised Random Forests and subsequent clustering by the Partitioning around Medoids (PAM) algorithm. PAM uses the dissimilarity matrix in its class partitioning or clustering algorithm and is more robust to noise and outliers as compared to the more commonly used k-means algorithm.

Headache is the most common type of pain resulting from mild traumatic brain injury. Roughly half of those with persistent post-traumatic headache (PPTH) also report neck pain associated with greater headache severity. Identification of biologically based phenotypes could improve our mechanistic understanding and management PPTH with concomitant neck pain. The purpose of this study was to identify PPTH subgroups who share common biological impairments in cervical muscle health, pain sensitivity, and/or functional connectivity of brain networks inv

Keywords

magnetic resonance imaging (MRI)

neck pain

First Author

Barbara Bailey, San Diego State University

Presenting Author

Barbara Bailey, San Diego State University

Topological Clustering with Covariate Selection

Topological data analysis (TDA) is a powerful tool for detecting hidden structures in complex data like biological signals and networks. A key TDA algorithm, persistent homology (PH), captures multi-scale topological features in data robust to noise, as summarized by persistence diagrams (PDs). However, PDs' non-Euclidean nature complicates traditional analysis. Recent topological inference methods use heat kernel (HK) expansion of PDs in multi-group permutation tests. Extending the topological inference methods, we develop a topological clustering framework based on HK expansion of PDs. This flexible framework allows incorporation of Euclidean covariates into topological clustering. An automate data-driven selection procedure is also included for identifying the optimal number of topological clusters. Based on our HK-expansion-based topological clustering framework, we develop a data-driven method for selecting an optimal number of topological clusters and most significant covariates linked to them. We demonstrate our method's effectiveness in detecting clusters with varying degrees of topological dissimilarity through simulations and applications to brain signals and networks.

Keywords

Topological data analysis

Topological clustering

Heat kernel expansion

Co-Author(s)

Jian Yin, City University of Hong Kong
Parth Desai, University of California, Berkeley
Rahul Ghosal, Arnold School of Public Health, University of South Carolina
Yuan Wang, Arnold School of Public Health, University of South Carolina

First Author

Jiaying Yi, University of South Carolina

Presenting Author

Jiaying Yi, University of South Carolina