Wednesday, Aug 6: 10:30 AM - 12:20 PM
0857
Topic-Contributed Paper Session
Music City Center
Room: CC-104B
Applied
Yes
Main Sponsor
IMS
Co Sponsors
International Chinese Statistical Association
Royal Statistical Society
Presentations
Clustering methods play a crucial role in Electronic Health Records (EHR) research, where patient subpopulations exhibit complex structures. While traditional clustering assumes hard assignments, soft clustering techniques such as Fuzzy C-Means (FCM) allow for probabilistic memberships, capturing inherent uncertainty. However, statistical inference on soft clustering remains an underexplored area.
In this work, we introduce a novel post-clustering inference framework for FCM, enabling hypothesis testing and uncertainty quantification in soft clustering assignments. Specifically, we extend the traditional FCM by incorporating weighted clustering, where clusters with high similarity are identified and adjusted accordingly. For instance, when multiple clusters share similar centroids, they can be reweighted to reflect their collective contribution, ensuring that redundant splits do not distort the clustering structure. This weighted formulation acknowledges that some clusters contribute disproportionately, improving interpretability and robustness.
We present theoretical properties, simulation studies, and an application to real-world EHR data, demonstrating how our weighted FCM framework enhances clustering inference. Our approach provides a principled way to conduct hypothesis testing in soft clustering, offering new insights for data-driven decision-making in biomedical and health informatics applications.
Co-Author(s)
Anru Zhang, Duke University
Zihan Zhu, The Wharton School of the University of Pennsylvania
Speaker
Qiuyi Wu, Duke University
Mixed multinomial logits are discrete mixtures introduced several decades ago to model the probability of choosing an attribute from p possible candidates, in heterogeneous populations. The model has recently attracted attention in the AI literature, under the name softmax mixtures, where it is routinely used in the final layer of a neural network to map a large number p of vectors in ℝL to a probability vector. Despite its wide applicability and empirical success, statistically optimal estimators of the mixture parameters, obtained via algorithms whose running time scales polynomially in L, are not known. This paper provides a solution to this problem for contemporary applications, such as large language models, in which the mixture has a large number p of support points, and the size N of the sample observed from the mixture is also large. Our proposed estimator combines two classical estimators, obtained respectively via a method of moments (MoM) and the expectation-minimization (EM) algorithm. Although both estimator types have been studied, from a theoretical perspective, for Gaussian mixtures, no similar results exist for softmax mixtures for either procedure. We develop a new MoM parameter estimator based on latent moment estimation that is tailored to our model, and provide the first theoretical analysis for a MoM-based procedure in softmax mixtures. Although consistent, MoM for softmax mixtures can exhibit poor numerical performance, as observed other mixture models. Nevertheless, as MoM is provably in a neighborhood of the target, it can be used as warm start for any iterative algorithm. We study in detail the EM algorithm, and provide its first theoretical analysis for softmax mixtures. Our final proposal for parameter estimation is the EM algorithm with a MoM warm start.
We study the performance of the spectral method for the phase synchronization problem with additive Gaussian noises and incomplete data. The spectral method utilizes the leading eigenvector of the data matrix followed by a normalization step. We prove that it achieves the minimax lower bound of the problem with a matching leading constant under a squared l2 loss. This shows that the spectral method has the same performance as more sophisticated procedures including maximum likelihood estimation, generalized power method, and semidefinite programming, as long as consistent parameter estimation is possible. To establish our result, we first have a novel choice of the population eigenvector, which enables us to establish the exact recovery of the spectral method when there is no additive noise. We then develop a new perturbation analysis toolkit for the leading eigenvector and show it can be well-approximated by its first-order approximation with a small l2 error. We further extend our analysis to establish the exact minimax optimality of the spectral method for the orthogonal group synchronization.
In this talk, I will introduce a novel bias-corrected joint spectral embedding algorithm to estimate the invariant subspace across heterogeneous multiple networks. The proposed algorithm recursively calibrates the diagonal bias of the sum of squared network adjacency matrices by leveraging the closed-form bias formula and iteratively updates the subspace estimator using the most recent estimated bias. Correspondingly, we establish a complete recipe for the entrywise subspace estimation theory for the proposed algorithm, including a sharp entrywise subspace perturbation bound and the entrywise eigenvector central limit theorem. Leveraging these results, we settle two multiple network inference problems: the exact community detection in multilayer stochastic block models and the hypothesis testing of the equality of membership profiles in multilayer mixed membership models.
In mixture models, nonspherical (anisotropic) noise within each cluster is widely present in real-world data. We study both the minimax rate and optimal statistical procedure for clustering under high-dimensional nonspherical mixture models. In high-dimensional settings, we first establish the information-theoretic limits for clustering under Gaussian mixtures. The minimax lower bound unveils an intriguing informational dimension-reduction phenomenon: there exists a substantial gap between the minimax rate and the oracle clustering risk, with the former determined solely by the projected centers and projected covariance matrices in a low-dimensional space. Motivated by the lower bound, we propose a novel computationally efficient clustering method: Covariance Projected Spectral Clustering (COPO). Its key step is to project the high-dimensional data onto the low-dimensional space spanned by the cluster centers and then use the projected covariance matrices in this space to enhance clustering. We establish tight algorithmic upper bounds for COPO, both for Gaussian noise with flexible covariance and general noise with local dependence. Our theory indicates the minimax-optimality of COPO in the Gaussian case and highlights its adaptivity to a broad spectrum of dependent noise. Extensive simulation studies under various noise structures and real data analysis demonstrate our method's superior performance.