Federated Algorithm for SVD-based Topic Modeling to Analyze Electronic Health Records Data
Zhiyu Yan
First Author
Harvard T. H. Chan School of Publich Health
Zhiyu Yan
Presenting Author
Harvard T. H. Chan School of Publich Health
Monday, Aug 4: 11:05 AM - 11:20 AM
1612
Contributed Papers
Music City Center
Topic modeling offers valuable insights into disease mechanisms, patient subgroups, and personalized treatment strategies. While traditional methods such as Latent Dirichlet Allocation can be time-consuming and sensitive to initialization, Topic-SCORE leverages a stable, interpretable, polynomial-time approach based on singular value decomposition (SVD). However, single-institution EHR datasets often lack sufficient sample size and population diversity, limiting accurate and generalizable topic estimation. To address this challenge, we propose a privacy-preserving federated algorithm that implements Topic-SCORE on heterogeneous datasets across multiple institutions by aggregating summary-level projection matrices of singular vectors. In simulations, the federated approach achieves near-pooled performance and reduces L1 error from ground-truth topic loadings by a median of 76% (IQR: 73% - 79%) compared to single-site models, particularly when topic weight distributions substantially differ across sites. We demonstrate our algorithm on a multi-institutional EHR database to extract meaningful topics for patients from both codified EHR fields and unstructured clinical notes.
Federated Learning
Topic Modeling
Electronic Health Records
Main Sponsor
Section on Statistical Learning and Data Science
You have unsaved changes.