Federated Algorithm for SVD-based Topic Modeling to Analyze Electronic Health Records Data

Anand Viswanathan Co-Author
Massachusettes General Hospital
 
Christopher Anderson Co-Author
Brigham and Women's Hospital
 
Rui Duan Co-Author
 
Zhiyu Yan First Author
Harvard T. H. Chan School of Publich Health
 
Zhiyu Yan Presenting Author
Harvard T. H. Chan School of Publich Health
 
Monday, Aug 4: 11:05 AM - 11:20 AM
1612 
Contributed Papers 
Music City Center 
Topic modeling offers valuable insights into disease mechanisms, patient subgroups, and personalized treatment strategies. While traditional methods such as Latent Dirichlet Allocation can be time-consuming and sensitive to initialization, Topic-SCORE leverages a stable, interpretable, polynomial-time approach based on singular value decomposition (SVD). However, single-institution EHR datasets often lack sufficient sample size and population diversity, limiting accurate and generalizable topic estimation. To address this challenge, we propose a privacy-preserving federated algorithm that implements Topic-SCORE on heterogeneous datasets across multiple institutions by aggregating summary-level projection matrices of singular vectors. In simulations, the federated approach achieves near-pooled performance and reduces L1 error from ground-truth topic loadings by a median of 76% (IQR: 73% - 79%) compared to single-site models, particularly when topic weight distributions substantially differ across sites. We demonstrate our algorithm on a multi-institutional EHR database to extract meaningful topics for patients from both codified EHR fields and unstructured clinical notes.

Keywords

Federated Learning

Topic Modeling

Electronic Health Records 

Main Sponsor

Section on Statistical Learning and Data Science