62: Weakly Supervised Transformer for Rare Disease Phenotyping

Zongxin Yang Co-Author
Harvard Medical School
 
Mengyan Li Co-Author
Bentley University
 
Han Tong Co-Author
Columbia University
 
Alon Geva Co-Author
Boston Children's Hospital
 
Kenneth Mandl Co-Author
Boston Children's Hospital
 
Tianxi Cai Co-Author
Harvard University
 
Kimberly Greco First Author
Harvard University
 
Kimberly Greco Presenting Author
Harvard University
 
Monday, Aug 4: 2:00 PM - 3:50 PM
1442 
Contributed Posters 
Music City Center 
Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. Efforts to automate rare disease detection through computational phenotyping are limited by the scarcity of labeled data and biases in available label sources. Gold-standard labels from registries or expert chart review offer high accuracy but suffer from selection bias and high ascertainment costs, while labels derived from electronic health records (EHRs) capture broader patient populations but introduce noise. To address these challenges, we propose a weakly supervised, transformer-based framework that integrates gold-standard labels with iteratively refined silver-standard labels from EHR data to train a scalable and generalizable phenotyping model. We first learn concept-level embeddings from EHR co-occurrence patterns, which are then refined and aggregated into patient-level representations using a multi-layer transformer. Using rare pulmonary diseases as a case study, we validate our framework on EHR data from Boston Children's Hospital. Our approach improves phenotype classification, uncovers clinically meaningful subphenotypes, and enhances disease progression prediction, enabling more accurate and scalable identification and stratification of rare disease patients.

Keywords

Semi-Supervised Learning

Transformers

Phenotyping

Electronic Health Records

Rare Diseases

Machine Learning 

Abstracts


Main Sponsor

Section on Statistical Learning and Data Science