Print Close

62: Weakly Supervised Transformer for Rare Disease Phenotyping

Presented During: SPAAC Poster Competition — Topic Contributed Poster Presentations

Zongxin Yang Co-Author
Harvard Medical School

Mengyan Li Co-Author
Bentley University

Han Tong Co-Author
Columbia University

Alon Geva Co-Author
Boston Children's Hospital

Kenneth Mandl Co-Author
Boston Children's Hospital

Tianxi Cai Co-Author
Harvard University

Kimberly Greco First Author
Harvard University

Kimberly Greco Presenting Author
Harvard University

Monday, Aug 4: 2:00 PM - 3:50 PM
1442
Contributed Posters

Music City Center

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. Efforts to automate rare disease detection through computational phenotyping are limited by the scarcity of labeled data and biases in available label sources. Gold-standard labels from registries or expert chart review offer high accuracy but suffer from selection bias and high ascertainment costs, while labels derived from electronic health records (EHRs) capture broader patient populations but introduce noise. To address these challenges, we propose a weakly supervised, transformer-based framework that integrates gold-standard labels with iteratively refined silver-standard labels from EHR data to train a scalable and generalizable phenotyping model. We first learn concept-level embeddings from EHR co-occurrence patterns, which are then refined and aggregated into patient-level representations using a multi-layer transformer. Using rare pulmonary diseases as a case study, we validate our framework on EHR data from Boston Children's Hospital. Our approach improves phenotype classification, uncovers clinically meaningful subphenotypes, and enhances disease progression prediction, enabling more accurate and scalable identification and stratification of rare disease patients.

Keywords

Semi-Supervised Learning

Transformers

Phenotyping

Electronic Health Records

Rare Diseases

Machine Learning

Main Sponsor

Section on Statistical Learning and Data Science