Clustering-Informed Shared-Structure Variational Autoencoder for Missing Data Imputation

Kenneth Seier Co-Author
Memorial Sloan Kettering Cancer Center
 
Katherine Panageas Co-Author
Memorial Sloan-Kettering Cancer Center
 
Mithat Gonen Co-Author
Memorial Sloan-Kettering Cancer Center
 
Yuan Chen Co-Author
Memorial Sloan Kettering Cancer Center
 
Yasin Khadem Charvadeh First Author
Memorial Sloan Kettering Cancer Center
 
Yasin Khadem Charvadeh Presenting Author
Memorial Sloan Kettering Cancer Center
 
Sunday, Aug 3: 3:05 PM - 3:20 PM
0943 
Contributed Papers 
Music City Center 
Despite advancements in managing healthcare data, missing data in Electronic Health Records (EHR) and patient-reported health data remain a challenge, compromising their usability in healthcare analytics. Conventional imputation methods face limitations such as difficulties in capturing complex non-linear relationships, extended computation times, and constraints in addressing various types of missing data mechanisms. To address this, we propose the clustering-informed shared-structure variational autoencoder (CISS-VAE), building upon the powerful generative Bayesian neural networks. This model can effectively capture complex associations and accommodate various missing data mechanisms, including missing not at random (MNAR). We also develop iterative learning algorithms that further enhance missing data imputation accuracy while preventing overfitting. Comprehensive simulations demonstrate our model's superior accuracy compared to traditional and contemporary methods. We apply our method to EHR data from early-stage breast cancer patients at Memorial Sloan Kettering Cancer Center, aiming to mitigate the impact of missing data and enhance health monitoring and analyses.

Keywords

Missing Data Imputation

Variational Autoencoder

Missing Not at Random

Electronic Health Records 

Main Sponsor

Section on Statistics in Epidemiology