Advanced Multiple Imputation for Missing Sociodemographic and Laboratory Characterization Data in Infectious Disease Surveillance

Soyoun Park Speaker
Centers for Disease Control and Prevention
 
Jasmine Varghese Co-Author
Centers for Disease Control and Prevention
 
Namrata Prasad Co-Author
Centers for Disease Control and Prevention
 
Yunmi Chung Co-Author
Centers for Disease Control and Prevention
 
Stephen Mugel Co-Author
Centers for Disease Control and Prevention
 
Miwako Kobayashi Co-Author
Centers for Disease Control and Prevention
 
Melissa Arvay Co-Author
Centers for Disease Control and Prevention
 
Nong Shang Co-Author
CDC
 
Monday, Aug 3: 2:35 PM - 2:50 PM
2590 
Contributed Papers 
Thomas M. Menino Convention & Exhibition Center 
CDC's Active Bacterial Core surveillance (ABCs) monitors invasive bacterial diseases among about 45.9 million people across 10 U.S. sites. A challenge in ABCs is missing sociodemographic (e.g., race) and laboratory characterization (e.g., bacterial subtypes) data. Non-random missing data can bias stratified disease estimates. Therefore, we developed a multi-step multiple imputation approach utilizing random forest models to capture complex predictor interactions, mitigate multicollinearity, and account for hierarchical structure of laboratory characteristics. The approach leverages sociodemographic and clinical characteristics to enhance imputation under non-random missingness. We further proposed a decision tree-based framework to characterize the complex missing data mechanisms inherent in ABCs, conducted simulations, and assessed multiple metrics of imputation accuracy. Results demonstrate improved precision and validity of imputed demographic data, leading to more reliable estimates; the race misclassification rate was only 2.4%, despite approximately 20% missingness. This framework can be broadly applicable to public health surveillance systems with non-random missing data.

Keywords

Multiple imputation

Missing not at random (MNAR)

Hierarchical data structure

Infectious disease surveillance 

Main Sponsor

Section on Statistics in Epidemiology