Application of machine learning methods in the imputation of heterogeneous co-missing data

Jinhui Ma Co-Author
McMaster
 
Lauren Griffith Co-Author
 
Narayanaswamy Balakrishnan Co-Author
McMaster University
 
Hon Yiu So First Author
Oakland University
 
Hon Yiu So Presenting Author
Oakland University
 
Monday, Aug 4: 10:35 AM - 10:50 AM
2241 
Contributed Papers 
Music City Center 
Ordinary imputation methods may not be able to handle heterogeneous co-missing data, such as the lung function measures from the spirometry test in population-based studies. This work aims to review and evaluate various statistical and machine learning imputation methods for estimating the prevalence of impaired lung function, such as chronic obstructive pulmonary disease (COPD), using data from public surveys on aging studies. Unsupervised learning (clustering) methods improve multiple imputations. The k-prototype method outperforms DBSCAN as it can handle categorical data more effectively. Direct imputations based on the predicted values of random forests and artificial neural networks are unsatisfactory. When combined with multiple imputations, the k-prototype clustering method appears to be the most suitable one for imputing missing spirometry values. Even if the imputation functions are not the same as those used in simulation, the k-prototype method can improve the estimates of the MI methods.

Keywords

co-missing

heterogeneous

multiple imputations

machine learning 

Main Sponsor

Section on Statistical Learning and Data Science