Multiple Imputation, Machine-Learning, and Hot Deck Imputation Models with the 2021 NSDUH survey

Jingsheng Yan Co-Author
Department of Health and Human Services/SAMHSA
 
Mark Brow First Author
Department of Health and Human Services/SAMHSA
 
Mark Brow Presenting Author
Department of Health and Human Services/SAMHSA
 
Thursday, Aug 8: 11:20 AM - 11:35 AM
3865 
Contributed Papers 
Oregon Convention Center 
This study will evaluate several imputation strategies and a novel natural language processing (NLP) deep neural network algorithm and hot deck NLP algorithm vis-à-vis hot deck imputation strategies and complete case analysis on artificially created missing-at-random (MAR) 2021 NSDUH survey data. Missing rates are 1.43%, 9%, and 16%. Evaluation metrics include empirical bias (EBias), root mean square error (RMSE), percent coverage, and percentage of correct prediction (PCP). Survey weighted and non-survey weighted hot deck imputation methods in SAS and a weighted sequential hot deck method (WSHD) in SUDAAN were used, in addition to a multiple imputation by chained equations model, a multiple imputation classification and regression tree model (CART) and gradient boosted trees model (xgboost) in R. A novel approach using Google's search inquiry algorithm B.E.R.T. involved converting numeric values to data labels to predict the true value. Results: the BERT model had highest PCP for all three missing rates, the hot deck BERT model performed best at 9% missing, the WSDH at 1.43% missing, and the CART model at 16% missing. This study examines optimal imputation strategies for complex survey data and explores use of NLP for imputation.

Keywords

Machine learning

NSDUH 2021

Imputation

Natural Language Processing (NLP)