Techniques and Case Studies to Address Missing Data Complications in Scientific Studies

Ruoyu Wang Chair
Harvard University
 
Thursday, Aug 8: 10:30 AM - 12:20 PM
5190 
Contributed Papers 
Oregon Convention Center 
Room: CC-E147 

Main Sponsor

Biometrics Section

Presentations

Bootstrap Inference with Stacked Multiple Imputations

Stacked multiple imputation for missing data modifies the usual multiple imputation approach by stacking the m imputed data sets into a single data set for analysis. Various advantages of the stacked approach have been previously demonstrated (e.g., Beesley and Taylor, 2020 & 2021). We explore bootstrap approaches with stacked multiple imputation, similar to those suggested by Schomaker and Heumann (2018) for usual multiple imputation. We demonstrate that bootstrap inference with stacked multiple imputations has modest advantages in some settings with respect to computation and estimation. 

Keywords

Multiple Imputation

Stacked Multiple Imputation

Bootstrap

Missing Data 

View Abstract 3606

First Author

Paul Bernhardt, Villanova University

Presenting Author

Paul Bernhardt, Villanova University

Multiple imputation for joint modelling of longitudinal and survival data with censored covariates

Time-to-event data and longitudinal data are often encountered in medical studies in which longitudinal biomarkers may be highly correlated to time to event. Joint modeling analysis is a common strategy to evaluate the association between the longitudinal marker and occurrence of the event over time and thus to better assess the covariate effects. Challenges arise in the joint analysis when some covariates are subject to detection limits. To this end, we propose a flexible multiple imputation method to deal with such fixed censored covariates in joint modeling. Our proposed method utilizes the information from the fully observed covariates and the two types of outcomes to impute the censored covariates iteratively using rejection sampling. This approach ensures compatibility with the substantive joint models and significantly enhances estimation efficiency. We demonstrate its promising performance through simulation studies. To underscore its practical utility, we apply the method to the data from a study on community-acquired pneumonia. 

Keywords

Joint modeling

Detection limits

Multiple imputation

Rejection sampling 

View Abstract 2850

Co-Author

Liming Xiang, Nanyang Technological University

First Author

Yilin Wu

Presenting Author

Yilin Wu

Multiple imputation models for SARS-CoV-2 viral load

SARS-CoV-2 viral load is frequently used as an endpoint in Phase 2 COVID-19 treatment trials. Meta-analyses of multiple trials have missingness by design due to different sampling schedules, missingness not by design, and left-truncation at the lower limit of quantification (LLOQ). We compare three viral load imputation models. The first treats infection status as a latent variable. For uninfected people, viral load is represented as a point mass at zero; for infected people, viral load is modeled as lognormally distributed with left-truncation at LLOQ. Hence, the allotment of probability mass <LLOQ to infected and uninfected people is driven by the untestable assumption of log normality of viral load values <LLOQ for infected people. To avoid the need for this assumption, a second approach directly models "<LLOQ" as a point mass and values above LLOQ as lognormally distributed. Temporal correlation is modeled by inclusion of individual-level random intercepts. A third approach involves individual linear interpolation plus random noise estimated from the empirical covariance structure. We compare the models in simulation and apply them to COVID-19 treatment trials in outpatients. 

Keywords

multiple imputation

viral load

longitudinal data

latent variable

mixture model

interpolation 

View Abstract 2785

Co-Author(s)

Allyson Mateja
Eric Chu, Clinical Monitoring Research Program Directorate, Frederick National Laboratory for Cancer Research
Daniel Rubin, FDA
David Smith, University of California, San Diego
Victor DeGruttola, Harvard School of Public Health
Michael Hughes, Harvard TH Chan School of Public Health

First Author

Gail Potter, National Institute of Health / NIAID

Presenting Author

Gail Potter, National Institute of Health / NIAID

Multiple Imputation, Machine-Learning, and Hot Deck Imputation Models with the 2021 NSDUH survey

This study will evaluate several imputation strategies and a novel natural language processing (NLP) deep neural network algorithm and hot deck NLP algorithm vis-à-vis hot deck imputation strategies and complete case analysis on artificially created missing-at-random (MAR) 2021 NSDUH survey data. Missing rates are 1.43%, 9%, and 16%. Evaluation metrics include empirical bias (EBias), root mean square error (RMSE), percent coverage, and percentage of correct prediction (PCP). Survey weighted and non-survey weighted hot deck imputation methods in SAS and a weighted sequential hot deck method (WSHD) in SUDAAN were used, in addition to a multiple imputation by chained equations model, a multiple imputation classification and regression tree model (CART) and gradient boosted trees model (xgboost) in R. A novel approach using Google's search inquiry algorithm B.E.R.T. involved converting numeric values to data labels to predict the true value. Results: the BERT model had highest PCP for all three missing rates, the hot deck BERT model performed best at 9% missing, the WSDH at 1.43% missing, and the CART model at 16% missing. This study examines optimal imputation strategies for complex survey data and explores use of NLP for imputation. 

Keywords

Machine learning

NSDUH 2021

Imputation

Natural Language Processing (NLP) 

View Abstract 3865

Co-Author

Jingsheng Yan, Department of Health and Human Services/SAMHSA

First Author

Mark Brow, Department of Health and Human Services/SAMHSA

Presenting Author

Mark Brow, Department of Health and Human Services/SAMHSA

Predicting risk for a new patient with missing risk factor: a submodel approach for binary outcome

Massive clinical predictive models are published in medical literature in the past decades, however, very few of these models have been implemented into electronic health record (EHR) system to aid decision-making in clinical practice. One obstacle for real-time implementation is handling missing information upon risk score calculation. In this paper, we propose a new submodel approximation approach for binary outcome. Under certain assumption, this approach only relies on the original coefficients and the first two moments of a function involving all missing risk factor. The proposed approach has the advantage of borrowing information from the target population. The asymptotic properties of the proposed estimator was derived and assessed through comprehensive simulations. The model performance was also assessed in simulation studies and compared with existing approaches including one-step-sweep (OSS) submodel and the imputation by fixed chained equations approaches. The proposed method was applied to address missing risk factor issues in predicting the risk of 30-day adverse event of acute heart failure patients. 

Keywords

submodel approximation

binary outcome

missing risk factors

EHR

prediction model 

View Abstract 2285

Co-Author(s)

Allison B. McCoy, Vanderbilt University Medical Center
Alan B. Storrow, Vanderbilt University Medical Center
Dandan Liu, Vanderbilt University Medical Center

First Author

Tianyi Sun, Vanderbilt University

Presenting Author

Tianyi Sun, Vanderbilt University