Thursday, Aug 8: 10:30 AM - 12:20 PM
5190
Contributed Papers
Oregon Convention Center
Room: CC-E147
Main Sponsor
Biometrics Section
Presentations
Stacked multiple imputation for missing data modifies the usual multiple imputation approach by stacking the m imputed data sets into a single data set for analysis. Various advantages of the stacked approach have been previously demonstrated (e.g., Beesley and Taylor, 2020 & 2021). We explore bootstrap approaches with stacked multiple imputation, similar to those suggested by Schomaker and Heumann (2018) for usual multiple imputation. We demonstrate that bootstrap inference with stacked multiple imputations has modest advantages in some settings with respect to computation and estimation.
Keywords
Multiple Imputation
Stacked Multiple Imputation
Bootstrap
Missing Data
Time-to-event data and longitudinal data are often encountered in medical studies in which longitudinal biomarkers may be highly correlated to time to event. Joint modeling analysis is a common strategy to evaluate the association between the longitudinal marker and occurrence of the event over time and thus to better assess the covariate effects. Challenges arise in the joint analysis when some covariates are subject to detection limits. To this end, we propose a flexible multiple imputation method to deal with such fixed censored covariates in joint modeling. Our proposed method utilizes the information from the fully observed covariates and the two types of outcomes to impute the censored covariates iteratively using rejection sampling. This approach ensures compatibility with the substantive joint models and significantly enhances estimation efficiency. We demonstrate its promising performance through simulation studies. To underscore its practical utility, we apply the method to the data from a study on community-acquired pneumonia.
Keywords
Joint modeling
Detection limits
Multiple imputation
Rejection sampling
SARS-CoV-2 viral load is frequently used as an endpoint in Phase 2 COVID-19 treatment trials. Meta-analyses of multiple trials have missingness by design due to different sampling schedules, missingness not by design, and left-truncation at the lower limit of quantification (LLOQ). We compare three viral load imputation models. The first treats infection status as a latent variable. For uninfected people, viral load is represented as a point mass at zero; for infected people, viral load is modeled as lognormally distributed with left-truncation at LLOQ. Hence, the allotment of probability mass <LLOQ to infected and uninfected people is driven by the untestable assumption of log normality of viral load values <LLOQ for infected people. To avoid the need for this assumption, a second approach directly models "<LLOQ" as a point mass and values above LLOQ as lognormally distributed. Temporal correlation is modeled by inclusion of individual-level random intercepts. A third approach involves individual linear interpolation plus random noise estimated from the empirical covariance structure. We compare the models in simulation and apply them to COVID-19 treatment trials in outpatients.
Keywords
multiple imputation
viral load
longitudinal data
latent variable
mixture model
interpolation
This study will evaluate several imputation strategies and a novel natural language processing (NLP) deep neural network algorithm and hot deck NLP algorithm vis-à-vis hot deck imputation strategies and complete case analysis on artificially created missing-at-random (MAR) 2021 NSDUH survey data. Missing rates are 1.43%, 9%, and 16%. Evaluation metrics include empirical bias (EBias), root mean square error (RMSE), percent coverage, and percentage of correct prediction (PCP). Survey weighted and non-survey weighted hot deck imputation methods in SAS and a weighted sequential hot deck method (WSHD) in SUDAAN were used, in addition to a multiple imputation by chained equations model, a multiple imputation classification and regression tree model (CART) and gradient boosted trees model (xgboost) in R. A novel approach using Google's search inquiry algorithm B.E.R.T. involved converting numeric values to data labels to predict the true value. Results: the BERT model had highest PCP for all three missing rates, the hot deck BERT model performed best at 9% missing, the WSDH at 1.43% missing, and the CART model at 16% missing. This study examines optimal imputation strategies for complex survey data and explores use of NLP for imputation.
Keywords
Machine learning
NSDUH 2021
Imputation
Natural Language Processing (NLP)
Co-Author
Jingsheng Yan, Department of Health and Human Services/SAMHSA
First Author
Mark Brow, Department of Health and Human Services/SAMHSA
Presenting Author
Mark Brow, Department of Health and Human Services/SAMHSA
Massive clinical predictive models are published in medical literature in the past decades, however, very few of these models have been implemented into electronic health record (EHR) system to aid decision-making in clinical practice. One obstacle for real-time implementation is handling missing information upon risk score calculation. In this paper, we propose a new submodel approximation approach for binary outcome. Under certain assumption, this approach only relies on the original coefficients and the first two moments of a function involving all missing risk factor. The proposed approach has the advantage of borrowing information from the target population. The asymptotic properties of the proposed estimator was derived and assessed through comprehensive simulations. The model performance was also assessed in simulation studies and compared with existing approaches including one-step-sweep (OSS) submodel and the imputation by fixed chained equations approaches. The proposed method was applied to address missing risk factor issues in predicting the risk of 30-day adverse event of acute heart failure patients.
Keywords
submodel approximation
binary outcome
missing risk factors
EHR
prediction model