Impacts of Missing Data Imputation on Statistical Models for Environmental Mixtures

Conference: Symposium on Data Science and Statistics (SDSS) 2025
05/02/2025: 8:55 AM - 9:20 AM MDT
Refereed 

Description

Humans are consistently exposed to complex chemical mixtures, including metals and per- and polyfluoroalkyl substances (PFAS), known to have detrimental health effects. Concurrently, individuals accumulate an allostatic load (AL) from chronic stressors that impact behavior, systemic physiology, and critical health metrics, contributing to physiological dysfunction. Traditional statistical methods faced challenges capturing complex relationships within multipollutant mixtures, which often exhibit interactive, non-linear, and non-additive associations with health metrics, necessitating advanced statistical and machine learning techniques for analysis.
Due to incomplete or inconsistent data collection methods, further complexity arises from datasets with missing values. Since most machine learning techniques require complete datasets and missing data is common in surveys and electronic records, researchers typically employ imputation techniques to handle these gaps before fitting statistical or machine learning models. These techniques can significantly influence model performance and inferences, underscoring this research's need for careful data handling.
This study aims to investigate how different data imputation methods-including, mean, median, Multivariate Imputation by Chained Equations (MICE), and Amelia-and listwise deletion techniques affect the performance of environmental mixture modeling techniques, including Weighted Quantile Sum (WQS), Bayesian Weighted Quantile Sum (BWQS), Quantile G-Computation (Q-gcomp), Bayesian Kernel Machine Regression (BKMR), Elastic Net, and Lasso. Assuming the data are missing completely at random (MCAR) or missing at random (MAR), the study uses extensive Monte Carlo simulations to compare the performance of these models under various strategies for handling missing data. The findings can significantly impact environmental health and statistics by informing future research on properly treating missing data.

Keywords

Environmental mixture models

Missing data

Imputation Techniques 

Presenting Author

Yvonne Boafo, North Carolina A & T

First Author

Yvonne Boafo, North Carolina A & T

CoAuthor(s)

Sayed Mostafa, Department of Mathematics and Statistics, North Carolina A & T State University
Emmanuel Obeng-Gyasi, North Carolina A& T

Tracks

Statistical Data Science
Symposium on Data Science and Statistics (SDSS) 2025