Print Close

CS033 Navigating Missing Data

Conference: Symposium on Data Science and Statistics (SDSS) 2025

05/02/2025: 8:25 AM - 9:55 AM MDT
Refereed

Room: Alpine East

Chair

Jing Cao, Southern Methodist University

Target Audience

Mid-Level

Tracks

Practice and Applications

Statistical Data Science

Symposium on Data Science and Statistics (SDSS) 2025

Presentations

Addressing Missing Data in Multisite Learning Health Systems: Statistical Imputation Using MICE

A learning health system (LHS), as defined by the Institute of Medicine, is an organizational approach that integrates research and practice in a feedback loop, ensuring that knowledge gained from practice directly informs improvements in care and policy. These systems are increasingly using client
data collected in real-world settings to enhance clinical knowledge, innovation, and quality of care. However, data collected in service settings are prone to data quality challenges, including higher rates of missingness than controlled research settings, requiring innovative statistical solutions to reduce biases associated with missing data.

The National Institute of Mental Health's Early Psychosis Intervention Network (EPINET) is an LHS comprising over 100 clinics across the United States, embedded within 8 regional scientific hubs, that provide Coordinated Specialty Care (CSC) services to individuals experiencing a first episode of psychosis (FEP). All EPINET clinics administer a Core Assessment Battery (CAB) that measures several key domains of FEP treatment and recovery, which treatment teams use to inform clinical decision-making and measure client progress. CAB data consolidated across EPINET clinics comprise a rich national data set, providing a valuable resource for FEP researchers.

This paper discusses the application of multiple imputation by chained equations (MICE) to handle missing data in the consolidated CAB dataset. We present data on fractions of missing information (FMI) within each regional hub and in aggregate across hubs, and the results from MICE for generating imputed cross-hub CAB data. These findings may be helpful for other researchers, such as those working within learning health systems, to handle missing data collected from multisite service settings in which site is a determinant of missingness.

Presenting Author

John Riddles, Westat

First Author

Robert Baskin

CoAuthor(s)

John Cosgrove, Westat
Gizem Korkmaz, Westat
Nick Askew, Westat
John Riddles, Westat
Alexander Devora, Westat
Abram Rosenblatt, Westat

Impacts of Missing Data Imputation on Statistical Models for Environmental Mixtures

Humans are consistently exposed to complex chemical mixtures, including metals and per- and polyfluoroalkyl substances (PFAS), known to have detrimental health effects. Concurrently, individuals accumulate an allostatic load (AL) from chronic stressors that impact behavior, systemic physiology, and critical health metrics, contributing to physiological dysfunction. Traditional statistical methods faced challenges capturing complex relationships within multipollutant mixtures, which often exhibit interactive, non-linear, and non-additive associations with health metrics, necessitating advanced statistical and machine learning techniques for analysis.
Due to incomplete or inconsistent data collection methods, further complexity arises from datasets with missing values. Since most machine learning techniques require complete datasets and missing data is common in surveys and electronic records, researchers typically employ imputation techniques to handle these gaps before fitting statistical or machine learning models. These techniques can significantly influence model performance and inferences, underscoring this research's need for careful data handling.
This study aims to investigate how different data imputation methods-including, mean, median, Multivariate Imputation by Chained Equations (MICE), and Amelia-and listwise deletion techniques affect the performance of environmental mixture modeling techniques, including Weighted Quantile Sum (WQS), Bayesian Weighted Quantile Sum (BWQS), Quantile G-Computation (Q-gcomp), Bayesian Kernel Machine Regression (BKMR), Elastic Net, and Lasso. Assuming the data are missing completely at random (MCAR) or missing at random (MAR), the study uses extensive Monte Carlo simulations to compare the performance of these models under various strategies for handling missing data. The findings can significantly impact environmental health and statistics by informing future research on properly treating missing data.

Presenting Author

Yvonne Boafo, North Carolina A & T

First Author

Yvonne Boafo, North Carolina A & T

CoAuthor(s)

Sayed Mostafa, Department of Mathematics and Statistics, North Carolina A & T State University
Emmanuel Obeng-Gyasi, North Carolina A& T