04/29/2026: 1:15 PM - 2:45 PM CDT
Lightning
While substantial research has applied machine learning (ML) methods to cross-sectional datasets, comparatively limited work has evaluated ML approaches within longitudinal study designs. Accurate prediction of type 2 diabetes mellitus (T2DM) using longitudinal data is critical for early detection, risk stratification, and targeted prevention. This study evaluates and compares traditional longitudinal statistical models and ML approaches for predicting incident T2DM using nurse visit data from waves 2, 4, and 6 of the English Longitudinal Study of Ageing (ELSA). The analysis included 8,368 repeated observations from adults aged 50 years and older, with diabetes status assessed over time. Traditional models, including Generalized Linear Mixed Models (GLMM) and Generalized Estimating Equations (GEE), were compared with machine learning (ML) methods, including Random Forest (RF), Mixed-Effects Random Forest (MERF), and Extreme Gradient Boosting (XGBoost). Models were trained and evaluated under consistent subject-level data splits to preserve the longitudinal structure. Predictive performance was assessed using discrimination, classification, and calibration metrics, including AUROC, PR-AUC, sensitivity, specificity, precision, F1-score, log loss, and Brier score. The comparative analysis showed that GEE achieved the strongest overall discrimination and calibration on the test set (AUROC = 0.8289; Brier = 0.0786), closely followed by RF. XGBoost and MERF demonstrated moderate discrimination, while GLMM showed comparatively lower AUROC. At the default 0.5 threshold, all models exhibited high specificity but reduced sensitivity due to class imbalance. After threshold optimization, performance improved substantially across models, with RF and GEE achieving the highest F1-scores and balanced sensitivity–specificity trade-offs. These findings emphasize the importance of threshold tuning and appropriate longitudinal modeling strategies when predicting chronic disease risk using repeated measures data.
Type 2 Diabetes
Longitudinal Data
Machine Learning, Mixed Effect Random Forest, Random Forest, and Extreme Gradient Boosting
Predictive Modeling
Traditional Modeling, Generalized Linear Mixed Model (GLMM), and Generalized Estimating Equations
UK Data Set
Presenting Author
Peggy Akabuah
First Author
Peggy Akabuah
CoAuthor
Kristina Vatcheva, University of Texas Rio Grande Valley
Tracks
Data Science Applications
Symposium on Data Science and Statistics (SDSS) 2026