Longitudinal Data-Driven Prediction of Type 2 Diabetes

Conference: Symposium on Data Science and Statistics (SDSS) 2026
04/29/2026: 1:15 PM - 2:45 PM CDT
Lightning 

Description

While substantial research has applied machine learning (ML) methods to cross-sectional datasets, comparatively limited work has evaluated ML approaches within longitudinal study designs. Accurate prediction of type 2 diabetes mellitus (T2DM) using longitudinal data is critical for early detection, risk stratification, and targeted prevention. This study evaluates and compares traditional longitudinal statistical models and ML approaches for predicting incident T2DM using nurse visit data from waves 2, 4, and 6 of the English Longitudinal Study of Ageing (ELSA). The analysis included 8,368 repeated observations from adults aged 50 years and older, with diabetes status assessed over time. Traditional models, including Generalized Linear Mixed Models (GLMM) and Generalized Estimating Equations (GEE), were compared with machine learning (ML) methods, including Random Forest (RF), Mixed-Effects Random Forest (MERF), and Extreme Gradient Boosting (XGBoost). Models were trained and evaluated under consistent subject-level data splits to preserve the longitudinal structure. Predictive performance was assessed using discrimination, classification, and calibration metrics, including AUROC, PR-AUC, sensitivity, specificity, precision, F1-score, log loss, and Brier score. The comparative analysis showed that GEE achieved the strongest overall discrimination and calibration on the test set (AUROC = 0.8289; Brier = 0.0786), closely followed by RF. XGBoost and MERF demonstrated moderate discrimination, while GLMM showed comparatively lower AUROC. At the default 0.5 threshold, all models exhibited high specificity but reduced sensitivity due to class imbalance. After threshold optimization, performance improved substantially across models, with RF and GEE achieving the highest F1-scores and balanced sensitivity–specificity trade-offs. These findings emphasize the importance of threshold tuning and appropriate longitudinal modeling strategies when predicting chronic disease risk using repeated measures data.

Keywords

Type 2 Diabetes

Longitudinal Data

Machine Learning, Mixed Effect Random Forest, Random Forest, and Extreme Gradient Boosting

Predictive Modeling

Traditional Modeling, Generalized Linear Mixed Model (GLMM), and Generalized Estimating Equations

UK Data Set 

Presenting Author

Peggy Akabuah

First Author

Peggy Akabuah

CoAuthor

Kristina Vatcheva, University of Texas Rio Grande Valley

Tracks

Data Science Applications
Symposium on Data Science and Statistics (SDSS) 2026