Print Close

Effects of sample weights on the performance of machine learning models using complex survey data

Presented During: Recent Advances in Machine Learning and Data Science

Paul Rogers Co-Author
FDA-NCTR

Dong Wang Co-Author
FDA National Center for Toxicological Research (NCTR)

Hyeonju Kim First Author
NCTR

Hyeonju Kim Presenting Author
NCTR

Sunday, Aug 3: 4:50 PM - 5:05 PM
2441
Contributed Papers

Music City Center

Presentation

Description

Recent studies have used machine learning (ML) methods ignoring sample weights determined by a complex design framework. This inadequacy stems from the lack of ML software packages managing sample weights' adjustment. We have developed an R-MLSurvey package for suitably adjusting sample weights to ML algorithms by replicate weights methods in the CV step as an extension of weighted LASSO regression. ML models considered are Penalized logistic regression, i.e., L_1 and Elastic Net (EN), Random Forest (RF), and Extreme gradient boosting (XGBoost), developed by design-based K-fold cross validation (dCV) and Jackknife repeated replication (JKn) for weighted ML models. The final models were evaluated by weighted performance metrics. We discuss effects of sample weights on prediction and variable selection with two examples of class imbalance, hypertension and diabetes, using the National Health and Nutrition Examination Survey (NHANES) data. Two under-sampling approaches were utilized for balancing classes ad hoc.

Keywords

complex survey, replicate weights, NHANES, sample weights, machine learning, class imbalance

Main Sponsor

IMS