Effects of sample weights on the performance of machine learning models using complex survey data
Dong Wang
Co-Author
FDA National Center for Toxicological Research (NCTR)
Sunday, Aug 3: 4:50 PM - 5:05 PM
2441
Contributed Papers
Music City Center
Recent studies have used machine learning (ML) methods ignoring sample weights determined by a complex design framework. This inadequacy stems from the lack of ML software packages managing sample weights' adjustment. We have developed an R-MLSurvey package for suitably adjusting sample weights to ML algorithms by replicate weights methods in the CV step as an extension of weighted LASSO regression. ML models considered are Penalized logistic regression, i.e., L_1 and Elastic Net (EN), Random Forest (RF), and Extreme gradient boosting (XGBoost), developed by design-based K-fold cross validation (dCV) and Jackknife repeated replication (JKn) for weighted ML models. The final models were evaluated by weighted performance metrics. We discuss effects of sample weights on prediction and variable selection with two examples of class imbalance, hypertension and diabetes, using the National Health and Nutrition Examination Survey (NHANES) data. Two under-sampling approaches were utilized for balancing classes ad hoc.
complex survey, replicate weights, NHANES, sample weights, machine learning, class imbalance
Main Sponsor
IMS
You have unsaved changes.