Print Close

009: A general meta machine learning model for constructing a binary classifier using small training dataset

Presented During: Poster Session 2 and Continental Breakfast

Conference: Conference on Statistical Practice (CSP) 2023

02/03/2023: 7:30 AM - 8:45 AM PST
Posters

Room: Cyril Magnin Foyer

Description

Predictive modeling aiding decision making has been widely used in a number of fields such as chemistry, computer science, physics, economic, finance and statistics. Many models have been proposed to make an accurate prediction and yet no model consistently outperforms the rest. In addition, other challenges in practice include training dataset being relatively small (n<100) probably due to rare disease or expensive data collection and how to best handle missing data. We use a real data as an example to show a general framework of advanced machine learning (ML) model when the training dataset is small and has missing data in presence. Specifically, multiple imputation will be used to create new imputed datasets to eliminate missing data. Repeated K-fold cross validation can be used to robustly evaluate the predictive performance of the final predictor. Popularly used machine learning methods for predicting a binary outcome such as penalized logistic regression, random forest, gradient boosted decision trees, support vector machine, XGBoost, neural network will first be applied to the training data from the cross-validation step as base machine learning models. Each model has associated hyper-parameters that can be tuned by how well different sets of these parameters performed on all imputed datasets. For each imputed dataset, the out of fold predictions from each of those base machine learning methods were then used as the data to undergo stacking using logistic regression model to create the final predictive model. The predictive performance measures such as balanced accuracy (average of accuracy within each of two outcome categories), accuracy, area under the curve (AUC), sensitivity and specificity and the corresponding standard errors (se) will be summarized using Rubin's rule from the final predictive models for each imputed dataset. Permutation importance ranking values (which define how much each feature contributes to the prediction) can be obtained for any base machine learning models. Then the importance ranking values of all features used in the final base ML models using each imputed dataset can be averaged to examine which features are most important for predicting the binary outcome. A feature selection strategy based on the importance ranking could be further used.

Keywords

machine learning

stacking

missing data

feature selection

importance ranking

predictive performance measure

Presenting Author(s)

Junying Wang, Stony Brook University
David Wu, Stony Brook University

First Author

David Wu, Stony Brook University

CoAuthor(s)

Junying Wang, Stony Brook University
Christine DeLorenzo, Stony Brook University
Jie Yang, Stony Brook University

Tracks

Implementation and Analysis

Conference on Statistical Practice (CSP) 2023