009: A general meta machine learning model for constructing a binary classifier using small training dataset
Conference: Conference on Statistical Practice (CSP) 2023
02/03/2023: 7:30 AM - 8:45 AM PST
Posters
Room: Cyril Magnin Foyer
Predictive modeling aiding decision making has been widely used in a number of fields such as chemistry, computer science, physics, economic, finance and statistics. Many models have been proposed to make an accurate prediction and yet no model consistently outperforms the rest. In addition, other challenges in practice include training dataset being relatively small (n<100) probably due to rare disease or expensive data collection and how to best handle missing data. We use a real data as an example to show a general framework of advanced machine learning (ML) model when the training dataset is small and has missing data in presence. Specifically, multiple imputation will be used to create new imputed datasets to eliminate missing data. Repeated K-fold cross validation can be used to robustly evaluate the predictive performance of the final predictor. Popularly used machine learning methods for predicting a binary outcome such as penalized logistic regression, random forest, gradient boosted decision trees, support vector machine, XGBoost, neural network will first be applied to the training data from the cross-validation step as base machine learning models. Each model has associated hyper-parameters that can be tuned by how well different sets of these parameters performed on all imputed datasets. For each imputed dataset, the out of fold predictions from each of those base machine learning methods were then used as the data to undergo stacking using logistic regression model to create the final predictive model. The predictive performance measures such as balanced accuracy (average of accuracy within each of two outcome categories), accuracy, area under the curve (AUC), sensitivity and specificity and the corresponding standard errors (se) will be summarized using Rubin's rule from the final predictive models for each imputed dataset. Permutation importance ranking values (which define how much each feature contributes to the prediction) can be obtained for any base machine learning models. Then the importance ranking values of all features used in the final base ML models using each imputed dataset can be averaged to examine which features are most important for predicting the binary outcome. A feature selection strategy based on the importance ranking could be further used.
machine learning
stacking
missing data
feature selection
importance ranking
predictive performance measure
Presenting Author(s)
Junying Wang, Stony Brook University
David Wu, Stony Brook University
First Author
David Wu, Stony Brook University
CoAuthor(s)
Junying Wang, Stony Brook University
Christine DeLorenzo, Stony Brook University
Jie Yang, Stony Brook University
Tracks
Implementation and Analysis
Conference on Statistical Practice (CSP) 2023
You have unsaved changes.