When cross-validation meets Cook's distance

Yoonkyung Lee Co-Author
The Ohio State University
 
ZHENBANG JIAO First Author
 
ZHENBANG JIAO Presenting Author
 
Wednesday, Aug 7: 11:50 AM - 12:05 PM
3674 
Contributed Papers 
Oregon Convention Center 
We introduce a new feature selection method for regression models based on cross-validation (CV) and Cook's distance (CD). Leave-one-out (LOO) CV measures the difference of the LOO fitted values from the observed responses while CD measures their difference from the full data fitted values. CV selects a model based on its prediction accuracy and tends to select overfitting models often. To improve CV, we take into account model robustness using CD, which can be shown to be effective in differentiating overfitting models. Hence we propose a linear combination of CV error and the average Cook's distance as a feature selection criterion. Under mild assumptions, we show that the probability of this criterion selecting the true model in linear regression using the least squares method converges to 1, which is not the case for CV. Our simulation studies also demonstrate that this criterion yields significantly better performance in feature selection for both linear regression and penalized linear regression compared to CV. As for computational efficiency, this criterion requires no extra calculation compared to CV as CD involves the same fitted values needed for CV.

Keywords

Cook's distance

Cross-validation

Linear regression

Model robustness

Model selection 

Main Sponsor

Section on Statistical Learning and Data Science