40: Optimal Sampling under Class Imbalance: A Kernel-Based IPW Estimator for Efficient Classification

JooChul Lee Co-Author
Auburn University
 
Hyelim Jung First Author
Auburn University
 
Hyelim Jung Presenting Author
Auburn University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
2667 
Contributed Posters 
Music City Center 
Various studies have been conducted to design classification models in situations where human error is present or where the population distribution is not precisely known. However, research explicitly addressing imbalanced data is still in its early stages. In this context, we propose a novel optimal sampling method that enhances classification performance without requiring additional data collection or sacrificing the desirable distributional properties of the classification model. Among optimal sampling methods, the Inverse ProbabilityWeighted (IPW) estimator is utilized to sub-sample more informative instances from the dataset. In particular, under imbalanced data settings, the amount of available information is more closely tied to the number of positive instances than to the total data size. Therefore, all positive instances are retained, and the negative instances are substantially reduced using a non-uniform sampling strategy, thereby improving estimation efficiency. This study derives the asymptotic distribution of the IPW estimator combined with a kernel-based method and shows that the proposed estimator is not only unbiased but also consistent. Furthermore, through extensive simulation studies and application to a real dataset, we demonstrate that the proposed method remains effective under imbalanced data and unspecified model settings. The results confirm that the proposed estimator achieves superior efficiency compared to existing methods.

Keywords

Active learning

Optimal sampling

Imbalanced data

Label noise

Binary classification

Semi-supervised learning 

Main Sponsor

Section on Statistical Learning and Data Science