Print Close

40: Optimal Sampling under Class Imbalance: A Kernel-Based IPW Estimator for Efficient Classification

Presented During: Contributed Poster Presentations: Section on Statistical Learning and Data Science

JooChul Lee Co-Author
Auburn University

Hyelim Jung First Author
Auburn University

Hyelim Jung Presenting Author
Auburn University

Tuesday, Aug 5: 10:30 AM - 12:20 PM
2667
Contributed Posters

Music City Center

Various studies have been conducted to design classification models in situations where human error is present or where the population distribution is not precisely known. However, research explicitly addressing imbalanced data is still in its early stages. In this context, we propose a novel optimal sampling method that enhances classification performance without requiring additional data collection or sacrificing the desirable distributional properties of the classification model. Among optimal sampling methods, the Inverse ProbabilityWeighted (IPW) estimator is utilized to sub-sample more informative instances from the dataset. In particular, under imbalanced data settings, the amount of available information is more closely tied to the number of positive instances than to the total data size. Therefore, all positive instances are retained, and the negative instances are substantially reduced using a non-uniform sampling strategy, thereby improving estimation efficiency. This study derives the asymptotic distribution of the IPW estimator combined with a kernel-based method and shows that the proposed estimator is not only unbiased but also consistent. Furthermore, through extensive simulation studies and application to a real dataset, we demonstrate that the proposed method remains effective under imbalanced data and unspecified model settings. The results confirm that the proposed estimator achieves superior efficiency compared to existing methods.

Keywords

Active learning

Optimal sampling

Imbalanced data

Label noise

Binary classification

Semi-supervised learning

Main Sponsor

Section on Statistical Learning and Data Science