55: Mitigating Data Imbalance in Credit Card Fraud Detection

Yisong Chen Co-Author
 
Chuanhao Nie Co-Author
 
Yixin Xu Co-Author
 
Chuqing Zhao First Author
Harvard University
 
Chuqing Zhao Presenting Author
Harvard University
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
1276 
Contributed Posters 
Music City Center 
Credit card fraud poses a significant challenge and leads to substantial financial losses. Although machine learning and deep learning models have been extensively studied in this domain, few address the issue of data imbalance, which can bias predictions. In this paper, we explore techniques to address data imbalance, including Synthetic Minority Oversampling Technique (SMOTE), simple oversampling, and Variational Autoencoders (VAE). These methods are evaluated using metrics tailored for imbalanced datasets. In real-world scenarios, there is often a trade-off between recall and precision, both of which significantly impact revenue.
Our preliminary results show that SMOTE biases toward recall (0.897) than precision (0.098) but generates distributionally similar synthetic data, while VAE achieves better precision (0.903) and generalizability. Combining VAE-generated data with baseline logistic regression significantly improves performance with ROC-AUC 0.978, offering a computationally efficient solution for large-scale fraud detection in imbalanced datasets. This study highlights the trade-offs between different techniques and provides a practical solution for fraud detection.

Keywords

Fraud Detection

Synthetic Data

Machine Learning

Neural Network

Deep Learning 

Main Sponsor

Section on Statistical Learning and Data Science