Abstract:
Class imbalance problem is prevalent in many real-world domains and as such has become an area
of increasing interest for many researchers. In binary classification problems, imbalance learning
refers to learning from a dataset with a high degree of skewness to the negative class. This
phenomenon causes traditional classification algorithms to perform woefully when predicting
positive classes with new examples. Data resampling is among the most commonly used
techniques used to deal with this problem. It involves the manipulation of the training data before
applying standard classification techniques. This study presents a new hybrid sampling technique
that has the capability of improving the overall performance of a wide range of traditional machine
learning algorithms. The proposed method uses an undersampling technique based on CUST to
under-sample majority instances and an oversampling technique derived from SNOCC to
oversample minority instances. The method is implemented in Python version 3.5 on windows.
The performance was evaluated using classification algorithms from scikit-learn machine learning
library, namely: KNN, SVM, Decision Tree, Random Forest, Neural Network, AdaBoost, Naïve
Bayes, and Quadratic Discriminant Analysis. Eleven datasets with various degrees of imbalance
were used. The performance of each classifier when the proposed technique is used is compared
with the performance when no sampling is performed. In addition to that, the performance of eight
(8) other sampling techniques is compared with that of the proposed method. These techniques
include ROS, RUS, SMOTE, ADASYN, CUST, SBC, CLUS, and OSS. The experimental results
showed that HCBST performed better with most of the classifiers in terms of AUC, G-Mean, and
MCC. The overall average performance also showed that HCBT performed better in most of the
datasets, having the highest average scores of 0.73, 0.67 and 0.35 in AUC, G-Mean and MCC
respectively across all the classifiers used for this study. Extensive testing of machine learning algorithms and their performance metrics yielded promising results. A Graphical User Interface
(GUI) to enable interactivity for machine learning with class imbalanced data task operations was
incorporated to allow flexibility in the choice of algorithms for certain datasets for higher accuracy.