Improving Software Defect Prediction Using Cluster Under sampling.
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Ghana
Abstract
Adequately learning and classifying datasets that are highly unbalanced has become
one of the most challenging task in Data Mining and Machine Learning disciplines.
Most datasets are adversely affected by the class imbalance problem due to the limited
occurrence of positive examples. This phenomenon adversely affect the ability of
classification algorithms to adequately learn from these data to correctly classify
positive examples in new datasets. Data sampling techniques presented in Data
Mining and Machine Learning literature are often used to manipulate the training data
in order to minimize the level of imbalance prior to training classification models.
This study presents an undersampling technique that has the capability of further
improving the performance of classification algorithms when learning from imbalance
datasets. The technique targets the removal of potential problematic instances from
the majority class in the course of undersampling. The proposed technique uses
Tomek links to detect and remove noisy/inconsistent instances, and data clustering to
detect and remove outliers and redundant instances from the majority class. The
proposed technique is implemented in Java within the framework of the WEKA
machine learning tool. The performance of the proposed sampling technique has been
evaluated with WEKA machine learning tool using C4.5 and OneR classification
algorithms. Sixteen datasets with varying degrees of imbalance are used. The
performance of the models when CUST is used are compared to RUS, ROS, CBU,
SMOTE, OSS, and when no sampling is performed prior to training (NONE). The
results of CUST are encouraging as compared to the other techniques particularly in
datasets that have less than 2% minority instances and larger quantities of repeated
instances. The experimental results using AUC and G-Mean showed that CUST
resulted in higher performance in most of the datasets than the other methods. The
average performance of the classification algorithms across the datasets for each
technique also showed that CUST resulted in the highest average performance in all
test cases. Statistical comparison of the mean performance also revealed that CUST
performed statistically better than ROS, SMOTE, OSS and NONE in all test cases.
CUST however, performed statistically the same as RUS and CBU, but with a higher
mean performance. The results also confirmed that CUST is a viable alternative to the
already existing sampling techniques particularly when the datasets are highly
unbalanced with larger quantities of repeated, noisy instances and outliers.
Description
Thesis (MPhil) - University of Ghana, 2014