Improving Software Defect Prediction Using Cluster Under sampling.

Agebure, M.P.

Improving Software Defect Prediction Using Cluster Under sampling.

Files

Improving Software Defect Prediction Using Cluster Under sampling_ 2014.pdf (5.39 MB)

Date

2014-07

Authors

Agebure, M.P.

Publisher

University of Ghana

Abstract

Adequately learning and classifying datasets that are highly unbalanced has become one of the most challenging task in Data Mining and Machine Learning disciplines. Most datasets are adversely affected by the class imbalance problem due to the limited occurrence of positive examples. This phenomenon adversely affect the ability of classification algorithms to adequately learn from these data to correctly classify positive examples in new datasets. Data sampling techniques presented in Data Mining and Machine Learning literature are often used to manipulate the training data in order to minimize the level of imbalance prior to training classification models. This study presents an undersampling technique that has the capability of further improving the performance of classification algorithms when learning from imbalance datasets. The technique targets the removal of potential problematic instances from the majority class in the course of undersampling. The proposed technique uses Tomek links to detect and remove noisy/inconsistent instances, and data clustering to detect and remove outliers and redundant instances from the majority class. The proposed technique is implemented in Java within the framework of the WEKA machine learning tool. The performance of the proposed sampling technique has been evaluated with WEKA machine learning tool using C4.5 and OneR classification algorithms. Sixteen datasets with varying degrees of imbalance are used. The performance of the models when CUST is used are compared to RUS, ROS, CBU, SMOTE, OSS, and when no sampling is performed prior to training (NONE). The results of CUST are encouraging as compared to the other techniques particularly in datasets that have less than 2% minority instances and larger quantities of repeated instances. The experimental results using AUC and G-Mean showed that CUST resulted in higher performance in most of the datasets than the other methods. The average performance of the classification algorithms across the datasets for each technique also showed that CUST resulted in the highest average performance in all test cases. Statistical comparison of the mean performance also revealed that CUST performed statistically better than ROS, SMOTE, OSS and NONE in all test cases. CUST however, performed statistically the same as RUS and CBU, but with a higher mean performance. The results also confirmed that CUST is a viable alternative to the already existing sampling techniques particularly when the datasets are highly unbalanced with larger quantities of repeated, noisy instances and outliers.

Description

Thesis (MPhil) - University of Ghana, 2014

Keywords

IMPROVING, SOFTWARE, DEFECT PREDICTION, CLUSTER, UNDERSAMPLING

URI

http://197.255.68.203/handle/123456789/8994

Collections

Department of Computer Engineering

Full item page

Improving Software Defect Prediction Using Cluster Under sampling.

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By