Improving Software Defect Prediction Using Cluster Under sampling.

Agebure, M.P.

Improving Software Defect Prediction Using Cluster Under sampling.

dc.contributor.advisor	Sowah,R.A.
dc.contributor.advisor	Amanquah, N.
dc.contributor.author	Agebure, M.P.
dc.contributor.other	University of Ghana, College of Basic and Applied Sciences, School of Engineering, Department of Computer Engineering
dc.date.accessioned	2016-11-22T11:03:13Z
dc.date.accessioned	2017-10-13T17:35:16Z
dc.date.available	2016-11-22T11:03:13Z
dc.date.available	2017-10-13T17:35:16Z
dc.date.issued	2014-07
dc.description	Thesis (MPhil) - University of Ghana, 2014
dc.description.abstract	Adequately learning and classifying datasets that are highly unbalanced has become one of the most challenging task in Data Mining and Machine Learning disciplines. Most datasets are adversely affected by the class imbalance problem due to the limited occurrence of positive examples. This phenomenon adversely affect the ability of classification algorithms to adequately learn from these data to correctly classify positive examples in new datasets. Data sampling techniques presented in Data Mining and Machine Learning literature are often used to manipulate the training data in order to minimize the level of imbalance prior to training classification models. This study presents an undersampling technique that has the capability of further improving the performance of classification algorithms when learning from imbalance datasets. The technique targets the removal of potential problematic instances from the majority class in the course of undersampling. The proposed technique uses Tomek links to detect and remove noisy/inconsistent instances, and data clustering to detect and remove outliers and redundant instances from the majority class. The proposed technique is implemented in Java within the framework of the WEKA machine learning tool. The performance of the proposed sampling technique has been evaluated with WEKA machine learning tool using C4.5 and OneR classification algorithms. Sixteen datasets with varying degrees of imbalance are used. The performance of the models when CUST is used are compared to RUS, ROS, CBU, SMOTE, OSS, and when no sampling is performed prior to training (NONE). The results of CUST are encouraging as compared to the other techniques particularly in datasets that have less than 2% minority instances and larger quantities of repeated instances. The experimental results using AUC and G-Mean showed that CUST resulted in higher performance in most of the datasets than the other methods. The average performance of the classification algorithms across the datasets for each technique also showed that CUST resulted in the highest average performance in all test cases. Statistical comparison of the mean performance also revealed that CUST performed statistically better than ROS, SMOTE, OSS and NONE in all test cases. CUST however, performed statistically the same as RUS and CBU, but with a higher mean performance. The results also confirmed that CUST is a viable alternative to the already existing sampling techniques particularly when the datasets are highly unbalanced with larger quantities of repeated, noisy instances and outliers.	en_US
dc.format.extent	xi, 103p. ill
dc.identifier.uri	http://197.255.68.203/handle/123456789/8994
dc.language.iso	en	en_US
dc.publisher	University of Ghana	en_US
dc.rights.holder	University of Ghana
dc.subject	IMPROVING	en_US
dc.subject	SOFTWARE	en_US
dc.subject	DEFECT PREDICTION	en_US
dc.subject	CLUSTER	en_US
dc.subject	UNDERSAMPLING	en_US
dc.title	Improving Software Defect Prediction Using Cluster Under sampling.	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Improving Software Defect Prediction Using Cluster Under sampling_ 2014.pdf
Size:: 5.39 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.82 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Department of Computer Engineering