Improving Software Defect Prediction Using Cluster Under sampling.

dc.contributor.advisorSowah,R.A.
dc.contributor.advisorAmanquah, N.
dc.contributor.authorAgebure, M.P.
dc.contributor.otherUniversity of Ghana, College of Basic and Applied Sciences, School of Engineering, Department of Computer Engineering
dc.date.accessioned2016-11-22T11:03:13Z
dc.date.accessioned2017-10-13T17:35:16Z
dc.date.available2016-11-22T11:03:13Z
dc.date.available2017-10-13T17:35:16Z
dc.date.issued2014-07
dc.descriptionThesis (MPhil) - University of Ghana, 2014
dc.description.abstractAdequately learning and classifying datasets that are highly unbalanced has become one of the most challenging task in Data Mining and Machine Learning disciplines. Most datasets are adversely affected by the class imbalance problem due to the limited occurrence of positive examples. This phenomenon adversely affect the ability of classification algorithms to adequately learn from these data to correctly classify positive examples in new datasets. Data sampling techniques presented in Data Mining and Machine Learning literature are often used to manipulate the training data in order to minimize the level of imbalance prior to training classification models. This study presents an undersampling technique that has the capability of further improving the performance of classification algorithms when learning from imbalance datasets. The technique targets the removal of potential problematic instances from the majority class in the course of undersampling. The proposed technique uses Tomek links to detect and remove noisy/inconsistent instances, and data clustering to detect and remove outliers and redundant instances from the majority class. The proposed technique is implemented in Java within the framework of the WEKA machine learning tool. The performance of the proposed sampling technique has been evaluated with WEKA machine learning tool using C4.5 and OneR classification algorithms. Sixteen datasets with varying degrees of imbalance are used. The performance of the models when CUST is used are compared to RUS, ROS, CBU, SMOTE, OSS, and when no sampling is performed prior to training (NONE). The results of CUST are encouraging as compared to the other techniques particularly in datasets that have less than 2% minority instances and larger quantities of repeated instances. The experimental results using AUC and G-Mean showed that CUST resulted in higher performance in most of the datasets than the other methods. The average performance of the classification algorithms across the datasets for each technique also showed that CUST resulted in the highest average performance in all test cases. Statistical comparison of the mean performance also revealed that CUST performed statistically better than ROS, SMOTE, OSS and NONE in all test cases. CUST however, performed statistically the same as RUS and CBU, but with a higher mean performance. The results also confirmed that CUST is a viable alternative to the already existing sampling techniques particularly when the datasets are highly unbalanced with larger quantities of repeated, noisy instances and outliers.en_US
dc.format.extentxi, 103p. ill
dc.identifier.urihttp://197.255.68.203/handle/123456789/8994
dc.language.isoenen_US
dc.publisherUniversity of Ghanaen_US
dc.rights.holderUniversity of Ghana
dc.subjectIMPROVINGen_US
dc.subjectSOFTWAREen_US
dc.subjectDEFECT PREDICTIONen_US
dc.subjectCLUSTERen_US
dc.subjectUNDERSAMPLINGen_US
dc.titleImproving Software Defect Prediction Using Cluster Under sampling.en_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Improving Software Defect Prediction Using Cluster Under sampling_ 2014.pdf
Size:
5.39 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.82 KB
Format:
Item-specific license agreed upon to submission
Description:
Loading...
Thumbnail Image
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: