Machine Learning Algorithms on Small-Sized Datasets in Software Effort Estimation: A Comparative Study

Abedu, S.Machine Learning Algorithms on Small-Sized Datasets in Software Effort Estimation: A Comparative StudyUniversity Of Ghana2021Deep LearningMachine LearningSoftware Effort EstimationSmall-Sized DatasetMy UniversityMy University2021-04-132021-04-132021-01enThesishttp://ugspace.ug.edu.gh/handle/123456789/36174MPhil. Computer ScienceContext: Software effort estimation is crucial in the software development process. Overestimating or underestimating the effort for a software project can have consequences on the bidding or development process for a company. Over the years, there has been growing research interest in machine learning approaches in software effort estimation. Though deep learning has been described as the state of the art in the field of machine learning, much has not been done to assess the performance of deep learning approaches in the field. Objective: This study defines a discretization scheme for setting a threshold for a small-sized dataset in software effort estimation. Also, it investigates the performance of selected machine learning models on the small-sized datasets. Method: Software effort estimation datasets were identified with their number of project instances and features from existing literature and ranked according to the number of project instances. Eubank’s optimal spacing theory was used to discretize the ranking of the project instances into three classes. The performance of selected conventional machine learning models and two deep learning models were assessed on the datasets classified as small-sized. The leave-one-out cross-validation as recommended by Kitchenham was adopted to assess the training and validation needs of the selected model. The performance of each model on the selected datasets was measured using the mean absolute error (MAE). Robust statistical tests were conducted using the Yuen’s t-test and Cliff’s delta effect size. Results: Results showed that the conventional machine learning models achieved improved prediction performance as compared to the deep learning models. Nonetheless, after applying early stopping regularisation to the deep learning models, it was found that the deep learning models achieved improved prediction accuracy than the conventional machine learning models but failed to outperform the Automatically Transformed Linear Model (ATLM). Conclusion: The study concluded that conventional machine learning approaches achieve better performance than deep learning approaches on small-sized datasets. However, applying the early stopping regularisation technique to the deep learning models can improve the performance of the deep learning model. Also, a given software effort estimation dataset can be classified as small-sized if the number of project instances in the dataset is less than 43. Keyword: Deep Learning, Machine Learning, Software Effort Estimation, Small-Sized Dataset