University of Ghana http://ugspace.ug.edu.gh UNIVERSITY OF GHANA COLLEGE OF BASIC AND APPLIED SCIENCES MULTIPLE BUG DETECTION AND EFFORT ESTIMATION FRAMEWORK FOR OPEN-SOURCE PROJECTS BY ANAS HARA (10804091) A THESIS SUBMITTED TO THE SCHOOL OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE AWARD OF DEGREE OF MASTER OF PHILOSOPHY IN COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE SEPTEMBER, 2021 University of Ghana http://ugspace.ug.edu.gh DECLARATION I Anas Hara, author of this thesis, do now declare that the work presented in this thesis/dissertation; "Multiple Bug Detection and Effort Estimation Framework for Open- source Projects" is my work, produced from research undertaken at the Department of Computer Science, University of Ghana, Legon under the supervision of my thesis supervisors. Other people's work has been duly acknowledged and cited. STUDENT Name: Hara Anas Signature: Date: September 20, 2021 SUPERVISOR Name: Dr. Solomon Mensah Signature: Date: September 22, 2021 CO-SUPERVISOR Name: Dr. Ebenezer Owusu Signature: Date: September 23, 2021 i University of Ghana http://ugspace.ug.edu.gh ABSTRACT Bug reports are essential in the development and maintenance of software. Bug tracking systems allow testers to submit bug reports which allow for report analysis and assignment of reports to fixers to address them. A given bug x is described as multiple bug when it is reported by more than two bug reporters. It is described as a duplicate bug when it was reported by two reporters. In a given pool of bug reports from a tracking system, estimating the effort required to identify multiple bug is a challenge, and hence the need to conduct this study. Thus, a plausible solution based on an effort estimation framework to detect multiple bugs will reduce the effort software testers spend in analyzing bug reports and also improve software reliability and productivity. Although several studies are attempting to solve the problem, there is the need to introduce an effort estimation framework to detect multiple bugs in software projects, specifically open-source projects. However, the following constraints exist when detecting multiple bugs in open-source projects: - (1) a large number of existing bug reports, and (2) much effort is required when detecting and analyzing multiple bug reports. This study seeks to develop a framework to detect multiple bugs and estimate the effort required in identifying such bugs in open-source projects. This study implements the bugdetector tool, which uses bug information and code features to find similar bugs. It will first extract features from bug information in a bug tracking system, next it locates bug methods in source code and extracts bug method code features. It calculates similarities between each overridden and overload method, and finally, it determines which method may cause potentially related or similar bugs. Empirical analysis was conducted on bug reports from two open-source projects, namely Mozilla Firefox and Eclipse. Thus, empirical analysis was conducted on the extracted bug reports by the bugdetector tool. The analysis was conducted using Deep learning algorithms (LSTM, Bidirectional LSTM and CNN) and conventional machine learning algorithms (SVM and Random Forest). Accuracy, Precision, Recall, and F1-score metrics were used to evaluate ii University of Ghana http://ugspace.ug.edu.gh the models' performance. Estimating the required effort for identifying multiple bugs was done using a proposed effort estimation metric. Empirical result shows that the deep learning method, namely the Bidirectional LSTM algorithm yielded improved performance for multiple bug detection across the two-studied datasets. Thus, for Mozilla Firefox, the Bidirectional LSTM yielded the best performance accuracy (71.09%), precision (68.30%), and recall (45.7%). For Eclipse, Bidirectional LSTM dominated the best performance about accuracy (82.6%) and F1-score (50.9%). The effort required for detecting multiple bugs on average ranges from 1255.7 to 1383.2 days for the studied Eclipse bug repository, and 1049.8 to 1139.2 days for the studied Mozilla Firefox bug repository. The study concluded that the deep learning method has a better tendency in detecting multiple bugs in open-source projects as compared to the conventional machine learning approach. An effort estimation metric is introduced to compute the effort required to detect multiple bugs in open-source projects. This will assist software testers/fixers to differentiate between severity levels of the detected bugs based on the respective efforts computed. Keywords: Duplicate bugs, Effort estimation, Bug detection, Deep learning, Open-source projects. iii University of Ghana http://ugspace.ug.edu.gh DEDICATION To my family especially my mum (Mrs. Salamatu Hara). iv University of Ghana http://ugspace.ug.edu.gh ACKNOWLEDGEMENT I want to show my sincere appreciation to my supervisors Dr Solomon Mensah and Dr Ebenezer Owusu for the care and support given throughout this study. Also, my gratitude to my lovely family, my brothers and friends that cheered me on to this end. May God bless you all. v University of Ghana http://ugspace.ug.edu.gh TABLE OF CONTENTS DECLARATION ........................................................................................................................ i ABSTRACT .............................................................................................................................. ii DEDICATION ......................................................................................................................... iv ACKNOWLEDGEMENT ......................................................................................................... v TABLE OF CONTENTS ......................................................................................................... vi LIST OF FIGURES .................................................................................................................. ix LIST OF TABLES .................................................................................................................... x LIST OF ABBREVIATIONS .................................................................................................. xi CHAPTER ONE ........................................................................................................................ 1 INTRODUCTION ..................................................................................................................... 1 1.1 Background and Motivation ....................................................................................... 1 1.2 Statement of Problem .................................................................................................. 3 1.3 Scope of the Study ...................................................................................................... 4 1.4 Research Objectives .................................................................................................... 4 1.4.1 Global Objective .................................................................................................. 4 1.4.2 Specific objectives ............................................................................................... 4 1.5 Research Input and Contribution ................................................................................ 5 1.5.1 Research Input ..................................................................................................... 5 1.5.2 Academic Input ................................................................................................... 5 1.5.3 Practical Contribution .......................................................................................... 5 1.6 Organization of the Thesis .......................................................................................... 5 CHAPTER TWO ....................................................................................................................... 7 LITERATURE REVIEW .......................................................................................................... 7 2.1 Introduction ................................................................................................................. 7 2.2 Overview ..................................................................................................................... 7 2.3 Find Duplicate Bugs from Program Inconsistency ..................................................... 8 2.4 Techniques for Finding Duplicate Bugs ..................................................................... 9 2.5 Duplicate Bugs Fix Effort Estimation using Machine Learning .............................. 13 2.6 Estimation Approach in Software project ................................................................. 15 2.7 Effort Estimation Accuracy ...................................................................................... 18 2.8 Chapter Summary ..................................................................................................... 19 CHAPTER THREE ................................................................................................................. 20 RESEARCH METHODOLOGY ............................................................................................ 20 vi University of Ghana http://ugspace.ug.edu.gh 3.1 Introduction ............................................................................................................... 20 3.1.1 Selection Criteria ................................................................................................... 20 3.1.2 Inclusion Criteria ................................................................................................... 20 3.1.3 Exclusion Criteria .................................................................................................. 20 3.2 Search process ........................................................................................................... 21 3.3 Classification of Papers ............................................................................................ 21 3.4 Case Study Setup ...................................................................................................... 22 3.4.1 Bug Reports from Mozilla Firefox ........................................................................ 23 3.4.2 Bug Reports from Eclipse ..................................................................................... 24 3.5 Dataset Preprocessing ............................................................................................... 25 3.6 Feature Extraction ..................................................................................................... 25 3.6.1 TF-IDF .................................................................................................................. 26 3.6.2 GloVE ................................................................................................................... 27 3.7 Traditional Machine Learning and Deep Learning Models ...................................... 27 3.7.1 Traditional Machine Learning ........................................................................... 27 3.7.2 Deep Learning ................................................................................................... 28 3.7.3 Bidirectional LSTM model ................................................................................ 28 3.8 Multiple Bug Detection and Effort Estimation Framework ..................................... 31 3.9 Performance Evaluation ............................................................................................ 32 3.10 How much effort is required to identify and resolve a multiple bug? ................... 32 3.10.1 Path Extraction .................................................................................................. 34 3.11 Bugs detection stages ............................................................................................ 35 3.12 Effort Collection Requirements to Improve Bug Detection Process .................... 36 3.13 Processes of Dataset Framework .......................................................................... 38 3.14 Bug Detection and Evaluation Tools .................................................................... 39 3.15 Chapter Summary .................................................................................................. 40 CHAPTER FOUR ................................................................................................................... 41 4.1 Review of Existing Techniques for Bug Detection after Inclusion/Exclusion Criteria 41 4.2 Results from Techniques for Multiple Bug Detection .............................................. 41 4.3 Research questions .................................................................................................... 42 4.4 Bidirectional LSTM neural network ......................................................................... 43 4.5 Comparison between performance results from applying BiLSTM, CNN, LSTM, SVM, and Random Forest ................................................................................................... 44 4.6 Comparison between BiLSTM, LSTM, CNN, SVM and Random Forest results .... 46 vii University of Ghana http://ugspace.ug.edu.gh 4.7 Mean Identification Duration/Estimation Results for Multiple Bugs ....................... 48 4.8 Chapter Summary ..................................................................................................... 49 4.9 Threats to validity ..................................................................................................... 49 CONCLUSION ....................................................................................................................... 51 5.1 Conclusion ................................................................................................................ 51 5.2 Recommendation and Future Work .......................................................................... 52 REFERENCES ........................................................................................................................ 53 viii University of Ghana http://ugspace.ug.edu.gh LIST OF FIGURES Figure 1. Order of the selection process (SLR protocol) ......................................................... 22 Figure 2. Bidirectional LSTM framework (Sanchoy et al.[69]) ............................................. 30 Figure 3. Multiple bug detection and effort estimation framework(mBugDEE) .................... 31 Figure 4. Severity levels of bugs ............................................................................................. 35 Figure 5. Bugs detection cycle ................................................................................................ 36 Figure 6. Bugs detection process ............................................................................................. 38 Figure 7. Multiple bugs detection and fixing algorithm .......................................................... 39 Figure 8. Search results of overall papers and final papers after inclusion and exclusion criteria ................................................................................................................................................. 42 Figure 9. Final papers obtained after inclusion and exclusion criteria .................................... 42 Figure 10. Comparison between training and validation (loss and accuracy) ......................... 44 Figure 11. Comparison between BiLSTM, CNN, SVM, LSTM and Random Forest results in Mozilla Firefox dataset ............................................................................................................ 47 Figure 12. Comparison between BiLSTM, CNN, SVM, LSTM and Random Forest results in Eclipse dataset ......................................................................................................................... 47 ix University of Ghana http://ugspace.ug.edu.gh LIST OF TABLES Table 1.Distribution of search results from selected journals ................................................. 22 Table 2. Characteristics of multiple bug reports used in the experiment ................................ 23 Table 3. Description of case study projects ............................................................................. 25 Table 4.Machine learning dataset training details ................................................................... 28 Table 5. Deep learning dataset training details ........................................................................ 28 Table 6. Performance of the models based on the Mozilla Firefox dataset ............................. 45 Table 7. Performance of the models based on the Eclipse dataset .......................................... 45 Table 8. BiLSTM improvement compared to SVM, CNN, LSTM and Random Forest ........ 46 Table 9. Mean identification duration for multiple bug reports (Mozilla Firefox) ................. 48 Table 10. Mean identification duration for multiple bug reports (Eclipse) ............................. 48 x University of Ghana http://ugspace.ug.edu.gh LIST OF ABBREVIATIONS AST Abstract syntax trees BFC Base functional component EA Exact-accuracy GloVe Global Vectors for Word Representation GUI Graphical user interface IR Information retrieval ML Machine learning MMRE Mean magnitude of relative error MAE Mean absolute error NLP Natural language processing SEE Software effort estimation SVM Support vector machine LSTM Long-short term memory LDA Latent dirichlet allocation PLSA Probabilistic latent semantic analyse TF-IDF Term frequency Inverse Document Frequency xi University of Ghana http://ugspace.ug.edu.gh CHAPTER ONE INTRODUCTION 1.1 Background and Motivation Estimating the effort required to identify bugs in a pool of bug reports from a tracking system is a challenge for software maintenance researchers and practitioners [1]. Users can submit bug reports through a bug tracking system that allows for report analysis and the assignment of reports to fixers to address them [2]. However, it has been observed that bug reports are multiple bugs when more than two testers report the same bug, normally from the source code. A bug is regarded as a duplicated bug when it is reported by two testers or reporters [2]. Bug reports can then be used to guide software maintenance behavior and help in the development of more secure software systems. Recognizing software bug reports can aid in the triaging process, allowing developers to prioritize and address the most critical complaints first [3]. Developers frequently get a large number of bug reports and may be unable to handle them due to various constraints such as effort and time limits. Prioritization of bug reports is a time-consuming and difficult process [3]. Thus, a plausible solution for an effort estimation framework to detect multiple bugs will reduce the effort the developer and tester spend on analyzing bug reports while also improving software reliability and productivity. Advaita bug detection has been proven to be a good approach to assist developers in finding issues early in the software development process, saving time and effort [4]. However, they are limited in their ability to detect flaws that use numerous approaches and have a high percentage of false positives [5]. Regrettably, it appears that the research community has overlooked this crucial aspect of software engineering. Furthermore, the available databases 1 University of Ghana http://ugspace.ug.edu.gh are massive and difficult to study manually [5]. As a result, an effort estimation framework model is required to anticipate the effort required in identifying and resolving all reported bugs [6]. Given all the advantages, someone would expect a well-founded theory and several applications in the area of effort estimation framework from available data [6]. Although rapid developments in computing technology have resulted in powerful multi-gigahertz CPUs, software stability has not caught up. Despite the rising need for software to be trustworthy, software program bugs continue to be common. System requirements must meet strict uptime requirements and must continue to operate even if hardware or software failures occur [7]. They may also be required to be monitored and debugged. Programs, like anything else created by humans, contain thoughtless errors or misconceptions, which we refer to as bugs or flaws [8]. Some bugs, such as null pointer dereference and double-free, cause system failures; others, such as memory leaks, degrade system performance; and yet others, such as memory leaks, do not cause failure but do not produce the expected results. Testing, human code review, static analysis, dynamic analysis, and other methods have all been developed to find bugs and repair them [9]. This study was motivated by the fact that many overridden or overloaded methods can be found in large software projects written in programming languages [10]. Because all overridden or overloaded methods follow the same logic, it is evident that they could produce related or comparable errors [10]. Meanwhile, open-source software project developers frequently keep open bug mines. Users can report issues they find while using software in a "bug mine," and developers can confirm and solve the reported bugs in the next version. 2 University of Ghana http://ugspace.ug.edu.gh 1.2 Statement of Problem Previous studies [1] [11] [12] have shown that in a given pool of bug reports from a tracking system, estimating the effort required to identify multiple bugs is a challenge because bug fixers normally go through a series of challenges identifying and estimating highly prioritized bugs from a given bug report repository to fix or resolve. As a result, a series of techniques have been introduced. One of the techniques is triage [11], which attempts to identify if a given report is of high priority and meaningful. This study seeks to prioritize bugs that have been reported by more than two reporters, hence the name "multiple bugs." Due to the vast nature of the existing bug reports, it is problematic for the triage [11] to evaluate all previous bug reports to discover multiple bugs. Only a small number of similar bug reports are obtained by the triage [12]. which then compares the new bug report to each retrieved bug report to check if it has been multiplied. Otherwise, the triage [12] must conclude that no multiplication has occurred. The difficulty of finding the main bug report for each new report inside the matching proposed bug list may then be considered as multiple-bug-report detection [13]. The potential problems or risks may be discovered while managing the projects that may use some shared resources. Therefore, the risk may be shared across multiple software projects. Less effort estimation techniques have been developed to accommodate the detection of multiple bugs in a pool of bug reports by software testers [14]. If traditional techniques are used for estimating the effort required for detecting multiple bugs in software projects, the findings will undoubtedly be erroneous [14]. On the other hand, current practise in software project work estimating is based on a single iteration [15]. In the case of open-source projects, some features, such as persistent changes in levels of scope or product, are not emphasised [15]. Software complexity and human resource issues, such as technical and analytical skills 3 University of Ghana http://ugspace.ug.edu.gh [16]. As a result, to anticipate the development work of software projects, an effort estimation model for detecting multiple bugs is being theoretically and empirically investigated to assist fixers to identify and resolve such priority bugs [17]. 1.3 Scope of the Study This study focuses on the detection of multiple bugs and estimates the effort required to detect multiple bugs. Established bug report data from Mozilla Firefox and Eclipse open-source projects archived in the Github repository is analyzed for this study. 1.4 Research Objectives 1.4.1 Global Objective The main objective of the study is to develop a framework to detect multiple bugs and estimate the effort required to detect multiple bugs in open-source projects. 1.4.2 Specific objectives The study addresses the following objectives: 1. To review existing techniques from the literature for detecting bugs in software projects. 2. To develop a framework to detect instances of multiple bugs in open-source projects. 3. To define an effort estimation metric for detecting multiple bugs in open-source projects. 4 University of Ghana http://ugspace.ug.edu.gh 1.5 Research Input and Contribution 1.5.1 Research Input According to Li and Reformat [18] the knowledge input of a thesis can be a strategic technique, prototypical, construct, method, or instantiation. This study adds to the pool of information by considering the prospect of multiple bug detection techniques in a software project that is lacking. The research covers the application of a known tool, triage, in the software management system. Thus, this is an exaptation study according to Li and Reformat [18]. 1.5.2 Academic Input The study will ascertain whether the deep learning model has a better tendency than the conventional machine learning model at detecting multiple bugs in open-source projects. 1.5.3 Practical Contribution The study makes the following applicable contributions: 1. A framework to detect multiple bugs in open-source projects. 2. An effort estimation metric for estimating the time required to identify multiple bugs. 1.6 Organization of the Thesis Chapter two presents a literature review on bug detection in software projects. Chapter three discusses the framework and methodology employed for the study. The dataset used is discussed, with data pre-processing and experimental setup explained. 5 University of Ghana http://ugspace.ug.edu.gh Chapter four presents the experimental results and discussions from the empirical analysis conducted in this study. Chapter five presents the conclusion and future work of the study. 6 University of Ghana http://ugspace.ug.edu.gh CHAPTER TWO LITERATURE REVIEW 2.1 Introduction As mentioned in the previous chapter, the aim of this thesis is to develop a framework to estimate the effort required to detect multiple bugs in open-source projects by comparing its performance with machine learning techniques. Therefore, this chapter presents the literature review. The literature review focuses on machine learning techniques in bug detection and the need for this study. The chapter is divided into seven sections. The first section presents the introduction. The second section presents an overview of a framework for finding bugs in software projects. The third section presents a review of bug finders in a software project. The fourth section identifies the duplicate bug tracking system. The fifth section presents techniques for finding duplicate bugs. The sixth section uses a machine learning technique to describe approaches for estimating effort in software development. The seventh section discusses software project estimation methods. 2.2 Overview This section introduces the DeepBugs framework [19] for constructing name-based bug detectors using machine learning. The fundamental idea is to teach a classifier to distinguish between codes with and without name-related bugs [19]. A "bug pattern" is a series of programming errors that break the same routines [20]. Bug patterns include accidentally switching a role's arguments, using the wrong API calls, and using the wrong binary operator [19]. Hand-built bug checkers like FindBugs and Error-Prone use bug patterns that require distinct analysis [21]. Using DeepBugs to create a bug detector needs several processes [19]. 7 University of Ghana http://ugspace.ug.edu.gh (1) Create training data from the corpus. This stage selects good instances from a corpus and produces negative examples [17]. Inferred positive code examples are unlikely to be affected by the bug pattern because we assume the bulk of the codes are accurate [22]. DeepBugs uses basic code alterations to build negative training examples [19]. (2) Develop a model that can distinguish between correct and incorrect data. Given two sets of codes containing positive and negative examples, this stage trains a classifier to distinguish between them. The classifier is a feedforward neural network. (3) Detects previously unknown code flaws. This phase employs the previous classifier to determine if a new piece of code is affected by the bug pattern [17]. The approach warns the developer if the trained model thinks the code is wrong. An example of this is a bug detector that looks for incorrectly ordered function arguments [21]. A possible problem in Step 6 requires swapping the dim and x-dim variables. As a result, the approach may detect such inaccuracies because the learnt classifier generalizes beyond the training data. That width and x dim are pairwise semantically similar in the case that allows DeepBugs to find the bugs [19]. 2.3 Find Duplicate Bugs from Program Inconsistency The find duplicate bugs from programme inconsistency technique is used by prominent and successful commercial static analysers, such as Coverity [23] . One of the most difficult aspects of discovering duplicate defects is determining which rule a piece of programme code should obey[24]. Allowing the programmer to specify it, either by saving it to an extra file or using a special comment type, is one option [25]. Program inconsistency identification, the approach presented in this section, may automatically extract such rules from the source code rather than the 8 University of Ghana http://ugspace.ug.edu.gh programmer [26]. Program "beliefs" are the key to this technique [27]. A dereference of a pointer p indicates that p is non-null, and a function call unlock (l) implies that the lock l was locked, which are examples of programme "beliefs." They are divided into two categories for "beliefs," one for "must believe" and the other for "may believe." The phrase "must believe" is pulled from code, and there is no doubt that the programmer believes it. When a pointer is dereferenced, the programmer assumes that the pointer is not null [27]. "May believe" refers to when we see a pattern in the code that suggests a fact, but it could just be a coincidence [27]. A call to the function "adds" followed by a call to "dec" could indicate that the programmer believes they must be called painstakingly, but it could merely be a coincidence [28] . While this "may belief" is highly likely to be true if this pairing occurs 999 times out of 1000 [22]. 2.4 Techniques for Finding Duplicate Bugs Automatic systems for finding software faults need to know the rules that a program must follow, or "specifications" [29]. Finding duplicate bugs in a software project can be done in a variety of ways [30]. Code inspections can be quite effective in detecting duplicate bugs, but they come with the apparent drawback of needing a lot of manual labor [31]. Human observers are also techniques, but they are susceptible to being influenced by what a piece of code is supposed to perform [32]. The advantage of automatic approaches is their relative objectivity [33]. Static analysis evaluates code without running it and can uncover potential security violations (e.g., SQL injection), as well as runtime faults (e.g., dereferencing a null pointer), as well as logical contradictions [21]. While there is a wealth of literature on the algorithms and analytical frameworks employed by such tools, publications describing real-world experiences with such tools are harder to come by in terms of the sophistication of their analysis methodologies. FindBugs does not push the envelope; rather, it is intended to determine what types of faults may be effectively discovered using relatively simple 9 University of Ghana http://ugspace.ug.edu.gh procedures, as well as to assist us in understanding how such tools might be integrated into the software development process [21] . FindBugs has been downloaded over 580,000 times and is used by several large corporations and software projects [21]. Testing and assertions are dynamic procedures that rely on a program's runtime behavior [34]. This can be both a benefit and a drawback. The benefit is that dynamic techniques do not consider infeasible routes [34]. The problem is that they are usually confined to discovering duplicate bugs in the executed program pathways [34]. Furthermore, the time necessary to execute a full suite of tests can be prohibitively long. Static approaches, in contrast to dynamic techniques, can examine abstractions of all potential program behaviors and are therefore not constrained by the quality of test cases to be effective [33]. The complexity of static approaches varies, as does their ability to find or eradicate duplicate bugs. A formal proof of correctness is the most effective (and difficult) static technique for removing duplicate bugs [35]. Having a correctness proof is probably the best way to ensure that software is free from duplicate bugs [35]. These strategies demonstrate that a program's intended property holds for all conceivable executions. These techniques may be complete or incomplete; if incomplete, the analysis may be unable to verify that the required attribute exists for some right program [36]. Finally, while ineffective approaches can detect "possible" duplicate bugs, they can also miss true duplicate bugs and inaccurate warnings [37]. There are several attempts to find duplicate bugs from available data reported in the literature. Sharma et al. [38] introduced a priority prediction strategy based on SVM, neural networks, NB, and KNN. The proposed approach enables the prioritization of bug reports. The results indicated that, with the exception of the NB technique, the accuracy of the machine learning techniques utilized in estimating the importance of bug reports inside the project was greater than 70%. 10 University of Ghana http://ugspace.ug.edu.gh Badashian et al. [39] developed a new strategy for bug triage that involves the utilization of Q&A community sites such as GitHub, which is a source of developer expertise. This approach extracts keywords from bug reports by utilizing keywords (from metadata), project languages, tags from "GitHub" or "Stack overflow," as well as title and description fields. These terms are synonymous with social expertise. They examined this strategy using bug reports from 20 GitHub projects and discovered that it had an accuracy of 89.43 %. Mani et al. [40] suggested a deep bidirectional recurrent neural network (DBRNN-A) model for the approach. The model is for categorizing appropriate developers based on the title and features of individual software bug reports, utilizing Naive Bayes, cosine distance, SVM, and SoftMax. Experiments with software project bug reports (e.g., Google Chromium, Mozilla Core, and Mozilla Firefox). It demonstrated a precision of 47% for Google Chromium, 43% for Mozilla Core, and 56% for Mozilla Firefox. Lyubinets et al. [3] describe an RNN-based model for labelling bug reports. Accuracy was achieved in the 56–88 percent range. Nguyen et al. [41] introduced DBTM, a method for detecting duplicate bug reports that takes into account not only IR-based but also topic-based aspects. Each bug report is modelled by DBTM as a textual document explaining one or even more technical difficulties, and duplicate bug reports are modelled as those reporting the same technical issue. With historical data, including duplicate reports that have been detected, DBTM may learn sets of phrases expressing the same technical concerns and detect further not-yet-identified duplicates. Our empirical examination of real-world systems demonstrates that DBTM can increase the accuracy of state-of-the-art techniques by up to 20%. 11 University of Ghana http://ugspace.ug.edu.gh Shokripour et al. [2] presented a two-phased approach based on allocation recommendations on the bug's expected location. This approach makes use of the source code as well as the title and description fields, from which meaningful terms are retrieved. They used bug reports from Eclipse and Firefox to reach an assessment accuracy of 89.41 percent and 59.76 percent, respectively. However, the quantity of bug complaints is significantly lower than in previous studies. Shokripour et al. [42] introduced an ABA-time-TF-IDF approach, a strategy for automatic assignment based on TF-IDF time metadata. A corpus is built using words, and developers are identified and recommended for specific expertise. Shokripour compared the ABA-time-TF-IDF to the ABA-time-TF-IDF, the NB, the VSM, the SUM, and the SVM in Eclipse, NetBean , and ArgoUML projects. As a result, accuracy and mean reciprocal rank (MBR) improved by up to 11.8 and 8.94%, respectively. Bhattacharya and Neamtiu [43] suggested an approach based on enhanced classification to increase the accuracy of bug report triage and shorten the tossing pathways. It utilizes the NB and a tossing graph to extract features such as TF-IDF or bag-of-words (BOW). It takes advantage of meta data from various sorts of bug reports, title, description, keywords, product, component, and the most recent developer actions. They evaluated this technique using bug reports from Eclipse and Mozilla, achieving 77.43 percent accuracy for Eclipse and 77.87 percent accuracy for Mozilla. Bashar et al. [3] used a 5-layer deep learning RNN-LSTM neural network to estimate the priority of bug reports, comparing the results to those from Support Vector Machine (SVM) and K-nearest neighbors (KNN). The suggested RNN-LSTM model's performance was evaluated using the JIRA dataset, which contains over 2000 bug reports. The proposed model was found to be 90% accurate when compared to KNN (74% accuracy) and SVM (87 percent). 12 University of Ghana http://ugspace.ug.edu.gh Sawarka et al. [44] introduced a new technique for estimating the time required to fix a newly reported bug. The developed algorithm is validated against a publicly available database. A comparative analysis is performed in which the output of the algorithm is compared to the output of the algorithm using various machine learning approaches. A highest accuracy of 64% is achieved, which is superior to any previously proposed algorithm in this domain. 2.5 Duplicate Bugs Fix Effort Estimation using Machine Learning There are several attempts at estimating efforts from available data reported in the literature. A study by Mahmood et al. [45] used NASA's KC1 dataset to present a solution for estimating software defect fix effort using self-organizing neural networks. The basic idea behind their work was to identify the most similar earlier-issued duplicate bug and use the reported normal time as an estimate. To investigate the effort for new issue reports or duplicate bug reports, they used the nearest neighbour approach [46]. They found out that the association rule mining technique is better at predicting the effort compared with other machine learning techniques like PART, C4.5, and Naive Bayes. The empirical evaluation is based on the Eclipse bug database [46]. The proposed model can correctly predict up to 34.9% of the bugs into a discretized log scaled lifetime class. Although the authors introduced a method for developing an effort estimation model, they did not present such a model. A study by Ahsan et al. [47] concluded that the effort estimation using programmer involvement is lower than the estimation from product metrics. He obtained effort data from a commercial company and also collected a set of product metrics like lines of code, Halsted vocabulary, object and complexity metrics. He designed a duplicate set of neural networks based on a different combination of hidden layers and the number of perceptron per layer. He used different 13 University of Ghana http://ugspace.ug.edu.gh combinations of product metrics to train the model. He performed the experiments using 33,000 different models of neural networks and collected the best data [47]. His result shows that neural networks can be used for an effort estimation model. A huge number of experiments are necessary to obtain the best framework. The majority of effort estimation work is associated with a commercial or closed software system [47]. Very few research papers for OSS are available, especially in the area of duplicate bug fix effort estimation [48]. Most of the mentioned work used machine learning techniques to train the model using available effort data, whereas in our case we do not have effort data. Therefore, we have to extract the effort data from OSS to perform the experiments [48]. A study by Sharma et al.[49] showed that SVM and neural networks yielded improved performance as compared to Naive Bayes and KNN in predicting the priority of reported bugs. A study by Chaturvedi and Singh [50] classified bug severity and found that neural networks yielded an improved performance as compared to KNN, SVM, Naive Bayes Multinomial, RIPPER, and J48 in defining the class of bug severity. A recent study by Sawarkar et al. [51] recommended SVM with sigmoid functions as the best classifier, followed by Random Forest for predicting bug estimation times for newly reported bugs. Malhotra et al. [52] showed that the single-layer perceptron (neural networks) is the best approach among all the techniques used in this study for the development of a defect prediction model. A study by Baarah et al. [53] determined the class of bug severity and found that Logistic Model Trees yielded improved performance as compared to Naive Bayes, Naive, SVM, KNN, and Random Forest in predicting the severity level of software bug reports in closed-source projects. 14 University of Ghana http://ugspace.ug.edu.gh A study by Ahsan et al. [54] used NASA's SEL bug dataset and applied association rule mining to the dataset to detect "duplicate bugs" using intervals. They discovered that association rule mining outperforms other machine learning algorithms such as PART, C4.5, and Naive Bayes in detecting duplicate bugs [54]. Although the authors introduced a method for developing duplicate bug detection models, they did not present such a model. The problems of programmer engagement and effort modelling for OSS were discussed[7]. He worked with data from the ECLIPSE project and calculated the effort based on programmer participation and product metrics [55]. He concluded that detecting duplicate bugs based on programmer engagement is less accurate than detecting duplicate bugs based on product metrics [56]. They used Microsoft Project Visual Studio data for their research work [56]. Their statistical research revealed that the real detection error is positively connected with the featured size, as well as the process metrics and the detection error. They developed a duplicate detection model using a neural network [57]. He designed a duplicate set of neural networks based on a different combination of veiled layers and the sum of perceptron per layer [57]. He performed the experiments using different models of a neural network and collected the best data. His result shows that a neural network can be used to find duplicate bugs in software projects [57]. A huge number of experiments are necessary to obtain the best framework. Most of the duplicate bug detection work is related to commercial or closed software systems [29]. Very few research papers for OSS are available, especially in the area of duplicate bug fixing. 2.6 Estimation Approach in Software project The expedition for better software systems, value, and consistency is not an ending point. Several techniques have been presented to help testers or developers detect and resolve duplicate bugs 15 University of Ghana http://ugspace.ug.edu.gh [24]. To test the application starting with static techniques (database analysis, duplicate bugs, duplicate bug prediction, classical inspection, and authentication). Bug detection, for example, assists testers or developers in detecting several bugs early by analyzing source code directly and determining whether or not a specific source code is buggy [58]. It has been demonstrated that detecting duplicate bugs improves software quality and dependability [59]. The existing state-of- the-art bug detection techniques are divided into the following phases or categories: • Duplicate bug detection based on rules: Several programming criteria are set in this method to statically detect common programming errors or problems. FindBugs is a common example of this kind of model [21]. Although this kind of method is quite active, novel directions for detecting new forms of duplicate bugs. • Bug identification based on data mining: Mining-based techniques depend on removal of current source code to get around pre-defined rules [60]. Typically, exploitation data mining techniques generate implicit software design instructions after database source codes and detect violations of the retrieved rules as probable duplicate bugs [60]. These mining-based techniques still have a significant flaw in that they cannot distinguish between wrong codes and infrequent or rare codes, resulting in a high false-positive rate. • Duplicate bug identification with machine learning [46]: Through the advancement of machine learning (ML), particularly deep learning models, numerous ways to learn from previously known and reported duplicate defects and detect bugs in new code have been presented [61]. While ML-based duplicate bug detection models depend on feature selection techniques, duplicate bug detection models use the capacity to learn essential features from training data. 16 University of Ghana http://ugspace.ug.edu.gh This technique is shown to have advantages over existing ML-based duplicate bug identification algorithms. Deep learning techniques are still limited to detecting duplicate defects in single operations without taking their interdependencies into account. In practice, there are various instances when duplicate problems occur in multiple methods [62]. To determine whether a method is flawed or not, a model must take into account other methods that share data or control with the method under review. Thus, deep learning-based algorithms have high false-positive rates, making their regular use by software engineers impracticable [23]. That is, approximately one in three reported duplicate problems is a false positive, wasting engineers' time [63] . Duplicate bug reports can be bundled and processed sequentially[12]. Techniques for decreasing duplicate bug reports include NLP, IR, and similarity-based classification. Artificial Intelligence (AI) is used to find duplicate bug reports [46]. Because bug reports are written in human language, distinct terminology is commonly used [64]. Methods (including stemming, lemmatization, and stopword processing) are used to improve accuracy. Topic modelling is the most often used NLP technique, processing words from bug reports or summary text and selecting criterion words. The topic picked represents a few features of the problem report and specifies a duplicate bug report. The most common subject modelling technique is the LDA, NB, and NB polynomial [65]. We discovered duplicate bug reports using the LDA, NB, and NB polynomial [2]. This was done using Eclipse bug reports to compare the results [66]. As a result, nearly 80% of duplicate bug reports were discovered. They also showed substantial differences in employing statistical approaches, including the LDA and N-gram (LNG) technology [2]. A linearly connected weight-based N-gram similarity and topic modelling using the existing LDA approach. The N-gram (LNG) approach has better recall, precision, and EA rates than existing algorithms in detecting bug report duplication [2]. 17 University of Ghana http://ugspace.ug.edu.gh When DTBM, a cutting-edge approach for detecting bug report duplication, was used, the recall rate increased by 2.96 percent to 10.53 percent [2]. Meta data such as bug type, operating system, priority, and severity aid in the identification of similar bug reports [37]. Approaches to identifying duplicate bug reports based on information retrieval are frequently used in concert with these other methods focusing on contextual information to deduplicate bug reports [63] . They extracted context from bug reports using BM25F techniques, a scoring algorithm commonly used in information retrieval [63]. The word list was constructed in response to a non-functional requirement relating to software quality as well as the subject of the bug report. As a result, the number of duplicate bug reports was cut in half. The LDA, NB, and NB polynomials were used to find duplicate bug reports [2]. Using eclipse bug reports, they compared the results to conventional machine learning [66]. As a result, nearly 80% of duplicate bug reports were discovered. They also found substantial variations in the statistical approaches offered for the LDA and N-gram (LNG) technologies [2]. Metadata like bug type, OS, priority, and severity assist in identifying related bug reports [37]. Even in the same setting, bug reports can have diverse solutions. For bug report deduplication, information retrieval techniques are commonly employed in concert with other methods focused on context [67]. They employed BM25F methods, a scoring algorithm used in IR fields, to extract context from bug reports [68]. According to their research, non-functional needs linked to software quality and the nature of bug reports influenced the terms list. As a result, the number of duplicate bug reports fell by 11.55 percent. 2.7 Effort Estimation Accuracy To determine whether one effort estimating model produces better results than another. In the literature, a variety of different prediction accuracy statistics have been utilized. The Mean 18 University of Ghana http://ugspace.ug.edu.gh Magnitude Relative Error (MMRE) is an average error that indicates how much the predictions overestimate or underestimate the model's true value [45]. The magnitude of relative error (MRE) is derived by multiplying the difference between the actual and estimated values by the actual value [45]. As a result, the mean MRE (MMRE) is the average value of this indicator across all observations in the dataset. In general, a lower MMRE score suggests a more accurate model. On the other hand, the coefficient of duplicate determinations indicates how much of the variation can be explained by the independent variables. When R2 approaches 1, it means that the independent and dependent variables have a strong relationship. The quality of the predictions is represented by the PRED (k) prediction level parameter. 2.8 Chapter Summary The chapter reviewed related literature for duplicate bugs and effort estimation. The review acknowledged that there has been much improvement in research on machine learning algorithms for effort estimation in finding duplicate bugs. Besides similar documents that offered important information for the review, the publications indicate that no work has been done to investigate an effort estimation framework for detecting multiple bugs. 19 University of Ghana http://ugspace.ug.edu.gh CHAPTER THREE RESEARCH METHODOLOGY 3.1 Introduction This chapter presents the empirical framework and the experimental set-up of the study. Data pre- processing procedures, feature selection techniques, and model evaluation techniques used in the study are all discussed in the empirical framework. It also includes a list of the datasets and models that were used. In the experimental setup section, the model design technique is described. 3.1.1 Selection Criteria To extract various datasets used to train multiple bug detection models, the following inclusion and exclusion criteria were defined. 3.1.2 Inclusion Criteria i. The papers were published between 2010 and 2020. ii. The study used articles published in reputable sources journals and conferences as its sources. iii. The study used machine learning techniques for the estimation. 3.1.3 Exclusion Criteria i. Studies that were not published in English were excluded. ii. Working papers or work in-progress articles were excluded. 20 University of Ghana http://ugspace.ug.edu.gh 3.2 Search process A robotic search strategy was used to find suitable studies for the review in four separate databases: ScienceDirect, IEEE Xplore, ACM Digital Library, and Springer. Because the chosen databases are common in the fields of engineering and computer science, it was assumed that they would cover major papers on the research issue. In addition, the percentage of content updates in these databases was examined as part of the selection process. 3.3 Classification of Papers Papers used in this study are categorised by the publishers or journals in which they were published. The following platforms for full-text database offering journals were used to find research materials for this study: 1. ACM digital library 2. Science Direct 3. IEEE Xplore 4. Springer 21 University of Ghana http://ugspace.ug.edu.gh Figure 1. Order of the selection process (SLR protocol) Table 1.Distribution of search results from selected journals Journal Overall database results Final results retrieved after retrieved exclusion criteria ACM digital library 39 14 Springer 1 1 Science Direct 175 27 IEEE Xplore 19 8 TOTAL 234 50 3.4 Case Study Setup The dataset used for the experiment is publicly available bug reports from a free and open-source browser, Mozilla Firefox, and Eclipse, which is an integrated development environment used in computer programming. Datasets obtained from the two platforms have 115814 and 85156 records, respectively, and 11 attributes, that is: priority, issue_id, component, duplicated_issue, 22 University of Ghana http://ugspace.ug.edu.gh title, descriptions, status, resolution, version, created time, resolved time. The datasets consist of an additional and manually labelled attribute, Classes, which determines the presence or absence of multiple reports made on bug issues and is determined from the 'Duplicated_issue' attribute. In the 'Classes' attribute, 0 represents the absence of multiple reports and 1 represents the presence of multiple reports. The number of 0 records was 80,000 and 35,814 for that of 1 for the Mozilla Firefox, as well as 70752 0 records and 14404 for that of 1 record for the Eclipse platform. Table 2. Characteristics of multiple bug reports used in the experiment Dataset Total Number of bugs Multiple bugs Mozilla Firefox 115814 35814 Eclipse 85156 14404 Bug reports stored in Bugzilla issues were extracted from two open-source projects, namely Mozilla Firefox and Eclipse, sourced from recognized repositories. These two open-source projects were used for our case studies. 3.4.1 Bug Reports from Mozilla Firefox Mozilla Firefox is a well-known Internet browser used by millions of people around the world. Firefox used Mozilla as a crash reporting system and Bugzilla for tracking bug reports. The dataset used for the study is a publicly available Mozilla Firefox bug report dataset containing 115814 records and 11 attributes. The dataset consists of an additional and manually labelled attribute class, which determines the presence or absence of a multiple bug report and is determined from the "duplicate issue" attribute. In the ‘Classes’ attribute, 0 represents the absence of a multiple and 1 represents the presence of a multiple bug. The number of 0 records was 80,000 and 35,814 for 23 University of Ghana http://ugspace.ug.edu.gh that of 1. There is a clear cut between crash reports and bug reports for Mozilla Firefox products. Crash reports are robotically reported when a crash usually occurs. The report contains several fields, including stack trace, priority, product, component, and OS version [64]. We extracted multiple bug reports from the Mozilla Firefox website, from BR # 1 to BR # 115814. Mozilla uses "description" and "classes" to group bugs. We extracted reports with the status "RESOLVED MULTIPLE" or "VERIFIED MULTIPLE" and found their corresponding master bug report. In addition, we only included the dataset bug reports that have links to their corresponding crash reports since our method uses stack traces, which are only saved in crash reports. We only added multiple bug reports that have at least two stack traces, the strict minimum number of traces needed to construct the training and testing sets. 3.4.2 Bug Reports from Eclipse The ECLIPSE dataset comprises bug reports from the Eclipse bug repository. In this repository, stack traces are usually embedded in the bug report description. For Eclipse, the extracted traces were preprocessed to remove noise in the data, such as native methods that are used when a call to the Java library is performed. Unlike Firefox, which uses a variety of systems to track stack traces, Eclipse uses a single system. A bug report has stack traces as part of the bug report description, compared to Mozilla Firefox. We generate multiple groups by putting in separate multiple group stack traces of a master bug report. We only saved multiple bugs with the smallest of two stack traces to construct training and testing. Features of multiple bug report groups in Mozilla Firefox and the Eclipse dataset. 24 University of Ghana http://ugspace.ug.edu.gh Table 3. Description of case study projects Metric Mozilla Firefox Eclipse LOC 23738195 129278 Comment 4934094 36637 Developers 16 8 Release May, 2021 June,2018 Version 91.0 4.8 3.5 Dataset Preprocessing Since we are dealing with text data written by humans, the data is found to be noisy and dirty as a result of the use of excessive punctuation, slang words, repeated words, mixed-case letters, and misspellings. As a result of the imbalance in text data, it would have to undergo certain data cleaning techniques to qualify it for training and testing. The techniques used here are standard techniques known for preprocessing in natural language processing. Some of the preprocessing techniques used are lowercasing, where we transform all upper-case letters to lowercase for the purpose of dimensionality reduction. Text data in general comes in larger sizes with a lot of unnecessary punctuation, and this punctuation would have a negative impact on the final result and would have to be removed. This procedure, which is an NLP normalization technique, is also performed by removing the suffixes and prefixes of words. This is to help achieve the root form of the texts. Other techniques considered were the removal of stop words, which are common words in the English language that do not convey any meaning. Examples include "he," "the," "is," "have," and so on. 3.6 Feature Extraction Although at times, data comes in smaller sizes, at most times, they come in tremendously larger sizes, which makes its processing very tiring and time-consuming. This is as a result of the number 25 University of Ghana http://ugspace.ug.edu.gh of features in the dataset, as the higher the number of features increases, the larger the file size, though they might come in different forms sometimes. The contribution of these types of datasets is often less towards predictive modelling as compared to that of the critical features. Problems caused by these features during implementation can be the unnecessary resource allocation for the features, the poor performance of the model, and more exhaustion for the training model. To solve this issue, as well as maximize the performance of the model, feature extraction is implemented, which would select the most significant features from the dataset. Two feature extractions were used in the study, which included TFIDF and GloVe. 3.6.1 TF-IDF The term frequency is used to normalize the document term matrix. It is a product of TF and IDF. TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates the relevance of a word in a pool of words. One major issue with machine learning and deep learning with natural language is that they are both dealing with the same problem. Its algorithms perform computations and workflow with numbers. However, the natural language is text, and hence there is a need to convert text into numbers, which is classified as text vectorization. Diverse vectorization techniques would have a significant impact on the results; thus, the one that best fits the expected result would need to be chosen. Once the words have been successfully transformed into words that the machine learning algorithms can understand, the TF-IDF output can then be fed to algorithms such as Naive Bayes and Support Vector Machines, greatly improving the results of basic methods such as word counts. 26 University of Ghana http://ugspace.ug.edu.gh 3.6.2 GloVE The glove is an unsupervised learning algorithm for obtaining vector representations of words. Word embeddings are dense word representations that bridge the gap between human understanding of language and that of a machine. Words with the same meaning have a similar representation in an n-dimensional space. Meaning that similar vectors that are very closely placed in a vector space are said to be words that have a similar meaning. On the other hand, GloVe (Global Vectors for Word Representation) is an alternative method to generate word embeddings that is normally based on matrix factorization methods on the word-context matrix. A large matrix of co-occurrence data is constructed where a count of each "word", i.e., the rows, and how often the word appears in a "context", i.e., the columns in a large corpus, is calculated. Usually, the corpus is first scanned in a manner such that, for each term, a search for context terms within some area is defined by a window size before the term and a window size after the term. Also, less weight is given to words that are more distant. 3.7 Traditional Machine Learning and Deep Learning Models 3.7.1 Traditional Machine Learning Once there are representations of the input data, the model would go through training to fit the data to be capable of making predictions on unseen reports. With results from TFIDF, data is split into two forms, that is, the training data and the testing data, which would be in the ratio of 90% for training data and 10% for testing data using the Scikit function train test split. The training data is then passed through a machine learning classifier, specifically, the Random Forest classifier. Once it is trained with different training samples, the machine learning model can start or begin to make correct predictions. These predictions from machine learning would serve as a baseline for the deep learning algorithm. 27 University of Ghana http://ugspace.ug.edu.gh Table 4.Machine learning dataset training details Dataset Training Data Testing Data #Multiple #All #Multiple #All Mozilla Firefox 32145 104232 3669 11582 Eclipse 12934 76640 1470 8516 3.7.2 Deep Learning In deep learning, similar processes to those in machine learning would be repeated here. However, in feature extraction, the deep learning model makes use of GloVe word embeddings. With results from the GloVe, a model is built using bidirectional LSTM with an epoch of 5 and a batch size of 128. Table 5. Deep learning dataset training details Dataset Training Data Testing Data #Multiple #All #Multiple #All Mozilla Firefox 32246 104232 3568 11582 Eclipse 12946 76640 1458 8516 3.7.3 Bidirectional LSTM model Researchers and practitioners have presented a number of unique and successful neural network and probabilistic models to handle bug identification challenges. For example, the proposed Bidirectional LSTM model evaluates both the prior and following context of codes to detect 28 University of Ghana http://ugspace.ug.edu.gh mistakes and offer solutions that will allow testers and developers to make necessary changes quickly. Our approach uses a BiLSTM neural network to model language. The LSTM network computes the output vectors based on the input data. Now, let I = i1, i2, i3.......... it be the set of encoded IDs of source codes. An RNN then executes for each encoded ID for t = 1 to n. The output vector of RNN yt can then be expressed by the following equations: 𝒉𝒕 = 𝒕𝒂𝒏𝒉(𝑾𝒙𝒉𝒙𝒕 +𝑾𝒉𝒉𝒉𝒕$𝟏 + 𝒃𝒉 (1) 𝒚𝒕 = 𝑾𝒉𝒚𝒉𝒕 + 𝒃𝒚) (2) where ℎ' is the hidden state output, 𝑊 𝑖s a weight matrix (𝑊() is a weight connecting input (x) to hidden layer (h)), of the hidden layer. The first equation calculates the hidden state output, which receives the preceding state's results. However, because of the gradient vanishing or exploding problem, not all input sequences are successful in an RNN. RNN is extended into LSTM to help combat this and improve results. An LSTM network is conceptually similar to an RNN, except that the hidden layer updating is replaced with a memory cell. The following equations implement LSTM: 𝒊𝒕 = 𝝈(𝑾𝒙𝒊 𝒙𝒕 +𝑾𝒉𝒊𝒉𝒕$𝟏 +𝑾𝒄𝒊𝒄𝒕$𝟏 + 𝒃𝒊) (3) 𝒇𝒕 = 𝝈(𝑾𝒙𝒇𝒙𝒕 +𝑾𝒉𝒇𝒉𝒕$𝟏 +𝑾𝒄𝒇𝒄𝒕$𝟏 + 𝒃𝒇) (4) 𝒄𝒕 = 𝒇𝒕𝒄𝒕$𝟏 + 𝒊𝒕𝒕𝒂𝒏𝒉(𝑾𝒙𝒄𝒙𝒕 +𝑾𝒉𝒄𝒉𝒕$𝟏 + 𝒃𝒄) (5) 𝒐𝒕 = 𝝈(𝑾𝒙𝒐𝒙𝒕 +𝑾𝒉𝒐𝒉𝒕$𝟏 +𝑾𝒄𝒐𝒄𝒕 + 𝒃𝒐) (6) 𝒉𝒕 = 𝒐𝒕𝒕𝒂𝒏𝒉(𝑪𝒕) (7) 29 University of Ghana http://ugspace.ug.edu.gh where σ is a sigmoid function; c the cell state f the forget gate, i the input, o as the output, and b represents biases. However, there are a few flaws it poses: LSTM networks are slow to train and, as such, it takes a significant number of times for the neural network to learn. Predominantly, it considers only the previous context of the input but cannot consider any subsequent text context. In LSTM, what happens is that the network learns from the left -to- the right and right-to-left contexts separately, then concatenates them. This process it undergoes causes the true meaning of the context to be slightly lost. As such, in this study, we propose the bidirectional LSTM model as a better upscale for our task, where the first is taken as the forward direction for the input and the other serves as the backwards direction. Having Bidirectional LSTM will intend, effectively enhance the amount of information to be made readily available for the network, thus improving the context availed to the algorithm. Figure 2. Bidirectional LSTM framework (Sanchoy et al.[69]) 30 University of Ghana http://ugspace.ug.edu.gh 3.8 Multiple Bug Detection and Effort Estimation Framework Figure 3 describes the multiple bug detection and effort estimation framework of the two-study dataset, that is, Mozilla Firefox and the Eclipse bug report repository. Once we have a bug report repository, we extract bug reports with stack traces using the bug detector tool (bug tracking system), selecting instances of multiple bug reports based on their severity levels, namely critical or less critical. Note that from Figure 4, we classified low and medium-severity instances based on the bug report description as less critical, and we classified instances with high and critical severity as critical. Descriptions of low, medium, high, and critical are provided in Figure 3. If active or critical, we compute the identification duration or effort needed to detect the bug. A primary fix means the identified instance of multiple bugs is very critical or high priority, which demands attention for fixing, whereas a secondary fix demands less critical attention. Figure 3. Multiple bug detection and effort estimation framework(mBugDEE) 31 University of Ghana http://ugspace.ug.edu.gh 3.9 Performance Evaluation The metrics accuracy, precision, recall, and F1score are used to evaluate the performance of the model. Precision is basically known as a metric used to quantify the number of positive class predictions that actually belong to the positive class. Recall is also known to be the sum of positive class predictions made out of all positive instances. The F1 score provides a single score that contains both the concerns of precision and recall. Precision 𝑷 = 𝑻𝑷 (8) 𝑻𝑷0𝑭𝑷 Recall 𝑹 = 𝑻𝑷 (9) 𝑻𝑷0𝑭𝑵 F1-Score 𝑭 = 𝟐𝑹𝑷 (10) 𝑹0𝑷 The symbol TP, which is "True Positive," normally occurs when the predicted output is positive, which is an indication of attack for this study, and the actual instance is positive. The symbol FP, which is “False Positive,” occurs when the predicted output is falsely positive, an attack for study is carried out. The F1-Score is the average or vocal mean of R (recall) and P (precision). 3.10 How much effort is required to identify and resolve a multiple bug? Motivation: There are many duplicate bug reports. According to a recent study, 20–40% of the issue reports in Mozilla and Eclipse are duplicates [70]. Duplicate issue reports waste developers' time and resources. As a result, many researchers have suggested automatic duplication report detection methods [61]. Identifying and resolving duplicate reports has received little attention in recent years. Hence, this study investigates a feasible technique to estimate the effort of identifying and resolving duplicate as well as multiple bug reports. Note that when a bug is reported by one 32 University of Ghana http://ugspace.ug.edu.gh reporter on two or more occasions, it is referred to as a "duplicate bug" [70]. But in the case of multiple bugs, the same bug is reported by different reporters. Method: The identification duration statistic measures the time it takes to identify multiple bug reports. That is, the longer it takes to find and fix a large number of bugs, the more work developers must do [71]. We measure the time between a reporter triaging a bug report and another reporter marking it as a multiple problem (s). For non-triaged bug reports, we record the days between reporting and being recognized as a multiple bug. Given a corpus of bug issue reports, the more time (in days) it takes to identify and triage bug reports by different reports, the more developers’ effort is required and vice versa. Identification duration (time) is directly proportional to the developers’ effort required. For example, given a report X that was first reported by reporter A on day one (d1) at a given time (t1), and the same bug by reporter B on day four (d4) at time (t4), and lastly, the same bug by reporter C on day seven (d7) at time (t7). Assuming that the release date for the given version of the open-source project is given as Day 0 at time (t0), Hence, from the above example, the total duration needed to identify bug X as a multiple bug by each of the reporters is given as follows: Identification_duration_by_ReporterA = d0 – d1 @ t0 – t1 Identification_duration_by_ReporterB= d0 – d4 @ t0 – t4 Identification_duration_by_ReporterC = d0 – d7 @ t0 – t7 In general, the identification duration by a reporter is given by d0 – di @ t0 – tj. Where d0 denotes the first day a reporter identified and reported the given bug X at the initial time, t0 di denotes the ith subsequent day a different reporter also identified and reported the same bug X at the jth time tj. Thus, the timestamp for a given bug to be identified and reported by a reporter is di @ tj. 33 University of Ghana http://ugspace.ug.edu.gh From the above example, it can be deduced that the average duration for a given bug to be classified as multiple bugs is as follows: Mean_identification_duration = (|𝒅𝟎 – 𝒅𝟏 @ 𝒕𝟎 – 𝒕𝟏|)0(|𝒅𝟎 – 𝒅𝟒 @ 𝒕𝟎 – 𝒕𝟒|)0(|𝒅𝟎 – 𝒅𝟕 @ 𝒕𝟎 – 𝒕𝟕|) (11) 𝟑 We can deduce from the above example, that the general equation for estimating the average effort required for identifying a given bug X to be classified as a multiple bug by at least two different reporters is defined as follows: ∑𝒏 >|𝒅 – 𝒅 @ 𝒕 – 𝒕 |? Mean_identification_duration = 𝒊,𝒋)𝟏 𝟎 𝒊 𝟎 𝒋 (12) 𝒏 where n denotes the number of different reports identifying and reporting the given bug X, i denotes the ith day a given bug is identified and reported by a given reporter at a jth time. Since the identification duration is directly proportional to the developers’ effort needed to address a reported bug [70]. Then we can deduce that the mean_identification_duration= ( ∑𝒏𝒊,𝒋)𝟏>|𝒅𝟎 – 𝒅𝒊 @ 𝒕𝟎 – 𝒕𝒋|? | ) (13), can be used as the average mean of resolving the identified multiple 𝒏 bug X. Thus, we assume that the developers’ effort needed to resolve the reported bug will be at most the duration needed for identifying the given bug. That is, mean_developers’_effort ≤ mean_identification_duration. Note that, this assumption is based on experienced developers resolving the identified bug. 3.10.1 Path Extraction We extract paths from an AST built for our method instead of straight source code, because an AST usually helps in capturing and improving code features [72]. Through the clear illustration of features through AST, the technique can make dissimilarity much better between different bug reports and, thus, non-bug code structures [72]. We used the widely-recognized Eclipse JDT 34 University of Ghana http://ugspace.ug.edu.gh package to build an AST for a given Java method [23]. For example, we do not want to lose important or relevant data for each technique for bug detection. We find the smallest figure of four long pathways that can safeguard an AST model's nodes. So, we created an AST and extracted the four long pathways. In an AST, an extended path connects two leaf nodes via the root node, with multiple overlapping nodes [23] (Figure 4). Figure 4. Severity levels of bugs 3.11 Bugs detection stages Before a bug can go through its detection and fixation cycle, the user-involved must suspect or detect and discover the anomaly or malfunctioning of a software project [73]. Then, by using software language, the user then categories the anomalies into single or multiple bugs or the categories that may be available based on the type of operation. The user has to inform the test manager or IT personnel who put in place the necessary executable actions to resolve the problem. After the resolution, the test manager has to verify if the bugs are truly resolved or not. If the resolution fails, the action has to be re-launched again. But if the results verify that the bugs have been fixed, the dataset repository is closed and the report is transmitted to the central system. This is shown in Figure 5. 35 University of Ghana http://ugspace.ug.edu.gh Figure 5. Bugs detection cycle 3.12 Effort Collection Requirements to Improve Bug Detection Process Data for effort estimation should be gathered based on defined stages and activities [74]. If there is non-homogeneity in the data collection of the effort variable across projects, the overall dataset effort provided spans a variety of phase combinations, causing bug analysis to be erroneous [74]. Thorough planning, such as establishing shorter activities and smaller tasks, improves estimating accuracy and minimizes the extent of estimation errors. To avoid "noisy" information in the effort value, unrelated activities conducted during project execution should be pushed out of the analysis. Time spent on software projects, as well as time spent on unrelated tasks, should not be included in the effort collecting tool, because the quantity of functionality offered may be directly tied to various aspects of effort estimation [75]. However, it is not possible to construct a relationship with the same capabilities for other types of effort, such as documentation, meetings, and demos. The amount of time spent on multiple projects and unrelated tasks should be differentiated. 36 University of Ghana http://ugspace.ug.edu.gh 37 University of Ghana http://ugspace.ug.edu.gh Figure 6. Bugs detection process 3.13 Processes of Dataset Framework The dataset consists of five processes. Software managers are in charge of providing a high- level overview of all bug estimation methods [3]. The dataset execution level is where all of the data analysis and calibration stages are carried out. The data analysis process uses the effort collection results and BFC size measurements of all the projects for analysis [76]. Then it produces "effort models" for each application type. Other important data collected during project implementation is analyzed for the final effort estimation of the new dataset. The calibration process is carried out regularly at the dataset implementation level [76]. This method is used to do two distinct calibration processes. These are established frameworks for describing bug detection stages, activities, unexpected events, and unplanned events. The Measurement and Analysis Group determines the calibration period primarily based on the rate of arrival of available data. 38 University of Ghana http://ugspace.ug.edu.gh Figure 7. Multiple bugs detection and fixing algorithm 3.14 Bug Detection and Evaluation Tools The process of running test cases and assessing the results is known as bug discovery and evaluation. Selecting test cases for implementation, setting up the environment, running the selected tests, recording the execution activities, analyzing probable product failures, and assessing the effort's effectiveness are all part of this process. Execution tools are primarily focused on making running tests easier. 1. Capture and playback tools 2. Memory testing tools 3. Monitoring tools 4. System log reporting tools 5. Coverage analysis tools 6. Mapping tools. 39 University of Ghana http://ugspace.ug.edu.gh Implementation tools are also used for test execution. Implementation tools take the place of software or hardware that interacts within the system with the dataset to be tested. 3.15 Chapter Summary The chapter presented the experimental methodology of the study. It includes the descriptions of the two datasets used for the study as well as the description of the framework. It also described the experimental design, which included a discussion of deep learning and traditional machine learning algorithms in detecting multiple bugs. Again, it also provided an effort estimation framework for estimating the effort required for detecting multiple bugs. 40 University of Ghana http://ugspace.ug.edu.gh CHAPTER FOUR EXPERIMENTAL RESULT AND DISCUSSIONS 4.1 Review of Existing Techniques for Bug Detection after Inclusion/Exclusion Criteria The search from the review of existing studies from four electronic databases identified fifty (50) papers after inclusion and exclusion criteria. Figure 8 depicts a clustered bar graph showing both the original number of papers sourced from the four electronic databases and the respective number after the inclusion and exclusion criteria. In Figure 9, twenty-seven (27) of the primary studies, representing 63% of the total, were obtained from ScienceDirect. The majority of these papers were obtained from the Science Direct database, which contains more journals dedicated to the study than the other three databases (IEEE Xplore, Springer, and ACM Digital Library). Fourteen (14) of the primary studies were obtained from the ACM digital library, accounting for 32% of the total, eight (8) from IEEE Xplore, accounting for 3%, and one (1) from Springer, accounting for 2% (Figure 9). 4.2 Results from Techniques for Multiple Bug Detection The proposed bidirectional LSTM model was tested on a dataset from two separate projects, namely, Mozilla Firefox and Eclipse, each with a different number of multiple bug reports, as indicated in Table 2. The proposed model was compared using existing machine learning techniques. The model was trained and tested with more than 1000 bug reports. 41 University of Ghana http://ugspace.ug.edu.gh Number of overall database results papers Final results of obtained papers after inclusion and exclusion criteria 200 180 175 160 140 120 100 80 60 39 40 27 20 14 19 0 0 8 0 1 1 0 0 ACM digital library Science Direct IEEE Xplore Springer Figure 8. Search results of overall papers and final papers after inclusion and exclusion criteria 3% 32% ACM digital library Springer Secience Direct 63% 2% IEEE Xplore Figure 9. Final papers obtained after inclusion and exclusion criteria 4.3 Research questions Examining the proposed framework requires answering the following questions: • What are the existing techniques for detecting duplicate bugs? • What are the techniques for detecting multiple bugs? • What is the mean identification duration for estimating multiple bugs? 42 University of Ghana http://ugspace.ug.edu.gh The research questions compare the selected deep neural network (BiLSTM) against other alternatives, and it also examines the performance improvement of the proposed model as indicated in Tables 6, 7, and 8. 4.4 Bidirectional LSTM neural network After the training epoch, the experiments were conducted on a bidirectional LSTM neural network. The bidirectional LSTM neural network model was created in Python with the help of the Keras deep learning library [3]. Accuracy and loss for validation in the Keras model may vary according to the study's specific circumstances. The epoch is inversely related to the loss, which means that as the epoch length increases, the loss decreases and the accuracy increases. The following scenarios are evaluated for keras validation loss and accuracy: • When the validation loss increases while the accuracy decline, the model is said to be not learning. • When both the validation loss and the accuracy increase, the model is said to be overfitting. • When the validation loss begins to decline and the accuracy begins to increase, the model is considered to be learning. Figure 10 (i) and 10 (ii) describe the accuracy and loss metrics, which are normally considered to be the return on the variance between the training and testing data. 43 University of Ghana http://ugspace.ug.edu.gh (i) Model loss (ii) Model accuracy Figure 10. Comparison between training and validation (loss and accuracy) Figures 10(i) and 10(ii) show that the model has a higher training accuracy but a lower validation accuracy, implying that it is likely learnable. The training loss is found to be decreasing, which shows that the model is learning to recognize the training set. 4.5 Comparison between performance results from applying BiLSTM, CNN, LSTM, SVM, and Random Forest After training and testing, the accuracy, precision, recall, and F1 score metrics were used to determine the performance of both traditional machine learning and deep learning models. Table 6 shows the overall performance of results made from Support Vector Machines (SVM), Random Forest, Convolutional Neural Network (CNN), (LSTM), and the proposed BiLSTM model, performed on the Mozilla Firefox dataset. Table 7 shows the overall performance results on the Eclipse dataset and compares them with the proposed BiLSTM model. On average, the proposed model performed better in accuracy as well as precision across the Mozilla dataset. 44 University of Ghana http://ugspace.ug.edu.gh Table 6. Performance of the models based on the Mozilla Firefox dataset Model Accuracy Precision Recall F1 score Random Forest 0.6789 0.4587 0.2334 0.3094 SVM 0.6838 0.5088 0.0547 0.0989 CNN 0.6221 0.3941 0.3589 0.3757* LSTM 0.6609 0.4341 0.2316 0.3021 Bidirectional LSTM 0.7109* 0.6830* 0.4570* 0.0855 * denotes the best model across a given performance metric – Accuracy, Precision, Recall, F1 score Table 6 shows the comparative results when the models were applied to the Mozilla Firefox dataset. Once again, the proposed bidirectional LSTM model outperformed the LSTM, CNN, SVM, and Random Forest, with an accuracy of 71.09%, precision of 68.30%, and a recall of 45.7%. Despite the fact that the bidirectional LSTM model had the highest precision rate of 68.30%, there were only a few accurate codes classified as incorrect. The 45.7% recall rate indicated that there were few false positives. As discussed before, the F1-score is the vocal mean of the recall and precision ratios and is a key metric to describe model performance. As shown in Table 6, the F1-score of the BiLSTM model was average among the four models, showing that the model produced more true positives with a low percentage of false positives. Table 7 shows the proportional results when the models were applied to the Eclipse dataset. The proposed BiLSTM and CNN proved the best as compared to LSTM, SVM, and Random Forest Table 7. Performance of the models based on the Eclipse dataset Model Accuracy Precision Recall F1 score Random Forest 0.8123 0.2953 0.0692 0.1122 SVM 0.8203 0.5000* 0.0027 0.0054 CNN 0.7353 0.2219 0.2129* 0.2173 LSTM 0.8014 0.2721 0.0897 0.1350 Bidirectional LSTM 0.8260* 0.4225 0.0204 0.5089* * denotes the best model across a given performance metric – Accuracy, Precision, Recall, F1 score 45 University of Ghana http://ugspace.ug.edu.gh 4.6 Comparison between BiLSTM, LSTM, CNN, SVM and Random Forest results As shown in Figure 11 and 12, the results of our experiments show that the proposed framework based on bidirectional LSTM neural networks appropriately detects multiple bugs in software projects and that performance can be significantly increased when compared to LSTM, CNN, SVM, and Random Forest. Based on Table 8, and Figure 11, 12, we draw the following conclusions: The proposed model achieves a minor performance improvement. We calculated bidirectional LSTM improvement and compared it to other machine learning techniques such as LSTM, CNN, SVM, and Random Forest. Accuracy results show a 3% improvement for bidirectional LSTM compared with SVM, a 9% improvement accuracy for bidirectional LSTM compared with CNN, a 5% improvement accuracy for bidirectional LSTM compared with LSTM, and a 4% improvement accuracy for bidirectional LSTM compared with Random Forest. Precision values improved by 17% for bidirectional LSTM when compared to SVM, 29% improvement for bidirectional LSTM when compared to CNN, 25% improvement for bidirectional LSTM when compared to LSTM, and 23% improvement for bidirectional LSTM when compared to Random Forst. Recall values of the proposed model improved 39.6% compared to SVM, 9% improvement compared to CNN, 22% improvement compared to LSTM, and 22% improvement compared to Random Forest, which shows that bidirectional LSTM model outperforms the other techniques in detecting multiple bugs in software project. Table 8. BiLSTM improvement compared to SVM, CNN, LSTM and Random Forest SVM, BiLSTM, Improv. CNN, BiLSTM, Improv. LSTM, BiLSTM, Improvement Random(F), BiLSTM, Improv. Accuracy 0.68, 0.71 3% 0.62, 0.71 9% 0.66, 0.71, 5% 0.67, 0.71, 4% Precision 0.51, 0.68 17% 0.39, 0.68 29% 0.43, 0.68, 25% 0.45, 0.68 23% Recall 0.054, 0.45 39.6% 0.36, 0.45 9% 0.23, 0.45, 22% 0.233 0.45 22% 46 University of Ghana http://ugspace.ug.edu.gh Figures 11 and 12 shows comparison between BiLSTM, CNN, SVM, LSTM and Random Forest results. F1 Score Recall Precision Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Random Forest SVM CNN LSTM Bidirectional LSTM Figure 11. Comparison between BiLSTM, CNN, SVM, LSTM and Random Forest results in Mozilla Firefox dataset F1 Score Recall Precision Accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MAE Random Forest SVM CNN LSTM Bidirectional LSTM Figure 12. Comparison between BiLSTM, CNN, SVM, LSTM and Random Forest results in Eclipse dataset 47 MODELS MODELS University of Ghana http://ugspace.ug.edu.gh 4.7 Mean Identification Duration/Estimation Results for Multiple Bugs Table 9 shows the mean identification duration results obtained from 10 sampled bug reports from the entire Mozilla Firefox dataset. Here it shows one bug, i.e., the referenced bug identified and reported by a user and the reports made by three different users (reports 1, 2, and 3) on the referenced bug and the dates those bugs were reported. Table 9. Mean identification duration for multiple bug reports (Mozilla Firefox) Bug ID Report Report Report Ref Date Date 1 Date 2 Date 3 Durat Dur Durat Mean er 1 ID er 2 ID er 3 ID ion 1 atio ion 3 Durati n 2 on 14871 269442 143180 515027 9/24/1999 11/12/2004 5/8/2002 9/7/2009 1876 957 3636 2156.3 153509 340604 610714 334876 6/21/2002 6/6/2006 11/9/2010 4/20/2006 1446 3063 1399 1969.3 513391 475603 505371 506588 8/28/2009 1/27/2009 7/20/2009 7/26/2009 213 39 33 95 513401 495999 482190 540641 8/28/2009 6/2/2009 3/9/2009 1/19/2010 87 172 144 1 34.3 513415 447571 427280 529958 8/28/2009 7/22/2008 4/5/2008 1/15/2010 402 510 140 350.6 513424 501115 457572 505298 8/28/2009 6/29/2009 9/28/2008 7/20/2009 60 334 39 144.3 513459 465673 403402 493606 8/29/2009 11/18/2008 11/11/2007 5/18/2009 284 657 103 348 516003 494373 502405 502185 9/11/2009 5/22/2009 7/4/2009 7/3/2009 112 69 70 8 3.6 516108 64488 121839 176312 9/11/2009 2/5/2001 1/25/2001 10/23/200 3140 3151 2515 2935.3 2 516115 64488 187930 190785 9/11/2009 1/5/2001 1/6/2003 1/27/2003 3171 2440 2419 2676.6 Average Identification duration 1079. 1139 10499 1089.4 1 .2 NB: Duration in Days Table 10. Mean identification duration for multiple bug reports (Eclipse) Bug ID Report Report Report Ref Date Date 1 Date 2 Date 3 Durat Durat Durati Mean er 1 ID er 2 ID er 3 ID ion 1 ion 2 on 3 Duration 108813 138959 141014 155890 9/6/2005 4/27/2006 5/10/2006 8/31/2006 233 246 359 279.3 135956 139047 138154 138301 4/10/2006 4/27/2006 4/23/2006 4/24/2006 17 13 14 14.7 5835 177407 27381 45659 11/7/2002 3/14/2007 11/29/2002 10/28/2003 1588 22 355 655 21383 49261 72982 103659 7/8/2002 12/22/2003 8/31/2004 7/13/2005 532 785 1101 806 47927 49277 47960 48254 12/2/2003 12/22/2003 12/3/2003 12/8/2003 20 1 6 9 293551 313480 294392 296556 9/10/2028 5/19/2010 11/5/2009 12/1/2009 6689 6884 6858 6810.3 38821 49293 51392 41677 6/12/2003 12/22/2003 2/9/2004 8/19/2003 193 242 68 167.7 39409 49333 41052 44098 6/27/2003 12/23/2003 8/1/2003 10/2/2003 179 35 97 103.7 29923 312103 144234 285735 1/21/2003 5/7/2010 5/29/2006 9/8/2005 2663 1224 961 1616 241155 313406 249677 264043 8/7/2016 10/5/2018 8/10/2004 2/7/2009 789 4380 2738 2635.7 Average Identification duration 1290 1383 1255.7 1309.7 48 University of Ghana http://ugspace.ug.edu.gh The duration from the date the first or reference reporter makes the report to the date the other reporters also make the same report is also taken into consideration. And here, with all the information extracted, the mean identification duration is then calculated. The mean identification duration is directly proportional to the estimation duration for resolving the multiple bugs. A similar result is shown for the Eclipse dataset in Table 10. 4.8 Chapter Summary This section presents the results of the empirical experiment. It includes a review of various existing techniques for bug detection. The chapter also presents the results of traditional machine learning algorithms and deep learning algorithms in detecting multiple bugs from two open-source bug repositories. Also, results of the novel metric for estimating the effort of detecting multiple bugs in open-source projects are presented. That is, the mean identification duration results, performed on 10 sampled bug reports from the Mozilla Firefox and Eclipse bug repositories, respectively, are presented. 4.9 Threats to validity Some features, like other studies, may affect the proposed model's performance. Here are the dangers to our study's validity. Internal validity refers to bidirectional LSTM implementation and no other approaches. We select bidirectional LSTM, because others have shown to be effective for feature extraction and text categorization [3]. The findings are checked to avoid errors and mistakes 49 University of Ghana http://ugspace.ug.edu.gh External validity makes it hard to simplify the results. As stated earlier the used dataset extracted from bug reports related to two open-source projects. Dataset that has been used already from other projects; it is not certain to reach the same results. 50 University of Ghana http://ugspace.ug.edu.gh CHAPTER FIVE CONCLUSION 5.1 Conclusion This thesis provides a framework for multiple bug detection and effort estimation metrics for open- source projects to reduce the time and effort software testers spend on analyzing bug reports while also improving software reliability and productivity. A bidirectional LSTM effort estimation model based on a deep learning approach is proposed for multiple bug detection and effort estimation for open-source projects. The proposed framework involves the extraction of relevant features, text preprocessing (lemmatization, stemming, and stopwords) and then the extraction of relevant keywords from bug report descriptions. We extracted a dataset from the Bugzilla bug tracking system, which is comprised of two open-source projects and contains more than 1,000 bug reports. The dataset was divided into training and test examples, and the dataset's variation was recorded (90% training, 10% test,). The proposed model was validated using two publicly available datasets, Eclipse and Mozilla Firefox. The experimental results were evaluated using accuracy, precision, recall, and the F1 score. The proposed Bidirectional LSTM model is compared to four well-known machine learning algorithms: Random Forest, SVM, CNN, and LSTM. The proposed model outperformed the benchmark models in terms of accuracy (62.60-71.09%) and precision (50.0–68.30%). The proposed model outperforms studies by Sawarkar et al. [44],(52.42-64.61%), Shokripour et al. [2], (44.01-59.76 %) and Sharma et al.[38],(68.20-70.00%).The precision rate of the proposed system improves more than the Mani et al. [40],(43.10-47.00%), and Mahmood et al. [45],(28.32-42.25%) systems respectively. 51 University of Ghana http://ugspace.ug.edu.gh 5.2 Recommendation and Future Work We measure effort using three metrics: identification duration, the number of people involved, and identification delay [70]. These metrics do not normally capture the work required to find and detect multiple bug reports. There may be additional ways to quantify the work required to detect multiple bugs in software projects. In future case studies, we intend to investigate other indicators. We plan to test other deep learning techniques on closed-source projects like Maharah and Hashfood. These covers testing the performance of different classifiers such as KNN, SVM, and Random Forest. Integration of the proposed framework with Bugzilla software may be a future work direction. 52 University of Ghana http://ugspace.ug.edu.gh REFERENCES [1] M. Suresh, M. Amarnath, G. Baranikumar, M. Jagadheeswaran, and B. Tech, “a Survey on Bug Tracking System for Effective Bug Clearance,” Int. Res. J. Eng. Technol., pp. 1622– 1630, 2016, [Online]. Available: www.irjet.net. [2] D. G. Lee and Y. S. Seo, “Systematic review of Bug report processing techniques to improve software management performance,” J. Inf. Process. Syst., vol. 15, no. 4, pp. 967–985, 2019, doi: 10.3745/JIPS.04.0130. [3] H. Bani-Salameh, M. Sallam, and B. Al shboul, “A deep-learning-based bug priority prediction using RNN-LSTM neural networks,” E-Informatica Softw. Eng. J., vol. 15, no. 1, pp. 29–45, 2021, doi: 10.37190/E-INF210102. [4] A. Kumar, M. Madanu, H. Prakash, L. Jonnavithula, and S. R. Aravilli, “Advaita: Bug Duplicity Detection System,” no. August, pp. 0–5, 2020, [Online]. Available: http://arxiv.org/abs/2001.10376. [5] S. Hangal and M. S. Lam, “Tracking down software bugs using automatic anomaly detection,” p. 291, 2002, doi: 10.1145/581376.581377. [6] J. E. Corter, Y. J. Rho, D. Zahner, J. Nickerson, and B. Tversky, “Bugs and Biases: Diagnosing Misconceptions in the Understanding of Diagrams,” Proc. 31st Annu. Conf. Cogn. Sci. Soc., pp. 756–761, 2009, [Online]. Available: http://csjarchive.cogsci.rpi.edu/Proceedings/2009/papers/133. [7] I. Gomes, P. Morgado, T. Gomes, and R. Moreira, “An overview on the Static Code Analysis approach in Software Development,” Fac. Eng. da Univ. do Porto, Port., p. 16, 53 University of Ghana http://ugspace.ug.edu.gh 2009, [Online]. Available: https://pdfs.semanticscholar.org/ce3c/584c906eea668954f6a1a0ddbb295c6ec5a2.pdf%0A http://paginas.fe.up.pt/~ei05021/TQSO - An overview on the Static Code Analysis approach in Software Development.pdf. [8] M. J. Coblenz, “User-Centered Design of Principled Programming Languages,” no. August, 2020. [9] A. Goyal and N. Sardana, “Performance Assessment of Bug Fixing Process in Open Source Repositories,” Procedia Comput. Sci., vol. 167, no. Iccids 2019, pp. 2070–2079, 2020, doi: 10.1016/j.procs.2020.03.247. [10] J. Xuan et al., “Towards effective bug triage with software data reduction techniques,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 1, pp. 264–280, 2015, doi: 10.1109/TKDE.2014.2324590. [11] W. Xiaoyin, Z. Lu, X. Tao, J. Anvik, and J. Sun, “An approach to detecting duplicate bug reports using natural language and execution information,” Proc. - Int. Conf. Softw. Eng., no. January, pp. 461–470, 2008, doi: 10.1145/1368088.1368151. [12] W. Xiaoyin, Z. Lu, X. Tao, J. Anvik, and J. Sun, “An approach to detecting duplicate bug reports using natural language and execution information,” Proc. - Int. Conf. Softw. Eng., pp. 461–470, 2008, doi: 10.1145/1368088.1368151. [13] K. R. Subramanian, “Analytical Skills – The Wake up Call for Human Resource,” Int. J. Eng. Manag. Res., vol. 7, no. 1, pp. 14–19, 2017. [14] G. Decker and J. Mendling, “Process instantiation,” Data Knowl. Eng., vol. 68, no. 9, pp. 54 University of Ghana http://ugspace.ug.edu.gh 777–792, 2009, doi: 10.1016/j.datak.2009.02.013. [15] P. Mell, V. Hu, R. Lippmann, J. Haines, and M. Zissman, “An overview of issues in testing intrusion detection systems,” Contract, p. 22, 2003, [Online]. Available: https://pdfs.semanticscholar.org/e87b/4ea56a5cc4e985158c8eaff13ff1c5e0828b.pdf%5Cn http://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir7007.pdf%5Cnhttp://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.8.5163. [16] A. Nistor, P. C. Chang, C. Radoi, and S. Lu, “CARAMEL: Detecting and fixing performance problems that have non-intrusive fixes,” Proc. - Int. Conf. Softw. Eng., vol. 1, pp. 902–912, 2015, doi: 10.1109/ICSE.2015.100. [17] D. Hovemeyer and W. Pugh, “Finding bugs is easy,” Proc. Conf. Object-Oriented Program. Syst. Lang. Appl. OOPSLA, no. June, pp. 132–135, 2004, doi: 10.1145/1028664.1028717. [18] A. Kloptchenko, Text Mining Based on the Prototype Matching Method the Prototype Matching Method, no. 47. 2003. [19] M. Pradel and K. Sen, “DeepBugs: a learning approach to name-based bug detection,” Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 1–25, 2018, doi: 10.1145/3276517. [20] K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyande, “AVATAR: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations,” SANER 2019 - Proc. 2019 IEEE 26th Int. Conf. Softw. Anal. Evol. Reengineering, pp. 456–467, 2019, doi: 10.1109/SANER.2019.8667970. [21] N. Ayewah, D. Hovemeyer, D. J. Morgenthaler, J. Penix, and W. Pugh, “Using static analysis to find bugs,” IEEE Softw., vol. 25, no. 5, pp. 22–29, 2008, doi: 10.1109/MS.2008.130. 55 University of Ghana http://ugspace.ug.edu.gh [22] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as deviant behavior,” ACM SIGOPS Oper. Syst. Rev., vol. 35, no. 5, pp. 57–72, 2001, doi: 10.1145/502059.502041. [23] Y. Li, S. Wang, T. N. Nguyen, and S. Van Nguyen, “Improving bug detection via context- based code representation learning and attention-based neural networks,” Proc. ACM Program. Lang., vol. 3, no. OOPSLA, pp. 1–30, 2019, doi: 10.1145/3360588. [24] A. Kumar, M. Madanu, H. Prakash, L. Jonnavithula, and S. R. Aravilli, “Advaita: Bug Duplicity Detection System,” pp. 1–5, 2020, [Online]. Available: http://arxiv.org/abs/2001.10376. [25] S. Koch, “Effort modeling and programmer participation in open source software projects,” Inf. Econ. Policy, vol. 20, no. 4, pp. 345–355, 2008, doi: 10.1016/j.infoecopol.2008.06.004. [26] S. Panthaplackel, J. J. Li, M. Gligoric, and R. J. Mooney, “Deep Just-In-Time Inconsistency Detection Between Comments and Source Code,” 2020, [Online]. Available: http://arxiv.org/abs/2010.01625. [27] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as deviant behavior: A general approach to inferring errors in systems code,” Oper. Syst. Rev., vol. 35, no. 5, pp. 57–72, 2001, doi: 10.1145/502059.502041. [28] N. Cooper, C. Bernal-Cardenas, O. Chaparro, K. Moran, and D. Poshyvanyk, “A Replication Package for It Takes Two to Tango: Combining Visual and Textual Information for Detecting Duplicate Video-Based Bug Reports,” Proc. - Int. Conf. Softw. Eng., no. Cv, pp. 160–161, 2021, doi: 10.1109/ICSE-Companion52605.2021.00067. [29] J. Zou, L. Xu, M. Yang, X. Zhang, J. Zeng, and S. Hirokawa, “Automated duplicate bug 56 University of Ghana http://ugspace.ug.edu.gh report detection using multi-factor analysis,” IEICE Trans. Inf. Syst., vol. E99D, no. 7, pp. 1762–1775, 2016, doi: 10.1587/transinf.2016EDP7052. [30] A. Kukkar, R. Mohana, Y. Kumar, A. Nayyar, M. Bilal, and K. S. Kwak, “Duplicate Bug Report Detection and Classification System Based on Deep Learning Technique,” IEEE Access, vol. 8, no. November, pp. 200749–200763, 2020, doi: 10.1109/ACCESS.2020.3033045. [31] T. Chappelly, C. Cifuentes, P. Krishnan, and S. Gevay, “Machine learning for finding bugs: An initial report,” MaLTeSQuE 2017 - IEEE Int. Work. Mach. Learn. Tech. Softw. Qual. Eval. co-located with SANER 2017, pp. 21–26, 2017, doi: 10.1109/MALTESQUE.2017.7882012. [32] M. Thorstensson, Using Observers for Model Based Data Collection in Distributed Tactical Operations, vol. Licentiate, no. 1386. 2008. [33] F. Brown, D. Stefan, and D. Engler, “Sys: A static/symbolic tool for finding good bugs in good (browser) code,” Proc. 29th USENIX Secur. Symp., pp. 199–216, 2020. [34] L. P. Binamungu, “Detecting and Correcting Duplication in Behaviour Driven Development,” 2020. [35] P. Fonseca, K. Zhang, X. Wang, and A. Krishnamurthy, “An empirical study on the correctness of formally verified distributed systems,” Proc. 12th Eur. Conf. Comput. Syst. EuroSys 2017, pp. 328–343, 2017, doi: 10.1145/3064176.3064183. [36] P. Runeson, M. Alexandersson, and O. Nyholm, “Detection of duplicate defect reports using natural language processing,” Proc. - Int. Conf. Softw. Eng., no. November, pp. 499–508, 57 University of Ghana http://ugspace.ug.edu.gh 2007, doi: 10.1109/ICSE.2007.32. [37] A. Baarah, A. Aloqaily, Z. Salah, M. Zamzeer, and M. Sallam, “Machine learning approaches for predicting the severity level of software bug reports in closed source projects,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 8, pp. 285–294, 2019, doi: 10.14569/ijacsa.2019.0100836. [38] H. A. Ahmed, N. Z. Bawany, and J. A. Shamsi, “Capbug-a framework for automatic bug categorization and prioritization using nlp and machine learning algorithms,” IEEE Access, vol. 9, no. April, pp. 50496–50512, 2021, doi: 10.1109/ACCESS.2021.3069248. [39] S. F. A. Zaidi, F. M. Awan, M. Lee, H. Woo, and C. G. Lee, “Applying Convolutional Neural Networks with Different Word Representation Techniques to Recommend Bug Fixers,” IEEE Access, vol. 8, no. November, pp. 213729–213747, 2020, doi: 10.1109/ACCESS.2020.3040065. [40] T. Thongtan and T. Phienthrakul, “余弦相似度分类,” pp. 407–414, 2019, [Online]. Available: https://www.aclweb.org/anthology/papers/P/P19/P19-2057/. [41] S. Gupta and S. K. Gupta, “A Systematic Study of Duplicate Bug Report Detection,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 1, pp. 578–589, 2021, doi: 10.14569/IJACSA.2021.0120167. [42] A. Yadav and S. K. Singh, “Survey Based Classification of Bug Triage Approaches,” APTIKOM J. Comput. Sci. Inf. Technol., vol. 1, no. 1, pp. 1–11, 2016, doi: 10.34306/csit.v1i1.37. [43] T. W. W. Aung, Y. Wan, H. Huo, and Y. Sui, “Multi-triage: A multi-task learning 58 University of Ghana http://ugspace.ug.edu.gh framework for bug triage,” J. Syst. Softw., vol. 184, p. 111133, 2022, doi: 10.1016/j.jss.2021.111133. [44] Ö. Köksal and B. Tekinerdogan, “Automated classification of unstructured bilingual software bug reports: An industrial case study research,” Appl. Sci., vol. 12, no. 1, 2022, doi: 10.3390/app12010338. [45] Y. Mahmood, N. Kama, A. Azmi, A. S. Khan, and M. Ali, “Software effort estimation accuracy prediction of machine learning techniques: A systematic performance evaluation,” Softw. - Pract. Exp., 2021, doi: 10.1002/spe.3009. [46] Y. Tian, C. Sun, and D. Lo, “Improved duplicate bug report identification,” Proc. Eur. Conf. Softw. Maint. Reengineering, CSMR, pp. 385–390, 2012, doi: 10.1109/CSMR.2012.48. [47] S. N. Ahsan, J. Ferzund, and F. Wotawa, “Program file bug fix effort estimation using machine learning methods for OSS,” Proc. 21st Int. Conf. Softw. Eng. Knowl. Eng. SEKE 2009, pp. 129–134, 2009. [48] S. N. Ahsan, M. T. Afzal, S. Zaman, C. Gütel, and F. Wotawa, “Mining Effort Data From the Oss Repository of Developer’S Bug Fix Activity,” J. IT Asia, vol. 3, no. 1, pp. 107–128, 2016, doi: 10.33736/jita.38.2010. [49] S. Iqbal, R. Naseem, S. Jan, S. Alshmrany, M. Yasar, and A. Ali, “Determining Bug Prioritization Using Feature Reduction and Clustering with Classification,” IEEE Access, vol. 8, no. August 2009, pp. 215661–215678, 2020, doi: 10.1109/ACCESS.2020.3035063. [50] S. Gujral, G. Sharma, S. Sharma, and Diksha, “Classifying bug severity using dictionary based approach,” 2015 1st Int. Conf. Futur. Trends Comput. Anal. Knowl. Manag. ABLAZE 59 University of Ghana http://ugspace.ug.edu.gh 2015, no. February, pp. 599–602, 2015, doi: 10.1109/ABLAZE.2015.7154933. [51] N. Pandey, D. K. Sanyal, A. Hudait, and A. Sen, “Automated classification of software issue reports using machine learning techniques: an empirical study,” Innov. Syst. Softw. Eng., vol. 13, no. 4, pp. 279–297, 2017, doi: 10.1007/s11334-017-0294-1. [52] F. Porto, L. Minku, E. Mendes, and A. Simao, “A Systematic Study of Cross-Project Defect Prediction With Meta-Learning,” 2018, [Online]. Available: http://arxiv.org/abs/1802.06025. [53] A. Baarah, A. Aloqaily, Z. Salah, and E. Alshdaifat, “Sentiment-based machine learning and lexicon-based approaches for predicting the severity of bug reports,” J. Theor. Appl. Inf. Technol., vol. 99, no. 6, pp. 1386–1401, 2021. [54] H. Valdivia-Garcia and A, “Understanding the Impact of Diversity in Software Bugs on Bug Prediction Models by A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing and Information Sciences PhD Program in Computing an,” ProQuest, 2016. [55] V. R. Basili, “Models and Metrics for Software Management and Engineering.,” Instrumentation in the Pulp and Paper Industry, Proceedings, vol. 1. pp. 278–289, 1980. [56] A. Hindle, A. Alipour, and E. Stroulia, “A contextual approach towards more accurate duplicate bug report detection and ranking,” Empir. Softw. Eng., vol. 21, no. 2, pp. 368– 410, 2016, doi: 10.1007/s10664-015-9387-3. [57] M. Padmanaban and T. Bhuvaneswari, “An Approach Based on Artificial Neural Network for Data Deduplication,” Int. J. Comput. Sci. Inf. Technol., vol. 3, no. 4, pp. 4637–4644, 60 University of Ghana http://ugspace.ug.edu.gh 2012. [58] M. Allamanis, H. Jackson-Flux, and M. Brockschmidt, “Self-Supervised Bug Detection and Repair,” no. NeurIPS, 2021, [Online]. Available: http://arxiv.org/abs/2105.12787. [59] N. Jalbert and W. Weimer, “Automated duplicate detection for bug tracking systems,” Proc. Int. Conf. Dependable Syst. Networks, pp. 52–61, 2008, doi: 10.1109/DSN.2008.4630070. [60] L. Wu, B. Xie, G. Kaiser, and R. Passonneau, “BugMiner: Software reliability analysis via data mining of bug reports,” SEKE 2011 - Proc. 23rd Int. Conf. Softw. Eng. Knowl. Eng., no. December, pp. 95–100, 2011. [61] N. Ebrahimi, A. Trabelsi, M. S. Islam, A. Hamou-Lhadj, and K. Khanmohammadi, “An HMM-based approach for automatic detection and classification of duplicate bug reports,” Inf. Softw. Technol., vol. 113, no. May, pp. 98–109, 2019, doi: 10.1016/j.infsof.2019.05.007. [62] J. Deshmukh, K. M. Annervaz, S. Podder, S. Sengupta, and N. Dubash, “Towards accurate duplicate bug retrieval using deep learning techniques,” Proc. - 2017 IEEE Int. Conf. Softw. Maint. Evol. ICSME 2017, pp. 115–124, 2017, doi: 10.1109/ICSME.2017.69. [63] A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, D. Lo, and C. Sun, “Duplicate bug report detection with a combination of information retrieval and topic modeling,” 2012 27th IEEE/ACM Int. Conf. Autom. Softw. Eng. ASE 2012 - Proc., pp. 70–79, 2012, doi: 10.1145/2351676.2351687. [64] N. Ebrahimi, A. Trabelsi, M. S. Islam, A. Hamou-Lhadj, and K. Khanmohammadi, “An HMM-based approach for automatic detection and classification of duplicate bug reports,” 61 University of Ghana http://ugspace.ug.edu.gh Inf. Softw. Technol., vol. 113, pp. 98–109, 2019, doi: 10.1016/j.infsof.2019.05.007. [65] D. Sharma, B. Kumar, and S. Chand, “A Survey on Journey of Topic Modeling Techniques from SVD to Deep Learning,” Int. J. Mod. Educ. Comput. Sci., vol. 9, no. 7, pp. 50–62, 2017, doi: 10.5815/ijmecs.2017.07.06. [66] N. Bettenburg, S. Just, A. Schröter, C. Weiß, R. Premraj, and T. Zimmermann, “Quality of bug reports in Eclipse,” no. May 2014, pp. 21–25, 2007, doi: 10.1145/1328279.1328284. [67] A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, D. Lo, and C. Sun, “Duplicate bug report detection with a combination of information retrieval and topic modeling,” 2012 27th IEEE/ACM Int. Conf. Autom. Softw. Eng. ASE 2012 - Proc., no. September 2015, pp. 70– 79, 2012, doi: 10.1145/2351676.2351687. [68] A. Alipour, A. Hindle, and E. Stroulia, “A contextual approach towards more accurate duplicate bug report detection,” IEEE Int. Work. Conf. Min. Softw. Repos., pp. 183–192, 2013, doi: 10.1109/MSR.2013.6624026. [69] M. Rahman, Y. Watanobe, and K. Nakamura, “A Bidirectional LSTM Language Model for Code Evaluation and Repair,” pp. 1–15, 2021. [70] M. S. Rakha, W. Shang, and A. E. Hassan, “Studying the needed effort for identifying duplicate issues,” Empir. Softw. Eng., vol. 21, no. 5, pp. 1960–1989, 2016, doi: 10.1007/s10664-015-9404-6. [71] C. Weiß, R. Premraj, T. Zimmermann, and A. Zeller, “How long will it take to fix this bug?,” Proc. - ICSE 2007 Work. Fourth Int. Work. Min. Softw. Repos. MSR 2007, no. June 2007, 2007, doi: 10.1109/MSR.2007.13. 62 University of Ghana http://ugspace.ug.edu.gh [72] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “A general path-based representation for predicting program properties,” ACM SIGPLAN Not., vol. 53, no. 4, pp. 404–419, 2018, doi: 10.1145/3192366.3192412. [73] J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug repository,” Proc. 2005 OOPSLA Work. Eclipse Technol. Exch. eclipse’05, pp. 35–39, 2005, doi: 10.1145/1117696.1117704. [74] E. Kocaguneli and T. Menzies, “Software effort models should be assessed via leave-one- out validation,” J. Syst. Softw., vol. 86, no. 7, pp. 1879–1890, 2013, doi: 10.1016/j.jss.2013.02.053. [75] V. Lenarduzzi, S. Morasca, and D. Taibi, “Estimating software development effort based on phases,” Proc. - 40th Euromicro Conf. Ser. Softw. Eng. Adv. Appl. SEAA 2014, no. December 2015, pp. 305–308, 2014, doi: 10.1109/SEAA.2014.54. [76] D. Baca, Software, Developing Secure - in an Agile Process. 2012. 63 University of Ghana http://ugspace.ug.edu.gh APPENDIX: IMPLEMENTATION CODES CODES FOR BIDIRECTION LSTM Importing libraries from tensorflow import keras from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense, Embedding, LSTM, Bidirectional from sklearn.model_selection import train_test_split import numpy as np import pandas as pd Reading dataset df = pd.read_csv('/content/drive/MyDrive/Anas datasets/eclipse_cleaned_data.csv') Exploring dataset df.sample() X = df['Title'] y = df['Classes'] 64 University of Ghana http://ugspace.ug.edu.gh Dataspliting from sklearn.model_selection import train_test_split X_train, X_test , y_train, y_test = train_test_split(X, y , test_size = 0.30) Data preprocessing vocab_size = 1000 oov_token = "" max_length = 100 padding_type = "post" trunction_type="post" tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_token) tokenizer.fit_on_texts(X_train.astype(str)) word_index = tokenizer.word_index Create Sequence X_train_sequences = tokenizer.texts_to_sequences(X_train.astype(str) X_train_sequences[10:15] Padding sequence X_train_padded = pad_sequences(X_train_sequences,maxlen=max_length, padding=padding_type, truncating=trunction_type) X_train_padded 65 University of Ghana http://ugspace.ug.edu.gh X_test_sequences = tokenizer.texts_to_sequences(X_test.astype(str)) X_test_padded = pad_sequences(X_test_sequences,maxlen=max_length, padding=padding_type, truncating=trunction_type) Prepare Glove Embeddings embeddings_index = {} f = open('/content/drive/MyDrive/Anas datasets/glove.6B.100d.txt') for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Found %s word vectors.' % len(embeddings_index)) embeddings_index['attention'] embedding_matrix = np.zeros((len(word_index) + 1, max_length)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. 66 University of Ghana http://ugspace.ug.edu.gh embedding_matrix[i] = embedding_vector Embedding Layer embedding_layer = Embedding(len(word_index) + 1, max_length, weights=[embedding_matrix], input_length=max_length, trainable=False) Model Definition embedding_dim = 16 input_length = 100 model = Sequential([ embedding_layer, Bidirectional(LSTM(embedding_dim, return_sequences=True)), Bidirectional(LSTM(embedding_dim,)), Dense(6, activation='relu'), Dense(1, activation='sigmoid') ]) model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) 67 University of Ghana http://ugspace.ug.edu.gh model.summary() Train Model History = model.fit(X_train_padded, y_train, epochs=20, validation_data=(X_test_padded, y_test)) import pickle f = open('/content/drive/MyDrive/Anas datasets/eclipse_history.pckl', 'wb') pickle.dump(history.history, f) f.close() Visualizing results import matplotlib.pyplot as plt plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'val'], loc='upper left') plt.show() import matplotlib.pyplot as plt plt.plot(history.history['loss']) 68 University of Ghana http://ugspace.ug.edu.gh plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'val'], loc='upper left') plt.show() y_pred = model.predict(X_test_padded) from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score accuracy = accuracy_score(y_test, np.round(abs(y_pred))) print('Accuracy: %f' % accuracy) precision = precision_score(y_test, np.round(abs(y_pred))) print('Precision: %f' % precision) recall = recall_score(y_test, np.round(abs(y_pred))) print('Recall: %f' % recall) # f1: 2 tp / (2 tp + fp + fn) f1 = f1_score(y_test, np.round(abs(y_pred))) print('F1 score: %f' % f1) import numpy as np 69 University of Ghana http://ugspace.ug.edu.gh (unique, counts) = np.unique(y_train, return_counts=True) frequencies = np.asarray((unique, counts)).T print(frequencies) (unique, counts) = np.unique(y_test, return_counts=True) frequencies = np.asarray((unique, counts)).T print(frequencies) y_train.shape evaluate the model train_acc = model.evaluate(X_train_padded, y_train) test_acc = model.evaluate(X_test_padded, y_test) print("train acc", train_acc) print("test acc", test_acc) CODES FOR LSTM Importing Libraries import pandas as pd from nltk.corpus import stopwords import re 70 University of Ghana http://ugspace.ug.edu.gh from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from sklearn.model_selection import train_test_split Reading Data df = pd.read_csv('/content/drive/MyDrive/Anas datasets/eclipse_cleaned_data.csv') df.head() Data Tokenization and Embeddings # The maximum number of words to be used. (most frequent) MAX_NB_WORDS = 50000 # Max number of words in each complaint. MAX_SEQUENCE_LENGTH = 250 EMBEDDING_DIM = 100 tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,- ./:;<=>?@[\]^_`{|}~', lower=True) tokenizer.fit_on_texts(df['Title'].astype(str)) word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index)) X = tokenizer.texts_to_sequences(df['Title'].astype(str)) 71 University of Ghana http://ugspace.ug.edu.gh Padding Sequence X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH) print('Shape of data tensor:', X.shape) Y = pd.get_dummies(df['Classes']).values print('Shape of label tensor:', Y.shape) Data Splitting X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42) print(X_train.shape,Y_train.shape) print(X_test.shape,Y_test.shape) Defining Model from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM, SpatialDropout1D from keras.layers.embeddings import Embedding from keras.callbacks import EarlyStopping model = Sequential() model.add(Embedding(MAX_NB_WORDS,EMBEDDING_DIM, input_length=X.shape[1])) model.add (SpatialDropout1D (0.2)) 72 University of Ghana http://ugspace.ug.edu.gh model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(2, activation='softmax')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) epochs = 5 batch_size = 128 history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)]) model.summary() evaluate the model train_acc = model.evaluate(X_train, Y_train) test_acc = model.evaluate(X_test, Y_test) print("train acc", train_acc) print("test acc", test_acc) # model.save('/content/drive/MyDrive/Anas datasets/mozilla_model.h5') from matplotlib import pyplot plot loss during training pyplot.subplot(211) pyplot.title('Loss') 73 University of Ghana http://ugspace.ug.edu.gh pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() plot accuracy during training pyplot.subplot(212) pyplot.title('Accuracy') pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show() classes = model.predict(X_test).argmax(axis=1) from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score import numpy as np accuracy = accuracy_score(np.argmax(Y_test, axis=1), np.round(abs(classes))) print('Accuracy: %f' % accuracy) precision = precision_score(np.argmax(Y_test, axis=1), np.round(abs(classes))) print('Precision: %f' % precision) recall = recall_score(np.argmax(Y_test, axis=1), np.round(abs(classes))) 74 University of Ghana http://ugspace.ug.edu.gh print('Recall: %f' % recall) f1 = f1_score(np.argmax(Y_test, axis=1), np.round(abs(classes))) print('F1 score: %f' % f1) CODES FOR NLP Import Libaries and load data import nltk import pandas as pd import re from sklearn.feature_extraction.text import TfidfVectorizer import string Reading Data Data = pd.read_csv('/content/drive/MyDrive/Anas datasets/eclipse_cleaned_data.csv') data.head() data['Classes'].value_counts() Extract features from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(min_df=0, lowercase=False) 75 University of Ghana http://ugspace.ug.edu.gh vectorizer.fit(data) vectorizer.vocabulary Data Splitting from sklearn.model_selection import train_test_split X=data[['Title']] y=data['Classes'] X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1, random_state=42) nltk.download('stopwords') Data Preprocessing stopwords = nltk.corpus.stopwords.words('english') ps = nltk.PorterStemmer() def clean_text(text): text = "".join([word.lower() for word in text if word not in string.punctuation]) tokens = re.split('\W+', text) text = [ps.stem(word) for word in tokens if word not in stopwords] return text Vectorize text tfidf_vect = TfidfVectorizer(analyzer=clean_text) 76 University of Ghana http://ugspace.ug.edu.gh tfidf_vect_fit = tfidf_vect.fit(X_train['Title'].values.astype('U')) tfidf_train = tfidf_vect_fit.transform(X_train['Title'].values.astype('U')) tfidf_test = tfidf_vect_fit.transform(X_test['Title'].values.astype('U')) Final evaluation of models from sklearn import metrics from sklearn.ensemble import RandomForestClassifier from sklearn import svm rf = RandomForestClassifier (n_estimators=150, max_depth=None, n_jobs=-1) rf_model = rf.fit(tfidf_train, y_train) y_pred = rf_model.predict(tfidf_test) print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) print("Precision:",metrics.precision_score(y_test, y_pred)) print("Precision:",metrics.recall_score(y_test, y_pred)) print("F1score:",metrics.f1_score(y_test, y_pred)) Create svm Classifier clf = = svm.SVC(kernel='linear') # Linear Kernel Trained the model using the training sets clf.fit.(tfidf_train, y_train) 77 University of Ghana http://ugspace.ug.edu.gh Predict the response for test dataset svm_y_pred == clf.predict.(tfidf_test) Print ("Accuracy:",metric.accuracy_score (y_test, svm_y_pred)) Print ("Precision: “metric. precision score (y_test, svm_y_pred)) Print ("Precision: “metric. recall_score(y_test, svm_y_pred)) Print ("F1score:"metric. f1_score(y_test, svm_y_pred)) 78