University of Ghana http://ugspace.ug.edu.gh UNIVERSITY OF GHANA MISCLASSIFICATION COST SENSITIVE LEARNING FOR PREDICTING GONORRHEA INFECTION STATUS IN GHANA BY BEHENE ERIC (10244343) THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, LEGON IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE AWARD OF MPHIL STATISTICS DEGREE JUNE, 2017 University of Ghana http://ugspace.ug.edu.gh DECLARATION I hereby declare that with the exception of cited references to other people’s work which has been acknowledged, this work is as the result of my own research work done under supervision and has neither been presented elsewhere either in part or whole for another degree. Student: Behene Eric (10244343) Signature .................................................. Date .......................................................... Principal Supervisor: Dr. Isaac Baidoo Signature: ................................................. Date: ......................................................... Co-Supervisor: Dr. F.O.Mettle Signature: ................................................. Date: ......................................................... ii University of Ghana http://ugspace.ug.edu.gh DEDICATION Every challenging work needs sacrifice and dedication. This work is dedicated to my God Almighty and my family for their support and care. iii University of Ghana http://ugspace.ug.edu.gh ACKNOWLEDGEMENT I thank my academic supervisor Dr. Isaac Baidoo for his advice, encourage and motivation. I also express my gratitude to Dr. F.O. Mettle for his support to make this thesis a reality. I also appreciate the support of the Staff of Noguchi Memorial Institute for Medical Research and staff of Naval Medical Research Unit 3 Ghana Detachment (NAMRU-3GD) especially Mrs. Naiki Pupulampu Attram. Lastly, to my mother Theresa Nyamah and brother Samuel Behene for their prayers and sacrifice. iv University of Ghana http://ugspace.ug.edu.gh ABSTRACT Gonorrhea, which is one of the most frequently reported sexually transmitted infection is caused by a bacterium called Neisseria gonorrhoeae. This disease can causes a serious public health problem worldwide, with about 88 million new infections occurring each year. Failure to treat this disease can result into pelvic inflammatory disease (PID), chronic pain and also damage the female reproductive organ. In males it can lead to reduced fertility and sterility. In developing countries, the unavailability of diagnostic capacity due to cost, lack of equipment and trained personnel has led to the syndrome based management of sexually transmitted infection (STI). Due to these challenges, there is the need for statistical models for gonorrhoea diagnosis which can easily be obtained and implemented with the appropriate expertise. In diagnosing sexually transmitted infection, a false positive has different impact than vice versa. Assuming equal misclassification cost in such models can lead to incorrect decisions and also incur financial cost and harm to the patient. Many classifiers do not allow integration of cost into model development process but rather are designed to improve prediction accuracy assuming equal misclassification cost. The aim of the study is to develop cost sensitive statistical models for predicting gonorrhoea infection. For the data used for the study, 80% was used for training and 20% for testing. The results indicated that, the cost sensitive classifiers had a reduced total classification cost than the cost insensitive classifiers. Also, the classification cost of all laboratory diagnostic method except culture was lower than the cost sensitive and insensitive model. The class distribution weakly affected the cost sensitive classifiers but not the cost insensitive classifiers. v University of Ghana http://ugspace.ug.edu.gh TABLE OF CONTENT DECLARATION ....................................................................................................................... ii DEDICATION ......................................................................................................................... iii ACKNOWLEDGEMENT ........................................................................................................ iv ABSTRACT ............................................................................................................................... v TABLE OF CONTENT ............................................................................................................ vi LIST OF TABLES .................................................................................................................... ix LIST OF FIGURES ................................................................................................................... x LIST OF ABBREVIATIONS ................................................................................................... xi CHAPTER ONE ........................................................................................................................ 1 Introduction ............................................................................................................................ 1 1.1 Diagnostic test .................................................................................................................. 1 1.2 Cost Sensitive Learning ................................................................................................... 3 1.3 Problem Statement ........................................................................................................... 3 1.4 Objectives ........................................................................................................................ 4 1.5 Significance...................................................................................................................... 4 1.6 Limitation ......................................................................................................................... 5 CHAPTER TWO ....................................................................................................................... 6 LITERATURE REVIEW .......................................................................................................... 6 2.1 Brief History .................................................................................................................... 6 2.2 Statistical models application to sexually transmitted infection ...................................... 8 2.3 Application of classification trees to other medical diagnosis ......................................... 9 2.4 Research work on Cost sensitive Methods .................................................................... 13 2.5 Conclusion ..................................................................................................................... 15 CHAPTER THREE ................................................................................................................. 16 METHODOLOGY .................................................................................................................. 16 3.1 Source of Data................................................................................................................ 16 3.2 Description of Data ........................................................................................................ 17 3.3 Definition of Some Medical and Statistical Terminology ............................................. 18 3.4 Logistic Regression ........................................................................................................ 18 3.4.1 Assumption of logistic regression ........................................................................... 18 3.4.2 Model Specification ................................................................................................ 18 3.4.3 Model Evaluation and Diagnostic ........................................................................... 19 vi University of Ghana http://ugspace.ug.edu.gh 3.4.3.1 Likelihood Ratio test ......................................................................................... 19 3.4.3.2 Hosmer-Lemeshow test .................................................................................... 20 3.4.3.3 Pearson Residual ............................................................................................... 21 3.4.3.4 Deviance Residual ............................................................................................ 21 3.4.3.5 Wald Statistics .................................................................................................. 22 3.4.3.6 Linearity test ..................................................................................................... 22 3.4.4 Parameter estimation of logistic regression ............................................................ 23 3.4.4.1 Maximum likelihood estimation of parameters of logistic regression .............. 23 3.4.4.2 Variable Selection ............................................................................................. 24 3.4.4.3 Bayesian Estimation.......................................................................................... 24 3.4.4.4 Prior Distribution .............................................................................................. 25 3.4.4.5 Posterior Distribution ........................................................................................ 25 3.4.4.6 Markov Chain Monte Carlo (MCMC) .............................................................. 26 3.4.4.7 Gibbs sampling ................................................................................................. 26 3.4.4.8 Burn in .............................................................................................................. 27 3.5 Classification tress ......................................................................................................... 27 3.5.1 Basic Definitions used in classification tree ........................................................... 27 3.5.2 Tree Construction.................................................................................................... 28 3.5.3 ID3 .......................................................................................................................... 28 3.5.4 C4.5 ......................................................................................................................... 29 3.5.5 CART ...................................................................................................................... 29 3.5.6 Splitting Criterion ................................................................................................... 29 3.5.7 Information gain...................................................................................................... 30 3.5.7.1 Gain Ratio ......................................................................................................... 30 3.5.7.2 Entropy .............................................................................................................. 30 3.5.7.3 Gini index.......................................................................................................... 31 3.5.8 Pruning .................................................................................................................... 31 3.5.8.1 Reduced error pruning ...................................................................................... 31 3.5.8.2 Cost complexity pruning ................................................................................... 32 3.6 Random forest ................................................................................................................ 32 3.6.1Random forest construction ..................................................................................... 34 3.6.2 Bagging ................................................................................................................... 35 3.7 Cost sensitive modelling ................................................................................................ 35 3.7.1 Modifying the classification of the predicted probability scores obtained from logistic regression to include unequal misclassification cost .......................................... 37 vii University of Ghana http://ugspace.ug.edu.gh 3.8 Sampling ........................................................................................................................ 38 3.9 Performance Measure .................................................................................................... 39 F-measure:..................................................................................................................... 41 3.9.1 Receiving Operating Characteristics (ROC) ........................................................... 42 CHAPTER FOUR .................................................................................................................... 43 DATA PRESENTATION AND ANALYSIS ......................................................................... 43 4.1 Data and Preliminary Analysis ...................................................................................... 44 4.2 Training Data and Model fitting .................................................................................... 47 4.2.1 Logistic Regression ................................................................................................. 49 4.2.2 Classification tree.................................................................................................... 51 4.2.3 Random Forest ........................................................................................................ 52 4.3 Cost-Sensitive Models ................................................................................................... 53 4.3.1 Classification of logistic regression predicted probability score to include unequal misclassification cost ....................................................................................................... 54 4.3.2 Classification tree with unequal misclassification cost .......................................... 55 4.4 Comparing the performance of the models using training Data .................................... 56 4.5 Effect of Total classification cost on cost sensitive and insensitive method ................. 56 4.6 Model Validation ........................................................................................................... 57 4.6.1 Comparing laboratory diagnostic methods with cost sensitive and insensitive models on the testing data ................................................................................................ 58 4.6.2 Effect of class distribution and cost sensitive method on classification cost ......... 59 4.7 Summary of results ........................................................................................................ 60 CHAPTER FIVE ..................................................................................................................... 62 CONCLUSIONS AND RECOMMENDATIONS .................................................................. 62 5.1 Discussion ...................................................................................................................... 62 5.1.1 Comparing Logistic regression, classification tree and Random forest ................. 62 5.1.2 Effect of Classification cost on Laboratory diagnostic method and skewed class distribution of Cost sensitive and insensitive classifiers.................................................. 64 5.2 Conclusion ..................................................................................................................... 65 5.3 Recommendation ......................................................................................................... 66 REFERENCE ....................................................................................................................... 67 APPENDIX .............................................................................................................................. 73 viii University of Ghana http://ugspace.ug.edu.gh LIST OF TABLES Table 4.1a:Background information of the respondent ........................................................... 46 Table 4.1b:Background information of the respondent ........................................................... 47 Table 4.2a:Description of Training Data ................................................................................. 48 Table 4.2b:Description of Training Data continuation ............................................................ 49 Table 4.3: Logistic Regression model using maximum likelihood estimation ........................ 50 Table 4.4: Confusion Matrix .................................................................................................... 54 Table 4. 5: Comparing Performance of the various model using training data ....................... 56 Table 4.6: Comparing Performance of the various model using test data ............................... 57 Table 1a: Bayesian logistic regression ..................................................................................... 73 ix University of Ghana http://ugspace.ug.edu.gh LIST OF FIGURES Figure 4.1: Distribution of the predicted probability score of the training data ...................... 51 Figure 4.2: Tree structure for gonorrhea data .......................................................................... 52 Figure 4.3: Variable importance for Random forest ................................................................ 53 Figure 4.4: Determination of optimal cut off of predicted scores using unequal classification cost ........................................................................................................................................... 54 Figure 4.5 : Tree structure for gonorrhea data using a cost ratio of 1:4 .................................. 55 Figure 4.6: Effect of classification cost on cost sensitive and cost insensitive classifiers....... 57 Figure 4.7: Total cost of classification of Laboratory method, Cost sensitive and insensitive classifiers.................................................................................................................................. 58 Figure 4.8: Laboratory diagnostic methods and Cost sensitive models................................... 59 Figure 4.9: Effect of class distribution on classification cost of the classifiers ....................... 60 Figure 1a: Pearson Residuals plotted against predictor one by one......................................... 73 Figure 1b: Posterior distribution of the model parameters ...................................................... 74 Figure 1b: Posterior distribution of the model parameters (Cont.) .......................................... 74 Figure 1c: Posterior distribution of the model parameters (Cont) ........................................... 75 Figure 1d: Posterior distribution of the model parameters (Cont.) .......................................... 75 Figure 1e : Error rate for the number of trees .......................................................................... 75 x University of Ghana http://ugspace.ug.edu.gh LIST OF ABBREVIATIONS WHO World Health Organization CDC Centre for Disease Control and Prevention NAAT Nucleic Acid Amplification Test AUC Area Under the Curve xi University of Ghana http://ugspace.ug.edu.gh CHAPTER ONE Introduction Gonorrhea, is one of the most frequently reported sexually transmitted infection caused by a bacterium called Neisseria gonorrhoeae. It causes serious public health problem worldwide, with about 88 million new infections occurring each year (Smith, 2016). It is the third most prevalent sexually transmitted infection (STI) worldwide (WHO, 2005). Center for Disease Control and Prevention (CDC) report in 2010 stated that about 700000 new cases of Neisseria gonorrhoeae are diagnosed yearly in the United States which makes it the second most frequent reported STI after Chlamydia trachomatis. This disease can be acquired through having unprotected vaginal, oral or anal sex with an infected person just like any STI. It can also be transmitted from an infected mother to child through birth. Gonorrhoea can be avoided through abstinence or use of condom during sexual intercourse as both partners can reduce the chances of acquiring it. Failure to treat this disease can result into pelvic inflammatory disease (PID), chronic pain and also damage the female reproductive organ. In males it can lead to reduced fertility and sterility (Handsfield et al., 1974). Men are more symptomatic to the disease as compared to women. They mostly present with symptoms such as penile discharge, painful and frequent urination. Women normally present with increased vaginal discharge, painful urination, lower back pain and spotting between menstrual periods which may occur alone or in combination and may range from hardly visible to severe spotting (Smith, 2016). 1.1 Diagnostic test In developing countries, the unavailability of diagnostic capacity due to cost, lack of equipment and trained personnel has led to the syndromic management of STI. Even when there are 1 University of Ghana http://ugspace.ug.edu.gh equipment and trained personnel to help in diagnosing these diseases, they are usually found in the urban centres (Meade & Cornelius, 2012). The main diagnostic techniques for gonorrhoea include, culture, direct microscopic testing and Nucleic Acid Amplification Testing (NAAT) which is currently recommended by CDC (Papp et al., 2014). Culture, which is the gold standard for bacterium identification has a high specificity and sensitivity and is also optimal for antimicrobial susceptibility testing (Murray et al., 2003) Another method which is also useful is the direct microscopy and is more preferable for diagnosing symptomatic gonococcal urethritis in men. This test is not appropriate for the diagnosis of extra-genital infections since non-pathogenic Gram-negative diplococci may be present and may result in false positives (Bignell et al., 2006). This method requires highly trained personnel and also specimen required for the testing needs to be stored and transported under appropriate conditions to maintain organism viability (Whiley et al., 2006). This is however not often the case in developing countries. The challenges associated with specimen collection and transportation required for culture based diagnosis has resulted in the development and application of nucleic acid detection methods such as Nucleic Acid Amplification test (NAAT) which utilises urine samples (Whiley et al., 2006). Test results from these methods of diagnosing are obtained from urine and specimen which can be obtained in minimal invasive ways. This diagnostic method is often not available in developing countries since it requires specialized equipment which are very expensive and require experienced personnel. In recent years, molecular diagnostics like nucleic acid amplification technique (NAAT) has captured attention and been recommended as the optimal test for diagnosing gonorrhoea. As compared to culture it is more sensitive and provides faster results (Cosentino et al., 2012). In some developing countries, NAAT may not be available in public hospitals hence patients are 2 University of Ghana http://ugspace.ug.edu.gh requested to go to private laboratories where costs are much higher. These laboratories may provide better results but may not be reliable due to lack of national quality assurance programs to certify them (Ndongmo, 2005). Due to these challenges, there is the need for statistical models for gonorrhoea diagnosis which can easily be obtained and implemented with the appropriate expertise. 1.2 Cost Sensitive Learning This method is a type of learning in data mining which considers the misclassification cost (and other possible types of cost such as test cost) with the aim to minimize the total cost. Total cost (cost of classification) refers to the total number of people who has been misclassified. In diagnosing sexually transmitted infection, a false positive has different impact than vice versa. To be able to solve this problem, a cost-sensitive classification is obtained which considered the varying misclassification cost (false positive and false negative) using cost matrix (Zhou & Liu, 2006). The cost matrix is used during the model building processes and it is quite subjective (Zadrony & Elkan, 2001). 1.3 Problem Statement There are several classifiers designed for medical diagnosis (Chou & Shapiro, 2003). However, many of these do not allow for integration of misclassification cost into their model development process. Instead, they are designed to improve prediction accuracy assuming equal misclassification cost (Jiang & Cukic, 2009). The dataset used for gonorrhoea prediction is usually unbalanced with regards to the proportion of gonorrhoea positives to gonorrhoea negatives. The class distribution is usually skewed in favour of the gonorrhoea negatives and may cause poor performance when detecting gonorrhoea positive instances. . Assuming equal misclassification cost in such models can lead to incorrect conclusions regarding their diagnosis which can result in significant harm to the patient Therefore, in development of predictive models for gonorrhoea diagnosis, there is the need to reduce the number of people 3 University of Ghana http://ugspace.ug.edu.gh who could be misclassified by including misclassification cost in the model since this usually varies in real life. From literature, very few studies as reported in Ling et al., (2004) and Domingo (1999) have considered the essence of misclassification cost when developing predictive models. 1.4 Objectives The aim of the study is to develop cost sensitive statistical model for predicting gonorrhoea infection. Specific Objectives  To fit a cost insensitive classifier for predicting gonorrhoea status using logistic regression , classification tress ,and random forest and also to induce cost-sensitive criterion into these classifiers  To determine the effect of classification cost on cost sensitive and insensitive method  To determine whether classification cost is affected by both the skewed distribution of class and cost sensitivity  To compare traditional laboratory diagnostic methods with cost-sensitive and insensitive classifiers 1.5 Significance According to World Health Organisation’s (WHO) Western Pacific region manual of tests, an ideal diagnostic tool for reproductive tract infection is one in which results are easily made available to patients, inexpensive, highly sensitive and specific, requires no specialised equipment and also samples are obtained by non-invasive procedure (Verma et al.,2009). This study seeks to achieve this goal and also propose an alternative tool for diagnosing gonorrhoea in resource constrained environment which will be beneficial to the clinicians in order to limit the symptomatic management of the disease. 4 University of Ghana http://ugspace.ug.edu.gh 1.6 Limitation The limitation for the study included;  The data obtained considered only symptomatic patients, hence results obtained cannot be generalised to asymptomatic patients who may or may not have gonorrhoea  Data for the study was obtained from only three health facilities, it would have produce better results if other health facilities across the country were included. This was not achieved due to scarcity of epidemiological data on gonorrhoea. Future Research The study focused on only misclassification cost which was used to determine the total classification and hence, it would be important to consider the other forms of cost in future research 5 University of Ghana http://ugspace.ug.edu.gh CHAPTER TWO LITERATURE REVIEW This chapter of the thesis deals with reviews of related work of previous authors regarding statistical models of medical diagnosis, sexually transmitted disease and cost sensitive models. This review will be based on the methods, data finding and conclusions. 2.1 Brief History Logistic regression evolved in the 19th Century and was used to describe the growth of population and chain reaction (Cramer, 2002). It is a popular model used in biomedical informatics to study the relationship between response and predictors that included physiological data. One of the medical areas it has been applied is cancer prediction (Yusulf et al., 2012). Another method which is also used for classification is the Naïve Bayes classifiers which has been studied since 1950’s.This is a probabilistic classifier which uses Bayes theorem. Similarities between logistic regression and Naïve Bayes is that they are both linear classification. Logistic regression estimates the conditional probability of response variable given predictors from the data by minimizing the error. Hence it is termed as a discriminating model whiles Naïve Bayes estimate a joint probability of the response variable and the predictors from the data hence it is a generative model (Kolluru, 2014). In some instances, researchers would resort to use nonparametric models like decision trees and Random forest for classification. Regarding decision tress, there are two types; Classification trees which have a discrete response variable and Regression tree which uses a continuous response variable. Decision trees was initially introduced in the 1960 as one of the proactive methods in data mining which has widely been used in several fields such as agriculture, astronomy, image 6 University of Ghana http://ugspace.ug.edu.gh processing, medicine, software development, financial, manufacturing and production ( Hastie et al., 2009).The most used decision tree algorithm are classification and regression tree(CART) (Breiman et al., 1984), Interactive Dichotomizer 3(ID3) which was developed by Quinlan (1979, 1983, 1986) and C4.5(Quinlan,1993) which is an improvement of the ID3.These classification methods tend to identify classes in which the object belongs from a descriptive trait. In 2001, Leo Breiman developed Random forest in other to improve the performance of CART The name came from random decision forest which was first proposed by Tin Kam Ho. The method is a combination of Breiman “bagging” idea and Ho’s random subspace method which a collation of decision trees with a regulated variation. In most data mining techniques, varying misclassification cost produces different cost hence traditional data mining techniques which aim at minimizing error and assume equal misclassification tend to perform poorly in this area. Cost-sensitive learning (CSL) is an extension of traditional inductive learning for tackling the imbalance in misclassification errors, i.e., minimizing classification costs. This issue is practical but challenging, because different classification errors often have distinct costs in real-world applications. As a result of this, Turney in 1995 developed cost sensitive learning for addressing classification with non- uniform cost. In his published paper in 2000, he identified nine major type of cost which are;  Misclassification  Test  Teacher  Computational  Interventional  Unwanted achievement  Human computer interaction 7 University of Ghana http://ugspace.ug.edu.gh  Cost of cases  Cost of instability 2.2 Statistical models application to sexually transmitted infection Researchers have adopted various statistical technique to model sexually transmitted disease. In epidemiological research, the focus is to identify risk factors of a disease in which standard analytical technique (e.g Logistic regression) are used for the analysis. Below are some literature in the field of sexually transmitted disease. Gardella et al. (2005) conducted a study to determine the risk factors of herpes simplex virus (HSV) acquisition among pregnant women at risk. The women and their partners were enrolled and tested for HSV. The risk factors for HSV susceptibility, exposure and acquisition were determined using logistic regression. Hupert et al. (2006) conducted a study which was to determine the association between urinary tract infection and sexually transmitted infection using history, clinical and laboratory findings of symptomatic women. A cross sectional of 296 sexually active women between age 14-22 years were recruited. Logistic regression and CART was used for the analysis and both methods identified virtually the same risk factors. The misclassification rate for CART was obtained to be 38% but that of logistic regression was not stated in the manuscript since the focus of the study was not on the evaluation of the performance of the two models. Kershaw et al. (2007) aimed to use individual, family and community level characteristics to construct a clinical classification tree to help identify women who are at risk of acquiring sexually transmitted infection during pregnancy. This result of the study was to assist clinician who normally uses informal decision trees in making clinical decisions. 8 University of Ghana http://ugspace.ug.edu.gh 2.3 Application of classification trees to other medical diagnosis Long et al. (1993) compared the performance of logistic regression to decision trees induction in classifying patient as having acute cardiac ischemia. The data for the study was obtained from six different new England hospital ranging from urban teaching and rural nonteaching hospital. The training data had a total of 3453 patients with 59 clinical features whiles the test data set compose of 2320 patients. The results indicate that, comparing the error rate of the logistic regression with ID3, it perform better on the test set than the training dataset. Comparing the area under the receiving operating characteristic (ROC) curve for the test data, LR(0.89) was still better than ID3(0.82). The ID3 was pruned to improve the performance but the results indicated that LR still outperformed it even though it showed improvement over the default ID3. Ture et al. (2005), compared the performance of classification techniques which is used for the prediction of essential hypertension. Among the classifiers were three decision trees (Chi- square Automatic interaction Detector, CART and Quick unbiased efficient statistical tree), three statistical algorithm (Logistic regression, Flexible discriminant analysis and Multivariate Adaptive Regression spline and two neural networks (Multilayer perceptron and Radial Bias function). For the study, a respective analysis was done on a total of 694 who were obtained from Cardiology Clinic of Trakya University Medical Faculty in Turkey, 2002–2003.The dataset was split into 75% for training and 25% for testing. Findings from the study indicated that the two neural network methods performed better than the other classifiers but the decision trees remained advantageous than the statistical algorithms and neural network since the probabilities available for each terminal node remain dependent on the tree structure and its interpretation may not be the same as the other classifiers. In the work of de Queiroz Mello et al., (2006) which was at aimed to develop a predictive model using logistic regression and classification trees for smear negative pulmonary 9 University of Ghana http://ugspace.ug.edu.gh tuberculosis (SNPT) for outpatient in areas with scare resources. The study enrolled 551 patients with clinical symptoms of SNPT in which data was divided into training and validation set. Model performance were evaluated using sensitivity, specificity and area under the ROC curve. Classification tree models performed better than logistic regression for the training data but on the validation data logistic regression performed slightly better. Kurt et al. (2008) compared the performance of logistic regression, classification and regression tree and neural network for predicting coronary artery disease (CAD). A retrospective dataset of size 1225 was obtained in Cardiology Clinic of Trakya University Medical Faculty in Turkey between January 2002 and February 2003 was used for the study. Findings indicated that neural network outperformed the other classifiers using the area under the ROC curve. The difference between the value of the area under ROC curve of logistic regression and CART was statistically insignificant. Lavanya and Rani (2011) evaluated the performance of ID3, C4.5 and CART classifiers on some medical dataset. These were Diabetes, Heart stat log, Thyroid, Breast cancer and Arrhythmia dataset which were obtained from the UCI machine learning repository. Performance measure such as accuracy and time of complexity was access using the 10 fold cross validation of the various datasets. The results on the experimental data indicated that CART performed better than the other two algorithm and also had an improved classification for the medical data set. The study of Abdullah and Rajalaxmi (2012) was to use Random forest to improve the prediction accuracy and to investigate the various event related to Coronary heart disease (CHD). Data were obtained from the UCI machine learning repository. The classifiers were evaluated using Kappa statistics, classification error and root mean square error. Results 10 University of Ghana http://ugspace.ug.edu.gh indicated that Random forest performed better base on the evaluation measures than the decision trees. In the research of Adeyemo and Adeyeye (2015), the performance of ID3, C4.5 and Multilayer perceptron (MLP) Artificial Neural Network in the prediction of typhoid fever were compared. The data was obtained from a Nigerian hospital and divided into a training and testing set. Classifiers were evaluated based on accuracy, root mean square error, F-measure, area under the ROC curve, mean absolute error, relative absolute error, the mean relative square error and Kappa statistics. The result indicated that MLP had a high accuracy and performed better on the other evaluating measures than the other two classifiers. In comparing the two decision tree classifiers, C4.5 outperformed ID3 in terms of area under the ROC curve, misclassification rate root mean square error and the other evaluation measures. Mohammed (2016) analysed and compared the performance of various classification methods used to diagnose Parkison disease. These classification methods were Naïve Bayes, Support vector machine (SVM) and decision tress. The dataset used contained 22 features obtained from 31 people of which 23 had the Parkison disease (PD). In the study two data set were used which are; Actual PD and discretised PD dataset. For the discretised dataset, continuous variables are discretised. The PD disease was diagnosed using numerous features obtained from the human voice. The dataset was divided into training (70%) and testing (30%).Also cross validation method was used without splitting the dataset into training and testing. Method used to compare the performance of the various models were accuracy. Naive Bayes performed better on the discretised data set with cross validation yielding 84.6% accuracy than compared to using the actual data set. SVM and decision tress also obtained a high accuracy of 96.5% and 89.6% respectively on the discretised data set. 11 University of Ghana http://ugspace.ug.edu.gh Heish et al. (2010) evaluated the performance of Random forest, Support vector machine and Artificial Neural Network compared with logistic regression to diagnose acute appendicitis. Data for the study was from January 2006 to December 2008 in which patients who were suspected of acute appendicitis were enrolled. Sixteen input variables which are commonly used in diagnosing acute appendicitis were used. Operation note and pathology report was used to confirm the diagnosis of acute appendicitis. Those who did not obtain any operation note were followed to make sure they were not false negative. The data set was divided into two, seventy-five percent for training (i.e used for the development of the various models) and twenty five percent was for testing. The area under the Receiving operation characteristic curve (AUC) ,accuracy(AC),sensitivity(SN),specificity(SP),positive predictive value(PPV) and negative predictive value(NPV) were used to evaluate model performance. A total of 180 patients were enrolled in which 135 patients were used for training and 45 patients for testing. The AUC for testing data set for the various models are Random forest (0.98), Support vector machine (0.96), artificial neural network (0.91) and logistic regression (0.77). Random forest had a high AC, SN, SP, PPV and NPV than logistic regression also the SN (0.94) and SP (1.0) values of Random Forest was the same for Artificial Neural Network and Support Vector machine respectively. This is an indication that random forest can predict acute appendicitis more accurately and can be an effective tool for clinician decision making. Jin et al. (2014) compared the prediction test of some data mining algorithm using a data set containing liver disease patients. The data set used was collected from Andhra Pradesh, India which is made up of 414 confirmed liver disease patient and 165 people suspected of liver disease. There were 11 predictor variables which were also used. The classification algorithm which were used are Naïve Bayes, Decision Tree, Multi-layer perceptron, Random forest, logistic regression and K-Nearest Neighbour. These algorithm performance measure were evaluated base on precision, sensitivity, specificity, accuracy, AUC, and root mean square error 12 University of Ghana http://ugspace.ug.edu.gh (RMSE).The results indicate that, in terms of precision and specificity, Naïve Bayes is superior to the other classification algorithm. Also, Logistic regression together with Random forest showed the highest AUC value. The RMSE value of logistic regression was the lowest (0.42) which mean the difference between the actual and actual and expected value is small which indicates of a relatively low error rate as compared to the other models. Danjuma & Osofisan (2015) study was aimed to identify the most performing predictive data mining algorithms used in the diagnosis of Erythemato-squamous disease. Predictive models developed were Naïve Bayes, Multilayer Perceptron and J48 decision tree. A 10-fold cross validation and a set of performance measure were used to evaluate the predictive performance of the models. The results indicated that, Naïve Bayes had the best accuracy (97.4%) as compared with the other classifiers. 2.4 Research work on Cost sensitive Methods Cost sensitive learning is a machine learning approach which considers cost of misclassification. Basically, they can be grouped into two; Direct method and Meta learning. Turney (1995) made a contribution in the direct method of cost sensitive learning by developing an algorithm such as ICET which incorporated misclassification cost in the fitness function of genetic algorithm. Ling et al. (2004) also made a contribution to direct method of cost sensitive learning by considering classification cost in the tree generating process which selected attributes with reduced expected total cost instead of attributes with minimum entropy. Meta learning method converts cost insensitive to cost sensitive classifiers. This method can be grouped into two which are thresholding and sampling. 13 University of Ghana http://ugspace.ug.edu.gh Domingo (1999) developed a method called MetaCost which used cost insensitive bagging on decision trees to produce estimated probabilities using training data and then applied thresholding to obtain the predicted class. Witten & Frank (2005) also used a cost insensitive algorithm to obtain the probability estimates and then applied thresholding to predict the class labels. Regarding to Sampling method, it modify the class distribution of training data and then classifiers using cost insensitive classifiers directly on the sampled data. In the work of Weisis (2003), the effective of class distribution on decision tree was investigated using under sampling and oversampling method to obtain various measuring performance using accuracy and Area under the curve(AUC).The conclusion that both the under sampling and over sampling method for dealing with class imbalance problem were both effective. Chawla et al. (2002), proposed an approached termed Synthetic Minority Over sampling technique which tend to reduce the overfitting problem in over sampling technique. The method create synthetic data base on the minority class and has proven to be effective Pazzani et al. (1994) compared various cost sensitive decision trees to cost insensitive CART and C4.5.The method which was used in selecting decision split in the tree were GINI criterion with altered prior and also used the misclassification cost as the test selection metric. The study found that the original CART and C4.5 performed better than the cost sensitive trees in terms of minimizing misclassification cost. Sahin et al. (2013) developed a new cost sensitive decision tree which minimize the cost of misclassification while selecting the splitting attribute at each terminal node. This model performance was compared with known traditional classification models such as CART, C5 etc. on credit dataset. Performance measure used were accuracy, true positive rate and save loss 14 University of Ghana http://ugspace.ug.edu.gh rate. The findings of the study indicated that, the cost sensitive tree algorithm outperformed the existing well known methods. 2.5 Conclusion The review of literature from the various authors indicate that most studies in medical diagnosis use cost insensitive models to predict infection status of various diseases. This thesis would seek to introduce cost of misclassification in the various models since in real life there is variation in the cost of misclassification. Review of previous study has shown that few studies have accounted for the different cost of misclassification resulting from type I and type II errors in developing classifiers especially for sexually transmitted disease. But most of the classifiers assume equal cost even though that is not the case in real life. 15 University of Ghana http://ugspace.ug.edu.gh CHAPTER THREE METHODOLOGY This chapter of the thesis describe the various cost insensitive statistical models such as logistic regression, classification tree random forest which was used to fit the data. Also, how cost of misclassification was included into these models were described together with the various performance measure which were used to evaluate these models. 3.1 Source of Data Secondary data from the U.S Naval Medical Research Unit 3, Ghana detachment (NAMRU 3 GD) was used for the thesis. The data are Four year data set which span from 2012 to 2016. Patients who were enrolled into the main study titled “Sexually Transmitted Disease (STD) Surveillance Characterizing Gonorrhea and Chlamydia Prevalence and Gonorrhea Resistance Profile in Ghana” were from 37 Military Hospital, Adabraka polyclinic and three Garrison clinics in Takoradi ( Naval Sick bay, Airforce Medical center and 2 Medical Reception station ).For patients to participate in the study, they needed to eligible to be enrolled. Inclusion Criteria  Aged ≥ 18 may provide independent (autonomous) consent  Aged > 11 and <18 may participate but will require parental consent and child’s assent  Patients presenting with an STI syndrome.(i.e., urethritis in men and cervicitis in females)  Pregnant women may be included. This is a group which may particularly benefit from information regarding STI transmission in order to protect themselves and their foetus If the Patient fulfil the above criteria then a questionnaire is administered, urine sample (for Nucleic Acid Amplification (NAAT) testing) is obtained and two swabs of discharge from penis or the vagina/cervix is obtained (for culture and gram stain testing). 16 University of Ghana http://ugspace.ug.edu.gh Exclusion Criteria Patients presenting without an STI syndrome or suspicion of Gonorrhea 3.2 Description of Data The variables which were used in the study are described below. Dependent Variable The dependent variable is gonorrhoea status of patients. This infection status was obtained using Nucleic acid amplification test which is a molecular diagnostic method. Independent Variable The table below provide a description of the various independent variables used; Table 3.1 : Independent variables used in modelling Variable Description Demographic Gender Binary Age Count Marital status Discrete Educational level Discrete Clinical Presentation Painful urination Binary Discharge Binary Pain in penis or vagina Binary Foul smell Binary Painful sex Binary Bleeding from penis or vagina Binary Itching of Genital Binary Sexual Behaviour Alcohol intake Binary Use of Condom Discrete Having more than one sexual partner in the partner month Discrete 17 University of Ghana http://ugspace.ug.edu.gh 3.3 Definition of Some Medical and Statistical Terminology Culture: It is a laboratory diagnostic method use for diagnosing gonorrhea infection. Majority voting: Majority voting occurs when majority of a particular class is predicted than the other classes. Machine learning: Machine learning is a field of computer science that learn from data without relying on rules base programing. 3.4 Logistic Regression Logistic regression is a most common model used in medical diagnosis to fit a binary or dichotomous response variable (Hilbe, 2011). It was first used in the 19th century to describe population growth and it now been adopted in biomedical research to model the log odds of the response variable using the logistic function (Cramer, 2002). The response variable(Y) takes two values which is 0 and 1.The event Y=1 is the success of the event and Y=0 is the failure. 1 𝑖𝑓 𝑡ℎ𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑖𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑌𝑖 = { 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 3.4.1 Assumption of logistic regression  Normal distribution is not necessary or assumed for the response variable  Normally distributed description of errors are not assumed  Equal variance is not assumed for each level of the independent variables  Linearity is assumed for the log odds of the response variable and the covariate. 3.4.2 Model Specification Let us consider a response variable 𝑌 which is binary and having a Bernoulli distribution 𝑦~𝐵(1, 𝜋) 18 University of Ghana http://ugspace.ug.edu.gh 𝐾 𝜋 𝑙𝑜𝑔 ( ) = 𝛽0 + ∑ 𝛽𝑖𝑋𝑖 (3.0) 1 − 𝜋 𝑖=1 π is the probability of success where, 𝑿 = (𝑥𝟎, … . 𝑥𝒌)′ are independent variables ~ 𝜷 = (𝛽𝟎, … . 𝛽𝒌) are the unknown parameter ~ 3.4.3 Model Evaluation and Diagnostic The goodness of fit of logistic regression is assessed using Likelihood ratio test (LRT) and Hosmer-Lemeshow test (HL).The purpose of LRT is used to determine the overall significance of the predictors in the model (Bewick, Cheek, & Ball, 2005).Hosmer-Lomeshow test is used to determine if there exit some form of interaction or linearity in the model. Other diagnostic measures used to evaluate the model is deviance, Wald statistics and linearity test. The Wald statistic is used to determine the significance of each predictor in the model (Hosmer Jr, Lemeshow, & Sturdivant, 2013).Linearity test is used to determine if the linerarity between the logs odd of the dependent and the covariate is assumed correctly. Below is the detailed description of the methods. 3.4.3.1 Likelihood Ratio test Likelihood ratio test compares the likelihood of the data under the full model against the likelihood of a model with fewer predictors. The likelihood ratio statistic is asymptotically distributed as chi-square with one degree of freedom. It is use to determine the overall significance of the predictors The LRT test statistics is given by; 𝐿𝑅𝑇 = 2 ∗ (log 𝐿𝐹 − log 𝐿𝑅) (3.2) 19 University of Ghana http://ugspace.ug.edu.gh where Log likelihood of full model (𝐿𝐹) and Log likelihood of reduced model (𝐿𝑅) 𝑛 𝐾 1 1 log 𝐿𝐹 = ∑ [𝑦𝑖 (𝛽𝑜 + ∑ 𝑋𝑖𝛽𝑖) − 2𝑦𝑖 log ( 𝐾 ) + ln ( 𝐾 )] 1 + 𝑒𝛽𝑜+∑𝑘=1 𝛽𝑖𝑋𝑖 1 + 𝑒𝛽𝑜+∑𝑘=1 𝛽𝑖𝑋𝑖 𝑖=1 𝑘=1 𝑛 1 1 log 𝐿𝑅 = ∑ [𝑦𝑖𝛽𝑜 − 2𝑦𝑖 log ( ) + ln ( )] 1 + 𝑒𝛽𝑜 1 + 𝑒𝛽𝑜 𝑖=1 When the p-value obtained less than 0.05 then the model is described not to fit the data. 3.4.3.2 Hosmer-Lemeshow test Hosmer-Lemeshow test is another approach of determining the goodness of fit of the data been divided into various subgroups with similar predicted probabilities. The test seeks to find out if the proportion of events observed in the subgroup is the same as the predicted probabilities using Pearson chi square. The Hosmer-Lemeshow goodness of fit is calculated as follows; 𝐺 (𝑂𝑖 − 𝑁𝑖?̅? ) 𝜒2 𝑖 𝐻𝐿 = ∑ (3.3) (𝑁 ?̅? (1 − ?̅? )) 𝑖=1 𝑖 𝑖 𝑖 where: 𝐺 is the number of subgroups; 𝑂𝑖 is the number of responses in the 𝑖 𝑡ℎ group; 𝑁𝑖 is the number of observation in the 𝑖 𝑡ℎ group; ?̅?𝑖 is the average predicted probabilities in the 𝑖 𝑡ℎ group. The test statistics approaches a chi-square distribution with 𝐺 − 2 degree of freedom. 20 University of Ghana http://ugspace.ug.edu.gh Small p-values indicate a poor fit to the data whiles large p-values which is 0.05 or more indicate otherwise. 3.4.3.3 Pearson Residual Person Residual is the difference between the observed and the predicted probabilities from the model divided by binomial standard deviation from the predicted probabilities which is use to correct the uneven variation in the actual residuals .Mathematically it can be expressed as; 𝑌𝑖 − ?̂?𝑖 𝑃𝑅𝑖 = 1 (3.4) (?̂?𝑖(1 − ?̂? 2 𝑖)) The Pearson residual test statistic follow a chi-square distribution with 𝑁 − 𝑘 and it is given by; (𝑦𝑖 − ?̂?𝑖) 2 𝜒2 = ∑ (3.4.1) ?̂?𝑖(1 − ?̂?𝑖) 𝑖 3.4.3.4 Deviance Residual Deviance residual is used to determine if specific observations properly fit the model. The residual deviance is positive when the specified observation is greater than the predicted probability and negative otherwise. It is expressed mathematical as below; 1 𝑦𝑖 (1 − 𝑦𝑖) 2 𝑅𝐷𝑖 = ± {2 [𝑦𝑖𝑙𝑜𝑔 ( ) + (1 − 𝑦𝑖)𝑙𝑜𝑔 ]} (3.5) ?̂?𝑖 1 − ?̂?𝑖 The test statistics of the Deviance is expressed as; 𝑛 𝐷 = −2 ∑[?̂?𝑖𝑙𝑜𝑔(?̂?𝑖) + (1 − ?̂?𝑖)𝑙𝑜𝑔(1 − ?̂?𝑖)] 𝑖=1 21 University of Ghana http://ugspace.ug.edu.gh 3.4.3.5 Wald Statistics Wald statistic is the ratio of the parameter estimated square over the square of the estimated standard error of the estimated parameter. It is asymptotically distributed as a chi-square with one degree of freedom. Normally, it assumes that the individual predictor variable have a significant influence on the response. 𝛽2𝑖 𝑊𝑖 = 2 (3.6) 𝑆𝐸𝛽𝑖 where, 𝑆𝐸2 is the standard error of the 𝑖𝑡ℎ estimated parameter which is the square root of the 𝑖𝑡ℎ𝛽 𝑖 diagonal element of the estimated covariance matrix. 3.4.3.6 Linearity test One of the ways of assessing the linearity assumption of the log odds of the response variable and the continuous covariate is the use of locally weighted scatter plot smoothing (LOESS). It combines the simple form of the least square regression with flexible form of non-linear regression. It fits simple models to localised subsets of the data to build up a function that would describe the deterministic component of changes in the data points. A functional form of the data is not required to be specified to be able to fit a model and hence, it is able to show complex relationship in a data that could be ignored. The model can be expressed as; 𝑦𝑖 = 𝑚(𝑥𝑖) + 𝜀𝑖 (3.7) Where 𝑚 is a regression function which is not specified and 𝜀𝑖 is the random error. The LOESS method is used to estimate the function 𝑚 22 University of Ghana http://ugspace.ug.edu.gh 3.4.4 Parameter estimation of logistic regression The parameters of the logistic regression model were estimated using maximum likelihood and Bayesian estimation method. Below is a detailed description of the methods. 3.4.4.1 Maximum likelihood estimation of parameters of logistic regression Maximum likelihood estimation is the method used to find the parameters which maximize the likelihood function. It is use to find the smallest possible deviance between the observed and the predicted values which is stated as; 𝟏 𝝅𝒊 = 𝑲 𝑖 = 1,2, … . 𝑛 (3.8) 𝟏 + 𝒆−(∑𝒌=𝟎 𝜷𝒊𝒙𝒊𝒌) The probability distribution of the response is represented as; 𝑓(𝑦 𝑦𝑖𝑖) = 𝜋 (1 − 𝜋) 1−𝑦𝑖 (3.9) Likelihood function can be expressed as 𝑛 𝐿(𝑦𝑖, 𝛽𝑜 , 𝛽1, … , 𝛽𝑘) = ∏ 𝜋 𝑦𝑖(1 − 𝜋)1−𝑦𝑖 (3.10) 𝑖=1 Taking 𝑙𝑜𝑔 of both side 𝑛 𝑙(𝑦𝑖, 𝛽𝑜, 𝛽1, … , 𝛽𝑘) = log (∏ 𝜋 𝑦𝑖(1 − 𝜋)1−𝑦𝑖) 𝑖=1 𝑛 𝑛 = ∑ 𝑦𝑖 log 𝜋 + ∑(1 − 𝑦𝑖) log(1 − 𝜋) 𝑖=1 𝑖=1 𝑛 𝑛 𝜋 = ∑ 𝑦𝑖 log + ∑ log(1 − 𝜋) 1 − 𝜋 𝑖=1 𝑖=1 Differentiating the above equation with respect to 𝛽𝑘 yields and using 𝜋 as in equation (3.8): 23 University of Ghana http://ugspace.ug.edu.gh 𝑛 𝜕𝑙(𝛽) 1 𝜕 = ∑ 𝑦 𝑥 − 𝑛 . . (1 + 𝑒∑ 𝐾 𝑘=0 𝑥𝑖𝑘𝛽𝑘 𝜕𝛽 𝑖 𝑖𝑘 𝑖 ∑𝐾 ) 𝑘 1 + 𝑒 𝑘=0 𝑥𝑖𝑘𝛽𝑘 𝜕𝛽𝑘 𝑖=1 𝑛 𝐾 1 𝐾 𝜕 = ∑ 𝑦𝑖𝑥𝑖𝑘 − 𝑛 . . 𝑒 ∑𝑘=0 𝑥𝑖𝑘𝛽𝑘. 𝑖 𝐾 ∑ 𝑥∑ 𝜕𝛽 𝑖𝑘 𝛽𝑘 1 + 𝑒 𝑘=0 𝑥𝑖𝑘𝛽𝑘 𝑘 𝑖=1 𝑘=0 𝑛 1 𝐾 = ∑ 𝑦 ∑𝑘=0 𝑥𝑖𝑘𝛽𝑘.𝑥𝑖𝑘𝑖𝑥𝑖𝑘 − 𝑛𝑖 . 𝐾 . 𝑒 1 + 𝑒∑𝑘=0 𝑥𝑖𝑘𝛽𝑘 𝑖=1 The maximum likelihood can be obtained by setting each 𝑘 + 1 equations to zero in the equation above and solving for each 𝛽𝑘.Where 𝑘 is the number of independent variables specified in the model. 3.4.4.2 Variable Selection The Akaike information criterion (AIC) is used to select variables for the model. It is a measure of relative quality of statistical models hence given a set of models, AIC estimates the quality of the model relative to the other models hence it is use to select the best model. Mathematically it is express as; 𝐴𝐼𝐶 = 2𝑘 − 2log 𝐿 = 2𝑘 + 𝐷𝑒𝑣𝑖𝑎𝑛𝑐𝑒 where 𝐿 is the maximum value of the likelihood function. 3.4.4.3 Bayesian Estimation Bayesian method of estimation is flexible which does not demand compliance with assumption as that of the case of maximum likelihood approach used by classical techniques. The flexibility of the Bayesian method is enhanced by the use of Markov chain Monte carlo (MCMC) method which is used as a base sampling (Acquah , 2013). It has become a method for fitting various 24 University of Ghana http://ugspace.ug.edu.gh non-linear regression models. The method is not often used because very minimal is understood about its concept in Bayesian analysis and its application in logistic regression. In the Bayesian analysis framework, there are three component associated with parameter estimation, these are; prior distribution, likelihood function and posterior distribution. The likelihood function can be expressed as, 𝑛 𝐿(𝑦𝑖, 𝛽𝑜, 𝛽 𝑦𝑖 1−𝑦𝑖 1, … , 𝛽𝑘) = ∏ 𝜋 (1 − 𝜋) 𝑖=1 3.4.4.4 Prior Distribution There are two types of prior distribution which are often used. They are informative prior and Non-informative prior. Informative prior distribution are used if the likelihood values of the unknown parameters are known. For non-informative prior, they are used if little is known about the coefficient values of the parameter. In the study, the prior used is the multivariate normal prior on 𝛽 expressed as; 𝛽𝑗~𝑁(𝜇𝑗 , 𝜎 2 𝑗 ) (3.11) The common choice of µ is zero and 𝜎 is set to 1000 which is chosen large enough to be considered as a non-informative prior (Acquah, 2013). 3.4.4.5 Posterior Distribution The posterior is obtained by multiplying the prior and the likelihood function as given by; 𝑛 𝐾 2 1 −1 𝛽 − 𝜇 𝑃(𝛽⁄𝑦, 𝑥) = ∏ 𝜋𝑦𝑖(1 − 𝜋)1−𝑦 𝑗 𝑗 𝑖 ∏ 𝑒𝑥𝑝 { ( ) } (3.12) √2𝜋𝜎𝑗 2 𝜎𝑗𝑖=1 𝑗=1 The above equation has no closed form hence Gibbs sampler was used to solve and approximate the properties of the marginal posterior distribution of each parameter. 25 University of Ghana http://ugspace.ug.edu.gh 3.4.4.6 Markov Chain Monte Carlo (MCMC) It is a computational method in Bayesian estimation which is used to obtain sequence of random samples from a probability distribution. In this method, values of the parameters 𝛽 are obtained from an approximate distribution and then correcting those drawn to better approximate the desired posterior distribution. For a Markov chain, a sample at 𝑡 + 1 depends on the sample at 𝑡. There are two MCMC algorithm that are used; Gibbs sampler and Metropolis Hastings algorithm (Medova , 2008). In this thesis, the Gibbs sampler was used. 3.4.4.7 Gibbs sampling The Gibbs sampler produces a sequence of samples from the joint distributing of two or more random variable. It can be used to sample from the joint distribution if the full conditional distribution for each parameter is known. It is uses an iterative procedure to sample from the posterior The algorithm is as follows; 0. Set an arbitrary starting value {𝛽0 00 , … . 𝛽𝑘 } 1. Draw 𝛽10 from the 𝑃(𝛽 ⁄𝛽 0 0 1 , … 𝛽 0 𝑘 , 𝑦, 𝑥) 2. Draw 𝛽11 from the 𝑃(𝛽1⁄𝛽 1, 𝛽00 2 … 𝛽 0 𝑘 , 𝑦, 𝑥) 3. Draw 𝛽12 from the 𝑃(𝛽 1 2⁄𝛽0 , 𝛽 1 1 , 𝛽 0 3 … 𝛽 0 𝑘 , 𝑦, 𝑥) . . . k. Draw 𝛽1𝑘 from the 𝑃(𝛽𝑘⁄𝛽 1 0 , 𝛽 1 1 , 𝛽 1 3 … 𝛽 1 𝑘−1, 𝑦, 𝑥) This complete iteration 1 of the Gibbs sampler, hence {𝛽10 , … . 𝛽 1 𝑘} is obtained. For the second iteration, 1. Draw 𝛽2 from the 𝑃(𝛽 ⁄𝛽1 10 0 1 , … 𝛽𝑘 , 𝑦, 𝑥) 2. Draw 𝛽21 from the 𝑃(𝛽0⁄𝛽 2 1 0 , 𝛽2 … 𝛽 1 𝑘 , 𝑦, 𝑥) 3. Draw 𝛽2 from the 𝑃(𝛽 ⁄𝛽22 2 0 , 𝛽 2 1 , 𝛽 1 1 3 … 𝛽𝑘 , 𝑦, 𝑥) 26 University of Ghana http://ugspace.ug.edu.gh . . . k. Draw 𝛽2 from the 𝑃(𝛽 ⁄𝛽2, 𝛽2, 𝛽2 … 𝛽2𝑘 𝑘 0 1 3 𝑘−1, 𝑦, 𝑥) This complete iteration 1 of the Gibbs sampler, hence {𝛽2 20 , … . 𝛽𝑘 } is obtained The procedure continues until 𝑇 iterations are obtained and we have {𝛽𝑇0 , … . 𝛽 𝑇 𝑘 } 3.4.4.8 Burn in The time it takes for a markov chain to converge depends on the starting point, however in practice, certain number of draws are thrown out which is known as burn in. This makes the draw closer to the stationary distribution and less dependent on the starting point. 3.5 Classification tress This method is a form of decision tress which is based on the repeated partitioning of data to become homogenous in other to estimate the conditional probabilities of the outcome given predictor variables. In situations where the response variable is discrete, a classification tree is used and when the response variable is continuous, a regression tree is used (De’ath & Fabricius ,2000). The main components of a classification tree are; nodes and branches. The important process for constructing a classification tree are splitting, stopping and pruning. 3.5.1 Basic Definitions used in classification tree Nodes: The classification tree has three types of nodes which are root, internal and leave node. Root nodes result in the subdivision of observation into two or more mutually exclusive subsets. The internal node have top part of it connected to the parent node whiles the bottom 27 University of Ghana http://ugspace.ug.edu.gh part is connected to the child or leave node. The leave node is the representation of the final results of the events of the classification three. Branches: Branches represent chance outcomes that originate from root nodes and internal nodes. A classification tree model is formed using an ordered branches in which each path from the root node through internal nodes to a leaf node represents a classification decision rule. 3.5.2 Tree Construction The classification tree is constructed using the steps below; 1. In the development of classification tree, the feature with the highest information gain is selected to be root node. 2. For each value of the feature at the root node create new descendant of node 3. Sort training examples to the leaf nodes 4. When the training examples are perfectly classified, then stop, else iterate over the new leave nodes There are various algorithm for constructing classification trees, the commonly known ones are ID3, C4.5 and CART. 3.5.3 ID3 This is a tree algorithm which is constructed in two phases: tree building and pruning. It uses information gain to choose the splitting attribute and accept only categorical attribute in building a tree model. It has bad performance when there are noise (Peng, Chen, & Zhou, 2009). To build the tree, information gain measured by entropy is calculated for all the attributes and the attributes with the highest information gain is assigned the root node. 28 University of Ghana http://ugspace.ug.edu.gh Continuous attributes are handled by discretizing or directly considering the values to find the best split point by taking the threshold value on the attribute. ID3 does not support pruning 3.5.4 C4.5 This algorithm is base also base on that of Hunt’s (Kohavi & Quinlan, 2002). It handles both continues and categorical attributes in building the tree. For it to be able to handle continuous attributes,C4.5 splits the attributes values into two partition based on the selected threshold as one child and the other as another. It uses Gain ratio as defined in (3.8.6.1) as an attribute selection measure to build the tree. It removes the biasness of information gain where there are many outcome values of an attribute. It also handles missing attribute values. 3.5.5 CART This algorithm is also based on Hunt’s which also handles both continuous and categorical attributes to construct the decision tree. CART uses Gini index for selecting attributes for the tree .This method does not use probability assumptions unlike the other methods. CART produces binary split and also use cost complexity pruning to eliminate unreliable branches from the decision tree to improve accuracy. It can handle missing values (Lewis, 2007). 3.5.6 Splitting Criterion In construction of a decision tree, it is important to choose a split that provide the important information concerning a class label or predominantly of one class. This is to obtain nodes that are pure. Impurity of nodes is the measure in which a node is completely homogenous. There are many impurity measure (splitting criteria) used for classification tree, these are information gain, Gini index, Gain ratio etc. For the regression tree, the splitting criteria used is the sum of square about the group mean or the sum of the absolute deviation about the median. 29 University of Ghana http://ugspace.ug.edu.gh 3.5.7 Information gain It is the change of entropy. It is use to determine which attribute in a given training feature is useful for discriminating between the classes and also provide information on how important a given attribute is to enhance ordering in the nodes of a decision tree. Information gain calculates the entropy of examples after a split on an attribute and subtract it from the entropy of the examples at the node of splitting. Mathematically, information gain is defined as ; |𝑋𝑣| 𝐺𝑎𝑖𝑛(𝐸, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑋) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑋𝑣) (3.13) |𝑋| 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝑋) Or 𝐺𝑎𝑖𝑛(𝐺, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) = 𝐺𝑖𝑛𝑖(𝑋) − 𝐺𝑖𝑛𝑖(𝑋)𝑠𝑝𝑙𝑖𝑡 where 𝐺𝑖𝑛𝑖(𝑋) is expressed in (3.5.3.7) 3.5.7.1 Gain Ratio The Information gain ratio is the ratio between the information gain and the intrinsic value (IV) .It biases the decision tree by considering attributes with large number of diverse values which solve the setback in information gain |𝑋𝑣| |𝑋𝑣| 𝐼𝑉(𝐸, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) = − ∑ log |𝑋| 2 |𝑋| 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴) 𝐺𝑎𝑖𝑛(𝐸, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) 𝐺𝑅(𝐸, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) = 𝐼𝑉(𝐸, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) 3.5.7.2 Entropy Entropy is a measure of uncertainty (impurity) associated with the attribute. The entropy E of a discrete random variable 𝑋 = 𝑥1, … 𝑥𝑛 and probability 𝑝(𝑥𝑖) is defined as; 30 University of Ghana http://ugspace.ug.edu.gh 𝐶 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑋) = − ∑ 𝑝(𝑥𝑖) log2 𝑝(𝑥𝑖) (3.14) 𝑖=1 Where c is the number of classes and 𝑝(𝑥𝑖) log2 𝑝(𝑥𝑖) is assumed to be zero if 𝑝(𝑥𝑖) = 0 or 𝑝(𝑥𝑖) = 1 Note that if Entropy is low then information gain is high 3.5.7.3 Gini index Gini index is the probability that a random sample is been classified correctly. The aim of Gini index is to reduce misclassification rate. Mathematically it is expressed as; 𝐺𝑖𝑛𝑖(𝑋) = 1 − ∑𝐶𝑖=1(𝑝(𝑥 )) 2 𝑖 (3.15) When the 𝑛𝑜𝑑𝑒 𝑝 is partition into 𝑘, the quality of the split is stated as; |𝑋𝑣| 𝐺𝑖𝑛𝑖(𝑋)𝑠𝑝𝑙𝑖𝑡 = ∑ 𝐺𝑖𝑛𝑖(𝑋) |𝑋| 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝑋) 3.5.8 Pruning It is a method of removing certain portion of the decision tree which have no significant influence and provide little power in classifying labels which at the end improve prediction accuracy and reduces overfitting. There are several methods for pruning, the commonly used ones are reduced error and cost complexity pruning. 3.5.8.1 Reduced error pruning Reduced error pruning commence at the leave where each node is assigned the majority class. When the prediction accuracy does not change then the tree is maintained. This method of pruning is simple and fast (Patel & Upadhyay, 2012). 31 University of Ghana http://ugspace.ug.edu.gh 3.5.8.2 Cost complexity pruning This method generate series of trees 𝑇𝑜 , … 𝑇𝑚𝑎𝑥 where 𝑇𝑜 is the initial tree and 𝑇𝑚𝑎𝑥 the maximum number of tree. For any subtree, the complexity (⌈?̌?⌉) is defined as the number of terminal nodes in 𝑇. Then the cost complexity (𝑅𝛼(𝑇) ) measures the penalised on the resubstitution error rate. The resubstitution error rate is not an appropriate measure for obtaining subtrees because it always favour the bigger ones. Including a complexity penalty to resubstitution error turns to favour small trees hence reducing the cost complexity when pruning the tree. Mathematically, it can be expressed as; 𝑅𝛼(𝑇) = 𝑅(𝑇) + 𝛼⌈?̌?⌉ where 𝛼 ≥ 0 is the complexity parameter and 𝑅(𝑇) is the resubstitution error rate. Note that as 𝛼 approaches infinity, the tree of size 1 (i.e single root node) becomes the biggest tree. For a pre-selected 𝛼 then an identified subtree 𝑇(𝛼) which minimize the cost complexity would be obtained such that 𝑅𝛼(𝑇(𝛼)) = max 𝑅𝛼(𝑇) 𝑇≤𝑇𝑚𝑎𝑥 3.6 Random forest Random forest are ensemble methods which means they are made up of other small models in which predictions are obtained by combining the output of these smaller models which are the classification or regression tress. It is normally useful for exploratory analysis, detection of interaction and non-linearity without initially specifying them in the model. The training algorithm for random forest uses the techniques of bootstrap aggregating or bagging. For this method, observations for each trees are selected randomly likewise the variables 32 University of Ghana http://ugspace.ug.edu.gh argmax ∑𝐾𝑖=1 𝐼(ℎ𝑖(𝑥𝑖) = 𝑦) 𝑦 𝐻(𝑥) = (3.16) 𝐾 where; 𝐻(𝑥) is the class label through majority voting, 𝐾 is the number of decision tress, and ℎ𝑖 is the 𝑖 𝑡ℎ tree of the random forest. In other to fine tune the forest, two parameters must be considered;  Number of trees that would correspond to a stable classifier  Number of random variables used in each tree Variable importance are estimated based on margin of cases which is defined as proportion of votes for true class minus maximum proportion of votes. It measures the association between a given variable and its classification results. 𝑁 𝛽 ,𝑡 𝑁 ,𝑡 𝑡 𝑡 𝐼(𝑦𝑖 = 𝑦𝑖 ) 𝛽𝑡𝐼(𝑦𝑖 = 𝑦𝑖,𝑎)𝑉𝐼𝑗 = ∑ − ∑ (3.17) |𝛽𝑡| |𝛽𝑡| 𝑖=1 𝑖=1 where; 𝛽𝑡 is the out-of-Bag(OOB) sample for tree 𝑡 ∈ (1,2, … 𝑛𝑡𝑟𝑒𝑒), 𝑦 ,𝑡 is the 𝑖𝑡ℎ𝑖 predicted class before permutation in 𝑡, 𝑦 ,𝑡𝑖,𝑎 is the 𝑖 𝑡ℎ predicted class after permutation in 𝑡. The variable importance for variable 𝑗 in the Random Forest (RF) is; 𝑛𝑡𝑟𝑒𝑒 𝑉𝐼𝑚 𝑉𝐼𝑗 = ∑ 𝑛𝑡𝑟𝑒𝑒 𝑡=1 33 University of Ghana http://ugspace.ug.edu.gh The merit of RF over logistic regression is that essential variables can automatically be selected no matter the number used initially. It does not use stepwise regression to select variables. It uses 63% of cases to construct each tree while the remaining 37% of case which is out-of-bag (OOB) is used to evaluate the performance of the tree (Steinberg, Golovnya, & Cardell, 2004). 3.6.1Random forest construction 1. Obtain 𝑛 𝑡𝑟𝑒𝑒 which are independent and identically distributed bootstrap samples from the original data with 𝑁 available case. The samples are obtained with replacement. 2. In growing each tree, 𝑁 − 1 bootstrap samples are obtained and the remaining cases called out of bag (OOB) are used to validate the tree. Out of Bag Estimate Out of bag estimate is the average prediction error on each training sample which use only trees which did not have those observation in their bootstrapped sample. It is a method which is used to measure the error of prediction of the Random forest and other machine learning models which makes cross validation not necessary in these models. 3. If there are 𝑀 variables, 𝑚 < 𝑀 random specified variables at each node are chosen such that the best split of 𝑚 is use to split the node.The number of 𝑚 variables remain the same during the forest growing process. But the variables might vary when growing the forest. 4. Prediction are obtained by averaging the majority vote obtained from the 𝑛 𝑡𝑟𝑒𝑒 5. Error rate are obtained by averaging the prediction obtained by the OOB data which were used during the bootstrap iteration process 34 University of Ghana http://ugspace.ug.edu.gh 3.6.2 Bagging Bagging is obtaining repeated samples from a dataset in other to produce 𝐶 different bootstrapped training dataset where the 𝐶𝑡ℎ bootstrapped training set is trained to obtain 𝑓𝑐 . Each separate prediction model are averaged out to obtain a low variance statistical learning model. 𝐶 1 𝑓(𝑥) = ∑ 𝑓𝑐 (𝑥) (3.18) 𝐶 𝑐=1 3.7 Cost sensitive modelling In this method, the loss matrix is incorporated into the classification tree and random forest. These are used to weight misclassification differently. In medical diagnosis, false positive (type I error) and false negative (type II error) have different cost. The classification takes into consideration how much to penalise each incorrect classification in a given choice split. The cost matrix 𝐿 is stated as; 𝐶 𝐶 𝐿 = ( 𝑇𝑃 𝐹𝑃 ) 𝐶𝐹𝑁 𝐶𝑇𝑁 where; 𝐶𝐹𝑃 is the cost of false positive, 𝐶𝐹𝑁 is the cost of false negative, 𝐶𝑇𝑃 is the cost of true positive, and 𝐶𝑇𝑁 is the cost of true negative. The cost matrix is use to adjust the way split are chosen. The tree is constructed in such a way that it uses the splitting criteria which minimize the misclassification cost rather than 35 University of Ghana http://ugspace.ug.edu.gh minimizing the entropy. It can also be used to tune the threshold on the probability of classification. The theory of cost sensitive learning as reported in Elkan (2001), Zadrozny and Elkan (2001) describe how misclassification cost plays important role in cost sensitive learning algorithm. Since the study is focused on binary classification, let 𝑖, 𝑗 represent the two classes. For an instance to be classified to minimise the expected cost, then expected cost of predicting an instance 𝑥 to belong to class 𝑖 can represented as; 𝑅(𝑖/𝑥) = ∑ 𝑃(𝑗/𝑥)𝐶(𝑖, 𝑗) , 𝑖 = 0,1 𝑎𝑛𝑑 𝑗 = 0,1 (3.19) 𝑗 where; 𝑃(𝑗/𝑥) is the probability that an instance 𝑥 belong to 𝑗, 𝐶𝑇𝑃 is equal to 𝐶(1,1), 𝐶𝑇𝑁 is equal to 𝐶(0,0), 𝐶𝐹𝑁 is equal to 𝐶(0,1), and 𝐶𝐹𝑝 is equal to 𝐶(1,0). The classifier would classify an instance into a positive class if and only if ; 𝑃(0/𝑥)𝐶(1,0) + 𝑃(1/𝑥)𝐶(1,1) ≤ 𝑃(0/𝑥)𝐶(0,0) + 𝑃(1/𝑥)𝐶(0,1) Or 𝑃(0/𝑥)𝐶𝐹𝑃 + 𝑃(1/𝑥)𝑃(0/𝑥)𝐶𝑇𝑃 ≤ 𝑃(0/𝑥)𝐶𝑇𝑁 + 𝑃(1/𝑥)𝐶𝐹𝑁. In this thesis, we assumed that 𝐶𝑇𝑃=𝐶𝑇𝑁=0 Therefore, 𝑃(0/𝑥)𝐶𝐹𝑃 ≤ 𝑃(1/𝑥)𝐶𝐹𝑁 note that 𝑃(0/𝑥) = 1 − 𝑃(1/𝑥), hence; 36 University of Ghana http://ugspace.ug.edu.gh (1 − 𝑃(1/𝑥))𝐶𝐹𝑃 ≤ 𝑃(1/𝑥)𝐶𝐹𝑁 𝐶𝐹𝑃 𝑃(1/𝑥) ≥ (3.20) 𝐶𝐹𝑁 + 𝐶𝐹𝑃 𝐶 From above, it can be observed that there is a threshold 𝐹𝑃 for the classifier to classify an 𝐶𝐹𝑁+𝐶𝐹𝑃 instance into a positive class. When the cost of misclassification is equal, that is 𝐶𝐹𝑃 = 𝐶𝐹𝑁 = 1 the threshold is 0.5 3.7.1 Modifying the classification of the predicted probability scores obtained from logistic regression to include unequal misclassification cost In other to determine the optimal decision threshold for the predicted probability scores of logistic regression the Receiving Operation characteristic (ROC) expressed in (3.8.1) analysis was used. Also the relative cost of the false negative and false positive was considered. 𝑇𝑜𝑡𝑎𝑙 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑐𝑜𝑠𝑡(𝐶𝑇𝑜𝑡𝑎𝑙) = 𝜋𝑜(1 − 𝑆𝑃)𝐶𝐹𝑃 + 𝜋1(1 − 𝑆𝑁)𝐶𝐹𝑁 where; 𝑆𝑃 is the Specificity, 𝑆𝑁 is the sensitivity, 𝜋𝑜 is prior probability of negative cases, and 𝜋1is the prior probability of positive cases. Since Sensitivity (SN) is a function of false positive rate of the ROC curve. The total expected cost is equivalent; 𝐶𝑇𝑜𝑡𝑎𝑙 = 𝜋𝑜𝐹𝑃𝑅𝐶𝐹𝑃 + 𝜋1[1 − 𝑅𝑂𝐶(𝐹𝑃𝑅)]𝐶𝐹𝑁 For a minimal cost, the optimal cut off is obtained by differentiating with respect to 𝐹𝑃𝑅 and setting it to zero; 37 University of Ghana http://ugspace.ug.edu.gh 𝑑𝑅𝑂𝐶(𝐹𝑃𝑅) 𝜋𝑜 𝐶𝐹𝑃 = 𝑑𝐹𝑃𝑅 𝜋1 𝐶𝐹𝑁 Alternately, the optimal cut off could be obtained by evaluating the various cut off corresponding to (1 − 𝑆𝑃) in which total expected cost is minimum. 3.8 Sampling Let (𝑋, 𝑌) be the original unbalanced training sample and (ℵ, 𝛾) be the balanced sample, which means(ℵ, 𝛾) ⊆ (𝑋, 𝑌). Suppose 𝑠 is a binary variable for selection which takes values 1 if the point is in (ℵ, 𝛾) and 0 if otherwise, then, the posterior distribution of the training data for the balanced data and that on the original data can be obtained. For 𝛽 = 𝑝(𝑠 = 1⁄𝑦 = 0) is the probability of selecting negative instances with under sampling, 𝑝 = 𝑝(𝑦 = 1/𝑥) is the posterior probability of the positive class on the original dataset and 𝑝𝑠 = 𝑝(𝑦 = 1⁄𝑥, 𝑠 = 1) is the posterior after sampling. Then 𝑝 𝑝𝑠 3.21 𝑝 + 𝛽(1 − 𝑝) which can also be expressed as; 𝛽𝑝𝑠 𝑝 = 𝛽𝑝𝑠 − 𝑝𝑠 + 1 To balance an unbalanced class distribution corresponding to the cases then, 𝑝(𝑦=1) 𝑁+ 𝛽 = ≈ 𝑝(𝑦=0) 𝑁− where 𝑁+the number of positive instances, 𝑁− is the number of negative instances in the 𝑁+ dataset and is the minimum value for 𝛽. 𝑁− 38 University of Ghana http://ugspace.ug.edu.gh When 𝛽 = 1 , all the negative instances are used for training whiles for 𝛽 < 1 a subset of the 𝑁+ negative instances are used for training. For 𝛽 < − , we would have more positive than 𝑁 negative cases. In under sampling, the number of negative is 𝑁− −𝑠 = 𝛽𝑁 while the number of positive is 𝑁+𝑠 = 𝑁 + In this thesis, the under sampling method was used to vary the ratio of the gonorrhea negative cases to that of the gonorrhea positive (i.e 40:60, 50:50 and 60:40). 3.9 Performance Measure Assessing model’s performance is an essential part in machine learning in which it becomes impossible to compare models without an evaluation. The performance measure which were used are; confusion matrix, which identifies the errors in the classifiers; g-mean, which combines the performance of two classifiers; ROC curve; F-measure and Kappa. The ROC is used to determine the discriminative power of the classifiers. Kappa is used to determine the agreement between the classifiers used. F-measure is used to determine the classifiers performance to predict the positive class whiles the g-mean is used to determine if the negative class has been over fitted or the positive class has been under fitted. Confusion Matrix The Confusion matrix basically show the type of classification error a classifier make as shown below: 39 University of Ghana http://ugspace.ug.edu.gh Predicted class + - Actual Class TP FN + FP TN - TP is true positive, TN is true negative, FP represent the false positive, and FN represent the false negative. Accuracy: It is usually defined over all the classification error and it is calculated as; 𝑇𝑃 + 𝑇𝑁 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 Recall: It a measure of proportion of true positives verses examples which are classified by a system as positive (Borges, 2016). This is also known as sensitivity 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 + 𝐹𝑁 Specificity: It is a measure of proportion of truly negative verse examples which are classified by a system as negative. 𝑇𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁 + 𝐹𝑃 40 University of Ghana http://ugspace.ug.edu.gh Precision: It is a measure of proportion of positive which are classify by a system as truly positive 𝑇𝑃 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃 Negative predictive value: It is a measure of proportion of negative which are classify by a system as truly negative. 𝑇𝑁 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑣𝑒 𝑣𝑎𝑙𝑢𝑒 = 𝑇𝑁 + 𝐹𝑁 F-measure: It is a measure which combines the trade-offs of recall and precision and produces a single outcome which is a measure the goodness of a classifier in the presence of uncommon event (Sokolova, Japkowicz, & Szpakowicz, 2006). 2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 ∗ 𝑃𝑟𝑒𝑐𝑒𝑠𝑖𝑜𝑛 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 Geometric-mean: It measure of the product of the prediction accuracy of sensitivity and specificity. It indicate the classification performance on the majority and minority class (Bekkar, Djemaa, & Alitouche , 2013). 𝐺 − 𝑚𝑒𝑎𝑛 = √𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 ∗ 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 Kappa: Kappa statistics is a measure that compares the observed agreement with the expected agreement. It adjust the accuracy by accounting for the correct prediction by chance only. The maximum value is one which indicate perfect agreement between the predictions of the models. Kappa values less than one indicate imperfect agreement (Biswas, 2006). 𝑃𝑜 − 𝑃𝑒 𝐾𝑎𝑝𝑝𝑎 = 1 − 𝑃𝑒 41 University of Ghana http://ugspace.ug.edu.gh where; 𝑇𝑃+𝑇𝑁 𝑃𝑜 = , 𝑁 1 𝑃𝑒 = [(𝑇𝑃 + 𝐹𝑁)(𝑇𝑃 + 𝐹𝑃) + ((𝐹𝑃 + 𝑇𝑁)(𝐹𝑁 + 𝑇𝑁)],2 (𝑁) and 𝑁 = 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁. 3.9.1 Receiving Operating Characteristics (ROC) The ROC curve is a plot that classify the trade-off between the True Positive rate (Sensitivity) and False Positive rate (1-Speficificity) across a series of cut-off points. The area under the ROC curve is considered as an effective measure of intrinsic validity of the diagnostic test. The curve is useful in; evaluating the discriminatory ability of the test, compare efficiency of two or more medical test for assessing the same disease and finding optimal cut-off point to at least misclassify disease and non-disease. There are parametric and non-parametric methods for obtaining the ROC curve. Non- parametric methods uses the trapezoidal rule to obtain the area under the ROC curve .Other methods use Man-Whitney statistic also known as Wilcoxon rank-sum to calculate the area under the ROC curve. Parametric methods are used when the statistical distribution of the test values of disease and non-disease is known. Often, binomial distribution is use which gives a smooth ROC curve. 42 University of Ghana http://ugspace.ug.edu.gh CHAPTER FOUR DATA PRESENTATION AND ANALYSIS This chapter presents the analysis of cost insensitive learning and sensitive learning for predicting Gonorrhea infection status and discussion of results obtained. Percentage and frequency was used to describe both the training and testing data. Logistic regression was used to fit a model for the training data in which the parameters were estimated using the maximum likelihood method. The variables were selected into the model using Akaike information criterion (AIC).Diagnostic measures such as log likelihood ratio, residual deviance, linearity test etc. were accessed. In other to use the model as a classifier, the probability distribution of the predicted score was accessed. An optimal cut off 0.5 which assumed equal cost of misclassification was initially used as a threshold for classification. Another optimal cut off 0.26 was obtained by considering unequal misclassification cost and this threshold was used for classification. The parameters of the logistic regression were also estimated using the Bayesian method and also the model diagnostic were also accessed. A classification tree model which assumed equal misclassification cost was developed. The variables were selected using information gain. In other to include unequal misclassification cost in the model, a loss matrix was used in the classification tree. Finally, the random forest model with equal classification cost was use to fit the training data. Also, to include unequal misclassification cost, lost matrix was used to classify the predicted probabilities of the random forest. The various cost insensitive and sensitive models performance were mainly evaluated using total classification cost and also measures such as accuracy, receiving operating characteristic Area Under Curve(AUC), F-measure, G-mean, type I and type II error were also used . 43 University of Ghana http://ugspace.ug.edu.gh The model was validated using a testing data using the optimal cut off obtained from the trained data. These threshold were used to classify the probability scores. Cost sensitive and insensitive classification tree and random forest models were evaluated on the testing data. This classifiers were also compared to the laboratory diagnostic methods such gram stain and culture using total classification cost (benefit).They were also evaluated on the other performance measure. Under sampling method was used to adjust the class distribution, cost sensitive and insensitive method was train on the adjusted dataset and then tested on the hold out data. Total classification cost was used to compare the effect of the class adjustment on the cost sensitive and insensitive method. Data was analysed using R version 3.3.3. 4.1 Data and Preliminary Analysis In the Table 4.1a and 4.1b below, 906 patients, were enrolled from the clinical facilities during the study period, of which 28% were diagnosed of having gonorrhea. The study enrolled more females than males but yet high proportion of the males were diagnosed of the infections than females. Majority (70%) of the patients who participated in the study were 31 years and below and high proportion (31%) of them were diagnosed of gonorrhea. The study enrolled 581(64%) of patients who were not married as at the time of enrolment. Five percent (5%) of these patients were previously married and are now single either by divorce or deceased of spouse. This is an indication that majority of the patient who were enrolled had never been married. Majority of clinical presentation were discharge (88%) and pain during urination (49%).Other symptoms which were reported were itching of the genital, foul smell, painful sex, ulcers, bleeding from penis or vagina and pain in penis or vagina. Regarding sexual behaviour, only 12% of the patients have had more than one sexual partner in the past month of which 51% of them were diagnosed with gonorrhoea. Forty one percent 44 University of Ghana http://ugspace.ug.edu.gh (41%) of the enrolled patients never used condom during sexual intercourse whiles 59% of the patients at least used condom once during sexual intercourse. Among the patient who always use condom during sexual intercourse 21% of them were diagnose of gonorrhoea. Only 289(32%) of the patients enrolled had a behaviour of drinking alcohol of which 38% them were diagnosed gonorrhea. 45 University of Ghana http://ugspace.ug.edu.gh Table 4.1a:Background information of the respondent Total Gonorrhea Gonorrhea -ve Variable N(%) +ve N(%) N(%) Demographic Gender Male 390(43.0) 169(43.3) 221(56.7) Female 516(57.0) 85(16.4) 431(83.6) Age 18-24years 247(27.3) 70(28.3) 177(71.7) 25-31years 385(42.5) 123(31.9) 262(68.1) 32-38years 171(18.9) 45(26.3) 126(73.7) 39years and above 103(11.4) 16(15.5) 87(84.5) Marital status Single 551(60.8) 161(29.2) 390(70.8) Previously Married 30(3.3) 10(33.3) 20(66.7) Married 325(35.9) 83(25.5) 242(74.5) Clinical Presentation Painful urination Yes 445(49.1) 150(33.7) 295(66.3) No 461((50.9) 104(22.6) 357(77.4) Discharge Yes 800(88.3) 228(28.5) 572(71.5) No 106(11.7) 26(24.5) 80(75.5) Pain in penis or vagina Yes 256(28.3) 70(27.3) 186(72.7) No 650(71.7) 184(28.3) 466(71.7) Foul smell Yes 214(23.6) 50((23.4) 164((76.6) No 692(76.4) 204(29.5) 488(70.5) Painful sex Yes 129(14.2) 30(23.3) 99(76.7) No 777(85.8) 224(28.8) 553(71.2) Bleeding from penis or vagina Yes 69(7.6) 21(30.4) 48(69.6) No 837(92.4) 233(25.7) 604(74.3) Itching of Genital Yes 162(17.9) 32(19.8) 130(80.2) No 744(82.1) 222(29.8) 522(70.2) Ulcers Yes 53(5.8) 15(28.3) 38(71.7) No 853(94.2) 239(28.0) 6149(72.0) 46 University of Ghana http://ugspace.ug.edu.gh Table 4.1b:Background information of the respondent Total Gonorrhea +ve Gonorrhea -ve Variable N(%) N(%) N(%) Sexual Behaviour Alcohol intake Yes 289(31.9) 109(37.7) 180(62.3) No 617(68.1) 145(68.1) 472(31.9) Use of Condom Never 375(41.4) 107(28.5) 268(71.5) Rarely 244(26.9) 74(30.3) 170(69.7) Most occasion 147(16.2) 43(29.3) 104(70.7) Always 140(15.5) 30(21.4) 110(78.6) Having more than one sexual partner in the pastner month Yes 112(12.4) 57(50.9) 55(49.1) No 794(87.6) 197(24.8) 597(75.2) 4.2 Training Data and Model fitting The result presented in Table 4.2a and 4.2b is a description of the patient in the training data consisting of 80% of the original data. The description of the characteristic is approximately the same as that of Table 4.1a and 4.1b. 47 University of Ghana http://ugspace.ug.edu.gh Table 4.2a:Description of Training Data Total Gonorrhea +ve Gonorrhea -ve Variable N(%) N(%) N(%) Demographic Gender Male 318(43.7) 142(44.7) 176(55.3) Female 410(56.3) 63(15.4) 347(84.6) Age 18-24years 199(27.3) 58(29.1) 141(70.9) 25-31years 313(43.0) 103(32.9) 210(67.1) 32-38years 133(18.3) 31(23.3) 102(76.7) 39years and above 83(1.4) 13(15.7) 70(84.3) Marital status Single 446(61.3) 133(26.8) 313(73.2) Previously Married 26(3.6) 8(30.8) 18(69.2) Married 256(35.2) 64(25.0) 192(75.0) Clinical Presentation Painful urination Yes 367(50.4) 128(34.9) 239(65.1) No 361(49.6) 77(21.3) 284(78.7) Discharge Yes 646(88.7) 181(28.1) 465(71.9) No 82(11.3) 24(29.3) 58(70.7) Pain in penis or vagina Yes 210(28.8) 58(27.6) 152(72.4) No 518(71.2) 147(28.4) 371(71.6) Foul smell Yes 170(23.4) 41(54.1) 129(45.9) No 558(76.6) 164(29.4) 394(70.6) Painful sex Yes 105(14.2) 26(24.8) 79(75.2) No 623(85.6) 179(28.7) 504(71.3) Bleeding from penis or vagina Yes 55(7.6) 13(23.6) 42(76.4) No 673(92.4) 192(28.5) 481(71.5) Itching of Genital Yes 136(18.7) 26(19.1) 110(80.9) No 592(81.3) 179(30.2) 413(69.8 Ulcers Yes 44(6.0) 12(27.3) 32(72.7) No 684(94.0) 193(28.2) 481(71.8) 48 University of Ghana http://ugspace.ug.edu.gh Table 4.2b:Description of Training Data continuation Total Gonorrhea Gonorrhea - Variable N(%) +ve N(%) ve N(%) Wartsong Yes 26(3.6) 7(26.9) 19(73.1) No 702(96.4) 198(28.2) 504(71.8) Sexual Behaviour Alcohol intake Yes 239(32.8) 88(36.8) 151(63.2) No 489(67.2) 117(23.9) 372(76.1) Use of Condom Never 301(41.3) 85(28.2) 216(71.8) Rarely 200(27.5) 63(31.5) 137(68.5) Most occasion 114(15.7) 35(30.7) 79(69.3) Always 113(15.5) 22(19.5) 91(80.5) Having more than one sexual partner in the partner month Yes 93(12.8) 48(51.6) 45(48.4) No 635(87.2) 157(24.7) 478(75.3) Models which were fitted with the training data were logistic regression, classification tree and Random forest. 4.2.1 Logistic Regression The LR model was developed using 14 variables available in Table 4.2a and 4.2b .The reduced final model gave 5 significant variables using stepwise selection procedure and selecting model with the lowest AIC. The variables which remained in the model were Age, Gender, Pain during urination, Condom usage and having more than one sexual partner in the past month. Almost all thes variables were significantly associated with gonorrhea infection except Pain during urination. Similar results were obtained using Markov chain Monte Carlo Bayesian regression as in Table 1a in (Appendix). 49 University of Ghana http://ugspace.ug.edu.gh Table 4.3: Logistic Regression model using maximum likelihood estimation Variable Estimate SE P-value Intercept -2.58 0.33 <0.001** Age 25-31years 0.06 0.22 0.77919 32-38years -0.47 0.28 0.09632 39years and above -1.05 0.36 0.003858* Male 1.37 0.2 <0.001** Burning 0.35 0.19 0.06252 Condom usage Never 0.88 0.3 0.00324* Rarely 0.81 0.31 0.00834* Most occasion 0.7 0.34 0.03744* More than one sexual partner in past month 0.67 0.25 0.00788* AIC=770.93 Residual deviance=748.93 In Figure 4.1, both distributions of the predicted probabilities of the positive and negative cases are slightly skewed to the left. The reason for this is because the dataset used consists of majority of the negative class which makes the predicted scores pulled towards a lower number. When developing models for prediction, it is aimed to make it as accurate as possible. The above density distribution clearly shows that accuracy is not a suitable measurement for the model. Since the prediction of logistic regression uses probabilities, it is necessary to include unequal misclassification which would determine the optimal cut off which would have low total classification cost. This makes it cost effective. 50 University of Ghana http://ugspace.ug.edu.gh Figure 4.1: Distribution of the predicted probability score of the training data 4.2.2 Classification tree In this classification tree, Gini splitting criterion and assuming equal misclassification cost was used which produced tree in Figure 4.2. Prior probabilities was also assumed to be proportional to the number of positive-Negative gonorrhea infection status in the training data (0.28 and 0.72 respectively). The classification tree had a resubstitution estimate of misclassification of 0.23 and CV classification error rate of 0.25. The tree structure defines five decision profile of gonorrhoea infection in which two predicted positive while three predicted negative infection status of gonorrhea. The 56% of females which ended up in the leave node were predicted to be negative to the disease. Males who had Pain during urination and less than 30 years were predicted to be positive to the infection whiles those who did not show clinical symptom of Pain during urination but had multiple sexual partner in past months were also predicted to be positive. 51 University of Ghana http://ugspace.ug.edu.gh Negative 523 205 100% Gender = Female Male Negative 176 142 44% Burning = No Yes Negative Positive 82 40 94 102 17% 27% partner = No Age >= 30 Yes < 30 Positive 50 79 18% Age < 22 >= 22 Negative Negative Positive Negative Negative Positive 347 63 72 22 10 18 44 23 8 5 42 74 56% 13% 4% 9% 2% 16% Figure 4.2: Tree structure for gonorrhea data 4.2.3 Random Forest The variable importance plot given in Figure 4.3 shows how important each variable is when classifying the data. These variable importance were measured using the mean decreasing gini which is a measure of how each variable contribute to the purity of each node. The most important five variables were Gender, Age, condom usage, marital status and Pain during urinating. The model gave an out-of-bag (OOB) error rate of 26.4% Regarding the number of tree to have in the model, use of Figure 1a in Appendix gives an indication. The results indicate that increasing the number of trees decreases the OOB error. 52 University of Ghana http://ugspace.ug.edu.gh randomforest Gender Gender partner Agegr Burning Condom Agegr Marritalstatus Abdominal Burning bleeding partner Marritalstatus Abdominal Itching alcohol alcohol Itching Condom foulsmel foulsmel painsex Discharge Discharge wartsong bleeding painsex wartsong 0 10 20 0 5 15 25 MeanDecreaseAccuracy MeanDecreaseGini Figure 4.3: Variable importance for Random forest 4.3 Cost-Sensitive Models The cost of false positive and false negative differ in medical diagnosis, hence, in this study misclassification cost was assumed to vary. Therefore, cost of false positive was set at 1 whiles the cost of false negative was adjusted from 1 to 25 shown in Table 4.4. The effect of misclassification error is sated below; False negative  Spread of disease  Loss of fertility False positive  Drug abuse  Unnecessary financial burden 53 University of Ghana http://ugspace.ug.edu.gh Table 4.4: Confusion Matrix Confirmed Test(NAAT) Positive Negative Positive TP=0 FP=1 Statistical Model Negative FN TN=0 4.3.1 Classification of logistic regression predicted probability score to include unequal misclassification cost From Figure 4.1, the predicted probabilities were skewed to the left hence there is the need to obtain optimal threshold which could be used to classify the model. In Figure 4.4, the false negative was given a high cost than the false positive which yielded a 0.26 probability cut off (threshold). Figure 4.4: Determination of optimal cut off of predicted scores using unequal classification cost 54 University of Ghana http://ugspace.ug.edu.gh 4.3.2 Classification tree with unequal misclassification cost Assumption for the study was that misclassification of a minority class incur a higher cost hence the cost ratio between false negative and false positive was adjusted until decision trees with a reduced type II error was obtained. In Figure 4.5, cost ratio of 1:4 gave seven decision profile. Forty four (44%) of the males who ended up in the leave node were predicted to be negative to gonorrhea whiles for females, 6% of them who ended up in the leave node reported of never or not been consistent of condom usage and also age equal or more than 28years were predicted to be positive to the infection. Also, 4% of the females who were predicted positive in one of the leave nodes were married and partners or they themselves never use condom. This result obtained differs from instances of assuming equal cost of misclassification. Positive 523 205 100% Gender = Female Male Negative 347 63 56% Age >= 30 < 30 Negative 223 50 38% Condom = Always Most occasion,Nev er,Rarely Positive 184 48 32% Age < 28 >= 28 Negative 152 34 26% Marritalstatus = Prev iouly Married,Single Married Positive 30 10 5% Condom = Most occasion,Rarely Nev er Negative Negative Negative Negative Positive Positive Positive 124 13 39 2 122 24 10 1 20 9 32 14 176 142 19% 6% 20% 2% 4% 6% 44% Figure 4.5 : Tree structure for gonorrhea data using a cost ratio of 1:4 55 University of Ghana http://ugspace.ug.edu.gh 4.4 Comparing the performance of the models using training Data Assessing the performance of the various models on the training data indicate that the Random forest had the highest F-measure, AUC and Accuracy. Adding cost to the tree models makes it also better than logistic regression on the training data. Table 4. 5: Comparing Performance of the various model using training data Accuracy AUC F-measure G-mean LR 0.74 0.74 0.43 0.56 LR with optimal cut off 0.26 0.71 0.74 0.58 0.71 CART 0.77 0.72 0.53 0.64 CART(Cost Ratio=1:4) 0.72 0.72 0.55 0.67 RF 0.86 0.91 0.71 0.76 RF(Cost Ratio=1:4) 0.84 0.91 0.74 0.84 Adjusting the cost of misclassification between the false negative and false positive to obtain an optimal cut off of 0.26 for logistic regression model had a reduction in accuracy than logistic regression with 0.5 threshold which considers equal cost of misclassification. But the F- measure and G-mean was higher. Also, the other classifiers which considered unequal cost of misclassification also had a reduction in accuracy but a high F-measure and G-mean. 4.5 Effect of Total classification cost on cost sensitive and insensitive method In other to calculate the total classification cost, the cost matrix defined in Table 4.4 was used. The results in Figure 4.6 indicated that, the cost sensitive classifiers produced low total classification cost than the cost insensitive classifiers. Adjusting the misclassification cost made the cost sensitive classifiers fall below 1000 cost unit whiles that of the cost insensitive classifiers was above 2000 cost unit. This is an indication that the cost sensitive method have a reduced total number of people who misclassified than the cost insensitive method. 56 University of Ghana http://ugspace.ug.edu.gh TRAINING DATA 4000 3500 3000 2500 2000 1500 1000 500 0 1:1 1:4 1:15 1:25 LR CRT RF CS-CRT CS-RF Figure 4.6: Effect of classification cost on cost sensitive and cost insensitive classifiers 4.6 Model Validation The results from the cost insensitive and cost sensitive models which were obtained using the training data were tested using a holdout data. This data was 20% (178) of the entire observation in the dataset. From table 4.6 below, logistic regression with an optimal threshold of 0.5 performed better in terms of Accuracy than the other classifiers. A change of the optimal threshold of the logistic regression classifier to 0.26 improves the F-measure and G-mean. Making the tree base method cost-sensitive, also reduces the accuracy of the model. Cost sensitive trees had a better F-measure and G-mean than logistic regression with optimal threshold of 0.5. Table 4.6: Comparing Performance of the various model using test data Accuracy AUC F-measure G-mean LR 0.74 0.64 0.37 0.51 LR with Optimal cut off 0.26 0.64 0.64 0.40 0.60 CART 0.70 0.61 0.27 0.43 CART(Cost Ratio=1:4) 0.57 0.61 0.46 0.59 RF 0.74 0.66 0.36 0.49 RF(Cost Ratio=1:4) 0.65 0.66 0.45 0.56 57 University of Ghana http://ugspace.ug.edu.gh 4.6.1 Comparing laboratory diagnostic methods with cost sensitive and insensitive models on the testing data The laboratory diagnostic method had a lower total classification cost (higher benefit) than the cost sensitive and insensitive method except culture as seen in Figure 4.7. The cost sensitive classifiers had their classification cost less than 300 cost unit whiles the cost insensitive classifiers had their classification cost more than 1000 cost unit. Testing Data 1200 1000 800 600 400 200 0 1:1 1:4 1:15 1:25 LR CRT RF CS-CRT CS-RF Culture Grainstain Figure 4.7: Total cost of classification of Laboratory method, Cost sensitive and insensitive classifiers The results in Figure 4.8 indicate that Grain stain test was perfect. Cost sensitive tress a reduced type II error than Culture. This model also outperformed culture in terms of F-measure, Geometric mean and Kappa which a measure of agreement between classifiers. The reference test used was the results from Nucleic acid amplification test. 58 University of Ghana http://ugspace.ug.edu.gh 1.2 1 0.8 0.6 0.4 0.2 0 Type I error Type II error F-measure G-mean Kappa CS-CART CS-RF Culture Gramstain Figure 4.8: Laboratory diagnostic methods and Cost sensitive models 4.6.2 Effect of class distribution and cost sensitive method on classification cost To evaluate the effect of the class distribution on classification cost, the total cost for the cost insensitive and sensitive classifiers were calculated for each dataset where class distribution was adjusted (i.e the class distribution of gonorrhea negative to positive was adjusted in a ratio of 40:60, 50:50 and 60:40 using the under sampling method). The results in Figure 9.0 indicate that when the ratio between the two classes was 60:40 a less classification cost was obtained using the cost insensitive classifiers. This is an indication that less classification cost is obtained when the data contains more gonorrhea positive cases than the negative cases. Regarding the cost sensitive classifiers, adjustment in the ratio of the class distribution weakly affected the classification cost. 59 University of Ghana http://ugspace.ug.edu.gh 900 800 700 600 500 400 300 200 100 0 LR CRT RF CS-CRT CS-RF 40:60 50:50 60:40 Figure 4.9: Effect of class distribution on classification cost of the classifiers 4.7 Summary of results The models were fitted with logistic regression, classification and Random forest using equal cost of misclassification. For logistic regression model, the goodness of fit was tested using Hosmer-Lemoshow, log likelihood and deviance in which p-value greater than 0.05 obtained was an indication of a good fit to the training data. The misclassification rate of the model was 26%, F-measure was 43% and G-mean was 56%. Regarding the classification tree, the important variables selected were Gender, age, Pain during urination and more than one partner in the past month which was similar to that obtained in the logistic regression model. The model misclassification rate was 23%, F measure was 53% and G-mean was 64%. For Random forest, the misclassification rate was 14%, F-measure was 71% and G-mean was 76%.The cost of misclassification was varied (Cost of False negative was considered higher than cost of false positive) for the various models which yielded a reduction in the total classification cost and accuracy of the classifiers. For logistic regression an optimal threshold of 0.26 was obtained when the ROC curved which included the cost of misclassification was used to obtain a cut off. This increased the misclassification rate but improved the F-measure and G-mean. Similar results were obtained with the inclusion of the cost matrix in the classification tree and Random forest. The cost sensitive models performed well on the training data than the cost insensitive 60 Total Cost 1:1 1:4 1:15 1:25 1:1 1:4 1:15 1:25 1:1 1:4 1:15 1:25 1:1 1:4 1:15 1:25 1:1 1:4 1:15 1:25 University of Ghana http://ugspace.ug.edu.gh models in terms of reduction on the total classification cost. Also, the class distribution had a weak effect on cost insensitive classifiers but weakly affected the cost sensitive classifiers. When the models where evaluated on a testing data, it had a poor performance than the training data but the cost sensitive models still out performed the cost insensitive models using F- measure and G-mean. Misclassification rate was not a good measure to evaluate this models since the data was unbalanced. 61 University of Ghana http://ugspace.ug.edu.gh CHAPTER FIVE CONCLUSIONS AND RECOMMENDATIONS The final chapter of this thesis discussed the results and also deals with the conclusion and recommendation from the study. 5.1 Discussion The results obtained from the study are discussed below; 5.1.1 Comparing Logistic regression, classification tree and Random forest In this study, three traditional models were developed for the prediction gonorrhea infection. These are logistic regression, classification trees and Random forest. Furthermore, the models were evaluated using Classification cost. In addition, Accuracy, sensitivity, specificity, Area under the ROC curve, F-measure, and G-mean were used to determine the performance of the models. The logistic regression identified four features which are age, gender, and Painful urination and Condom usage as significant for predicting gonorrhoea infection status which were similar for the classification trees and Random forest. The choice of these features for the classification tree and Random forest was based on the information gain and Mean decreasing Gini respectively. The LR model was transformed into classifier by considering a 0.5 probability threshold for classification of gonorrhea infection whiles for Random forest, the most important variables were selected based on the mean decreasing Gini. The tree base models had reduced misclassification error in the learning phase than the logistic regression (Table 4.5) but had a high misclassification error for the hold out data (Table 4.6). The area under ROC curve for LR on the test data set is 0.64 while that of Classification tree and Random forest are 0.61 and 0.63 respectively. The difference between the AUC of LR and tree classifiers were not statistically significant likewise comparing the tree classifiers using 62 University of Ghana http://ugspace.ug.edu.gh Delong’s method as reported in Table 2a. This indicate that the classifiers virtually have the same probability rank of randomly choosing positive cases higher than a randomly choosing a negative case. LR makes an assumption of the log odds of the response being a linear combination of the predictors. Since all the variables used in the model construction are discrete, the linearity assumption is satisfied hence minimizing the specification error which reduce the chance of overfitting in LR and making it more robust in the testing data than the tree classifiers. The tree classifiers poor performance for the testing data may be due to the fact that the pattern of features in the learning data set may not have been similar to the testing data set hence resulting in overfitting. One way of curbing this situation is pruning the classification tree which also did not yield any much difference when it was implemented. For Random forest, a portion of the data called out of bag is use to validate the model. This undergo bootstrap processes to obtain a prediction for the model which is an average of the bootstrap tree. In terms of accuracy, AUC F-measure and g-mean, Random Forest performed better than Classification tree on the test data (Table 4.6). Its performance compared to logistic regression was slightly high in terms accuracy and area under the ROC curve which is similar to the findings of Jin et al., (2007). The type I error(False positive) for the classifiers were very less than the type II error(False Negative) which is as a result of the imbalance class distribution in the data set which made the models more likely to predict the negative case than the positive. These classification algorithms assume equal class distribution hence without any adjustment leads to classification bias toward the majority class. In medical diagnosis, type II error are more severe than type I error (Feitas et al.,2009), this means for type II error ,individuals who have the disease have been misclassified as not having the disease. To address class imbalance in data set, some researchers choose to use random sampling method such as under-sampling and Over-sampling which does not have any 63 University of Ghana http://ugspace.ug.edu.gh theoretical basis .This method tend to adjust the prior distribution of the learning data to be able to obtain a balance class distribution. Other method to use is the cost sensitive learning which can accept information cost and also assign different cost to the various misclassification error. This is very difficult to implement since in most cases the misclassification cost are unknown hence need to be assumed. 5.1.2 Effect of Classification cost on Laboratory diagnostic method and skewed class distribution of Cost sensitive and insensitive classifiers Since the impact of each type of error has its own financial cost and harm to the individual, in medical diagnosis there is much focus to reduce the misclassification of type II error. From the study, cost sensitive trees helped to reduce the classification cost when the cost of misclassification for the errors are adjusted. In comparing cost-sensitive classification tree and random forest to traditional classifiers which assume cost of misclassification, the cost sensitive classifiers performed better in terms of reduction in total classification cost. It also performed better in terms of F-measure and g-mean than the cost insensitive models. Even though traditional classifiers had a high accuracy than the cost-sensitive tree base methods, this measure is not appropriate for evaluating performance of classifiers since the data is imbalance and biases towards the majority class (Weng & Pong, 2006). The use of F-measure and G-mean are appropriate instances when the there is an imbalance data. F-measure is a combination of sensitivity and positive predictive value which is effective to use when there is an imbalance class distribution in the data. G-mean also has a combination of sensitivity and specificity. Gram stain test had a perfect test which was extremely better than the cost sensitive models. The other laboratory method which is culture had a high classification cost than the cost sensitive and insensitive classifiers. The poor performance of this laboratory diagnostic method 64 University of Ghana http://ugspace.ug.edu.gh might be due to lack of adherence to standard operating procedures which may not have been followed when collecting samples, transporting and testing. In other to determine the effect of class distribution of the classification cost, under sampling method in which varying ratio of gonorrhea negative to positive were used. The results in Figure 4.9 clear show that the class distribution weakly affect the classification cost of the cost sensitive classifiers but for the cost insensitive classifiers, the class distribution affect the classification cost. The reason why the class distribution did not affect the classification cost of the cost sensitive classifiers is due to the fact that the method uses a cost matrix which encodes the penalty of misclassifying a data sample. 5.2 Conclusion In this work, a number of cost sensitive and cost insensitive algorithm were proposed for gonorrhea prediction. The study investigated whether the cost-sensitive approach for gonorrhoea prediction result in less total classification cost than the cost insensitive and laboratory diagnostic method. Also, it was investigated if the class distribution of the data had an effect on the total classification cost of the cost sensitive and insensitive classifier. The results obtained indicated that, cost sensitive methods outperform cost insensitive methods in terms of reduction of total classification cost (higher benefit). But it did not outperform the laboratory diagnostic methods except culture which even had a poor performance than the cost insensitive method. The class distribution of the data weakly affected the cost sensitive method but affected that of the cost insensitive method. 65 University of Ghana http://ugspace.ug.edu.gh 5.3 Recommendation The following are some recommendation to consider;  Misclassification cost, rather than accuracy of the classifiers should be considered when selecting statistical diagnostic model. This will help clinicians to make more effective decision in diagnosing by minimizing the number of False negative which has serious impact on patients and society at large  Cost sensitive classifiers can be used if data for statistical prediction is skewed (i.e imbalanced problem in the dataset).  Finding from classification tree can serve as a guide for clinicians when diagnosing gonorrhea. In most cases they rely on informal decision which enables them provide treatment for patients. 66 University of Ghana http://ugspace.ug.edu.gh REFERENCE Abdullah, A. S., & Rajalaxmi, R. (2012, April). A data mining model for predicting the coronary heart disease using random forest classifier. In International Conference in Recent Trends in Computational Methods, Communication and Controls. Acquah, H.D. (2013). Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm. Journal of Social and Development Sciences, 4(4), 193-197. Adeyemo, O., Adeyeye, T., & Ogunbiyi, D. (2015). Comparative Study of ID3/C4. 5 Decision tree and Multilayer Perceptron Algorithms for the Prediction of Typhoid Fever. African Joumal of Computing & ICT, IEEE,8(1),103-112. Archer, K. J., & Lemeshow, S. (2006). Goodness-of-fit test for a logistic regression model fitted using survey sample data. Stata Journal, 6(1), 97-105. Austin, P.C. (2006). A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Statistics in Medicine, 26(15), 2937-2957. Bekkar, M., Djemaa, H. K., & Alitouche, T. A. (2013). Evaluation measures for models assessment over imbalanced data sets. Iournal Of Information Engineering and Applications, 3(10),27-38. Biswas, B. (2006). Assessing agreement for diagnostic devices. In FDA/Industry Statistics Workshop. FDA. Borges, L. S. R. (2016). Diagnostic Accuracy Measures in Cardiovascular Research. Int J Cardiovasc Sci, 29(3), 218-222. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Centres for Disease Control and Prevention. (2017). 10 ways STDs impact women differently from men. CDC Fact Sheet, (April), 1. Retrieved from http://www.cdc.gov/nchhstp/newsroom/docs/STDs-Women-042011.pdf Centres for Disease Control and Prevention . (2010). National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention Division of STD Prevention. Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence and Research 16, 321-357. Chen, J.J., Tsai, C. A., Moon, H., A, H., Y, J. J., & Chen, C.H. (2006).The use of decision threshold adjustment in classification for cancer prediction. 67 University of Ghana http://ugspace.ug.edu.gh Chou, Y.Y., & Shapiro, L. G. (2003). A hierarchical multiple classifier learning algorithm. Pattern Analysis & Applications, 6(2), 150-168. Cosentino , L.A., Campbell, T., Jett , A., Macio , I., Zamborsky, T., Cranston, R. D., & Hiller,S.L. (2012). Use of Nucleic Amplification Testing for Diagnosis of Anorectal Sexually Transmitted infection. Journal of Clinical Microbiology, 80(6), 2005-2008. Cramer, J. S. (2002). The origins of logistic regression. Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G. (2015, December). Calibrating probability with undersampling for unbalanced classification. In Computational Intelligence, 2015 IEEE Symposium Series on (pp. 159-166). IEEE. Danjuma, K., & Osofisan,A.O.(2015).Evaluation of Predictive Data Mining Algorithms in Erythemato-Squamous Disease Diagnosis. International Journal of Computer Science, 11(6), 85-94. De'ath, G., & Fabricius, K. E. (2000). Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81(11), 3178-3192. De Queiroz Mello, F. C., do Valle Bastos, L. G., Soares, S. L. M., Rezende, V. M., Conde, M. B., Chaisson, R. E., & Werneck, G. L. (2006). Predicting smear negative pulmonary tuberculosis with classification trees and logistic regression: a cross-sectional study. BMC Public Health, 6(1), 43. Domingos, P. (1999, August). Metacost: A general method for making classifiers cost- sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 155-164). ACM. Effects of STIs on Pregnancy | SexInfo Online. (n.d.). Retrieved November 25, 2016, from http://www.soc.ucsb.edu/sexinfo/article/effects-stis-pregnancy Gardella, C., Brown, Z., Wald, A., Selke, S., Zeh, J., Morrow, R. A., & Corey, L. (2005). Risk factors for herpes simplex virus transmission to pregnant women: a couples study. American journal of obstetrics and gynecology, 193(6), 1891-1899. Handsfield, H. H., Lipman, T. O., Harnisch, J. P., Tronca, E., & Holmes, K. K. (1974). Asymptomatic gonorrhea in men: diagnosis, natural course, prevalence and significance. New England Journal of Medicine, 290(3), 117-123. Hastie, T. J., and Tibshirani, R. J. (1990). Generalized Additive Models. New York: Chapman & Hall. Hsieh,C.H.,Lu,R.H.,Lee,N.H.,Chiu,W.T.,Hsu,M.H.,&Li,Y.C.(2010).Novel solution for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks.Surgery,149(1),87-93. 68 University of Ghana http://ugspace.ug.edu.gh Hssina, B., Merbouha, A., Ezzikouri, H., & Erritali, M. (2014). A comparative study of decision tree ID3 and C4. 5. International Journal of Advanced Computer Science and Applications, 4(2), 13-19. Huppert, J. S., Biro, F., Lan, D., Mortensen, J. E., Reed, J., & Slap, G. B. (2007). Urinary symptoms in adolescent females: STI or UTI?. Journal of adolescent health, 40(5), 418-424. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: springer. . Jiang, Y., & Cukic, B. (2009, May). Misclassification cost-sensitive fault prediction models. In Proceedings of the 5th international conference on predictor models in software engineering (p. 20). ACM. Jin, H., Kim, S.,& Kim ,J.(2014).Decision Factors on effective Liver Patient Data Prediction. International Journal of Bio-Science-Technology, 6(4), 167-178. Kazemnejad,A.,Zayeri,F.,Aishah,H.,Gharaaghaji,R.,& Salehi,M.(2010).A Bayesian analysis of bivariate ordered categorical response using a latent variable regression model: Application to diabetic retinopathy data. Scientific Research and Essays,5(11),1264- 1273. Kershaw, T. S., Lewis, J., Westdahl, C., Wang, Y. F., Rising, S. S., Massey, Z., & Ickovics, J. (2007). Using clinical classification trees to identify individuals at risk of STDs during pregnancy. Perspectives on sexual and reproductive health, 39(3), 141-148. King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 137- 163. Kohavi, R., & Quinlan, J. R. (2002). Data mining tasks and methods: Classification: decision- tree discovery. Paper presented at the Handbook of data mining and knowledge discovery. Kolluru,M. (n.d.). What is the difference between logistic regression and Naive Bayes? - Quora. Retrieved September 13, 2016, from https://www.quora.com/What-is-the- difference-between-logistic-regression-and-Naive-Bayes Kurt, I., Ture, M., & Kurum, A. T. (2008). Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Systems with Applications, 34(1), 366-374. Lavanya, D., & Rani, K. U. (2011). Performance evaluation of decision tree classifiers on medical datasets. International Journal of Computer Applications, 26(4). Lecture 14 Diagnostics and model checking for logistic regression. (2004). Retrieved from https://courses.washington.edu/b515/l14.pdf 69 University of Ghana http://ugspace.ug.edu.gh Leung, K. M. (2007). Naive bayesian classifier. Polytechnic University Department of Computer Science/Finance and Risk Engineering. Lewis, R. J. (2007). An introduction to classification and regression tree (CART) analysis. Paper presented at the 2000 annual meeting of the society for academic emergency medicine. San Francisco. Disponível em: www. saem. org/download/lewis1. pdf. Acesso em: mar. Ling, C.X., Yang, Q., Wang, J., & Zhang, S. (2004). Decision Trees with Minimal Costs. In Proceedings of 2004 International Conference on Machine Learning (ICML'2004). Liu, Y. (2007). On goodness-of-fit of logistic regression model. (PhD Thesis), Kansas State University. Long, W. J., Griffith, J. L., Selker, H. P., & D'agostino, R. B. (1993). A comparison of logistic regression to decision-tree induction in a medical domain. Computers and Biomedical Research, 26(1), 74-97. Mohammed, G.S. (2016). Parkison’s disease diagnosis: Detecting the effect of attribute selection and discretization of Parkinson’s disease dataset on the performance of classifier algorithm.Open access library Journal, 3(11), 1-11. Meade, J. C., & Cornelius, D. C. (2012). Sexually transmitted infections in the tropics Current Topics in Tropical Medicine, Dr. Alfonso Rodriguez-Morales (Ed.), ISBN: 978-953-51-0274-8, InTech, Available from:http://www.intechopen.com/books /current-topics-in-tropical-medicine/sexually -transmitted-infections-in-the-tropics . Medova, E. (2008). Bayesian Analysis and Markov Chain Monte Carlo Simulation: Wiley Online Library. Murray P.R., Baron E.J., Pfaller M.A., Jorgensen J.H., & Yolken R.H. (2003).Manual of Clinical Microbiology. 8th ed. Washington DC:American Society for Microbiology. Ndongmo, B.C. (2005). Clinical laboratory diagnostics in Africa. African Technology Development Forum Journal, 2(3), 21-22. Papp, J. R., Schachter, J., Gaydos, C. A., & Van Der Pol, B. (2014). Recommendations for the laboratory-based detection of Chlamydia trachomatis and Neisseria gonorrhoeae— 2014. MMWR. Recommendations and reports: Morbidity and mortality weekly report. Recommendations and reports/Centers for Disease Control, 63(0), 1-19. Parker, C. (2011, December). An analysis of performance measures for binary classifiers. In Data Mining (ICDM), 2011 IEEE 11th International Conference on (pp. 517-526). IEEE. Patel, N., & Upadhyay, S. (2012). Study of various decision tree pruning methods with their empirical comparison in WEKA. International journal of computer applications, 60(12). 70 University of Ghana http://ugspace.ug.edu.gh Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 217-225). Peng, C. Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The journal of educational research, 96(1), 3-14. Peng, W., Chen, J., & Zhou, H. (2009). An implementation of ID3-decision tree learning algorithm. From web. arch. usyd. edu. au/wpeng/DecisionTree2. pdf Retrieved date: May, 13. Sahin, Y., Bulkan, S., & Duman, E. (2013). A cost-sensitive decision tree approach for fraud detection. Expert Systems with Applications, 40(15), 5916-5923. Salameh,P.,Waked,M.,Khayat,G.,& Dramaix, M.(2014).Bayesian and Frequentist Comparison for Epidemiologists: A Non Mathematical Application on Logistic Regressions. The open Epidemiology, 7(1), 17-26. Schachter J., Moncada J., & Liska S. (2008). Nucleic acid amplification tests in the diagnosis of chlamydial and gonococcal infections of the oropharynx and rectum in men who has sex with men. Sexually transmitted disease, 35(7), 637-42. Smith,L.(2016). Gonorrhea: Causes, Systems and Treatments. Retrieved December 13, 2016 from http://www.medicalnewstoday.com/articles/155653.php. Smith, A. F., & Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 55(1), 3-23. Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. Paper presented at the Australian conference on artificial intelligence. Steinberg, D., Golovnya, M., & Cardell, N. S. (2004). Data Mining with Random Forests™. Therneau, T. M., & Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines (Vol. 61, p. 452). Mayo Foundation: Technical report. Ture, M., Kurt, I., Kurum, A. T., & Ozdamar, K. (2005). Comparing classification techniques for predicting essential hypertension. Expert Systems with Applications, 29(3), 583-588. Turney, P. D. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of artificial intelligence research, 2, 369- 409. Verma, R., Sood, S., Kapil, A., & Sharma, V. K. (2009). Diagnostic approach to gonorrhoea: Limitations. Indian Journal of Sexually Transmitted Diseases and AIDS, 30(1), 61. 71 University of Ghana http://ugspace.ug.edu.gh Weiss, G. (2003). The Effect of Small Disjuncts and Class Distribution on Decision Tree Learning, Ph.D. Dissertation, Department of Computer Science, Rutgers University, New Brunswick, New Jersey. Whiley, D. M., Tapsall, J. W., & Sloots, T. P. (2006). Nucleic acid amplification testing for Neisseria gonorrhoeae: an ongoing challenge. The Journal of Molecular Diagnostics, 8(1), 3-15. Yusuff, H., Mohamad, N., Ngah, U., & Yahaya, A. (2012). Breast cancer analysis using logistic regression. International Journal of Research and Reviews in Applied Sciences, 10(1),14-22. Zadrozny, B.,& Elkan, C.(2001).Learning and making decisions when costs and probabilities are both unknown. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining. pp. 204-2013.ACM Press. Zhou, Z. H., & Liu, X. Y. (2006). ON MULTI‐CLASS COST‐SENSITIVE LEARNING. Proceedings of the 21st National conference on Artificial intelligence, pp. 567-572. Boston, MA. 72 University of Ghana http://ugspace.ug.edu.gh APPENDIX Logistic Regression Diagnostic measure Figure 1a: Pearson Residuals plotted against predictor one by one. Table 1a: Bayesian logistic regression Variable Mean Std. Dev. Lower Upper Intercept -2.58 0.30 -3.19 -2.0 Age 25-31years 0.06 0.21 -0.34 0.49 32-38years -0.46 0.28 -1.01 0.10 39years and above -1.08 0.37 -1.85 -0.41 Male 1.38 0.19 1.02 1.76 Pain during Urination 0.36 0.19 -0.03 0.73 Condom usage Never 0.85 0.28 0.31 1.33 Rarely 0.78 0.29 0.31 1.37 0.22 1.37 Most occasion 0.66 0.32 More than one sexual partner in past month 0.65 0.27 0.13 1.16 73 University of Ghana http://ugspace.ug.edu.gh Bayesian regression diagnostic measures On the left is Time series of the parameter in the model as MCMC iterates and on the right is the probability density estimate of the parameters which likely to occur at the peak of the distribution(posterior mode) Figure 1b: Posterior distribution of the model parameters Figure 1b: Posterior distribution of the model parameters (Cont.) 74 University of Ghana http://ugspace.ug.edu.gh Figure 1c: Posterior distribution of the model parameters (Cont) Figure 1d: Posterior distribution of the model parameters (Cont.) Figure 1e : Error rate for the number of trees 75