University of Ghana http://ugspace.ug.edu.gh UNIVERSITY OF GHANA PREDICTIVE MODELS FOR IDENTIFYING CRITICAL UNITS FOR INSPECTION IN A REGULATORY BODY BY FELIX DELA DJOKOTO 10598534 THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, LEGON IN PARTIAL FUFILLMENT OF THE REQUIREMENT FOR THE AWARD OF MPHIL STATISTICS DEGREE. November 14, 2018 University of Ghana http://ugspace.ug.edu.gh DECLARATION I hereby declare that this submission is my own work towards the award of the MPhil. degree and that, to the best of my knowledge, it contains no material previously published by another person nor material which had been accepted for the award of any other degree of the university, except where due acknowledgment had been made in the text. FELIX DELA DJOKOTO .......................... .................... Student Signature Date (10598534) Certified by: DR. RICHARD MINKAH .......................... .................... Principal Supervisor Signature Date Certified by: DR. LOUIS ASIEDU .......................... .................... Co-Supervisor Signature Date i University of Ghana http://ugspace.ug.edu.gh DEDICATION I dedicate this work to my loving father, Mr Christopher Yao Djokoto. ii University of Ghana http://ugspace.ug.edu.gh ACKNOWLEDGEMENT This work has been by the grace of God, and for that I will always be grateful. He has kept me growing stronger in every facet of life. My sincerest gratitude goes to Dr Richard Minkah and Dr Louis Asiedu, my supervisors, for continually guiding and helping me throughout this research work. I greatly appreciate their advice and support. Also, I would like to thank every member of my family for their support throughout the duration of my study, especially my parents Mr and Mrs Djokoto who have been a pillar of strength and support throughout my education. I surely cannot forget my aunt Mrs Faustina Lawson and family for their sacrifices. It is deeply appreciated. I also thank the Chicago Department of Public Health for making the data available online for public use. My gratitude also goes to the developers of the statistical programming packages SPSS, R and MATLAB. Finally, I say thank you to all lecturers at the faculty and colleagues at the department of Statistics, University of Ghana. iii University of Ghana http://ugspace.ug.edu.gh ABSTRACT Routine inspections are conducted at various food establishments that yield large data sets, which capture attributes useful for data mining algorithms to predict critical violations. Critical violations related to food establishments cause serious public health problems, which may happen as result of unhygienic environment, leading to food contamination. This study presents predictive models to detect critical violations in food establishments by employing Logistic Regression (LR), Support Vector Machine (SVM) and K-Nearest Neighbour (KNN). A database from the City of Chicago data portal that contained food inspections from 2011 to 2014 was used. In the preliminary analysis, Principal Component Analysis was utilised and ten (10) relatively relevant variables, that are independent of each other, were selected from twenty-eight (28) to be used as inputs in the models. In the family of the SVM, several kernels were used and the optimal model selected was based on the performance measures Receiver Operating Characteristic (ROC), sensitivity and specificity. The optimal model of the KNN was also selected based on the same performance measures. The out of sample classification accuracies for the LR, SVM and KNN classifiers were 92.7872%, 92.7873% and 92.6650% respectively. The performances of the models showed no large marginal differences in classification accuracies; however, the SVM model appears to provide a better discrimination ability as compared to the LR and KNN. iv University of Ghana http://ugspace.ug.edu.gh CONTENTS DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . iii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv ABBREVIATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background of the study . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 Organisation of the study . . . . . . . . . . . . . . . . . . . . . . 7 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Food safety in the food establishment industry . . . . . . . . . . . 8 2.2 Approach to reduce food safety risks . . . . . . . . . . . . . . . . 10 2.3 Factors that affect food safety in food establishments . . . . . . . 12 v University of Ghana http://ugspace.ug.edu.gh 2.4 Reviewed literature on support vector machine . . . . . . . . . . . 13 2.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Kernel function selection . . . . . . . . . . . . . . . . . . . 16 2.4.3 Theoritical review of SVM . . . . . . . . . . . . . . . . . . 18 2.5 Reviewed literature on k nearest neighbour . . . . . . . . . . . . . 22 2.5.1 Selection of K- value . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Distance Metric . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.3 Theoretical review of KNN . . . . . . . . . . . . . . . . . . 25 2.6 Application of predictive models in inspections . . . . . . . . . . . 27 2.7 Comparing Logistic regression, K-Nearest Neighbour and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8 Performance measures of the algorithms. . . . . . . . . . . . . . . 31 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Preprocessing the data . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Data formatting . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 The Logistic Regression Model . . . . . . . . . . . . . . . . 36 3.2.2 Assumptions of Logistic Regression . . . . . . . . . . . . . 37 3.2.3 Odds and Odds Ratio . . . . . . . . . . . . . . . . . . . . 38 3.2.4 Parameter estimation of logistic regression coefficients . . . 39 3.2.5 Testing the Goodness – of – Fit . . . . . . . . . . . . . . . 40 3.2.6 Confidence Interval Estimation . . . . . . . . . . . . . . . 43 3.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Parameter selection . . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 Proposed Procedure . . . . . . . . . . . . . . . . . . . . . 47 3.3.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 48 3.3.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . 48 vi University of Ghana http://ugspace.ug.edu.gh 3.3.5 Cross-validation and Grid-search . . . . . . . . . . . . . . 49 3.4 K-Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 K - value selection . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Training and testing of K-NN Classifier . . . . . . . . . . . 52 3.4.3 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Criteria for selection of algorithms . . . . . . . . . . . . . . . . . . 54 3.6 Performance evaluation of the models . . . . . . . . . . . . . . . . 55 3.6.1 Receiver Operating Characteristic (ROC) curve . . . . . . 55 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 DATA ANALYSIS AND DISCUSSIONS . . . . . . . . . . . . . . 59 4.1 Data collection and description . . . . . . . . . . . . . . . . . . . 59 4.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Preliminary analysis . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.1 Extraction of features . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Descriptive statistics of the normalised data . . . . . . . . 66 4.4 Logistic Regression (LR) . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.1 Logistic regression model . . . . . . . . . . . . . . . . . . . 67 4.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . 70 4.5.1 Model selection for SVM . . . . . . . . . . . . . . . . . . . 70 4.5.2 Linear, RBF and Polynomial kernels . . . . . . . . . . . . 73 4.6 K-Nearest Neighbour (KNN) . . . . . . . . . . . . . . . . . . . . . 74 4.7 Comparing prediction performance of SVM models, Logistic regression (LR) and KNN . . . . . . . . . . . . . . . . . . . . . . 76 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5 CONCLUSION AND RECOMMENDATIONS . . . . . . . . . . 80 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 vii University of Ghana http://ugspace.ug.edu.gh APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 viii University of Ghana http://ugspace.ug.edu.gh LIST OF ABBREVIATION AMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accra Metropolitan Assembly ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Variance AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Area Under Curve ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Artificial Neural Networks CFIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Canadian Food Inspection Agency CDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Center for Disease Control CDPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Chicago Department of Public Health CHAID . . . . . . . . . . . . . . . . . . . . . . . . Chi-squared Automatic Interaction Detection CCNND . . . . . . . . . . . . . . . . . . . . Class Conditional Nearest Neighbor Distribution CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Classification & Regression Trees ECG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electricity Company of Ghana DFPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discriminative Function Pruning Analysis FDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Food and Drugs Authority fmGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . fast-messy Genetic Algorithm GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Generalised Linear Models GASVM . . . . . . . . . . . . . . . . . . . . . . . . .Genetic Algorithm Support Vector Machine GSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Ghana Standard Authority HACCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hazard Analysis Critical Control Point IG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain ix University of Ghana http://ugspace.ug.edu.gh KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K-Nearest Neighbour KMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kumasi Metropolitan Assembly LSSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Learning Vector Quantisation LR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic regression MR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magnetic Resonance MCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Matthews Correlation Coefficient MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Maximum Likelihood Estimator NIST . . . . . . . . . . . . . . . . . . . . . . . National Institute Of Standards And Technology NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Network NRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . National Restaurant Association NPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .National Petroleum Authority OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Odds Ratio OVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . one-vs-all OP-KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimally Pruned K-Nearest Neighbors OC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Coding PIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partition Index Maximization PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Principal Component Analysis PNDCL . . . . . . . . . . . . . . . . . . . . . . . . . . . Provisional National Defense Council Law QUEST . . . . . . . . . . . . . . . . . . . . . . . . . . Quick, Unbiased, Efficient, Statistical, Tree RBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radial Basis Function ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Receiver Operating Characteristic x University of Ghana http://ugspace.ug.edu.gh SOFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Self-Organising Feature Map SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Statistical Package for the Social Sciences SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Support Vector Machine SVM RFE . . . . . . . . . . Support Vector Machines Recursive Feature Elimination USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .United States of America WDNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Weighted Distance Nearest Neighbor WHO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .World Health Organisation xi University of Ghana http://ugspace.ug.edu.gh LIST OF TABLES 3.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Description of variables . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Rotated Component Matrix . . . . . . . . . . . . . . . . . . . . . 64 4.3 Features used in the model . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Descriptive statistics of the normalised data . . . . . . . . . . . . 66 4.5 Summary of the logistic regression model . . . . . . . . . . . . . . 67 4.6 Resampling results cross the tuning parameters of linear kernel . . 71 4.7 Resampling results cross the tuning parameters of RBF kernel . . 72 4.8 Resampling results cross the tuning parameters of polynomial kernel 73 4.9 Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . 73 4.10 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.11 Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.12 Resampling results cross the tuning parameter of KNN . . . . . . 75 4.13 The performance of SVM models . . . . . . . . . . . . . . . . . . 76 4.14 The performance of KNN models . . . . . . . . . . . . . . . . . . 76 4.15 The optimal prediction accuracy of LR, SVM and LNN . . . . . . 77 5.1 Descriptive of categorical features . . . . . . . . . . . . . . . . . . 99 5.2 Communalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 Analysis of Deviance Table . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Resampling results cross the tuning parameters of the RBF kernel 102 5.5 Resampling results cross the tuning parameters of the polynomial kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.6 Resampling results across different K values . . . . . . . . . . . . 104 xii University of Ghana http://ugspace.ug.edu.gh LIST OF FIGURES 3.1 Framework for classifying food establishments . . . . . . . . . . . 54 3.2 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 ROC of the logistic regression model . . . . . . . . . . . . . . . . 69 4.2 Plot of the three kernels Against ROC . . . . . . . . . . . . . . . 74 4.3 Comparison of Accuracy against K . . . . . . . . . . . . . . . . . 75 4.4 Accuracies and misclassification of LR, SVM and KNN . . . . . . 78 5.1 Scree plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 xiii University of Ghana http://ugspace.ug.edu.gh CHAPTER 1 INTRODUCTION Inspections into many organisations/establishments are routinely done to ascertain that due diligence to rules and regulations of a particular domain are followed. Here, critical units are critical violations found in food establishments which can lead to food related threats to the health of patrons, like food contamination (Murphy, DiPietro, Kock, & Lee, 2011). Therefore, a predictive model in this setting helps to predict the probability of detecting food establishments with critical violations. For the safety of patrons of food establishment, some recognised bodies like the Canadian Food Inspection Agency (CFIA), Chicago Department of Public Health (CDPH) and Ghana Food and Drugs Authority (FDA), are legally empowered to enforce or implement food-related rules and regulations. In light of this, several food establishments are inspected so that the duration of exposure of unsafe food establishments to patrons is reduced. For example, the Chicago Department of Public Health has a database that captures the inspections done. In this study, the algorithms Logistic Regression (LR), Support Vector Machine (SVM) and K-Nearest Neighbour (KNN) are used to develop models that can be used to prioritise inspections by detecting the riskiest food establishments, based on food inspection data sets from the City of Chicago open data portal. This chapter is composed of Section 1.1, discussing the background of the study; Section 1.2, exploring the problem statement; Section 1.3, highlighting objectives; Section 1.4, showing the significance of the study; Section 1.5, presenting scope of the study; Section 1.6, outlining some limitations of the study and Section 1.7, covering organization of the study. 1 University of Ghana http://ugspace.ug.edu.gh 1.1 Background of the study In a world of an ever-increasing accumulation of data, the field of statistics and computer science offer several algorithms to analyse these large data sets. An algorithm here means sets of computations and problem-solving approaches that learn from a data to produce a model. Algorithms such as Neural Network (NN), K-means, Support Vector Machine, K-Nearest Neighbour, Classification & Decision Trees (CART) and Naive Bayes are employed on large data sets in clustering, classification and regression problems. These algorithms, known as data mining algorithms, are used to sift through large data sets to look for patterns and relationships to get understanding that can be translated to make informed decisions. With respect to large data sets and food safety, routine inspections are conducted at various food establishments that yield large data sets that capture attributes useful for data mining algorithms. For example, in 2014, Chicago Department of Public Health inspected more than 15,000 restaurants with less than three dozen inspectors, which meant every inspector was responsible for approximately 417 food establishments. Looking at the huge number of inspections required of the inspectors to complete, an automatic inference can be drawn to mean increase in exposure time of patrons to critical violation-ridden food establishments. The cost and irredeemable exposure time involved make prioritising food establishments, based on a validated model, a necessity. Schenk et al. (2014) used logistic regression in this regard to prioritise inspections based on food inspection data from the City of Chicago open data portal. However Logistic Regression (LR) has some deficiencies as compared to Support Vector Machine and K-Nearest Neighbour. Conventionally, Logistic Regression (LR) seeks to fit a model as best as it can, even on a training data set with outliers which may lead to misclassification (Pochet & Suykens, 2006). Again, LR cannot detect potential non-linear structures in a set of observations. That is to say, a non-linear 2 University of Ghana http://ugspace.ug.edu.gh relationship would require a non-linear discriminant/decision boundary for a better performance. Therefore, this study seeks to use SVM and KNN to assess the performance of these algorithms, relative to the LR, in predicting critical violations in food establishments so as to help prioritise inspections. SVM is an algorithm that generates a mathematical function able to classify linearly or non-linearly separable data into two distinct categories (Vapnik, 1998). The main thrust of SVM in cases of classification is to present a model capable of precisely predicting the label of classes of a test data (new points), having already learned a training data. SVM has many desirable traits like accuracy, robustness and effectiveness (Golub et al., 1999; Wang & Huan, 2011) even in non-linear separable problems. It again exhibits greater generalization ability (Thome, 2012) i.e the ability of SVM to perform similarly on training data and any new future data set. K-Nearest Neighbour is a non-parametric algorithm (Duda, Hart & Stork, 1973) conceptualised by Fix and Hodges (1951), which has its base on the natural intuition to classify new cases or objects by finding its nearest training examples. It is an instance-based learning algorithm, that can be seen as classified by majority vote based on a distance metric, to determine the closeness of a new object or a feature vector in the fold of training examples. KNN is robust to noisy training data (i.e meaningless data that may be as a result of data corruption or inaccurate recording) and very effective with large training data sets (Kalakuntla, 2017). It is an effective method that is relatively easy to execute (Bhatia, 2010). Data mining algorithms have performed extremely well in areas of bankruptcy prediction and fraud detection (Kumar, Krovi & Rajagopalan, 1997; Nagi et al., 2008), facial recognition (Swiniarski, 2000) and database marketing (Brachman & Anand, 1996). Many of such techniques/algorithms have been used in various fields; however, there are lingering questions on their relative performance which make some algorithms suitable to certain problem terrains than others (Becker, 3 University of Ghana http://ugspace.ug.edu.gh 2001). In this study, LR, SVM and KNN are applied on the food inspection data set from the City of Chicago open data portal, to compare their performance in correctly identifying any critical violations in a food establishment. In this study, the three algorithms suit the classification of food establishments into those with critical violations or those without. The primary process through which the CDPH detects a critical violation in a food establishment is through inspections which may arise through new license inspections, voluntary calls from concerned citizens and daily routine inspections. This would require a large workforce of inspectors, which is logistically costly. The application of SVM and KNN to the field of food inspections will be greatly valued, as it provides an alternative approach to identify an unsafe food establishment by analyzing data sets on the inspections. This study seeks to present predictive models for detecting critical violations that can help make informed decisions, based on a historical data on food inspections. 1.2 Problem Statement Food safety is an established requirement backed by laws in many countries for every food establishment to follow, and it is enforced by inspections into all food establishments. According to statistics from the World Health Organisation (WHO), about 600 million people fall ill and 420,000 die every year as a result of consuming unsafe food ("Food safety", 2017). In Ghana, the habit of patronising food establishments is increasing particularly due to changing lifestyles and modernization (Monney, Agyei & Owusu, 2013). Wilson et al. (1997) revealed caterers cause about 70% of every bacterium related food poisoning. Therefore, there is the need for routine inspections of food establishments to ascertain safe and good environments for operation. In Ghana, the Food and Drugs Authority (FDA), backed by Food and Drugs 4 University of Ghana http://ugspace.ug.edu.gh Act PNDCL 305 of 1992, is required to enforce food policies and ensure safety and wholesomeness of food across the country (Ababio & Lovatt, 2015). With the growing number of food establishments, prioritising inspections based on the riskiest restaurant would traditionally depend on the wit and experience of the inspector. The traditional routine inspection is not only time-consuming and revenue draining, but also stalls the detection of these unsafe food establishments. This could ultimately prove to be dangerous to the health of patrons. Schenk et al. (2014) performed a study in Chicago using Logistic Regression to help prioritise inspections, in order to reduce the exposure time of patrons to unsafe food establishments. The model realised identification of critical violations about 7.44 days earlier over a 60-day evaluation period relative to the normal daily inspections. Arguably, limited research has been done in the area of comparing and evaluating the performance of SVM and KNN using different parameter settings on finding critical violations at food establishments. In light of this, supervised learning algorithms, logistic regression, SVM and KNN will be compared to bring forth the better one which is able to identify these unsafe food establishments early, with respect to the previous study. Prioritising inspections based on effective model is of utmost importance. Therefore, this study focuses on developing and validating mathematical models that can be used to detect critical violations which will help prioritise inspections. 1.3 Objectives The main objective of this study is to develop a validated set of models to detect food establishments with critical violations. Other specific objectives to be considered are: • To identify and select suitable predictor variables that can be used as inputs in the predictive models. 5 University of Ghana http://ugspace.ug.edu.gh • To examine the effects of parameter settings on the K-Nearest Neighbour (KNN) and Support Vector Machines (SVM) algorithms. • To develop models that are able to detect critical violations in food establishments. • To compare Logistic regression (LR), K-Nearest Neighbour (KNN) and Support Vector Machines (SVM) to determine the highest detection rate. 1.4 Significance of the study This study is of utmost importance to many food safety organisations around the world (like FDA, Ghana) as it will help sensitise them to use data mining algorithms on their large accumulated inspections data to prioritise inspections. This will help reduce the cost and time involved in their traditional routine inspections. Literature on such study is arguably non-existent in Ghana; therefore, it will benefit the academia and the research community. Stakeholders in the food establishment sector will be better informed to invest in using data mining algorithms to draw statistically backed inferences that will make food establishments safer, which would inure to benefit the Ghanaian populace. 1.5 Scope This study used data collected from the City of Chicago data portal, in the periods September 2011 to March 2014, and September 2014 to October 2014. The secondary data was on food inspections conducted at various food establishments in the City of Chicago and it contained thirty-seven variables (features or attributes). The algorithms Logistic Regression, Support Vector Machine and K-Nearest Neighbour were employed in finding the model with a relatively high dectection rate. A defined grid parameter setting was used in selecting the best model of the SVM and KNN. All the analysis were made possible by MATLAB 6 University of Ghana http://ugspace.ug.edu.gh (R2017a), R and SPSS. 1.6 Limitations Below are some of the limitations encountered in the course of the study. • A defined grid search was used due to the processing speed of the computer used for parameter tuning • Several efforts were made to acquire data from regulatory bodies like the Food and Drugs Authority, National Petroleum Authority (NPA) and Electricity Company of Ghana (ECG), but none yielded a real dividend in the kind of data appropriate for this study. The issue of privacy of the customer or entity was a challenge, particularly for the ECG to provide the full data needed for this study. Hence the non usage of local data. 1.7 Organisation of the study The following is how the thesis is organised: Chapter one introduces the study whiles also giving the background, the problem statement, objectives and significance of the study. Chapter two explores relevant literature on food safety and related works on the use of Support Vector Machine and K-Nearest Neighbour. Chapter three provides a detailed explanation of the methods of analysis. Chapter four provides the analysis of data, model building and the discussion of results. Chapter five covers the conclusions and recommendations from the study. 7 University of Ghana http://ugspace.ug.edu.gh CHAPTER 2 LITERATURE REVIEW The early identification of critical violations in food establishments is a priority for every food safety organisation since it poses serious health implications for patrons of these unsafe food establishment. Making sound arguments statistically to detect critical violations early in a food establishment is a growing area in research. This chapter presents the relevant literatures related to the study that brings to fore useful facts and findings realized by previous researchers. It begins with food safety in food establishments followed by various approaches used in reducing the risks. Some probable variables that predict an unsafe food establishment are explored while also discussing SVM and KNN, and comparisons with the logistic regression. Some relevant application of predictive models to inspections are also outlined. Finally, the measure of performance of the algorithms are explained. 2.1 Food safety in the food establishment industry Food safety can be said to be the tradition of handling, making and storing food where the best possible means are used to reduce danger of people contracting food-borne illness. Foodborne illness occurs when food consumed is infested with pathogens like viruses, bacteria, parasites or by toxic chemicals (Torgerson et al., 2014). Foods can be rendered unsafe at any time from production to the point of consumption but law suits are more likely to be directed at food establishments in the event of an outbreak than any other player in the chain (Buzby, Jenkins & Fernandez, 2001). Notwithstanding the rise in global consciousness of food-borne diseases, food 8 University of Ghana http://ugspace.ug.edu.gh safety remains seemingly disregarded (World Health Organisation [WHO], 2015). One contributing factor is the dearth of precise data on the full scope and cost of food related illness that would propel the allocation of funds by policy makers (WHO, 2015). Conducting inspections in food establishments is an effective way of gathering data on food safety. Food establishments provides solution to individuals who are not able to regularly make their own foods at home. Food prepared at such places are done on large scale which automatically mean the involvement of many hands. In such situations unintentional contamination of food may occur and the spread would have serious health implications for patrons and the country as a whole when outbreaks occur (Omaye, 2004). Factors such as improper handling, preparation and storing of food, and unrestrained environments where pathogens like bacteria are easy to spread (Fielding, Aguirre & Palaiologos, 2001) account for critical violations in food service establishments. Food safety organisations like Canadian Food Inspection Agency (CFIA), Chicago Department of Public Health (CDPH) and Ghana Food and Drugs Authority (FDA) are particularly interested in finding critical violations in the food service establishment as early as possible since any delays would have health and financial implications for patrons and industry players. (Knight, Worosz & Todd, 2007). A chronicle of events of food poisoning from pathogens like E.coli and Campylobacter have made food safety a top priority for governments across Europe. Outbreaks from food establishments occurs locally, regionally, nationally and internationally. Notable of them is the E.coli outbreak in the USA at "Jack in the Box" restaurant where 700 people suffered foodborne illness and 4 children dying as a result of taking contaminated meat bought at the chain of "Jack in the Box" restaurant (Golan et al., 2004). This resulted in an estimated cost of $160 million to Jack in the Box restaurant as a result of reduction in sales and law suits by their patrons (Fielding et al.,2001). In 2014, there was cholera outbreak in Ghana with Greater Accra region as the epicenter. The reported number of cases and deaths were 28,975 and 243 respectively (WHO, 2015). Statistics from 9 University of Ghana http://ugspace.ug.edu.gh the FDA Ghana in 2013 showed that about 77% of the total food borne illness can be traced to food establishments (Ghana Standards Authority [GSA], 2013). Reflecting on such outbreaks calls for stringent measures to prevent re-occurrence. In light of this, food safety organisations through local health trained professionals perform routine inspection as a control mechanism to decrease or eradicate entirely the risk factors related to foodborne diseases (Reske et al., 2007). The various ways identified to reduce risk of contracting foodborne disease at food establishments are outlined in the next section. 2.2 Approach to reduce food safety risks The implementation of food safety in food establishments is still a challenge because of the large number of people involved. For example, employees of 13 million working at 980,000 restaurants oversee 190 million foods served daily in the USA (National Restaurant Association [NRA], 2012). This presents various opportunities for food prepared to be contaminated. Therefore, the implementation of the Hazard Analysis Critical Control Point (HACCP) system is highly recommended. HACCP, is a process control system that detects where hazards might happen in the food production process and sets strict measures to avoid the hazards from happening (WHO, 2015). It has placed more responsibility on the industry players to ensure protection of the consumer from foodborne illness. With a set goal to decrease or eradicate food safety risks, one shared approach is conducting routine inspections by trained professionals at every food establishment to look for any violations of the food safety regulation like the Food and Drugs Act PNDCL 305 of 1992. This activity however is focused on preventing imminent foodborne disease outbreak. The inspection of food establishments is not just to reduce risks of an outbreak but to have a database which can help in further studies to mitigate future violations. Routine inspections alone will not suffice as the Center for Disease Control (CDC) in the USA, estimates almost half of the 9 million people who suffer 10 University of Ghana http://ugspace.ug.edu.gh foodborne illnesses every year, contract them from restaurants (Center for Disease Control [CDC], 2013). Therefore, there is the need for food establishments to take responsibility to help reduce foodborne illnesses. Following the HACCP principles helps establish a well monitored environment at the food establishment site. The principles of the HACCP are; conducting a hazard analysis, fixing critical control point, setting critical limits, setting monitoring point, acting on corrective measures, setting verification procedures and creating databases of all documentation (WHO, 2015). Several studies in Ghana revealed locally owned businesses have limited regulatory systems of food safety and low level of education on food safety among food handlers in Accra and Kumasi. (Ababio & Adi, 2012; Ababio et al., 2012; Feglo & Sakyi, 2012 ; Tomlins et al., 2002). Another study revealed, both food handlers and patrons interest in nearness, appearance of the environment or vendor and price clout their core responsibility of good hygiene practices (Rheinlander et al., 2008). Agyei-Baffour et al. (2013) has established existing food safety rules are integrated in some HACCP principles but sentisation among food operators is low in Ghana . In Ghana, FDA through Ghana Tourist Board, local government authorities like KMA and AMA sensitise food operators on the food safety guidelines and the HACCP principles through training (Agyei-Baffour, Sekyere & Addy, 2013). For example, according to the 2016 FDA report, 5430 street food vendors, travelers and market women were trained on food safety and 7983 pupils were also educated on food safety and hygiene (Food & Drugs Authority [FDA], 2016). Apart from following the regulations spelt out for food safety at food establishments, some guidelines like owners or managers taking active role in the process of acquiring the foodstuffs, making stricter rules at premises and being responsive to the complains of customers can also help reduce food related illness. 11 University of Ghana http://ugspace.ug.edu.gh 2.3 Factors that affect food safety in food establishments The main health problem in relation to food in the world is the issue of foodborne disease. As the population of the world increase, the responsibility of ensuring safety in our foods also become burdensome. In many developing countries, the challenge in food safety is more evident in the poor handling, preparing and storing of food, weak monitoring systems and lack of training for food handlers (Tessema, Gelaye & Chercos, 2014). The issue of food hygiene is the most basic necessity for all food service establishments to follow. They involve the handling, preparing and storing of food. WHO, has outlined five practices essential in avoiding foodborne illness and these are to; keep food clean, isolate raw from cooked foods, cook food carefully, keep food at safe temperatures, and use safe water and raw materials (WHO, 2006). Flouting these essential practices affect food safety negatively. Almost all the five practices can be controlled in the food establishment (Arendt, Strohbehn & Jun, 2015). Practising these however have some challenges such as time constraints, inconvenience and inadequacy of resources as expressed by employees of a restaurant in a focus group discussion (Howells et al., 2008). In a bid to find the factors that have influence on the use of safe food handling manners by employees of restaurants, several studies concentrated on food safety knowledge, training, attitudes and motivation (Allwood et al., 2004; Lynch et al., 2003). A study by Cushman et al. (2001) showed that the length of stay in a particular restaurant by employees also affect their practice of personal hygiene because they identified part-time student employees practiced personal hygiene properly than the main employees. Some environmental factors such as weather affect food safety, since an increase in temperature, increases the likelihood of spoilage of foods like the dairy products, 12 University of Ghana http://ugspace.ug.edu.gh meat and fish. Other factors like physical damage of foods and prolong storage of food also affect the safety of foods. Food safety culture is the shared conduct or behavior of employer and employees on the handling of food in the environments of food establishments. Findings from a research showed that the creation of food safety culture is vital as this will let employees know how non-expendable food safety is to the establishment (Yiannas, 2008). Some food safety cultures in food establishments include support from administration, communication, and employees’ attitudes and manners (Abidin, Arendt, & Strohbehn, 2014). Following a food safety culture creates atmosphere for food safety (under-girded by adherence to the food safety laws). The challenge of facilitating food safety put huge obligation on food producers and handlers. No matter how small an outbreak may be, it can quickly grow into international emergencies because of how fast products move across. Therefore, deliberate effort is required from everyone (governments, private sector, industry players, customers etc.). 2.4 Reviewed literature on support vector machine Support Vector Machine (SVM), a bunch of inspected learning methods introduced by Vladimir Vapnik, has shown to work effectively in regression and classification problems, and detection of outliers (Ekici, 2012). SVM is a useful machine learning method which attains top predictive accuracy by learning from a training dataset to develop an optimal hyperplane to classify data. It has gained popularity in pattern recognition and computer vision communities because of their good generalisation capabilities and high accuracy (Du et al. 2017). It is widely applied to many fields of study in the biological and other sciences like computational neuroscience, pharmaceutical data analysis and drug design and fraud detection (Arvey et al., 2012). 13 University of Ghana http://ugspace.ug.edu.gh In separating two groups of observations by a hyperplane with high separating margin, two problems emerge, thus how well the separating hyperplane generalize and the computational challenge. In 1965, (Vapnik, 1982) gave a solution to the first part of the problem by presenting the best hyperplane for divisible classes. The optimal hyperplane used in this instance represents a linear decision function able to divide the points(vectors) into two classes by leaving a wide margin. Therefore, the support vectors are defined by the margin of widest separation between the two classes. The study posits that with a minimal quantity of support vectors in comparison to the size of the training set, there will be high generalization. The second problem still remained until 1992 where (Boser, Guyon & Vapnik, 1992) changed the process of operation. Also, a comparison was made between two vectors before transforming them non-linearly as this helped to make better decision surfaces. Hence the name support vector network, but this concept was extended to a much complex sphere of dealing with non separable data by using different kernels functions for better precision. The new learning machine became as powerful and widespread as the neural networks was birthed. Comprehensive information regarding the SVM classifier is available in (Cortes & Vapnik, 1995; Tsai et al., 2009). The performance of an SVM invariably depends on the method to select the optimal count of features and how to fix the best kernel parameters (Frohlich & Chapelle, 2003). 2.4.1 Feature Selection The process of feature selection helps to detect the certified predictive subgroup of fields in the available database so that the relevant (reduced) number is offered to the algorithm for furthur processing (Huang &Wang, 2006). This process helps to draw forth relevant information from the available data set which in turn reduces computational time (Huang & Wang, 2006). The choice of features affect various parts of the classification, like the algorithm’s accuracy, computational time, the 14 University of Ghana http://ugspace.ug.edu.gh training examples required and the related cost of the features. Selecting a subset of features from a database is significant in SVM as it extracts the important information from the available data set to reduce the computation time. As stated by Yang and Honavar (1998), choosing a feature affects not only classification algorithm’s accuracy, but also the time required for learning a classification function, costs related to the features, and the number of examples required for learning. Not many algorithms have been proposed for SVM feature selection in the literatures (Bradley, Mangasarian & Street, 1998; Bradley & Mangasarian, 1998; Weston et al., 2001; Guyon, Weston, Barnhill & Vapnik, 2002; Mao, 2004). Mao, (2004) proposed a feature selection technique known as Discriminative Function Pruning Analysis (DFPA). The intuition behind DFPA technique is to learn SVM’s classification function after the training data by first, making use of every input variable, followed by selecting the number of features through pruning analysis. The DFPA technique employed both wrapper and filter methods i.e it first uses the filter method to avoid training a huge number of SVM classifiers and later uses the wrapper method to assess the number of features selected based on the classifier’s performance. Bradley et al. (1998) proposed a mathematical programming technique that minimizes a concave function on a polyhedral set. Again, in a different study Bradley et al. (1998) uses another term to penalise the size of subgroup features. Weston et al. (2001) also presented a vector of two, that denotes the presence or non-existence of a feature and their best measures with a drive of finding the two vectors. Also, a real valued vector can be in such a way that the gradient descent approach will determine the best value of the two vectors and its matching subgroup feature. The approaches used in these studies, all assess features one by one. But Guyon et al., 2002 proposed another method for feature selection known as SVM Recursive Feature Elimination (SVM RFE) which assess 15 University of Ghana http://ugspace.ug.edu.gh features collectively. The method removes irrelevant or redundant features and, also it entails few computations compared to the wrapper method. Therefore, a discriminative measure for selection of method for feature selection based on it’s appropriateness should always be considered. However, the Principal Component Analysis (PCA), is popular approach used in several studies (Guyon & Elisseeff, 2003; Song, Guo & Mei, 2010; Uğuz, 2011) to select/extract relatively important features that are independent of each other. In this study the PCA was employed in reducing the dimensionality of data before feeding it to the algorithms. 2.4.2 Kernel function selection A Kernel is simply a function for measuring the similarity between two observations(Sahami & Heilman, 2006). The kernel function helps to map the data in input space into a high-dimensional feature space to fit an optimal hyperplane to separate the data into their respective class. The commonly used kernels for SVM classification are Polynomial kernel function, Radial basis function and Linear kernel function. Selecting a Support Vector Machine kernel can be tricky, since it typically depends on the distribution of the input values (z) of the training data set. To solve support vector classifier problem, the inner products of observations is used instead of the observations themselves. Suppose the inner products of two observation are ∑p 〈zi, zi′〉 = zi, zi′j (2.1) j=1 Linear support vector classifier can therefore be expressed as ∑n f(z) = β0 + αi(z, zi), (2.2) i=1 where αi, i = 1, 2...n are the number of parameters per training observation. ( ) The number of inner products between all pairs of training observations thus n k 16 University of Ghana http://ugspace.ug.edu.gh are used to estimate the parameters α1, ..., αn and β0 Some kernels are explained below (James, Witten, Hastie, & Tibshirani, 2013). • Linear kernel The linear kernel basically computes the likeness of a pair of observations using Pearson (standard) correlation. Using a linear kernel to train an SVM is generally much faster than with another kernel. Now suppose the inner product in (2.1) is generalised in the form K(zi, zi′), (2.3) where K is referred to as a kernel. The equation below is known as a linear kernel ∑p K(zi, zi′) = zi, zi′j (2.4) j=1 Less parameters is therefore needed to optimise when training SVM using linear kernel. • Polynomial kernel can be described as function showing vectors (training samples) of similar types in a feature space over polynomials of the actual var∑iables, allowing learning of non-linear models. Replacing each instance of pj=1 zi, zi′j with the following quantity yields ∑p K(zi, zi′) = (1 + zi, z d i′j) (2.5) j=1 Equation (2.5) is known as the polynomial kernel of degree d, where d is a positive integer. Therefore when d > 1, suggests moving from the linear realm into a higher-dimensional space. Combining the equation (2.5) with the support vector classifier yields a SVM with a polynomial kernel of the 17 University of Ghana http://ugspace.ug.edu.gh form (∑ ) f(z) = β0 + K(zi, zi′ (2.6) i∈S • Radial Basis Function(RBF) kernel RBF kernel also called the Gaussian kernel, is a well known kernel function used in SVM classification to draw completely non-linear hyperplanes. It takes the form ( ∑p ) K(zi, zi′) = exp γ (z 2 ij − zi′j) (2.7) j=1 where γ is a positive constant that sets the "spread" of the kernel. 2.4.3 Theoritical review of SVM Some significant contributions made to SVM and its accomplishments in several data mining works of diversified fields are surveyed below. In biomedical sciences, Valentini (2002) used gene expression data and proposed approaches built on SVM of non-linear nature using Gaussian and polynomial kernels, and learning machines of Output Coding (OC) ensembles to separate a typical tissue from malicious tissue, categorize lymphoma of different types and also to examine the roles of collections of coordinately stated genes in processes of cancer-causing lymphoid tissues. The study showed that SVM has the ability to appropriately divide normal tissues from tumour ridden ones, and different types of lymphoma can be classified by using OC ensembles. Yang et al. (2005) used three classifiers, Learning Vector Quantisation (LVQ), Self-Organising Feature Map (SOFM) and Support Vector Machines, to study the making of an innovative signal classifier for small interchanging refrigerator compressors by means of vibration and noise signals. A novel approach was proposed to identify goods at semi-finish stage in a spontaneous bulk produce of interchanging compressors for refrigerators utilised in homes. SOFM with LVM was found to exibit high accurateness and shows to be the finest technique 18 University of Ghana http://ugspace.ug.edu.gh for categorising healthy and malfunctioning condition of small interchanging compressors. Polat and Güneş (2007) used Least Square Support Vector Machine (LSSVM) to develop a medical decision where LSSVM was used in detecting breast cancer. Evaluation was done to check the robustness of LSSVM using specificity and sensitivity analysis, classification accuracy, confusion matrix and k-fold cross validation method. Wisconsin Breast Cancer Diagnosis dataset was used in the study and classification accuracy of 98.53% was realised which suggest LSSVM can help in diagnosing breast cancer. The research proposed furthur exploration on large data set would yield increase in the accuracy level. Chaplot et al. (2006) suggested a different method for classifying MR images by making use of wavelets as input to support vector machines and neural network self-organising maps. A dataset of 52 MR brain images was used whereby the data was separated into two groups as either normal or abnormal. The neural network self-organising maps and support vector machines achieved a good classification percentage of 94% and 98% respectively. In comparison, the SVM classifier showed a high classification rate than the self-organising map. The method was applied to only T2 weighted images at a specific depth inside the brain. The research proposed, exploring it on T1-weighted, proton density and different kinds of MR images where a software for a diagnostic system can be developed for identifying brain disorders like Alzheimer’s, Parkinson’s, Huntington’s diseases etc. Hong et al. (2008) in their work proposed a novel method to use SVMs integrated with one-vs-all (OVA) scheme and naive Bayes classifies in multi-class fingerprint classification systems. To train the OVA SVMs and naive Bayes classifiers, some indicative fingerprint features such as the FingerCode, singularities and pseudo ridges were used. The NIST-4 database was used to validate the method proposed and a classification accuracy of 90.8% was realised for the five-class classification 19 University of Ghana http://ugspace.ug.edu.gh problem and 94.9%, for the four-class classification problem. Zhang et al. (2008) presented a study that describes the effect of employing multi-words for text representation on potentials of text classification. To use the multi-words for text representation, two strategies based on the several semantic levels of the multi-words were developed. Strategy one involves a proposal to perform the multi-word extraction from documents with respect to its syntactical structure. Strategy two involves a combination strategy with respect to the subtopics of the general proposal for representation. The robustness of the classification performance was realised by using Information Gain (IM) method to rid the multi-word from the feature set. Finally, SVMs in linear and non-linear kernels were used respectively on a series of tasks with text classification. It was realized that the effect of utilising distinct representation strategies outweighs the effect of utilizing several kernels on classification performance. It illustrated furthermore the usage of individual words representation outclass any usage of multi-words representation. It confirmed SVMs power in text classification. To support the claim of SVMs accuracy in classification problems several studies like Parikh et al. (2010) showed a classification accuracy of a new SVM proposed to be around 98%. The input features proposed by the new SVM fault-based classification algorithm to identify the fault phases were three samples of phase currents together with the zero-sequence current. Tests were carried out on a data set of 25,200 test cases to check the feasibility of the technique developed which indicated that the new technique proposed, is accurate and robust to any fault condition and variation in system. Este et al. (2009) used a new classification technique based on SVM, described an algorithm that permits the classifier to correctly perform with a few hundred training samples. The proposed classifier was tested on three sets of traffic traces and classification accuracy went above 90%. With reduced size of training data sets the study confirmed SVM classifier as very effective. 20 University of Ghana http://ugspace.ug.edu.gh Support Vector Machine is well known to be effective in dealing with outliers. Qu and Zuo (2010) in their study proposed an algorithm for effective data cleaning, data processing and feature selection. The algorithm was based on SVM and random sub-sampling violation. Outliers and irrelevant features were determined based on measuring the misclassification rate whiles adopting the backward selection method of feature selection. Three data sets were used to test the performance of the data cleaning algorithm, which exhibited a good capability of detecting outliers in all the data sets. Again, Wu et al. (2014) in order to provide a solution to the problem of SVM being sensitive to noises or outliers for the training data set, used fuzzy methods on SVM. The proposed Partition Index Maximization (PIM) clustering based Fuzzy SVM (FSVM) algorithm showed more reasonable membership when used on five benchmark data sets. It also showed that the PIM-FSVM algorithm is more robust to noises which indicated that the algorithm is effective. Lo and Wang, (2012) used classifiers based on SVM to classify MR images by conducting experiment of twofold; one made up of phantom images, generated by a computer and the other made up of MR images. The SVM showed a better classification accuracy than the C-Means (CM) when the efficiency and feasibility of the methods were evaluated. The SVM again showed its mettle in being robust against noise. Chou et al. (2014) in their study proposed a high performing hybrid artificial intelligence model to merge a fast-messy Genetic Algorithm (fmGA) with SVM aimed at early prediction of dispute propensity at the initial phase of public- private partnership projects. The fmGA optimises the parameter of SVM and the SVM helps to provide learning and curve fitting. Several classifiers were proposed when applying them on CART, QUEST, C5.0, CHAID and GASVM,the hybrid approach. Considering precision, accuracy, AUC and sensitivity of all the models, the GASVM produced the topmost overall performance measurement score of 0.871. Rai and Yadav. (2014) introduced a new and efficient way for recognising 21 University of Ghana http://ugspace.ug.edu.gh and extracting features from the iris by utilising both SVM and Hamming distance. The approach used distinct feature extraction techniques for Hamming distance and SVM based classifier which they supposedly claim will increase efficiency. The proposed method’s accuracy proved to be excellent and successful computationally as the recognition rate on image databases CASIA and Chek were 99.91% and 99.98% respectively. In summary, the results of reviewed literature suggest Support Vector Machine is an effective algorithm that has high predictive power particularly for regression and classification problems. 2.5 Reviewed literature on k nearest neighbour The K nearest neighbor is a simple algorithm conceptualised by Fix and Hodges, (1951) which is based on the natural intuition to classify a new case or point by finding its nearest training examples. It is particularly used in regression and classification problems and considered as one of the earliest, accurate and simplest algorithm (Hamamoto, Uchimura & Tomita, 1997; Alpaydin, 1997). It can also be deduced from the KNN that, only identical instances have identical class labels (in classification) or similar target values (regression). A machine learning algorithm learns automatically from data and improve from experience, of which KNN is referred to as a lazy type, in the world of machine learning users. Algorithm like K nearest neighbor was made known to deal with pattern classification (Yang, 1999) likewise Support Vector Machine (SVM) (Japkowicz, 2000). Two choices that primarily affect the performance of the KNN algorithm is selection of K and the distance metric used (Latourrette, 2000). 2.5.1 Selection of K- value A very important and sensitive parameter in the KNN algorithm is the choice of K since it has an effect on the performance of the classifier. Seldom do studies 22 University of Ghana http://ugspace.ug.edu.gh explain the type of method employed to choose the KNN parameters. Sun and Huang (2010), in their quest to identify an optimal K brought forward an adaptive KNN algorithm to find a K that each training example can utilise to attain right class label. Having identified a limitation in the conventional KNN algorithm to find for each test example, similar number of nearest neighbours, the adaptive KNN algorithm was proposed and tested on numerous data sets. Empirical results from tests suggested the adaptive KNN algorithm is more effective than the conventional KNN algorithm. Some published studies use varying choice of K, for instance Rosenfeld et al. (2008) made use of a KNN algorithm to predict the origin of cancer tissue using microRibonucleic acid profile. By considering limited varying values of K in the parameter space, K = 3 was the best K value. Similarly, Lu et al. (2005) considered this type of approach in selecting an optimal K parameter. Guo et al. (2003) transformed the training set to a model that allot a group to a set of similar examples from available data set. The output model in this case is made up of the category of group and likeness of the far apart points in a particular group relative to the amount of points at centre of that group. In this work, the training data’s size is reduced and the choice of K is determined automatically since the best value of K can be said to be the number of points in every group. The model was applied on six data sets to test its performance and it showed to be more accurate than the traditional KNN. Suguna and Thanushkodi (2010) in another study employed a Genetic Algorithm together with KNN in order to increase performance. The traditional KNN algorithm considers the entire training samples and selects k-neighbors, but this method employed GA to select k-neighbors right away and the distance is computed to appropriately classify the test samples. The combined algorithms was tested using five distinctive data sets and empirical results suggested an improvement in classification accuracy. Selecting a desirable K rests entirely on the data. A larger K greatly reduce 23 University of Ghana http://ugspace.ug.edu.gh noise in classification but also fails to set clear boundaries among the classes and invariably a small K lead to a large variance in prediction. Therefore, K should be set at a point where it is large enough to reduce misclassification and small enough to let the K nearest cases to be close to the new case (Hassanat, Abbadi, Altarawneh & Alhasanat, 2014). 2.5.2 Distance Metric In K-NN algorithm, the distance from a new point and nearest neighbours are very important in making predictions. Some popular choices of distance metrics are the Euclidean, Hamming, Manhattann and Minkowski distances. If c and d are vectors that has numeric attributes, thus c = (c1, c2, ..., cn) and d = (d1, d2, ..., dn), • Minkowski distance These distance metrics have special relationship with Minkowski distance, namely the Manhattan distance, Chebyshev distance and Euclidean distance. Minkowski distance is given by: √√√√∑nD (c, d) = z |c − d |zMi i i , (2.8) i=1 where z is positive value. Once z = 1, then it turn into Manhattan distance. Again, once z = 2, then it turn into Euclidean distance. Chebyshev distance is a Minkowski distance modified where z =∞. Also the ith values of vector c and d are ci and di respectively hold. • Euclidean Distance is the root of the sum of squares of differences among the opposing quantities in vectors. Given by √√√√∑nDE(c, d) = |ci − d |2i (2.9) i=1 24 University of Ghana http://ugspace.ug.edu.gh • Manhattan Distance is the sum of absolute differences among the opposing quantities in vectors. Given by ∑n DMa(c, d) = |ci − di| (2.10) i=1 • Chebyshev distance is a measure of the distance among two vectors in a vector space where any difference between them is the largest on every coordinate dimension. Given by DC(c, d) = maxi|ci − di| (2.11) • Hamming Distance is a metric that quantifies the number of bad fits among two vectors. It is usually a measure for nominal data and string analyses, and also suitable for numerical data. Given by ∑n DH(c, d) = 1c 6=d (2.12)i i i=1 2.5.3 Theoretical review of KNN Some significant accomplishments in the application of K nearest neighbor in several data mining works in diversified fields are surveyed below. In a bid to improve the performance of the nearest neighbor a method known as Weighted Distance Nearest Neighbor (WDNN)(Jahromi, Parvinnia & John, 2009) was proposed. It assigned non negative weights to each training instance to compensate for the nearest neighbour sensitivity to distance function whiles utilising all the training instance when generalising. The technique can be described as a great way of reducing instances in a training set. In a related work, Liu and Chawla (2011) proposed a new KNN weighting strategy to tackle problems in traditional KNN that arise as a result of the presence of more class samples of one class than the other. The method uses class confidence 25 University of Ghana http://ugspace.ug.edu.gh weights which involve utilising probability of attribute values with specified class labels to weight prototypes in KNN. The bias to the majority class is corrected and this translates onto improved performance. Kriminger et al. (2012) also proposed using the geometric structure of data to lessen the influence of class imbalance on KNNs performance. The method is known as Class Conditional Nearest Neighbor Distribution (CCNND). Existing approaches make use of a type of sampling scheme or by applying error costs to check the imbalance in distribution of the classes. It was applied on data sets fetched from UCI Machine Learning Repository (imbalanced data sets) and real world oil pipeline data. The CCNND extremely performed better than traditional KNN. Another work by Yu et al. (2010) proposed a method known as Optimally Pruned K-Nearest Neighbors (OP-KNNs) which proved to be competitive to advanced methods while reducing computational time. By making use of KNN as kernels to do regression, a one hidden-layer feedforward neural network is created. The method showed good performance while remaining relatively a simple model. Some factors contribute to the performance of KNN, so Parry et al. (2010) considered varying factors like number of features, distance metric, number of neighbours, vote weighting and decision threshold which gave 463,320 KNN models. The models were validated using data on 478 neuroblastoma patients. With varying factors, the optimal KNN model was identified. From the literature reviewed, other methods have being integrated into the traditional KNN in order to either reduce computational time or select the optimal K. However this study makes use of varied values ofK to select the optimal model for the KNN algorithm. 26 University of Ghana http://ugspace.ug.edu.gh 2.6 Application of predictive models in inspections A process of building, testing and evaluating models to predict the likelihood of an event to happen can be described as a predictive model. In most human institution, inspections are conducted to ascertain that rules and regulations are followed. Predictive models has being applied in various areas to help detect critical units. Some of these are surveyed below. Schenk et al. (2014), used data on food inspections, 311 complains, crime data and others to build a predictive model able to detect critical violations in food establishments. The main purpose of the study was to prioritise inspections, so that the exposure time of patrons to unsafe food establishments is reduced. The logistic regression model was used and it able to identify approximately critical violations 7.44 days earlier over an evaluation period of two months. In a similar study by Kassel (2017), Azavea (i.e. Geospatial technology company) partnered with government agencies and other private companies to build a model that can make inform decisions. Machine learning algorithms were used to build models that statistically foretells the probability that a building will fail an inspection based on historical data from the City of Philadelphia Department of Licenses and Inspections. The classification accuracy was the evaluation criteria for the model and it predicted 74.19% accuracy on the test set used. The study suggested more data-driven tools could be used to improve on the model developed. The recent fire incident that happened at the Grenfell Towers in London, among others motivated the City of Pittsburgh’s Bureau of Fire (PBF) to use its data on inspections, to develop predictive models to detect risky properties likely to experience fire. Historical data on fire incidents coupled with routine inspections on properties were used. Logistic regression, Ada Boost, Random Forest and Xg 27 University of Ghana http://ugspace.ug.edu.gh Boost were the machine learning algorithms used to help prioritise inspections. The presence of an alarm or smoke detector was the predominant predictive feature among other variables in the model. The predictive model built, was able to statistically foretell the presence of a fire incident at a specific location. It performed better than the previous methods used to prioritise inspections (Smart Cities Initiative, 2018). Also, Madaio et al. (2016) asked themselves the questions, "how do we help Atlanta Fire Rescure Department (AFRD) identify new properties that need inspections?" and "how do we help AFRD to prioritise their property inspections by fire risk?". In order to answer these questions, the algorithms, Random Forest, logistic regression and Gradient Boosting Tree SVM were used on historical inspections of properties with fire incidents. The Random Forest and SVM performed best. At the end 69 properties were flagged as high risk and 48 violations of the fire safety codes were found. The Philadelphia Department of Licenses and Inspections (L&I) razed several buildings to the ground for concerns of possible collapse. Mosley and Steif (2018) utilised the data on previous demolitions to train model capable of predicting the likelihood a building collapse. The predictive algorithms Naive Bayes, logistic regression, Random Forest (RB) and Gradient Boosting Machine (GBM) were compared. The predicted properties flagged to be demolished by the algorithms were compared with actual properties razed down. The outcome saw a better performance from the GBM and RB. With sights on the predicted probabilities from these algorithms, the GBM found about 1, 800 parcels citywide, unsafe with probabilities greater than the 10% threshold set. Though many data mining algorithms has been used in the literature reviewed, arguably little has been applied to detect critical violations in food establishments. Inasmuch as many studies (Palaniappan, Sundaraj & Sundaraj, 2014; Weinberger, Blitzer & Saul, 2006; Li, Zhang & Zhao, 2017) use fixed or single parameter setting 28 University of Ghana http://ugspace.ug.edu.gh (for example k = 1 in KNN and C = 1 in SVM) for the SVM and KNN, this study made use of a grid with different values circled around the fixed parameters usually considered. Predicting the likelihood of identifying critical units in various sectors is a difficult adventure to take therefore data scientists suggest, using data mining algorithms. Hence, the usage of SVM, LR and KNN in this study. 2.7 Comparing Logistic regression, K-Nearest Neighbour and Support Vector Machine There are many supervised learning algorithms such as Support Vector Machine (SVM), Logistic Regression (LR), Naïve Bayes K-Nearest Neighbour, random forest, decision tree, etc. Supervised learning algorithm focuses on building a model able to make predictions of the response values for a new dataset. The main challenge posed in supervised learning is the selection of the appropriate data mining algorithm for classification. One underlining criteria vital for the selection of an algorithm is based on characteristics of the training data set, such as the size, quality and nature. Duda (2001) established and the ’no free lunch’ theorem (Wolpert, 1997) further reiterate that no particular classifier works optimally all the time but depends on the type of problem and the available data set. Depending on the data set, some questions such as the number of training examples, the features dimensionality and its independence are realised. Parry et al. (2010) related KNN and logistic regression to defend the nonlinear classifiers used in its study of gene expression. KNN performed significantly better than logistic regression when their mean performance were compared using a Bonferroni adjusted significance level of 0.005. Again the Matthews Correlation Coefficient (MCC) showed that KNN performed significantly better than logistic regression based on the responses from some tumour cells in the data set. The logistic regression performed better than KNN in the comparison of gender. 29 University of Ghana http://ugspace.ug.edu.gh Whereas linear classifiers, like logistic regression use a straight line to separate the feature space, nonlinear classifiers like KNN and Support Vector Machine have the chance to build further complex decision surfaces (Parry et al., 2010). LR, KNN and SVM well suit objects that can be classified into separate class labels. Kuramochi (2005) based on gene profiles saw that KNN performed better than SVM which supposedly has more complex structure. Logistic regression becomes disadvantageous to an algorithm that uses kernel functions to map input vectors into a high dimensional feature space like SVM for classification. One significant property of a hyperplane is that it is better fine-tune to enlist the details of data. In a study by Joachims (1998), KNN, SVM and others algorithms were applied to Reuters data and KNN performed best among other methods which (Yang, 1997) confirmed the findings in a different study. Rana et al. (2015) compared SVM, logistic regression, Naïve Bayes and KNN using an online data i.e UCI machine learning repository. As noted by the researchers, thus every algorithm performs totally different which is largely due to the parameter selection and available dataset. The SVM had a test accuracy of 93% and 68% on the diagnoses of breast cancer and the recurrence or non- recurrence of breast cancer data respectively. The regularised logistic regression had a test accuracy of 92.10% and 72% on the diagnoses of breast cancer and the recurrence or non-recurrence of breast cancer data respectively. Also, the KNN using Euclidean distance had test accuracies of 95.63% and 72% on the diagnoses of breast cancer and the recurrence or non-recurrence of breast cancer data respectively. But the training accuracy of both SVM and KNN (using Euclidean distance) were 100% for both sets of data and the regularised logistic regression was 93.54% and 80% for the data sets respectively. An SVM with the Radial Basis Function (RBF) gave the best outcome when the parameter values (γ is small and C large). The regularised logistic regression used performed better than the generalised logistic regression. The KNN showed to be the best for the 30 University of Ghana http://ugspace.ug.edu.gh overall methodology. The SVM is known for its accuracy (Meyer, Leisch & Hornik, 2003) and it is confirmed by the findings of Übeyli (2007), as SVM showed a classification accuracy of 99.5% compared with several types of Artificial Neural Networks (ANN). All the three algorithms works well on large data sets but logistic regression and KNN are relatively simple to compute compared to SVM. SVM has many desirable traits like accuracy, robustness and effectiveness (Golub et al., 1999; Wang & Huan, 2011) even in non-linear separable problems. KNN is also robust to noisy training data (i.e meaningless data that may be as a result of data corruption or inaccurate recording) and very effective with large training data set (Lavanya & Divya, 2017). Considering the literature reviewed, SVM and KNN tend to be more desirable to contend with the performance of logistic regression. 2.8 Performance measures of the algorithms. A model evaluation procedure is needed to help estimate how well a model generalise to a different future sample regardless of the choice of set of classifiers, the optimal tuning parameter or choice of different sets of features. However, evaluation metric is needed to pair with procedure so that model performance can be quantified. In classification, using the right performance metric to evaluate a learned classifier is fundamental to assess its quality. One way of evaluating a model’s performance is basing it on statistical significance or confidence intervals. Another way is by using a metric for the model’s evaluation (Ferri, Hernández- Orallo & Modroiu, 2009). Evaluating a classifier will depend on several factors like predictive accuracy, robustness, scalability, simplicity etc. Notwithstanding there are diverse measures to evaluate a classifiers performance(Mulak & Talhar, 2013) but ROC curve sensitivity, specificity and error rates were used in this study. Sensitivity can be described as the percentage of the presence of a 31 University of Ghana http://ugspace.ug.edu.gh condition/activity understudy which is rightly identified by the classifier. For example, the percentage of food establishments identified to have critical violation that truly identify as food establishments with critical violation. It is also called the true positive rate. number of true positives Sensitivity = (2.13) total number of positives Specificity can be described as the percentage of the absence of a condition/activity understudy which is rightly identified by the classifier. For example, the percentage of food establishments identified to be without critical violation that truly identify as food establishments without critical violation. It is also known as true negative rate. number of true negatives specificity = (2.14) total number of negatives number of true positives+ number of true negatives error rate = total number of positives+ total number of negatives (2.15) where a true positive as defined by Mulak and Talhar (2013) is the positive tuples that are correctly classified as positive and true negatives are the negative tuples that are correctly classified as negative. 2.9 Summary The focus of this chapter is on reviewed literature on food establishments, K- Nearest Neighbour and Support Vector Machine (SVM). It gives insight into food safety, its related risk factors and approaches to deal with it. Some works by researches on SVM and KNN are also reviewed as well as comparisons between LR, SVM and KNN. 32 University of Ghana http://ugspace.ug.edu.gh CHAPTER 3 METHODOLOGY This chapter describes the various methods that will be used in identifying critical violations in a food establishment. It outlines the feature selection technique used for selecting the attributes that influence the response variable (presence of a critical violation or not). It also explains the criteria for selection of the algorithms that will be employed thus, logistic regression, K-Nearest Neighbour and Support Vector Machine (SVM). A detailed description of the three classifiers and the performance evaluation of the models are also outlined in this chapter. 3.1 Preprocessing the data In the beginning stages of analysing the data, certain processes has to be followed in order to remove relatively unimportant aspects of the data that might affect the accuracy of the algorithm. All the processes is collectively referred to as preprocessing the data. 3.1.1 Data formatting Here, data formatting involves prepping the data before any further analysis. With the set objectives in mind, the data is subjected to data formatting which involves cleaning, by treating the missing data (i.e. removal or applying imputation algorithm) and normalising them (i.e rendering data to be in the same range). In this study, missing data was removed since the data was large enough to accommodate the removal of 0.006% of the whole dataset. The data used contained many variables with different measurement scale so normalisation was done to render the various features to have the same scale. 33 University of Ghana http://ugspace.ug.edu.gh 3.1.2 Feature Extraction Feature extraction is the process of capturing a subset of features (attribute) from the actual feature set while maintaining interpretation and focus on the goal of the analysis. Feature extraction is a dimensionality reduction technique that is very vital in providing more information about the data such that only relevant features are included in the training process. There is the likelihood that every feature contain as much information but it is in order of importance or usefulness (Nilsson, Peña, Björkegren & Tegnér, 2007). The goal of feature selection is to remove irrelevant features i.e invariably identifying relevant features in order reduce the likelihood of over-fitting to noisy data, decrease computational time and space needed to run the algorithm. Discussions on the importance on selecting every relevant feature is expounded more by Guyon and Elisseeff (2006). This study uses Principal Component Analysis (PCA) to reduce the number of features by inextricably extracting the relatively useful features. Principal Component Analysis (PCA) The method Principal Component Analysis (PCA) also known as "Hotelling transform" used in data processing transforms features into a set of features which are a linear combination of the original features and these new features are called principal components. PCA is an orthogonal linear transformation that transform data to a new coordinate system such that the greatest variance by some projection of the data lies on the first principal component. The first principal component explains the greatest variability found in the data and every component that follows from the second to the last explains the remaining variability. Only the first few components are considered and interpreted because of the amount of variance they explain. Conventionally, the defining character between components is that they are uncorrelated. 34 University of Ghana http://ugspace.ug.edu.gh The following highlights the method involved in the PCA assuming the data is n-dimensional; • Subtract the mean from each of the data dimension. This makes the mean of the data become zero. • Calculate the covariance matrix. The dimension of the covariance matrix will be an n× n • Calculate the eigen vectors and the eigenvalues of the covariance matrix as they convey useful information concerning the data. • Select components and create feature vector. This step welcomes in the dimensionality reduction. • Form the new data set with features retained in the data. The goal of PCA is to reduce the dimensionality of the huge data set and also detect new important underlying features in order to explain a certain phenomenon. The eigen analysis is a technique used in Principal Component Analysis. The sums of squares and cross products is used in determining the eigen vectors and eigenvalues of a square symmetric matrix. The relationship between the principal components, eigen vectors and eigen values is such that the eigen vector linked with the greatest eigenvalue has the same direction like the first principal component. It follows that the second eigen vector linked with the second greatest eigenvalue shows the second principal component’s direction. The Principal Component Analysis conveys information that concerns not only the varying patterns in features but also the relationship that exist between features (Qi & Luo, 2015). The final display of PCA analysis, presents components that have different degrees of correlation with the observed features. Jolliffe (2011) provides further explanation on the Principal Component Analysis. 35 University of Ghana http://ugspace.ug.edu.gh 3.2 Logistic regression Logistic regression is a special type of regression used in predictive analysis where the probability of a dichotomous outcome is modeled based on one or more predictors (numerical or categorical) by means of a logistic function. The model seeks to estimate the probability that an event occurs for any randomly selected observation against the probability that the event does not occur, hence very useful in classification. It explains the relationship between a response/dependent variable and one or more explanatory/independent variables. Depending on the response variable, logistic regression may be binomial or multinomial. With binomial logistic regression the observed outcome can take on only two categories, a typical example is a "yes" or "no". In multinomial logistic regression the observed outcome has more than two possible categories. Logistic regression is an example of a Generalised Linear Models (GLM), i.e broad class of models which includes linear regression, ANOVA , Poisson regression, etc. The GLM also originates from a family of distribution known as the exponential family. 3.2.1 The Logistic Regression Model Consider a data of n independent observations y1, y2, y3, ...yn and treating the ith observation as a realisation of a random variable Yi. Assume that the Yi has a Bernoulli distribution with parameter θ, where θ = P (x = 1). A typical probability function (p.d.f) of the Bernoulli distribution is of the form θ x(1− θ)1−x for x = 0or1 0 < θ < 1. p(x|θ) = 0 otherwise In the exponential form we express the pdf as 36 University of Ghana http://ugspace.ug.edu.gh { [ ]} p(x|θ) = exp[ log θx(1− θ)1−x ] = exp[xlogθ + (1− x)log(1− θ) ] (3.1)θ = exp xlog + (1− x)log(1− θ) (1− θ) The general form of the single exponential family of distributions is stated as gX(x|ϑ) = g(x)exp(ϑ.Q(x)−B(ϑ)) (3.2) Therefore comparing, equations 3.1 and 3.2 ϑ = logθ/(1 − θ), Q(x) = −log(1 − θ) and g(x) = 1 The natural parameter can be rearranged as, 1 θ(x) = − (3.3)1 + e (Xiβ) the logistic regression model is equation (3.6). In order to fit logistic regression certain assumptions have to be met. This is outlined in the next subsection. 3.2.2 Assumptions of Logistic Regression Assumptions of the logistic regression model is as follows; (i) The dependent variable, Yi is binomially distributed. Therefore there is no need for the Yi to be normally distributed, but assumes a particular distribution from the exponential family. (ii) The cases or data, Yi, Y2, ..., Yn are independently distributed. (iii) A linear relationship is assumed between the logit of dependent and independent variables; logit(θ) = β0 + βiXi. (iv) Errors does not need to be normally distributed but must be independent. (v) The assumption of homoscedasticity is not a necessity. 37 University of Ghana http://ugspace.ug.edu.gh (vi) Maximum likelihood estimation (MLE) is preferred to the ordinary least squares (OLS) for parameter estimation, which rests on large-sample approximations. (vii) It usually requires large sample data. 3.2.3 Odds and Odds Ratio The odds of a dependent variable can simply be described as the ratio of the probability of an event happening to the probability of the event not happening. Thus, P (event happening) θ Odds = = P (event not happening) 1− θ where, P (Y = 1) = θ is the probability of an event occurring and P (Y = 0) = 1− θ The ratio of two odds is defined as Odds Ratio (OR). Also represented by, odds of event happening OR = odds of event not happening Defining odds of one event as θ0 and another as θ1, then ( ) θ0 1− θ0 OR = ( ) θ1 1− θ1 For a simple logistic regression model with an independent variable, the odds ratio is explained mathematically as, 38 University of Ghana http://ugspace.ug.edu.gh [ ] θ(x+ 1) odds(x+ 1) 1[− θ(x+ 1]) exp[β0 + β1(x+ 1)]OR = = = = expβ1 (3.4) odds(x) θ(x) exp[β0 + β1x] 1− θ(x) This exponential relationship with the odd ratio suggests that the odds that a particular characteristic is existent is multiplied by exp(β1), for every unit increase in X. Having identity as an odds ratio means the odds do not change with time. Thus, • Odds increase if βj > 0, then exp (β)>1 • Odds decrease if βj < 0 then exp (β)<1 3.2.4 Parameter estimation of logistic regression coefficients The estimation method that was used in this research is the maximum likelihood estimation. This method is a routine procedure for obtaining estimators for unknown parameters from a set of data. Firstly, a function known as likelihood function has to be established. This function can be described as the probability of observing the actual data conditioned on the values of the parameter. The maximum likelihood estimate is the value of the parameter that maximises the likelihood function over the entire ring-space of the parameter. Thus, it is the parameter value that is most likely in the light of what has been observed. Estimates from MLE is known to have some enviable properties such as efficiency, consistency, invariance and asymptotic normality. Maximum Likelihood Estimator is preferred because it utilises every information about the parameters found in the data and it’s comparatively highly flexible. (Denuit et al., 2007). Assume a probability distribution is defined by a parameter α. If the likelihood 39 University of Ghana http://ugspace.ug.edu.gh function L(α) of observations (Zj) could be created from the distribution having probability density f(z), the likelihood can be defined as the product of the individual likelihoods over the observations. If the data is a vector Z = (Z1, Z2, ..., Zn) with parameter vector α = (α1, α2, ..., αn) defined on a multi-dimensional parameter space from an unknown population with pdf f(Z, α1, α2, ..., αn). The likelihood for each model given by ∏n L(Z1, Z2, ..., Zn|α1, α2, ..., αn) = L(Z|α) = f(Z, α1, α2, ..., αn) (3.5) i=1 For a binary logistic regression model, the likelihood function,L(α) can be expressed as, ∏n [ ]1−y y iL(α) = α ii 1− αi i=1 The maximum likelihood estimate is the value of α that maximises the likelihood function. However, it is more expedient to use log likelihood rather than the likelihood (Geyer, 2003). The log likelihood is expressed as [∏ ]n ∑n lnL(Z|α) = ln f(Zi, α) = lnf(Zi, α) (3.6) i=1 i=1 Also, ∑n { } L(α) = lnL(Z|α) = yilnαi + (1− yi)ln(1− αi) (3.7) i=1 The MLE is obtained by finding the derivatives of InL(Z|α) with respect to α and equating it to zero. Thus, ∣ ∂ln(α) ∣∣∣ = 0 (3.8)∂α α=α̂ 3.2.5 Testing the Goodness – of – Fit The term goodness of fit is very useful in comparing the observed sample distribution with the expected probability distribution. It involves assessing a random sample from an unknown distribution to test the null hypothesis that 40 University of Ghana http://ugspace.ug.edu.gh the unknown distribution function is actually from a known distribution. The procedure for determining the Goodness–of–Fit is by stating a hypothesis, calculating the test statistic and then computing the probability of finding data which have a greater value of this test statistic than the observed value. If the hypothesis is true, the probability is known as the confidence level. In assessing the model fit, some of the techniques that will be employed are below. Deviance and likelihood ratio tests Several software algorithms employ the deviance rather than the log-likelihood function as the basis of convergence when using GLM to estimate logistic models (Lovric, 2011). Deviance measures the lack of fit to a data and it is computed by making a comparison between a given model and a saturated model. Also expressed as, ( ) − likelihood of fitted modelD = 2ln (3.9) likelihood of saturated model Using equations (10) and (12), the deviance for a logistic regression model can be stated as ∑{ }n ( y )i (1− yi) D = 2 yiln + (1− yi)ln (3.10) αi (1− α )i=1 i Inside the brackets is the quantity known as the likelihood ratio. A test of such nature is termed the likelihood ratio test. Saturated model is a model having a theoretically perfect fit. Obtaining smaller values gives an indication of a better fit since deviation from the saturated model is small. Assessing upon a chi-square distribution, will give an indication of a good model fit when non-significant chi-square values are obtained which suggests a lot of the variance is explained and, importantly, that little remains unexplained. The value of D with and without the explanatory variables needs to be compared in order to assess the significance of the predictors in the equation. Therefore, 41 University of Ghana http://ugspace.ug.edu.gh the model deviance is subtracted from the null deviance. Thus [ ] Dnull −Dmodel[= − 2lnL(α)null − (−2)lnL(α)saturat]ed − −[2lnL(α)fitted −](−2[)lnL(α)saturated ] = − 2[lnL(α)null − − 2lnL(α)fitted ] − Likelihood of the null model= 2ln Likelihood of the fitted model To assess the significance of any individual predictor the likelihood ratio test and the Wald statistic is preferably used. The deviance test is central to the likelihood test since it tests the significance of the difference between the likelihood ratio’s of the fitted model and null or reduced model. Consider the null hypothesis; H0 : βj = 0. i = 1, 2, ..., p The statistic for the likelihood ratio test is expressed as ( ) L [ ]− null2ln = −2 In(Lnull)− In(Lfitted) Lfitted A chi-square statistic is obtained with this log transformation of the likelihood functions. Another statistic that can be used is the Wald statistic. The Wald test is achieved by comparing estimate of maximum likelihood of the slope parameter,β̂j to an estimate of its standard error.Under the null hypothesis H0 : βi = 0. j = 1, 2, ..., p The subsequent ratio follows a standard normal distribution.The Wald statistic 42 University of Ghana http://ugspace.ug.edu.gh (Wi) is expressed as βj Wi = ∼ N(0, 1) (3.11) SE(βj) The Wald statistic however has some limitations, thus for large coefficient the standard error becomes inflated thereby reducing the Wald statistic value (Manard, 1995) and it is also inclined towards biasedness with a sparse data. Generally the likelihood ratio test is preferred over the Wald test. McFadden’s pseudo-R squared A different approach to assess the effectiveness of a regression model is by measuring the strength of the relationship among the independent variable(s) and the outcome. The McFadden’s pseudo-R squared is one of many versions founded on the log-likelihoods for the null model and full estimated model. Other version that can be used are Hosmer & Lemeshow’s R2 and Nagelkerke’s R2, Cox and Snell R2. McFadden’s R squared measure is defined as 2 − In(Lfull)RMcFadden = 1 (3.12)In(Lnull) where Lfull denote the likelihood from the current fitted model and Lnull denote the likelihood from the null model The value obtained ranges from 0 to 1. These type of statistics can be suggestive on their own, but very useful when comparing competing models for the same data. 3.2.6 Confidence Interval Estimation The confidence interval simply shows how accurately with which a sample statistic estimates a population parameter, given random sample size,N and the significance level, α. Usually the confidence interval for this slope is built from the Wald statistic. A (1− α)% two-sided confidence interval for β1 β̂1 ± Z1−α/2SE(β̂1) (3.13) 43 University of Ghana http://ugspace.ug.edu.gh where SE(β̂1) represents the standard error of a model-based estimate of the respective estimator of the parameter and Z1−α/2 is the upper 100(1 − α/2)% point from the standard normal distribution. 3.3 Support Vector Machine Support Vector Machine (SVM)(Vapnik, 1998) is an effective method that has a solid theoretical foundation and also has the ability to learn automatically from data and improve from experience(also termed as machine learning). SVM identifies a maximum margin function that divides a large set of observations into two categories where every observation is a point in a multidimensional space of feature measurements. It is known for its high prediction accuracy which is as result of learning from the training set (Meyer, Leisch & Hornik, 2003) to produce an optimal hyperplane which significantly simplifies classification and regression problems. High robustness and generalisation ability of SVM with a small number of samples are some of its many admirable features. Thus, SVM considers minority support vectors together with the complexity and learning ability of the model to define the final optimal hyperplane. There are two problems to deal with when using SVM are; by what method to select the optimal count of features and how to fix the best kernel parameters. Both problems influence each other (Frohlich and Chapelle, 2003), hence finding optimal number of features and kernel parameters should simultaneously occur. Selecting a subset of features from a database is significant in SVM as it extracts the important information from the available data set to reduce the computation time. As stated by Yang and Honavar (1998), the number of features affect not only a learned classification algorithm’s accuracy but the time needed for learning a classification function, the cost associated with the features and the number of examples needed for learning. A few algorithms have been proposed for SVM feature selection in the literatures (Bradley et al., 1998; Bradley & Mangasarian, 44 University of Ghana http://ugspace.ug.edu.gh 1998; Weston et al., 2001; Guyon et al., 2002; Mao, 2004). Suppose a training data {xi, yi}i = n where every xi represents a training element and yi ∈ {+1,−1} a matching class label. The goal of the SVM problem is to find a hyper-plane that divides the two categories or classes of points with the highest separation margin. The foundation of this technique is mapping the input vector unto a high dimensional feature space using non-linear transformation function. Since exact separation between the two categories is extremely difficult, some error allowance variables known as slack variables ξi are introduced in classifying the data that is difficult to separate linearly (Vapnik, 1995). The categorisation of the surface equation ω · xi + b = 0 satisfies the equation below: yi[(ω · xi + b)] ≥ 1− ξi, i = 1, 2, ..., n (3.14) where ω is a weight vector b is the classification threshold. if 0 < ξi < 1 xi is accurately classifiedif ξi ≥ 1 xi is wrongly classified Below is the objective function n · 1 ∑ φ(ω x) = ‖ω‖2 + C ξi (3.15) 2 i=1 where 1‖ω‖2 is minimize objective function 2 C∑is a regularization parameter C ni=1 ξi is a penalty function Equation 3.15 can be fixed by the convex quadratic programming below. 45 University of Ghana http://ugspace.ug.edu.gh ∑n ∑n ∑n1 max ai − aiajyiyj(x1 · xj), 0 6 ai 6 C, i = 1, 2, ..., n (3.16) 2 i=1 i=1 j=1 ai is Lagrange multiplier. Assuming a∗ is the best solution, then ∑n ω∗ = a∗i yixi (3.17) i=1 A linear combination of support vector can be defined as the face of the generalized optimal classification. The optimal classification of the function is given as: [∑n ] f(x) = sgn a∗y ∗i i(xi · x) + b (3.18) i=1 where b∗ is a threshold classification without omitting the constraint condition a [y (ω∗i i · xi + b)− 1] = 0 In situations where there are no relations linearly between the outcome and the predictors, enlarging the feature space by making use of functions of the predictors like cubic, quadratic or even higher-order polynomial functions are considered to address the non linearity problem. This could lead up to an immense number of features which intends make computation not manageable. SVM allows the support vector classifier to enlarge feature space using kernels in a way that leads to efficient computations (James et al., 2013). This leads to the importance of the selection of a kernel function and feature selection procedure. 3.3.1 Parameter selection The selection of appropriate parameters is important in improving the classification accuracy of the SVM. Selecting the parameters is a requirement before training the SVM model. The regularization parameter C and the parameters of the kernel function, γ in the Radial Basis Function (RBF) kernel are 46 University of Ghana http://ugspace.ug.edu.gh some parameters that should be optimized. There exist so many approaches to the selection of parameters, they comprise cross validation method, particle swarm optimization, experience choice method, Bayesian method, gradient descent method, Genetic Algorithm (GA) based method etc. The study made use of cross validation to find the best parameter. A parameter range should be identified before concluding on the best parameters, and if the range is small, then a huge deviation for optimal parameters will be produced. 3.3.2 Proposed Procedure The following illustrates the proposed procedure for the SVM classification algorithm in this study : • Make available the collected data and transform the data to be compatible with the SVM package. • Perform basic scaling on available data • Select which kernel function to use. The RBF, linear and polynomial kernels were considered. • Use the cross validation and a defined grid search procedure to choose SVM parameters to use. The procedures was used to determine the finest parameter, thus C and γ, the parameter’s of the RBF kernel and C with the linear kernel and also the C, d and γ • Train the algorithm using the optimum parameters obtained Make use of the optimum parameter identified to develop training set. • Evaluate on the test set. 47 University of Ghana http://ugspace.ug.edu.gh 3.3.3 Data Preprocessing Categorical Feature For the SVM package to be compatible with the data, transformations should be made to the data. First, each data case has to be denoted by a vector of real numbers. This means any categorical variable attribute has to be converted into a numeric data. Hsu et al,(2003) suggest making use of p numbers to denote an attribute of p-category. For instance, a category with three attributes like {amateur, medium, difficult} can be denoted by (1,0,0), (0,1,0), (0,0,1). In cases where the total values in a particular attribute is not so huge, this type of coding is suggested since it may bring more stability than considering a single number. Scaling Scaling is prerequisite in order to apply the SVM. Scaling allows all numeric ranges to be the same across to avoid dominance of one range over the other and it also make numerical calculations simple. This is important since the kernel values ideally hinge on the the inner products of feature vectors. Every attribute will be scaled linearly in the range [0,1] in both data (training and testing). 3.3.4 Model Selection Conventionally, SVM uses a kernel that is a set of mathematical functions whose function is recognising data as input and transforming it into the required form. This study used the linear kernel and the RBF kernel. Here C the regularisation parameter (cost parameter) measures the trade-off between maximising the width of the margin and minimisation of errors (Eitrich & Lang, 2006) The RBF Kernel is a popular model selection method as it has the ability to accommodate cases where data sets are not linearly separable. It is able to map samples that are nonlinear into a higher dimensional space. The RBF has only two 48 University of Ghana http://ugspace.ug.edu.gh parameters thus C (regularisation parameter) and ω (kernel parameter). These two parameters is vital in the SVMs performance since it can lead to under-fitting or over-fitting troubles when selected inappropriately. A suggested approach to deal with this is making use of grid-search and cross validation (Hsu, Chang & Lin, 2003). The clear objective is to identify the best choice of both C and ω so that the model can generalise to new data set. However, other kernel behave like the RBF Kernel, example is the sigmoid kernel in some particular parameters (Lin and Lin, 2003) and the linear kernel behave like the RBF Kernel (with C, γ) in special cases where linear kernel has a penalty parameter C (Keerthi and Lin, 2003). Apart its popularity, RBF Kernel has less the number of hyperplanes that influence the model selection’s complexity than the polynomial kernel which relatively has higher and hence the fewer numeric difficulties in RBF Kernel informed the decision for it’s selection. Thus, values of a polynomial kernel can approach infinity but the value of a RBF Kernel is between 0 an 1. Owing to the fact that logistic regression was used by Schenk et al. (2014), a linear kernel was also used. The linear kernel do not have parameters to tune besides C which makes it relatively flexible. It performs best when the data is linearly separable and relatively takes less computational time to train the SVM as compared to the RBF kernel. It less likely to lead to overfitting as compared to the RBF kernel. Notwithstanding all these facts, the polynomial kernel was included purposefully for comparison with the other kernels. The proper kernel selection ultimately has influence on the accuracy of the SVM classifier (Asraf, Nooritawati & Rizam, 2012). 3.3.5 Cross-validation and Grid-search The parameters of the kernels are already unknown and selecting the optimal parameters demands some type of model selection. In order to predict accurately 49 University of Ghana http://ugspace.ug.edu.gh the testing data, the optimal C, γ and d has to be identified. Before a classifier can be termed as having a high or low prediction accuracy, the data is divided into training and test data where the trained classifier is used on the test data after training the classifier on the training data. The capability of any classifier is rated when it is able to predict accurately the test data, hence the use of cross validation. Cross validation can be described as the measure of how the classifier in a statistical research is able to generalize even to unknown data set by dividing the data into equal sub-data sets. In a typical v-fold cross validation, an equal partition of v subsets is made out of the training set. When the classifier is trained on the v − 1 subsets, the trained classifier is used to test the remainder of the v subsets. Hence, every case in the entire training set is predicted once, therefore cross-validation accuracy can be described as the proportion of the data set which are precisely classified. The cross-validation approach is used in order to avoid overly fitted data. A 10-fold cross validation was used in this study. This method prevents the classifier from over-fitting In this setting the training data is randomly divided into subsets of 10 with equal size. Out of the ten(10) subsets, one (1) is held to test the model and the 9 remaining subsets are used to train the data. This process of holding out one subset and training the model on the remaining 9 is repeated such that each of the 10 subsets is used only once as the test or validation data. A defined grid-search was applied in selecting the optimal C,d and γ using cross- validation. The grid-search tries all the sets of C,dγ values and anyone having the top cross-validation accurateness is selected. Hsu et al, (2003) suggest a practical way of using sequences of C and γ in an exponentially growth trajectory to determine appropriate parameters (for instance, (C = 2−6, 2−4, 2−2, ..., 2−12, γ = 2−12, 2−10, 2−8, ..., 2−4). However, this study used a defined grid. This was informed by the default tuning parameters in software package like MATLAB. The defined grid used had the values 0.25, 0.5, 0.75, 1, 1.25 as the C in all the 50 University of Ghana http://ugspace.ug.edu.gh kernels, 0.01, 0.015, 0.2 as the γ for the RBF kernel. The γ and d in the polynomial kernel were define by the values 0.01, 0.015, 0.2 and 1, 2, 3 respectively. 3.4 K-Nearest Neighbour The nearest neighbor algorithm finds the class of undefined data point based on its nearest neighbor whose class is defined aforehand. It has been extensively used in recognition of patterns (Vaidehi, 2008; Xu, & Wu, 2008), ranking models (Geng, 2008), event recognition (Yang, 2000) applications and text or categorization (Elnahrawy, 2002). The algorithm is non-parametric in nature and its usually used for regression and classification. It is based on classifying cases based on their similarity. In the world of machine learning, this method was developed to identify patterns in a data without the requirement of exactly matching any stored case or pattern and it is considered by many as the simplest algorithm among all machine learning algorithms. It is an effective method that is relatively easy to execute (Bhatia, 2010). In simple terms, when similar or matching cases are nearer to each other and not similar or mismatching cases are far from each other, the similar cases are termed as "neighbours". The distance between two cases measure their dissimilarity. Assuming a new case is presented, the distance between existing cases and the new case is computed. Similar cases are classified and tallied and the new case is given a class based on the highest nearest neighbours the class has. The number of nearest neighbour, K is specified beforehand by making some considerations. The K, can be termed as a user-defined constant. 3.4.1 K - value selection Selecting a desirable K rests entirely on the data. A larger K greatly reduce noise in classification but also fails to set clear boundaries among the classes and invariably a small K leads to a large variance in prediction. Therefore K should be set at a point where it is large enough to reduce misclassification and small 51 University of Ghana http://ugspace.ug.edu.gh enough to let the K nearest cases to be close to the new case. Consider the following figure The goal here is to classify or estimate the new case contingent on the number of nearest neighbour around it. If the green point is a new case to be classified, and considering both selection on the boundary in (a) and (b) with K = 3 &K = 5 respectively. Again, considering the first case with k=3, the new case is said to belong to class 1 since the selected boundary has more class 1 nearest neighbours. Similarly in (b), the new case is said to belong to class 2 since the selected boundary has more class 2 nearest neighbours. 3.4.2 Training and testing of K-NN Classifier In classification, the dependent variable is categorical therefore any point introduced is classified by majority voting. The choice of the K value greatly affect the quality of prediction. The K-Nearest Neighbors algorithm is mostly associated with regression and classification. In the feature spaces of both cases the input hold the closest training examples. In this study, data on food inspections is partitioned into two thus training and test sets. The K-NN classifier is set to classify the data as either there is critical violation or not, by learning the already defined training data. The cross validation technique is then applied to check the accuracy of the K-NN classifier. 52 University of Ghana http://ugspace.ug.edu.gh 3.4.3 Steps The following steps describes the use of the KNN classifier. 1. Preparing the data which involves: • Normalising the data, thus adjusting features such that inferences drawn are not distorted by variables with wide ranges. • Dividing the data into training and testing sets in order to assess the performance of the algorithm. 2. Storing the class labels and training samples by subjecting it to preprocessing 3. Classification is done by majority vote. Any new case/instance introduced follows these steps: • All distances between a test sample and a training sample is calculated using a distance metric (Euclidean was used in this study). • With a pre-defined value of k ranging from 1 to 50, the class of a new vector is determined. • A new case is assigned a class if among the K nearest nearest samples it is the most frequent class. 4. Assessing model performance. 53 University of Ghana http://ugspace.ug.edu.gh Figure 3.1: Framework for classifying food establishments The figure 3.1 describes the processes involved in classifying restaurants using the logistic regression, SVM and KNN algorithms. After obtaining the data, it is subjected to pre-processing, which involves formatting the data, cleaning, scaling and feature extraction. The data is then divided into train set and test set with 16697 and 1636 observations(restaurants) respectively. Each classifier learns from the training set and subsequently evaluated on the test set. 3.5 Criteria for selection of algorithms There are many supervised learning algorithms such as Support Vector Machine (SVM), Logistic Regression (LR), Näive Bayes K-Nearest Neighbour (KNN), random forest, decision tree, etc. Supervised learning algorithm focuses on building a model able to make predictions of the response values for a new dataset. The main challenge posed in supervised learning is the selection of the appropriate data mining algorithm for classification. When considering supervised learning algorithms, several factors like linearity, accuracy, training time, number of features, complexity and interpretability of the algorithm are the usual victims but a lot also lies in the data. One underlining criteria vital for 54 University of Ghana http://ugspace.ug.edu.gh the selection of an algorithm is based on characteristics of the training data set such as the size, quality and nature. The LR, SVM and KNN are considered in this study are all examples of data mining algorithm. The name data mining connotes sifting through large data sets to make meaningful inferences. Both predictive data mining algorithms have a bigger capacity to handle large and even noisy data as a study by Lavanya et al. (2017) showed their strength in big data analysis. The data used in this study is considerably large with 28 features and 18334 observations. It is obtained from the City of Chicago data repository which has a world class quality. Also the nature of the data makes it possible to use both algorithms since the data suits a classification overlaid problem. Also, with the study aiming to prioritise inspections based on a validated set of models, one vital characteristic that cannot be overlooked is the predictive power of the algorithm. Several studies show SVM and KNN have high predictive power (Meyer et al., 2003; Wang & Huan 2011; Alkhatib, Najadat, Hmeidi & Shatnawi, 2013) This informed the decision to use SVM and KNN. 3.6 Performance evaluation of the models Evaluating a classifier will depend on several factors like predictive accuracy, robustness, scalability and simplicity, however there are diverse measures to evaluate a classifiers performance(Mulak & Talhar, 2013). This study will consider ROC curve specificity, sensitivity and error rates to assess the performance of the classifiers. 3.6.1 Receiver Operating Characteristic (ROC) curve The ROC curve is a visual display of diagnostic test evaluation measure that plots the sensitivity (true positive rate) and specificity (false positive rate). Every point in the ROC curve denotes a sensitivity/specificity pair consistent with a 55 University of Ghana http://ugspace.ug.edu.gh specific discriminative threshold. The area under curve is considered when one wants to determine how well a particular parameter can discriminate among two groups. The area under the ROC curve quantifies how accurate the model does in classifying members. We describe a test to have a perfect discriminative measure (100 % specificity and 100% sensitivity ) if the ROC curve passes through the upper left corner of the curve below. Figure 3.2: ROC curve We describe a test to have a perfect discriminative measure (100 % specificity and 100% sensitivity ) if the ROC curve passes through the upper left corner of 3.2 number of true positives Sensitivity = (3.19) total number of positives number of true negatives specificity = (3.20) total number of negatives 56 University of Ghana http://ugspace.ug.edu.gh number of true positives+ number of true negatives error rate = total number of positives+ total number of negatives (3.21) where a true positive as defined by Mulak and Talhar (2013) is the positive tuples that are correctly classified as positives and true negatives are the negative tuples that are correctly classified as negatives. Using a simple 2× 2 table (Confusion matrix) to illustrate further, Table 3.1: Confusion Matrix Event No event Event Q R No event S T Q Sensitivity = (3.22) Q+ S T Specificity = (3.23) R + T The study also makes use of classification accuracy(Acccl) which is the portion of samples that are rightly classified (i.e addition of true positives and true negatives). A good model must be able to accurately classify every member of a dataset. It is evaluated by the formula: c Acccl = × 100 (3.24) m where c is the portion of rightly classified samples and n, all the samples used. 3.7 Summary The focus of this chapter is on the supervised learning algorithms (i.e logistic regression, K-Nearest Neighbour and Support Vector Machine (SVM))used. It 57 University of Ghana http://ugspace.ug.edu.gh gives insight into how the various parameters are selected and estimated, and also sets out some definitions used. The criteria for selection of the supervised learning algorithms are also spelt out. The data considered satisfies all the classifiers which is very paramount in model building. 58 University of Ghana http://ugspace.ug.edu.gh CHAPTER 4 DATA ANALYSIS AND DISCUSSIONS This chapter presents the varied stages of analysing the data and explanation of findings in the study while supporting them with relevant literature. It focused on using the supervised learning algorithms, logistic regression, support vector machine and K Nearest Neighbour under varying parameter settings to detect critical violations at food establishments. 4.1 Data collection and description This study used secondary data from the City of Chicago data portal. Some desirable features considered in selecting the data has a lot more to with the availability, quality and accessibility coupled with the dearth in literature on application of SVM and KNN on such data. So many sources of data and attributes were gathered and used in the development of the models. The data sets, contained information on food inspections, business licenses, detailed crime data, 311 complaints from patrons of food establishment such as sanitation and garbage complains, and finally weather data in Chicago from 2011 to 2014. Similar data was used in the Schenk et al. (2014) which entailed the information on extracted variables. The binary response variable is whether or not there is the presence of critical violation. The independent variables are as follows; 59 University of Ghana http://ugspace.ug.edu.gh Table 4.1: Description of variables No. feature Description of feature 1 timeSinceLast time since last inspection 2 pastCritical history of previous risk level "critical" associated with each food establishment 3 patSerious history of previous risk level "serious" associated with each food establishment 4 criticalCount number of risk level type "critical" at food establishment on inspection 5 seriousCount number of risk level type "serious" at food establishment on inspection 6 ageAtInspection business license age at the time of inspection 7 BlueInsp A feature identifier for first sanitarian cluster (Blue) 8 BrownInsp A feature identifier for second sanitarian cluster (Brown) 9 GreenInsp A feature identifier for third sanitarian cluster (Green) 10 OrangeInsp A feature identifier for fourth sanitarian cluster (Orange) 11 PurpleInsp A feature identifier for fifth sanitarian cluster (Purple) 12 YellowInsp A feature identifier for sixth sanitarian cluster (Yellow) 13 license_insp presence of license for consumption 14 package_goods sale of packaged goods 15 pastFail previous failure at inspection 16 tobacco_sale licensed for sale of tobacco over the counter 17 PubAmusement presence of public space of amusement 18 burglary recent burglaries 19 sanitation recent sanitation complaints 20 garbage recent garbage cart requests 21 precipIntensity average expected intensity of precipitation 22 regBisLic regulated business license 23 filling_station nearness to filling station 24 catLiqLic caterers liquor license 25 mobFoodLic mobile food license 26 temperatureMax daily high temperature on the day of inspection 27 windSpeed wind speed on the day of inspection 28 humidity humidity on the day of inspection The data considered only regular canvass inspections and inspections as a result of a complain. An observation in the model represents every regular inspection. A food establishment based on the expectation of their food handling practices are classified as risk one, two or three. A food establishment is assigned a "risk one (High)" category if it handles the preparation of food and ingredients directly or heating and cooling of food. A food establishment is assigned a risk three (Low) category if it deals in already packaged and non perishable foods. A risk two (Medium) suggests the food establishment engages in both i.e already packaged goods and direct handling of food and ingredients. In a year, the number of times inspections are directed at a particular food establishment is determined 60 University of Ghana http://ugspace.ug.edu.gh by the food handling practices of the establishment. Risk one establishments are frequented twice a year by food inspectors, risk two facilities are checked at least once a year and risk three facilities, once every other year. The data also contained individual sanitarian inspectors who were grouped into clusters such that each cluster were assigned a colour coded name in order to hide their identities. Schenk et al. (2014) used data from January 2011 to January 2014 as the training data set and evaluated the classifier on data from September 2014 to October 2014. Sufficient time space was allowed between the training data set and the test data set (evaluation set) in order to decrease likely correlation between both periods. In order to make a case for the comparison of the three classifiers relative to the work by Schenk et al. (2014), similar data had to be used. Available data only made it possible to use data from September 2011 to March 2014 as training data set and September 2014 to October 2014 as the test set. 4.2 Research Design The study followed these steps as the plan in designing this research: 1. The raw data set was preprocessed which involves transforming it to the desirable format by filtering, scaling and removing missing data. It was then divided into training and test data sets. 2. The Principal Component Analysis (PCA) was primarily used to extract the relatively relevant features. 3. The SVM kernel functions namely linear, Radial Basis Function (RBF) and polynomial were employed using a defined range of tuning parameters. 4. Different K values were used in the K-Nearest Neighbour algorithm to determine the optimal model. 61 University of Ghana http://ugspace.ug.edu.gh 5. A 10-fold cross validation was used to prove the reliability of the outcome of the models, where repetitively nine sets are used for training and the remaining used for testing such that each run uses different sets. 6. All the models were evaluated on the test data set and compared based on their classification accuracy. 4.3 Preliminary analysis This section focuses on the preparation of the data for model building. In order to feed the correct data to the algorithms (LR, SVM and KNN), the data has to be prepared adequately for correct analysis. The data first has to be in the right format, so that only relevant attributes are included and this is achieved by subjecting it to preprocessing. In prepocessing the data, it was formatted to be in the right shape and data cleansing was done by removing missing data (i.e instances where the data are incomplete). The final data used in the model comprised of 16697 inspections from September 2, 2011 to March 31, 2014 (two and half years) as training set and 1636 inspections from September 2, 2014 to October 31, 2014 as test set. The division of the data in this format is informed by Schenk et al. (2014). The data was normalised, so that the range of the exploratory variables will have the same scale (Quackenbush, 2002). The data was analysed to select the best features relevant to the study and this is explained in the next subsection. 4.3.1 Extraction of features In the preliminary analysis of the data, principal component analysis was used to select the best features relevant to the study. The results from the principal component analysis showed two selected principal components. The scree plot showed a significant reduction in eigen value and leveled-off on the third component and this is generally regarded as the criterion for identifying 62 University of Ghana http://ugspace.ug.edu.gh the number of components to interpret (See appendix 5.1 for the scree plot). Also, only two components had eigen values greater than 1, hence only two components were retained. The table below shows the correlation between the principal components and the actual features. 63 University of Ghana http://ugspace.ug.edu.gh Table 4.2: Rotated Component Matrix Components 1 2 pastSerious 0.841 pastFail 0.813 timeSinceLast -0.608 pastCritical 0.547 ageAtInspection license_insp humidity windSpeed temperatureMax PubAmusement OrangeInsp BrownInsp mobFoodLic sanitation 0.638 burglary 0.632 garbage 0.623 tobacco_sale 0.452 BlueInsp 0.357 seriousCount 0.352 package_goods filling_station GreenInsp criticalCount YellowInsp catLiqLic regBisLic PurpleInsp precipIntensity Extraction Method: PCA Table 4.2 shows only values greater than 0.3 (Samuels, P. 2016; Field, 2013). These values are the farthest from zero in either direction and it represent the features that are strongly correlated with the principal components. Four (4) of 64 University of Ghana http://ugspace.ug.edu.gh the original features highly correlated with the first principal component whiles the second principal component correlated with six (6) of the original features. It showed 10 features as relatively important to be retained out of the 28 features. Also the 10 variables retained had relatively high communalities. The communalities indicate the effect of each observed feature from all the factors related to it. In addition the variables that were eliminated also had the lowest communalities or the amount of variance explained as compared to the rest (see appendix, table 5.2). The table displays the relevant features retained from the bunch of available features: Table 4.3: Features used in the model Name of Feature Description of feature BlueInsp A feature identifier for first sanitarian cluster pastFail Presence of previous record of failures seriusCount Number of serious violations pastCritical Presence of critical violation upon last visit pastSerious Presence of serious violation upon last visit timeSinceLast Time passed since last inspection tobacco_sale Licensed to sell tobacco burglary The intensity of burglaries (locally) sanitation Intensity of recent sanitation complains (locally) garbage Intensity of recent garbage cart requests (locally) 65 University of Ghana http://ugspace.ug.edu.gh 4.3.2 Descriptive statistics of the normalised data Some descriptives of the features or attributes selected are illustrated in the tables below. Table 4.4: Descriptive statistics of the normalised data Variable Mean Standard deviation past Critical 0.47340 0.12335 past Serious 0.47384 0.15287 serious counts 0.47938 0.18803 Time since last inspection 0.50813 0.22228 burglary 0.48679 0.19379 Garbage cart requests 0.48875 0.197096 Sanitation complains 0.48334 0.16088 From table 4.4, most of the features had relatively lower means with the feature "time since last inspection" having the highest among them suggesting most establishments are at the level of risk three where such establishments are inspected once every other year. The data showed 1685 food establishments were licensed to sell tobacco whereas 16648 were not. The colour coded cluster of inspectors namely Blue, did 3323 inspections out of the total 18333. There were also 1609 record of previous failures out of the 18333 inspections. (See appendix table 5.1). 66 University of Ghana http://ugspace.ug.edu.gh 4.4 Logistic Regression (LR) This section involves using the training set to fit the model and evaluating on the testing set. 4.4.1 Logistic regression model Below is a summary of the model 95% C.I for EXP(β) Variables Estimate (β) Wald Std. Error z value Pr(>|z|) Lower Upper (Intercept) 10.801 463.929 0.330 32.717 < 2e− 16 *** - - pastFail 0.308 2.827 0.183 -1.681 0.093 0.950 1.948 pastCritical -0.874 12.387 0.248 -3.519 4.32e− 4 *** 0.256 0.679 pastSerious -0.085 0.057 0.358 -0.238 0.812 0.455 0.1854 seriousCount -20.666 1116.304 0.619 -33.412 < 2e− 16 *** 0.000 0.000 timeSinceLast -0.601 15.761 0.151 -3.970 7.19e− 05 *** 0.407 0.738 BlueInsp 1.031 216.587 0.070 -14.717 < 2e− 16 *** 2.443 3.215 tobacco_sale -0.316 6.816 0.121 2.611 0.009 ** 0.575 0.924 burglary -0.568 10.045 0.179 -3.169 0.002 ** 0.399 805 garbage 0.745 18.353 0.174 4.284 1.84e-05 *** 1.498 2.961 sanitation -0.138 0.405 0.216 -0.636 0.525 0.570 1.331 Signif. codes: 0 ’***’, 0.001, ’**’ 0.01, ’*’ 0.05, ’.’ 0.1, ’ ’ 1 Table 4.5: Summary of the logistic regression model From table 4.5 the variables, past failures, presence of serious violation on last visit and sanitations complains are not statistically significant at 5% level of significance. The variables, presence of critical violation on last visit, number of serious violations, time passed since last inspection, first sanitarian cluster, the intensity of local burglaries and intensity of recent garbage cart requests are statistically significant since their p−values is less than 0.05. The number of serious violations had the least p−value suggesting a very strong association between the number of serious violation and the probability of passing an inspection. Again, its coefficient having a negative sign suggests, with all variables 67 University of Ghana http://ugspace.ug.edu.gh equal an increase in serious counts would lead to a likely failed outcome in inspection. Since the log odds is a function of the estimated coefficients, an increase in a variable like for example a licensed tobacco seller, increases the odds by e0.316 (0.7290595). Considering the other independent variables in the model, the 95% Wald Confidence limit displays the confidence interval that suggests the true population odds is within the limits of the interval. Also, an analysis of the deviance table (See Appendix, table 5.3) clearly shows that when each variable is added one by one, there is a drop in deviance. Moreover, the addition of pastFail, pastCritical and pastSerious causes a significant reduction in residual deviance. A big p−value shows that the null model accounts for more or less similar amount of variation. The results also indicates that all the variables significantly contribute to the model since the p−value is below 0.05. The McFadden Pseudo R2 was 0.6226629 which suggests the full model performed better than the null model. The training time and processing speed of the logistic regression was 76.693 seconds and 13000 observations per second respectively (MATLAB R2017a) . Evaluating the predictive ability of the model In order to assess the predictive ability of the model, a new data set (test set) is used. Below is a confusion matrix to assess how the logistic model performed on the test set. Presence of Critical violation(Fail) Absence of Critical violation(Pass) Presence of Critical violation(Fail) 495 118 Absence of Critical violation(Pass) 61 1080 R output From equations 3.22, 3.23 and 3.24, 495 Sensitivity = = 0.8903 495 + 61 68 University of Ghana http://ugspace.ug.edu.gh 1080 Specificity = = 0.9015 118 + 1080 1518 Acccl = × 100 = 92.7873% 1636 Figure 4.1: ROC of the logistic regression model Figure 4.4.1 suggests the model has a high discrimination ability since the Receiver Operating Curve (ROC) curve goes very close to the top left corner. The value obtained for the area under the estimated curve (AUC) was 0.9199951. This shows that when any random pair of inspections are taken, one with critical violation and the other without critical violation, there is a 91.9995 % probability that the logistic regression model rightly ranks them as such. The logistic regression model realised a prediction accuracy 92.7872%. 69 University of Ghana http://ugspace.ug.edu.gh 4.5 Support Vector Machine (SVM) The SVM algorithm is a kernel based classifier specifically built for binary classification. The SVM uses kernel functions to map the training set to increase its semblace to a data set that can be linearly separated. It does so by increasing the data set’s dimensionality. Three kernel functions were used, namely linear, Radial Basis Function (RBF) and polynomial. 4.5.1 Model selection for SVM The kernels linear, Radial Basis Function (RBF) and polynomial was compared using their ROC, sensitivity and specificity measure. The study made use of 10 - fold cross validation to find the best parameters. The parameter range for the tuning parameters of each kernel was defined. The R package (caret), MATLAB (R2017a) was used in running these analysis. The computer used had the specifications, AMD E-300 APU with Radeon(tm) HD Graphics 1.30 GHz with installed memory of 4 GB. Linear kernel In the Linear kernel, the regularisation parameter also called cost function or the box constraint, C is the main tuning parameter. In defining the grid for the selection of the best C, as the tuning parameter for the kernel function, the values 0.25, 0.5, 0.75, 1, 1.25 and 1.5 were considered in training the SVM. This stems from the usage of 1 by studies such as (Palaniappan et al, 2014) coupled with the fact that the default setting for software package like MATLAB for the choice of C is 1. Hence, the choice of this grid. This resulted in the creation of six models. The following shows the results of the various tuning parameters considered. 70 University of Ghana http://ugspace.ug.edu.gh Table 4.6: Resampling results cross the tuning parameters of linear kernel C ROC Sens Spec 0.25 0.8919498 0.7789694 0.9981073 0.50 0.8903240 0.7789694 0.9981073 0.75 0.8919143 0.7789694 0.9981073 1.00 0.8909931 0.7789694 0.9981073 1.25 0.8919220 0.7789694 0.9981073 1.50 0.8927530 0.7789694 0.9981073 Table 4.6 shows the various cost parameter considered and their sensitivity, specificity and ROC measures. All of the C values had the same sensitivity and specificity values which may be as a result of the closeness of each tuning parameter in defining the optimum cost parameter. It shows all the models have similar ability to correctly predict those with critical violations as those with critical violation and those without critical violations as those without critical violation. But their ROC measures, differentiate them since it compares the sensitivity and specificity across the range of the tuning parameters to be used in the model. ROC was used to select the optimal model using the largest value. The final value used for the model was C = 1.50. The computational time for training and selection of the optimal C was 0.44891 hours (Source: R statistical package). RBF kernel With RBF kernel the default tuning parameters is the regularisation parameter, C and the kernel parameter, γ. In defining the grid for the selection of the best C and γ as the tuning parameters for the kernel function, the values 0.25, 0.5, 0.75, 1, 1.25, 1.5 and 0.01, 0.015, 0.2 were respectively considered in training the SVM. This creates 18 distinct models as each C value pairs all the γ values. The following shows some of the resampling results cross the tuning parameters 71 University of Ghana http://ugspace.ug.edu.gh (For the full table, see appendix table 5.4). Table 4.7: Resampling results cross the tuning parameters of RBF kernel γ C ROC Sens Spec 0.010 0.25 0.8956507 0.7789700 0.9981074 0.010 1.50 0.8935073 0.7789700 0.9981074 0.015 0.25 0.8962297 0.7789700 0.9981074 0.015 1.50 0.8978426 0.7790057 0.9980533 0.200 0.25 0.9028196 0.7897895 0.9763524 0.200 0.50 0.9029592 0.7877184 0.9822641 Table 4.7 shows the various regularisation and kernel parameters considered and their sensitivity, specificity and ROC measures. When γ = 0.01 for any value of C the sensitivity and specificity values remain the same but with different ROC. For the models with γ = 0.015 against C = 0.25 and 0.50 also showed similar sensitivity and specificity values. This may be as a result of the closeness of both parameters in defining the optimum C and γ. The sensitivity and specificity begin to differ when γ = 0.015 for C = 0.75, 1.00 & 1.50 and when γ = 0.20 for all values of C. The distinguishing measure was the ROC. ROC was used to select the optimal model using the largest value. The final values used for the model were γ = 0.2 and C = 0.5. The computational time for training and selection of the optimal C was 18.048915 hours (Source: R statistical package). Polynomial Kernel With polynomial kernel the default tuning parameters are degree (d), cost (C) and scale. In defining the grid for the polynomial kernel function, the values 0.25, 0.5, 0.75, 1, 1.25, 1.5 and 0.001, 0.01, 0.1 were defined as the cost and scale parameters respectively whiles using degrees 1, 2, 3 in training the SVM. All the parameters paired creates 27 distinct models. The following shows some of the resampling results cross the tuning parameters (For the full table, see appendix table 5.5). 72 University of Ghana http://ugspace.ug.edu.gh Table 4.8: Resampling results cross the tuning parameters of polynomial kernel degree scale C ROC Sens Spec 1 0.001 0.25 0.9124103 0.7789683 0.9981074 1 0.100 1.00 0.8893959 0.7789683 0.9981074 2 0.001 0.25 0.8837571 0.7789683 0.9981074 2 0.100 1.00 0.8843421 0.7792181 0.9974225 3 0.001 0.25 0.8803390 0.7789683 0.9981074 3 0.100 1.00 0.8890698 0.7839677 0.9903569 ROC was used to select the optimal model using the largest value. The final values used for the model were degree = 1, scale = 0.001 and C = 0.25. The computational time for training and selection of the optimal tuning parameters was 15.678910 hours (Source: R statistical package). 4.5.2 Linear, RBF and Polynomial kernels The tables below summarises the ROC, sensitivity and specificity of the three kernels based on the best tuning parameters: Table 4.9: Receiver Operating Characteristic Min. 1st Qu. Median Mean 3rd Qu. Max. linear 0.8688288 0.8818005 0.8910184 0.8927530 0.9023934 0.9281725 radial 0.8769418 0.8945710 0.9045632 0.9029592 0.9122074 0.9222675 Polynomial 0.8910859 0.9055932 0.9129130 0.9124103 0.9194117 0.9361986 Table 4.10: Sensitivity Min. 1st Qu. Median Mean 3rd Qu. Max. linear 0.7468806 0.7665179 0.7821429 0.7789694 0.7910714 0.8196429 radial 0.7290553 0.7767857 0.7883929 0.7877184 0.8000000 0.8232143 Polynomial 0.7464286 0.7687500 0.7769847 0.7789683 0.7910714 0.8107143 73 University of Ghana http://ugspace.ug.edu.gh Table 4.11: Specificity Min. 1st Qu. Median Mean 3rd Qu. Max. linear 0.9954914 0.9972955 0.9981974 0.9981073 0.9990991 1.0000000 radial 0.9720721 0.9801802 0.9819820 0.9822641 0.9846812 0.9891794 Polynomial 0.9954955 0.9972973 0.9981974 0.9981074 0.9990991 1.0000000 From tables 4.9, 4.10 and 4.11, the polynomial kernel seems to have the advantage when it comes to the ROC and the specificity. Hence, comparisons were made by using the resampling approach for clearer display. Eugster et al. (2008) Hothorn et al. (2005) shows techniques for making decisions using resampling. Therefore, 50 resamplings were done and the figure below plots the kernels against their ROC. Figure 4.2: Plot of the three kernels Against ROC Clearly, the polynomial kernel with the tuning parameters, degree = 1, scale = 0.001 and C = 0.25 is the optimal model. 4.6 K-Nearest Neighbour (KNN) In the KNN algorithm, a new object is categorised by the greatest vote of its neighbours, that is the commonest(the number of neighbours) category amongst its nearest neighbours. The K is the number of nearest neighbours which is strictly a positive integer. In training the KNN classifier, the tune length was set to 20 where the best k was selected. Similarly, the studies (Lu et al. 2005; Zhu et 74 University of Ghana http://ugspace.ug.edu.gh al. 2011) considered this type of approach in selecting an optimal k parameter. The Euclidean distance metric was used in this study. Below are some resampling results across the k values. (See appendix table 5.6 for all the values) Table 4.12: Resampling results cross the tuning parameter of KNN k ROC Sens Spec 5 0.9059139 0.8045366 0.9392038 11 0.9079729 0.7908606 0.9675018 19 0.9112177 0.7907181 0.9726389 25 0.9120057 0.7905747 0.9739367 31 0.9128466 0.7910390 0.9751260 37 0.9137709 0.7917529 0.9746935 41 0.9136882 0.7917172 0.9749281 Below is a plot of the accuracy of K against their ROC. Figure 4.3: Comparison of Accuracy against K Figure 4.3 displays the performance the number of nearest neighbour on the the training data. The ROC increases with an increase in the number of neighbours but it decreased when k was 31. It peaked again and the optimal k selected was 37. 75 University of Ghana http://ugspace.ug.edu.gh The computational time for training and selection of the optimal tuning parameter K was 1.648078 hours (Source: R statistical package). 4.7 Comparing prediction performance of SVM models, Logistic regression (LR) and KNN The optimal prediction performance of SVMmodels, Logistic regression and KNN were trained and tested on the same sets of data. A 10-fold cross validation method was used. The analysis was made possible by MATLAB R2017a software package. Below are the results realised from both the SVM models. Table 4.13: The performance of SVM models Model (kernel) C γ d Prediction accuracy (%) Training time Predicted speed Testing data Linear 1.50 N/A N/A 92.7873 249.59 sec 4300 obs/sec RBF 0.50 0.20 N/A 92.0538 1355.3 sec 1900 obs/sec Polynomial 0.25 0.001 1 90.1589 1983.2 sec 250 obs/sec As shown in table 4.13 the SVM classifier with the linear kernel obtained the maximum classication accuracy of 92.7873% amongst the RBF and the polynomial models. The linear kernel also had the fastest processing speed which translate into less computational time. Table 4.14: The performance of KNN models Number of nearest neighbours (k) Prediction accuracy (%) Test accuracy 5 91.1369 11 92.2983 19 92.5428 25 92.5428 31 92.6650 37 92.6039 41 92.5428 76 University of Ghana http://ugspace.ug.edu.gh Zhu et al. (2011) recommend using varied values of K to select the optimal K. Figure 4.14 shows all the models generalised well to the new test since they all had high test classification accuracy. The KNN classifier with K = 31 had a test classification accuracy of 92.6650% therefore having the highest generalisation capacity. Though, according to figure 4.3 and table 4.12, the optimal K measured by their ROC was 37 but after training and testing, K = 31 is the optimal model according to their classification accuracies. This means K = 37 did not generalise well to new data set. Table 4.15: The optimal prediction accuracy of LR, SVM and LNN LR SVM KNN Accuracy 92.7872% 92.7873% 92.6650 % Misclassification 7.2128% 7.2127% 7.3350% Table 4.15 shows the prediction accuracies and misclassification rate of all the best models of the three classifiers. The SVM classifier with linear kernel, had the highest detection rate relative to the logistic regression and KNN models. In other words, the SVM classifier with linear kernel predicts more correctly which suggests that part of the variables have greater probabilities to predict critical violations in food establishments. The logistic regression followed closely with 92.7872% whereas the KNN model had 92.6650 %. However, there was only 0.0001 difference between the SVM classifier and logistic regression, which suggest how seemingly similar they are in classifying critical violations at food establishments. This also shows that the data is linearly separable which is furthur confirmed by the selected degree, d = 1 in the polynomial kernel. The figure 4.4 clearly displays the classification accuracies and misclassifications rates of LR, SVM and KNN repectively. Since KNN exercise a majority voting scheme, in the case where we have a data set that has p data-points, then always the p-nearest neighbour model will make use of every data point in the data set to categorise new instances/points. In the 77 University of Ghana http://ugspace.ug.edu.gh H Figure 4.4: Accuracies and misclassification of LR, SVM and KNN light of this when k = 31, only the nearest thirty-one data points are selected. An increase in k-value means an increase in the nearest neighbours which likely lead to a decrease in performance (Beyer, Goldstein, Ramakrishnan,& Shaft, 1999). This feature of KNN makes the case for KNN to contend especially with SVM. Also comparing the computational time for training and selection of the optimal tuning parameters of SVM and KNN models, it suggests, the KNN is time consuming. Some studies (Bhaskar, Hoyle & Singh, 2006; Palaniappan, Sundaraj & Sundaraj, 2014) has reported on the computational complexity of the KNN algorithm, therefore leading to a high computational time. The logistic regression had the least computational time (76.693 seconds), therefore the logistic regression would be preferred when one is concerned about speed of the classifier. It took a little over thrice the amount of training time of the LR, for the SVM classfier to be trained (249.59 sec). The logistic regression and the SVM without a kernel can be used interchangeably but when a kernel is used the SVM with kernels perform better (Kumar, 2018).This suggests the capability of an SVM, lies in the use of a kernel. The performance of the kernel also tells more of the type of data used in the study. The performance of the linear kernel again reiterate the fact that linear kernels performs better than the RBF kernel on a linearly separable data (Palaniappan 78 University of Ghana http://ugspace.ug.edu.gh et al., 2014). Even under the constraints of the defined parameter settings of the SVM models, the SVM classifier with linear kernel had the highest detection rate. 4.8 Summary This chapter began with a description of the data set and preliminary analysis to select the relatively relevant features by using the principal component analysis. The algorithms logistic regression, Support Vector Machine and K- Nearest Neighbour were utilised to classify food inspections data in order to predict whether an inspection would yield critical violation or not. The SVM classifier with linear kernel was the best as compared to LR and KNN models in identifying critical violations. 79 University of Ghana http://ugspace.ug.edu.gh CHAPTER 5 CONCLUSION AND RECOMMENDATIONS This chapter presents a well trimmed version of the results of this study. It further makes conclusions that address the objectives of the study. Some suggestions and recommendations are also put forward. 5.1 Conclusion The substantive issue of keeping our food establishments safe is a never-ending problem which needs every available resource to tackle. One effective way it can be checked is by conducting inspections into these facilities to make sure the stipulated rules and regulations are followed. Some recognised bodies like the Canadian Food Inspection Agency (CFIA), Chicago Department of Public Health (CDPH) and Ghana Food and Drugs Authority (FDA), are legally empowered to enforce or implement food-related rules and regulations. Food establishments fail inspections when critical violations are found. Critical violation are food related threats that affect the health of patrons like food contamination (Murphy et al. 2011). Inspections of such nature yield huge size of data, and data mining algorithms like the Support Vector Machine (SVM), Logistic Regression (LR) and K- Nearest Neighbour (KNN) can help make predictions to make informed decisions. Previous studies used logistic regression in this regard to prioritise inspections based on food inspections data from the City of Chicago open data portal. However logistic regression has some deficiencies as compared to SVM and KNN. Therefore, this study used SVM and KNN to assess the performance of these algorithms relative to the logistic regression in predicting critical violations in 80 University of Ghana http://ugspace.ug.edu.gh food establishments, so as to help prioritise inspections. Using the available data set, the Principal Component Analysis (PCA) was used to extract 10 features that are independent of each other. These features were used as inputs in the predictive models. The linear, Radial Basis Function (RBF) and polynomial kernels of the SVM were trained on the training data using a 10-fold cross validation method to select the optimal models under varied parameter settings. Also, using varied K values, an optimal KNN model was selected when trained using the 10-fold cross validation method. The ROC, sensitivity and specificity were the performance measures used to select the optimal models. Similarly, the logistic regression model was trained using the 10-fold cross validation method. The logistic regression model realised the type of inspector, previous failures at inspections, the number of serious violations, licensed tobacco sellers, burglary, number of garbage cart requests, the intensity of local burglaries and the length of time since last inspection as the features that significantly influence the presence or absence of critical violations at food establishments. With a defined grid of the parameter setting, the SVM classifier with linear kernel and KNN model with K = 31 had higher accuracies relative to all the other models in the grid. Therefore, the logistic model, SVM model with linear kernel and KNN model with 31 nearest neighbours were able to detect critical violations in food establishments. The classification accuracies for the LR, SVM and KNN classifiers were 92.7872%, 92.7873% and 92.9025% respectively. In conclusion, the findings showed that the SVM classifier with linear kernel marginally has a high generalisation capacity than LR and KNN models based on this data set. Therefore, prioritising inspections using the SVM model will marginally improve the previous work in Schenk et al. (2014) for early detection of food establishments with critical violations. 81 University of Ghana http://ugspace.ug.edu.gh 5.2 Recommendation In relation to conclusions drawn from this study, these suggestions are made for future studies. 1. While this study used some defined parameter settings especially, for the SVM kernels, it remains therefore interesting for furthur studies to use some exhaustive parameter search approach. 2. Data on food inspections in Ghana should be properly collated and made available so that such studies can be readily conducted. 82 University of Ghana http://ugspace.ug.edu.gh REFERENCES Ababio, P. F., & Adi, D. D. (2012). Evaluating food hygiene awareness and practices of food handlers in the Kumasi metropolis. Internet Journal of Food Safety, 14 (2), 35-43. Ababio, P. F., Adi, D. D., & Commey, V. (2012). Food safety management systems, availability and maintenance among food industries in Ghana. Food Science and Technology Ababio, P. F., & Lovatt, P. (2015). A review on food safety and food hygiene studies in Ghana.Food Control, 47, 92-97. Abidin, U. F. U. Z., Arendt, S. W., & Strohbehn, C. H. (2014). Food safety culture in onsite foodservices: Development and validation of a measurement scale. Journal of Foodservice Management and Education, 8(1), 1. Agyei-Baffour, P., Sekyere, K. B., & Addy, E. A. (2013). Policy on Hazard Analysis and Critical Control Point (HACCP) and adherence to food preparation guidelines: a cross sectional survey of stakeholders in food service in Kumasi, Ghana. BMC research notes, 6(1), 442. Addison-Wesley, Reading, MA, 1999. Alkhatib, K., Najadat, H., Hmeidi, I., & Shatnawi, M. K. A. (2013). Stock price prediction using k-nearest neighbor (kNN) algorithm. International Journal of Business, Humanities and Technology, 3(3), 32-44. Allwood, P. B., Jenkins, T., Paulus, C., Johnson, L., & Hedberg, C. W. (2004). Hand washing compliance among retail food establishment workers in Minnesota. Journal of Food Protection, 67(12), 2825-2828. Alpaydin, E. (1997). Voting over multiple condensed nearest neighbors. In Lazy learning (pp. 115-132). Springer, Dordrecht. 83 University of Ghana http://ugspace.ug.edu.gh Andoh, A. H., Ackah, N. B., & Abbey, L. D. (2015). Let’s adopt and implement the draft national food safety policy: feature article in The Ghanaian Times, Thursday, April 9, 2015. Arendt, S., Strohbehn, C., & Jun, J. (2015). Motivators and barriers to safe food practices: observation and interview. Food Protection Trends, 35 (5), 365-376. Arvey, A., Agius, P., Noble, W. S., & Leslie, C. (2012). Sequence and chromatin determinants of cell-type–specific transcription factor binding. Genome research, 22 (9), 1723-1734. Asraf, H. M., Nooritawati, M. T., & Rizam, M. S. (2012). A comparative study in kernel-based support vector machine of oil palm leaves nutrient disease. Procedia Engineering, 41, 1353-1359. Becker, S. (Ed.). (2001). Data warehousing and web engineering. (pp. 77-99) IGI Global. Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999, January). When is "nearest neighbor" meaningful?. In International conference on database theory (pp. 217-235). Springer, Berlin, Heidelberg. Bhatia, N. (2010). Survey of nearest neighbor techniques. arXiv preprint arXiv:1007.0085. Bhaskar, H., Hoyle, D. C., & Singh, S. (2006). Machine learning in bioinformatics: A brief survey and recommendations for practitioners. Computers in biology and medicine, 36 (10), 1104-1125. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152). ACM. Brachman, R. J., & Anand, T. (1996). The process of knowledge discovery in databases. In Advances in knowledge discovery and data mining (pp. 37-57). American Association for Artificial Intelligence. 84 University of Ghana http://ugspace.ug.edu.gh Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In ICML (Vol. 98, pp. 82-90). Bradley, P. S., Mangasarian, O. L., & Street, W. N. (1998). Feature selection via mathematical programming. INFORMS Journal on Computing, 10(2), 209-217. Buzby, J. C., Frenzen, P. D., & Rasco, B. (2001). Product liability and microbial foodborne illness. US Department of Agriculture, Economic Research Service. Centers for Disease Control and Prevention (CDC. (2013). Surveillance for foodborne disease outbreaks–United States, 2009-2010. MMWR. Morbidity and mortality weekly report, 62(3), 41 Chaplot, S., Patnaik, L. M., & Jagannathan, N. R. (2006). Classification of magnetic resonance brain images using wavelets as input to support vector machine and neural network. Biomedical signal processing and control, 1(1), 86-92. Chou, J. S., Cheng, M. Y., Wu, Y. W., & Pham, A. D. (2014). Optimizing parameters of support vector machine using fast messy genetic algorithm for dispute classification. Expert Systems with Applications, 41 (8), 3955-3964. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297 Cushman, J. W., Shanklin, C. W., & Niehoff, B. P. (2001). Hygiene practices of part- time student employees in a university foodservice operation. The Journal of the National Association of College and University Food Services, 23, 37-44. Duda R. O., Hart P. E., and Stork D. G. (2001). Pattern Classification, 2nd ed. Wiley. Duda, R. O., Hart, P. E., & Stork, D. G. (1973). Pattern classification (Vol. 2). New York: Wiley Eitrich, T., & Lang, B. (2006). Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of computational and applied mathematics, 196 (2), 425-436. 85 University of Ghana http://ugspace.ug.edu.gh Ekici, S. (2012). Support Vector Machines for classification and locating faults on transmission lines. Applied Soft Computing, 12(6) , 1650-1658. Este, A., Gringoli, F., & Salgarelli, L. (2009). Support vector machines for TCP traffic classification. Computer Networks, 53(14), 2476-2490. Eugster M, Hothorn T, Leisch F (2008). "Exploratory and Inferential Analysis of Benchmark Experiments." Ludwigs-Maximilians-Universitat Munchen, Department of Statistics, Tech. Rep, 30. Feglo, P., & Sakyi, K. (2012). Bacterial contamination of street vending food in Kumasi, Ghana. Journal of Medical and Biomedical Sciences, 1 (1), 1-8. Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30 (1), 27-38. Field, A. (2013) Discovering Statistics using SPSS, 4th edn. London: SAGE Fielding, J. E., Aguirre, A., & Palaiologos, E. (2001). Effectiveness of altered incentives in a food safety inspection program. Preventive Medicine, 32(3), 239-244. Fix, E., & Hodges Jr, J. L. (1951). Discriminatory analysis-nonparametric discrimination: consistency properties. California Univ Berkeley. Food and Drugs Board. (1992). The food and drugs act. PNDCL 3058 1992. Retrieved from www.wipo.int/edocs/lexdocs/laws/en/gh/gh022en.pdf Food and Drugs Authority. (2018) Retrieved from http://www.moh.gov.gh/foods-and- drug-authority/ Food safety. (2017). Retrieved from http://www.who.int/news-room/fact- sheets/detail/food-safety Franco-Lopez, H., Ek, A. R., & Bauer, M. E. (2001). Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method. Remote sensing of environment, 77 (3), 251-274. 86 University of Ghana http://ugspace.ug.edu.gh Ghana Health Service. (2012). In International Federation of Red Cross and Red Crescent Societies. Disaster Relief Emergency Fund (Dref) Final Report. Viewed 18/4/2013 Ghana Standard Authority. (2013). Personal communication with standard documentation department. Golan, E. H., Roberts, T., Salay, E., Caswell, J. A., Ollinger, M., & Moore, D. L. (2004). Food safety innovation in the United States: evidence from the meat industry (No. 34083). United States Department of Agriculture, Economic Research Service. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., ... & Bloomfield, C. D. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science, 286 (5439), 531-537. Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN model-based approach in classification. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems"(pp. 986-996). Springer, Berlin, Heidelberg. Guyon, I., & Elisseeff, A. (2006). An introduction to feature extraction. In Feature extraction (pp. 1-25). Springer, Berlin, Heidelberg. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3 (Mar), 1157-1182. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3), 389- 422. Hamamoto, Y., Uchimura, S., & Tomita, S. (1997). A bootstrap technique for nearest neighbor classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (1), 73-79. 87 University of Ghana http://ugspace.ug.edu.gh Hassanat, A. B., Abbadi, M. A., Altarawneh, G. A.,& Alhasanat, A. A. (2014). Solving the problem of the K parameter in the KNN classifier using an ensemble learning approach. arXiv preprint arXiv:1409.0919 Hoak, J. (2010). The Effects of Outliers on Support Vector Machines.Portland State University. Hong, J. H., Min, J. K., Cho, U. K., & Cho, S. B. (2008). Fingerprint classification using one-vs-all support vector machines dynamically ordered with naive Bayes classifiers. Pattern Recognition, 41(2), 662-671 Hornik, K., Meyer, D., & Karatzoglou, A. (2006). Support vector machines in R. Journal of statistical software, 15 (9), 1-28. Hothorn, T., Leisch, F., Zeileis, A., & Hornik, K. (2005). The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14 (3), 675-699. Howells, A. D., Roberts, K. R., Shanklin, C. W., Pilling, V. K., Brannon, L. A., & Barrett, B. B. (2008). Restaurant employees’ perceptions of barriers to three food safety practices. Journal of the Academy of Nutrition and Dietetics, 108 (8), 1345-1349. Hsu, C.-W., Chang, C.-C., Lin, C.-J. (2004). A practical guide to support vector classification. Technical Report, Department of Computer Science and Information Engineering, National Taiwan University. Retrieved from http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/guide.pdf. Huang, C. L., & Wang, C. J. (2006). A GA-based feature selection and parameters optimization for support vector machines. Expert Systems with applications, 31(2), 231-240 Jahromi, M. Z., Parvinnia, E., & John, R. (2009). A method of learning weighted similarity function to improve the performance of nearest neighbor. Information sciences, 179 (17), 2964-2973. 88 University of Ghana http://ugspace.ug.edu.gh James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp.342-369). New York: springer. Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. In AAAI workshop on learning from imbalanced data sets (Vol. 68, pp. 10-15). Jiang, P., Missoum, S., & Chen, Z. (2014). Optimal SVM parameter selection for non-separable and unbalanced datasets. Structural and Multidisciplinary Optimization, 50 (4), 523-535. Joachims, T. (1998, April). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (pp. 137-142). Springer, Berlin, Heidelberg. Jolliffe, I. (2011). Principal component analysis. In International encyclopedia of statistical science (pp. 1094-1096). Springer, Berlin, Heidelberg. Kalakuntla, P. (2017). Performance Analysis of kNN Query Processing on large datasets using CUDA & Pthreads: comparing between CPU & GPU. Kassel, S., (2017). "Predicting Building Code Compliance with Machine Learning Models." Retrieved from https://www.azavea.com/blog/2017/09/21/building- inspection-prediction/. Knight, A. J., Worosz, M. R., & Todd, E. C. D. (2007). Serving food safety: consumer perceptions of food safety at restaurants. International Journal of Contemporary Hospitality Management, 19(6), 476-484. Kriminger, E., Príncipe, J. C., & Lakshminarayan, C. (2012). Nearest neighbor distributions for imbalanced classification. In Neural Networks (IJCNN), The 2012 International Joint Conference on (pp. 1-5). IEEE. Kumar, A. (2018). Machine Learning - When to Use Logistic Regression vs. SVM 89 University of Ghana http://ugspace.ug.edu.gh - Reskilling IT. Retrieved from https://vitalflux.com/machine-learning-use- logistic-regression-vs-svm/ Kumar, N., Krovi, R., & Rajagopalan, B. (1997). Financial decision support with hybrid genetic and neural based modeling tools. European Journal of Operational Research, 103 (2), 339-349. Kuramochi, M., & Karypis, G. (2005). Gene classification using expression profiles: A feasibility study. International Journal on Artificial Intelligence Tools, 14 (04), 641-660. Latourrette, M. (2000). Toward an explanatory similarity measure for nearest-neighbor classification. In European Conference on Machine Learning (pp. 238-245). Springer, Berlin, Heidelberg. Lavanya, B., & Divya, B. (2017). Big data analysis using SVM and K-NN data mining techniques. International Journal of Computer Science and Mobile Computing (IJCSMC), 6 (1), 84-91 Li, Z., Zhang, Q., & Zhao, X. (2017). Performance analysis of K-nearest neighbor, support vector machine, and artificial neural network classifiers for driver drowsiness detection with different road geometries. International Journal of Distributed Sensor Networks, 13 (9), 1550147717733391. Liu, C., Jiang, D., & Yang, W. (2014). Global geometric similarity scheme for feature selection in fault diagnosis. Expert Systems with Applications, 41 (8), 3585-3595. Liu, W., & Chawla, S. (2011). Class confidence weighted knn algorithms for imbalanced data sets. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 345-356). Springer, Berlin, Heidelberg. Lo, C. S., & Wang, C. M. (2012). Support vector machine for breast MR image classification. Computers & Mathematics with Applications, 64 (5), 1153-1162 90 University of Ghana http://ugspace.ug.edu.gh Lovric, M. (2011). International Encyclopedia of Statistical Science. Springer pp. 591- 756. Lu, J., Getz, G., Miska, E. A., Alvarez-Saavedra, E., Lamb, J., Peck, D., ... & Downing, J. R. (2005). MicroRNA expression profiles classify human cancers. nature, 435 (7043), 834 Lynch, R. A., Elledge, B. L., Griffith, C. C., & Boatright, D. T. (2003). A comparison of food safety knowledge among restaurant managers, by source of training and experience, in Oklahoma County, Oklahoma. Journal of Environmental Health, 66(2), 9. Madaio, M., Chen, S. T., Haimson, O. L., Zhang, W., Cheng, X., Hinds-Aldrich, M., ... & Dilkina, B. (2016). Firebird: Predicting fire risk and prioritizing fire inspections in atlanta. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 185-194). ACM. Monney, I., Agyei, D., & Owusu, W. (2013). Hygienic practices among food vendors in educational institutions in Ghana: the case of Konongo. Foods, 2(3), 282-294. Mosley, S., & Steif, K. (2018). Urban Spatial. Retrieved from http://urbanspatialanalysis.com/portfolio/proof-of-concept-using-predictive- modeling-to-prioritize-building-inspections/ Min, J. H., & Lee, Y. C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert systems with applications, 28 (4), 603-614. Murphy, K. S., DiPietro, R. B., Kock, G., & Lee, J. S. (2011). Does mandatory food safety training and certification for restaurant employees improve inspection outcomes?. International Journal of Hospitality Management, 30 (1), 150-156. Nagi, J., Yap, K. S., Tiong, S. K., Ahmed, S. K., & Mohammad, A. M. (2008). Detection of abnormalities and electricity theft using genetic support vector machines. In TENCON 2008-2008 IEEE Region 10 Conference (pp. 1-6). IEEE. 91 University of Ghana http://ugspace.ug.edu.gh National Restaurant Association. (2012).2013 Restaurant industry: Pocket factbook. Retrieved from http://www.restaurant.org/pdfs/research/Factbook2013_LetterSize.pdf. Nilsson, R., Peña, J. M., Björkegren, J., & Tegnér, J. (2007). Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research, 8 (Mar), 589-612 Omaye, S. T. (2004). Food and nutritional toxicology. CRC press. Boca Raton, pp. 163-173 Parikh, U. B., Das, B., & Maheshwari, R. (2010). Fault classification technique for series compensated transmission line using support vector machine. International Journal of Electrical Power & Energy Systems, 32 (6), 629-636 Parry, R. M., Jones, W., Stokes, T. H., Phan, J. H., Moffitt, R. A., Fang, H., Shi L., Oberthuer., Fischer., Tong W., & Wang, M. D. (2010). k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. The pharmacogenomics journal, 10 (4), 292. Palaniappan, R., Sundaraj, K., & Sundaraj, S. (2014). A comparative study of the svm and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC bioinformatics, 15 (1), 223. Pochet, N. L. M. M., & Suykens, J. A. K. (2006). Support vector machines versus logistic regression: improving prospective performance in clinical decision- making. Ultrasound in Obstetrics & Gynecology, 27 (6), 607-608. Polat, K., & Güneş, S. (2007). Breast cancer diagnosis using least square support vector machine. Digital Signal Processing, 17 (4), 694-701. Qi, X., & Luo, R. (2015). Sparse principal component analysis in Hilbert space. Scandinavian Journal of Statistics, 42 (1), 270-289. 92 University of Ghana http://ugspace.ug.edu.gh Qian, Y., Zhou, W., Yan, J., Li, W., & Han, L. (2014). Comparing machine learning classifiers for object-based land cover classification using very high resolution imagery. Remote Sensing, 7 (1), 153-168 Qu, J., & Zuo, M. J. (2010). Support vector machine based data processing algorithm for wear degree classification of slurry pump systems. Measurement, 43 (6), 781- 791 Quackenbush, J. (2002). Microarray data normalization and transformation. Nature genetics, 32, 496. Rai, H., & Yadav, A. (2014). Iris recognition using combined support vector machine and Hamming distance approach. Expert systems with applications, 41 (2), 588- 593. Rana, M., Chandorkar, P., Dsouza, A., & Kazi, N. (2015). Breast cancer diagnosis and recurrence prediction using machine learning techniques. IJRET: International Journal of Research in Engineering and Technology eISSN, 2319-1163. Reske, K. A., Jenkins, T., Fernandez, C., VanAmber, D., & Hedberg, C. W. (2007). Beneficial effects of implementing an announced restaurant inspection program. Journal of Environmental Health, 69 (9), 27-35. Rheinländer, T., Olsen, M., Bakang, J. A., Takyi, H., Konradsen, F., & Samuelsen, H. (2008). Keeping up appearances: perceptions of street food safety in urban Kumasi, Ghana. Journal of Urban Health, 85 (6), 952-964. Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y., Zepeniuk, M., ... & Lebanony, D. (2008). MicroRNAs accurately identify cancer tissue origin. Nature biotechnology, 26 (4), 462. Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web (pp. 377-386). AcM. 93 University of Ghana http://ugspace.ug.edu.gh Salazar, D. A., Vélez, J. I., & Salazar, J. C. (2012). Comparison between SVM and logistic regression: Which one is better to discriminate?. Revista Colombiana de Estadística, 35 (SPE2), 223-237. Samuels, P. (2016). Advice on Exploratory Factor Analysis. 10.13140/RG.2.1.5013.9766 Schenk, T., Leynes, G., Solanki, A., Collins, S., Smart, G., Abright, B., Crippen, C., (2014). "Food Inspection Forecasting - City of Chicago." Retrieved from https://github.com/Chicago/food- inspectionsevaluation/blob/master/REPORTS/forecasting-restaurants-with- critical-violations-in-Chicago.Rmd . Smart Cities Initiative (2018). Predictive Modeling of Building Fire Risk: Designing and evaluating predictive models of fire risk to prioritize property fire inspections. Metro21 Research Publication. Song, Y., Huang, J., Zhou, D., Zha, H., & Giles, C. L. (2007). Iknn: Informative k- nearest neighbor pattern classification. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 248-264). Springer, Berlin, Heidelberg. Song, F., Guo, Z., & Mei, D. (2010). Feature selection using principal component analysis. In System science, engineering design and manufacturing informatization (ICSEM), 2010 international conference on (Vol. 1, pp. 27-30). IEEE. Suguna, N., & Thanushkodi, K. (2010). An improved k-nearest neighbor classification using genetic algorithm. International Journal of Computer Science Issues, 7 (2), 18-21. Sun, S., & Huang, R. (2010). An adaptive k-nearest neighbor algorithm. In Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on (Vol. 1, pp. 91-94). IEEE. 94 University of Ghana http://ugspace.ug.edu.gh Swiniarski, R. W. (2000). Data mining methods in face recognition. In Applications of Artificial Neural Networks in Image Processing V (Vol. 3962, pp. 52-60). International Society for Optics and Photonics. Tarigan, A., Dewi Agushinta, R., Suhendra, A., & Budiman, F. (2017). Determination of SVM-RBF Kernel Space Parameter to Optimize Accuracy Value of Indonesian Batik Images Classification. JCS, 13 (11), 590-599. Tessema, A. G., Gelaye, K. A., & Chercos, D. H. (2014). Factors affecting food handling Practices among food handlers of Dangila town food and drink establishments, North West Ethiopia. BMC public Health, 14 (1), 571 Thanh Noi, P., & Kappas, M. (2017). Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery. Sensors, 18 (1), 18. Thome, A. C. G. (2012). SVM classifiers–concepts and applications to character recognition. In Advances in Character Recognition. InTech. Torgerson, P. R., de Silva, N. R., Fèvre, E. M., Kasuga, F., Rokni, M. B., Zhou, X. N., ... & Stein, C. (2014). The global burden of foodborne parasitic diseases: an update. Trends in Parasitology, 30 (1), 20-26. Tsai, C. F., Hsu, Y. F., Lin, C. Y., & Lin, W. Y. (2009). Intrusion detection by machine learning: A review. Expert Systems with Applications, 36 (10), 11994-12000. Übeyli, E. D. (2007). Comparison of different classification algorithms in clinical decision-making. Expert systems, 24 (1), 17-31. Uğuz, H. (2011). A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24 (7), 1024-1032. Valentini, G. (2002). Gene expression data analysis of human lymphoma using 95 University of Ghana http://ugspace.ug.edu.gh support vector machines and output coding ensembles. Artificial Intelligence in Medicine, 26 (3), 281-304. Vapnik, V. (1998). Statistical learning theory. 1998. Wiley, New York. Vasickova, P., Dvorska, L., Lorencova, A., & Pavlik, I. (2005). Viruses as a cause of foodborne diseases: a review of the literature. Veterinární medicína, 50 (3), 89- 104. vector classification. Technical Report, Department of Computer Wainer, J. (2016). Comparison of 14 different families of classification algorithms on 115 binary datasets. University of Campinas. Wang, H., & Huang, G. (2011). Application of support vector machine in cancer diagnosis. Medical oncology, 28 (1), 613-618. Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems (pp. 1473-1480). Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2001). Feature selection for SVMs. In Advances in neural information processing systems (pp. 668-674). Wilson, M., Murray, A. E., Black, M. A., & McDowell, D. A. (1997). The implementation of hazard analysis and critical control points in hospital catering. Managing Service Quality: An International Journal, 7 (3), 150-156. Witten, I. H., & Frank, E. (2001). Data Mining Carl Hanser. München, Wien. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1 (1), 67-82. World Health Organization (2006). Five keys to safer food manual. Retrieved from http://apps.who.int/iris/bitstream/handle/10665/43546/9789241594639_eng. pdf;jsessionid=EF9630A5F47F6217D2A3363E91B597DB?sequence=1 96 University of Ghana http://ugspace.ug.edu.gh World Health Organisation. (2015). "Situation Report on Cholera Outbreak in Ghana As of 12 April 2015 (Week 15)". Retrieved from https://reliefweb.int/report/ghana/situation-report-cholera-outbreak-ghana- 12-april-2015-week-15 World Health Organization. (2014). WHO initiative to estimate the global burden of foodborne diseases: fourth formal meeting of the Foodborne Disease Burden Epidemiology Reference Group (FERG): Sharing New Results, Making Future Plans, and Preparing Ground for the Countries. World Health Organization. (2015). Food Safety: What you should know. Retrieved from https://www.fda.gov/Food/GuidanceRegulation/HACCP/ucm2006801.html/ World Health Organization. (2015). WHO estimates of the global burden of foodborne diseases: foodborne disease burden epidemiology reference group 2007-2015. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Zhou, Z. H. (2008). Top 10 algorithms in data mining. Knowledge and information systems, 14 (1), 1-37. Wu, Z., Zhang, H., & Liu, J. (2014). A fuzzy support vector machine algorithm for classification based on a novel PIM fuzzy clustering method. Neurocomputing, 125, 119-124. Yang, B. S., Hwang, W. W., Kim, D. J., & Tan, A. C. (2005). Condition classification of small reciprocating compressor for refrigerators using artificial neural networks and support vector machines. Mechanical Systems and Signal Processing, 19 (2), 371-390 Yang, J., & Honavar, V. (1998). Feature subset selection using a genetic algorithm. In Feature extraction, construction and selection (pp. 117-136). Springer, Boston, MA. Yang, Y. (1997). An Evaluation of Statistical Approaches to Text Categorization School of Computer Science. Carnegie Mellon University. 97 University of Ghana http://ugspace.ug.edu.gh Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2), 69-90. Yiannas, F. (2008). Food safety culture: Creating a behavior-based food safety management system. Springer Science & Business Media. Yu, Q., Miche, Y., Sorjamaa, A., Guillen, A., Lendasse, A., & Séverin, E. (2010). OP- KNN: Method and applications. Advances in Artificial Neural Systems, 2010, 1. Zhang, W., Yoshida, T., & Tang, X. (2008). Text classification based on multi-word with support vector machine. Knowledge-Based Systems, 21 (8), 879-886. Zhu, X., Zhang, S., Jin, Z., Zhang, Z., & Xu, Z. (2011). Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering, 23 (1), 110-121. 98 University of Ghana http://ugspace.ug.edu.gh APPENDIX Table 5.1: Descriptive of categorical features Features Frequency Percentage pastFail Yes 16724 91.2 No 1609 8.8 BlueInsp Yes 15010 81.9 No 3323 18.1 tobacco_sale Yes 16648 90.8 No 1685 9.2 SPSS output 99 University of Ghana http://ugspace.ug.edu.gh Table 5.2: Communalities Initial Extraction pastSerious 1.000 0.721 pastFail 1.000 0.673 timeSinceLast 1.000 0.393 pastCritical 1.000 0.302 ageAtInspection 1.000 0.088 license_insp 1.000 0.037 humidity 1.000 0.021 windSpeed 1.000 0.018 temperatureMax 1.000 0.015 PubAmusement 1.000 0.014 OrangeInsp 1.000 0.005 BrownInsp 1.000 0.003 mobFoodLic 1.000 0.000 sanitation 1.000 0.408 burglary 1.000 0.400 garbage 1.000 0.390 tobacco_sale 1.000 0.236 BlueInsp 1.000 0.132 seriousCount 1.000 0.129 package_goods 1.000 0.085 filling_station 1.000 0.077 GreenInsp 1.000 0.050 criticalCount 1.000 0.055 YellowInsp 1.000 0.028 catLiqLic 1.000 0.010 regBisLic 1.000 0.009 PurpleInsp 1.000 0.006 precipIntensity 1.000 0.001 Extraction Method: PCA 100 University of Ghana http://ugspace.ug.edu.gh Figure 5.1: Scree plot Table 5.3: Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 16696 21304.4 pastFail 1 44.8 16695 21259.6 2.139e-11 *** pastCritical 1 31.1 16694 21228.5 2.475e-08 *** pastSerious 1 13.0 16693 21215.5 0.0003094 *** seriousCount 1 12934.0 16692 8281.5 < 2.2e-16 *** timeSinceLast 1 16.9 16691 8264.5 3.861e-05 *** BlueInsp 1 192.9 16690 8071.6 < 2.2e-16 *** tobacco_sale 1 8.5 16689 8063.1 0.0035944 ** heat_burglary 1 5.4 16688 8057.8 0.0203608 * heat_garbage 1 18.4 16687 8039.3 1.782e-05 *** heat_sanitation 1 0.4 16686 8038.9 0.5255327 Signif. codes: 0 ’***’, 0.001, ’**’ 0.01, ’*’ 0.05, ’.’ 0.1, ’ ’ 1 101 University of Ghana http://ugspace.ug.edu.gh Table 5.4: Resampling results cross the tuning parameters of the RBF kernel sigma C ROC Sens Spec 0.010 0.25 0.8956507 0.7789700 0.9981074 0.010 0.50 0.8937628 0.7789700 0.9981074 0.010 0.75 0.8920192 0.7789700 0.9981074 0.010 1.00 0.8944685 0.7789700 0.9981074 0.010 1.25 0.8934944 0.7789700 0.9981074 0.010 1.50 0.8935073 0.7789700 0.9981074 0.015 0.25 0.8962297 0.7789700 0.9981074 0.015 0.50 0.8959974 0.7789700 0.9981074 0.015 0.75 0.8967519 0.7790057 0.9979812 0.015 1.00 0.8967626 0.7789700 0.9981074 0.015 1.25 0.8976418 0.7789700 0.9980173 0.015 1.50 0.8978426 0.7790057 0.9980533 0.200 0.25 0.9028196 0.7897895 0.9763524 0.200 0.50 0.9029592 0.7877184 0.9822641 0.200 0.75 0.9021967 0.7873969 0.9835799 0.200 1.00 0.9022667 0.7872896 0.9837600 0.200 1.25 0.9018410 0.7874679 0.9829307 0.200 1.50 0.9011346 0.7880042 0.9822819 R output 102 University of Ghana http://ugspace.ug.edu.gh Table 5.5: Resampling results cross the tuning parameters of the polynomial kernel degree scale C ROC Sens Spec 1 0.001 0.25 0.9124103 0.7789683 0.9981074 1 0.001 0.50 0.8944628 0.7789683 0.9981074 1 0.001 1.00 0.8877146 0.7789683 0.9981074 1 0.010 0.25 0.8875028 0.7789683 0.9981074 1 0.010 0.50 0.8871005 0.7789683 0.9981074 1 0.010 1.00 0.8877925 0.7789683 0.9981074 1 0.100 0.25 0.8857886 0.7789683 0.9981074 1 0.100 0.50 0.8892116 0.7789683 0.9981074 1 0.100 1.00 0.8893959 0.7789683 0.9981074 2 0.001 0.25 0.8837571 0.7789683 0.9981074 2 0.001 0.50 0.8813082 0.7789683 0.9981074 2 0.001 1.00 0.8858293 0.7789683 0.9981074 2 0.010 0.25 0.8828022 0.7789683 0.9981074 2 0.010 0.50 0.8850110 0.7789683 0.9981074 2 0.010 1.00 0.8843879 0.7789683 0.9981074 2 0.100 0.25 0.8815593 0.7792540 0.9976748 2 0.100 0.50 0.8824951 0.7790397 0.9977651 2 0.100 1.00 0.8843421 0.7792181 0.9974225 3 0.001 0.25 0.8803390 0.7789683 0.9981074 3 0.001 0.50 0.8821149 0.7789683 0.9981074 3 0.001 1.00 0.8864385 0.7789683 0.9981074 3 0.010 0.25 0.8914778 0.7789683 0.9981074 3 0.010 0.50 0.8895648 0.7789683 0.9981074 3 0.010 1.00 0.8914105 0.7789683 0.9981074 3 0.100 0.25 0.8882463 0.7829675 0.9915826 3 0.100 0.50 0.8888486 0.7831101 0.9908797 3 0.100 1.00 0.8890698 0.7839677 0.9903569 R output 103 University of Ghana http://ugspace.ug.edu.gh Table 5.6: Resampling results across different K values k ROC Sensitivity Specificity 5 0.9059139 0.8045366 0.9392038 7 0.9070431 0.7974663 0.9532986 9 0.9075180 0.7946809 0.9628878 11 0.9079729 0.7908606 0.9675018 13 0.9089746 0.7910396 0.9708910 15 0.9100148 0.7900036 0.9709632 17 0.9107945 0.7901466 0.9725854 19 0.9112177 0.7907181 0.9726389 21 0.9118572 0.7909674 0.9738287 23 0.9125478 0.7903965 0.9736844 25 0.9120057 0.7905747 0.9739367 27 0.9127539 0.7900749 0.9748920 29 0.9132674 0.7906103 0.9749457 31 0.9128466 0.7910390 0.9751260 33 0.9132206 0.7913244 0.9748561 35 0.9136069 0.7913961 0.9748199 37 0.9137709 0.7917529 0.9746935 39 0.9134619 0.7915386 0.9752886 41 0.9136882 0.7917172 0.9749281 43 0.9132412 0.7913602 0.9752164 R output 104