University of Ghana http://ugspace.ug.edu.gh UNIVERSITY OF GHANA STATISTICAL ASSESSMENT OF IMPUTATION ALGORITHMS FOR ESTIMATION OF MISSING VALUES IN CROSS SECTIONAL DATA BY OSCAR GYIMAH 10599415 THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, LEGON IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE AWARD OF MPHIL STATISTICS DEGREE October 19, 2018 University of Ghana http://ugspace.ug.edu.gh DECLARATION I hereby declare that this submission is my own work towards the award of the MPhil. degree and that, to the best of my knowledge, it contains no material previously published by another person nor material which had been accepted for the award of any other degree of the university or elsewhere, except where due acknowledgment had been made in the text. OSCAR GYIMAH .......................... .................... Student Signature Date (10599415) Certified by: DR. ANANI LOTSI .......................... .................... Supervisor Signature Date Certified by: DR. LOUIS ASIEDU .......................... .................... Supervisor Signature Date i University of Ghana http://ugspace.ug.edu.gh DEDICATION This work is dedicated to my children: Samuel Macbeth Gyimah, Holiana Adjeiwaa Gyimah, Bomo-Yaa Gyimah, Yvonne Akua Tawiah Gyimah and Paul Nelson-Nyameyekesse Gyimah. ii University of Ghana http://ugspace.ug.edu.gh ACKNOWLEDGEMENT First and foremost, i give thanks and appreciation to Almighty God who has endowed me the wisdom, knowledge and great opportunity to continue my education to this level. I am filled with very deep gratefulness and thankfulness to my project supervisors Dr. Anani Lotsi and Dr. Louis Asiedu for their immeasurable advice, guidance, and support throughout my Mphil program. I am also indebted to my parents, Mr and Mrs Gyimah for their countless assistance towards my upbring and financial support. I would also like to express my profound gratitude to Mr Emmanuel Aidoo (Phd Statistics student) and Felix Dela Djokoto (Mphil Statisticis student) who assisted me to use the R package in running my data and all the Lecturers of Department of Statistics at the University of Ghana. Last but not the least, my sincere appreciation go to my family for their invaluable support and prayers during the period of my study. iii University of Ghana http://ugspace.ug.edu.gh ABSTRACT The validity and quality of data analysis relies largely on the data accuracy and completeness of the data matrix. Missing values are unavoidable statistical research problems in almost every research study and if not handled properly, may provide negative and bias conclusion. This study purposely sought to investigate the efficacy and accuracy of the convergence of five imputation algorithms: expectation maximization (EM), multiple imputation by chained equation (MICE), k nearest neighbor (KNN), mean substitution (MS) and regression substitution (RS) in estimating and replacing missing values in cross- sectional world population data sheet using MCAR and MAR assumptions. This thesis used Little’s Test to verify whether a given data matrix with missing values is MCAR or MAR. Multiple linear regression analysis model was used to run the complete data of the world population data sheet, and thereafter, missing values in the complete data sets were artificially introduced at 5%, 10%, 20%, 30% and 40% under two missing data mechanisms (MCAR & MAR). The imputation algorithms used for evaluating missing data problems were assessed and compared using average coefficient difference (ACD) of multiple linear regression (MLR) model, mean absolute difference (MAD) and the coefficient of determination (R2). The study suggested that, when data on cross-sectional World Population Data Sheet is missing completely at random (MCAR) and normally distributed, the regression substitution is the best approach. The MICE algorithm was found to be comparatively the best method for replacing missingness under MAR assumption. Since this thesis is mainly concentrated on missing data imputation in a cross- sectional dataset, it is recommended that in future categorical and longitudinal studies should be considered. iv University of Ghana http://ugspace.ug.edu.gh CONTENTS DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . iii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv ABBREVIATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Significance of Study . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Motivation of the Research . . . . . . . . . . . . . . . . . . . . . . 5 1.7 Scope of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.9 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . 8 2.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 v University of Ghana http://ugspace.ug.edu.gh 2.1 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Missing Data Mechanism . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Ignorability Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Pattern of Missing Data . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Traditional Methods of Treating Missing Data . . . . . . . . . . . 18 2.5.1 Mean Substitution . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2 K Nearest Neighbor (KNN) Imputation Algorithm . . . . . 19 2.5.3 Regression Substitution . . . . . . . . . . . . . . . . . . . 21 2.6 Modern Method of Treating Missing Data . . . . . . . . . . . . . 22 2.6.1 Expectation- Maximization (EM) Algorithm . . . . . . . . 23 2.6.2 Multiple Imputation by Chained Equation (MICE) Algorithm 23 2.7 Measures of Performance Assessment . . . . . . . . . . . . . . . . 25 2.7.1 Mean Absolute Difference (MAD) . . . . . . . . . . . . . . 25 2.7.2 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . 25 2.7.3 Coefficient of Determination (R2) . . . . . . . . . . . . . . 26 2.8 Multiple Linear Regression (MLR) Model . . . . . . . . . . . . . . 26 3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . 31 3.4.1 The Multiple Linear Regression (MLR) Model . . . . . . . 31 3.4.2 Matrix Representation of the Model . . . . . . . . . . . . . 32 3.4.3 Assumptions of the Multiple Linear Regression . . . . . . . 33 3.4.4 Testing for Overall Regression Significance . . . . . . . . . 33 3.4.5 Testing for the Significant of the Slopes . . . . . . . . . . . 33 3.4.6 ROLE of R2 and r2 . . . . . . . . . . . . . . . . . . . . . . 34 3.4.7 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.8 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . 35 vi University of Ghana http://ugspace.ug.edu.gh 3.4.9 Breusch-Pagan Test . . . . . . . . . . . . . . . . . . . . . . 35 3.4.10 Remedy for Assumption Violation . . . . . . . . . . . . . . 36 3.4.11 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.12 Normality Test . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Testing the Missing Data Mechanism (MCAR & MAR) Assumption 38 3.5.1 Little’s Test of MCAR . . . . . . . . . . . . . . . . . . . . 38 3.6 Classifications of Missing Data under the Assumptions of various Missing Data Mechanism . . . . . . . . . . . . . . . . . . . . . . . 39 3.7 The Imputation Algorithms for Treating Missing Values under the MCAR Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.1 K Nearest Neighbors Imputation (KNN) Algorithm . . . . 40 3.7.2 Regression Substitution . . . . . . . . . . . . . . . . . . . 41 3.7.3 Mean substitution (MS) . . . . . . . . . . . . . . . . . . . 42 3.8 The Algorithms for Treating Missing Values under MAR Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.8.1 Expectation – Maximization (EM) Algorithm . . . . . . . 44 3.8.2 Multiple Imputation by Chained Equation (MICE) Algorithm 45 3.9 Evaluation Assessment Criteria to Compare various Imputation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9.1 Mean Absolute Difference (MAD) . . . . . . . . . . . . . . 47 3.9.2 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . 48 3.9.3 Coefficient of Determination . . . . . . . . . . . . . . . . . 48 3.10 Data Analysis Procedure . . . . . . . . . . . . . . . . . . . . . . . 49 4 Data Analysis and Discussion of Results . . . . . . . . . . . . . . 50 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Multiple Linear Regression (MLR) model . . . . . . . . . . . . . . 53 4.4 Missing Data Mechanism Test . . . . . . . . . . . . . . . . . . . . 54 4.5 Comparison of Imputation Algorithms for Treating Missing Values 57 vii University of Ghana http://ugspace.ug.edu.gh 4.6 Comparison of Imputation Algorithms for Treating Missing Values under MLR Model using ACD . . . . . . . . . . . . . . . . . . . . 57 4.6.1 Comparison of Imputation Algorithms for Treating Missingness under MCAR Mechanism . . . . . . . . . . . . 59 4.6.2 Comparison of EM and MICE Algorithms for Treating Missingness under MAR Mechanism using ACD . . . . . . 61 4.7 Comparison of Imputation Algorithms for Treating Missing Values using Mean Absolute Difference (MAD) . . . . . . . . . . . . . . . 63 4.8 Comparison of Imputation Algorithms for Treating Missing Values using Coefficient of Determination (R2) . . . . . . . . . . . . . . . 65 5 SUMMARY, CONCLUSION AND RECOMMENDATIONS . 71 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 viii University of Ghana http://ugspace.ug.edu.gh LIST OF ABBREVIATION ACD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Average Coefficient Difference ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Artificial Neural Network CD4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Cluster Differentiation 4 CN2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Algorithm for rule induction C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Classifier EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Expectation Maximization EMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Expectation Maximization Imputation EMSI . . . . . . . . . . . . . . . . . . . . . . . . . . .Expectation Maximization Single Imputation EMMI . . . . . . . . . . . . . . . . . . . . . . . Expectation Maximization Multiple Imputation FC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Fractioning of Cases FIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Full Information Maximum Likelihood LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Listwise Deletion KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K Nearest Neighbor KNNSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K Nearest Neighbor Single Imputation MAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean Absolute Difference MAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing at Random MCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Completely at Random MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markov Chain Monte Carlo MDTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Data Techniques ix University of Ghana http://ugspace.ug.edu.gh MI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Imputation MICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Multiple Imputation by Chained Equation MLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Linear Regression MMSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Mean or Mode Single Imputation MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean Square Error MS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Mean Substitution NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Not Available NMAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Not Missing at Random OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Ordinary Least Squares PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pairwise Deletion RS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression Substitution RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Root Mean Square Error SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sum of Squares Error SST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sum of Squares Total SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single Value Decomposition Yc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complete values of the dataset Yo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Observed values of the dataset Ym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Missing values of the dataset x University of Ghana http://ugspace.ug.edu.gh LIST OF TABLES 3.1 The dataset with missing values . . . . . . . . . . . . . . . . . . . 43 3.2 After replacement of missing values by mean substitution technique 43 4.1 Classification of Life Expectancy at Birth (LEB) by 106 Countries. 50 4.2 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Determination of Multicollinearity . . . . . . . . . . . . . . . . . . 52 4.4 Test of Normality and Constancy of Variance of Residual Terms . 52 4.5 Summary of the Complete Original Dataset Model Coefficients (Regression coefficient estimates, standard error, t-value and p- value) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6 Output of Little’s MCAR test for MCAR . . . . . . . . . . . . . . 56 4.7 Output of Little’s MCAR test for MAR . . . . . . . . . . . . . . . 56 4.8 Imputation Algorithms for Treating Missing Values . . . . . . . . 57 4.9 Average Coefficient Difference of Missing Data under KNN Imputation Algorithm to the Original Data of MLR Model . . . . 59 4.10 Performance of KNN, Mean Substitution and Regression Substitution Algorithm under MCAR using ACD estimate . . . . 60 4.11 Performance of EM and MICE Algorithms under MAR using Average Coefficient Difference (ACD) . . . . . . . . . . . . . . . . 61 4.12 Performance of KNN, Mean Substitution and Regression Substitution for Treating Missing Values under MCAR Mechanism using Mean Absolute Difference (MAD) . . . . . . . . . . . . . . . 63 xi University of Ghana http://ugspace.ug.edu.gh 4.13 Performance of EM and MICE Algorithms for Treating Missing Values under MAR Mechanism using Mean Absolute Difference (MAD) . . . . . . . . . . . . . . . . . . . 65 4.14 Performance of KNN, Mean Substitution and Regression Substitution under MCAR Mechanism using R2 . . . . . . . . . . 66 4.15 Performance of EM and MICE algorithms for Treating Missing Values under MAR Mechanism using Coefficient of Determination R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 1 KNN IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . . 92 2 KNN REGRESSION AT 10% . . . . . . . . . . . . . . . . . . . . 92 3 KNN IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . . 93 4 KNN IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . . 93 5 KNN IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . . 93 6 MEAN IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . 94 7 MEAN IMPUTATION AT 10% . . . . . . . . . . . . . . . . . . . 94 8 MEAN IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . 94 9 MEAN IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . 95 10 MEAN IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . 95 11 REGRESSION IMPUTATION AT 5% . . . . . . . . . . . . . . . 95 12 REGRESSION IMPUTATION AT 10% . . . . . . . . . . . . . . . 96 13 REGRESSION IMPUTATION AT 20% . . . . . . . . . . . . . . . 96 14 REGRESSION IMPUTATION AT 30% . . . . . . . . . . . . . . . 96 15 REGRESSION IMPUTATION AT 40% . . . . . . . . . . . . . . . 97 16 EM IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . . . 97 17 EM IMPUTATION AT 10% . . . . . . . . . . . . . . . . . . . . . 97 18 EM IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . . . 98 19 EM IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . . . 98 20 EM IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . . . 98 21 MICE IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . 99 xii University of Ghana http://ugspace.ug.edu.gh 22 MICE IMPUTATION AT 10% . . . . . . . . . . . . . . . . . . . . 99 23 MICE IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . . 99 24 MICE IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . . 100 25 MICE IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . . 100 26 The world population data sheet, 2011 . . . . . . . . . . . . . . . 101 27 The world population data sheet, 2011 . . . . . . . . . . . . . . . 102 28 The world population data sheet, 2011 . . . . . . . . . . . . . . . 103 29 The world population data sheet, 2011 . . . . . . . . . . . . . . . 104 xiii University of Ghana http://ugspace.ug.edu.gh LIST OF FIGURES 2.1 Important types of missing data . . . . . . . . . . . . . . . . . . . 16 3.1 Step by step procedure of the research design . . . . . . . . . . . 30 4.1 Graph of EM and MICE algorithms under MAR using average coefficient difference as a measure of performance assessment criteria 62 4.2 Graph of KNN, Mean substitution and Regression substitution under MCAR using MAD as performance assessment criteria . . . 64 4.3 Graph of KNN, Mean substitution and Regression substitution algorithms under MCAR mechanism using coefficient of determination (R2) as evaluation assessment criteria . . . . . . . . 68 4.4 Graph of EM and MICE algorithms under MAR mechanism using coefficient of determination (R2) as measure of metric assessment criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 xiv University of Ghana http://ugspace.ug.edu.gh CHAPTER 1 INTRODUCTION 1.1 Introduction Governments, organizations and firms are depending largely on data quality to make decisions and planning towards running their operational activities. Data quality which is the backbone of every organization in particular can be affected or distorted by the massive presence of missing values or incomplete data. The presence of incomplete data is an unavoidable scientific research challenge in real world situations and large scale research studies. It often creates data anomalies and impurities with data analysis, it affects interpretation and visualization of the research results. Respondents or interviewees often fail to provide answers to particular elements of a survey questionnaire, countries do not collect statistics every year, subjects drop out of studies, which results to scattered missing values throughout a data set (Honaker, King & Blackwell, 2015). Discarding these respondents during the analysis stage usually results in throwing out a sizable amount of information, reduction of sample size, as well as potentially bias parameter estimation (Little & Rubin, 2002). The missing values also reduces insight into the data, causes inefficient data analyses, inaccurate decision making, which lead to impoverishment of statistical power and deceptive inferences. The pervasiveness of missing values has encouraged many academic researchers in finding solutions, developing various models and evaluating methods for missing data treatment. Incomplete data is a serious challenge for statistical analysis, because most standard statistical techniques and application softwares are programmed to work effectively and efficiently with the assumption that, all 1 University of Ghana http://ugspace.ug.edu.gh records are fully observed with regards to all variables in the analysis. To solve the problem of incomplete values in data sets, just neglecting the incomplete observations, deleting missing values, replacing incomplete values by zero might have serious limitations as compared to application of imputation algorithms (Meng & Shi, 2012). Imputation algorithm is an iterative procedure employed to estimate and assign substitute values for incomplete values in the data matrix by some close related values. The beauty of the imputation algorithm is that, the treatment of incomplete data does not depend on the studying algorithm employed. Hence, this permits the researchers from various discipline to choose the best imputation method or algorithm to treat problems of incomplete data. Imputation involves replacing missing values to create complete data set, in a way which accounts for both the natural variation in the data matrix and unreliability involved in replacing incomplete values. The goal of imputation is not to generate accurate predictions of missing values, but rather to replace them in a way that maintains the relationships among the given variables in order to exploit the available data from a partially-observed individual (Little & Rubin, 2002). This work is intended to apply some of the extensively used imputation algorithms in handling missing data and compare their efficiency in replacing missing values in cross sectional study. Even though, Markov Chain Monte Carlo (MCMC) approach in the time past has been used to compare some of the simplest imputation techniques, such as listwise deletion (LD), pairwise deletion (PD), mean or mode substitution, last value carry forward and hot deck replacement, which revealed that MCMC approach provides most efficient results in all situations. Some of these simplest imputation techniques have disadvantage of reducing sample size, inefficiency of parameter estimates and diminish the sensitivity of statistical analysis which lead to potentially biased conclusions. This study, therefore seeks to compare model based techniques such as Expectation Maximization (EM) algorithm, Multiple imputation by chained equation (MICE) 2 University of Ghana http://ugspace.ug.edu.gh algorithm, K Nearest Neighbor imputation (KNN) algorithm, Mean substitution (MS) and Regression substitution (RS) which have the ability to deal with missing values or incomplete data of unknown variables in the case of incomplete data problem. This work is distinctive because, to the best of our understanding, such substantial empirical study on no occasion has been presented by any researcher in the literature. 1.2 Problem Statement Incomplete data normally exist in cross-sectional and longitudinal studies. These unobserved values occur when in the present data set there is no data points that are recorded for some of the attributes. The problem of incomplete values are mostly attributed to withdrawal or non-response of the respondent, scales of interest are not available, loss of data due to transmission challenges, problems with monitoring and recording tools, loss of data during coding and storing. Before completion of the intended cross-sectional study, some subjects may be vanished or drop-out or any of the above mention problems may occur. Therefore, because of missing data for some attributes, researchers had to drop other cases from the analysis. Most often than not, the records for such subjects are not available for statistical analysis. The existence of unobserved values in the data matrix have severe implications. The occurrence of missing data reduce the effectiveness of parameter estimates, diminish the sensitivity of the data analysis, that is, it affects the interpretation and conclusions of the study outcomes, the strength of the research plan, the validity of inferences about relationships between attributes and may decrease the representativeness of the sample size (Morais, 2013). The incomplete data also diminish insight into the data, cause inefficient data analysis and impoverishment of statistical power, that lead to inaccuracy and inefficient inferences about a population to guide stakeholders, decision makers and researchers. According to Horton and Kleinman (2007), data may be missing for so many reasons, such as 3 University of Ghana http://ugspace.ug.edu.gh subject drop-out, interviewee non-response, non-coverage, misleading questions, confidentiality purpose and many more, which may account for scatter missing data points in a study. Choosing the best suitable imputation approach to resolve the problem of incomplete data is the major challenging data-scientists encounter. Moreover, missingness are always neglected instead of applying imputation method or algorithm to fill them. From the above stated problems, this study therefore seeks to investigate the efficacy and accuracy of convergence on the five imputation algorithms: expectation maximization (EM), multiple imputation by chain equation (MICE), k nearest neighbor imputation (KNNI), mean substitution and regression substitution in estimating and replacing missing cross-sectional data of world population data sheet. 1.3 Objectives of the Study The primary objective of this study is to identify the best imputation algorithm for estimating missing data. Specifically, the study seeks to: • determine the most appropriate imputation algorithm to estimate missing values in real life cross sectional data. • examine the main reasons why data are missing in cross-sectional study • determine if differences exist between imputation algorithms estimates and the multiple linear regression (MLR) model estimates based on data. 1.4 Significance of Study The consequences and discoveries from this study would be essential in the following forms. To start with, the outcomes and discoveries from the study will provide guidance to the general public, stakeholders and researchers about the choice of statistical imputation algorithm for missing value replacement. 4 University of Ghana http://ugspace.ug.edu.gh The study would also explain the ideas of statistical imputation algorithms in estimating missing cross-sectional data more especially to researchers and practitioners. Hence, effective use of it could give a comprehensive idea and a clear path as to how to solve the problem of missing data. The outcomes and discoveries from this study would be of appreciable assistance to the academic work thereby leading to supportive ideas of existing theories and literature. The consequences and discoveries from this research work will serve as a guide for further studies in the related field. 1.5 Methodology In order to identify the best imputation algorithm, which will be used to construct and replace the incomplete values in cross sectional data, the following missing data imputation algorithms were considered; Expectation Maximization (EM), Multiple Imputation by Chained Equation (MICE), K Nearest Neighbor Imputation (KNNI), Mean Substitution (MS) and Regression Substitution (RS). By creating artificial simulation studies based on missing completely at random (MCAR) and missing at random (MAR) assumptions, at different missing data rate of proportion, the five imputation algorithms were used to construct and replace the incomplete values in the data matrix. The multiple linear regression (MLR) model is then used to analysis the complete original data without missing values and all other incomplete datasets replaced by respective algorithms and assessed. The achievement of these selected algorithms was assessed by comparing the average coefficient difference of multiple linear regression model, mean absolute difference and coefficient of determination (R2). 1.6 Motivation of the Research This study is motivated by the fact that improper handling of missing cross- sectional data can cause a lot of biases and inappropriate results. Missing data 5 University of Ghana http://ugspace.ug.edu.gh problems usually occurs in many research studies and a common attribute to data accumulation, for example when working with very huge data sets. Missing data issues are hugely a challenge to researchers and practitioners, because most of the statistical methods and application software are designed to perform effectively and efficiently when data are observed. This research focuses on deriving the best appropriate imputation algorithm and predictive models that will be able to accommodate issues of missing cross-sectional data. In other words, the accurate or proper imputation algorithms to solve the problems of missingness in cross-sectional study leads to reducing the loss of precision and power due to the drop out of subjects with incomplete predictor variables as well as reducing bias in parameter estimation. Applying proper imputation algorithm to impute missing values accounts for the natural variability in the independent variables and produces unbiased parameter estimates hence valid statistical conclusions is guaranteed. 1.7 Scope of the Study This research work used data from the world population data sheet, 2011. It is cross sectional data, because the data on 106 countries were collected at one point in time (i.e 2011) without repeated study. Secondary data on life expectancy at birth (LEB) and all other nine (9) variables employed in this thesis were available only for one hundred and six countries (Population Reference Bureau, 2011). In this thesis the missingness were assumed ignorable (MCAR & MAR). The assumption implies that the reasons for the existence of missingness in the data set are not known to any one including the researcher but occurs randomly. They may have occurred due to chance or factors that the researchers can not explain. Also, the validity of ignorable missing data assumption cannot be tested and there is no existing theory to confirm this under this study. However, when missing data mechanisms that lead to missing values are more than one, it is assumed that variations from MAR mechanism are insignificant and will not distort prediction 6 University of Ghana http://ugspace.ug.edu.gh accuracies and conclusions by very wide margins. 1.8 Limitations The following are some of the limitations encountered during the study: • The treatment of missing data is an unavoidable statistical challenge and all researchers should be aware that there is not a known unique imputation approach that performs best in all situations. • The researchers encountered difficulties in creating percentages of missing values to conform to MAR mechanism. • The researchers also faced problem of obtaining data to facilitates the study easily. 1.9 Thesis Organization This thesis has been organised into chapters. First chapter provides a very short introduction to the research work, such as the background, statement, objectives, the significance , research questions, methodology, scope and limitations of the study. The second chapter presents the literature review, which explains the work done by other researchers on the same work or related work. The third chapter talks mainly about the designed methodology employed in this thesis. The forth chapter illustrates the results and outcomes of data analysis output. It mainly consists of tables and graphical presentation of results for discussion. The last chapter profers the summary of the research findings, conclusions, recommendations, and future research proposal of the work. 7 University of Ghana http://ugspace.ug.edu.gh CHAPTER 2 LITERATURE REVIEW 2.0 Introduction This segment of the study reviews diverse literature related to missing data imputation algorithms in order to uncover elucidative realities and discoveries which have formerly been established and published by other investigators. A great number of methods and algorithms have been instituted for calculating and replacing incomplete values in cross-sectional and longitudinal studies. The literature review consists of four sections. Firstly, the review of various works by many researchers. Secondly, the missing data pattern and mechanisms are introduced. Thirdly, the traditional and modern missing data imputation techniques (algorithms) are examined and finally, the measures of performance assessment were also reviewed. 2.1 Missing Values The term missing (latent) or incomplete data according to Day (1999) is defined as " a data value that should have been recorded but, for some reason, was not". Missing values create much complexity in modern research studies since most data analysis procedures are not planned for their inclusion. There have been numerous published articles focusing on estimation and reconstruction of missing value for health related data, whereas some studies have been allotted to related problems in other disciplines with varying degrees of sophistication. The issue has been considered in the context of respondent failing to answer all questions or some of the questions in research surveys and incomplete values 8 University of Ghana http://ugspace.ug.edu.gh in experiments (Little & Rubin, 2002). The rows with incomplete values may be utilized for further analyses after the estimation and reconstruction of the missing values. A great number of imputation algorithms exist for treating latent values, some are: hot deck imputation and mean imputation, regression imputation, cluster-based imputation, and tree-based imputation, maximum likelihood estimation (MLE) and multiple imputations. Data-scientists and other researchers not only have created many techniques for treating incomplete values, but also have developed several kinds of missing values. In subsequent segments, we shall elaborate on the classification of missing values mechanism that give rise to incomplete data. As far as implementation and decision making processes are concerned, the presence of missing values constitutes a problem of crucial importance for end-user data analyses, since many employed methods and application software require complete data matrices. Susianto, Notodiputro, Kurnia and Wijayanto (2017) in their study ’A comparative work of imputation techniques for estimation of incomplete values of Per Capita Expenditure’ compared and assessed three imputation procedures. The techniques used in their study were; Yates method, Expectation Maximization (EM) approach and Markov Chain Monte Carlo (MCMC) technique. These three methods were applied to a real data set of per capita expenditure at sub-district level at central Java. The main objective of their study was to identify the best missing data imputation approach to impute hidden values of per capita expenditure. The results of their study revealed that mean sum of squares generated by the Yates technique was smaller as compared to the mean sum of squares that emanated from the other two techniques, EM and MCMC approaches. These outcomes were consistent with mean absolute error of the Yates technique which was in addition smaller than the mean absolute error produced from the other two algorithms, EM and MCMC approaches. For these reasons, Yates formula was advocated to substitute 9 University of Ghana http://ugspace.ug.edu.gh missing values of per capita expenditure at sub-district level in Central Java. Rahman and Islam (2011) compared two imputation algorithms: Decision tree based Missing Imputation (DMI) and Expectation Maximisation Imputation (EMI) algorithms on two real data sets. Their investigation realized that the EM technique displays more desirable imputation results on data sets with very large interdependence between the variables. In addition, correlation between the variables are natural characteristics of a any given dataset. Therefore, a data value must not be altered or remoulded in order to enhance the relationships among the variables with the aim of obtaining more satisfactory imputation precisions. In spite of the fact that, DMI achieves remarkably desirable results than EMI on the two data sets, its performance on a large data set (Adult data) is obviously superior to its achievement on a small data set (Credit Approval data). It simply implies that DMI achieves more desirable results on large datasets than smaller data sets. Because DMI uses EMI based replacement on the records belonging to a leaf individually, for a minimal data set we may frequently arrive at having inadequate number of records for obtaining a desirable outcome from the EM approach. Therefore, in their investigation, DMI still produces more outstanding results than EMI even on small data sets in most of the situations. Brown (1994) assessed five indirect approaches for estimating structural equation models with different percentages of incomplete values.The approaches used in the work comprised listwise deletion (LD), pairwise deletion (PD), mean imputation, hot-deck imputation, and similar response imputation. Brown chooses to pivot on indirect techniques because he explains that numerous cases are not applicable for practical application of direct techniques. Brown’s work is conflicting with recent works which bears the application of a direct technique for treating incomplete values in a structural equation model. He 10 University of Ghana http://ugspace.ug.edu.gh applied a simulation study of 10 attributes on structural equation treatment. Brown’s study plan comprised of two different sample sizes (one hundred and five hundred) and each with five percentage levels of unobserved values. By comparing the strength of the indirect approaches, Brown identified four results: problems of convergence, model selection of best fit, unfair in estimated parameters, and estimates of standard errors. With the assumption that the data are MCAR, indirect technique, LD, must produce very good estimators for all parameters. Batista and Monard (2003) studied the effects of four imputation techniques for treating incomplete values at various percentage levels of incomplete data. The algorithms explored were KNNSI, MMSI, and internal approaches employed by fractioning cases FC and CN2 to handle incomplete values. Incomplete data were synthetically simulated at various percentage rates of latent values and attributes into the datasets. KNNSI algorithm displayed an excellent result as compare to MMSI if incomplete values were in one variable. Nevertheless, two systems provided very good performances when incomplete data were in several variables. On the other hand, fractioning case (FC) algorithm obtained a very good achievement as perfect as KNNSI. Twala, Cartwright and Shepperd (2005) evaluate the effect of the following incomplete data approaches : LD, EMSI, KNNSI, MMSI, EMMI, FC and SVD on eight commercial datasets by synthetically reproducing three percentage levels of missingness, two patterns and three mechanisms of incomplete values. Their study revealed that EMMI displays a highest reliability rates and other methods such as fractioning cases and EMSI yielded good outcomes. The poorest approach was LD. Besides, their study depicts that MCAR data is very cheap to handle with MI data. Batista and Monard (2001) have examined the achievement of ten nearest neighbor imputation as imputation technique, 11 University of Ghana http://ugspace.ug.edu.gh evaluating its achievements to other three incomplete values approaches: mean or mode imputation, statistical classifier C4.5 algorithm and CN2 technique. This study suggested that the benefits of the technique are that it can forecast both qualitative and quantitative variables, and it does not generate precise models, since it is an inactive model. The work indicates that the technique profers excellent outcomes, preferable to the other three techniques, mean or mode imputation, C4.5 and CN2 approaches, surprisingly for a very high amount of latent values. Moreover, the primary disadvantage of the 10-NNI approach is that the algorithm which explores through the entire datasets are restricted in huge data sets only relied on MCAR mechanism. From the literature review, all the imputation algorithms considered in the study were used to impute missingness under different missing data mechanisms (MCAR, MAR and MNAR), but failed to classify imputation methods under different missing data mechanisms. Since some of the imputation methods work better and some work poorer under different missing data mechanisms, it is important to classify them to require missing data mechanism before using to replace missing data. Therefore, this study would classified imputation methods under different missing data mechanisms (MCAR and MAR) to examine actual performance of the imputation algorithms under a specific missing data mechanism. However, many of the prior studies are mainly concerned with application of one modern imputation algorithms such as, artificial neural network (ANN), KNNI, EM, MCMC, full information maximum likelihood (FIML) methods etc, in comparing to other traditional imputation techniques such as LD, PD, mean substitution, mode or median, hot-decking and many more. At the end results, the modern method of imputation provides good estimates as compare to traditional imputation methods. This study will separately compare modern imputation algorithms as well as traditional imputation methods under a specific missing data mechanism to examine the best imputation method. Thus, 12 University of Ghana http://ugspace.ug.edu.gh no studies have accordingly examined the consequences of convergence on the five imputation algorithms, namely: expectation maximization (EM), multiple imputation by chained equation (MICE), k nearest neighbor imputation (KNNI), mean substitution (MS) and regression imputation (RS) which have the ability to substitute incomplete values and unknown parameters in large real database. This study therefore focuses on five imputation algorithms to estimate and reconstruct missing values in real life application data using ignorable incomplete values mechanism assumptions (MCAR & MAR) and arbitrary missing data pattern. 2.2 Missing Data Mechanism Incomplete values occur because of reasons beyond our supervision, hence, the properties of operations that account for unobserved values require first to be examined. Basically, there are three kinds of incomplete values mechanisms that are integrated with ignorable and non-ignorable latent values in the literature (Little & Rubin 2002, Carpenter & Kenward 2013). The incomplete values mechanism describes connection between unobserved values and values of variables in the data matrix i.e. whether the missing values depend on the underlying values of the variables in the data matrix. As explained by Schafer (1997), given a complete dataset Yc, which consists of Yo, the observed part, and Ym, the latent values part, it is obvious that the complete data matrix Yc = (Yo, Ym). Schafer emphasised that given I response characteristic function which has equivalent dimension asYc. When Yc is observed, I = 1 and when Yc is missing, I = 0. Then he explained the mechanism as follows: Missing Completely at Random (MCAR) A dataset with missing completely at random has no systematic arrangement of missing values among the attributes and incomplete values do not connect to either the observed values or lost values (Acock, 2005; Bennett, 2001; Roth, 1994). 13 University of Ghana http://ugspace.ug.edu.gh If data are said to exhibit MCAR assumption, then the likelihood of obtaining a particular pattern of missing value is independent on both the observed and latent values (Hair, Black, Babin, Anderson & Tatham, 2006). It simply means the probability of having a latent value for a variable does not rely on either observed or latent values, i.e P (I/Yc) = Pr(I) . The incomplete values have no dependency on any other attributes. Thus, the incomplete values occurred just by chance and they generally surface as very few individual points randomly distributed. Under MCAR assumption, any incomplete values treatment approach can be employed without fear of introducing bias into the study. In practice, it is very difficult to differentiate whether data are MCAR, but Little (1988) had established an omnibus statistical test of MCAR to solve such problems. Missing at Random (MAR) A dataset with MAR, the likelihood of having an incomplete value is connected to another attribute in the study but is not connected to latent values (Allison, 2001). The probability of drop-out for a variable is dependent on the observed values but does not rely on latent value itself, i.e Pr(I/Yc) = Pr(I/Yo). In addition, data with MAR, latent values are connected to known data but not to incomplete values (Roth, 1994; Schafer & Graham, 2002). Thus, the likelihood of having latent value is unrelated to the incomplete values in the study. The latent values rely on other observed values. The incomplete value can be calculated using other observed values. Incomplete values generally surface as little consecutive points disappeared at one time, but the sets of missingness are randomly dispersed. According to Schlomer, Buaman and Card (2010), it is practicable to differentiate between MCAR and MAR by calculating a dummy variable denoting whether values are lost on a desired attribute and then inspecting if this dummy variable is connected with other attributes in the study. Whenever dummy variable (lost values) is unconnected to other attributes, then the missingness arrangement is not counted as MAR instead of MCAR or 14 University of Ghana http://ugspace.ug.edu.gh NMAR. Contrary, when a dummy variable is truly connected to other attributes, we suggest MAR instead of MCAR, even though we may not totally neglect NMAR. The NMAR implies that the investigators cannot examine whether a given data is MAR or MCAR. However, investigators frequently presume MAR or MCAR if there are no symptoms to the opposite. Not Missing at Random (NMAR) Not missing at random (NMAR) mechanism is also called non-ignorable mechanism (Schlomer, Buaman and Card, 2010). A dataset with NMAR, the occurrence of the lost data points have certain patterns, that is likelihood of latent values are connected to the outcome on that same attribute. This means that, the probability of obtaining an incomplete data for an attribute is based on the value of that attribute, i.e Pr(I/Yc) = Pr(I/Ym, Yo). The incomplete or latent value is determined by other latent values, hence they cannot be calculated from an observed attribute. Apparently, the complexity in deciding NMAR is that the relationship between incomplete values and the manner in which interviewees would have answered the questions are not able to determine, since missing values are not available. The mechanisms do not usually present a rational reasons for the lost of data, but they provide mathematical technique to formulate the likelihood of incomplete data in connection with other attributes in the study. Non-ignorable missing type explains the likelihood of a lost data point as it is based on its value. It usually happens when the patterns of the latent data points are in such way that the lost value of Ym cannot be accurately forecasted using other attributes in the database. Ignorable missing data type comprises of MCAR and MAR mechanisms. The diagram in Figure 2.1 describes types of missing data. 15 University of Ghana http://ugspace.ug.edu.gh Types of missing data Nonignorable Ignorable Missing data Missing data mechanism mechanism MNAR MCAR MAR Figure 2.1: Important types of missing data 2.3 Ignorability Mechanism Rubin (1976), emphasized that "there are two broad classes of missing data: missing data that is ignorable from the analysis, and missing data that is non- ignorable. If one can reasonably assume that missing data occur under the either MCAR or MAR conditions, then the problem is deemed ignorable, and the missingness process need not be explicitly modeled. Moreover, when data are MCAR or MAR, the likelihood-based and Bayesian frameworks allow to ignore the missingness process since they use only observed data,conditional on the model being correctly specified (Little & Rubin, 2002)". Contrary, if data exhibits NMAR, the lost data points procedure cannot be rejected from the analysis (Little 16 University of Ghana http://ugspace.ug.edu.gh & Rubin, 2002). In the application to missing data classifications, ignorability, as it applies to missingness mechanisms, does not mean that investigators can ignore missing values. It refers to the fact that factors that cause missingness are unrelated or weakly related to the estimated intervention effect. In a restricted sense,the term refers to whether missingness mechanisms must be modeled as part of the parameter estimation process or not (Allison, 2002). In addition, the importance of ignorability arises when one needs to evaluate the impact of missing data in the analysis and study conclusion. Because of the randomness nature of missingness, MCAR data should not have systematic effect between complete and missing records on results. But, in the MAR data there is a systematic process underlying the missingness, but this effect can be modeled using the observed data (McKnight, McKnight, Sidani and Figgueredo, 2007). However, lost data point is non-ignorable if the likelihood of a lost data point is based on its value, even after controlling for other variables. Thus, the NMAR procedure means a contravention of ignorability law and needs suitable actions to consider for the impacts of data that is NMAR. The non ignorable incomplete data is significantly most strenuous task to handle and must be carefully catered for. Thus, it is not practically easy to make an acceptable and reasonable investigation of the data with NMAR (Thijs, Molenberghs, Micheiels, Verbeke,& Curran, 2002). 2.4 Pattern of Missing Data Basically, pattern of incomplete values explains the amount of values in the datasets that are observable and the value that are not observable (missing). In cross-sectional studies with missing data in a single variable or multiple variables, when data are presented in wide format as horizontal lines correspond to subjects and vertical lines correspond to attributes, the matrix displays three main patterns of lost data points. These are; univariate pattern, monotone pattern and arbitrary pattern. In a univariate pattern, missing values occur only in one attribute and other remaining attributes are totally observed. In 17 University of Ghana http://ugspace.ug.edu.gh a monotone pattern with ordered variables, once a variable is missing, then all succeeding variables are also missing, that is we can clearly notice a pattern between the lost data points. With arbitrary missing pattern, there is no way to reorder attributes to notice explicit pattern (SAS Institute,2005). The arbitrary missing pattern is the most general pattern in which different sets of attributes can be missing for different subjects. Therefore, assumptions and patterns of latent values are applied to decide which algorithm will be appropriate to solve the problems of latent values. This thesis therefore focuses on the arbitrary missing data pattern in multivariate datasets. Unlike univariate and monotone patterns, which can be handled by simple methods, the arbitrary missing data pattern may requires more sophisticated algorithms. 2.5 Traditional Methods of Treating Missing Data The traditional approaches or techniques for treating missing values are briefly discussed. The techniques are LD, PD,MS, RS, stochastic regression imputation, and hot decking. Generally speaking, incomplete values approaches can be separated into two classes, we have deletion approaches and imputation approaches. These techniques were once very popular and even dominant in the applied research when researchers had to solve the problem of lost data. Contrary, as research on how to handle incomplete values in multivariate data developed rapidly, many of these methods have been regarded as unacceptable in structural equation modeling (Savalei & Bentler, 2009). Although many of these traditional methods are still frequently utilized in applied studies, researchers should be aware of their disadvantages and consequences in analytical analysis and parameter estimation. 18 University of Ghana http://ugspace.ug.edu.gh 2.5.1 Mean Substitution With the mean substitution approach, the arithmetic mean of the observed values of specific variable is estimated and then substituted in each of the lost data cells. This approach performs better only if the variable considered is not nominal data. The mean substitution is a technique allowing to treat missing data that consists in substitution for a given variable, each missing value by mean of actual values noticed. This approach preserves the mean of the variable distribution but reduces other characteristics of the variable dispersion (Rubin,1987 and Cole, 2008). Allison (2002) showed that mean substitution approach restricts the variability of a variable and changes the underlying distribution. Moreover, mean substitution technique performs better if lost data mechanism is MCAR. However, one disadvantage of mean substitution is that, it leads to bias in calculating parameters (McDonald, Thurston, & Nelson, 2000; Pigott, 2001; Streiner, 2002). 2.5.2 K Nearest Neighbor (KNN) Imputation Algorithm Nearest neighbor imputation (NNI) is a process of substituting lost values of an instance B with reasonable values which is obtained from a complete instance that is closeness neighbor of B, it gives a feasible answer to the common problem. NNI is non-parametric technique,it has extensive records of implementation. The KNN imputation algorithm is an increased version of NNI which can solve the problem of over fitting. With the KNN techniques, the information in missing instances is employed sole for locating the nearest neighbor, or group of an instance with lost data. The fundamental law is identifying the K nearest neighbors from the target variable in all N variables, where N is the entire experiments. If we take an attribute B which has one latent value in study 1, KNN approach will locate k other attributes with an observed value in study 1, which have an expression very closer to B in study 2, in N. A weighted average of figures in study 1 from the K closest attributes is then 19 University of Ghana http://ugspace.ug.edu.gh applied as quantity for the lost value in attribute B Usually, euclidean distance is employed to determine the interval among the samples in study, for instance, the interval between two points qi and pi is given by, √√√∑n d(pi, qi) = d(qi, pi) = √ (pi − q 2i) (2.1) i=1 The process of replacing the lost values using euclidean distance may be summarized as: 1. Each attribute which has lost values, determine it with all other attributes the interval among the points using Euclidean distance in equation 2.1; 2. Arrange the distance in ascending order and pick the K smallest distances 3. When working with discrete data, the mode of K distances will be taken as replacement value; with continuous case, the mean or median of K distances will be used to replace values. The main benefits of KNN approach for calculating and replacing lost values are as follows: • . KNN may be used to predict both discrete cases and continuous cases. • There is no need to produce a forecast expression for each variable with lost values. In fact, the KNN does not produce precise models like other techniques, it is termed as a lazy model. The KNN may be simply adjusted to function with any variable as class, by amending which variables may be examined in the distance metric. Besides, KNN can simply handle cases with multiple lost data. • However,the major limitation of KNN approach is that, before the KNN seeks for the very closest instances, the method looks throughout the entire dataset. This problem is very crucial, because most of statistical research among their objectives is to analysis a large dataset. 20 University of Ghana http://ugspace.ug.edu.gh 2.5.3 Regression Substitution The principle of regression method is to use the observed values to create fitted regression model. The attributes with lost data are target variable and incomplete values are substituted by the predicted values using regression equation. Based on an approach proposed by Yuan, (2000), each variable with missing data is fitted with a regression model by using the remaining variables as independent variables ( i.e covariates, or regressors). By applying the coefficients of regression model, a new model is created or developed, then each attribute with lost values will be replaced by the developed regression model (Rubin, 1987). Let Zj be a continuous variable with lost data,which meet the requirement of the expression; E(Zj) = β0 + β1X1 + ...+ βMXM (2.2) which is formulated by applying observations with noted values for the attribute Zj, its covariates C1, C2, ..., CK . Where K is the number of attributes in the study, and 0 < M < K. Rubin (1987), emphasized that, this system presumes multivariate normal, it simply means a model with mean σ2. The formulated regression expression produces the regression parameter estimates β = [(β0, β1, ..., βM)] ′ and associated covariance matrix σ2 V , where V is the usual (CTC)−1 matrix derived from the intercept and covariates C1, C2, ..., CM . To explain in detail how the regression algorithm works, many predictors of the attribute with incomplete values are determined by applying the correlation matrix. The most excellent predictors are picked and used as predictor attributes in a regression expression. The attribute with the lost values is used as the outcome or response attribute. Cases which have complete information for the explanatory attributes are employed to create the regression expression, the model on the other hand is used to forecast missingness for lost cases. By iteration, the values of the lost attributes will be substituted and all instances are used to forecast the response variable. These systems are reiterated till they converge. 21 University of Ghana http://ugspace.ug.edu.gh The regressors obtained from final circle are the best ones employed to fill in the incomplete data. To differentiate regression substitution (RS) method and other kinds of imputation approaches, RS employs the major sources of data to forecast incomplete values and technically produces unbiased estimate for incomplete data (McDonald et al., 2000). The disadvantages of using RS method have been examined to be greater than its advantages (Graham & Hofer, 2000; Little & Rubin, 2002). To begin with, since the substituted figures were predicted using other attributes, they usually fit better. This means, RS does not exhibit random noise, and therefore, standard error is reduced (Allison, 2002). Also, in regression substitution expression, one might think that the correlation among variables is linear but in some instances they may be wrong. This issue may lead to overestimation of parameters and smaller significance values, that will result in presenting negative statistical inferences. Last but not the least, with regression substitution approach, substituting the lost values is extra difficult and a little workable when the attributes with incomplete values are mostly connected (Raaijmaken, 1999). The most uniquely advantage of RS method is the software program accessibility to initiate this technique. 2.6 Modern Method of Treating Missing Data It is significant to observe that traditional incomplete values imputation procedures were widely employed due to easy and general accessibility in application softwares, many of them provide unsatisfactory results (Enders, 2001; Little & Rubin, 1987). Nowadays, statisticians and other researchers have invented many methods and algorithms to impute missing data which have undergone substantial developments. Expectation maximization (EM) algorithm, MICE algorithm and full information Maximamum likelihood ( FIML) method have gained popularity in the recent times due to their superiority as compare to other traditional methods. These algorithms provide consistent asymptotically normal and coherent parameter statistics when applying MAR 22 University of Ghana http://ugspace.ug.edu.gh assumption (Allison, 2002; Schafer & Graham, 2002). 2.6.1 Expectation- Maximization (EM) Algorithm The EM algorithm is an iteration procedure used to calculate the maximum likelihood quantity in the presence of latent or lost data. With ML, we are interested in calculating the model parameters in which the non- hidden values are most possible. The EM approach, initially developed by Dempster, Laird and Rubin (1977), is a repetitive algorithm to maximize the probability calculated by a parametric model for observed data. The EM technique for hidden values is largely depended on the maximum probability estimated of covariance structure given by the available data. Each repetition of the EM approach comprises of two systems. The Expectation step (i.e. E-step) and the Maximization step (i.e. M-step). With expectation step (E-step), regression equations based on the given available values are used in calculating the missingness ("the expected values"). These missingness are replaced by the conditional mean established by the regression models. In the maximization step (M-step), the estimates obtained from the E-step are updated to increase the log probability of the current parameters from the first state. These two steps are repeated for some number of iterations. This algorithm will converge to a stationary point under some hypothesis of regularity (Alison (2002), Dempster et al. (1977)) 2.6.2 Multiple Imputation by Chained Equation (MICE) Algorithm MICE approach was initiated by Van Buuren and Groothuis-Oudshoorn (2011). The MICE approach is a Markov Chain Monte Carlo (MCMC) system such that the state space is the compilation of the entire substituted values. As it happens in all Monte Carlo operations, MICE technique has to fulfill three conditions for convergence to take place (Van Buuren, 2012). 23 University of Ghana http://ugspace.ug.edu.gh 1. The iteration is irreducible. The chain should be able to go through all aspects of the state space. 2. The iteration is aperiodic. The chain must not swing forth and back among separate states. 3. The iteration is recurrent. That is the likelihood of the chain beginning from j and returning to j is one. In reality, the convergence of the MICE approach is attained after an acceptably small number of iterations, commonly between five and twenty (Liu & Brown, 2013). Liu and Brown emphasized that roughly five iterations are acceptable, though in few cases may demand a huge amount of iterations. MICE requires researcher to state a conditional distribution for respective attribute utilizing other attributes as regressors. This means that each attribute can be modeled according to its distribution. For instance, counts data is modeled using Bayesian linear regression, binary variable is modeled using logistic regression (Azur, Stuart, Frangakis & Leaf, 2012). This technique operates by repeatedly replacing incomplete values based on the formulated conditional equations up to the time convergence is achieved. The chain is separated into three stages. In stage 1, each attribute, for each station, would be substituted by the arithmetic mean for that attribute. At stage 2, the calculated observed entries of the attribute in stage 1 are related to the other attributes of the database by regression. It means that, the attribute of stage 1 is the response attribute in the model and the other attributes are predictors. In stage 3, the incomplete values in stage 1 are substituted by the regression equation from stage 2. However, the approaches in stage 1, stage 2 and stage 3 are reiterated for each attribute with hidden value. This approach for respective attribute comprises an iteration or a cycle. By completion of an iteration, all incomplete data would be substituted with the quantities predicted by regression expressions. Stage 2 via stage 3 are repeated for separate iterations with the replacements being restored in each iteration. 24 University of Ghana http://ugspace.ug.edu.gh 2.7 Measures of Performance Assessment The following performance metric systems are used as a criteria to assess the best algorithm to substitute hidden values in cross- sectional data; these measures are mean absolute difference (MAD), root mean squared error (RMSE) and coefficient of determination (R2) 2.7.1 Mean Absolute Difference (MAD) The MAD is a statistical variance evaluation indicator. MAD is also called average difference between two different numbers selected from probability density. The MAD (absolute mean difference) is the arithmetic mean of absolute difference between observed value and predicted value. The smallest MAD is the best measure of dispersion, hence the algorithm with smallest MAD is recommended to substitute hidden values in the database. 2.7.2 Root Mean Squared Error (RMSE) The root mean squared error (RMSE) is a performance indicator that determines the mean distance of the residual. It compares the variance among original datum and substituted datum, basically it denotes standard deviation of the difference. It is a valuable indicator of total exactitude which assists to know how each imputation algorithm is performing in a data set. In literature, the most efficient imputation algorithm is the one with the lowest RMSE (Huang & Carriere, 2006). This implies that the smaller the RMSE, then the better the performance indicator. Chia and Draxier (2014) argue that "the RMSE has been used as a standard statistical metric to measure model performance in meteorology, air quality, and climate research studies". Also, in the area of geosciences, the RMSE has been chosen as one of the best standard indicators for model residuals ( Savage et al., 2013), and few researchers too prevent the use of RMSE but rather resort to MAE, stating the limitation of RMSE declared 25 University of Ghana http://ugspace.ug.edu.gh by Willmott, Mastsuura and Robeson (2009). One major merit of using RMSE instead of MAE is the avoidance of absolute sign that is serious unacceptable in statistical computations (Chia & Draxier, 2014). Mathematically, RMSE is given as: √∑n (X 2io −Xim) RMSE = i=1 (2.3) n where i = 1, 2, ..., n. The n is sample size, Xo is the observed values of data set and Xm is the imputed values in the data set (Schmitt, Mandel & Guedj, 2015) 2.7.3 Coefficient of Determination (R2) The coefficient of determination (R2) determines the proportion of variability in the dependent variable that is explained by the independent variables. The R square (R2) ranges from 0 to 1 while the model has healthy predictive ability or the regression line is perfectly fit the data when it is closer to 1 and is not analyzing better, when it is closer to 0. This performance metric is a good indicator of overall predictive exactness. In the realm of statistics, the R2 determines how best the model represents the observations in the dataset. In fitting regression line, the closer the line is to all entries on the scatter diagram implies totality of variation the model is able to explain. Contrary, when most of the entries deviated far away from the regression line then it is an indication that very small amount of variation of the model is accounted for. 2.8 Multiple Linear Regression (MLR) Model The study used MLR equation to analyze original complete dataset (without missing values). Also each imputation algorithm will be used to calculate and replace hidden data in order to identify the best algorithm. Linear regression analysis relates dependent attribute to its covariates. The fundamental objective of regression analysis is to build a statistical model to relate dependent variables to independent variables. 26 University of Ghana http://ugspace.ug.edu.gh According to Anghelache and Scala (2016), there are three kinds of regression models. These regression models are; the variable-based degree model (VBDD), the linear regression model, and the change point models. These forms of regression equations employ generalized least squares regression to determine the model coefficients. This thesis adopts multiple linear regression model to analyze the real life application data from World Population Data Sheet, (2011). MLR is an increased version of the simple one. MLR equations are employed to determine the linear connection among a dependent attribute and various regressors when fitting a straight line model to observed data entries (Coelho-Barros, Simoes, Achcar, Martinez and Shimano, 2008). The general multiple linear regression model is in the form Yi = β0 + β1X1i + β2X2i + ...+ βkXki + i (2.4) where Y is the dependent variable, X1, X2, ..., Xk are independent or explanatory variables, and i index the n sample observations,  is the random error term and β0, β1, ..., βk are regression coefficients. 27 University of Ghana http://ugspace.ug.edu.gh CHAPTER 3 METHODOLOGY 3.1 Introduction This section expounds on the techniques used in this study. It briefly discusses algorithms employed under the investigation. The segment has been divided into five main portions. First part describes the source of data and the research design. Section two deals with the methodological framework in the Multiple Linear Regression (MLR) model. Section three describes how to test missing data mechanism (MCAR and MAR). Section four describes assumptions of MCAR and MAR. It also discusses the classifications of hidden values under assumptions of various hidden values mechanism. Besides these, section five briefly explains the mean absolute difference (MAD), root mean squared error (RMSE), and coefficient of determination (R2) as performance assessment criteria to compare each imputation algorithm and identify the best. It also displays an outline of an extensive systematic plan of the data analysis procedure. 3.2 Source of Data This study illustrates the application of imputation techniques to real life dataset by using data from World Population Data Sheet, 2011 (Population Reference Bureau, 2011). Secondary data is used in this thesis. Population reference bureau is non-profitable organization which provides an annual world population data, that is chart filled with information from two hundred countries regarding essential demographic characteristics and health related issues, for example; population density, maternal mortality, life expectancy at birth, 28 University of Ghana http://ugspace.ug.edu.gh HIV/AIDS prevalence, total population estimation, poverty, and contraceptive usage (Population Reference Bureau, 2013). These data are important key for consumption of academic research work, stakeholders, practitioners and policy makers. The data comprises of 106 observations with 10 variables. The data on life expectancy at birth (LEB), the target variable and all other nine variables utilized in this study are accessible solely for one hundred and six countries (Population Reference Bureau, 2011). Due to the respective estimates of LEB for the various countries under consideration, the countries have been partitioned into three categories. The three divisions are; countries with small LEB estimate, countries with average LEB estimate and countries with large LEB estimate. The LEB is a single index of mortality that condenses mortality conditions and indicates mean number of years a cohort is expected to live, if they were subjected to the age specific mortality rates for a given period (Pollard, 1988). The life expectancy at birth ( LEB), the dependent variable, and the nine independent variables which account for LEB in the year 2011 are as follows: URBAN: total number of people living in urban towns. CMW: total number of married women of child bearing stage practicing birth control system. GNIPP: gross national income converted to international dollars; DEN: total number of people living in square kilometer; RWS: total number of rural people who have access to purified water supply; IMR: total number of child deaths under one year; TFR: total fertility rate. DEPPOP: total number of dependent people; POVERTY: total number of people who spend less than 2 dollars per day. 3.3 Research Design The following approach was used as our research design plan. In the original data set in matrix form, multiple linear regression (MLR) model is fitted. The following rate of missingness, 5%, 10%, 20%, 30% and 40% were artificially created in the original data. The Little’s test was performed in 29 University of Ghana http://ugspace.ug.edu.gh order to confirm missingness or missing value is MCAR or MAR. Some selected imputation algorithms were employed under MCAR and MAR mechanisms to estimate and replace the missing values created in the original data. After the computation and replacement of each of the artificial created missingness by an imputation algorithms, a MLR model was fitted to re-estimate the coefficients and their standard errors. All the estimated models were compared based on some evaluation criteria and the model that is very close to original data model is recommended as the best model. Diagrammatical Representation of the Research Design Figure 3.1: Step by step procedure of the research design 30 University of Ghana http://ugspace.ug.edu.gh 3.4 Multiple Linear Regression (MLR) It is important to note that, the MLRmodel is certainly the more widely employed technique in the field of statistics. It assists the researcher to determine the correlation among a response attribute and a number of predictor attributes. Regression analysis is largely the most robust approach which helps investigator to study in details the relationship among variables in the given data set. This study would consider lost values throughout as ignorable. The assumption of ignorable missing values mechanism implies that no one knows the reason why hidden values occurred in the data set. Ignorable hidden value is a compound term that comprises of missing completely at random (MCAR) and missing at random (MAR) mechanisms. 3.4.1 The Multiple Linear Regression (MLR) Model The MLR model assumes a direct or one- dimensional connection between response attribute Yi and a set of predictor attributes (X )Ti = (Xi0, Xi1, ..., Xik). The initial predictor Xi0 = 1 is a constant unless otherwise stated. Suppose X1, X2, ..., XN are N predictor observations on Y . Then each observation yi may be modeled as: Yi = β0 + β1Xi1 + β2Xi2 + ...+ βKXiK + εi (3.1) where, ε ∼ N(0, σ2). This equation is called a MLR model. Where Yi’s are the response attributes, β1, β2, ..., βK are the regression coefficients, β0 is the constant term when all covariates are not included in the model and ε is the residual term. The mean value of dependent attribute Y of a direct expression of the coefficients β0, β1, β2, ..., βK is given as, E(Y ) = β0 + β1X1 + ...+ βKXK (3.2) 31 University of Ghana http://ugspace.ug.edu.gh 3.4.2 Matrix Representation of the Model The function connecting response attribute Y to predictor variables X1, X2, ..., XK is given by; Y = β0 + β1X1 + ...+ βKXK + ε With N independent observations on Y and associated values of X, the model becomes Y1 = β0 + β1X11 + β2X12 + ...+ βKX1K + ε1 Y2 = β0 + β1X21 + β2X22 + ...+ βKX2K + ε2 . . . YN = β0 + β1XN1 + β2XN2 + ...+ βKXNK + εN In matrix notation, the model becomes         Y1 1 X11 ... X1kβ1  ε1  Y        2 1 X21 ... X2k         β2       ε2  .   . . . . .   .   =    +    .   . . . .  .   .    .   . . . .  .     .  Yn 1 Xn1 ... Xnk βk εN where E(ε) = 0, Cov(ε) = σ2I, Y is N × 1 column vector, X is N × (K + 1) design matrix, β is (K + 1)× 1 column vector and ε is N × 1 column vector. 32 University of Ghana http://ugspace.ug.edu.gh 3.4.3 Assumptions of the Multiple Linear Regression • The residuals are assumed to have mean zero and unknown common variance σ2 which are normally distributed. • The errors are uncorrelated. That is, they are independent. • It is appropriate to observe the predictor X as fixed by the investigator and measured with insignificant residual. • The sum of the residuals weighted by the corresponding values of the predicted Ŷ is zero. 3.4.4 Testing for Overall Regression Significance HO : β1 = β2 = ... = βn = 0 H1 : βi 6= 0 for at least one i := 1, 2, ...n That implies ∗ MSRF = (3.3) MSE where F ∗ is equal to the observed value of F. Let denote H0 and H1 by null and alternative hypothesis respectively. Then the decision rule will be the form: If F ∗ ≤ F(α,a,b), fail to reject H ∗0. If F ≥ F(α,a,b) , reject H0 where α is significance level of test, a is degree of freedom, b is degree of freedom and F(α,a,b) is the critical value or table value of F. Failing to rejectH0 will imply that overall regression is not statistically significant. Contrary, overall regression is statistically significant. 3.4.5 Testing for the Significant of the Slopes In determining the significant contribution of a particular variable to the model, the appropriate hypothesis is formulated and proceed to use the t-test statistics. To test the contribution ofX1 whose regression coefficient is β1, then the following hypothesis hold. 33 University of Ghana http://ugspace.ug.edu.gh H0 : β1 = 0 H1 : β1 6= 0 Test statistics, β1 t = ∼ tn−k−1 (3.4) s.e(β1) where s.e(β1) is the standard error of β1 √ s(β ) = (σ21 C11) (3.5) where C11 is the second element of (X ′X)−1 matrix and σ2 = MSE σ2 = SSE(n−k−1) where n and k are degrees of freedom. Decision Rule and Conclusion If t ≤ t(n−k−1), fail to reject H0, if t ≥ t(n−k−1), reject H0. When we fail to reject H0 , we conclude that the variable X1 does not contribute significantly to the model. Otherwise, it does. 3.4.6 ROLE of R2 and r2 R2 measures the percentage of a total variation of observation of the response attribute that is explained by the over all regression equation. The higher the value ofR2, the greater the percentage of the variance explained by fitting the data to the equation. This implies that the regression model is the best formulated. r2 measures the marginal contribution of one variable when all others are already included in model. 2 SSRR∑= (3.6)SSTn 2 R2 = ∑i=1(Ŷi − Ȳ )n (3.7) i=1(Yi − Ȳ )2 where SSR is the sum of squares due to residuals and SST is the total sum of squares 34 University of Ghana http://ugspace.ug.edu.gh 3.4.7 Multicollinearity Multicollinearity exists in the regression model when at least two of the exploratory variables are related to each other. A linear relationship or inter correlation between exploratory variables in a given data is described as multicollinearity. In effect, if multicollinearity occur in a data the statistical inferences concerning the data will not be reliable. Among the explanatory variables in a model, the correlated variables can be found by calculating the variance inflation factor (V.I.F) for each explanatory variables. Variance inflation factor is a more rigorous check for collinearity than correlation coefficient. Mathematically, 1 V.I.F = (3.8) (1−R2i ) Therefore the implementation of V.I.F function is to use a stepwise elimination approach until all V.I.F values are below a desired threshold. That is eliminate in a stepwise manner all independent variables with the highest V.I.F and run the model again until we have all remaining independent variables having a V.I.F less than 10(threshold). 3.4.8 Heteroscedasticity Heteroscedasticity occurs when the residuals of the estimated model do not have constant variance across various observations. The presences of heteroscedasticity in the data does not affect the expected value of the coefficients of a model but OLS underestimates the standard errors of the estimated coefficients. This affects the outcomes of the t-test statistic for significance. 3.4.9 Breusch-Pagan Test Breusch-Pagan test is used to test for heteroscedasticity in a linear regression model. It tests whether the variance of the residuals from a regression model 35 University of Ghana http://ugspace.ug.edu.gh is dependent on the values of the predictor variables. The test statistic of the Breusch-Pagan test is given by logeσ 2 = γ0 + γ1Xi (3.9) The σ2 either enlarges or reduces with the level of X, depending on the sign of γ1 . Constancy of error variance corresponds to γ1 = 0 . The test of Ho : γ1 = 0 versus Hα : γ1 =6 0 is carried out by means of regressing the squared residuals ε2i against Xi in the usual manner and obtaining the regression sum of squares, to be denoted by SSR∗. The test statistic χ2BP is as follows: ( )2 2 SSR ∗ SSE χBP = / (3.10)2 n where SSR∗ is the regression sum of squares when regressing ε2 on X and SSE is the error sum of squares when regressing Y on X. If Ho : γ1 = 0 holds and n is reasonably large, χ2BP follows approximately the chi-square distribution with one degree of freedom. Large values of χ2BP lead to conclusion on Hα, that the error variance is not constant. 3.4.10 Remedy for Assumption Violation The original Box-Cox transform is given by   yγ−1 , γ =6 0, γ y(γ) =  (3.11)logy, γ = 0 The objective of Box-Cox transformations is to maintain the assumption of linearity of the model. That is we transform our dependent variable by choosing a desired value of γ and applying is appropriately. 36 University of Ghana http://ugspace.ug.edu.gh 3.4.11 Outliers Cook’s Distance It determines the influence of the ith observation on all the ith fitted value. It is the standardized version of the total of squares of the difference between the predicted value computed with and without observation i. It is given by ∑n Yj − Yj(i) j=1 Di = (3.12) pMSEi Rule of thumb is that F (Di, p, n − p) < 10% or 20% not influential case and if F (Di, p, n−p) is near 50% or more, then the case has a major influential (Howard & Gordoh, 2005). 3.4.12 Normality Test Shapiro-Wilk Test The Shapro-Wilk test tests the null hypothesis that a sample x1, x2, ..., xn came from a normally distributed population. The test statistics is given by ∑∑( n 2W = i=1 aiX(i)) (3.13) i=1(xi − x̄)2 where x th(i) is the i smallest number in the sample; mTV −1 (a1, ..., an) = (3.14) (mTV −1V −1m)2 where m = (m1, ...,mn) and m1, ...,mn are the average of order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and V is the covariance matrix of those order statistics. When Shapiro-Wilk’s test of normality has p value less than alpha value (0.05), 37 University of Ghana http://ugspace.ug.edu.gh null hypothesis is rejected and conclude that the dataset is not normally distributed. Contrary, if the p value of Shapiro Wilk’s test statistic exceeds alpha value (0.05), fail to reject the null hypothesis that the dataset is normally distributed. 3.5 Testing the Missing Data Mechanism (MCAR & MAR) Assumption Researchers are frequently encountered with many difficulties in analyzing data sets that have missing observations or incomplete data. To appropriately analyse a data set which contains missing values, an extensive knowledge of incomplete values mechanism must first be investigated. If data exhibit a missing completely at random, then many incomplete data analysis algorithms lead to valid inference (Little & Rubin 2002). Thus, tests of missing completely at random is warranted. In missing values analysis, Little’s test (1988) is helpful for testing the assumption of missing completely at random for multivariate, partially observed random data. 3.5.1 Little’s Test of MCAR Little’s approach of MCAR test examines missing completely at random, that is an assumption which must be satisfied before substituting lost values with different imputation algorithms. This test is used to assess for MCAR for multivariate data with incomplete values. According to Kim & Bentler (2002) Little’s MCAR test is employed to assess homogeneity of means and covariances using generalised least squares estimation. The Little’s test of MCAR is given by; ∑N ∑−1 MCAR = N rk((x̄ Tobs.k − µ̄obs.k)) (x̄obs.k − µ̄obs.k) (3.15) k=1 obs.k where N is the total number of observations, k is an index of the 2P missing 38 University of Ghana http://ugspace.ug.edu.gh pattern, r is the single response within kth pattern, N rk is the number of observed samples f∑or the kth missing response pattern, and the χ2 statistic has degrees ofN freedom Pk−P where Pk is the number of observed variables for all K patterns. k=1 When Little’s test of MCAR has p value exceeds alpha value (0.05), then neither the assertion of normality nor the assertion of MCAR test is ignored. When Little’s test of MCAR statistic is lower than alpha value(0.05), then there is an evidence against null hypothesis and conclude that data is MAR. Data points are MCAR if the patterns of lost values do not rely on either observed or hidden data points. 3.6 Classifications of Missing Data under the Assumptions of various Missing Data Mechanism For researchers to know whether a matrix data set with missing observations is MCAR or MAR, the little’s test of MCAR and dammy variable of interest on the variables are employed. First, Little’s test of MCAR is used to test for the assumptions of MCAR and MAR. If there is no evidence against null hypothesis under the little’s test of MCAR, then the study can conclude that, the following imputation algorithms; KNN, mean substitution (MS), and regression substitution (RS) depend on the missing completely at random assumption (Lin & Bentler, 2012; McKnight, 2007). Violation of the MCAR assumption may result in unfair estimates provided by the methods of handling missing data. If there is significant evidence against null hypothesis under Little’s test of MCAR, then the study concludes that both imputation algorithms multiple imputation by chained equation ( MICE) and expectation maximization (EM) depend on MAR assumption (Lin & Bentler, 2012; Rubin & Thayer, 1982). The assumption permits parameters to be well modified utilizing all accessible information. Secondly, for the researchers to know whether a matrix data set 39 University of Ghana http://ugspace.ug.edu.gh with missing observations is MCAR or MAR, we compute a dummy variable that shows whether lost data in a particular attribute is correlated to other attributes in the dataset. When observed that dummy variable (lost data) is independent on other attributes, therefore pattern of missing data can not be described as MAR instead of MCAR in this study and the reverse hold for MAR. 3.7 The Imputation Algorithms for Treating Missing Values under the MCAR Mechanism From literature review, the following imputation algorithms have been used under MCAR assumption in handling missing data problem effectively (Schmitt, et al. 2015). 3.7.1 K Nearest Neighbors Imputation (KNN) Algorithm The nearest neighbor importation method is a technique based on the notion of proximity between observations (subjects). This similarity is often determined by a distance function (Euclidean distance for example). It is a technique in which the lost data for a given subject are substituted with value noticed at the same position of the nearest subject. Explicitly, supposed that X is a matrix which represents the data set, X = (X(1), X(2), ..., X(p)). Where each column X(i), (i = 1, 2, ..., p) is a random variable of n observations. Let x(j) be a column with missing values. Set (j) (j) (j) (j)X = (Xobs, Xmiss) where Xobs is the sub-vector of observed values of x(j) and xjmiss that of the missing values. Consider H = { (i)i, xmiss = θ, i = 1, 2, ..., p} with cardinality (the set of order) m, the set of indices of columns which is not having missing and Z = X(i) ∈ H . Let Zobs and Zmiss be two sub-matrices of Z extracted by selection of lines corresponding respectively to (i) (i)Xobs and {Xmiss}. Assume that l is the identifier of the subject who has not observed value for the variable X(i). Among the subject k, who has all measurement in the set, and subject j0 who minimizes the distance between 40 University of Ghana http://ugspace.ug.edu.gh k and li { (i) (i)j0 = argmin d zobs(l), zobs(k)} between i ∈ H, 1 6 k 6 n (3.16) where d is a distance measure and n is number of subjects in the set. The d is the Euclidean distance defined by: √ ∑ (i (i) (i) (i) d(zobs), zobs(k)) = ( (zobs(l)− zobs(k)) 2) (3.17) if j0 was determined missing value, (i) Xmiss(l) would be estimated by Xobs(j0): X imiss(l) = Xobs(j0) (3.18) 3.7.2 Regression Substitution The principle of regression substitution is to use the observed values to create fitted regression model. The attributes with lost data are target variable and incomplete values are substituted by the predicted values using regression equation.To describe how regression algorithm works, most appropriate predictor attributes with lost data are determined by applying correlation matrix. The most excellent predictors are picked and used as predictor attributes in a regression model. The attribute with the lost values is used as the response variable. Cases which have complete information for controlled attributes are employed to generate the regression expression, the model on the other hand, is used to forecast missingness for lost cases. By iteration, the values of the lost data will be substituted and therefore all instances are used to forecast the response attribute. These systems are reiterated till they converged. The predictors obtained from the final circle are the best ones employed to fill in incomplete data. Explicitly, suppose that X is a matrix which represents the data set, X = (X(1), X(2), ..., X(p)). Where each column x(i) (i = 1, 2, ..., p) is a random variable of n observations. Let x(j) be a column with missing values. Set 41 University of Ghana http://ugspace.ug.edu.gh (j) (i) (j) (j)X = (Xobs, Xmiss) where Xobs is the sub-vector of observed values of X (j) and (j)Xmiss that of the missing values. Consider { (i) H = i,Xmiss = θ, i = 1, 2, ..., p} with cardinality m, the set of indices of columns which is not having missing and Z = {x,(i) ∈ H}. Let Zobs and Zmiss be two sub-matrices of Z extracted by selection of lines corresponding respectively to (i) and (i)Xobs Xmiss. Let us consider the regression model based on the observed part: Xobs = βZobs + µ, where µ ∼ N(0, σ2) (3.19) With β = (β0, β1, ..., βm) is the vector Y ′s regression coefficients and the error term µ = (µ0, µ1, ..., µn−q) where q is the length of Xjmiss The estimation of the missing values, (j)X̂miss,i where i ranges over the q lines indices of Xjmiss, are obtained by Xj(miss,i) = β̂0 + β̂1Z(miss,1)+, ...,+β̂mZ(miss,m) (3.20) where β̂ is the usual estimator of β. The regression approach when dealing with missing values depends on the predictors that are considered into the equation of regression model. It is the reason why Little (2002) considered that, this technique is a conditional one. It is more sophisticated than mean substitution method (Rubin et al. 2007), but this technique can be conduct to overestimating the relationship between the predictors and the dependent variables (Schafer & Graham, 2002). 3.7.3 Mean substitution (MS) With the MS approach, the arithmetic mean of the observed values of every variable is estimated and then substituted in each of the lost data cells of that attribute. Moreover, MS technique yields good results if missing data mechanism is MCAR. It is among the most broadly employed imputation technique to replace incomplete data. Explicitly the value Yij is of the kth class, ck, is missing then it 42 University of Ghana http://ugspace.ug.edu.gh is replaced by ∑ yij Ŷij = (3.21) n ij∈c Kk where n is the observed values in jthk feature of the kth class. For example, consider the data set with missing values and after replacing with mean substitution technique. Table 3.1: The dataset with missing values V O1 V O2 V O3 12 NA 50 NA NA 43 20 26 67 23 64 NA 40 34 78 21 NA 21 Mean 23.2 19.3 51.8 Table 3.2: After replacement of missing values by mean substitution technique V O1 V O2 V O3 12 19.3 50 23.2 19.3 43 20 26 67 23 64 51.8 40 34 78 21 19.3 21 3.8 The Algorithms for Treating Missing Values under MAR Mechanism From literature review the following two imputation algorithms have been classified under MAR assumption in handling missing data problem effectively (Azur et al. (2012), Schafer and Graham (2002)) 43 University of Ghana http://ugspace.ug.edu.gh 3.8.1 Expectation – Maximization (EM) Algorithm The EM algorithm is an iterative procedure used to calculate the maximum likelihood estimate in the presence of latent or lost data. Maximum likelihood estimations are interested in calculating the model parameters in which the known values are most possible. EM approach, initially developed by Dempster et al. (1977), is an iterative procedure to maximize the probability calculated by a parametric model for the observed data. EM operates under the assumption that given the attributes employed in the imputation approach, the unobserved data are MAR. The EM technique for hidden values is largely depended on the maximum probability estimated of covariance structure given by the available data. Each repetition of the EM approach comprises of two systems. The Expectation step (i.e. E-step) and the Maximization step (i.e. M-step). With the expectation state step (E-step), regression equation based on the given available values are used in calculating the missingness (the expected values). These missingness are replaced by the conditional mean established by the regression models. In the maximization step (M-step), the estimates obtained from the E- step are updated to increase the log probability of the current parameters from the first state. These two steps are repeated for some number of iterations. This algorithm will converge on a stationary point under some hypothesis of regularity (Alison (2002), Dempster et al. (1977)) The distribution of the complete data Y can be denoted as; f(Y/θ) = f(Yobs, Ymis/θ) = f(Yobs/θ)f(Ymis, Yobs/θ) (3.22) Where f(Yobs, Ymis/θ)is the density of the observed data and f(Ymis, Yobs/θ) is the density of missing data. Then, the log likelihood of f(Y/θ) is given by l(θ/Y ) = l(θ/Yobs, Ymis) = l(θ/Ymis) + lnf(Ymis, Yobs/θ) (3.23) 44 University of Ghana http://ugspace.ug.edu.gh The objective is to optimize l(θ/Yobs, Ymis) using parameter θ. Now let current estimate of parameter θ be denoted by θk Optimization of the equation above is an iterative process of two steps, E-step and M-step. Expectation step (E- STEP) E-step determines the expected log likelihood of the data as if the parameter θ was truly current estimate, θk. Y = (Yobs, Ymis), given current parameter estimate and the observed values of Y and Q is the expected value of the log likelihood of the data Q(θ/θk) = E{lnf(Y/θ)f(Y /Y , θk ∑ mis obs )} = ln[f(Y kobs, Ymis/θ)]f(Ymis/Yobs, /θ ) (3.24) z∈Z or ∫ Q(θ|θk) = l(θ/Y )f(Ymis|Yobs, θ) (3.25) Maximization step ( M-step) M-step obtains the updated maximum likelihood parameter estimator using the Q function . θk+1 = arg[maxQ(θ/θk)] for all θ. The M-step finds θ that maximizes Q The E-step and M-step are repeated alternatively till the difference l(θ(k+1) − l(θ(k)) is very negligible. 3.8.2 Multiple Imputation by Chained Equation (MICE) Algorithm MICE is a particular multiple replacement approach used in handling missing data problem effectively (Raghunathan et al., 2001; Van Buuren, 2007). It works under the assumption that given the attributes employed in the imputation approach, the lost data are MAR. This mechanism explains the likelihood that the lost 45 University of Ghana http://ugspace.ug.edu.gh values are related or connected soly to the observed values but independent on lost values (Schafer & Graham, 2002). In reality, MICE techniques have been employed in data matrices with thousands of observations and hundreds of attributes (Van Buuren, 2007). In the multiple imputation by chained equation approach, a series of regression models are run whereby each attribute with incomplete values is regressed on the other attributes in the data (Van Buuren and Groothus - Oudshoorn, 2011). This implies that each attribute can be modeled according to its probability distribution function, for instance, logistic regression function for binary attributes,linear regression model for continuous data, multinomial logit function for categorical data and a Poisson function for count data. MICE specifies that the multiple imputation equation depend on each attribute for a set of conditional densities. The joint function is therefore only completely known, but it does not necessary exist. After some number of repetitions, the Markov Chain should converge to a stationary distribution. At that point, the chain must be irreducible, aperiodic and recurrent (Van Buuren 2012). The number of iteration needed for chain to converge, differ per data, but it usually quite small number, about 5 to 10 repetitions. The MICE technique is a MCMC approach and it is briefly described in Algorithm 1.0 below. Beginning with first imputation, MICE performs imputation iteration combining with the conditional densities. Algorithm 1.0 the MICE algorithm 46 University of Ghana http://ugspace.ug.edu.gh 1. Specify an imputation equation P (Y miss bbsj , | Yj , y−j, R) for incomplete variable Yj, with j = 1, ..., p 2. For each incomplete variable, initialise starting imputation ∗(0)= (Yj by random draws from Y obsj 3. Repeat for iterations t = 1, ..., T 4. Repeat for number of incomplete variables j = 1, ..., P 5. Define the imputed data Y t (t) (t) (t−1)(−j) = (Y1 , ..., Y t j−1, Yj+1, ..., Yp 6. Draw (θj ∗ (t)) ∼ p(θt | Y obsj j , Y t−j, R) 7. ∗(t) ∼ miss | obss ∗(t)Yj p(Yj Y−j , R, θj ) 8. End repeat t 9. End repeat j 3.9 Evaluation Assessment Criteria to Compare various Imputation Algorithms The following performance assessment criteria would be used to evaluate the various imputation algorithms employed in this study. These are: mean absolute difference (MAD), root mean square error (RMSE) and coefficient of determination (R2). 3.9.1 Mean Absolute Difference (MAD) The MAD is a statistical variance evaluation indicator. It is also called average difference between two different numbers selected from probability density. The MAD is the arithmetic mean of absolute difference between observed value and predicted value. The smallest MAD is the best measure of dispersion, hence the 47 University of Ghana http://ugspace.ug.edu.gh algorithm with smallest MAD is recommended to replace an unobserved data. Mathematically it is given by MAD = E|Xo −Xm| (3.26) where Xo is the observed values and Xm is the replaced values. 3.9.2 Root Mean Squared Error (RMSE) The root mean squared error (RMSE) is a performance indicator that determines the mean distance of the residual. It compares the variance among original value and substituted value, basically it denotes standard deviation of the difference. It is a valuable indicator of total exactitude which assists researchers to know how each imputation algorithm is performing in a data set. In literature, the most efficient imputation algorithm is the one with the lowest RMSE (Huang & Carriere, 2006). This implies that the smaller the RMSE, then the better the performance indicator is. The mathematical formula for RMSE is given below: √∑k i=1(Xio −X )2imRMSE = (3.27) n where i = 1, 2, . . . , n. The n is sample size, Xo is the observe values of data set and Xm is the imputed values in the data set (Schmitt, et al. 2015) 3.9.3 Coefficient of Determination The coefficient of determination (R2) measures the proportion of variability in the response variable explained by the predictor variables. The R square (R2) ranges from 0 to 1 while the model has healthy predictive ability or the regression line is perfectly fit the data, when it is nearer to one and it is not analyzing better, when it is closer to zero. R2 is given by the formula: 48 University of Ghana http://ugspace.ug.edu.gh R2 SSE = 1− (3.28) SST ∑n 2 R2 = 1− ∑i=1(Ŷi − Ȳ )n (3.29) i=1(Yi − Ȳ )2 where SSE is the residual sum of squares and SST is the total sum of squares corrected for the mean. 3.10 Data Analysis Procedure This study gathered information on 106 countries at the year 2011. Several variables for these countries were measured. Some of the variables tend to be correlated among themselves. The study outlines quantitative data analysis plan as follows: data entry, processing, organizing output into tables, explanation of tables and drawing conclusions. Moreover, the study used the R package to run the considered algorithms to estimate and impute the missing values into meaningful statistical results. In chapter 4 of this study, the study presents the simplified results in the form of tables, diagrams and graphical displays for easy interpretation. At the final stage of this study, the empirical outcomes of each algorithm will be evaluated with regards to the tables and the figures obtained in chapter four. 49 University of Ghana http://ugspace.ug.edu.gh CHAPTER 4 Data Analysis and Discussion of Results 4.1 Introduction This section illustrates the diverse analyses of the study, thus presentation of the empirical calculations and its statistical explanation. The chapter starts by giving the descriptive statistics of 2011, World Population Data Sheet, multiple linear regression model, missing data mechanism test, comparison of imputation algorithms under MCAR and MAR mechanism, comparison of imputation algorithms for treating missing values under MLR model, evaluation assessment of imputation algorithms using the coefficient of determination, and finally comparison of imputation methods using the mean absolute difference (MAD). All analyses were carried out with R package. 4.2 Descriptive Statistics Table 4.1 demonstrates the expository statistics of data from the World Population Data Sheet, 2011. According to Polland, (1988) life expectancy at birth by a country is classified as low (< 65 years), medium (65-73 years) and high (> 73 years). Table 4.1: Classification of Life Expectancy at Birth (LEB) by 106 Countries. Level Values of LEB Number of countries Low LEB < 65 43 Medium LEB 65-73 33 High LEB > 73 30 50 University of Ghana http://ugspace.ug.edu.gh From Table 4.1, the values of LEB were categorized into three levels. Firstly, 43 countries had small values of LEB representing 41%. Secondly, 33 countries had average values of LEB also representing 31% and finally, 30 countries had large values of LEB denoting 28%. From original data of world population data sheet, 2011, the values of LEB variable vary from country to country, Guinea-Bissau and Costa Rica recorded low LEB values with 48 and 49 respectively. However, Slovenia recorded the highest LEB value of 80 among the countries used in this study (see Appendix IV) Table 4.2: Correlation Matrix Predictors LEB CMW DEN RWS IMR TFR LEB 1.000 CMW 0.432 1.000 DEN 0.802 0.496 1.000 RWS 0.756 0.271 0.557 1.000 IMR -0.744 -0.171 -0.491 -0.501 1.000 TFR -0.201 0.138 -0.057 -0.218 -0.002 1.000 Table 4.2 shows the various correlation among the pool of predictor variables and the dependent variable. From Table 4.2, it is clearly observed that there is affirmative relationship between LEB (response variable) and CMW, DEN and RWS. However, there is a positive correlation of 0.496 between DEN and CMW, a positive correlation of 0.271 between RWS and CMW, a positive correlation of 0.557 between RWS and DEN and weak positive correlation of 0.138 between CMW and TFR. The low positive correlation and the weak positive correlation among independent variables indicate the absence of multicollinearity among the covariates under study. 51 University of Ghana http://ugspace.ug.edu.gh Table 4.3: Determination of Multicollinearity Predictors V.I.F Sq V.I.F CMW 1.97 0.506624 DENSITY 1.72 0.579939 RWS 1.50 0.664721 IMR 1.40 0.713982 TFR 1.12 0.893680 Table 4.3 depicts the variance inflation factor and the square root of each independent variable of variance inflation factor under consideration. The variance inflation factor indicates that there is no possible existence of multicollinearity among predictor variables. The table also unveils that three of the variance inflation factor of the predictor variables exceed the mean variance inflation factor of 1.542 but they did not exceed the threshold of 10. Table 4.4: Test of Normality and Constancy of Variance of Residual Terms Statistic P-Value Shapiro-Wilk Normality Test 0.26736 Breusch-Pagan Test 0.9065 Table 4.4 shows that the residual of the error term is normal (p > 0.05) and the p-value of the Breusch-Pagan test also shows a constancy variance of the residual terms (p > 0.05). 52 University of Ghana http://ugspace.ug.edu.gh Table 4.5: Summary of the Complete Original Dataset Model Coefficients (Regression coefficient estimates, standard error, t-value and p-value) Variable Estimate Std. Error t-value p-value Constant 26.5354 3.84132 6.91 0.0000 CMW 0.08643 0.02937 2.94 0.0040 DEN 0.39793 0.04723 8.43 0.0000 RWS 0.25676 0.0387 6.63 0.0000 IMR -0.0872 0.00891 -9.78 0.0000 TFR -0.7075 0.18353 -3.86 0.0000 R2=0.8932; F(5,100)=167.32; P-value of F-statistic=0.000 Table 4.5 presents the regression out carried out in the complete original dataset without missing values. It indicates various estimates of the coefficients of the regression output, their standard errors, t-values and the respective p-values. The output shows that F -statistic =143.6 (p-value=0.000), signifying that the study can evidently ignore the null hypothesis that the predictor variables as a whole have no impact on life expectancy at birth (LEB). The results also show that the variables; CMW, DEN, RWS, IMR, and TFR are all significant in predicting life expectancy at birth (LEB) with their respective p-(value) ≤ 0.05 while the other remaining independent variables are not significant in predicting life expectancy at birth with their respective p-(value)> 0.05. In addition, the output also shows that multiple R Squared= 0.8983 explains that 89.83% of the total variation in the total life expectancy at birth was explained by the regression model and adjusted R Squared = 0.8887. However, all five covariates or predictor variables were significant at 5% level of significance. 4.3 Multiple Linear Regression (MLR) model Using regression output from Table 4.5, five covariates such as CMW, DEN, RWS, IMR and TFR were significant, which means that they contribute significantly in predicting life expectancy at birth (LEB). The fitted MLR model is given by, 53 University of Ghana http://ugspace.ug.edu.gh Y = 26.5354+0.08643X1 +0.39793X2 +0.25676X3−0.0872X4−0.7075X5 (4.1) where Y = LEB,X1 = CMW,X2 = DEN,X3 = RWS,X4 = IMRandX5 = TFR A unit increase in CMW will cause a 0.08643 average increase in LEB if all other independent factors remain the same. Also, a unit increase in DEN will cause a 0.39793 average increase in LEB if all other independent factors remain unchanged, a unit enlarge in RWS will cause a 0.25676 average increase in LEB if all other independent factors remain the same and eventually a unit expand in TFR will cause a 0.7075 average decrease in LEB if all other covariates remain unchanged. According to the results established from Table 4.5, the following covariates; CMW, DEN, RWS, IMR and TFR were selected for the final model formulation which are subjected to missing values while LEB values were entirely observed in all instances. Throughout this thesis, the study assumes that the missing values exhibit the MCAR and MAR mechanisms and arbitrary missing pattern assumption. 4.4 Missing Data Mechanism Test In order to analyze a dataset with missing observation accurately, an in-dept knowledge of how the data is missing is required (i.e either random way or non-random way). This will assist in grouping missing values under the various missing data mechanism. In this thesis, the following proportions of missing data rate; 5%, 10%, 20%, 30% and 40% were artificially created in arbitrary 54 University of Ghana http://ugspace.ug.edu.gh missing pattern way from complete World Population Data Sheet, 2011. It is believed that 10% and below missingness indicate a small fraction of incomplete values, 20% of missingness indicates a medium amount of missing values, 30% shows a sign of large amount of missing values and finally 40% and above show very large amount of incomplete values in the data matrix. Little’s test of MCAR was used to identify the appropriate imputation algorithm to handle each percentage of missing values. Tables 4.6 and 4.7 show the output of Little’s MCAR test on the percentages of missing values artificially created. Hypothesis of Little’s MCAR test H0 : The missing values in the data set are MCAR H1 : The missing values in the data set are not MCAR Decision rule for Little’s MCAR test If P value ≥ 0.05, fail to reject the H0 and conclude that missing values mechanism is MCAR. If P value < 0.05, reject the H0 and presume that missing data mechanism is MAR. After the Little’s MCAR test, the various imputation algorithms would be applied to calculate and replace incomplete values artificially created in the complete original data matrix. This will aid to statistically compare and select the best imputation algorithm to use in replacing a particular missing data pattern. 55 University of Ghana http://ugspace.ug.edu.gh Table 4.6: Output of Little’s MCAR test for MCAR Proportion of missing data (%) Chi-square statistic Degree of freedom (df) P-value 5% 33.2287 29 0.2686 10% 36.0689 37 0.5125 20% 45.3028 53 0.7647 30% 66.8731 62 0.3134 40% 50.6023 65 0.9050 From Table 4.6, since all the p-values for the various proportion of missing data are greater than 0.05 (p-value ≥ 0.05), there is no evidence to reject H0 and hence the missingness mechanism is MCAR. The assumption implies that occurrence of incomplete values in the data matrix do not demonstrate any patterns and incomplete values do not connect to observed or missing values. Table 4.7: Output of Little’s MCAR test for MAR Proportion of missing data (%) Chi-square statistic Degree of freedom (df) P-value 5% 80.590 31 0.000 10% 98.855 37 0.000 20% 136.460 64 0.000 30% 149.485 80 0.000 40% 165.836 80 0.000 From Table 4.7, since all p-values for the various proportion of missing data are less than 0.05 (P value < 0.05), there is enough evidence to reject H0 and presume that missingness mechanism is MAR. With MAR assumption, the missing values are connected to observed values but they do not connect to missing values themselves. From literature review and the results of Little’s MCAR test, the following imputation algorithms at table 4.8 have been grouped into MCAR and MAR mechanisms. These algorithms will be employed to estimate and replace missing values artificially created in the complete data set. This will assist to assess and identify the best imputation algorithm under a particular missing data mechanism. 56 University of Ghana http://ugspace.ug.edu.gh Table 4.8: Imputation Algorithms for Treating Missing Values MCAR ‘ MAR Mean substitution EM algorithm K nearest neighbor MICE algorithm Regression imputation 4.5 Comparison of Imputation Algorithms for Treating Missing Values To compare various imputation algorithms used in this study and select the best performing technique among them, the study employed the following performance assessment procedure to select the best algorithm; 1. The average coefficient of difference (ACD) estimates between MLR model for the original complete data and MLR model for the incomplete data imputed by various algorithms are calculated and assessed. 2. The mean absolute difference (MAD) between the original data (complete data) and the data that have been predicted by imputation is computed and assessed. 3. The coefficient of determination (R2) of the regression output is also used to assess the best imputation algorithm performer. 4.6 Comparison of Imputation Algorithms for Treating Missing Values under MLR Model using ACD To compare various imputation algorithms and select the best among them, all the imputation algorithms considered were used to replace various missing values 57 University of Ghana http://ugspace.ug.edu.gh artificially created in the complete data matrix. The MLR model was employed to run the replaced complete data sets by various algorithms and compare them to the general MLR model for the original complete dataset (dataset without missing values). The MLR model for the original complete data set without missing values is given by; Y = 26.5354+0.08643X1 +0.39793X2 +0.25676X3−0.0872X4−0.7075X5 (4.2) The procedures for the comparison are outlined as follows: 1. Compare the MLR model for each imputation algorithm model to treat the proportion of missing values to the general MLR model. This means that the coefficient difference for each imputation algorithm employed to treat proportions of missing values will be estimated. 2. The coefficient difference is estimated by subtracting each coefficient of the algorithm models from the coefficients of the complete original data model. 3. The mean or average coefficients difference (ACD) for each imputation algorithm is calculated. 4. Total average coefficients difference for all the proportion of missingness for each imputation algorithm is computed. The best imputation algorithm is the one with the smallest or lowest average coefficients difference estimate. Table 4.9 indicates how to compute average coefficient difference between KNN imputation algorithm and the original data of general MLR model, under 5% missingness. 58 University of Ghana http://ugspace.ug.edu.gh Table 4.9: Average Coefficient Difference of Missing Data under KNN Imputation Algorithm to the Original Data of MLR Model MLR for original data KNN for 5% missingness in the data Variable Estimate St error Estimate St error Coefficient difference (CD) Constant 26.5654 6.90777 24.41543 4.45889 CMW 0.08643 0.02977 0.06778 0.03431 0.01865 DEN 0.39793 0.05458 0.45244 0.05317 -0.05451 RWS 0.25676 0.04697 0.24528 0.04604 0.01148 IMR -0.08720 0.00938 -0.08874 0.00999 0.00154 TFR -0.70750 0.31094 -0.61280 0.22788 -0.09470 Average (CD) 0.02351 With ACD, the probability of obtaining negative average coefficient difference (ACD) is possible. Since comparing various imputation regression models to the complete original regression model at the center, ignore the negative sign of the final ACD or absolute figure is considered. 4.6.1 Comparison of Imputation Algorithms for Treating Missingness under MCAR Mechanism The presence of missing data is inevitable in the cross-sectional and longitudinal studies. In the real data science analysis, the missing values pattern may be described as MCAR, MAR, or MNAR which account for the reasons that give rise to missing values. The following imputation algorithms, namely k nearest neighbor, mean substitution and regression substitution under MCAR mechanism were empirically compared and assessed. To assess and judge the performance of these imputation algorithms, ACD estimate for each imputation algorithm is computed. The imputation algorithm with the smallest or lowest ACD estimate is the best fit. Table 4.10 shows the total average coefficient difference estimates for KNN, mean substitution and 59 University of Ghana http://ugspace.ug.edu.gh regression substitution algorithms under MCAR mechanism. Table 4.10: Performance of KNN, Mean Substitution and Regression Substitution Algorithm under MCAR using ACD estimate Average Coefficient Difference (ACD) Percentage of missingness (%) KNN Mean Sub. Reg. Sub. 5 0.02351 0.01965 0.02142 10 0.00676 0.02092 0.00234 20 0.03640 0.06292 0.02379 30 0.02519 0.16870 0.06265 40 0.09255 0.02548 0.15454 TOTAL 0.18441 0.29778 0.26474 Table 4.10 presents the performance of k nearest neighbor algorithm, mean substitution method and regression substitution method under MCAR mechanism using ACD estimates. Among the three imputation algorithms compared, mean substitution is the worst method. All the three imputation methods considered performed better when percentage of missing data is small (i.e 5%, 10% and 20%) especially KNN and regression substitution approach. But at 5%, and 40% percentages of missingness, mean substitution performed very good as compared to KNN and regression substitution method. The performance of KNN algorithm was consistently very good through out missingness percentages. At large percentage of missingness (40%), regression substitution provided very poor result. In conclusion, at small missingness percentage, mean substitution and regression substitution approaches will be better to replace missingness considering MCAR mechanism. Again, at very small or large proportion of missingness, KNN is prefferd algorithm to replace missing values. 60 University of Ghana http://ugspace.ug.edu.gh 4.6.2 Comparison of EM and MICE Algorithms for Treating Missingness under MAR Mechanism using ACD With MAR, the missing data is related to the observed values and is independent of the missing data. Under MAR mechanism, the study employed expectation maximization (EM) algorithm and multiple imputation by chained equation ( MICE) algorithm to impute and replace the missing values. Also, the researchers computed average coefficient difference (ACD) for each imputation algorithm used and algorithm with lowest the ACD is the best method. Table 4.11 shows the total average coefficient difference of EM and MICE algorithms for treating missing values under MAR mechanism. Table 4.11: Performance of EM and MICE Algorithms under MAR using Average Coefficient Difference (ACD) Average Coefficient difference (ACD) Percentage (%) EM MICE 5 0.01191 0.02896 10 0.00394 0.05072 20 0.06718 0.02911 30 0.12249 0.06238 40 0.05914 0.05576 TOTAL 0.26466 0.22693 Table 4.11 presents the performance of EM and MICE algorithms for treating missing values under MAR mechanism using ACD. Among the two algorithms compared, EM had poor results. At small missingness proportion (i.e 5% & 10%), EM method outperformed MICE algorithm. Comparatively, at 20%, 30% and 40% levels of missing data, MICE algorithm provided satisfactory results. Therefore, EM algorithm can be used to replace misssing data if missingness proportion is small. At large missingness percentage, it will be prudent to impute 61 University of Ghana http://ugspace.ug.edu.gh missing data using MICE algorithm. In the nutshell, MICE algorithm provides unbiased estimates and accurate conclusions in replacing missingness. Hence, MICE algorithm is preferred to EM algorithm under MAR mechanism. Figure 4.2 shows graphical representation of the performance of EM and MICE imputation algorithms using ACD as a measure of evaluation criteria. Figure 4.1: Graph of EM and MICE algorithms under MAR using average coefficient difference as a measure of performance assessment criteria Experimental results on the achievement of EM and MICE imputation algorithms clearly manifest that at small missingness percentage (5% and 10%), EM method outperformed MICE algorithm. Contrary, at medium and very large missingness data percentage, MICE approach provided sufficiently good results. These opted for recommendation that under MAR, EM can be used to replace missingness, if missingness percentage is low. Therefore, MICE should be used to replace missingness if missingness percentage is large (20% and above). 62 University of Ghana http://ugspace.ug.edu.gh 4.7 Comparison of Imputation Algorithms for Treating Missing Values using Mean Absolute Difference (MAD) The mean absolute difference (MAD) is a measure of statistical dispersion which is the same as the average difference of two independent values drawn from a probability distribution. The mean absolute difference is the expected value of absolute difference of two random variables, that is observed value and predicted value. The smallest MAD is the best measure of dispersion, hence the algorithm with smallest MAD will be recommended to impute missing value. The table 4.12 shows the mean absolute difference under MCAR mechanism at various percentages of missing values. Table 4.12: Performance of KNN, Mean Substitution and Regression Substitution for Treating Missing Values under MCAR Mechanism using Mean Absolute Difference (MAD) Mean absolute difference (MAD) Percentage (%) KNN Mean Sub. Reg. Sub. 5 7.079757 12.27647 4.825816 10 8.521118 11.97301 5.787719 20 7.799661 11.23853 6.228872 30 9.111556 11.08854 7.333512 40 13.08629 12.22555 7.558301 TOTAL 45.598382 58.8021 31.73422 From table 4.12, among the three imputation algorithms compared under MCAR mechanism using mean absolute difference, regression substitution in general is overall best performer. K nearest neighbor algorithm and mean substitution provided unsatisfactory results. It is observed that the performance of regression substitution increases with an increasing percentage of missingness. At all level of missingness percentages, from 5% to 40%, the performance of regression 63 University of Ghana http://ugspace.ug.edu.gh substitution exceeds both KNN and mean substitution. The study suggests that under MCAR mechanism, regression substitution should be used to replace missing data, hence it is the preferred choice. Figure 4.2 depicts graphical demonstration of KNN, Mean substitution and Regression substitution under MCAR using MAD as performance assessment criteria. Figure 4.2: Graph of KNN, Mean substitution and Regression substitution under MCAR using MAD as performance assessment criteria From figure 4.2, it clearly demonstrates that regression substitution at all level of missingness percentages provides extremely good results. It outperformed both KNN algorithm and mean substitution method. The study suggests that regression substitution should be used to substitute missingness considering MCAR mechanism. 64 University of Ghana http://ugspace.ug.edu.gh Table 4.13: Performance of EM and MICE Algorithms for Treating Missing Values under MAR Mechanism using Mean Absolute Difference (MAD) Mean absolute difference (MAD) Percentage (%) EM MICE 5 0.80660 0.77920 10 0.09016 0.13324 20 0.14379 0.06088 30 0.45585 0.05452 40 0.31113 0.17289 TOTAL 1.49640 1.02784 Table 4.13 depicts the performance of EM algorithm and MICE algorithm for treating missingness under MAR mechanism using MAD. It is observed that at 5%, 20%, 30% and 40% missingness percentages, MICE algorithm outperformed EM algorithm. Therefore, it is suggests that under MAR mechanism, it is important to use MICE algorithm to replace missing values. However, MICE algorithm provides unbiased inference and conclusion considering MAR mechanism. The study suggests that multiple imputation by chained equation (MICE) is the preferred choice under MAR mechanism 4.8 Comparison of Imputation Algorithms for Treating Missing Values using Coefficient of Determination (R2) The coefficient of determination (R2) measures the percentage of variability in the dependent factor which is explained by the predictor factors. The value of R2 is between zero and one. The higher the value of R2, the greater the percentage of the variable elucidated by fitting the data to the model. 65 University of Ghana http://ugspace.ug.edu.gh The regression output for the complete original dataset showed that R squared is 0.8983, which implies that 89.83% of the total variation in the life expectancy at birth was explained by the regression model. To identify the best imputation algorithm as the main objective of the study, the average R2 value for the various imputation algorithms will be assessed and the one closer to the R2 value from original model will be chosen as the best. The closer the R2 value from each algorithm model as compared toR2 value from original complete model, the better the algorithm to replace missing values. The table 4.14 shows the performance of coefficient of determination under MCAR mechanism. Table 4.14: Performance of KNN, Mean Substitution and Regression Substitution under MCAR Mechanism using R2 Coefficient of determination (R-squared) Percentage (%) KNN Mean Sub. Reg. Sub. 5 0.8644 0.8457 0.8854 10 0.8575 0.8226 0.8981 20 0.8449 0.7944 0.8975 30 0.849 0.7848 0.9198 40 0.6585 0.7477 0.9425 TOTAL 4.0743 3.9952 4.5433 From table 4.14, under MCAR mechanism using coefficient of determination to select the best imputation approach, the results KNN algorithm and mean substitution exhibit similar pattern. As anticipated, the performance of the imputation algorithms reduce with an increasing proportion of incomplete values. As missingness percentage increases, both KNN technique and mean substitution approach provide unsatisfactory results. At 30% and 40% level of missing data, the coefficient of determination of KNN and mean substitution are (84.9% & 65.9%) and (78.5% & 74.8%) respectively as compared to coefficient of determination of the original data, that is 89.9%. With the regression substitution, the performance increases with an increasing percentage of missingness. At small missingness percentages ( 5% or 10%), the coefficient 66 University of Ghana http://ugspace.ug.edu.gh of determination of regression substitution are 88.5% and 89.8% respectively. This means that 88.5% or 89.8% of the total variation in the life expectancy at birth was explained by the regression model, which is almost the same as the R-squared explained by the original model ( 89.9%). At 40% of missingness percentage, regression substitution algorithm recorded 94.2% which is higher than 89.95% of the original model. Conclusively, regression substitution gives satisfactory performance in replacing missingness considering MCAR mechanism. Therefore, it is effective to use regression substitution to replace missing values under MCAR mechanism. Both k nearest neighbor algorithm and mean substitution method provided unsatisfactory results especially at large missingness percentage. Figure 4.3 designates the graphical representation of KNN, mean substitution and regression substitution algorithms under MCAR mechanism using coefficient of determination (R2). 67 University of Ghana http://ugspace.ug.edu.gh Figure 4.3: Graph of KNN, Mean substitution and Regression substitution algorithms under MCAR mechanism using coefficient of determination (R2) as evaluation assessment criteria . Figure 4.3 compares KNN, mean substitution and regression substitution to R2, the coefficient of determination of the complete original data without the missing values. Both KNN and mean substitution algorithms exhibit similar pattern. At small missingness percentages (5% and 10%), the performance of these algorithms were encouraging. At large missingness percentage levels (20%, 30% and 40%) the performance reduced as compare to R2 ofthe complete orginal data. Regression substitution algorithm bespeaks outstanding performance at all levels of missingness percentages. At 5%, 10% and 20% missingness percentages, regression replacement algorithm provided exactitude results as R2, the coefficient of determination of the complete original data. At 30% and 40% missingness levels, regression substitution algorithm outperformed coefficient of determination R2, indicating how good the regression substitution performs. 68 University of Ghana http://ugspace.ug.edu.gh Table 4.15: Performance of EM and MICE algorithms for Treating Missing Values under MAR Mechanism using Coefficient of Determination R2 coefficient of determination R2 Percentage (%) EM MICE 5 0.8700 0.8529 10 0.7888 0.8171 20 0.7892 0.7496 30 0.6202 0.6777 40 0.6538 0.6410 TOTAL 3.7215 3.7383 Table 4.15 presents the coefficient of determination of EM and MICE algorithms for treating missing values under MAR mechanism. From table 4.15, it can be observed that missing data mechanisms and percentage of missingness influence the performance of imputation algorithms greatly, as percentage of missingness increases, the performance of both EM and MICE algorithms also reduce very fast. It indicates that both algorithms performed very good at small missingness percentage (5%) as compared to coefficient of determination of the original complete data (89%). It is important to know that the two imputation algorithms performed very poor under MAR mechanism, especially at large missingness percentages (above 5%). At small percentage of missingness, EM and MICE can be used to replace missing values. Contrary, it is not prudent to use EM and MICE to impute missingness under MAR mechanism at large percentage of missings. Figure 4.4 designates the performance of EM and MICE algorithms under MAR mechanism using coefficient of determination (R2) as measure of metric assessment criteria. 69 University of Ghana http://ugspace.ug.edu.gh Figure 4.4: Graph of EM and MICE algorithms under MAR mechanism using coefficient of determination (R2) as measure of metric assessment criteria Figure 4.4 compares the performance of EM and MICE imputation algorithms to the coefficient of determination (R2) of the complete original data without missingness at all levels of missing data rates. It demonstrates the nature of table 4.15 and is the main reason why both methods cannot be recommended to replace missingness at all missingness percentages. The wider the algorithm fit line depart from R2, the poorer the performance of such imputation algorithm. Hence EM and MICE algorithms designate unsatisfactory performances. Therefore, generally both algorithms provide inconsistent and biased estimates. 70 University of Ghana http://ugspace.ug.edu.gh CHAPTER 5 SUMMARY, CONCLUSION AND RECOMMENDATIONS 5.1 Introduction This chapter provides an abridged version of the outcomes of the investigation; it draws conclusions that relate to the objectives of the study. It also makes recommendations and suggestions for further studies in the research area. 5.2 Summary The manifestation of missing values in data analysis is an ineluctable issue at the present time. During the data collection stage, the researcher can design the best questionnaire and employ the most efficient data collection method; it is still possible for some data points to be incomplete or lost, hence, the use of missing data imputation techniques is indispensable. The occurrence of incomplete observations in the analysis stage produces biased results that lead to inaccuracy and inefficient inferences about a population to guide stakeholders, decision makers and researchers. According to Horton & Kleinman (2007), data may be missing for many reasons, but summarised the reasons as unit non-response, item non-response and non-coverage. With unit non-response, which is also called subject non- response, many respondents are included in the sample population but fail to provide any information to the items on the questionnaire. An item non- response, many subjects in the sample fail to provide all the needed information 71 University of Ghana http://ugspace.ug.edu.gh for the items on the questionnaire. Some of the items may not be answered for confidentiality purpose. Finally, with non-coverage, the sample does not represent the population to which the researcher wants to make generalization, some fractions of the target population were not covered. The main aspiration of the investigation is to find the best imputation algorithm to treat incomplete values under the assumptions of various missing data mechanism. The study grouped and compared imputation algorithms to treat missing data under both MCAR and MAR mechanism assumptions. By considering MCAR mechanism, k nearest neighbor imputation algorithm, mean substitution method and regression substitution method were employed to yield unbiased estimate. Expectation maximization (EM) algorithm and multiple imputation by chained equation (MICE) algorithm require the missing data to be MAR to obtain accurate statistical conclusions and inferences. The multiple linear regression (MLR) model (i.e ACD), mean absolute difference (MAD) and coefficient of determination ( R2) were metric assessment criteria employed to evaluate the performance of the five imputation algorithms for both MCAR and MAR mechanisms. The MLR model for the original data set is; Y = 26.5354+0.08643X1 +0.39793X2 +0.25676X3−0.0872X4−0.7075X5 (5.1) where Y = LEB,X1 = CMW,X2 = DEN,X3 = RWS,X4 = IMR and X5 = TFR The results of the average coefficient difference (ACD) of MLR model performance 72 University of Ghana http://ugspace.ug.edu.gh assessment criteria for MCAR missing data revealed that mean substitution is the worst method. Regression substitution and mean substitution performed better when the percentage of missing data is small. From the analysis, it is revealed that at small or medium missigness percentage (below 20%), mean substitution and regression substitution performed very well. At small and large missingness percentages, K nearest neighbor approach provided an excellent results. Under MAR mechanism, the results of the ACD of MLR model performance criteria indicated that, expectation maximization (EM) algorithm is the weaker method. Expectation maximization (EM) algorithm performed better when percentage of missing data is small (i.e 5% & 10%). Multiple imputation by chained equation (MICE) algorithm is the best method. From the analysis, it is revealed that MICE performed credibly well when missingness percentage is large. Using the results of the mean absolute difference (MAD) under MCAR mechanism, regression substitution in general is the overall best method. Mean substitution is the worst method. At all levels of missing data rates; 5%, 10% 20%, 30% and 40%, the performance of regression substitution exceeds both KNN and mean substitution. From this context, it is clear that regression substitution performed well at small or large missingness percentage. With MAD performance assessment criteria under MAR mechanism, MICE algorithm performed very well as compared to EM algorithm. Finally, using coefficient of determination (R2) as a performance evaluation criteria under MCAR, both KNN algorithm and mean substitution performed relatively good at small missingness percentage. As the proportion of missing values increase, both KNN and mean substitution provide unsatisfactory results. Regression substitution in general was the best method. When percentage of missing data is small or large, regression substitution performed very well. Averagely, 90.8% of the total variation in the life expectancy at birth was explained by the regression substitution model, which is close to the R2 explained by the original regression 73 University of Ghana http://ugspace.ug.edu.gh model (89.9%). Finally, under MAR mechanism both EM and MICE algorithms produced unsatisfactory results at large missingness percentage (greater than 5%) as compared to coefficient of determination (89.95%) of the original complete data. At small missingness percentage (5%) EM algorithm performed credibly good with R square of 87% as compared to coefficient of determination (89.95%) of the original complete data. MICE algorithm in a whole performed slightly better than EM algorithm. Therefore, under MAR mechanism, both EM and MICE algorithms can be used to replace missing values when the amount of missing values is small. Schmitt et al. (2015) pointed out that, the most popular imputation methods such as mean, KNN, SVD and MICE are not necessary the most efficient. This conclusion was also supported by Celton, Malpertuy, Lelandais and Brevern (2010). Our current study share the same conclusion, that is mean and KNN provide unsatisfactory result when data is MCAR. But when data is MAR, MICE algorithm produced very good result, which is contrary to Schmitt et al. (2015) and Celton, et al. (2010). According to Lazaro, Gbeha and Kakai (2018) when missing data is MCAR, mean substitution provides a better performance in accuracy. Hening (2009) also emphasized that, the mean and median methods yielded satisfactory results comparing different missing data imputation methods. The results of the current study do not support the conclusions made by Lazaro, et al. (2018) and Hening (2009), but rather mean substitution performed poorly when missing data is MCAR. The work published by Turrando, Lopez, Lasheras, Gomez, Rolle and Juez (2014) pointed out that, the application of MICE algorithm provides very good outcome in comparing to other imputation approaches like inverse distance weighting and multiple linear regression. Also, study conducted by Porto,Monteiro, Kakai, and Assad (2017) reviled that, MICE algorithm provides very good estimates of daily precipitation values than 74 University of Ghana http://ugspace.ug.edu.gh geostatistical Krining and Co-Kringing models. The good performance of the multiple imputation by chained equation (MICE) algorithm as indicated by the study is confirmed by Turrando, et al.(2014) and Porto, et al. (2017). 5.3 Conclusion The study has compared imputation algorithms mechanisms. It was revealed that under MCAR mechanism, ACD of the MLR model produced by k nearest neighbor algorithm is lower than the ACD of the MLR model results produced by regression substitution and mean substitution method. The result of MAD of regression substitution is lower than mean substitution and KNN. Regression substitution result of coefficient of determination is higher than mean substitution method and KNN algorithm. Therefore, based on these three metric performance assessment criteria, it is concluded that regression substitution be used to impute missing values of world population data sheet. Thus, regression substitution method provides a comparatively successful replacement of the missing world population data sheet, which is supported by the work published in the literature by (Sattari, Joudi & Kusiak, 2016). Although, kNN imputation algorithm has very good performance, it is not the best in this study. Also, comparing imputation algorithms under MAR mechanism assumption, it was observed that ACD of the MLR model produced by MICE algorithm is smaller than the ACD results produced by EM algorithm. Besides, the MAD of the MICE algorithm is lower than the MAD result provided by EM algorithm. Finally, the analysis clearly revealed that average coefficient of determination produced by MICE algorithm is higher than that of EM algorithm. Based on these three measures, MICE is a highly accurate imputation algorithm for missing values of the world population data sheet, and outperforms EM algorithm in terms of imputation error. Therefore, overall conclusion is that multiple imputation by chained equation (MICE) algorithm is superior to 75 University of Ghana http://ugspace.ug.edu.gh the expectation maximization (EM) algorithm as confirmed by Turrando, et al.(2014) and Porto, et al. (2017) 5.4 Recommendations Based on the overview and inferences deduced from the investigation, the following suggestions are made for hereafter research studies. 1. The study suggests that when data is missing completely at random (MCAR) and normally distributed, then among the compared three imputation algorithms, the regression substitution is preferred. It is therefore recommended that regression substitution method be used to replace missing values under MCAR mechanism. The MICE algorithm was found to be comparatively the best algorithm for replacing missingness under MAR mechanism. It is therefore suggested that MICE algorithm should be used to substitute missing data under MAR mechanism. 2. On the grounds of this study, it is recommended that before undertaking a missing data imputation, the distribution of the data, the incomplete data mechanisms and percentage of incomplete data must be first examined before suggesting the best imputation methods. 3. Moreover, since the issue of missing data cannot be avoided in the data analysis, it is recommended that all research studies must promulgate the reasons which account for missingness and proportion of incomplete data in the data matrix and the imputation algorithm employed in the analysis stage. 4. Future studies can be targeted at determining appropriate imputation algorithm to replace missing values of cross-sectional data of World 76 University of Ghana http://ugspace.ug.edu.gh Population Data Sheet when data is missing not at random (MNAR) and normally distributed. This is essential because many of the literature review suggest that, the comparison of imputation methods under MNAR mechanism is a complex exercise. 5. This study is mainly concentrated on missing data imputation in a cross- sectional dataset. Therefore, it is recommended that categorical and longitudinal studies should be considered. 77 University of Ghana http://ugspace.ug.edu.gh REFERENCES Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67, 1012–1028 Allison, P. (2001). Missing Data. In Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136. Sage, Thousand Oaks, CA. Allison, P .D (2002). Missing data. Thousand Oaks, CA/ Sage Publication. Anghelache, C. & Scala, C. (2016). Multiple regression used to analysis the correlation between GDP and some variables. Romanian statistical review supplement, No. 10 pages 79-85 Azur, M .J., Stuart, E .A., Frangakis, C., & Leaf, P. J. (2012). Multiple Imputation by Chained Equation: What is it and how does it work? Batista, G. E. A. P. A, & Monard, M. C. (2001). "A study of K Nearest neighbour as a Model-Based Method to Treat Missing Data", in proceedings of the Argentine Symposium on Artificial Intelligence, Buenos Aires, Agentine. vol.30, pp. 1-9 Batista, G. E. A. P. A, & Monard, M.C. (2003). An Analysis of Four Missing Data Treatment Methods for Supervised Learning, Applied Artificial Intelligence, 17, 519-533 Bennett, D. A. (2001). How can I deal with missing data in my study? Australian and New Zealand Journal of Public Health, 25, 464–469. Biorn, E. (2013). Introductory Econometrics, Department of Economics, ECON3150/4150. Brown, R. L. (1994). Efficacy of the indirect approach for estimating structural equation models with missing data: A comparison of five methods. Structural. Equation Model. 1, 287–316. Carpenter, J. R., & Kenward, M. G. (2013). Multiple Imputation and its Application. Chi Chester, West Sussex: A John Wiley & Sons Publication. Celton, M., Malpertuy, a., Lelandais, G., & Brevern, A. (2010) Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. 78 University of Ghana http://ugspace.ug.edu.gh Cole, J.C. (2008). How to deal with missing data. In Best practices in quantitative methods, Journal Wiley Osborne (Ed.). Thousand Oaks, CA, Sage, pp. 214–238. Coelho-Barros, E. A., Simoes, P. A., Achcar, J. A., Martinez, E. Z. & Shimano, A. C. (2008). Methods of Estimation in Multiple Linear Regression: Application to Clinical Data. Revista Colombian De Estadistica, 31 (1):111-129. Chia, T., & Draxier, R. R. (2014) Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geoscience Model Development Discussion with version 4.1 of Latex class. Day, S. (1999). Dictionary for clinical trials. New York: John Wiley and Sons. Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39(1): 1-38. Enders, C. K. (2001). The performance of the full information maximum likelihood estimator in multiple regression models with missing data. Educational and Psychological Measurement, 61, 713-740. Fogarty, D. J. (2008). "Multiple imputation as a missing data approach to reject inference on consumer credit scoring". http://interstat.statjournals.net/YEAR/2006/articles/0609001.pdf. Golan, A. (2002). Information and entropy econometrics (special issue).Journal of econometrics, 107 (1-2) Graham, J. W., & Hofer, S. M. (2000). Multiple imputation in multivariate research. InT. D. Little, K.U. Hair, J., Black, W., Babin, B., Anderson, R., & Tatham, R.. (2006). Multivariate data analysis. 6th edn. Pearson Education, Inc. Hening, A. D. (2009) Missing Data Imputation Method Comparison in Ohio University Student Retention Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. The American Statistician, 61, 79-90 Honaker, J., King, G., & Blackwell, M. (2015). Amelia 2. A program for missing data. Version 1.7.4 Howard, E. & Gordoh, G. (2005) Statistical methods 79 University of Ghana http://ugspace.ug.edu.gh Huang, R., & Carriere, K.C. (2006). Comparison of Methods for Incomplete Repeated Measures Data Analysis in Small Samples. Journal of Statistical Planning and Inference, 136, 235-247 Kim, K., & Bentler, P. (2002). Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika, 67 (4), 609-623. Lazaro, M., Gbeha, M., & Kakai , R (2018) Influence of missing value imputations on the performance of canonical correspondence analysis: Ecological applications Lin, J. & Bentler, P. M. (2002) Probability Based Test for Missing Completely at Random Data Patterns Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198–1202. Little, R. J.A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data, Second Edition. Hoboken, NJ: John Wiley & Sons Little, R. J. A & Rubin, D. B. (1987). Statististical Analysis with Missing Data, New York: John Wiley. Liu, Y., & Brown, S.D. (2013). Comparison of five iterative imputation methods for multivariate classification. Chemom. Intell. Lab. 120, 106–115. McDonald, R. A.,Thurston, P.W, & Nelson, M. R. (2000). A Monte Carlo study of missing item methods. Organizational Research Methods, 3,71- 92 McKnight, P. E., McKnight, K. M., Sidani, S., & Figgueredo, A. J. (2007). Missing Data: A gentle introduction. Guilford Press. McKnight, P. (2007). Missing Data: A gentle introduction Meng, Z. Q., & Shi, Z. Z. (2012). Extended rough set-based attribute reduction in inconsistent incomplete decision systems, Information Science. Vol. 204, pp. 44–69. Morais, S.F.(2013). Dealing with Missing data: An Application in the Study of Family History of Hypertension. A Master Dissertation, Faculty of Medicine of the University of Porto Nelwamondo, F. V., Mohamed, S., Marwala, T. (2007) "Missing data: artificial neural network and expectation maximization techniques," Current Science, Vol. 93, No. 11, pp. 1514 - 1521 Pigott, T. D. (2001) A review of methods for missing data. Educational research and evaluation, 7, 353-383. 80 University of Ghana http://ugspace.ug.edu.gh Pollard, J.H. (1988). "On the Decomposition of Changes in Expectation of Life and Differentials in Life Expectancy. Demography 25(2):265-276 Population Reference Bureau. (2011). World Population Data Sheet, Washington, D.C., U.S.A. Population Reference Bureau. (2013). World Population Data Sheet, Washington, D.C., U.S.A. Porto de Carvalho, J. R., Monteiro, J. E. B.A., Kakai, A. M., & Assad, E. D. (2017) Model for multiple imputation to estimate daily rainfall data and filling of faults. Raaijmaken, Q.A.W. (1999). Effectiveness of different missing data treatments in surveys with Likert-type data: Introducing the relative mean substitution approach. Educational and Psychological Measurement, 59,725-748. Raghunathan, T. W., Lepkowksi, J .M., Van Hoewyk, j., and Solenbeger , p. (2001). A multivariate technique for multiple imputing missing values a sequence of regression models. 27; 85-95 Rahman, M. G. and Islam, M. Z. (2011): A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing Revicki D. A., Karen G., Buckman D., Chan K., Kallich J. D., and Woolley M .J. ( 2001). Imputing Physical Health Status Scores Missing Owing to Mortality: Results of a Simulation Comparing Multiple Techniques, Medical Care, Vol. 39, No. 1, pp. 61-71 Roth, P. L. (1994). Missing data: A conceptual review for applied psychologists. Personnel Psychology, 47, 537–570 Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434):473–489. Rubin. D. B. (1976). Inference and missing data. Biometrika, 63(3), 581- 592. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley. Rubin, D. & Thayer, D. (1982) EM algorithms for ML factor analysis SAS Institute Inc (2005). The SAS System, Version 9.3. SAS Institute Inc., Cary, NC. URL http://www.sas.com/. Sattari, M. T., Joudi, A. R., & Kusiak, A. (2016). Assessment of different methods for estimation of missing data in precipitation studies. 81 University of Ghana http://ugspace.ug.edu.gh Savage, N. H., Agnew, p., Davis, L. S., Ordonez, C., Johnson, C. E., O’Connor, F. M., and Dalvi, M. (2013) Air quality modelling using the Met Office Unified Model. Geoscience Model Dev., 6, 353-372. Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing data: Theory and application to auxiliary variables. Structural Equation Modeling, 16, 477- 497. Schafer, J.L. (1997). Analysis of incomplete multivariate data. Monographs on Statistics and Applied Probability, No. 72. Chapman and Hall, London. Schafer J.L., & Graham, J.W. (2002). Missing data: Our view of the state of the art. Psychological Methods. 7, 147–177. Schmitt, P., Mandel, J., and Guedj, M. ( 2015) A comparison of six methods for missing data imputation, Journal of Biometerics and Biostatistics, vol. 6, no. 1, pp, 1-6. Schlomer, C. L., Buaman, S., & Card, N. A. (2010) Best Practices for Missing Data Management in Counseling Psychology. Vol. 57, No. 1, 1-10 Streiner, D. L. (2002).The case of the missing data: Methods of dealing with dropouts and other research vagaries. Canadian Journal of Psychiatry, 47, 68-75. Susianto, Y., Notodiputro, K. A., Kurnia, A., and Wijayanto, H. (2017). A Comparative Study of Imputation Methods for Missing Values of Per Capita Expenditure in Central Java. Turrado, C. C., Lopez, M. C. M., Lasheras, F. s., Gomez, B. A. R., Rolle, J. L .C, Juez, F. J. C. (2014) Missing data imputation of solar radiation data under different atmospheric conditions Twala, B. (2005). Effective Techniques for Handling Incomplete Data Using Decision Trees. Unpublished PhD thesis, Open University, Milton Keynes, UK Twala, B., Cartwright, M., and Shepperd, M. (2005). Comparison of Various Methods for Handling Incomplete Data in Software Engineering Databases. 4th International Symposium on Empirical Software Engineering, Noosa Heads, Australia, November 2005 Thijs, H., Molenberghs, G., Micheiels, B., Verbeke, g., & Curran, D. (2002). Srategies to fit pattern- misture models. Biostatistics, 3(2), 245-265. Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical methods in medical research, 16(3):219–242. 82 University of Ghana http://ugspace.ug.edu.gh Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Multivariate Imputation by Chained Equations (Mice) in R. Statistics Software. 45, 1–67. Van Buuren, S. (2012). Flexible Imputation of Missing Data; Chapman & Hall/CRC: London, UK. P. 110 Willmott, C. J., Matsuura, K., and Robeson, S. M. (2009) Ambiguities inherent in sums-of-squares based error statistics, Atmos. Environ., 43,749-752. Yuan, K. H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analysis with non-normal missing data. In M. Becker & M. Sobel (Eds.), Sociological methodology, (pp. 165-200). Yuan, Y. (2000). Multiple imputation for missing data: Concepts and new developments. Rockville. MD. SAS Institute, 267-275. 83 University of Ghana http://ugspace.ug.edu.gh Appendix Appendix I. R-codes R Codes used in this study ### Regression of original data ### oscar <-read.csv(file.choose(),header=T) oscar attach(oscar) names(oscar) cor(oscar) pairs(oscar[,-6]) model <-lm(y~x1+x2+x3+x4+x5, data=oscar) par(mfrow=c(1,2)) qqnorm(model$residuals) qqline(model$residuals) plot(model$fitted ,model$residuals ,xlab="Fitted",ylab="Residuals",main="Time") abline(h=0) shapiro.test(model$residuals) ncvTest(model) ### Missing ## oscar <-read.csv(file.choose(),header=T) oscar attach(oscar) names(oscar) cor(oscar) ### MCAR ### ### MCAR 5% ### prop.m = .05 # 5% missingness mcar1 = runif(106, min=0, max=1) mcar2 = runif(106, min=0, max=1) mcar3 = runif(106, min=0, max=1) mcar4 = runif(106, min=0, max=1) mcar5 = runif(106, min=0, max=1) x2 = ifelse(mcar1min(mmm2), NA, oscar$x2) xx2.mar mam4<-1-logistic(oscar$x4) mmm4<-tail(sort.int(oscar$x4, partial=length(oscar$x4) - 4), 5) mam4 min(mmm4) xx4.mar = ifelse(oscar$x4>min(mmm4), NA , oscar$x4) xx4.mar mmm5<-tail(sort.int(oscar$x5, partial=length(oscar$x5) - 4), 5) mmm5 min(mmm5) xx5.mar = ifelse(oscar$x5>93, NA, oscar$x5) xx5.mar mam6<-1-logistic(oscar$x6) mmm6<-tail(sort.int(mam6, partial=length(mam6) - 4), 5) mmm6 min(mmm6) xx6.mar = ifelse(mam6> 1.026188e-10, NA, oscar$x6) 87 University of Ghana http://ugspace.ug.edu.gh xx6.mar mam7<-1-logistic(oscar$x7) mmm7<-tail(sort.int(mam7, partial=length(mam6) - 4), 5) mmm7 min(mmm7) xx7.mar = ifelse(mam7> 0.2689414, NA, oscar$x7) xx7.mar View(cbind(xx2.mar , xx4.mar , xx5.mar ,xx6.mar ,xx7.mar)) datama05<-data.frame(cbind(xx2.mar , xx4.mar , xx5.mar ,xx6.mar ,xx7.mar)) names(datama05) attach(datama05) str(datama05) library(BaylorEdPsych) library(mvnmle) LittleMCAR(datama05) write.csv(datama05, ""C:/Users/Desktop/datama05.csv") ###MAR### ##10%## logistic <-function(x)exp(x)/(1+exp(x)) mam210<-1-logistic(oscar$x2) mmm210<-tail(sort.int(mam210,partial=length(mam210)-4),10) min(mmm210) xx210.mar=felse(mam210>min(mmm210),NA,oscar$x2) xx210.mar mam410<-1-logistic(oscar$x4) mmm410<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),10) mam410 min(mmm410) xx410.mar=ifelse(oscar$x4>min(mmm410),NA,oscar$x4) xx410.mar mmm510<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),10) mmm510 min(mmm510) xx510.mar=ifelse(oscar$x5>89,NA,oscar$x5) xx510.mar mam610<-1-logistic(oscar$x6) mmm610<-tail(sort.int(mam6,␣partial=length(mam6)-4),10) mmm6 min(mmm610) xx610.mar=ifelse(mam610>min(mmm610),NA,oscar$x6) xx610.mar 88 University of Ghana http://ugspace.ug.edu.gh mam710<-1-logistic(oscar$x7) mmm710<-tail(sort.int(mam7,partial=length(mam6)-4),10) mmm710 min(mmm710) xx710.mar=ifelse(mam710>min(mmm610),NA,oscar$x7) xx710.mar View(cbind(xx210.mar ,xx410.mar ,xx510.mar ,xx610.mar ,xx710.mar)) datama10<-data.frame(cbind(xx210.mar ,xx410.mar ,xx510.mar ,xx610.mar ,xx710.mar)) names(datama10) attach(datama10) str(datama10) library(BaylorEdPsych) library(mvnmle) LittleMCAR(datama10) write.csv(datama10,"C:/Users/Desktop/datama10.csv") ###MAR### ##20%## logistic <-function(x)exp(x)/(1+exp(x)) mam220<-1-logistic(oscar$x2) mmm220<-tail(sort.int(oscar$x2,partial=length(oscar$x2)-4),20) min(mmm220) xx220.mar=ifelse(oscar$x2>65,NA,oscar$x2) xx220.mar mam420<-1-logistic(oscar$x4) mmm420<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),20) mam420 min(mmm420) xx420.mar=ifelse(oscar$x4␣min(mmm520),NA,oscar$x5) xx520.mar mam620<-1-logistic(oscar$x6) mmm620<-tail(sort.int(oscar$x6,partial=length(oscar$x6)-4),20) mmm620 min(mmm620) xx620.mar=ifelse(oscar$x6>min(mmm620),NA,oscar$x6) xx620.mar mam720<-1-logistic(oscar$x7) mmm720<-tail(sort.int(mam720,partial=length(mam720)-4),20) 89 University of Ghana http://ugspace.ug.edu.gh mmm720 min(mmm720) xx720.mar=ifelse(mam720<␣min(mmm720),NA,oscar$x7)xx720.mar View(cbind(xx220.mar ,xx420.mar ,xx520.mar ,xx620.mar ,xx720.mar)) datama20<-data.frame(cbind(xx220.mar ,xx420.mar ,xx520.mar ,xx620.mar ,xx720.mar)) names(datama20) attach(datama20) str(datama20) library(BaylorEdPsych) library(mvnmle) LittleMCAR(datama20) write.csv(datama20,"C:/Users/Desktop/datama20.csv") ###MAR### ##30%## logistic <-function(x)exp(x)/(1+exp(x)) mam230<-1-logistic(oscar$x2) mmm230<-tail(sort.int(oscar$x2,partial=length(oscar$x2)-4),30) min(mmm230) xx230.mar=ifelse(oscar$x2>58,NA,oscar$x2) xx230.mar mam430<-1-logistic(oscar$x4) mmm430<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),30) mam430 min(mmm430) xx430.mar=ifelse(oscar$x4 <=min(mmm430),NA,oscar$x4) xx430.mar mmm530<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),30) mmm530 min(mmm530) xx530.mar=ifelse(oscar$x5<␣min(mmm530),NA,oscar$x5) xx530.mar mam630<-1-logistic(oscar$x6) mmm630<-tail(sort.int(oscar$x6,partial=length(oscar$x6)-4),30) mmm630 min(mmm630) xx630.mar=ifelse(oscar$x6>min(mmm630),NA,oscar$x6) xx630.mar mam730<-1-logistic(oscar$x7) mmm730<-tail(sort.int(mam730,partial=length(mam720)-4),30) mmm730 w=sort(mam730) min(mmm730) xx730.mar=ifelse(mam730 <=,NA,oscar$x7) 90 University of Ghana http://ugspace.ug.edu.gh xx730.mar View(cbind(xx230.mar ,␣xx430.mar ,xx530.mar ,xx630.mar ,xx730.mar)) datama30<-data.frame(cbind(xx230.mar ,xx430.mar ,␣xx530.mar ,xx630.mar ,xx730.mar)) names(datama30) attach(datama30) str(datama30) library(BaylorEdPsych) library(mvnmle) LittleMCAR(datama30) write.csv(datama30,"C:/Users/Desktop/datama30.csv") ###MAR### ##40%## logistic <-function(x)exp(x)/(1+exp(x)) mam240<-1-logistic(oscar$x2) mmm240<-tail(sort.int(oscar$x2,partial=length(oscar$x2)-4),40) min(mmm240) xx240.mar=ifelse(oscar$x2 <=min(mmm240),NA,oscar$x2) xx240.mar mmm440<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),40) mam440 min(mmm420) xx440.mar=ifelse(oscar$x4>min(mmm420),NA,oscar$x4) xx440.mar mmm540<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),40) mmm540 min(mmm540) xx540.mar=ifelse(oscar$x5min(mmm640),NA,oscar$x6) xx640.mar Y<-sort(mam740) mam740<-1-logistic(oscar$x7) mmm740<-tail(sort.int(mam720,partial=length(mam720)-4),20) mmm740 min(mmm720) xx740.mar=ifelse(mam720>0.0474258732,NA,oscar$x7) xx740.mar View(cbind(xx240.mar ,xx440.mar ,␣xx540.mar ,xx640.mar ,xx740.mar)) datama40<-data.frame(cbind(xx240.mar ,xx440.mar ,xx540.mar ,xx640.mar , 91 University of Ghana http://ugspace.ug.edu.gh xx740.mar)) names(datama40) attach(datama40) str(datama40) library(BaylorEdPsych) library(mvnmle) LittleMCAR(datama40) write.csv(datama40,"C:/Users/Desktop/datama40.csv") Appendix II-KNN IMPUTATION Table 1: KNN IMPUTATION AT 5% Variable Estimate Std. Error t-value p-value Constant 24.41543 4.458885 5.48 0.0000 CMW 0.0677847 0.0343056 1.98 0.0510 DEN 0.4524383 0.0531727 8.51 0.0000 RWS 0.2452794 0.0460424 5.33 0.0000 IMR -0.0887447 0.0099929 -8.88 0.0000 TFR -0.6127951 0.2278845 -2.69 0.0080 R2= 0.8644; F(5,100)=127.45 ; P-value of F-statistic=0.000 Table 2: KNN REGRESSION AT 10% Variable Estimate Std. Error t-value p-value Constant 25.51724 4.771705 5.35 0.0000 CMW 0.0746098 0.0379418 1.97 0.0520 DEN 0.4528977 0.0551922 8.21 0.0000 RWS 0.2387563 0.0473055 5.05 0.0000 IMR -0.0940196 0.0105682 -8.9 0.0000 TFR -0.7596007 0.2532781 -3 0.0030 R2= 0.8575; F(5,100)=120.39; P-value of F-statistic=0.000 92 University of Ghana http://ugspace.ug.edu.gh Table 3: KNN IMPUTATION AT 20% Variable Estimate Std. Error t-value p-value Constant 17.60174 4.974587 3.54 0.0010 CMW 0.0302327 0.0417292 0.72 0.4700 DEN 0.5403715 0.0611663 8.83 0.0000 RWS 0.2767879 0.0475857 5.82 0.0000 IMR -0.0854882 0.011215 -7.62 0.0000 TFR -0.6334724 0.2878913 -2.2 0.0300 R2= 0.8449; F(5,100)=108.99; P-value of F-statistic=0.000 Table 4: KNN IMPUTATION AT 30% Variable Estimate Std. Error t-value p-value Constant 18.57368 5.19042 3.58 0.0010 CMW 0.0422071 0.0433263 0.97 0.3320 DEN 0.5188958 0.0625237 8.3 0.0000 RWS 0.3031963 0.051144 5.93 0.0000 IMR -0.1015226 0.0122141 -8.31 0.0000 TFR -0.6903953 0.3166767 -2.18 0.0320 R2= 0.8490; F(5,100)=112.49; P-value of F-statistic=0.000 Table 5: KNN IMPUTATION AT 40% Variable Estimate Std. Error t-value p-value Constant 24.09558 4.747458 5.08 0.000 CMW 0.0820895 0.0361516 2.27 0.0250 DEN 0.4538946 0.0547902 8.28 0.0000 RWS 0.2392104 0.0484447 4.94 0.0000 IMR -0.0896554 0.0105635 -8.49 0.0000 TFR -0.6408525 0.24416 -2.62 0.0100 R2= 0.8457; F(5,100)=109.62; P-value of F-statistic=0.000 93 University of Ghana http://ugspace.ug.edu.gh Appendix III - MEAN IMPUTATION Table 6: MEAN IMPUTATION AT 5% Variable Estimate Std. Error t-value p-value Constant 20.75863 4.996672 4.15 0.0000 CMW 0.0904376 0.0394639 2.29 0.0240 DEN 0.4754724 0.0561035 8.47 0.0000 RWS 0.2591374 0.0512571 5.06 0.0000 IMR -0.0874646 0.0114698 -7.63 0.0000 TFR -0.686563 0.2733041 -2.51 0.0140 R2= 0.8226; F(5,100)=92.73; P-value of F-statistic=0.000 Table 7: MEAN IMPUTATION AT 10% Variable Estimate Std. Error t-value p-value Constant 11.18575 5.274203 2.12 0.0360 CMW 0.0461124 0.0448393 1.03 0.3060 DEN 0.5577717 0.0636267 8.77 0.0000 RWS 0.3224677 0.0510945 6.31 0.0000 IMR -0.0785738 0.0125041 -6.28 0.0000 TFR -0.586762 0.3166016 -1.85 0.0670 R2= 0.7944; F(5,100)=77.28; P-value of F-statistic=0.000 Table 8: MEAN IMPUTATION AT 20% Variable Estimate Std. Error t-value p-value Constant 8.988445 5.74631 1.56 0.1210 CMW 0.0667237 0.0471772 1.41 0.1600 DEN 0.5351353 0.0669713 7.99 0.0000 RWS 0.3652465 0.054868 6.66 0.0000 IMR -0.0880375 0.0136471 -6.45 0.0000 TFR -0.4457335 0.3566515 -1.25 0.2140 R2= 0.7848; F(5,100)=72.95; P-value of F-statistic=0.000 94 University of Ghana http://ugspace.ug.edu.gh Table 9: MEAN IMPUTATION AT 30% Variable Estimate Std. Error t-value p-value Constant 13.07426 6.731646 1.94 0.0550 CMW 0.028893 0.0526946 0.55 0.5850 DEN 0.6370891 0.069246 9.2 0.0000 RWS 0.2870867 0.0703405 4.08 0.0000 IMR -0.0995959 0.0162369 -6.13 0.0000 TFR -1.034421 0.3313135 -3.12 0.0020 R2= 0.7477; F(5,100)=59.28; P-value of F-statistic=0.000 Table 10: MEAN IMPUTATION AT 40% Variable Estimate Std. Error t-value p-value Constant 35.67183 4.569475 7.81 0.0000 CMW 0.0131438 0.0276253 0.48 0.6350 DEN 0.3033493 0.0446717 6.79 0.0000 RWS 0.2741616 0.0432158 6.34 0.0000 IMR -0.1174565 0.0106542 -11.02 0.0000 TFR -0.1122018 0.1846629 -0.61 0.5450 R2= 0.9425; F(5,100)=327.79; P-value of F-statistic=0.000 Appendix iv–Regression Imputation Table 11: REGRESSION IMPUTATION AT 5% Variable Estimate Std. Error t-value p-value Constant 26.1566 4.15219 6.3 0.0000 CMW 0.0781062 0.0314003 2.49 0.0150 DEN 0.4094124 0.0498192 8.22 0.0000 RWS 0.2534665 0.0420098 6.03 0.0000 IMR -0.0881772 0.0092606 -9.52 0.0000 TFR -0.5992677 0.2083631 -2.88 0.0050 R2= 0.8854; F(5,100)=154.47; P-value of F-statistic=0.000 95 University of Ghana http://ugspace.ug.edu.gh Table 12: REGRESSION IMPUTATION AT 10% Variable Estimate Std. Error t-value p-value Constant 24.55656 3.823012 6.42 0.0000 CMW 0.0729499 0.0293616 2.48 0.0150 DEN 0.4549499 0.0487914 9.32 0.0000 RWS 0.242957 0.0385584 6.3 0.0000 IMR -0.0828031 0.0092185 -8.98 0.0000 TFR -0.7533552 0.187711 -4.01 0.0000 R2= 0.8981; F(5,100)=176.33; P-value of F-statistic=0.000 Table 13: REGRESSION IMPUTATION AT 20% Variable Estimate Std. Error t-value p-value Constant 21.717 3.897818 5.57 0.0000 CMW 0.060962 0.0322603 1.89 0.0620 DEN 0.4894645 0.0494193 9.9 0.0000 RWS 0.2482765 0.0387981 6.4 0.0000 IMR -0.0833678 0.0090782 -9.18 0.0000 TFR -0.6500357 0.222982 -2.92 0.0040 R2= 0.8975; F(5,100)=175.11; P-value of F-statistic=0.000 Table 14: REGRESSION IMPUTATION AT 30% Variable Estimate Std. Error t-value p-value Constant 23.78943 3.929865 6.05 0.0000 CMW 0.0747011 0.0294856 2.53 0.0130 DEN 0.415013 0.0473078 8.77 0.0000 RWS 0.2745521 0.0382235 7.18 0.0000 IMR -0.0871608 0.0090673 -9.61 0.0000 TFR -0.4174209 0.2166326 -1.93 0.0570 R2= 0.9198; F(5,100)=229.26; P-value of F-statistic=0.000 96 University of Ghana http://ugspace.ug.edu.gh Table 15: REGRESSION IMPUTATION AT 40% Variable Estimate Std. Error t-value p-value Constant 35.67183 4.569475 7.81 0.0000 CMW 0.0131438 0.0276253 0.48 0.6350 DEN 0.3033493 0.0446717 6.79 0.0000 RWS 0.2741616 0.0432158 6.34 0.0000 IMR -0.1174565 0.0106542 -11.02 0.0000 TFR -0.1122018 0.1846629 -0.61 0.5450 R2= 0.9425; F(5,100)=327.79; P-value of F-statistic=0.000 Appendix VI–EM IMPUTATION Table 16: EM IMPUTATION AT 5% Variable Estimate Std. Error t-value p-value Constant 27.27421 4.209365 6.48 0.0000 CMW 0.0732485 0.0334635 2.19 0.0310 DEN 0.3740375 0.0533309 7.01 0.0000 RWS 0.2774389 0.0411899 6.74 0.0000 IMR -0.0892841 0.009394 -9.5 0.0000 TFR -0.6294586 0.1993551 -3.16 0.0020 R2= 0.8700; F(5,100)=133.79; P-value of F-statistic=0.000 Table 17: EM IMPUTATION AT 10% Variable Estimate Std. Error t-value p-value Constant 31.02564 6.102412 5.08 0.0000 CMW 0.0968179 0.0435426 2.22 0.0280 DEN 0.3539075 0.0720339 4.91 0.0000 RWS 0.2268911 0.0526043 4.31 0.0000 IMR -0.0840243 0.0139266 -6.03 0.0000 TFR -0.6365357 0.2584398 -2.46 0.0150 R2= 0.7883; F(5,100)=74.46; P-value of F-statistic=0.000 97 University of Ghana http://ugspace.ug.edu.gh Table 18: EM IMPUTATION AT 20% Variable Estimate Std. Error t-value p-value Constant 30.30845 4.769065 6.36 0.0000 CMW 0.0559433 0.0454289 1.23 0.2210 DEN 0.4583636 0.0799098 5.74 0.0000 RWS 0.1630587 0.0503629 3.24 0.0020 IMR -0.0685205 0.012509 -5.48 0.0000 TFR -0.9983117 0.2582418 -3.87 0.0000 R2= 0.7892; F(5,100)=74.86; P-value of F-statistic=0.000 Table 19: EM IMPUTATION AT 30% Variable Estimate Std. Error t-value p-value Constant 23.21483 7.721299 3.01 0.0030 CMW 0.0018959 0.056835 0.03 0.9730 DEN 0.3544012 0.0864141 4.1 0.0000 RWS 0.3159736 0.0802667 3.94 0.0000 IMR -0.044099 0.0179196 -2.46 0.0160 TFR -0.0692912 0.3427523 -0.2 0.8400 R2= 0.6202; F(5,100)=32.66; P-value of F-statistic=0.000 Table 20: EM IMPUTATION AT 40% Variable Estimate Std. Error t-value p-value Constant 32.86259 7.803779 4.21 0.0000 CMW 0.1085613 0.0587219 1.85 0.0670 DEN 0.1500739 0.0929605 1.61 0.1100 RWS 0.3603069 0.0875461 4.12 0.0000 IMR -0.0845097 0.0184588 -4.58 0.0000 TFR -0.3423179 0.497264 -0.69 0.4930 R2= 0.6538; F(5,100)=37.77; P-value of F-statistic=0.000 98 University of Ghana http://ugspace.ug.edu.gh Appendix VII - MICE IMPUTATION Table 21: MICE IMPUTATION AT 5% Variable Estimate Std. Error t-value p-value Constant 24.637 4.44824 5.54 0.0000 CMW 0.06886 0.03432 2.01 0.0480 DEN 0.43458 0.05623 7.73 0.0000 RWS 0.24994 0.04431 5.64 0.0000 IMR -0.0844 0.01033 -8.17 0.0000 TFR -0.5181 0.21508 -2.41 0.0180 R2= 0.8529; F(5,100)=115.99; P-value of F-statistic=0.000 Table 22: MICE IMPUTATION AT 10% Variable Estimate Std. Error t-value p-value Constant 30.4423 5.51822 5.52 0.0000 CMW 0.07884 0.04121 1.91 0.0590 DEN 0.3301 0.06821 4.84 0.0000 RWS 0.2637 0.04996 5.28 0.0000 IMR -0.0925 0.01341 -6.9 0.0000 TFR -0.3801 0.23723 -1.6 0.1120 R2= 0.8171; F(5,100)=89.34; P-value of F-statistic=0.000 Table 23: MICE IMPUTATION AT 20% Variable Estimate Std. Error t-value p-value Constant 24.7085 5.6094 4.4 0.0000 CMW 0.04394 0.04359 1.01 0.3160 DEN 0.35512 0.06898 5.15 0.0000 RWS 0.33901 0.05088 6.66 0.0000 IMR -0.072 0.01311 -5.49 0.0000 TFR -0.8652 0.26522 -3.26 0.0020 R2= 0.7496; F(5,100)=59.87; P-value of F-statistic=0.000 99 University of Ghana http://ugspace.ug.edu.gh Table 24: MICE IMPUTATION AT 30% Variable Estimate Std. Error t-value p-value Constant 23.1781 6.4062 3.62 0.0000 CMW 0.04657 0.0506 0.92 0.3600 DEN 0.35153 0.07774 4.52 0.0000 RWS 0.31924 0.06704 4.76 0.0000 IMR -0.0561 0.01547 -3.62 0.0000 TFR -0.4029 0.35025 -1.15 0.2530 R2= 0.6777; F(5,100)=42.04; P-value of F-statistic=0.000 Table 25: MICE IMPUTATION AT 40% Variable Estimate Std. Error t-value p-value Constant 34.3272 7.99457 4.29 0.0000 CMW 0.07746 0.0521 1.49 0.1400 DEN 0.21235 0.07171 2.96 0.0040 RWS 0.30229 0.05956 5.08 0.0000 IMR -0.0845 0.01983 -4.26 0.0000 TFR -0.2824 0.34182 -0.83 0.4110 R2= 0.6410; F(5,100)=35.71; P-value of F-statistic=0.000 100 University of Ghana http://ugspace.ug.edu.gh Appendix VIII-World Population Data Sheet Table 26: The world population data sheet, 2011 Country Level Y X1 X2 X3 X4 X5 Jordan M 73 43 67 83 58 4 Syria H 74 43 68 84 54 3 Yemen L 65 38 60 75 93 5 Bangladesh M 69 40 64 79 75 2 Bhutan M 69 40 64 79 75 3 India L 64 38 59 73 97 3 Kazakhstan M 69 40 64 79 75 3 Kyrgyzstan M 85 50 78 96 5 3 Maldives M 54 32 50 63 141 2 Nepals M 66 39 61 76 89 3 Pakistan L 77 45 71 88 40 4 Sri Lanka H 72 42 66 82 62 2 Tajikistan M 61 36 57 70 100 3 Uzbekistan M 76 44 70 87 45 3 Cambodia L 62 36 58 71 96 3 Indonesia M 71 42 66 81 67 2 Loas L 65 38 60 75 93 4 Phiilippines M 68 40 63 78 80 3 Thailand H 74 43 68 84 54 2 Timor-Leste L 62 36 58 71 56 6 Vietman M 73 43 67 83 58 2 China H 74 43 68 84 54 2 Mongolia M 74 43 68 84 54 3 Estonia H 76 44 70 87 45 2 Latvia M 65 38 60 757 93 1 Lesotho L 58 34 54 67 124 3 South Africa L 67 39 62 77 84 2 Swaziland L 83 48 84 94 14 4 Belize H 78 46 80 89 36 3 Costa Rica H 55 32 57 64 137 2 El Salvador M 71 42 737 81 100 2 Guatemala M 51 30 53 59 45 4 Honduras M 59 35 61 68 96 3 Mexico H 62 36 64 71 67 2 101 University of Ghana http://ugspace.ug.edu.gh Table 27: The world population data sheet, 2011 Country Level Y X1 X2 X3 X4 X5 Nicaragua H 78 46 80 89 93 6 Dominican Republic M 61 36 63 70 80 3 Jamaica M 56 50 58 65 133 2 Argentina H 44 39 46 52 185 2 Bolivia H 72 64 74 82 62 3 Brazil H 73 65 75 83 58 2 Colombia M 77 68 79 88 40 2 Ecuador H 66 59 68 76 89 3 Guyana M 58 52 60 67 124 3 Paraguay M 71 63 50 81 67 3 Peru H 64 57 61 73 97 3 Suriname M 67 59 71 77 84 2 Uruguay H 73 65 66 83 58 2 Armenia M 67 59 57 77 84 2 Azerbaijan H 85 75 70 96 5 2 Georgia H 71 63 58 81 67 2 Iraq M 53 47 66 62 146 5 Aigeria M 66 59 60 76 89 2 Egypt M 75 66 63 86 49 3 Morocco M 64 57 68 74 97 2 Tunisia H 63 56 58 73 102 2 Benin L 65 58 67 75 93 5 Burkina Faso L 80 71 68 91 27 6 Cape Verde H 57 51 68 66 128 3 Cote Divoire L 52 46 70 60 150 5 Gambia L 59 52 61 68 119 5 Ghana L 64 57 66 74 97 4 Guinea L 54 48 56 75 141 5 Guinea-Bissau L 48 43 50 67 168 5 Liberia L 57 51 59 77 128 6 Mali L 52 46 54 94 150 6 Mauritania L 59 52 61 89 119 4 Niger L 79 70 81 64 32 7 102 University of Ghana http://ugspace.ug.edu.gh Table 28: The world population data sheet, 2011 Country Level Y X1 X2 X3 X4 X5 Nigeria L 81 72 82 81 23 6 Senegal L 56 50 58 59 133 5 Sierra Leone L 74 66 76 68 54 5 Togo L 67 59 69 71 84 5 Burundi L 63 56 65 89 102 6 Comoros L 75 66 77 70 49 5 Djibouti L 68 60 70 65 80 4 Ethopia L 69 61 71 52 75 5 Kenya L 62 55 64 72 106 5 Madagascar M 54 48 56 63 141 5 Malawi L 76 67 78 87 45 6 Mozambique L 67 59 69 77 84 6 Rwanda L 66 59 68 76 89 5 Tanzania L 53 47 55 62 146 5 Uganda L 69 61 71 79 75 6 Zambia L 66 63 68 76 89 6 Angola L 66 57 68 76 89 6 Cameroon L 66 59 68 76 89 5 Central Africa Rep. L 65 65 67 75 93 5 Chad L 82 59 83 94 18 6 Congo L 58 75 60 67 124 5 Gabon L 63 63 65 73 102 3 SaoTome &Principe L 62 47 64 72 106 5 Belarus M 71 59 73 81 67 2 Bulgaria H 74 66 76 85 54 2 Czech Rep H 78 57 80 89 36 2 Hungary H 74 56 76 85 49 1 Moldova M 69 58 71 79 97 1 Poland H 76 71 78 87 102 1 Russia M 69 51 71 79 93 2 Slovakia H 75 46 77 86 27 1 Ukraine M 69 52 71 79 128 1 Albania H 75 57 77 86 150 1 103 University of Ghana http://ugspace.ug.edu.gh Table 29: The world population data sheet, 2011 Country Level Y X1 X2 X3 X4 X5 Bosnia-Herzegovina H 76 48 78 87 119 1 Macedonia H 74 66 76 85 97 2 Montenegro H 74 66 76 85 141 2 Serbia H 74 66 76 85 168 1 Slovenia H 80 71 82 91 128 2 Papua New Guinea L 62 55 64 72 150 4 104