University of Ghana  http://ugspace.ug.edu.gh
UNIVERSITY OF GHANA
STATISTICAL ASSESSMENT OF IMPUTATION
ALGORITHMS FOR ESTIMATION OF MISSING
VALUES IN CROSS SECTIONAL DATA
BY
OSCAR GYIMAH
10599415
THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, LEGON
IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE AWARD
OF MPHIL STATISTICS DEGREE
October 19, 2018
University of Ghana  http://ugspace.ug.edu.gh
DECLARATION
I hereby declare that this submission is my own work towards the award of the
MPhil. degree and that, to the best of my knowledge, it contains no material
previously published by another person nor material which had been accepted
for the award of any other degree of the university or elsewhere, except where
due acknowledgment had been made in the text.
OSCAR GYIMAH .......................... ....................
Student Signature Date
(10599415)
Certified by:
DR. ANANI LOTSI .......................... ....................
Supervisor Signature Date
Certified by:
DR. LOUIS ASIEDU .......................... ....................
Supervisor Signature Date
i
University of Ghana  http://ugspace.ug.edu.gh
DEDICATION
This work is dedicated to my children: Samuel Macbeth Gyimah, Holiana
Adjeiwaa Gyimah, Bomo-Yaa Gyimah, Yvonne Akua Tawiah Gyimah and Paul
Nelson-Nyameyekesse Gyimah.
ii
University of Ghana  http://ugspace.ug.edu.gh
ACKNOWLEDGEMENT
First and foremost, i give thanks and appreciation to Almighty God who has
endowed me the wisdom, knowledge and great opportunity to continue my
education to this level. I am filled with very deep gratefulness and thankfulness
to my project supervisors Dr. Anani Lotsi and Dr. Louis Asiedu for their
immeasurable advice, guidance, and support throughout my Mphil program. I
am also indebted to my parents, Mr and Mrs Gyimah for their countless assistance
towards my upbring and financial support. I would also like to express my
profound gratitude to Mr Emmanuel Aidoo (Phd Statistics student) and Felix
Dela Djokoto (Mphil Statisticis student) who assisted me to use the R package
in running my data and all the Lecturers of Department of Statistics at the
University of Ghana. Last but not the least, my sincere appreciation go to my
family for their invaluable support and prayers during the period of my study.
iii
University of Ghana  http://ugspace.ug.edu.gh
ABSTRACT
The validity and quality of data analysis relies largely on the data accuracy
and completeness of the data matrix. Missing values are unavoidable statistical
research problems in almost every research study and if not handled properly,
may provide negative and bias conclusion. This study purposely sought to
investigate the efficacy and accuracy of the convergence of five imputation
algorithms: expectation maximization (EM), multiple imputation by chained
equation (MICE), k nearest neighbor (KNN), mean substitution (MS) and
regression substitution (RS) in estimating and replacing missing values in cross-
sectional world population data sheet using MCAR and MAR assumptions. This
thesis used Little’s Test to verify whether a given data matrix with missing values
is MCAR or MAR. Multiple linear regression analysis model was used to run the
complete data of the world population data sheet, and thereafter, missing values
in the complete data sets were artificially introduced at 5%, 10%, 20%, 30%
and 40% under two missing data mechanisms (MCAR & MAR). The imputation
algorithms used for evaluating missing data problems were assessed and compared
using average coefficient difference (ACD) of multiple linear regression (MLR)
model, mean absolute difference (MAD) and the coefficient of determination (R2).
The study suggested that, when data on cross-sectional World Population Data
Sheet is missing completely at random (MCAR) and normally distributed, the
regression substitution is the best approach. The MICE algorithm was found to be
comparatively the best method for replacing missingness under MAR assumption.
Since this thesis is mainly concentrated on missing data imputation in a cross-
sectional dataset, it is recommended that in future categorical and longitudinal
studies should be considered.
iv
University of Ghana  http://ugspace.ug.edu.gh
CONTENTS
DECLARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABBREVIATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Significance of Study . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Motivation of the Research . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Scope of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . 8
2.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
v
University of Ghana  http://ugspace.ug.edu.gh
2.1 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Missing Data Mechanism . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Ignorability Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Pattern of Missing Data . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Traditional Methods of Treating Missing Data . . . . . . . . . . . 18
2.5.1 Mean Substitution . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 K Nearest Neighbor (KNN) Imputation Algorithm . . . . . 19
2.5.3 Regression Substitution . . . . . . . . . . . . . . . . . . . 21
2.6 Modern Method of Treating Missing Data . . . . . . . . . . . . . 22
2.6.1 Expectation- Maximization (EM) Algorithm . . . . . . . . 23
2.6.2 Multiple Imputation by Chained Equation (MICE) Algorithm 23
2.7 Measures of Performance Assessment . . . . . . . . . . . . . . . . 25
2.7.1 Mean Absolute Difference (MAD) . . . . . . . . . . . . . . 25
2.7.2 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . 25
2.7.3 Coefficient of Determination (R2) . . . . . . . . . . . . . . 26
2.8 Multiple Linear Regression (MLR) Model . . . . . . . . . . . . . . 26
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . 31
3.4.1 The Multiple Linear Regression (MLR) Model . . . . . . . 31
3.4.2 Matrix Representation of the Model . . . . . . . . . . . . . 32
3.4.3 Assumptions of the Multiple Linear Regression . . . . . . . 33
3.4.4 Testing for Overall Regression Significance . . . . . . . . . 33
3.4.5 Testing for the Significant of the Slopes . . . . . . . . . . . 33
3.4.6 ROLE of R2 and r2 . . . . . . . . . . . . . . . . . . . . . . 34
3.4.7 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.8 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . 35
vi
University of Ghana  http://ugspace.ug.edu.gh
3.4.9 Breusch-Pagan Test . . . . . . . . . . . . . . . . . . . . . . 35
3.4.10 Remedy for Assumption Violation . . . . . . . . . . . . . . 36
3.4.11 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.12 Normality Test . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Testing the Missing Data Mechanism (MCAR & MAR) Assumption 38
3.5.1 Little’s Test of MCAR . . . . . . . . . . . . . . . . . . . . 38
3.6 Classifications of Missing Data under the Assumptions of various
Missing Data Mechanism . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 The Imputation Algorithms for Treating Missing Values under the
MCAR Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.1 K Nearest Neighbors Imputation (KNN) Algorithm . . . . 40
3.7.2 Regression Substitution . . . . . . . . . . . . . . . . . . . 41
3.7.3 Mean substitution (MS) . . . . . . . . . . . . . . . . . . . 42
3.8 The Algorithms for Treating Missing Values under MAR
Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8.1 Expectation – Maximization (EM) Algorithm . . . . . . . 44
3.8.2 Multiple Imputation by Chained Equation (MICE) Algorithm 45
3.9 Evaluation Assessment Criteria to Compare various Imputation
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9.1 Mean Absolute Difference (MAD) . . . . . . . . . . . . . . 47
3.9.2 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . 48
3.9.3 Coefficient of Determination . . . . . . . . . . . . . . . . . 48
3.10 Data Analysis Procedure . . . . . . . . . . . . . . . . . . . . . . . 49
4 Data Analysis and Discussion of Results . . . . . . . . . . . . . . 50
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Multiple Linear Regression (MLR) model . . . . . . . . . . . . . . 53
4.4 Missing Data Mechanism Test . . . . . . . . . . . . . . . . . . . . 54
4.5 Comparison of Imputation Algorithms for Treating Missing Values 57
vii
University of Ghana  http://ugspace.ug.edu.gh
4.6 Comparison of Imputation Algorithms for Treating Missing Values
under MLR Model using ACD . . . . . . . . . . . . . . . . . . . . 57
4.6.1 Comparison of Imputation Algorithms for Treating
Missingness under MCAR Mechanism . . . . . . . . . . . . 59
4.6.2 Comparison of EM and MICE Algorithms for Treating
Missingness under MAR Mechanism using ACD . . . . . . 61
4.7 Comparison of Imputation Algorithms for Treating Missing Values
using Mean Absolute Difference (MAD) . . . . . . . . . . . . . . . 63
4.8 Comparison of Imputation Algorithms for Treating Missing Values
using Coefficient of Determination (R2) . . . . . . . . . . . . . . . 65
5 SUMMARY, CONCLUSION AND RECOMMENDATIONS . 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
viii
University of Ghana  http://ugspace.ug.edu.gh
LIST OF ABBREVIATION
ACD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Average Coefficient Difference
ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Artificial Neural Network
CD4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Cluster Differentiation 4
CN2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Algorithm for rule induction
C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Classifier
EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Expectation Maximization
EMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Expectation Maximization Imputation
EMSI . . . . . . . . . . . . . . . . . . . . . . . . . . .Expectation Maximization Single Imputation
EMMI . . . . . . . . . . . . . . . . . . . . . . . Expectation Maximization Multiple Imputation
FC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Fractioning of Cases
FIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Full Information Maximum Likelihood
LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Listwise Deletion
KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K Nearest Neighbor
KNNSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K Nearest Neighbor Single Imputation
MAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean Absolute Difference
MAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing at Random
MCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Completely at Random
MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markov Chain Monte Carlo
MDTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Missing Data Techniques
ix
University of Ghana  http://ugspace.ug.edu.gh
MI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Imputation
MICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Multiple Imputation by Chained Equation
MLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Linear Regression
MMSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Mean or Mode Single Imputation
MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean Square Error
MS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Mean Substitution
NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Not Available
NMAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Not Missing at Random
OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Ordinary Least Squares
PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pairwise Deletion
RS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression Substitution
RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Root Mean Square Error
SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sum of Squares Error
SST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sum of Squares Total
SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single Value Decomposition
Yc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complete values of the dataset
Yo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Observed values of the dataset
Ym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Missing values of the dataset
x
University of Ghana  http://ugspace.ug.edu.gh
LIST OF TABLES
3.1 The dataset with missing values . . . . . . . . . . . . . . . . . . . 43
3.2 After replacement of missing values by mean substitution technique 43
4.1 Classification of Life Expectancy at Birth (LEB) by 106 Countries. 50
4.2 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Determination of Multicollinearity . . . . . . . . . . . . . . . . . . 52
4.4 Test of Normality and Constancy of Variance of Residual Terms . 52
4.5 Summary of the Complete Original Dataset Model Coefficients
(Regression coefficient estimates, standard error, t-value and p-
value) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Output of Little’s MCAR test for MCAR . . . . . . . . . . . . . . 56
4.7 Output of Little’s MCAR test for MAR . . . . . . . . . . . . . . . 56
4.8 Imputation Algorithms for Treating Missing Values . . . . . . . . 57
4.9 Average Coefficient Difference of Missing Data under KNN
Imputation Algorithm to the Original Data of MLR Model . . . . 59
4.10 Performance of KNN, Mean Substitution and Regression
Substitution Algorithm under MCAR using ACD estimate . . . . 60
4.11 Performance of EM and MICE Algorithms under MAR using
Average Coefficient Difference (ACD) . . . . . . . . . . . . . . . . 61
4.12 Performance of KNN, Mean Substitution and Regression
Substitution for Treating Missing Values under MCAR Mechanism
using Mean Absolute Difference (MAD) . . . . . . . . . . . . . . . 63
xi
University of Ghana  http://ugspace.ug.edu.gh
4.13 Performance of EM and MICE Algorithms for Treating
Missing Values under MAR Mechanism using Mean
Absolute Difference (MAD) . . . . . . . . . . . . . . . . . . . 65
4.14 Performance of KNN, Mean Substitution and Regression
Substitution under MCAR Mechanism using R2 . . . . . . . . . . 66
4.15 Performance of EM and MICE algorithms for Treating Missing
Values under MAR Mechanism using Coefficient of Determination
R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1 KNN IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . . 92
2 KNN REGRESSION AT 10% . . . . . . . . . . . . . . . . . . . . 92
3 KNN IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . . 93
4 KNN IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . . 93
5 KNN IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . . 93
6 MEAN IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . 94
7 MEAN IMPUTATION AT 10% . . . . . . . . . . . . . . . . . . . 94
8 MEAN IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . 94
9 MEAN IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . 95
10 MEAN IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . 95
11 REGRESSION IMPUTATION AT 5% . . . . . . . . . . . . . . . 95
12 REGRESSION IMPUTATION AT 10% . . . . . . . . . . . . . . . 96
13 REGRESSION IMPUTATION AT 20% . . . . . . . . . . . . . . . 96
14 REGRESSION IMPUTATION AT 30% . . . . . . . . . . . . . . . 96
15 REGRESSION IMPUTATION AT 40% . . . . . . . . . . . . . . . 97
16 EM IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . . . 97
17 EM IMPUTATION AT 10% . . . . . . . . . . . . . . . . . . . . . 97
18 EM IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . . . 98
19 EM IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . . . 98
20 EM IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . . . 98
21 MICE IMPUTATION AT 5% . . . . . . . . . . . . . . . . . . . . 99
xii
University of Ghana  http://ugspace.ug.edu.gh
22 MICE IMPUTATION AT 10% . . . . . . . . . . . . . . . . . . . . 99
23 MICE IMPUTATION AT 20% . . . . . . . . . . . . . . . . . . . . 99
24 MICE IMPUTATION AT 30% . . . . . . . . . . . . . . . . . . . . 100
25 MICE IMPUTATION AT 40% . . . . . . . . . . . . . . . . . . . . 100
26 The world population data sheet, 2011 . . . . . . . . . . . . . . . 101
27 The world population data sheet, 2011 . . . . . . . . . . . . . . . 102
28 The world population data sheet, 2011 . . . . . . . . . . . . . . . 103
29 The world population data sheet, 2011 . . . . . . . . . . . . . . . 104
xiii
University of Ghana  http://ugspace.ug.edu.gh
LIST OF FIGURES
2.1 Important types of missing data . . . . . . . . . . . . . . . . . . . 16
3.1 Step by step procedure of the research design . . . . . . . . . . . 30
4.1 Graph of EM and MICE algorithms under MAR using average
coefficient difference as a measure of performance assessment criteria 62
4.2 Graph of KNN, Mean substitution and Regression substitution
under MCAR using MAD as performance assessment criteria . . . 64
4.3 Graph of KNN, Mean substitution and Regression substitution
algorithms under MCAR mechanism using coefficient of
determination (R2) as evaluation assessment criteria . . . . . . . . 68
4.4 Graph of EM and MICE algorithms under MAR mechanism using
coefficient of determination (R2) as measure of metric assessment
criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xiv
University of Ghana  http://ugspace.ug.edu.gh
CHAPTER 1
INTRODUCTION
1.1 Introduction
Governments, organizations and firms are depending largely on data quality to
make decisions and planning towards running their operational activities. Data
quality which is the backbone of every organization in particular can be affected
or distorted by the massive presence of missing values or incomplete data.
The presence of incomplete data is an unavoidable scientific research challenge
in real world situations and large scale research studies. It often creates
data anomalies and impurities with data analysis, it affects interpretation and
visualization of the research results. Respondents or interviewees often fail to
provide answers to particular elements of a survey questionnaire, countries do
not collect statistics every year, subjects drop out of studies, which results to
scattered missing values throughout a data set (Honaker, King & Blackwell,
2015). Discarding these respondents during the analysis stage usually results in
throwing out a sizable amount of information, reduction of sample size, as well as
potentially bias parameter estimation (Little & Rubin, 2002). The missing values
also reduces insight into the data, causes inefficient data analyses, inaccurate
decision making, which lead to impoverishment of statistical power and deceptive
inferences.
The pervasiveness of missing values has encouraged many academic researchers
in finding solutions, developing various models and evaluating methods for
missing data treatment. Incomplete data is a serious challenge for statistical
analysis, because most standard statistical techniques and application softwares
are programmed to work effectively and efficiently with the assumption that, all
1
University of Ghana  http://ugspace.ug.edu.gh
records are fully observed with regards to all variables in the analysis. To solve
the problem of incomplete values in data sets, just neglecting the incomplete
observations, deleting missing values, replacing incomplete values by zero might
have serious limitations as compared to application of imputation algorithms
(Meng & Shi, 2012).
Imputation algorithm is an iterative procedure employed to estimate and assign
substitute values for incomplete values in the data matrix by some close related
values. The beauty of the imputation algorithm is that, the treatment of
incomplete data does not depend on the studying algorithm employed. Hence,
this permits the researchers from various discipline to choose the best imputation
method or algorithm to treat problems of incomplete data. Imputation involves
replacing missing values to create complete data set, in a way which accounts
for both the natural variation in the data matrix and unreliability involved in
replacing incomplete values. The goal of imputation is not to generate accurate
predictions of missing values, but rather to replace them in a way that maintains
the relationships among the given variables in order to exploit the available data
from a partially-observed individual (Little & Rubin, 2002).
This work is intended to apply some of the extensively used imputation algorithms
in handling missing data and compare their efficiency in replacing missing values
in cross sectional study. Even though, Markov Chain Monte Carlo (MCMC)
approach in the time past has been used to compare some of the simplest
imputation techniques, such as listwise deletion (LD), pairwise deletion (PD),
mean or mode substitution, last value carry forward and hot deck replacement,
which revealed that MCMC approach provides most efficient results in all
situations. Some of these simplest imputation techniques have disadvantage
of reducing sample size, inefficiency of parameter estimates and diminish the
sensitivity of statistical analysis which lead to potentially biased conclusions. This
study, therefore seeks to compare model based techniques such as Expectation
Maximization (EM) algorithm, Multiple imputation by chained equation (MICE)
2
University of Ghana  http://ugspace.ug.edu.gh
algorithm, K Nearest Neighbor imputation (KNN) algorithm, Mean substitution
(MS) and Regression substitution (RS) which have the ability to deal with missing
values or incomplete data of unknown variables in the case of incomplete data
problem. This work is distinctive because, to the best of our understanding, such
substantial empirical study on no occasion has been presented by any researcher
in the literature.
1.2 Problem Statement
Incomplete data normally exist in cross-sectional and longitudinal studies. These
unobserved values occur when in the present data set there is no data points
that are recorded for some of the attributes. The problem of incomplete values
are mostly attributed to withdrawal or non-response of the respondent, scales of
interest are not available, loss of data due to transmission challenges, problems
with monitoring and recording tools, loss of data during coding and storing.
Before completion of the intended cross-sectional study, some subjects may be
vanished or drop-out or any of the above mention problems may occur. Therefore,
because of missing data for some attributes, researchers had to drop other cases
from the analysis. Most often than not, the records for such subjects are not
available for statistical analysis.
The existence of unobserved values in the data matrix have severe implications.
The occurrence of missing data reduce the effectiveness of parameter estimates,
diminish the sensitivity of the data analysis, that is, it affects the interpretation
and conclusions of the study outcomes, the strength of the research plan, the
validity of inferences about relationships between attributes and may decrease the
representativeness of the sample size (Morais, 2013). The incomplete data also
diminish insight into the data, cause inefficient data analysis and impoverishment
of statistical power, that lead to inaccuracy and inefficient inferences about a
population to guide stakeholders, decision makers and researchers. According to
Horton and Kleinman (2007), data may be missing for so many reasons, such as
3
University of Ghana  http://ugspace.ug.edu.gh
subject drop-out, interviewee non-response, non-coverage, misleading questions,
confidentiality purpose and many more, which may account for scatter missing
data points in a study.
Choosing the best suitable imputation approach to resolve the problem of
incomplete data is the major challenging data-scientists encounter. Moreover,
missingness are always neglected instead of applying imputation method or
algorithm to fill them. From the above stated problems, this study therefore seeks
to investigate the efficacy and accuracy of convergence on the five imputation
algorithms: expectation maximization (EM), multiple imputation by chain
equation (MICE), k nearest neighbor imputation (KNNI), mean substitution and
regression substitution in estimating and replacing missing cross-sectional data
of world population data sheet.
1.3 Objectives of the Study
The primary objective of this study is to identify the best imputation algorithm
for estimating missing data. Specifically, the study seeks to:
• determine the most appropriate imputation algorithm to estimate missing
values in real life cross sectional data.
• examine the main reasons why data are missing in cross-sectional study
• determine if differences exist between imputation algorithms estimates and
the multiple linear regression (MLR) model estimates based on data.
1.4 Significance of Study
The consequences and discoveries from this study would be essential in the
following forms. To start with, the outcomes and discoveries from the study
will provide guidance to the general public, stakeholders and researchers about
the choice of statistical imputation algorithm for missing value replacement.
4
University of Ghana  http://ugspace.ug.edu.gh
The study would also explain the ideas of statistical imputation algorithms
in estimating missing cross-sectional data more especially to researchers and
practitioners. Hence, effective use of it could give a comprehensive idea and
a clear path as to how to solve the problem of missing data. The outcomes and
discoveries from this study would be of appreciable assistance to the academic
work thereby leading to supportive ideas of existing theories and literature. The
consequences and discoveries from this research work will serve as a guide for
further studies in the related field.
1.5 Methodology
In order to identify the best imputation algorithm, which will be used to
construct and replace the incomplete values in cross sectional data, the following
missing data imputation algorithms were considered; Expectation Maximization
(EM), Multiple Imputation by Chained Equation (MICE), K Nearest Neighbor
Imputation (KNNI), Mean Substitution (MS) and Regression Substitution (RS).
By creating artificial simulation studies based on missing completely at random
(MCAR) and missing at random (MAR) assumptions, at different missing data
rate of proportion, the five imputation algorithms were used to construct and
replace the incomplete values in the data matrix. The multiple linear regression
(MLR) model is then used to analysis the complete original data without
missing values and all other incomplete datasets replaced by respective algorithms
and assessed. The achievement of these selected algorithms was assessed by
comparing the average coefficient difference of multiple linear regression model,
mean absolute difference and coefficient of determination (R2).
1.6 Motivation of the Research
This study is motivated by the fact that improper handling of missing cross-
sectional data can cause a lot of biases and inappropriate results. Missing data
5
University of Ghana  http://ugspace.ug.edu.gh
problems usually occurs in many research studies and a common attribute to data
accumulation, for example when working with very huge data sets. Missing data
issues are hugely a challenge to researchers and practitioners, because most of the
statistical methods and application software are designed to perform effectively
and efficiently when data are observed. This research focuses on deriving the
best appropriate imputation algorithm and predictive models that will be able
to accommodate issues of missing cross-sectional data. In other words, the
accurate or proper imputation algorithms to solve the problems of missingness
in cross-sectional study leads to reducing the loss of precision and power due to
the drop out of subjects with incomplete predictor variables as well as reducing
bias in parameter estimation. Applying proper imputation algorithm to impute
missing values accounts for the natural variability in the independent variables
and produces unbiased parameter estimates hence valid statistical conclusions is
guaranteed.
1.7 Scope of the Study
This research work used data from the world population data sheet, 2011. It is
cross sectional data, because the data on 106 countries were collected at one point
in time (i.e 2011) without repeated study. Secondary data on life expectancy at
birth (LEB) and all other nine (9) variables employed in this thesis were available
only for one hundred and six countries (Population Reference Bureau, 2011).
In this thesis the missingness were assumed ignorable (MCAR & MAR). The
assumption implies that the reasons for the existence of missingness in the data
set are not known to any one including the researcher but occurs randomly. They
may have occurred due to chance or factors that the researchers can not explain.
Also, the validity of ignorable missing data assumption cannot be tested and there
is no existing theory to confirm this under this study. However, when missing
data mechanisms that lead to missing values are more than one, it is assumed that
variations from MAR mechanism are insignificant and will not distort prediction
6
University of Ghana  http://ugspace.ug.edu.gh
accuracies and conclusions by very wide margins.
1.8 Limitations
The following are some of the limitations encountered during the study:
• The treatment of missing data is an unavoidable statistical challenge and
all researchers should be aware that there is not a known unique imputation
approach that performs best in all situations.
• The researchers encountered difficulties in creating percentages of missing
values to conform to MAR mechanism.
• The researchers also faced problem of obtaining data to facilitates the study
easily.
1.9 Thesis Organization
This thesis has been organised into chapters. First chapter provides a very short
introduction to the research work, such as the background, statement, objectives,
the significance , research questions, methodology, scope and limitations of the
study. The second chapter presents the literature review, which explains the
work done by other researchers on the same work or related work. The third
chapter talks mainly about the designed methodology employed in this thesis.
The forth chapter illustrates the results and outcomes of data analysis output.
It mainly consists of tables and graphical presentation of results for discussion.
The last chapter profers the summary of the research findings, conclusions,
recommendations, and future research proposal of the work.
7
University of Ghana  http://ugspace.ug.edu.gh
CHAPTER 2
LITERATURE REVIEW
2.0 Introduction
This segment of the study reviews diverse literature related to missing data
imputation algorithms in order to uncover elucidative realities and discoveries
which have formerly been established and published by other investigators. A
great number of methods and algorithms have been instituted for calculating
and replacing incomplete values in cross-sectional and longitudinal studies. The
literature review consists of four sections. Firstly, the review of various works
by many researchers. Secondly, the missing data pattern and mechanisms
are introduced. Thirdly, the traditional and modern missing data imputation
techniques (algorithms) are examined and finally, the measures of performance
assessment were also reviewed.
2.1 Missing Values
The term missing (latent) or incomplete data according to Day (1999) is defined
as " a data value that should have been recorded but, for some reason, was
not". Missing values create much complexity in modern research studies since
most data analysis procedures are not planned for their inclusion. There have
been numerous published articles focusing on estimation and reconstruction of
missing value for health related data, whereas some studies have been allotted
to related problems in other disciplines with varying degrees of sophistication.
The issue has been considered in the context of respondent failing to answer all
questions or some of the questions in research surveys and incomplete values
8
University of Ghana  http://ugspace.ug.edu.gh
in experiments (Little & Rubin, 2002). The rows with incomplete values may
be utilized for further analyses after the estimation and reconstruction of the
missing values. A great number of imputation algorithms exist for treating
latent values, some are: hot deck imputation and mean imputation, regression
imputation, cluster-based imputation, and tree-based imputation, maximum
likelihood estimation (MLE) and multiple imputations. Data-scientists and
other researchers not only have created many techniques for treating incomplete
values, but also have developed several kinds of missing values. In subsequent
segments, we shall elaborate on the classification of missing values mechanism
that give rise to incomplete data. As far as implementation and decision making
processes are concerned, the presence of missing values constitutes a problem
of crucial importance for end-user data analyses, since many employed methods
and application software require complete data matrices.
Susianto, Notodiputro, Kurnia and Wijayanto (2017) in their study ’A
comparative work of imputation techniques for estimation of incomplete
values of Per Capita Expenditure’ compared and assessed three imputation
procedures. The techniques used in their study were; Yates method, Expectation
Maximization (EM) approach and Markov Chain Monte Carlo (MCMC)
technique. These three methods were applied to a real data set of per capita
expenditure at sub-district level at central Java. The main objective of their
study was to identify the best missing data imputation approach to impute
hidden values of per capita expenditure. The results of their study revealed that
mean sum of squares generated by the Yates technique was smaller as compared
to the mean sum of squares that emanated from the other two techniques,
EM and MCMC approaches. These outcomes were consistent with mean
absolute error of the Yates technique which was in addition smaller than the
mean absolute error produced from the other two algorithms, EM and MCMC
approaches. For these reasons, Yates formula was advocated to substitute
9
University of Ghana  http://ugspace.ug.edu.gh
missing values of per capita expenditure at sub-district level in Central Java.
Rahman and Islam (2011) compared two imputation algorithms: Decision tree
based Missing Imputation (DMI) and Expectation Maximisation Imputation
(EMI) algorithms on two real data sets. Their investigation realized that the
EM technique displays more desirable imputation results on data sets with very
large interdependence between the variables. In addition, correlation between
the variables are natural characteristics of a any given dataset. Therefore, a data
value must not be altered or remoulded in order to enhance the relationships
among the variables with the aim of obtaining more satisfactory imputation
precisions. In spite of the fact that, DMI achieves remarkably desirable results
than EMI on the two data sets, its performance on a large data set (Adult data)
is obviously superior to its achievement on a small data set (Credit Approval
data). It simply implies that DMI achieves more desirable results on large
datasets than smaller data sets. Because DMI uses EMI based replacement
on the records belonging to a leaf individually, for a minimal data set we
may frequently arrive at having inadequate number of records for obtaining a
desirable outcome from the EM approach. Therefore, in their investigation, DMI
still produces more outstanding results than EMI even on small data sets in
most of the situations.
Brown (1994) assessed five indirect approaches for estimating structural equation
models with different percentages of incomplete values.The approaches used
in the work comprised listwise deletion (LD), pairwise deletion (PD), mean
imputation, hot-deck imputation, and similar response imputation. Brown
chooses to pivot on indirect techniques because he explains that numerous
cases are not applicable for practical application of direct techniques. Brown’s
work is conflicting with recent works which bears the application of a direct
technique for treating incomplete values in a structural equation model. He
10
University of Ghana  http://ugspace.ug.edu.gh
applied a simulation study of 10 attributes on structural equation treatment.
Brown’s study plan comprised of two different sample sizes (one hundred
and five hundred) and each with five percentage levels of unobserved values.
By comparing the strength of the indirect approaches, Brown identified four
results: problems of convergence, model selection of best fit, unfair in estimated
parameters, and estimates of standard errors. With the assumption that the
data are MCAR, indirect technique, LD, must produce very good estimators for
all parameters.
Batista and Monard (2003) studied the effects of four imputation techniques for
treating incomplete values at various percentage levels of incomplete data. The
algorithms explored were KNNSI, MMSI, and internal approaches employed by
fractioning cases FC and CN2 to handle incomplete values. Incomplete data
were synthetically simulated at various percentage rates of latent values and
attributes into the datasets. KNNSI algorithm displayed an excellent result as
compare to MMSI if incomplete values were in one variable. Nevertheless, two
systems provided very good performances when incomplete data were in several
variables. On the other hand, fractioning case (FC) algorithm obtained a very
good achievement as perfect as KNNSI.
Twala, Cartwright and Shepperd (2005) evaluate the effect of the following
incomplete data approaches : LD, EMSI, KNNSI, MMSI, EMMI, FC and SVD
on eight commercial datasets by synthetically reproducing three percentage
levels of missingness, two patterns and three mechanisms of incomplete values.
Their study revealed that EMMI displays a highest reliability rates and other
methods such as fractioning cases and EMSI yielded good outcomes. The
poorest approach was LD. Besides, their study depicts that MCAR data is very
cheap to handle with MI data. Batista and Monard (2001) have examined
the achievement of ten nearest neighbor imputation as imputation technique,
11
University of Ghana  http://ugspace.ug.edu.gh
evaluating its achievements to other three incomplete values approaches: mean
or mode imputation, statistical classifier C4.5 algorithm and CN2 technique.
This study suggested that the benefits of the technique are that it can forecast
both qualitative and quantitative variables, and it does not generate precise
models, since it is an inactive model. The work indicates that the technique
profers excellent outcomes, preferable to the other three techniques, mean or
mode imputation, C4.5 and CN2 approaches, surprisingly for a very high amount
of latent values. Moreover, the primary disadvantage of the 10-NNI approach is
that the algorithm which explores through the entire datasets are restricted in
huge data sets only relied on MCAR mechanism.
From the literature review, all the imputation algorithms considered in the
study were used to impute missingness under different missing data mechanisms
(MCAR, MAR and MNAR), but failed to classify imputation methods under
different missing data mechanisms. Since some of the imputation methods
work better and some work poorer under different missing data mechanisms, it
is important to classify them to require missing data mechanism before using
to replace missing data. Therefore, this study would classified imputation
methods under different missing data mechanisms (MCAR and MAR) to examine
actual performance of the imputation algorithms under a specific missing data
mechanism. However, many of the prior studies are mainly concerned with
application of one modern imputation algorithms such as, artificial neural network
(ANN), KNNI, EM, MCMC, full information maximum likelihood (FIML)
methods etc, in comparing to other traditional imputation techniques such as
LD, PD, mean substitution, mode or median, hot-decking and many more. At
the end results, the modern method of imputation provides good estimates as
compare to traditional imputation methods. This study will separately compare
modern imputation algorithms as well as traditional imputation methods under a
specific missing data mechanism to examine the best imputation method. Thus,
12
University of Ghana  http://ugspace.ug.edu.gh
no studies have accordingly examined the consequences of convergence on the
five imputation algorithms, namely: expectation maximization (EM), multiple
imputation by chained equation (MICE), k nearest neighbor imputation (KNNI),
mean substitution (MS) and regression imputation (RS) which have the ability
to substitute incomplete values and unknown parameters in large real database.
This study therefore focuses on five imputation algorithms to estimate and
reconstruct missing values in real life application data using ignorable incomplete
values mechanism assumptions (MCAR & MAR) and arbitrary missing data
pattern.
2.2 Missing Data Mechanism
Incomplete values occur because of reasons beyond our supervision, hence, the
properties of operations that account for unobserved values require first to be
examined. Basically, there are three kinds of incomplete values mechanisms
that are integrated with ignorable and non-ignorable latent values in the
literature (Little & Rubin 2002, Carpenter & Kenward 2013). The incomplete
values mechanism describes connection between unobserved values and values
of variables in the data matrix i.e. whether the missing values depend on the
underlying values of the variables in the data matrix. As explained by Schafer
(1997), given a complete dataset Yc, which consists of Yo, the observed part,
and Ym, the latent values part, it is obvious that the complete data matrix
Yc = (Yo, Ym). Schafer emphasised that given I response characteristic function
which has equivalent dimension asYc. When Yc is observed, I = 1 and when Yc is
missing, I = 0. Then he explained the mechanism as follows:
Missing Completely at Random (MCAR)
A dataset with missing completely at random has no systematic arrangement
of missing values among the attributes and incomplete values do not connect to
either the observed values or lost values (Acock, 2005; Bennett, 2001; Roth, 1994).
13
University of Ghana  http://ugspace.ug.edu.gh
If data are said to exhibit MCAR assumption, then the likelihood of obtaining a
particular pattern of missing value is independent on both the observed and latent
values (Hair, Black, Babin, Anderson & Tatham, 2006). It simply means the
probability of having a latent value for a variable does not rely on either observed
or latent values, i.e P (I/Yc) = Pr(I) . The incomplete values have no dependency
on any other attributes. Thus, the incomplete values occurred just by chance and
they generally surface as very few individual points randomly distributed. Under
MCAR assumption, any incomplete values treatment approach can be employed
without fear of introducing bias into the study. In practice, it is very difficult
to differentiate whether data are MCAR, but Little (1988) had established an
omnibus statistical test of MCAR to solve such problems.
Missing at Random (MAR)
A dataset with MAR, the likelihood of having an incomplete value is connected
to another attribute in the study but is not connected to latent values (Allison,
2001). The probability of drop-out for a variable is dependent on the observed
values but does not rely on latent value itself, i.e Pr(I/Yc) = Pr(I/Yo). In
addition, data with MAR, latent values are connected to known data but not to
incomplete values (Roth, 1994; Schafer & Graham, 2002). Thus, the likelihood
of having latent value is unrelated to the incomplete values in the study. The
latent values rely on other observed values. The incomplete value can be
calculated using other observed values. Incomplete values generally surface as
little consecutive points disappeared at one time, but the sets of missingness
are randomly dispersed. According to Schlomer, Buaman and Card (2010),
it is practicable to differentiate between MCAR and MAR by calculating a
dummy variable denoting whether values are lost on a desired attribute and
then inspecting if this dummy variable is connected with other attributes in the
study. Whenever dummy variable (lost values) is unconnected to other attributes,
then the missingness arrangement is not counted as MAR instead of MCAR or
14
University of Ghana  http://ugspace.ug.edu.gh
NMAR. Contrary, when a dummy variable is truly connected to other attributes,
we suggest MAR instead of MCAR, even though we may not totally neglect
NMAR. The NMAR implies that the investigators cannot examine whether a
given data is MAR or MCAR. However, investigators frequently presume MAR
or MCAR if there are no symptoms to the opposite.
Not Missing at Random (NMAR)
Not missing at random (NMAR) mechanism is also called non-ignorable
mechanism (Schlomer, Buaman and Card, 2010). A dataset with NMAR, the
occurrence of the lost data points have certain patterns, that is likelihood of
latent values are connected to the outcome on that same attribute. This means
that, the probability of obtaining an incomplete data for an attribute is based
on the value of that attribute, i.e Pr(I/Yc) = Pr(I/Ym, Yo). The incomplete or
latent value is determined by other latent values, hence they cannot be calculated
from an observed attribute. Apparently, the complexity in deciding NMAR is that
the relationship between incomplete values and the manner in which interviewees
would have answered the questions are not able to determine, since missing values
are not available. The mechanisms do not usually present a rational reasons
for the lost of data, but they provide mathematical technique to formulate the
likelihood of incomplete data in connection with other attributes in the study.
Non-ignorable missing type explains the likelihood of a lost data point as it is
based on its value. It usually happens when the patterns of the latent data
points are in such way that the lost value of Ym cannot be accurately forecasted
using other attributes in the database. Ignorable missing data type comprises
of MCAR and MAR mechanisms. The diagram in Figure 2.1 describes types of
missing data.
15
University of Ghana  http://ugspace.ug.edu.gh
Types of missing data
Nonignorable Ignorable
Missing data Missing data
mechanism mechanism
MNAR MCAR MAR
Figure 2.1: Important types of missing data
2.3 Ignorability Mechanism
Rubin (1976), emphasized that "there are two broad classes of missing data:
missing data that is ignorable from the analysis, and missing data that is non-
ignorable. If one can reasonably assume that missing data occur under the
either MCAR or MAR conditions, then the problem is deemed ignorable, and
the missingness process need not be explicitly modeled. Moreover, when data are
MCAR or MAR, the likelihood-based and Bayesian frameworks allow to ignore
the missingness process since they use only observed data,conditional on the
model being correctly specified (Little & Rubin, 2002)". Contrary, if data exhibits
NMAR, the lost data points procedure cannot be rejected from the analysis (Little
16
University of Ghana  http://ugspace.ug.edu.gh
& Rubin, 2002). In the application to missing data classifications, ignorability,
as it applies to missingness mechanisms, does not mean that investigators can
ignore missing values. It refers to the fact that factors that cause missingness are
unrelated or weakly related to the estimated intervention effect. In a restricted
sense,the term refers to whether missingness mechanisms must be modeled as
part of the parameter estimation process or not (Allison, 2002). In addition, the
importance of ignorability arises when one needs to evaluate the impact of missing
data in the analysis and study conclusion. Because of the randomness nature of
missingness, MCAR data should not have systematic effect between complete and
missing records on results. But, in the MAR data there is a systematic process
underlying the missingness, but this effect can be modeled using the observed
data (McKnight, McKnight, Sidani and Figgueredo, 2007). However, lost data
point is non-ignorable if the likelihood of a lost data point is based on its value,
even after controlling for other variables. Thus, the NMAR procedure means a
contravention of ignorability law and needs suitable actions to consider for the
impacts of data that is NMAR. The non ignorable incomplete data is significantly
most strenuous task to handle and must be carefully catered for. Thus, it is not
practically easy to make an acceptable and reasonable investigation of the data
with NMAR (Thijs, Molenberghs, Micheiels, Verbeke,& Curran, 2002).
2.4 Pattern of Missing Data
Basically, pattern of incomplete values explains the amount of values in the
datasets that are observable and the value that are not observable (missing).
In cross-sectional studies with missing data in a single variable or multiple
variables, when data are presented in wide format as horizontal lines correspond
to subjects and vertical lines correspond to attributes, the matrix displays three
main patterns of lost data points. These are; univariate pattern, monotone
pattern and arbitrary pattern. In a univariate pattern, missing values occur
only in one attribute and other remaining attributes are totally observed. In
17
University of Ghana  http://ugspace.ug.edu.gh
a monotone pattern with ordered variables, once a variable is missing, then all
succeeding variables are also missing, that is we can clearly notice a pattern
between the lost data points. With arbitrary missing pattern, there is no way to
reorder attributes to notice explicit pattern (SAS Institute,2005). The arbitrary
missing pattern is the most general pattern in which different sets of attributes
can be missing for different subjects. Therefore, assumptions and patterns of
latent values are applied to decide which algorithm will be appropriate to solve
the problems of latent values. This thesis therefore focuses on the arbitrary
missing data pattern in multivariate datasets. Unlike univariate and monotone
patterns, which can be handled by simple methods, the arbitrary missing data
pattern may requires more sophisticated algorithms.
2.5 Traditional Methods of Treating Missing Data
The traditional approaches or techniques for treating missing values are briefly
discussed. The techniques are LD, PD,MS, RS, stochastic regression imputation,
and hot decking. Generally speaking, incomplete values approaches can
be separated into two classes, we have deletion approaches and imputation
approaches. These techniques were once very popular and even dominant in
the applied research when researchers had to solve the problem of lost data.
Contrary, as research on how to handle incomplete values in multivariate data
developed rapidly, many of these methods have been regarded as unacceptable in
structural equation modeling (Savalei & Bentler, 2009). Although many of these
traditional methods are still frequently utilized in applied studies, researchers
should be aware of their disadvantages and consequences in analytical analysis
and parameter estimation.
18
University of Ghana  http://ugspace.ug.edu.gh
2.5.1 Mean Substitution
With the mean substitution approach, the arithmetic mean of the observed values
of specific variable is estimated and then substituted in each of the lost data cells.
This approach performs better only if the variable considered is not nominal
data. The mean substitution is a technique allowing to treat missing data that
consists in substitution for a given variable, each missing value by mean of actual
values noticed. This approach preserves the mean of the variable distribution but
reduces other characteristics of the variable dispersion (Rubin,1987 and Cole,
2008). Allison (2002) showed that mean substitution approach restricts the
variability of a variable and changes the underlying distribution. Moreover, mean
substitution technique performs better if lost data mechanism is MCAR. However,
one disadvantage of mean substitution is that, it leads to bias in calculating
parameters (McDonald, Thurston, & Nelson, 2000; Pigott, 2001; Streiner, 2002).
2.5.2 K Nearest Neighbor (KNN) Imputation Algorithm
Nearest neighbor imputation (NNI) is a process of substituting lost values of an
instance B with reasonable values which is obtained from a complete instance that
is closeness neighbor of B, it gives a feasible answer to the common problem. NNI
is non-parametric technique,it has extensive records of implementation. The KNN
imputation algorithm is an increased version of NNI which can solve the problem
of over fitting. With the KNN techniques, the information in missing instances
is employed sole for locating the nearest neighbor, or group of an instance with
lost data. The fundamental law is identifying the K nearest neighbors from the
target variable in all N variables, where N is the entire experiments. If we take
an attribute B which has one latent value in study 1, KNN approach will locate
k other attributes with an observed value in study 1, which have an expression
very closer to B in study 2, in N.
A weighted average of figures in study 1 from the K closest attributes is then
19
University of Ghana  http://ugspace.ug.edu.gh
applied as quantity for the lost value in attribute B Usually, euclidean distance
is employed to determine the interval among the samples in study, for instance,
the interval between two points qi and pi is given by,
√√√∑n
d(pi, qi) = d(qi, pi) = √ (pi − q 2i) (2.1)
i=1
The process of replacing the lost values using euclidean distance may be
summarized as:
1. Each attribute which has lost values, determine it with all other attributes
the interval among the points using Euclidean distance in equation 2.1;
2. Arrange the distance in ascending order and pick the K smallest distances
3. When working with discrete data, the mode of K distances will be taken as
replacement value; with continuous case, the mean or median of K distances
will be used to replace values. The main benefits of KNN approach for
calculating and replacing lost values are as follows:
• . KNN may be used to predict both discrete cases and continuous cases.
• There is no need to produce a forecast expression for each variable with
lost values. In fact, the KNN does not produce precise models like other
techniques, it is termed as a lazy model. The KNN may be simply adjusted
to function with any variable as class, by amending which variables may be
examined in the distance metric. Besides, KNN can simply handle cases
with multiple lost data.
• However,the major limitation of KNN approach is that, before the KNN
seeks for the very closest instances, the method looks throughout the entire
dataset. This problem is very crucial, because most of statistical research
among their objectives is to analysis a large dataset.
20
University of Ghana  http://ugspace.ug.edu.gh
2.5.3 Regression Substitution
The principle of regression method is to use the observed values to create fitted
regression model. The attributes with lost data are target variable and incomplete
values are substituted by the predicted values using regression equation. Based
on an approach proposed by Yuan, (2000), each variable with missing data is
fitted with a regression model by using the remaining variables as independent
variables ( i.e covariates, or regressors). By applying the coefficients of regression
model, a new model is created or developed, then each attribute with lost values
will be replaced by the developed regression model (Rubin, 1987). Let Zj be a
continuous variable with lost data,which meet the requirement of the expression;
E(Zj) = β0 + β1X1 + ...+ βMXM (2.2)
which is formulated by applying observations with noted values for the attribute
Zj, its covariates C1, C2, ..., CK . Where K is the number of attributes in the
study, and 0 < M < K. Rubin (1987), emphasized that, this system
presumes multivariate normal, it simply means a model with mean σ2. The
formulated regression expression produces the regression parameter estimates
β = [(β0, β1, ..., βM)]
′ and associated covariance matrix σ2 V , where V is the
usual (CTC)−1 matrix derived from the intercept and covariates C1, C2, ..., CM .
To explain in detail how the regression algorithm works, many predictors of
the attribute with incomplete values are determined by applying the correlation
matrix. The most excellent predictors are picked and used as predictor attributes
in a regression expression. The attribute with the lost values is used as the
outcome or response attribute. Cases which have complete information for the
explanatory attributes are employed to create the regression expression, the model
on the other hand is used to forecast missingness for lost cases. By iteration, the
values of the lost attributes will be substituted and all instances are used to
forecast the response variable. These systems are reiterated till they converge.
21
University of Ghana  http://ugspace.ug.edu.gh
The regressors obtained from final circle are the best ones employed to fill in
the incomplete data. To differentiate regression substitution (RS) method and
other kinds of imputation approaches, RS employs the major sources of data
to forecast incomplete values and technically produces unbiased estimate for
incomplete data (McDonald et al., 2000). The disadvantages of using RS method
have been examined to be greater than its advantages (Graham & Hofer, 2000;
Little & Rubin, 2002). To begin with, since the substituted figures were predicted
using other attributes, they usually fit better. This means, RS does not exhibit
random noise, and therefore, standard error is reduced (Allison, 2002). Also, in
regression substitution expression, one might think that the correlation among
variables is linear but in some instances they may be wrong. This issue may
lead to overestimation of parameters and smaller significance values, that will
result in presenting negative statistical inferences. Last but not the least, with
regression substitution approach, substituting the lost values is extra difficult and
a little workable when the attributes with incomplete values are mostly connected
(Raaijmaken, 1999). The most uniquely advantage of RS method is the software
program accessibility to initiate this technique.
2.6 Modern Method of Treating Missing Data
It is significant to observe that traditional incomplete values imputation
procedures were widely employed due to easy and general accessibility in
application softwares, many of them provide unsatisfactory results (Enders,
2001; Little & Rubin, 1987). Nowadays, statisticians and other researchers
have invented many methods and algorithms to impute missing data which
have undergone substantial developments. Expectation maximization (EM)
algorithm, MICE algorithm and full information Maximamum likelihood ( FIML)
method have gained popularity in the recent times due to their superiority
as compare to other traditional methods. These algorithms provide consistent
asymptotically normal and coherent parameter statistics when applying MAR
22
University of Ghana  http://ugspace.ug.edu.gh
assumption (Allison, 2002; Schafer & Graham, 2002).
2.6.1 Expectation- Maximization (EM) Algorithm
The EM algorithm is an iteration procedure used to calculate the maximum
likelihood quantity in the presence of latent or lost data. With ML, we are
interested in calculating the model parameters in which the non- hidden values
are most possible. The EM approach, initially developed by Dempster, Laird and
Rubin (1977), is a repetitive algorithm to maximize the probability calculated by
a parametric model for observed data. The EM technique for hidden values is
largely depended on the maximum probability estimated of covariance structure
given by the available data. Each repetition of the EM approach comprises of
two systems. The Expectation step (i.e. E-step) and the Maximization step
(i.e. M-step). With expectation step (E-step), regression equations based on
the given available values are used in calculating the missingness ("the expected
values"). These missingness are replaced by the conditional mean established
by the regression models. In the maximization step (M-step), the estimates
obtained from the E-step are updated to increase the log probability of the current
parameters from the first state. These two steps are repeated for some number
of iterations. This algorithm will converge to a stationary point under some
hypothesis of regularity (Alison (2002), Dempster et al. (1977))
2.6.2 Multiple Imputation by Chained Equation (MICE)
Algorithm
MICE approach was initiated by Van Buuren and Groothuis-Oudshoorn (2011).
The MICE approach is a Markov Chain Monte Carlo (MCMC) system such that
the state space is the compilation of the entire substituted values. As it happens
in all Monte Carlo operations, MICE technique has to fulfill three conditions for
convergence to take place (Van Buuren, 2012).
23
University of Ghana  http://ugspace.ug.edu.gh
1. The iteration is irreducible. The chain should be able to go through all
aspects of the state space.
2. The iteration is aperiodic. The chain must not swing forth and back among
separate states.
3. The iteration is recurrent. That is the likelihood of the chain beginning
from j and returning to j is one.
In reality, the convergence of the MICE approach is attained after an acceptably
small number of iterations, commonly between five and twenty (Liu & Brown,
2013). Liu and Brown emphasized that roughly five iterations are acceptable,
though in few cases may demand a huge amount of iterations. MICE requires
researcher to state a conditional distribution for respective attribute utilizing
other attributes as regressors. This means that each attribute can be modeled
according to its distribution. For instance, counts data is modeled using Bayesian
linear regression, binary variable is modeled using logistic regression (Azur,
Stuart, Frangakis & Leaf, 2012). This technique operates by repeatedly replacing
incomplete values based on the formulated conditional equations up to the time
convergence is achieved. The chain is separated into three stages. In stage 1,
each attribute, for each station, would be substituted by the arithmetic mean
for that attribute. At stage 2, the calculated observed entries of the attribute
in stage 1 are related to the other attributes of the database by regression. It
means that, the attribute of stage 1 is the response attribute in the model and the
other attributes are predictors. In stage 3, the incomplete values in stage 1 are
substituted by the regression equation from stage 2. However, the approaches
in stage 1, stage 2 and stage 3 are reiterated for each attribute with hidden
value. This approach for respective attribute comprises an iteration or a cycle.
By completion of an iteration, all incomplete data would be substituted with the
quantities predicted by regression expressions. Stage 2 via stage 3 are repeated
for separate iterations with the replacements being restored in each iteration.
24
University of Ghana  http://ugspace.ug.edu.gh
2.7 Measures of Performance Assessment
The following performance metric systems are used as a criteria to assess the best
algorithm to substitute hidden values in cross- sectional data; these measures are
mean absolute difference (MAD), root mean squared error (RMSE) and coefficient
of determination (R2)
2.7.1 Mean Absolute Difference (MAD)
The MAD is a statistical variance evaluation indicator. MAD is also called
average difference between two different numbers selected from probability
density. The MAD (absolute mean difference) is the arithmetic mean of absolute
difference between observed value and predicted value. The smallest MAD
is the best measure of dispersion, hence the algorithm with smallest MAD is
recommended to substitute hidden values in the database.
2.7.2 Root Mean Squared Error (RMSE)
The root mean squared error (RMSE) is a performance indicator that determines
the mean distance of the residual. It compares the variance among original
datum and substituted datum, basically it denotes standard deviation of the
difference. It is a valuable indicator of total exactitude which assists to know
how each imputation algorithm is performing in a data set. In literature, the
most efficient imputation algorithm is the one with the lowest RMSE (Huang
& Carriere, 2006). This implies that the smaller the RMSE, then the better
the performance indicator. Chia and Draxier (2014) argue that "the RMSE
has been used as a standard statistical metric to measure model performance
in meteorology, air quality, and climate research studies". Also, in the area of
geosciences, the RMSE has been chosen as one of the best standard indicators
for model residuals ( Savage et al., 2013), and few researchers too prevent the
use of RMSE but rather resort to MAE, stating the limitation of RMSE declared
25
University of Ghana  http://ugspace.ug.edu.gh
by Willmott, Mastsuura and Robeson (2009). One major merit of using RMSE
instead of MAE is the avoidance of absolute sign that is serious unacceptable in
statistical computations (Chia & Draxier, 2014). Mathematically, RMSE is given
as: √∑n (X 2io −Xim)
RMSE = i=1 (2.3)
n
where i = 1, 2, ..., n. The n is sample size, Xo is the observed values of data set
and Xm is the imputed values in the data set (Schmitt, Mandel & Guedj, 2015)
2.7.3 Coefficient of Determination (R2)
The coefficient of determination (R2) determines the proportion of variability in
the dependent variable that is explained by the independent variables. The R
square (R2) ranges from 0 to 1 while the model has healthy predictive ability or the
regression line is perfectly fit the data when it is closer to 1 and is not analyzing
better, when it is closer to 0. This performance metric is a good indicator of
overall predictive exactness. In the realm of statistics, the R2 determines how
best the model represents the observations in the dataset. In fitting regression
line, the closer the line is to all entries on the scatter diagram implies totality
of variation the model is able to explain. Contrary, when most of the entries
deviated far away from the regression line then it is an indication that very small
amount of variation of the model is accounted for.
2.8 Multiple Linear Regression (MLR) Model
The study used MLR equation to analyze original complete dataset (without
missing values). Also each imputation algorithm will be used to calculate and
replace hidden data in order to identify the best algorithm. Linear regression
analysis relates dependent attribute to its covariates. The fundamental objective
of regression analysis is to build a statistical model to relate dependent variables
to independent variables.
26
University of Ghana  http://ugspace.ug.edu.gh
According to Anghelache and Scala (2016), there are three kinds of regression
models. These regression models are; the variable-based degree model (VBDD),
the linear regression model, and the change point models. These forms of
regression equations employ generalized least squares regression to determine the
model coefficients. This thesis adopts multiple linear regression model to analyze
the real life application data from World Population Data Sheet, (2011). MLR is
an increased version of the simple one. MLR equations are employed to determine
the linear connection among a dependent attribute and various regressors when
fitting a straight line model to observed data entries (Coelho-Barros, Simoes,
Achcar, Martinez and Shimano, 2008). The general multiple linear regression
model is in the form
Yi = β0 + β1X1i + β2X2i + ...+ βkXki + i (2.4)
where Y is the dependent variable, X1, X2, ..., Xk are independent or explanatory
variables, and i index the n sample observations,  is the random error term and
β0, β1, ..., βk are regression coefficients.
27
University of Ghana  http://ugspace.ug.edu.gh
CHAPTER 3
METHODOLOGY
3.1 Introduction
This section expounds on the techniques used in this study. It briefly discusses
algorithms employed under the investigation. The segment has been divided
into five main portions. First part describes the source of data and the research
design. Section two deals with the methodological framework in the Multiple
Linear Regression (MLR) model. Section three describes how to test missing data
mechanism (MCAR and MAR). Section four describes assumptions of MCAR and
MAR. It also discusses the classifications of hidden values under assumptions of
various hidden values mechanism. Besides these, section five briefly explains
the mean absolute difference (MAD), root mean squared error (RMSE), and
coefficient of determination (R2) as performance assessment criteria to compare
each imputation algorithm and identify the best. It also displays an outline of an
extensive systematic plan of the data analysis procedure.
3.2 Source of Data
This study illustrates the application of imputation techniques to real life
dataset by using data from World Population Data Sheet, 2011 (Population
Reference Bureau, 2011). Secondary data is used in this thesis. Population
reference bureau is non-profitable organization which provides an annual world
population data, that is chart filled with information from two hundred countries
regarding essential demographic characteristics and health related issues, for
example; population density, maternal mortality, life expectancy at birth,
28
University of Ghana  http://ugspace.ug.edu.gh
HIV/AIDS prevalence, total population estimation, poverty, and contraceptive
usage (Population Reference Bureau, 2013). These data are important key for
consumption of academic research work, stakeholders, practitioners and policy
makers.
The data comprises of 106 observations with 10 variables. The data on life
expectancy at birth (LEB), the target variable and all other nine variables utilized
in this study are accessible solely for one hundred and six countries (Population
Reference Bureau, 2011). Due to the respective estimates of LEB for the various
countries under consideration, the countries have been partitioned into three
categories. The three divisions are; countries with small LEB estimate, countries
with average LEB estimate and countries with large LEB estimate. The LEB is a
single index of mortality that condenses mortality conditions and indicates mean
number of years a cohort is expected to live, if they were subjected to the age
specific mortality rates for a given period (Pollard, 1988). The life expectancy at
birth ( LEB), the dependent variable, and the nine independent variables which
account for LEB in the year 2011 are as follows: URBAN: total number of people
living in urban towns. CMW: total number of married women of child bearing
stage practicing birth control system. GNIPP: gross national income converted
to international dollars; DEN: total number of people living in square kilometer;
RWS: total number of rural people who have access to purified water supply; IMR:
total number of child deaths under one year; TFR: total fertility rate. DEPPOP:
total number of dependent people; POVERTY: total number of people who spend
less than 2 dollars per day.
3.3 Research Design
The following approach was used as our research design plan.
In the original data set in matrix form, multiple linear regression (MLR) model is
fitted. The following rate of missingness, 5%, 10%, 20%, 30% and 40% were
artificially created in the original data. The Little’s test was performed in
29
University of Ghana  http://ugspace.ug.edu.gh
order to confirm missingness or missing value is MCAR or MAR. Some selected
imputation algorithms were employed under MCAR and MAR mechanisms to
estimate and replace the missing values created in the original data. After the
computation and replacement of each of the artificial created missingness by an
imputation algorithms, a MLR model was fitted to re-estimate the coefficients
and their standard errors. All the estimated models were compared based on
some evaluation criteria and the model that is very close to original data model
is recommended as the best model.
Diagrammatical Representation of the Research Design
Figure 3.1: Step by step procedure of the research design
30
University of Ghana  http://ugspace.ug.edu.gh
3.4 Multiple Linear Regression (MLR)
It is important to note that, the MLRmodel is certainly the more widely employed
technique in the field of statistics. It assists the researcher to determine the
correlation among a response attribute and a number of predictor attributes.
Regression analysis is largely the most robust approach which helps investigator
to study in details the relationship among variables in the given data set. This
study would consider lost values throughout as ignorable. The assumption of
ignorable missing values mechanism implies that no one knows the reason why
hidden values occurred in the data set. Ignorable hidden value is a compound
term that comprises of missing completely at random (MCAR) and missing at
random (MAR) mechanisms.
3.4.1 The Multiple Linear Regression (MLR) Model
The MLR model assumes a direct or one- dimensional connection between
response attribute Yi and a set of predictor attributes (X )Ti = (Xi0, Xi1, ..., Xik).
The initial predictor Xi0 = 1 is a constant unless otherwise stated. Suppose
X1, X2, ..., XN are N predictor observations on Y . Then each observation yi may
be modeled as:
Yi = β0 + β1Xi1 + β2Xi2 + ...+ βKXiK + εi (3.1)
where, ε ∼ N(0, σ2). This equation is called a MLR model. Where Yi’s are the
response attributes, β1, β2, ..., βK are the regression coefficients, β0 is the constant
term when all covariates are not included in the model and ε is the residual term.
The mean value of dependent attribute Y of a direct expression of the coefficients
β0, β1, β2, ..., βK is given as,
E(Y ) = β0 + β1X1 + ...+ βKXK (3.2)
31
University of Ghana  http://ugspace.ug.edu.gh
3.4.2 Matrix Representation of the Model
The function connecting response attribute Y to predictor variables
X1, X2, ..., XK is given by;
Y = β0 + β1X1 + ...+ βKXK + ε
With N independent observations on Y and associated values of X, the model
becomes
Y1 = β0 + β1X11 + β2X12 + ...+ βKX1K + ε1
Y2 = β0 + β1X21 + β2X22 + ...+ βKX2K + ε2
.
.
.
YN = β0 + β1XN1 + β2XN2 + ...+ βKXNK + εN
In matrix notation, the model becomes
      

Y1 1 X11 ... X1kβ1  ε1
 Y 
 
   
2 1 X21 ... X2k   

   
β2    
  ε2
 .   . . . . .   .  
=
 
 +  
 .   . . . .  .   . 

 .   . . . .  . 

  . 
Yn 1 Xn1 ... Xnk βk εN
where E(ε) = 0, Cov(ε) = σ2I, Y is N × 1 column vector, X is N × (K + 1)
design matrix, β is (K + 1)× 1 column vector and ε is N × 1 column vector.
32
University of Ghana  http://ugspace.ug.edu.gh
3.4.3 Assumptions of the Multiple Linear Regression
• The residuals are assumed to have mean zero and unknown common
variance σ2 which are normally distributed.
• The errors are uncorrelated. That is, they are independent.
• It is appropriate to observe the predictor X as fixed by the investigator and
measured with insignificant residual.
• The sum of the residuals weighted by the corresponding values of the
predicted Ŷ is zero.
3.4.4 Testing for Overall Regression Significance
HO : β1 = β2 = ... = βn = 0
H1 : βi 6= 0 for at least one i := 1, 2, ...n
That implies
∗ MSRF = (3.3)
MSE
where F ∗ is equal to the observed value of F. Let denote H0 and H1 by null and
alternative hypothesis respectively. Then the decision rule will be the form: If
F ∗ ≤ F(α,a,b), fail to reject H ∗0. If F ≥ F(α,a,b) , reject H0
where α is significance level of test, a is degree of freedom, b is degree of freedom
and F(α,a,b) is the critical value or table value of F.
Failing to rejectH0 will imply that overall regression is not statistically significant.
Contrary, overall regression is statistically significant.
3.4.5 Testing for the Significant of the Slopes
In determining the significant contribution of a particular variable to the model,
the appropriate hypothesis is formulated and proceed to use the t-test statistics.
To test the contribution ofX1 whose regression coefficient is β1, then the following
hypothesis hold.
33
University of Ghana  http://ugspace.ug.edu.gh
H0 : β1 = 0
H1 : β1 6= 0
Test statistics,
β1
t = ∼ tn−k−1 (3.4)
s.e(β1)
where s.e(β1) is the standard error of β1
√
s(β ) = (σ21 C11) (3.5)
where C11 is the second element of (X ′X)−1 matrix and σ2 = MSE σ2 = SSE(n−k−1)
where n and k are degrees of freedom.
Decision Rule and Conclusion
If t ≤ t(n−k−1), fail to reject H0, if t ≥ t(n−k−1), reject H0. When we fail to reject
H0 , we conclude that the variable X1 does not contribute significantly to the
model. Otherwise, it does.
3.4.6 ROLE of R2 and r2
R2 measures the percentage of a total variation of observation of the response
attribute that is explained by the over all regression equation. The higher the
value ofR2, the greater the percentage of the variance explained by fitting the data
to the equation. This implies that the regression model is the best formulated.
r2 measures the marginal contribution of one variable when all others are already
included in model.
2 SSRR∑= (3.6)SSTn 2
R2 = ∑i=1(Ŷi − Ȳ )n (3.7)
i=1(Yi − Ȳ )2
where SSR is the sum of squares due to residuals and SST is the total sum of
squares
34
University of Ghana  http://ugspace.ug.edu.gh
3.4.7 Multicollinearity
Multicollinearity exists in the regression model when at least two of the
exploratory variables are related to each other. A linear relationship or
inter correlation between exploratory variables in a given data is described as
multicollinearity. In effect, if multicollinearity occur in a data the statistical
inferences concerning the data will not be reliable.
Among the explanatory variables in a model, the correlated variables can be found
by calculating the variance inflation factor (V.I.F) for each explanatory variables.
Variance inflation factor is a more rigorous check for collinearity than correlation
coefficient. Mathematically,
1
V.I.F = (3.8)
(1−R2i )
Therefore the implementation of V.I.F function is to use a stepwise elimination
approach until all V.I.F values are below a desired threshold. That is eliminate
in a stepwise manner all independent variables with the highest V.I.F and run
the model again until we have all remaining independent variables having a V.I.F
less than 10(threshold).
3.4.8 Heteroscedasticity
Heteroscedasticity occurs when the residuals of the estimated model do not have
constant variance across various observations. The presences of heteroscedasticity
in the data does not affect the expected value of the coefficients of a model but
OLS underestimates the standard errors of the estimated coefficients. This affects
the outcomes of the t-test statistic for significance.
3.4.9 Breusch-Pagan Test
Breusch-Pagan test is used to test for heteroscedasticity in a linear regression
model. It tests whether the variance of the residuals from a regression model
35
University of Ghana  http://ugspace.ug.edu.gh
is dependent on the values of the predictor variables. The test statistic of the
Breusch-Pagan test is given by
logeσ
2 = γ0 + γ1Xi (3.9)
The σ2 either enlarges or reduces with the level of X, depending on the sign of
γ1 . Constancy of error variance corresponds to γ1 = 0 . The test of Ho : γ1 = 0
versus Hα : γ1 =6 0 is carried out by means of regressing the squared residuals ε2i
against Xi in the usual manner and obtaining the regression sum of squares, to
be denoted by SSR∗. The test statistic χ2BP is as follows:
( )2
2 SSR
∗ SSE
χBP = / (3.10)2 n
where SSR∗ is the regression sum of squares when regressing ε2 on X and SSE
is the error sum of squares when regressing Y on X. If Ho : γ1 = 0 holds and n is
reasonably large, χ2BP follows approximately the chi-square distribution with one
degree of freedom. Large values of χ2BP lead to conclusion on Hα, that the error
variance is not constant.
3.4.10 Remedy for Assumption Violation
The original Box-Cox transform is given by


yγ−1 , γ =6 0,
γ
y(γ) =  (3.11)logy, γ = 0
The objective of Box-Cox transformations is to maintain the assumption of
linearity of the model. That is we transform our dependent variable by choosing
a desired value of γ and applying is appropriately.
36
University of Ghana  http://ugspace.ug.edu.gh
3.4.11 Outliers
Cook’s Distance
It determines the influence of the ith observation on all the ith fitted value. It
is the standardized version of the total of squares of the difference between the
predicted value computed with and without observation i. It is given by
∑n
Yj − Yj(i)
j=1
Di = (3.12)
pMSEi
Rule of thumb is that F (Di, p, n − p) < 10% or 20% not influential case and if
F (Di, p, n−p) is near 50% or more, then the case has a major influential (Howard
& Gordoh, 2005).
3.4.12 Normality Test
Shapiro-Wilk Test
The Shapro-Wilk test tests the null hypothesis that a sample x1, x2, ..., xn came
from a normally distributed population. The test statistics is given by
∑∑( n 2W = i=1 aiX(i)) (3.13)
i=1(xi − x̄)2
where x th(i) is the i smallest number in the sample;
mTV −1
(a1, ..., an) = (3.14)
(mTV −1V −1m)2
where m = (m1, ...,mn) and m1, ...,mn are the average of order statistics of
independent and identically distributed random variables sampled from the
standard normal distribution, and V is the covariance matrix of those order
statistics.
When Shapiro-Wilk’s test of normality has p value less than alpha value (0.05),
37
University of Ghana  http://ugspace.ug.edu.gh
null hypothesis is rejected and conclude that the dataset is not normally
distributed. Contrary, if the p value of Shapiro Wilk’s test statistic exceeds
alpha value (0.05), fail to reject the null hypothesis that the dataset is normally
distributed.
3.5 Testing the Missing Data Mechanism (MCAR
& MAR) Assumption
Researchers are frequently encountered with many difficulties in analyzing data
sets that have missing observations or incomplete data. To appropriately analyse
a data set which contains missing values, an extensive knowledge of incomplete
values mechanism must first be investigated. If data exhibit a missing completely
at random, then many incomplete data analysis algorithms lead to valid inference
(Little & Rubin 2002). Thus, tests of missing completely at random is warranted.
In missing values analysis, Little’s test (1988) is helpful for testing the assumption
of missing completely at random for multivariate, partially observed random data.
3.5.1 Little’s Test of MCAR
Little’s approach of MCAR test examines missing completely at random, that
is an assumption which must be satisfied before substituting lost values with
different imputation algorithms. This test is used to assess for MCAR for
multivariate data with incomplete values. According to Kim & Bentler (2002)
Little’s MCAR test is employed to assess homogeneity of means and covariances
using generalised least squares estimation.
The Little’s test of MCAR is given by;
∑N ∑−1
MCAR = N rk((x̄ Tobs.k − µ̄obs.k)) (x̄obs.k − µ̄obs.k) (3.15)
k=1 obs.k
where N is the total number of observations, k is an index of the 2P missing
38
University of Ghana  http://ugspace.ug.edu.gh
pattern, r is the single response within kth pattern, N rk is the number of observed
samples f∑or the kth missing response pattern, and the χ2 statistic has degrees ofN
freedom Pk−P where Pk is the number of observed variables for all K patterns.
k=1
When Little’s test of MCAR has p value exceeds alpha value (0.05), then neither
the assertion of normality nor the assertion of MCAR test is ignored. When
Little’s test of MCAR statistic is lower than alpha value(0.05), then there is an
evidence against null hypothesis and conclude that data is MAR. Data points are
MCAR if the patterns of lost values do not rely on either observed or hidden data
points.
3.6 Classifications of Missing Data under
the Assumptions of various Missing Data
Mechanism
For researchers to know whether a matrix data set with missing observations
is MCAR or MAR, the little’s test of MCAR and dammy variable of interest
on the variables are employed. First, Little’s test of MCAR is used to test
for the assumptions of MCAR and MAR. If there is no evidence against null
hypothesis under the little’s test of MCAR, then the study can conclude that, the
following imputation algorithms; KNN, mean substitution (MS), and regression
substitution (RS) depend on the missing completely at random assumption (Lin
& Bentler, 2012; McKnight, 2007). Violation of the MCAR assumption may
result in unfair estimates provided by the methods of handling missing data.
If there is significant evidence against null hypothesis under Little’s test of
MCAR, then the study concludes that both imputation algorithms multiple
imputation by chained equation ( MICE) and expectation maximization (EM)
depend on MAR assumption (Lin & Bentler, 2012; Rubin & Thayer, 1982).
The assumption permits parameters to be well modified utilizing all accessible
information. Secondly, for the researchers to know whether a matrix data set
39
University of Ghana  http://ugspace.ug.edu.gh
with missing observations is MCAR or MAR, we compute a dummy variable that
shows whether lost data in a particular attribute is correlated to other attributes
in the dataset. When observed that dummy variable (lost data) is independent
on other attributes, therefore pattern of missing data can not be described as
MAR instead of MCAR in this study and the reverse hold for MAR.
3.7 The Imputation Algorithms for Treating
Missing Values under the MCAR Mechanism
From literature review, the following imputation algorithms have been used under
MCAR assumption in handling missing data problem effectively (Schmitt, et al.
2015).
3.7.1 K Nearest Neighbors Imputation (KNN) Algorithm
The nearest neighbor importation method is a technique based on the notion of
proximity between observations (subjects). This similarity is often determined
by a distance function (Euclidean distance for example). It is a technique in
which the lost data for a given subject are substituted with value noticed at the
same position of the nearest subject. Explicitly, supposed that X is a matrix
which represents the data set, X = (X(1), X(2), ..., X(p)). Where each column
X(i), (i = 1, 2, ..., p) is a random variable of n observations. Let x(j) be a column
with missing values. Set (j) (j) (j) (j)X = (Xobs, Xmiss) where Xobs is the sub-vector of
observed values of x(j) and xjmiss that of the missing values. Consider H =
{ (i)i, xmiss = θ, i = 1, 2, ..., p} with cardinality (the set of order) m, the set of
indices of columns which is not having missing and Z = X(i) ∈ H . Let Zobs
and Zmiss be two sub-matrices of Z extracted by selection of lines corresponding
respectively to (i) (i)Xobs and {Xmiss}. Assume that l is the identifier of the subject
who has not observed value for the variable X(i). Among the subject k, who has
all measurement in the set, and subject j0 who minimizes the distance between
40
University of Ghana  http://ugspace.ug.edu.gh
k and li
{ (i) (i)j0 = argmin d zobs(l), zobs(k)} between i ∈ H, 1 6 k 6 n (3.16)
where d is a distance measure and n is number of subjects in the set. The d is
the Euclidean distance defined by:
√ ∑
(i (i) (i) (i)
d(zobs), zobs(k)) = ( (zobs(l)− zobs(k))
2) (3.17)
if j0 was determined missing value,
(i)
Xmiss(l) would be estimated by Xobs(j0):
X imiss(l) = Xobs(j0) (3.18)
3.7.2 Regression Substitution
The principle of regression substitution is to use the observed values to create
fitted regression model. The attributes with lost data are target variable
and incomplete values are substituted by the predicted values using regression
equation.To describe how regression algorithm works, most appropriate predictor
attributes with lost data are determined by applying correlation matrix. The most
excellent predictors are picked and used as predictor attributes in a regression
model. The attribute with the lost values is used as the response variable.
Cases which have complete information for controlled attributes are employed
to generate the regression expression, the model on the other hand, is used
to forecast missingness for lost cases. By iteration, the values of the lost data
will be substituted and therefore all instances are used to forecast the response
attribute. These systems are reiterated till they converged. The predictors
obtained from the final circle are the best ones employed to fill in incomplete
data. Explicitly, suppose that X is a matrix which represents the data set,
X = (X(1), X(2), ..., X(p)). Where each column x(i) (i = 1, 2, ..., p) is a random
variable of n observations. Let x(j) be a column with missing values. Set
41
University of Ghana  http://ugspace.ug.edu.gh
(j) (i) (j) (j)X = (Xobs, Xmiss) where Xobs is the sub-vector of observed values of X
(j)
and (j)Xmiss that of the missing values. Consider {
(i)
H = i,Xmiss = θ, i = 1, 2, ..., p}
with cardinality m, the set of indices of columns which is not having missing
and Z = {x,(i) ∈ H}. Let Zobs and Zmiss be two sub-matrices of Z extracted by
selection of lines corresponding respectively to (i) and (i)Xobs Xmiss. Let us consider
the regression model based on the observed part:
Xobs = βZobs + µ, where µ ∼ N(0, σ2) (3.19)
With β = (β0, β1, ..., βm) is the vector Y ′s regression coefficients and the error
term µ = (µ0, µ1, ..., µn−q) where q is the length of Xjmiss
The estimation of the missing values, (j)X̂miss,i where i ranges over the q lines indices
of Xjmiss, are obtained by
Xj(miss,i) = β̂0 + β̂1Z(miss,1)+, ...,+β̂mZ(miss,m) (3.20)
where β̂ is the usual estimator of β.
The regression approach when dealing with missing values depends on the
predictors that are considered into the equation of regression model. It is the
reason why Little (2002) considered that, this technique is a conditional one. It
is more sophisticated than mean substitution method (Rubin et al. 2007), but
this technique can be conduct to overestimating the relationship between the
predictors and the dependent variables (Schafer & Graham, 2002).
3.7.3 Mean substitution (MS)
With the MS approach, the arithmetic mean of the observed values of every
variable is estimated and then substituted in each of the lost data cells of that
attribute. Moreover, MS technique yields good results if missing data mechanism
is MCAR. It is among the most broadly employed imputation technique to replace
incomplete data. Explicitly the value Yij is of the kth class, ck, is missing then it
42
University of Ghana  http://ugspace.ug.edu.gh
is replaced by
∑ yij
Ŷij = (3.21)
n
ij∈c Kk
where n is the observed values in jthk feature of the kth class. For example,
consider the data set with missing values and after replacing with mean
substitution technique.
Table 3.1: The dataset with missing values
V O1 V O2 V O3
12 NA 50
NA NA 43
20 26 67
23 64 NA
40 34 78
21 NA 21
Mean 23.2 19.3 51.8
Table 3.2: After replacement of missing values by mean substitution technique
V O1 V O2 V O3
12 19.3 50
23.2 19.3 43
20 26 67
23 64 51.8
40 34 78
21 19.3 21
3.8 The Algorithms for Treating Missing Values
under MAR Mechanism
From literature review the following two imputation algorithms have been
classified under MAR assumption in handling missing data problem effectively
(Azur et al. (2012), Schafer and Graham (2002))
43
University of Ghana  http://ugspace.ug.edu.gh
3.8.1 Expectation – Maximization (EM) Algorithm
The EM algorithm is an iterative procedure used to calculate the maximum
likelihood estimate in the presence of latent or lost data. Maximum likelihood
estimations are interested in calculating the model parameters in which the known
values are most possible. EM approach, initially developed by Dempster et al.
(1977), is an iterative procedure to maximize the probability calculated by a
parametric model for the observed data. EM operates under the assumption
that given the attributes employed in the imputation approach, the unobserved
data are MAR. The EM technique for hidden values is largely depended on the
maximum probability estimated of covariance structure given by the available
data. Each repetition of the EM approach comprises of two systems. The
Expectation step (i.e. E-step) and the Maximization step (i.e. M-step). With the
expectation state step (E-step), regression equation based on the given available
values are used in calculating the missingness (the expected values). These
missingness are replaced by the conditional mean established by the regression
models. In the maximization step (M-step), the estimates obtained from the E-
step are updated to increase the log probability of the current parameters from
the first state. These two steps are repeated for some number of iterations. This
algorithm will converge on a stationary point under some hypothesis of regularity
(Alison (2002), Dempster et al. (1977))
The distribution of the complete data Y can be denoted as;
f(Y/θ) = f(Yobs, Ymis/θ) = f(Yobs/θ)f(Ymis, Yobs/θ) (3.22)
Where f(Yobs, Ymis/θ)is the density of the observed data and f(Ymis, Yobs/θ) is
the density of missing data.
Then, the log likelihood of f(Y/θ) is given by
l(θ/Y ) = l(θ/Yobs, Ymis) = l(θ/Ymis) + lnf(Ymis, Yobs/θ) (3.23)
44
University of Ghana  http://ugspace.ug.edu.gh
The objective is to optimize l(θ/Yobs, Ymis) using parameter θ. Now let current
estimate of parameter θ be denoted by θk Optimization of the equation above is
an iterative process of two steps, E-step and M-step.
Expectation step (E- STEP)
E-step determines the expected log likelihood of the data as if the parameter θ
was truly current estimate, θk. Y = (Yobs, Ymis), given current parameter estimate
and the observed values of Y and Q is the expected value of the log likelihood of
the data
Q(θ/θk) = E{lnf(Y/θ)f(Y /Y , θk
∑ mis obs
)}
= ln[f(Y kobs, Ymis/θ)]f(Ymis/Yobs, /θ ) (3.24)
z∈Z
or
∫
Q(θ|θk) = l(θ/Y )f(Ymis|Yobs, θ) (3.25)
Maximization step ( M-step)
M-step obtains the updated maximum likelihood parameter estimator using the
Q function
. θk+1 = arg[maxQ(θ/θk)] for all θ. The M-step finds θ that maximizes Q
The E-step and M-step are repeated alternatively till the difference l(θ(k+1) −
l(θ(k)) is very negligible.
3.8.2 Multiple Imputation by Chained Equation (MICE)
Algorithm
MICE is a particular multiple replacement approach used in handling missing data
problem effectively (Raghunathan et al., 2001; Van Buuren, 2007). It works under
the assumption that given the attributes employed in the imputation approach,
the lost data are MAR. This mechanism explains the likelihood that the lost
45
University of Ghana  http://ugspace.ug.edu.gh
values are related or connected soly to the observed values but independent
on lost values (Schafer & Graham, 2002). In reality, MICE techniques have
been employed in data matrices with thousands of observations and hundreds of
attributes (Van Buuren, 2007). In the multiple imputation by chained equation
approach, a series of regression models are run whereby each attribute with
incomplete values is regressed on the other attributes in the data (Van Buuren
and Groothus - Oudshoorn, 2011). This implies that each attribute can be
modeled according to its probability distribution function, for instance, logistic
regression function for binary attributes,linear regression model for continuous
data, multinomial logit function for categorical data and a Poisson function for
count data.
MICE specifies that the multiple imputation equation depend on each attribute
for a set of conditional densities. The joint function is therefore only completely
known, but it does not necessary exist. After some number of repetitions, the
Markov Chain should converge to a stationary distribution. At that point, the
chain must be irreducible, aperiodic and recurrent (Van Buuren 2012). The
number of iteration needed for chain to converge, differ per data, but it usually
quite small number, about 5 to 10 repetitions. The MICE technique is a MCMC
approach and it is briefly described in Algorithm 1.0 below. Beginning with first
imputation, MICE performs imputation iteration combining with the conditional
densities.
Algorithm 1.0 the MICE algorithm
46
University of Ghana  http://ugspace.ug.edu.gh
1. Specify an imputation equation P (Y miss bbsj , | Yj , y−j, R) for incomplete
variable Yj, with j = 1, ..., p
2. For each incomplete variable, initialise starting imputation ∗(0)= (Yj by
random draws from Y obsj
3. Repeat for iterations t = 1, ..., T
4. Repeat for number of incomplete variables j = 1, ..., P
5. Define the imputed data Y t (t) (t) (t−1)(−j) = (Y1 , ..., Y
t
j−1, Yj+1, ..., Yp
6. Draw (θj ∗ (t)) ∼ p(θt | Y obsj j , Y t−j, R)
7. ∗(t) ∼ miss | obss ∗(t)Yj p(Yj Y−j , R, θj )
8. End repeat t
9. End repeat j
3.9 Evaluation Assessment Criteria to Compare
various Imputation Algorithms
The following performance assessment criteria would be used to evaluate the
various imputation algorithms employed in this study. These are: mean
absolute difference (MAD), root mean square error (RMSE) and coefficient of
determination (R2).
3.9.1 Mean Absolute Difference (MAD)
The MAD is a statistical variance evaluation indicator. It is also called average
difference between two different numbers selected from probability density. The
MAD is the arithmetic mean of absolute difference between observed value and
predicted value. The smallest MAD is the best measure of dispersion, hence the
47
University of Ghana  http://ugspace.ug.edu.gh
algorithm with smallest MAD is recommended to replace an unobserved data.
Mathematically it is given by
MAD = E|Xo −Xm| (3.26)
where Xo is the observed values and Xm is the replaced values.
3.9.2 Root Mean Squared Error (RMSE)
The root mean squared error (RMSE) is a performance indicator that determines
the mean distance of the residual. It compares the variance among original value
and substituted value, basically it denotes standard deviation of the difference.
It is a valuable indicator of total exactitude which assists researchers to know
how each imputation algorithm is performing in a data set. In literature, the
most efficient imputation algorithm is the one with the lowest RMSE (Huang &
Carriere, 2006). This implies that the smaller the RMSE, then the better the
performance indicator is.
The mathematical formula for RMSE is given below:
√∑k
i=1(Xio −X )2imRMSE = (3.27)
n
where i = 1, 2, . . . , n. The n is sample size, Xo is the observe values of data set
and Xm is the imputed values in the data set (Schmitt, et al. 2015)
3.9.3 Coefficient of Determination
The coefficient of determination (R2) measures the proportion of variability in the
response variable explained by the predictor variables. The R square (R2) ranges
from 0 to 1 while the model has healthy predictive ability or the regression line
is perfectly fit the data, when it is nearer to one and it is not analyzing better,
when it is closer to zero. R2 is given by the formula:
48
University of Ghana  http://ugspace.ug.edu.gh
R2
SSE
= 1− (3.28)
SST
∑n 2
R2 = 1− ∑i=1(Ŷi − Ȳ )n (3.29)
i=1(Yi − Ȳ )2
where SSE is the residual sum of squares and SST is the total sum of squares
corrected for the mean.
3.10 Data Analysis Procedure
This study gathered information on 106 countries at the year 2011. Several
variables for these countries were measured. Some of the variables tend to be
correlated among themselves. The study outlines quantitative data analysis plan
as follows: data entry, processing, organizing output into tables, explanation of
tables and drawing conclusions. Moreover, the study used the R package to
run the considered algorithms to estimate and impute the missing values into
meaningful statistical results. In chapter 4 of this study, the study presents the
simplified results in the form of tables, diagrams and graphical displays for easy
interpretation. At the final stage of this study, the empirical outcomes of each
algorithm will be evaluated with regards to the tables and the figures obtained
in chapter four.
49
University of Ghana  http://ugspace.ug.edu.gh
CHAPTER 4
Data Analysis and Discussion of Results
4.1 Introduction
This section illustrates the diverse analyses of the study, thus presentation of
the empirical calculations and its statistical explanation. The chapter starts
by giving the descriptive statistics of 2011, World Population Data Sheet,
multiple linear regression model, missing data mechanism test, comparison
of imputation algorithms under MCAR and MAR mechanism, comparison of
imputation algorithms for treating missing values under MLR model, evaluation
assessment of imputation algorithms using the coefficient of determination, and
finally comparison of imputation methods using the mean absolute difference
(MAD). All analyses were carried out with R package.
4.2 Descriptive Statistics
Table 4.1 demonstrates the expository statistics of data from the World
Population Data Sheet, 2011. According to Polland, (1988) life expectancy at
birth by a country is classified as low (< 65 years), medium (65-73 years) and
high (> 73 years).
Table 4.1: Classification of Life Expectancy at Birth (LEB) by 106
Countries.
Level Values of LEB Number of countries
Low LEB < 65 43
Medium LEB 65-73 33
High LEB > 73 30
50
University of Ghana  http://ugspace.ug.edu.gh
From Table 4.1, the values of LEB were categorized into three levels. Firstly, 43
countries had small values of LEB representing 41%. Secondly, 33 countries had
average values of LEB also representing 31% and finally, 30 countries had large
values of LEB denoting 28%. From original data of world population data sheet,
2011, the values of LEB variable vary from country to country, Guinea-Bissau
and Costa Rica recorded low LEB values with 48 and 49 respectively. However,
Slovenia recorded the highest LEB value of 80 among the countries used in this
study (see Appendix IV)
Table 4.2: Correlation Matrix
Predictors LEB CMW DEN RWS IMR TFR
LEB 1.000
CMW 0.432 1.000
DEN 0.802 0.496 1.000
RWS 0.756 0.271 0.557 1.000
IMR -0.744 -0.171 -0.491 -0.501 1.000
TFR -0.201 0.138 -0.057 -0.218 -0.002 1.000
Table 4.2 shows the various correlation among the pool of predictor variables
and the dependent variable. From Table 4.2, it is clearly observed that there is
affirmative relationship between LEB (response variable) and CMW, DEN and
RWS. However, there is a positive correlation of 0.496 between DEN and CMW,
a positive correlation of 0.271 between RWS and CMW, a positive correlation
of 0.557 between RWS and DEN and weak positive correlation of 0.138 between
CMW and TFR. The low positive correlation and the weak positive correlation
among independent variables indicate the absence of multicollinearity among the
covariates under study.
51
University of Ghana  http://ugspace.ug.edu.gh
Table 4.3: Determination of Multicollinearity
Predictors V.I.F Sq V.I.F
CMW 1.97 0.506624
DENSITY 1.72 0.579939
RWS 1.50 0.664721
IMR 1.40 0.713982
TFR 1.12 0.893680
Table 4.3 depicts the variance inflation factor and the square root of each
independent variable of variance inflation factor under consideration. The
variance inflation factor indicates that there is no possible existence of
multicollinearity among predictor variables. The table also unveils that three of
the variance inflation factor of the predictor variables exceed the mean variance
inflation factor of 1.542 but they did not exceed the threshold of 10.
Table 4.4: Test of Normality and Constancy of Variance of Residual
Terms
Statistic P-Value
Shapiro-Wilk Normality Test 0.26736
Breusch-Pagan Test 0.9065
Table 4.4 shows that the residual of the error term is normal (p > 0.05) and the
p-value of the Breusch-Pagan test also shows a constancy variance of the residual
terms (p > 0.05).
52
University of Ghana  http://ugspace.ug.edu.gh
Table 4.5: Summary of the Complete Original Dataset Model
Coefficients (Regression coefficient estimates, standard error, t-value
and p-value)
Variable Estimate Std. Error t-value p-value
Constant 26.5354 3.84132 6.91 0.0000
CMW 0.08643 0.02937 2.94 0.0040
DEN 0.39793 0.04723 8.43 0.0000
RWS 0.25676 0.0387 6.63 0.0000
IMR -0.0872 0.00891 -9.78 0.0000
TFR -0.7075 0.18353 -3.86 0.0000
R2=0.8932; F(5,100)=167.32; P-value of F-statistic=0.000
Table 4.5 presents the regression out carried out in the complete original dataset
without missing values. It indicates various estimates of the coefficients of the
regression output, their standard errors, t-values and the respective p-values. The
output shows that F -statistic =143.6 (p-value=0.000), signifying that the study
can evidently ignore the null hypothesis that the predictor variables as a whole
have no impact on life expectancy at birth (LEB). The results also show that the
variables; CMW, DEN, RWS, IMR, and TFR are all significant in predicting life
expectancy at birth (LEB) with their respective p-(value) ≤ 0.05 while the other
remaining independent variables are not significant in predicting life expectancy
at birth with their respective p-(value)> 0.05. In addition, the output also shows
that multiple R Squared= 0.8983 explains that 89.83% of the total variation
in the total life expectancy at birth was explained by the regression model and
adjusted R Squared = 0.8887. However, all five covariates or predictor variables
were significant at 5% level of significance.
4.3 Multiple Linear Regression (MLR) model
Using regression output from Table 4.5, five covariates such as CMW, DEN, RWS,
IMR and TFR were significant, which means that they contribute significantly
in predicting life expectancy at birth (LEB). The fitted MLR model is given by,
53
University of Ghana  http://ugspace.ug.edu.gh
Y = 26.5354+0.08643X1 +0.39793X2 +0.25676X3−0.0872X4−0.7075X5 (4.1)
where
Y = LEB,X1 = CMW,X2 = DEN,X3 = RWS,X4 = IMRandX5 = TFR
A unit increase in CMW will cause a 0.08643 average increase in LEB if all
other independent factors remain the same. Also, a unit increase in DEN will
cause a 0.39793 average increase in LEB if all other independent factors remain
unchanged, a unit enlarge in RWS will cause a 0.25676 average increase in LEB
if all other independent factors remain the same and eventually a unit expand in
TFR will cause a 0.7075 average decrease in LEB if all other covariates remain
unchanged.
According to the results established from Table 4.5, the following covariates;
CMW, DEN, RWS, IMR and TFR were selected for the final model formulation
which are subjected to missing values while LEB values were entirely observed
in all instances. Throughout this thesis, the study assumes that the missing
values exhibit the MCAR and MAR mechanisms and arbitrary missing pattern
assumption.
4.4 Missing Data Mechanism Test
In order to analyze a dataset with missing observation accurately, an in-dept
knowledge of how the data is missing is required (i.e either random way or
non-random way). This will assist in grouping missing values under the various
missing data mechanism. In this thesis, the following proportions of missing
data rate; 5%, 10%, 20%, 30% and 40% were artificially created in arbitrary
54
University of Ghana  http://ugspace.ug.edu.gh
missing pattern way from complete World Population Data Sheet, 2011. It is
believed that 10% and below missingness indicate a small fraction of incomplete
values, 20% of missingness indicates a medium amount of missing values, 30%
shows a sign of large amount of missing values and finally 40% and above show
very large amount of incomplete values in the data matrix.
Little’s test of MCAR was used to identify the appropriate imputation
algorithm to handle each percentage of missing values. Tables 4.6 and 4.7 show
the output of Little’s MCAR test on the percentages of missing values artificially
created.
Hypothesis of Little’s MCAR test
H0 : The missing values in the data set are MCAR
H1 : The missing values in the data set are not MCAR
Decision rule for Little’s MCAR test
If P value ≥ 0.05, fail to reject the H0 and conclude that missing values
mechanism is MCAR. If P value < 0.05, reject the H0 and presume that missing
data mechanism is MAR.
After the Little’s MCAR test, the various imputation algorithms would be
applied to calculate and replace incomplete values artificially created in the
complete original data matrix. This will aid to statistically compare and select
the best imputation algorithm to use in replacing a particular missing data
pattern.
55
University of Ghana  http://ugspace.ug.edu.gh
Table 4.6: Output of Little’s MCAR test for MCAR
Proportion of missing data (%) Chi-square statistic Degree of freedom (df) P-value
5% 33.2287 29 0.2686
10% 36.0689 37 0.5125
20% 45.3028 53 0.7647
30% 66.8731 62 0.3134
40% 50.6023 65 0.9050
From Table 4.6, since all the p-values for the various proportion of missing data
are greater than 0.05 (p-value ≥ 0.05), there is no evidence to reject H0 and hence
the missingness mechanism is MCAR. The assumption implies that occurrence
of incomplete values in the data matrix do not demonstrate any patterns and
incomplete values do not connect to observed or missing values.
Table 4.7: Output of Little’s MCAR test for MAR
Proportion of missing data (%) Chi-square statistic Degree of freedom (df) P-value
5% 80.590 31 0.000
10% 98.855 37 0.000
20% 136.460 64 0.000
30% 149.485 80 0.000
40% 165.836 80 0.000
From Table 4.7, since all p-values for the various proportion of missing data
are less than 0.05 (P value < 0.05), there is enough evidence to reject H0 and
presume that missingness mechanism is MAR. With MAR assumption, the
missing values are connected to observed values but they do not connect to
missing values themselves.
From literature review and the results of Little’s MCAR test, the following
imputation algorithms at table 4.8 have been grouped into MCAR and MAR
mechanisms. These algorithms will be employed to estimate and replace missing
values artificially created in the complete data set. This will assist to assess
and identify the best imputation algorithm under a particular missing data
mechanism.
56
University of Ghana  http://ugspace.ug.edu.gh
Table 4.8: Imputation Algorithms for Treating Missing Values
MCAR ‘ MAR
Mean substitution EM algorithm
K nearest neighbor MICE algorithm
Regression imputation
4.5 Comparison of Imputation Algorithms for
Treating Missing Values
To compare various imputation algorithms used in this study and select the best
performing technique among them, the study employed the following performance
assessment procedure to select the best algorithm;
1. The average coefficient of difference (ACD) estimates between MLR model
for the original complete data and MLR model for the incomplete data
imputed by various algorithms are calculated and assessed.
2. The mean absolute difference (MAD) between the original data (complete
data) and the data that have been predicted by imputation is computed
and assessed.
3. The coefficient of determination (R2) of the regression output is also used
to assess the best imputation algorithm performer.
4.6 Comparison of Imputation Algorithms for
Treating Missing Values under MLR Model
using ACD
To compare various imputation algorithms and select the best among them, all
the imputation algorithms considered were used to replace various missing values
57
University of Ghana  http://ugspace.ug.edu.gh
artificially created in the complete data matrix. The MLR model was employed
to run the replaced complete data sets by various algorithms and compare them
to the general MLR model for the original complete dataset (dataset without
missing values). The MLR model for the original complete data set without
missing values is given by;
Y = 26.5354+0.08643X1 +0.39793X2 +0.25676X3−0.0872X4−0.7075X5 (4.2)
The procedures for the comparison are outlined as follows:
1. Compare the MLR model for each imputation algorithm model to treat the
proportion of missing values to the general MLR model. This means that
the coefficient difference for each imputation algorithm employed to treat
proportions of missing values will be estimated.
2. The coefficient difference is estimated by subtracting each coefficient of the
algorithm models from the coefficients of the complete original data model.
3. The mean or average coefficients difference (ACD) for each imputation
algorithm is calculated.
4. Total average coefficients difference for all the proportion of missingness for
each imputation algorithm is computed. The best imputation algorithm is
the one with the smallest or lowest average coefficients difference estimate.
Table 4.9 indicates how to compute average coefficient difference between
KNN imputation algorithm and the original data of general MLR model,
under 5% missingness.
58
University of Ghana  http://ugspace.ug.edu.gh
Table 4.9: Average Coefficient Difference of Missing Data under KNN
Imputation Algorithm to the Original Data of MLR Model
MLR for original data KNN for 5% missingness in the data
Variable Estimate St error Estimate St error Coefficient
difference (CD)
Constant 26.5654 6.90777 24.41543 4.45889
CMW 0.08643 0.02977 0.06778 0.03431 0.01865
DEN 0.39793 0.05458 0.45244 0.05317 -0.05451
RWS 0.25676 0.04697 0.24528 0.04604 0.01148
IMR -0.08720 0.00938 -0.08874 0.00999 0.00154
TFR -0.70750 0.31094 -0.61280 0.22788 -0.09470
Average (CD) 0.02351
With ACD, the probability of obtaining negative average coefficient difference
(ACD) is possible. Since comparing various imputation regression models to the
complete original regression model at the center, ignore the negative sign of the
final ACD or absolute figure is considered.
4.6.1 Comparison of Imputation Algorithms for Treating
Missingness under MCAR Mechanism
The presence of missing data is inevitable in the cross-sectional and longitudinal
studies. In the real data science analysis, the missing values pattern may be
described as MCAR, MAR, or MNAR which account for the reasons that
give rise to missing values. The following imputation algorithms, namely k
nearest neighbor, mean substitution and regression substitution under MCAR
mechanism were empirically compared and assessed.
To assess and judge the performance of these imputation algorithms, ACD
estimate for each imputation algorithm is computed. The imputation algorithm
with the smallest or lowest ACD estimate is the best fit. Table 4.10 shows the
total average coefficient difference estimates for KNN, mean substitution and
59
University of Ghana  http://ugspace.ug.edu.gh
regression substitution algorithms under MCAR mechanism.
Table 4.10: Performance of KNN, Mean Substitution and Regression
Substitution Algorithm under MCAR using ACD estimate
Average Coefficient Difference (ACD)
Percentage of missingness (%) KNN Mean Sub. Reg. Sub.
5 0.02351 0.01965 0.02142
10 0.00676 0.02092 0.00234
20 0.03640 0.06292 0.02379
30 0.02519 0.16870 0.06265
40 0.09255 0.02548 0.15454
TOTAL 0.18441 0.29778 0.26474
Table 4.10 presents the performance of k nearest neighbor algorithm,
mean substitution method and regression substitution method under MCAR
mechanism using ACD estimates. Among the three imputation algorithms
compared, mean substitution is the worst method. All the three imputation
methods considered performed better when percentage of missing data is small (i.e
5%, 10% and 20%) especially KNN and regression substitution approach. But at
5%, and 40% percentages of missingness, mean substitution performed very good
as compared to KNN and regression substitution method. The performance of
KNN algorithm was consistently very good through out missingness percentages.
At large percentage of missingness (40%), regression substitution provided very
poor result. In conclusion, at small missingness percentage, mean substitution
and regression substitution approaches will be better to replace missingness
considering MCAR mechanism. Again, at very small or large proportion of
missingness, KNN is prefferd algorithm to replace missing values.
60
University of Ghana  http://ugspace.ug.edu.gh
4.6.2 Comparison of EM and MICE Algorithms for
Treating Missingness under MAR Mechanism using
ACD
With MAR, the missing data is related to the observed values and is independent
of the missing data.
Under MAR mechanism, the study employed expectation maximization (EM)
algorithm and multiple imputation by chained equation ( MICE) algorithm to
impute and replace the missing values. Also, the researchers computed average
coefficient difference (ACD) for each imputation algorithm used and algorithm
with lowest the ACD is the best method. Table 4.11 shows the total average
coefficient difference of EM and MICE algorithms for treating missing values
under MAR mechanism.
Table 4.11: Performance of EM and MICE Algorithms under MAR
using Average Coefficient Difference (ACD)
Average Coefficient difference (ACD)
Percentage (%) EM MICE
5 0.01191 0.02896
10 0.00394 0.05072
20 0.06718 0.02911
30 0.12249 0.06238
40 0.05914 0.05576
TOTAL 0.26466 0.22693
Table 4.11 presents the performance of EM and MICE algorithms for treating
missing values under MAR mechanism using ACD. Among the two algorithms
compared, EM had poor results. At small missingness proportion (i.e 5% &
10%), EM method outperformed MICE algorithm. Comparatively, at 20%, 30%
and 40% levels of missing data, MICE algorithm provided satisfactory results.
Therefore, EM algorithm can be used to replace misssing data if missingness
proportion is small. At large missingness percentage, it will be prudent to impute
61
University of Ghana  http://ugspace.ug.edu.gh
missing data using MICE algorithm. In the nutshell, MICE algorithm provides
unbiased estimates and accurate conclusions in replacing missingness. Hence,
MICE algorithm is preferred to EM algorithm under MAR mechanism. Figure 4.2
shows graphical representation of the performance of EM and MICE imputation
algorithms using ACD as a measure of evaluation criteria.
Figure 4.1: Graph of EM and MICE algorithms under MAR using average
coefficient difference as a measure of performance assessment criteria
Experimental results on the achievement of EM and MICE imputation algorithms
clearly manifest that at small missingness percentage (5% and 10%), EM method
outperformed MICE algorithm. Contrary, at medium and very large missingness
data percentage, MICE approach provided sufficiently good results. These opted
for recommendation that under MAR, EM can be used to replace missingness,
if missingness percentage is low. Therefore, MICE should be used to replace
missingness if missingness percentage is large (20% and above).
62
University of Ghana  http://ugspace.ug.edu.gh
4.7 Comparison of Imputation Algorithms for
Treating Missing Values using Mean Absolute
Difference (MAD)
The mean absolute difference (MAD) is a measure of statistical dispersion which
is the same as the average difference of two independent values drawn from a
probability distribution. The mean absolute difference is the expected value of
absolute difference of two random variables, that is observed value and predicted
value. The smallest MAD is the best measure of dispersion, hence the algorithm
with smallest MAD will be recommended to impute missing value. The table
4.12 shows the mean absolute difference under MCAR mechanism at various
percentages of missing values.
Table 4.12: Performance of KNN, Mean Substitution and Regression
Substitution for Treating Missing Values under MCAR Mechanism
using Mean Absolute Difference (MAD)
Mean absolute difference (MAD)
Percentage (%) KNN Mean Sub. Reg. Sub.
5 7.079757 12.27647 4.825816
10 8.521118 11.97301 5.787719
20 7.799661 11.23853 6.228872
30 9.111556 11.08854 7.333512
40 13.08629 12.22555 7.558301
TOTAL 45.598382 58.8021 31.73422
From table 4.12, among the three imputation algorithms compared under MCAR
mechanism using mean absolute difference, regression substitution in general is
overall best performer. K nearest neighbor algorithm and mean substitution
provided unsatisfactory results. It is observed that the performance of regression
substitution increases with an increasing percentage of missingness. At all level
of missingness percentages, from 5% to 40%, the performance of regression
63
University of Ghana  http://ugspace.ug.edu.gh
substitution exceeds both KNN and mean substitution. The study suggests
that under MCAR mechanism, regression substitution should be used to replace
missing data, hence it is the preferred choice. Figure 4.2 depicts graphical
demonstration of KNN, Mean substitution and Regression substitution under
MCAR using MAD as performance assessment criteria.
Figure 4.2: Graph of KNN, Mean substitution and Regression substitution under
MCAR using MAD as performance assessment criteria
From figure 4.2, it clearly demonstrates that regression substitution at all level
of missingness percentages provides extremely good results. It outperformed
both KNN algorithm and mean substitution method. The study suggests that
regression substitution should be used to substitute missingness considering
MCAR mechanism.
64
University of Ghana  http://ugspace.ug.edu.gh
Table 4.13: Performance of EM and MICE Algorithms for Treating
Missing Values under MAR Mechanism using Mean Absolute
Difference (MAD)
Mean absolute difference (MAD)
Percentage (%) EM MICE
5 0.80660 0.77920
10 0.09016 0.13324
20 0.14379 0.06088
30 0.45585 0.05452
40 0.31113 0.17289
TOTAL 1.49640 1.02784
Table 4.13 depicts the performance of EM algorithm and MICE algorithm for
treating missingness under MAR mechanism using MAD. It is observed that at
5%, 20%, 30% and 40% missingness percentages, MICE algorithm outperformed
EM algorithm. Therefore, it is suggests that under MAR mechanism, it
is important to use MICE algorithm to replace missing values. However,
MICE algorithm provides unbiased inference and conclusion considering MAR
mechanism. The study suggests that multiple imputation by chained equation
(MICE) is the preferred choice under MAR mechanism
4.8 Comparison of Imputation Algorithms for
Treating Missing Values using Coefficient of
Determination (R2)
The coefficient of determination (R2) measures the percentage of variability in
the dependent factor which is explained by the predictor factors. The value of R2
is between zero and one. The higher the value of R2, the greater the percentage
of the variable elucidated by fitting the data to the model.
65
University of Ghana  http://ugspace.ug.edu.gh
The regression output for the complete original dataset showed that R squared
is 0.8983, which implies that 89.83% of the total variation in the life expectancy
at birth was explained by the regression model. To identify the best imputation
algorithm as the main objective of the study, the average R2 value for the various
imputation algorithms will be assessed and the one closer to the R2 value from
original model will be chosen as the best. The closer the R2 value from each
algorithm model as compared toR2 value from original complete model, the better
the algorithm to replace missing values. The table 4.14 shows the performance
of coefficient of determination under MCAR mechanism.
Table 4.14: Performance of KNN, Mean Substitution and Regression
Substitution under MCAR Mechanism using R2
Coefficient of determination (R-squared)
Percentage (%) KNN Mean Sub. Reg. Sub.
5 0.8644 0.8457 0.8854
10 0.8575 0.8226 0.8981
20 0.8449 0.7944 0.8975
30 0.849 0.7848 0.9198
40 0.6585 0.7477 0.9425
TOTAL 4.0743 3.9952 4.5433
From table 4.14, under MCAR mechanism using coefficient of determination
to select the best imputation approach, the results KNN algorithm and mean
substitution exhibit similar pattern. As anticipated, the performance of the
imputation algorithms reduce with an increasing proportion of incomplete
values. As missingness percentage increases, both KNN technique and mean
substitution approach provide unsatisfactory results. At 30% and 40% level of
missing data, the coefficient of determination of KNN and mean substitution
are (84.9% & 65.9%) and (78.5% & 74.8%) respectively as compared to
coefficient of determination of the original data, that is 89.9%. With the
regression substitution, the performance increases with an increasing percentage
of missingness. At small missingness percentages ( 5% or 10%), the coefficient
66
University of Ghana  http://ugspace.ug.edu.gh
of determination of regression substitution are 88.5% and 89.8% respectively.
This means that 88.5% or 89.8% of the total variation in the life expectancy
at birth was explained by the regression model, which is almost the same as
the R-squared explained by the original model ( 89.9%). At 40% of missingness
percentage, regression substitution algorithm recorded 94.2% which is higher
than 89.95% of the original model.
Conclusively, regression substitution gives satisfactory performance in replacing
missingness considering MCAR mechanism. Therefore, it is effective to use
regression substitution to replace missing values under MCAR mechanism.
Both k nearest neighbor algorithm and mean substitution method provided
unsatisfactory results especially at large missingness percentage. Figure
4.3 designates the graphical representation of KNN, mean substitution and
regression substitution algorithms under MCAR mechanism using coefficient of
determination (R2).
67
University of Ghana  http://ugspace.ug.edu.gh
Figure 4.3: Graph of KNN, Mean substitution and Regression substitution
algorithms under MCAR mechanism using coefficient of determination (R2) as
evaluation assessment criteria
.
Figure 4.3 compares KNN, mean substitution and regression substitution to
R2, the coefficient of determination of the complete original data without the
missing values. Both KNN and mean substitution algorithms exhibit similar
pattern. At small missingness percentages (5% and 10%), the performance of
these algorithms were encouraging. At large missingness percentage levels (20%,
30% and 40%) the performance reduced as compare to R2 ofthe complete orginal
data. Regression substitution algorithm bespeaks outstanding performance at all
levels of missingness percentages. At 5%, 10% and 20% missingness percentages,
regression replacement algorithm provided exactitude results as R2, the coefficient
of determination of the complete original data. At 30% and 40% missingness
levels, regression substitution algorithm outperformed coefficient of determination
R2, indicating how good the regression substitution performs.
68
University of Ghana  http://ugspace.ug.edu.gh
Table 4.15: Performance of EM and MICE algorithms for
Treating Missing Values under MAR Mechanism using Coefficient of
Determination R2
coefficient of determination R2
Percentage (%) EM MICE
5 0.8700 0.8529
10 0.7888 0.8171
20 0.7892 0.7496
30 0.6202 0.6777
40 0.6538 0.6410
TOTAL 3.7215 3.7383
Table 4.15 presents the coefficient of determination of EM and MICE algorithms
for treating missing values under MAR mechanism. From table 4.15, it can be
observed that missing data mechanisms and percentage of missingness influence
the performance of imputation algorithms greatly, as percentage of missingness
increases, the performance of both EM and MICE algorithms also reduce very
fast. It indicates that both algorithms performed very good at small missingness
percentage (5%) as compared to coefficient of determination of the original
complete data (89%). It is important to know that the two imputation algorithms
performed very poor under MAR mechanism, especially at large missingness
percentages (above 5%). At small percentage of missingness, EM and MICE
can be used to replace missing values. Contrary, it is not prudent to use EM
and MICE to impute missingness under MAR mechanism at large percentage of
missings. Figure 4.4 designates the performance of EM and MICE algorithms
under MAR mechanism using coefficient of determination (R2) as measure of
metric assessment criteria.
69
University of Ghana  http://ugspace.ug.edu.gh
Figure 4.4: Graph of EM and MICE algorithms under MAR mechanism using
coefficient of determination (R2) as measure of metric assessment criteria
Figure 4.4 compares the performance of EM and MICE imputation algorithms
to the coefficient of determination (R2) of the complete original data without
missingness at all levels of missing data rates. It demonstrates the nature of table
4.15 and is the main reason why both methods cannot be recommended to replace
missingness at all missingness percentages. The wider the algorithm fit line
depart from R2, the poorer the performance of such imputation algorithm. Hence
EM and MICE algorithms designate unsatisfactory performances. Therefore,
generally both algorithms provide inconsistent and biased estimates.
70
University of Ghana  http://ugspace.ug.edu.gh
CHAPTER 5
SUMMARY, CONCLUSION AND
RECOMMENDATIONS
5.1 Introduction
This chapter provides an abridged version of the outcomes of the investigation;
it draws conclusions that relate to the objectives of the study. It also makes
recommendations and suggestions for further studies in the research area.
5.2 Summary
The manifestation of missing values in data analysis is an ineluctable issue at
the present time. During the data collection stage, the researcher can design
the best questionnaire and employ the most efficient data collection method;
it is still possible for some data points to be incomplete or lost, hence, the
use of missing data imputation techniques is indispensable. The occurrence of
incomplete observations in the analysis stage produces biased results that lead
to inaccuracy and inefficient inferences about a population to guide stakeholders,
decision makers and researchers.
According to Horton & Kleinman (2007), data may be missing for many
reasons, but summarised the reasons as unit non-response, item non-response
and non-coverage. With unit non-response, which is also called subject non-
response, many respondents are included in the sample population but fail
to provide any information to the items on the questionnaire. An item non-
response, many subjects in the sample fail to provide all the needed information
71
University of Ghana  http://ugspace.ug.edu.gh
for the items on the questionnaire. Some of the items may not be answered
for confidentiality purpose. Finally, with non-coverage, the sample does not
represent the population to which the researcher wants to make generalization,
some fractions of the target population were not covered.
The main aspiration of the investigation is to find the best imputation
algorithm to treat incomplete values under the assumptions of various missing
data mechanism. The study grouped and compared imputation algorithms
to treat missing data under both MCAR and MAR mechanism assumptions.
By considering MCAR mechanism, k nearest neighbor imputation algorithm,
mean substitution method and regression substitution method were employed
to yield unbiased estimate. Expectation maximization (EM) algorithm
and multiple imputation by chained equation (MICE) algorithm require the
missing data to be MAR to obtain accurate statistical conclusions and inferences.
The multiple linear regression (MLR) model (i.e ACD), mean absolute
difference (MAD) and coefficient of determination ( R2) were metric assessment
criteria employed to evaluate the performance of the five imputation algorithms
for both MCAR and MAR mechanisms. The MLR model for the original data
set is;
Y = 26.5354+0.08643X1 +0.39793X2 +0.25676X3−0.0872X4−0.7075X5 (5.1)
where
Y = LEB,X1 = CMW,X2 = DEN,X3 = RWS,X4 = IMR and X5 = TFR
The results of the average coefficient difference (ACD) of MLR model performance
72
University of Ghana  http://ugspace.ug.edu.gh
assessment criteria for MCAR missing data revealed that mean substitution is
the worst method. Regression substitution and mean substitution performed
better when the percentage of missing data is small. From the analysis, it is
revealed that at small or medium missigness percentage (below 20%), mean
substitution and regression substitution performed very well. At small and large
missingness percentages, K nearest neighbor approach provided an excellent
results. Under MAR mechanism, the results of the ACD of MLR model
performance criteria indicated that, expectation maximization (EM) algorithm is
the weaker method. Expectation maximization (EM) algorithm performed better
when percentage of missing data is small (i.e 5% & 10%). Multiple imputation by
chained equation (MICE) algorithm is the best method. From the analysis, it is
revealed that MICE performed credibly well when missingness percentage is large.
Using the results of the mean absolute difference (MAD) under MCAR
mechanism, regression substitution in general is the overall best method. Mean
substitution is the worst method. At all levels of missing data rates; 5%, 10%
20%, 30% and 40%, the performance of regression substitution exceeds both
KNN and mean substitution. From this context, it is clear that regression
substitution performed well at small or large missingness percentage. With
MAD performance assessment criteria under MAR mechanism, MICE algorithm
performed very well as compared to EM algorithm. Finally, using coefficient
of determination (R2) as a performance evaluation criteria under MCAR, both
KNN algorithm and mean substitution performed relatively good at small
missingness percentage. As the proportion of missing values increase, both KNN
and mean substitution provide unsatisfactory results. Regression substitution
in general was the best method. When percentage of missing data is small
or large, regression substitution performed very well. Averagely, 90.8% of the
total variation in the life expectancy at birth was explained by the regression
substitution model, which is close to the R2 explained by the original regression
73
University of Ghana  http://ugspace.ug.edu.gh
model (89.9%).
Finally, under MAR mechanism both EM and MICE algorithms produced
unsatisfactory results at large missingness percentage (greater than 5%) as
compared to coefficient of determination (89.95%) of the original complete data.
At small missingness percentage (5%) EM algorithm performed credibly good
with R square of 87% as compared to coefficient of determination (89.95%) of
the original complete data. MICE algorithm in a whole performed slightly better
than EM algorithm. Therefore, under MAR mechanism, both EM and MICE
algorithms can be used to replace missing values when the amount of missing
values is small.
Schmitt et al. (2015) pointed out that, the most popular imputation methods
such as mean, KNN, SVD and MICE are not necessary the most efficient. This
conclusion was also supported by Celton, Malpertuy, Lelandais and Brevern
(2010). Our current study share the same conclusion, that is mean and KNN
provide unsatisfactory result when data is MCAR. But when data is MAR,
MICE algorithm produced very good result, which is contrary to Schmitt et al.
(2015) and Celton, et al. (2010). According to Lazaro, Gbeha and Kakai (2018)
when missing data is MCAR, mean substitution provides a better performance
in accuracy. Hening (2009) also emphasized that, the mean and median methods
yielded satisfactory results comparing different missing data imputation methods.
The results of the current study do not support the conclusions made by Lazaro,
et al. (2018) and Hening (2009), but rather mean substitution performed
poorly when missing data is MCAR. The work published by Turrando, Lopez,
Lasheras, Gomez, Rolle and Juez (2014) pointed out that, the application of
MICE algorithm provides very good outcome in comparing to other imputation
approaches like inverse distance weighting and multiple linear regression. Also,
study conducted by Porto,Monteiro, Kakai, and Assad (2017) reviled that,
MICE algorithm provides very good estimates of daily precipitation values than
74
University of Ghana  http://ugspace.ug.edu.gh
geostatistical Krining and Co-Kringing models. The good performance of the
multiple imputation by chained equation (MICE) algorithm as indicated by the
study is confirmed by Turrando, et al.(2014) and Porto, et al. (2017).
5.3 Conclusion
The study has compared imputation algorithms mechanisms. It was revealed
that under MCAR mechanism, ACD of the MLR model produced by k nearest
neighbor algorithm is lower than the ACD of the MLR model results produced
by regression substitution and mean substitution method. The result of
MAD of regression substitution is lower than mean substitution and KNN.
Regression substitution result of coefficient of determination is higher than mean
substitution method and KNN algorithm. Therefore, based on these three metric
performance assessment criteria, it is concluded that regression substitution be
used to impute missing values of world population data sheet. Thus, regression
substitution method provides a comparatively successful replacement of the
missing world population data sheet, which is supported by the work published
in the literature by (Sattari, Joudi & Kusiak, 2016). Although, kNN imputation
algorithm has very good performance, it is not the best in this study.
Also, comparing imputation algorithms under MAR mechanism assumption, it
was observed that ACD of the MLR model produced by MICE algorithm is
smaller than the ACD results produced by EM algorithm. Besides, the MAD of
the MICE algorithm is lower than the MAD result provided by EM algorithm.
Finally, the analysis clearly revealed that average coefficient of determination
produced by MICE algorithm is higher than that of EM algorithm. Based
on these three measures, MICE is a highly accurate imputation algorithm
for missing values of the world population data sheet, and outperforms EM
algorithm in terms of imputation error. Therefore, overall conclusion is that
multiple imputation by chained equation (MICE) algorithm is superior to
75
University of Ghana  http://ugspace.ug.edu.gh
the expectation maximization (EM) algorithm as confirmed by Turrando, et
al.(2014) and Porto, et al. (2017)
5.4 Recommendations
Based on the overview and inferences deduced from the investigation, the
following suggestions are made for hereafter research studies.
1. The study suggests that when data is missing completely at random
(MCAR) and normally distributed, then among the compared three
imputation algorithms, the regression substitution is preferred. It is
therefore recommended that regression substitution method be used to
replace missing values under MCAR mechanism. The MICE algorithm
was found to be comparatively the best algorithm for replacing missingness
under MAR mechanism. It is therefore suggested that MICE algorithm
should be used to substitute missing data under MAR mechanism.
2. On the grounds of this study, it is recommended that before undertaking
a missing data imputation, the distribution of the data, the incomplete
data mechanisms and percentage of incomplete data must be first examined
before suggesting the best imputation methods.
3. Moreover, since the issue of missing data cannot be avoided in the data
analysis, it is recommended that all research studies must promulgate the
reasons which account for missingness and proportion of incomplete data
in the data matrix and the imputation algorithm employed in the analysis
stage.
4. Future studies can be targeted at determining appropriate imputation
algorithm to replace missing values of cross-sectional data of World
76
University of Ghana  http://ugspace.ug.edu.gh
Population Data Sheet when data is missing not at random (MNAR) and
normally distributed. This is essential because many of the literature
review suggest that, the comparison of imputation methods under MNAR
mechanism is a complex exercise.
5. This study is mainly concentrated on missing data imputation in a cross-
sectional dataset. Therefore, it is recommended that categorical and
longitudinal studies should be considered.
77
University of Ghana  http://ugspace.ug.edu.gh
REFERENCES
Acock, A. C. (2005). Working with missing values. Journal of Marriage and
Family, 67, 1012–1028
Allison, P. (2001). Missing Data. In Sage University Papers Series on
Quantitative Applications in the Social Sciences, 07-136. Sage, Thousand Oaks,
CA.
Allison, P .D (2002). Missing data. Thousand Oaks, CA/ Sage Publication.
Anghelache, C. & Scala, C. (2016). Multiple regression used to analysis the
correlation between GDP and some variables. Romanian statistical review
supplement, No. 10 pages 79-85
Azur, M .J., Stuart, E .A., Frangakis, C., & Leaf, P. J. (2012). Multiple
Imputation by Chained Equation: What is it and how does it work?
Batista, G. E. A. P. A, & Monard, M. C. (2001). "A study of K Nearest
neighbour as a Model-Based Method to Treat Missing Data", in proceedings
of the Argentine Symposium on Artificial Intelligence, Buenos Aires, Agentine.
vol.30, pp. 1-9
Batista, G. E. A. P. A, & Monard, M.C. (2003). An Analysis of Four Missing
Data Treatment Methods for Supervised Learning, Applied Artificial Intelligence,
17, 519-533
Bennett, D. A. (2001). How can I deal with missing data in my study? Australian
and New Zealand Journal of Public Health, 25, 464–469.
Biorn, E. (2013). Introductory Econometrics, Department of Economics,
ECON3150/4150.
Brown, R. L. (1994). Efficacy of the indirect approach for estimating structural
equation models with missing data: A comparison of five methods. Structural.
Equation Model. 1, 287–316.
Carpenter, J. R., & Kenward, M. G. (2013). Multiple Imputation and its
Application. Chi Chester, West Sussex: A John Wiley & Sons Publication.
Celton, M., Malpertuy, a., Lelandais, G., & Brevern, A. (2010) Comparative
analysis of missing value imputation methods to improve clustering and
interpretation of microarray experiments.
78
University of Ghana  http://ugspace.ug.edu.gh
Cole, J.C. (2008). How to deal with missing data. In Best practices in
quantitative methods, Journal Wiley Osborne (Ed.). Thousand Oaks, CA, Sage,
pp. 214–238.
Coelho-Barros, E. A., Simoes, P. A., Achcar, J. A., Martinez, E. Z. & Shimano,
A. C. (2008). Methods of Estimation in Multiple Linear Regression: Application
to Clinical Data. Revista Colombian De Estadistica, 31 (1):111-129.
Chia, T., & Draxier, R. R. (2014) Root mean square error (RMSE) or mean
absolute error (MAE)? Arguments against avoiding RMSE in the literature.
Geoscience Model Development Discussion with version 4.1 of Latex class.
Day, S. (1999). Dictionary for clinical trials. New York: John Wiley and Sons.
Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B (Methodological) 39(1): 1-38.
Enders, C. K. (2001). The performance of the full information maximum
likelihood estimator in multiple regression models with missing data. Educational
and Psychological Measurement, 61, 713-740.
Fogarty, D. J. (2008). "Multiple imputation as a missing
data approach to reject inference on consumer credit scoring".
http://interstat.statjournals.net/YEAR/2006/articles/0609001.pdf.
Golan, A. (2002). Information and entropy econometrics (special issue).Journal
of econometrics, 107 (1-2)
Graham, J. W., & Hofer, S. M. (2000). Multiple imputation in multivariate
research. InT. D. Little, K.U.
Hair, J., Black, W., Babin, B., Anderson, R., & Tatham, R.. (2006). Multivariate
data analysis. 6th edn. Pearson Education, Inc.
Hening, A. D. (2009) Missing Data Imputation Method Comparison in Ohio
University Student Retention
Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: a comparison
of missing data methods and software to fit incomplete data regression models.
The American Statistician, 61, 79-90
Honaker, J., King, G., & Blackwell, M. (2015). Amelia 2. A program for missing
data. Version 1.7.4
Howard, E. & Gordoh, G. (2005) Statistical methods
79
University of Ghana  http://ugspace.ug.edu.gh
Huang, R., & Carriere, K.C. (2006). Comparison of Methods for Incomplete
Repeated Measures Data Analysis in Small Samples. Journal of Statistical
Planning and Inference, 136, 235-247
Kim, K., & Bentler, P. (2002). Tests of homogeneity of means and covariance
matrices for multivariate incomplete data. Psychometrika, 67 (4), 609-623.
Lazaro, M., Gbeha, M., & Kakai , R (2018) Influence of missing value imputations
on the performance of canonical correspondence analysis: Ecological applications
Lin, J. & Bentler, P. M. (2002) Probability Based Test for Missing Completely
at Random Data Patterns
Little, R. J. A. (1988). A test of missing completely at random for multivariate
data with missing values. Journal of the American Statistical Association, 83,
1198–1202.
Little, R. J.A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data,
Second Edition. Hoboken, NJ: John Wiley & Sons
Little, R. J. A & Rubin, D. B. (1987). Statististical Analysis with Missing Data,
New York: John Wiley.
Liu, Y., & Brown, S.D. (2013). Comparison of five iterative imputation methods
for multivariate classification. Chemom. Intell. Lab. 120, 106–115.
McDonald, R. A.,Thurston, P.W, & Nelson, M. R. (2000). A Monte Carlo study
of missing item methods. Organizational Research Methods, 3,71- 92
McKnight, P. E., McKnight, K. M., Sidani, S., & Figgueredo, A. J. (2007).
Missing Data: A gentle introduction. Guilford Press.
McKnight, P. (2007). Missing Data: A gentle introduction
Meng, Z. Q., & Shi, Z. Z. (2012). Extended rough set-based attribute reduction
in inconsistent incomplete decision systems, Information Science. Vol. 204, pp.
44–69.
Morais, S.F.(2013). Dealing with Missing data: An Application in the Study of
Family History of Hypertension. A Master Dissertation, Faculty of Medicine of
the University of Porto
Nelwamondo, F. V., Mohamed, S., Marwala, T. (2007) "Missing data: artificial
neural network and expectation maximization techniques," Current Science, Vol.
93, No. 11, pp. 1514 - 1521
Pigott, T. D. (2001) A review of methods for missing data. Educational research
and evaluation, 7, 353-383.
80
University of Ghana  http://ugspace.ug.edu.gh
Pollard, J.H. (1988). "On the Decomposition of Changes in Expectation of Life
and Differentials in Life Expectancy. Demography 25(2):265-276
Population Reference Bureau. (2011). World Population Data Sheet,
Washington, D.C., U.S.A.
Population Reference Bureau. (2013). World Population Data Sheet,
Washington, D.C., U.S.A.
Porto de Carvalho, J. R., Monteiro, J. E. B.A., Kakai, A. M., & Assad, E. D.
(2017) Model for multiple imputation to estimate daily rainfall data and filling
of faults.
Raaijmaken, Q.A.W. (1999). Effectiveness of different missing data treatments
in surveys with Likert-type data: Introducing the relative mean substitution
approach. Educational and Psychological Measurement, 59,725-748.
Raghunathan, T. W., Lepkowksi, J .M., Van Hoewyk, j., and Solenbeger , p.
(2001). A multivariate technique for multiple imputing missing values a sequence
of regression models. 27; 85-95
Rahman, M. G. and Islam, M. Z. (2011): A Decision Tree-based Missing Value
Imputation Technique for Data Pre-processing
Revicki D. A., Karen G., Buckman D., Chan K., Kallich J. D., and Woolley M
.J. ( 2001). Imputing Physical Health Status Scores Missing Owing to Mortality:
Results of a Simulation Comparing Multiple Techniques, Medical Care, Vol. 39,
No. 1, pp. 61-71
Roth, P. L. (1994). Missing data: A conceptual review for applied psychologists.
Personnel Psychology, 47, 537–570
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the
American statistical Association, 91(434):473–489.
Rubin. D. B. (1976). Inference and missing data. Biometrika, 63(3), 581- 592.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York:
Wiley.
Rubin, D. & Thayer, D. (1982) EM algorithms for ML factor analysis
SAS Institute Inc (2005). The SAS System, Version 9.3. SAS Institute Inc.,
Cary, NC. URL http://www.sas.com/.
Sattari, M. T., Joudi, A. R., & Kusiak, A. (2016). Assessment of different
methods for estimation of missing data in precipitation studies.
81
University of Ghana  http://ugspace.ug.edu.gh
Savage, N. H., Agnew, p., Davis, L. S., Ordonez, C., Johnson, C. E., O’Connor,
F. M., and Dalvi, M. (2013) Air quality modelling using the Met Office Unified
Model. Geoscience Model Dev., 6, 353-372.
Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing data:
Theory and application to auxiliary variables. Structural Equation Modeling,
16, 477- 497.
Schafer, J.L. (1997). Analysis of incomplete multivariate data. Monographs on
Statistics and Applied Probability, No. 72. Chapman and Hall, London.
Schafer J.L., & Graham, J.W. (2002). Missing data: Our view of the state of
the art. Psychological Methods. 7, 147–177.
Schmitt, P., Mandel, J., and Guedj, M. ( 2015) A comparison of six methods for
missing data imputation, Journal of Biometerics and Biostatistics, vol. 6, no. 1,
pp, 1-6.
Schlomer, C. L., Buaman, S., & Card, N. A. (2010) Best Practices for Missing
Data Management in Counseling Psychology. Vol. 57, No. 1, 1-10
Streiner, D. L. (2002).The case of the missing data: Methods of dealing with
dropouts and other research vagaries. Canadian Journal of Psychiatry, 47, 68-75.
Susianto, Y., Notodiputro, K. A., Kurnia, A., and Wijayanto, H. (2017). A
Comparative Study of Imputation Methods for Missing Values of Per Capita
Expenditure in Central Java.
Turrado, C. C., Lopez, M. C. M., Lasheras, F. s., Gomez, B. A. R., Rolle, J. L
.C, Juez, F. J. C. (2014) Missing data imputation of solar radiation data under
different atmospheric conditions
Twala, B. (2005). Effective Techniques for Handling Incomplete Data Using
Decision Trees. Unpublished PhD thesis, Open University, Milton Keynes, UK
Twala, B., Cartwright, M., and Shepperd, M. (2005). Comparison of Various
Methods for Handling Incomplete Data in Software Engineering Databases. 4th
International Symposium on Empirical Software Engineering, Noosa Heads,
Australia, November 2005
Thijs, H., Molenberghs, G., Micheiels, B., Verbeke, g., & Curran, D. (2002).
Srategies to fit pattern- misture models. Biostatistics, 3(2), 245-265.
Van Buuren, S. (2007). Multiple imputation of discrete and continuous data
by fully conditional specification. Statistical methods in medical research,
16(3):219–242.
82
University of Ghana  http://ugspace.ug.edu.gh
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Multivariate Imputation
by Chained Equations (Mice) in R. Statistics Software. 45, 1–67.
Van Buuren, S. (2012). Flexible Imputation of Missing Data; Chapman &
Hall/CRC: London, UK. P. 110
Willmott, C. J., Matsuura, K., and Robeson, S. M. (2009) Ambiguities inherent
in sums-of-squares based error statistics, Atmos. Environ., 43,749-752.
Yuan, K. H., & Bentler, P. M. (2000). Three likelihood-based methods for mean
and covariance structure analysis with non-normal missing data. In M. Becker
& M. Sobel (Eds.), Sociological methodology, (pp. 165-200).
Yuan, Y. (2000). Multiple imputation for missing data: Concepts and new
developments. Rockville. MD. SAS Institute, 267-275.
83
University of Ghana  http://ugspace.ug.edu.gh
Appendix
Appendix I. R-codes
R Codes used in this study
### Regression of original data ###
oscar <-read.csv(file.choose(),header=T)
oscar
attach(oscar)
names(oscar)
cor(oscar)
pairs(oscar[,-6])
model <-lm(y~x1+x2+x3+x4+x5, data=oscar)
par(mfrow=c(1,2))
qqnorm(model$residuals)
qqline(model$residuals)
plot(model$fitted ,model$residuals ,xlab="Fitted",ylab="Residuals",main="Time")
abline(h=0)
shapiro.test(model$residuals)
ncvTest(model)
### Missing ##
oscar <-read.csv(file.choose(),header=T)
oscar
attach(oscar)
names(oscar)
cor(oscar)
### MCAR ###
### MCAR 5% ###
prop.m = .05 # 5% missingness
mcar1 = runif(106, min=0, max=1)
mcar2 = runif(106, min=0, max=1)
mcar3 = runif(106, min=0, max=1)
mcar4 = runif(106, min=0, max=1)
mcar5 = runif(106, min=0, max=1)
x2 = ifelse(mcar1<prop.m, NA, oscar$x2)
# unrelated to anything
x4 = ifelse(mcar2<prop.m, NA, oscar$x4)
# unrelated to anything
x5 = ifelse(mcar3<prop.m, NA, oscar$x5)
# unrelated to anything
x6 = ifelse(mcar4<prop.m, NA, oscar$x6)
# unrelated to anything
x7 = ifelse(mcar5<prop.m, NA, oscar$x7)
View(cbind(x2, x4, x5,x6,x7))
datam05<-data.frame(cbind(x2, x4, x5,x6,x7))
84
University of Ghana  http://ugspace.ug.edu.gh
frequency(x2)
names(datam05)
attach(datam05)
str(datam05)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datam05)
write.csv(datam05,"C:/Users/Desktop/datam05.csv")
### MCAR 10% ###
prop.m = .10 # 10%% missingness
mcar1 = runif(106, min=0, max=1)
mcar2 = runif(106, min=0, max=1)
mcar3 = runif(106, min=0, max=1)
mcar4 = runif(106, min=0, max=1)
mcar5 = runif(106, min=0, max=1)
x2.mcar = ifelse(mcar1<prop.m, NA, oscar$x2)
# unrelated to anything
x4.mcar = ifelse(mcar2<prop.m, NA, oscar$x4)
# unrelated to anything
x5.mcar = ifelse(mcar3<prop.m, NA, oscar$x5)
# unrelated to anything
x6.mcar = ifelse(mcar4<prop.m, NA, oscar$x6)
# unrelated to anything
x7.mcar = ifelse(mcar5<prop.m, NA, oscar$x7)
View(cbind(x2.mcar , x4.mcar , x5.mcar ,x6.mcar ,x7.mcar))
datam10<-data.frame(cbind(x2.mcar , x4.mcar , x5.mcar ,x6.mcar ,x7.mcar))
names(datam10)
attach(datam10)
str(datam10)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datam10)
write.csv(datam10, "C:/Users/Desktop/datam10.csv")
### MCAR 20% ###
prop.m = .20 # 20% missingness
mcar1 = runif(106, min=0.8, max=1)
mcar2 = runif(106, min=0.8, max=1)
mcar3 = runif(106, min=0.8, max=1)
mcar4 = runif(106, min=0.8, max=1)
mcar5 = runif(106, min=0.8, max=1)
x2.r = ifelse(mcar1<prop.m, NA, oscar2$x2)
# unrelated to anything
x4.r = ifelse(mcar2<prop.m, NA, oscar2$x4)
# unrelated to anything
85
University of Ghana  http://ugspace.ug.edu.gh
x5.r = ifelse(mcar3<prop.m, NA, oscar2$x5)
# unrelated to anything
x6.r = ifelse(mcar4<prop.m, NA, oscar2$x6)
# unrelated to anything
x7.r = ifelse(mcar5<prop.m, NA, oscar2$x7)
View(cbind(x2.r, x4.r, x5.r,x6.r,x7.r))
datam20<-data.frame(cbind(x2.r, x4.r, x5.r,x6.r,x7.r))
names(datam20)
attach(datam20)
str(datam20)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datam20)
write.csv(datam20, "C:/Users/Desktop/datam20.csv")
### MCAR 30% ###
prop.m = .30 # 30% missingness
mcar1 = runif(106, min=0, max=1)
mcar2 = runif(106, min=0, max=0.9)
mcar3 = runif(106, min=0, max=0.6)
mcar4 = runif(106, min=0, max=0.8)
mcar5 = runif(106, min=0, max=0.7)
x2.mcar = ifelse(mcar1<prop.m, NA, oscar$x2)
# unrelated to anything
x4.mcar = ifelse(mcar2<prop.m, NA, oscar$x4)
# unrelated to anything
x5.mcar = ifelse(mcar3<prop.m, NA, oscar$x5)
# unrelated to anything
x6.mcar = ifelse(mcar4<prop.m, NA, oscar$x6)
# unrelated to anything
x7.mcar = ifelse(mcar5<prop.m, NA, oscar$x7)
View(cbind(x2.mcar , x4.mcar , x5.mcar ,x6.mcar ,x7.mcar))
datam30<-data.frame(cbind(x2.mcar , x4.mcar , x5.mcar ,x6.mcar ,x7.mcar))
names(datam30)
attach(datam30)
str(datam30)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datam30)
write.csv(datam30, "C:/Users/Desktop/datam30.csv")
### MCAR 40% ###
prop.m = .40 # 40% missingness
mcar1 = runif(106, min=0, max=1)
mcar2 = runif(106, min=0, max=1)
mcar3 = runif(106, min=0, max=1)
86
University of Ghana  http://ugspace.ug.edu.gh
mcar4 = runif(106, min=0, max=1)
mcar5 = runif(106, min=0, max=1)
x2.mcar = ifelse(mcar1<prop.m, NA, oscar$x2)
# unrelated to anything
x4.mcar = ifelse(mcar2<prop.m, NA, oscar$x4)
# unrelated to anything
x5.mcar = ifelse(mcar3<prop.m, NA, oscar$x5)
# unrelated to anything
x6.mcar = ifelse(mcar4<prop.m, NA, oscar$x6)
# unrelated to anything
x7.mcar = ifelse(mcar5<prop.m, NA, oscar$x7)
View(cbind(x2.mcar , x4.mcar , x5.mcar ,x6.mcar ,x7.mcar))
datam40<-data.frame(cbind(x2.mcar , x4.mcar , x5.mcar ,x6.mcar ,x7.mcar))
names(datam40)
attach(datam40)
str(datam40)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datam40)
write.csv(datam40, "C:/Users/Desktop/datam05.csv")
fix(datam40)
###MAR###
##5%##
logistic <- function(x) exp(x)/(1+exp(x))
mam2<-1-logistic(oscar$x2)
mmm2<-tail(sort.int(mam2, partial=length(mam2) - 4), 5)
min(mmm2)
xx2.mar = ifelse(mam2>min(mmm2), NA, oscar$x2)
xx2.mar
mam4<-1-logistic(oscar$x4)
mmm4<-tail(sort.int(oscar$x4, partial=length(oscar$x4) - 4), 5)
mam4
min(mmm4)
xx4.mar = ifelse(oscar$x4>min(mmm4), NA , oscar$x4)
xx4.mar
mmm5<-tail(sort.int(oscar$x5, partial=length(oscar$x5) - 4), 5)
mmm5
min(mmm5)
xx5.mar = ifelse(oscar$x5>93, NA, oscar$x5)
xx5.mar
mam6<-1-logistic(oscar$x6)
mmm6<-tail(sort.int(mam6, partial=length(mam6) - 4), 5)
mmm6
min(mmm6)
xx6.mar = ifelse(mam6> 1.026188e-10, NA, oscar$x6)
87
University of Ghana  http://ugspace.ug.edu.gh
xx6.mar
mam7<-1-logistic(oscar$x7)
mmm7<-tail(sort.int(mam7, partial=length(mam6) - 4), 5)
mmm7
min(mmm7)
xx7.mar = ifelse(mam7> 0.2689414, NA, oscar$x7)
xx7.mar
View(cbind(xx2.mar , xx4.mar , xx5.mar ,xx6.mar ,xx7.mar))
datama05<-data.frame(cbind(xx2.mar , xx4.mar , xx5.mar ,xx6.mar ,xx7.mar))
names(datama05)
attach(datama05)
str(datama05)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datama05)
write.csv(datama05, ""C:/Users/Desktop/datama05.csv")
###MAR###
##10%##
logistic <-function(x)exp(x)/(1+exp(x))
mam210<-1-logistic(oscar$x2)
mmm210<-tail(sort.int(mam210,partial=length(mam210)-4),10)
min(mmm210)
xx210.mar=felse(mam210>min(mmm210),NA,oscar$x2)
xx210.mar
mam410<-1-logistic(oscar$x4)
mmm410<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),10)
mam410
min(mmm410)
xx410.mar=ifelse(oscar$x4>min(mmm410),NA,oscar$x4)
xx410.mar
mmm510<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),10)
mmm510
min(mmm510)
xx510.mar=ifelse(oscar$x5>89,NA,oscar$x5)
xx510.mar
mam610<-1-logistic(oscar$x6)
mmm610<-tail(sort.int(mam6,␣partial=length(mam6)-4),10)
mmm6
min(mmm610)
xx610.mar=ifelse(mam610>min(mmm610),NA,oscar$x6)
xx610.mar
88
University of Ghana  http://ugspace.ug.edu.gh
mam710<-1-logistic(oscar$x7)
mmm710<-tail(sort.int(mam7,partial=length(mam6)-4),10)
mmm710
min(mmm710)
xx710.mar=ifelse(mam710>min(mmm610),NA,oscar$x7)
xx710.mar
View(cbind(xx210.mar ,xx410.mar ,xx510.mar ,xx610.mar ,xx710.mar))
datama10<-data.frame(cbind(xx210.mar ,xx410.mar ,xx510.mar ,xx610.mar ,xx710.mar))
names(datama10)
attach(datama10)
str(datama10)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datama10)
write.csv(datama10,"C:/Users/Desktop/datama10.csv")
###MAR###
##20%##
logistic <-function(x)exp(x)/(1+exp(x))
mam220<-1-logistic(oscar$x2)
mmm220<-tail(sort.int(oscar$x2,partial=length(oscar$x2)-4),20)
min(mmm220)
xx220.mar=ifelse(oscar$x2>65,NA,oscar$x2)
xx220.mar
mam420<-1-logistic(oscar$x4)
mmm420<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),20)
mam420
min(mmm420)
xx420.mar=ifelse(oscar$x4<min(mmm420),NA,oscar$x4)
xx420.mar
mmm520<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),20)
mmm520
min(mmm520)
xx520.mar=ifelse(oscar$x5>␣min(mmm520),NA,oscar$x5)
xx520.mar
mam620<-1-logistic(oscar$x6)
mmm620<-tail(sort.int(oscar$x6,partial=length(oscar$x6)-4),20)
mmm620
min(mmm620)
xx620.mar=ifelse(oscar$x6>min(mmm620),NA,oscar$x6)
xx620.mar
mam720<-1-logistic(oscar$x7)
mmm720<-tail(sort.int(mam720,partial=length(mam720)-4),20)
89
University of Ghana  http://ugspace.ug.edu.gh
mmm720
min(mmm720)
xx720.mar=ifelse(mam720<␣min(mmm720),NA,oscar$x7)xx720.mar
View(cbind(xx220.mar ,xx420.mar ,xx520.mar ,xx620.mar ,xx720.mar))
datama20<-data.frame(cbind(xx220.mar ,xx420.mar ,xx520.mar ,xx620.mar ,xx720.mar))
names(datama20)
attach(datama20)
str(datama20)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datama20)
write.csv(datama20,"C:/Users/Desktop/datama20.csv")
###MAR###
##30%##
logistic <-function(x)exp(x)/(1+exp(x))
mam230<-1-logistic(oscar$x2)
mmm230<-tail(sort.int(oscar$x2,partial=length(oscar$x2)-4),30)
min(mmm230)
xx230.mar=ifelse(oscar$x2>58,NA,oscar$x2)
xx230.mar
mam430<-1-logistic(oscar$x4)
mmm430<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),30)
mam430
min(mmm430)
xx430.mar=ifelse(oscar$x4 <=min(mmm430),NA,oscar$x4)
xx430.mar
mmm530<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),30)
mmm530
min(mmm530)
xx530.mar=ifelse(oscar$x5<␣min(mmm530),NA,oscar$x5)
xx530.mar
mam630<-1-logistic(oscar$x6)
mmm630<-tail(sort.int(oscar$x6,partial=length(oscar$x6)-4),30)
mmm630
min(mmm630)
xx630.mar=ifelse(oscar$x6>min(mmm630),NA,oscar$x6)
xx630.mar
mam730<-1-logistic(oscar$x7)
mmm730<-tail(sort.int(mam730,partial=length(mam720)-4),30)
mmm730
w=sort(mam730)
min(mmm730)
xx730.mar=ifelse(mam730 <=,NA,oscar$x7)
90
University of Ghana  http://ugspace.ug.edu.gh
xx730.mar
View(cbind(xx230.mar ,␣xx430.mar ,xx530.mar ,xx630.mar ,xx730.mar))
datama30<-data.frame(cbind(xx230.mar ,xx430.mar ,␣xx530.mar ,xx630.mar ,xx730.mar))
names(datama30)
attach(datama30)
str(datama30)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datama30)
write.csv(datama30,"C:/Users/Desktop/datama30.csv")
###MAR###
##40%##
logistic <-function(x)exp(x)/(1+exp(x))
mam240<-1-logistic(oscar$x2)
mmm240<-tail(sort.int(oscar$x2,partial=length(oscar$x2)-4),40)
min(mmm240)
xx240.mar=ifelse(oscar$x2 <=min(mmm240),NA,oscar$x2)
xx240.mar
mmm440<-tail(sort.int(oscar$x4,partial=length(oscar$x4)-4),40)
mam440
min(mmm420)
xx440.mar=ifelse(oscar$x4>min(mmm420),NA,oscar$x4)
xx440.mar
mmm540<-tail(sort.int(oscar$x5,partial=length(oscar$x5)-4),40)
mmm540
min(mmm540)
xx540.mar=ifelse(oscar$x5<min(mmm540),NA,oscar$x5)
xx540.mar
mam640<-1-logistic(oscar$x6)
mmm640<-tail(sort.int(oscar$x6,partial=length(oscar$x6)-4),40)
mmm640
min(mmm640)
xx640.mar=ifelse(oscar$x6>min(mmm640),NA,oscar$x6)
xx640.mar
Y<-sort(mam740)
mam740<-1-logistic(oscar$x7)
mmm740<-tail(sort.int(mam720,partial=length(mam720)-4),20)
mmm740
min(mmm720)
xx740.mar=ifelse(mam720>0.0474258732,NA,oscar$x7)
xx740.mar
View(cbind(xx240.mar ,xx440.mar ,␣xx540.mar ,xx640.mar ,xx740.mar))
datama40<-data.frame(cbind(xx240.mar ,xx440.mar ,xx540.mar ,xx640.mar ,
91
University of Ghana  http://ugspace.ug.edu.gh
xx740.mar))
names(datama40)
attach(datama40)
str(datama40)
library(BaylorEdPsych)
library(mvnmle)
LittleMCAR(datama40)
write.csv(datama40,"C:/Users/Desktop/datama40.csv")
Appendix II-KNN IMPUTATION
Table 1: KNN IMPUTATION AT 5%
Variable Estimate Std. Error t-value p-value
Constant 24.41543 4.458885 5.48 0.0000
CMW 0.0677847 0.0343056 1.98 0.0510
DEN 0.4524383 0.0531727 8.51 0.0000
RWS 0.2452794 0.0460424 5.33 0.0000
IMR -0.0887447 0.0099929 -8.88 0.0000
TFR -0.6127951 0.2278845 -2.69 0.0080
R2= 0.8644; F(5,100)=127.45 ; P-value of F-statistic=0.000
Table 2: KNN REGRESSION AT 10%
Variable Estimate Std. Error t-value p-value
Constant 25.51724 4.771705 5.35 0.0000
CMW 0.0746098 0.0379418 1.97 0.0520
DEN 0.4528977 0.0551922 8.21 0.0000
RWS 0.2387563 0.0473055 5.05 0.0000
IMR -0.0940196 0.0105682 -8.9 0.0000
TFR -0.7596007 0.2532781 -3 0.0030
R2= 0.8575; F(5,100)=120.39; P-value of F-statistic=0.000
92
University of Ghana  http://ugspace.ug.edu.gh
Table 3: KNN IMPUTATION AT 20%
Variable Estimate Std. Error t-value p-value
Constant 17.60174 4.974587 3.54 0.0010
CMW 0.0302327 0.0417292 0.72 0.4700
DEN 0.5403715 0.0611663 8.83 0.0000
RWS 0.2767879 0.0475857 5.82 0.0000
IMR -0.0854882 0.011215 -7.62 0.0000
TFR -0.6334724 0.2878913 -2.2 0.0300
R2= 0.8449; F(5,100)=108.99; P-value of F-statistic=0.000
Table 4: KNN IMPUTATION AT 30%
Variable Estimate Std. Error t-value p-value
Constant 18.57368 5.19042 3.58 0.0010
CMW 0.0422071 0.0433263 0.97 0.3320
DEN 0.5188958 0.0625237 8.3 0.0000
RWS 0.3031963 0.051144 5.93 0.0000
IMR -0.1015226 0.0122141 -8.31 0.0000
TFR -0.6903953 0.3166767 -2.18 0.0320
R2= 0.8490; F(5,100)=112.49; P-value of F-statistic=0.000
Table 5: KNN IMPUTATION AT 40%
Variable Estimate Std. Error t-value p-value
Constant 24.09558 4.747458 5.08 0.000
CMW 0.0820895 0.0361516 2.27 0.0250
DEN 0.4538946 0.0547902 8.28 0.0000
RWS 0.2392104 0.0484447 4.94 0.0000
IMR -0.0896554 0.0105635 -8.49 0.0000
TFR -0.6408525 0.24416 -2.62 0.0100
R2= 0.8457; F(5,100)=109.62; P-value of F-statistic=0.000
93
University of Ghana  http://ugspace.ug.edu.gh
Appendix III - MEAN IMPUTATION
Table 6: MEAN IMPUTATION AT 5%
Variable Estimate Std. Error t-value p-value
Constant 20.75863 4.996672 4.15 0.0000
CMW 0.0904376 0.0394639 2.29 0.0240
DEN 0.4754724 0.0561035 8.47 0.0000
RWS 0.2591374 0.0512571 5.06 0.0000
IMR -0.0874646 0.0114698 -7.63 0.0000
TFR -0.686563 0.2733041 -2.51 0.0140
R2= 0.8226; F(5,100)=92.73; P-value of F-statistic=0.000
Table 7: MEAN IMPUTATION AT 10%
Variable Estimate Std. Error t-value p-value
Constant 11.18575 5.274203 2.12 0.0360
CMW 0.0461124 0.0448393 1.03 0.3060
DEN 0.5577717 0.0636267 8.77 0.0000
RWS 0.3224677 0.0510945 6.31 0.0000
IMR -0.0785738 0.0125041 -6.28 0.0000
TFR -0.586762 0.3166016 -1.85 0.0670
R2= 0.7944; F(5,100)=77.28; P-value of F-statistic=0.000
Table 8: MEAN IMPUTATION AT 20%
Variable Estimate Std. Error t-value p-value
Constant 8.988445 5.74631 1.56 0.1210
CMW 0.0667237 0.0471772 1.41 0.1600
DEN 0.5351353 0.0669713 7.99 0.0000
RWS 0.3652465 0.054868 6.66 0.0000
IMR -0.0880375 0.0136471 -6.45 0.0000
TFR -0.4457335 0.3566515 -1.25 0.2140
R2= 0.7848; F(5,100)=72.95; P-value of F-statistic=0.000
94
University of Ghana  http://ugspace.ug.edu.gh
Table 9: MEAN IMPUTATION AT 30%
Variable Estimate Std. Error t-value p-value
Constant 13.07426 6.731646 1.94 0.0550
CMW 0.028893 0.0526946 0.55 0.5850
DEN 0.6370891 0.069246 9.2 0.0000
RWS 0.2870867 0.0703405 4.08 0.0000
IMR -0.0995959 0.0162369 -6.13 0.0000
TFR -1.034421 0.3313135 -3.12 0.0020
R2= 0.7477; F(5,100)=59.28; P-value of F-statistic=0.000
Table 10: MEAN IMPUTATION AT 40%
Variable Estimate Std. Error t-value p-value
Constant 35.67183 4.569475 7.81 0.0000
CMW 0.0131438 0.0276253 0.48 0.6350
DEN 0.3033493 0.0446717 6.79 0.0000
RWS 0.2741616 0.0432158 6.34 0.0000
IMR -0.1174565 0.0106542 -11.02 0.0000
TFR -0.1122018 0.1846629 -0.61 0.5450
R2= 0.9425; F(5,100)=327.79; P-value of F-statistic=0.000
Appendix iv–Regression Imputation
Table 11: REGRESSION IMPUTATION AT 5%
Variable Estimate Std. Error t-value p-value
Constant 26.1566 4.15219 6.3 0.0000
CMW 0.0781062 0.0314003 2.49 0.0150
DEN 0.4094124 0.0498192 8.22 0.0000
RWS 0.2534665 0.0420098 6.03 0.0000
IMR -0.0881772 0.0092606 -9.52 0.0000
TFR -0.5992677 0.2083631 -2.88 0.0050
R2= 0.8854; F(5,100)=154.47; P-value of F-statistic=0.000
95
University of Ghana  http://ugspace.ug.edu.gh
Table 12: REGRESSION IMPUTATION AT 10%
Variable Estimate Std. Error t-value p-value
Constant 24.55656 3.823012 6.42 0.0000
CMW 0.0729499 0.0293616 2.48 0.0150
DEN 0.4549499 0.0487914 9.32 0.0000
RWS 0.242957 0.0385584 6.3 0.0000
IMR -0.0828031 0.0092185 -8.98 0.0000
TFR -0.7533552 0.187711 -4.01 0.0000
R2= 0.8981; F(5,100)=176.33; P-value of F-statistic=0.000
Table 13: REGRESSION IMPUTATION AT 20%
Variable Estimate Std. Error t-value p-value
Constant 21.717 3.897818 5.57 0.0000
CMW 0.060962 0.0322603 1.89 0.0620
DEN 0.4894645 0.0494193 9.9 0.0000
RWS 0.2482765 0.0387981 6.4 0.0000
IMR -0.0833678 0.0090782 -9.18 0.0000
TFR -0.6500357 0.222982 -2.92 0.0040
R2= 0.8975; F(5,100)=175.11; P-value of F-statistic=0.000
Table 14: REGRESSION IMPUTATION AT 30%
Variable Estimate Std. Error t-value p-value
Constant 23.78943 3.929865 6.05 0.0000
CMW 0.0747011 0.0294856 2.53 0.0130
DEN 0.415013 0.0473078 8.77 0.0000
RWS 0.2745521 0.0382235 7.18 0.0000
IMR -0.0871608 0.0090673 -9.61 0.0000
TFR -0.4174209 0.2166326 -1.93 0.0570
R2= 0.9198; F(5,100)=229.26; P-value of F-statistic=0.000
96
University of Ghana  http://ugspace.ug.edu.gh
Table 15: REGRESSION IMPUTATION AT 40%
Variable Estimate Std. Error t-value p-value
Constant 35.67183 4.569475 7.81 0.0000
CMW 0.0131438 0.0276253 0.48 0.6350
DEN 0.3033493 0.0446717 6.79 0.0000
RWS 0.2741616 0.0432158 6.34 0.0000
IMR -0.1174565 0.0106542 -11.02 0.0000
TFR -0.1122018 0.1846629 -0.61 0.5450
R2= 0.9425; F(5,100)=327.79; P-value of F-statistic=0.000
Appendix VI–EM IMPUTATION
Table 16: EM IMPUTATION AT 5%
Variable Estimate Std. Error t-value p-value
Constant 27.27421 4.209365 6.48 0.0000
CMW 0.0732485 0.0334635 2.19 0.0310
DEN 0.3740375 0.0533309 7.01 0.0000
RWS 0.2774389 0.0411899 6.74 0.0000
IMR -0.0892841 0.009394 -9.5 0.0000
TFR -0.6294586 0.1993551 -3.16 0.0020
R2= 0.8700; F(5,100)=133.79; P-value of F-statistic=0.000
Table 17: EM IMPUTATION AT 10%
Variable Estimate Std. Error t-value p-value
Constant 31.02564 6.102412 5.08 0.0000
CMW 0.0968179 0.0435426 2.22 0.0280
DEN 0.3539075 0.0720339 4.91 0.0000
RWS 0.2268911 0.0526043 4.31 0.0000
IMR -0.0840243 0.0139266 -6.03 0.0000
TFR -0.6365357 0.2584398 -2.46 0.0150
R2= 0.7883; F(5,100)=74.46; P-value of F-statistic=0.000
97
University of Ghana  http://ugspace.ug.edu.gh
Table 18: EM IMPUTATION AT 20%
Variable Estimate Std. Error t-value p-value
Constant 30.30845 4.769065 6.36 0.0000
CMW 0.0559433 0.0454289 1.23 0.2210
DEN 0.4583636 0.0799098 5.74 0.0000
RWS 0.1630587 0.0503629 3.24 0.0020
IMR -0.0685205 0.012509 -5.48 0.0000
TFR -0.9983117 0.2582418 -3.87 0.0000
R2= 0.7892; F(5,100)=74.86; P-value of F-statistic=0.000
Table 19: EM IMPUTATION AT 30%
Variable Estimate Std. Error t-value p-value
Constant 23.21483 7.721299 3.01 0.0030
CMW 0.0018959 0.056835 0.03 0.9730
DEN 0.3544012 0.0864141 4.1 0.0000
RWS 0.3159736 0.0802667 3.94 0.0000
IMR -0.044099 0.0179196 -2.46 0.0160
TFR -0.0692912 0.3427523 -0.2 0.8400
R2= 0.6202; F(5,100)=32.66; P-value of F-statistic=0.000
Table 20: EM IMPUTATION AT 40%
Variable Estimate Std. Error t-value p-value
Constant 32.86259 7.803779 4.21 0.0000
CMW 0.1085613 0.0587219 1.85 0.0670
DEN 0.1500739 0.0929605 1.61 0.1100
RWS 0.3603069 0.0875461 4.12 0.0000
IMR -0.0845097 0.0184588 -4.58 0.0000
TFR -0.3423179 0.497264 -0.69 0.4930
R2= 0.6538; F(5,100)=37.77; P-value of F-statistic=0.000
98
University of Ghana  http://ugspace.ug.edu.gh
Appendix VII - MICE IMPUTATION
Table 21: MICE IMPUTATION AT 5%
Variable Estimate Std. Error t-value p-value
Constant 24.637 4.44824 5.54 0.0000
CMW 0.06886 0.03432 2.01 0.0480
DEN 0.43458 0.05623 7.73 0.0000
RWS 0.24994 0.04431 5.64 0.0000
IMR -0.0844 0.01033 -8.17 0.0000
TFR -0.5181 0.21508 -2.41 0.0180
R2= 0.8529; F(5,100)=115.99; P-value of F-statistic=0.000
Table 22: MICE IMPUTATION AT 10%
Variable Estimate Std. Error t-value p-value
Constant 30.4423 5.51822 5.52 0.0000
CMW 0.07884 0.04121 1.91 0.0590
DEN 0.3301 0.06821 4.84 0.0000
RWS 0.2637 0.04996 5.28 0.0000
IMR -0.0925 0.01341 -6.9 0.0000
TFR -0.3801 0.23723 -1.6 0.1120
R2= 0.8171; F(5,100)=89.34; P-value of F-statistic=0.000
Table 23: MICE IMPUTATION AT 20%
Variable Estimate Std. Error t-value p-value
Constant 24.7085 5.6094 4.4 0.0000
CMW 0.04394 0.04359 1.01 0.3160
DEN 0.35512 0.06898 5.15 0.0000
RWS 0.33901 0.05088 6.66 0.0000
IMR -0.072 0.01311 -5.49 0.0000
TFR -0.8652 0.26522 -3.26 0.0020
R2= 0.7496; F(5,100)=59.87; P-value of F-statistic=0.000
99
University of Ghana  http://ugspace.ug.edu.gh
Table 24: MICE IMPUTATION AT 30%
Variable Estimate Std. Error t-value p-value
Constant 23.1781 6.4062 3.62 0.0000
CMW 0.04657 0.0506 0.92 0.3600
DEN 0.35153 0.07774 4.52 0.0000
RWS 0.31924 0.06704 4.76 0.0000
IMR -0.0561 0.01547 -3.62 0.0000
TFR -0.4029 0.35025 -1.15 0.2530
R2= 0.6777; F(5,100)=42.04; P-value of F-statistic=0.000
Table 25: MICE IMPUTATION AT 40%
Variable Estimate Std. Error t-value p-value
Constant 34.3272 7.99457 4.29 0.0000
CMW 0.07746 0.0521 1.49 0.1400
DEN 0.21235 0.07171 2.96 0.0040
RWS 0.30229 0.05956 5.08 0.0000
IMR -0.0845 0.01983 -4.26 0.0000
TFR -0.2824 0.34182 -0.83 0.4110
R2= 0.6410; F(5,100)=35.71; P-value of F-statistic=0.000
100
University of Ghana  http://ugspace.ug.edu.gh
Appendix VIII-World Population Data Sheet
Table 26: The world population data sheet, 2011
Country Level Y X1 X2 X3 X4 X5
Jordan M 73 43 67 83 58 4
Syria H 74 43 68 84 54 3
Yemen L 65 38 60 75 93 5
Bangladesh M 69 40 64 79 75 2
Bhutan M 69 40 64 79 75 3
India L 64 38 59 73 97 3
Kazakhstan M 69 40 64 79 75 3
Kyrgyzstan M 85 50 78 96 5 3
Maldives M 54 32 50 63 141 2
Nepals M 66 39 61 76 89 3
Pakistan L 77 45 71 88 40 4
Sri Lanka H 72 42 66 82 62 2
Tajikistan M 61 36 57 70 100 3
Uzbekistan M 76 44 70 87 45 3
Cambodia L 62 36 58 71 96 3
Indonesia M 71 42 66 81 67 2
Loas L 65 38 60 75 93 4
Phiilippines M 68 40 63 78 80 3
Thailand H 74 43 68 84 54 2
Timor-Leste L 62 36 58 71 56 6
Vietman M 73 43 67 83 58 2
China H 74 43 68 84 54 2
Mongolia M 74 43 68 84 54 3
Estonia H 76 44 70 87 45 2
Latvia M 65 38 60 757 93 1
Lesotho L 58 34 54 67 124 3
South Africa L 67 39 62 77 84 2
Swaziland L 83 48 84 94 14 4
Belize H 78 46 80 89 36 3
Costa Rica H 55 32 57 64 137 2
El Salvador M 71 42 737 81 100 2
Guatemala M 51 30 53 59 45 4
Honduras M 59 35 61 68 96 3
Mexico H 62 36 64 71 67 2
101
University of Ghana  http://ugspace.ug.edu.gh
Table 27: The world population data sheet, 2011
Country Level Y X1 X2 X3 X4 X5
Nicaragua H 78 46 80 89 93 6
Dominican Republic M 61 36 63 70 80 3
Jamaica M 56 50 58 65 133 2
Argentina H 44 39 46 52 185 2
Bolivia H 72 64 74 82 62 3
Brazil H 73 65 75 83 58 2
Colombia M 77 68 79 88 40 2
Ecuador H 66 59 68 76 89 3
Guyana M 58 52 60 67 124 3
Paraguay M 71 63 50 81 67 3
Peru H 64 57 61 73 97 3
Suriname M 67 59 71 77 84 2
Uruguay H 73 65 66 83 58 2
Armenia M 67 59 57 77 84 2
Azerbaijan H 85 75 70 96 5 2
Georgia H 71 63 58 81 67 2
Iraq M 53 47 66 62 146 5
Aigeria M 66 59 60 76 89 2
Egypt M 75 66 63 86 49 3
Morocco M 64 57 68 74 97 2
Tunisia H 63 56 58 73 102 2
Benin L 65 58 67 75 93 5
Burkina Faso L 80 71 68 91 27 6
Cape Verde H 57 51 68 66 128 3
Cote Divoire L 52 46 70 60 150 5
Gambia L 59 52 61 68 119 5
Ghana L 64 57 66 74 97 4
Guinea L 54 48 56 75 141 5
Guinea-Bissau L 48 43 50 67 168 5
Liberia L 57 51 59 77 128 6
Mali L 52 46 54 94 150 6
Mauritania L 59 52 61 89 119 4
Niger L 79 70 81 64 32 7
102
University of Ghana  http://ugspace.ug.edu.gh
Table 28: The world population data sheet, 2011
Country Level Y X1 X2 X3 X4 X5
Nigeria L 81 72 82 81 23 6
Senegal L 56 50 58 59 133 5
Sierra Leone L 74 66 76 68 54 5
Togo L 67 59 69 71 84 5
Burundi L 63 56 65 89 102 6
Comoros L 75 66 77 70 49 5
Djibouti L 68 60 70 65 80 4
Ethopia L 69 61 71 52 75 5
Kenya L 62 55 64 72 106 5
Madagascar M 54 48 56 63 141 5
Malawi L 76 67 78 87 45 6
Mozambique L 67 59 69 77 84 6
Rwanda L 66 59 68 76 89 5
Tanzania L 53 47 55 62 146 5
Uganda L 69 61 71 79 75 6
Zambia L 66 63 68 76 89 6
Angola L 66 57 68 76 89 6
Cameroon L 66 59 68 76 89 5
Central Africa Rep. L 65 65 67 75 93 5
Chad L 82 59 83 94 18 6
Congo L 58 75 60 67 124 5
Gabon L 63 63 65 73 102 3
SaoTome &Principe L 62 47 64 72 106 5
Belarus M 71 59 73 81 67 2
Bulgaria H 74 66 76 85 54 2
Czech Rep H 78 57 80 89 36 2
Hungary H 74 56 76 85 49 1
Moldova M 69 58 71 79 97 1
Poland H 76 71 78 87 102 1
Russia M 69 51 71 79 93 2
Slovakia H 75 46 77 86 27 1
Ukraine M 69 52 71 79 128 1
Albania H 75 57 77 86 150 1
103
University of Ghana  http://ugspace.ug.edu.gh
Table 29: The world population data sheet, 2011
Country Level Y X1 X2 X3 X4 X5
Bosnia-Herzegovina H 76 48 78 87 119 1
Macedonia H 74 66 76 85 97 2
Montenegro H 74 66 76 85 141 2
Serbia H 74 66 76 85 168 1
Slovenia H 80 71 82 91 128 2
Papua New Guinea L 62 55 64 72 150 4
104