University of Ghana http://ugspace.ug.edu.gh UNIVERSITY OF GHANA DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE L1-L2 REGULARIZATION OF COLLINEAR DATA BY BOATENG OWUSU-ANSAH (10600402) THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF M.PHIL STATISTICS DEGREE JULY, 2018 University of Ghana http://ugspace.ug.edu.gh DECLARATION Candidate’s Declaration I, Boateng Owusu-Ansah hereby declare that, except for references cited from the work of others which have duly been acknowledged, this thesis is as a result of an original research work carried out by me and has not been presented either whole or in part elsewhere for another degree. Signature: ................................ Date:........................ BOATENG OWUSU-ANSAH (10600402) Supervisors’ Declaration We hereby certify that this thesis was prepared from the candidate’s own work and supervised in accordance with guidelines on supervision of thesis laid down by the University of Ghana. Signature: ................................ Date:........................ DR. F.O METTLE (Principal Supervisor) Signature: ................................ Date:........................ DR. ISAAC BAIDOO (Co-Supervisor) i University of Ghana http://ugspace.ug.edu.gh DEDICATION This thesis is dedicated to the Almighty God for his mercies and provision throughout the years and also to my parents for their support and encouragement. ii University of Ghana http://ugspace.ug.edu.gh ABSTRACT Multiple linear regression analysis may be used to describe the relation of a variable (response variable) based on the score of several other variables (independent vari- ables). The least squares estimate of the regression coefficients are unsteady in that replicate samples can give widely differing values of the regression coefficients if the predictor variables are highly correlated. Ridge and Lasso regression analysis are reg- ularization techniques for eliminating the effect of high covariance from the regression analysis. They produce estimates that are biased but have smaller mean square errors between the coefficients and their estimates. The lasso and ridge trace plot of the coef- ficients versus λ and cross validation are some ways that helps to determine the value of regularization constant λ and regression coefficients based on the data. Ridge re- gression and Lasso regression help the analysis to a more trustable interpretation of the results of multiple regression with highly correlated covariates. iii University of Ghana http://ugspace.ug.edu.gh ACKNOWLEDGEMENTS I would like to thank Dr. F. O. Mettle for his guidance and patience along the years. His doors were always opened at whatever point I had questions about my research writing. I am gratefully indebted to his very valuable comments on this thesis. I am gratefully indebted to Dr. Isaac Baidoo for his extremely significant remarks on this thesis. iv University of Ghana http://ugspace.ug.edu.gh Contents Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background of study . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Scope of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Organization of the Study . . . . . . . . . . . . . . . . . . . . . . . . 5 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Nature of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 7 v University of Ghana http://ugspace.ug.edu.gh 2.5 Sources of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 8 2.6 Effects of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 9 2.6.1 Test for Multicollinearity . . . . . . . . . . . . . . . . . . . . 9 2.6.2 Solution to Multicollinearity . . . . . . . . . . . . . . . . . . 11 2.6.3 Droping Collinear Variables . . . . . . . . . . . . . . . . . . 11 2.6.4 Recoding Variables . . . . . . . . . . . . . . . . . . . . . . . 11 2.6.5 Principal Component Regression . . . . . . . . . . . . . . . . 11 2.6.6 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . 12 2.6.7 Regularization techniques (Ridge and Lasso regression) . . . 12 3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Review of Ordinary Least Squares . . . . . . . . . . . . . . . . . . . 17 3.2.1 Model performance and accuracy of the OLS estimator . . . . 19 3.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Properties of the Ridge Estimator . . . . . . . . . . . . . . . . 21 3.4 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 The Use Of Monte Carlo Simulation . . . . . . . . . . . . . . . . . . 27 3.8 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 DATA ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Performance of Estimators as Sample Size Increases . . . . . . . . . . 29 4.3 Lasso, Ridge and OLS Coefficients. . . . . . . . . . . . . . . . . . . 37 4.4 Standard Errors of the Regression Coefficients . . . . . . . . . . . . . 46 4.5 Performance of OLS, Ridge and Lasso Estimators at Different Corre- lation Coefficients for Two Predictor Variables . . . . . . . . . . . . . 48 vi University of Ghana http://ugspace.ug.edu.gh 4.6 Application of L1-L2 Regularization to Bodyfat Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 DISCUSSIONS, CONCLUSIONS AND RECOMMENDATIONS . . . . 55 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 vii University of Ghana http://ugspace.ug.edu.gh List of Tables 4.1 Correlation Matrix of the Simulated Dataset for n=25 . . . . . . . 31 4.2 Correlation Matrix of the Simulated Dataset for n=50 . . . . . . . 31 4.3 Correlation Matrix of the Simulated Dataset for n=200 . . . . . . 31 4.4 Correlation Matrix of the Simulated Dataset for n=1000 . . . . . . 31 4.5 OLS Regression Output of the Simulated Dataset for n=25 . . . . 32 4.6 OLS Regression Output of the Simulated Dataset for n=50 . . . . 32 4.7 OLS Regression Output of the Simulated Dataset for n=200 . . . . 33 4.8 OLS Regression Output of the Simulated Dataset for n=1000 . . . 34 4.9 VFS’s of the Simulated Dataset for Different Sample Sizes . . . . . 35 4.10 Regression Coefficients for n=25 . . . . . . . . . . . . . . . . . . . 37 4.11 Regression Coefficients for n=50 . . . . . . . . . . . . . . . . . . . 38 4.12 Regression Coefficients for n=200 . . . . . . . . . . . . . . . . . . . 38 4.13 Regression Coefficients for n=1000 . . . . . . . . . . . . . . . . . . 39 4.14 Eigenvalues for the Independent Variables . . . . . . . . . . . . . 40 4.15 Shrinkage Parameters for Ridge and Lasso Regression . . . . . . . 45 4.16 MSE’s and MAE’s for OLS, RR and LR . . . . . . . . . . . . . . . 45 4.17 Standard Errors for n=25 . . . . . . . . . . . . . . . . . . . . . . . 46 4.18 Standard Errors for n=50 . . . . . . . . . . . . . . . . . . . . . . . 46 4.19 Standard Errors for n=200 . . . . . . . . . . . . . . . . . . . . . . 46 4.20 Standard Errors for n=1000 . . . . . . . . . . . . . . . . . . . . . . 46 4.21 MSE’s with Varying Correlation Coefficients for Two Predictor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.22 Correlation Matrices For Bodyfat Data . . . . . . . . . . . . . . . 50 4.23 Correlation Matrices For Bodyfat Data . . . . . . . . . . . . . . . 50 viii University of Ghana http://ugspace.ug.edu.gh 4.24 Correlation Matrices For Bodyfat Data . . . . . . . . . . . . . . . 50 4.25 OLS output for n=25 . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.26 OLS output for n=100 . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.27 OLS output for n=200 . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.28 MSE’s Across Three Different Sample Sizes. . . . . . . . . . . . . 53 4.29 MAE’s Across Three Different Sample Sizes. . . . . . . . . . . . . 53 4.30 Standard Errors of Regression Coefficients for n=25. . . . . . . . 53 4.31 Standard Errors of Regression Coefficients for n=100. . . . . . . 54 4.32 Standard Errors of Regression Coefficients for n=200. . . . . . . 54 ix University of Ghana http://ugspace.ug.edu.gh List of Figures 3.1 This figure shows data partitioning for cross validation. . . . . . . . . . 25 4.1 Scatter Plot of Simulated Data for Different Sample Sizes . . . . . 30 4.2 Cross Validation Diagrams for Lasso Regression . . . . . . . . . . 35 4.3 Cross Validation Diagrams for Ridge Regression . . . . . . . . . . 36 4.4 Ridge Trace Plot for Simulated Dataset . . . . . . . . . . . . . . . 40 4.5 Lasso Plot of Independent Variables . . . . . . . . . . . . . . . . . 41 4.6 Lasso plot of Coefficients against Lambda . . . . . . . . . . . . . . 42 4.7 Shrinkage Cross Validation Diagrams . . . . . . . . . . . . . . . . 43 4.8 Shows the standard errors of regression coefficients across differ- ent samples sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.9 Matrix Plot of Predictor Variables of Bodyfat Data . . . . . . . . . 49 x University of Ghana http://ugspace.ug.edu.gh List of Abbreviations LS Least Squares OLS Ordinary Least Squares RR Ridge Regression LR Lasso Regression MSE Mean Square Error MAE Mean Absolute Error PCA Principal Component Analysis PCR Principal Component Regression RMSE Root Mean Square Error BLUE Best Linear Unbaised Estimator xi University of Ghana http://ugspace.ug.edu.gh Chapter 1 INTRODUCTION 1.1 Background of study One of the main precautions in fitting a statistical model is controlling underfitting and reducing overfitting. A good statistical model should exhibit the following properties. • Stability: minute changes in the data should not spring huge difference in the predicted outcomes. • Model Performance: the generated model should give accurate predictions. • Interpretability: the generated model should be easy to use and explain. We often seek to have a model with a fewer number of predictor variables that gives accurate predictions. • Reducing Bias: We often seek to generate a model that is unbiased. That is to say that the estimated value using the model should be approximately equal to the true population parameters. Least squares is one of the most popular statistical modeling technique employed in fitting linear models. This technique which is known to be the best linear unbiased estimator (BLUE), falls short of the model fitting goals under multicollinearity condi- tions. Regularization techniques are some proposed ways to correct the prediction error. This is done by introducing very little bias to gain a model that has a lower MSE hence more stable and precise in prediction scores. This thesis deals with the theory of mul- ticollinearity as well as with ways that have been proposed to detect and correct the issues investigated. The study seeks to compare the L1 (Lasso regression) and L2 (ridge) regularization techniques to ordinary least squares using MSE criterion. Ridge 1 University of Ghana http://ugspace.ug.edu.gh regression which was first proposed by Hoerl and Kennard (1970) has proven to be a useful technique to handle the multicollinearity effect in the multiple linear regression models. The thesis presents the ridge and lasso estimator and its properties and also ways for selecting the regularization constant. 1.2 Statement of the Problem Multiple regression analysis, as a statistical technique, helps to determine the effects of several predictor variables on a dependent variable with forecasts. Thus, strong corre- lation between predictor (s) and the dependent variable is desirable as opposed to high correlations among the underlying independent variables. Consequently, high corre- lation among two or more predictors introduces a statistical problem known as multi- collinearity by over inflating standard errors of the regression model. According to Wa- hab et al. (2018), presence of multicollinearity can render some predictor variables sta- tistically insignificant when they ought to be significant and vice versa. Hence, mean- ingful interpretations and conclusions cannot be made in regression analysis whenever multicollinearity is present after modeling; thereby reducing the predictive power of the fitted regression model. Many approaches have been developed by statisticians in detecting as well as solving problems associated with multicollinearity in regression analysis. Variance inflation factors (VIF) and tolerance are commonly used to identify variables accounting to multicollinearity. But, different desirable thresholds of the VIF have been proposed from many studies. However, a general rule of thumb is that the unbiased Ordinary Least Square (OLS) estimation of the regression coefficients may not be desirable when the VIF value of the predictors are greater than 10 according to O’brien (2007). This therefore suggests that other appropriate regression models are only necessary in predicting the dependent variables in the presence of high multi- collinearity. In addition, sample sizes can contribute to the problems of standard error since small samples could increase standard errors as compared to large samples as proposed Mogessie and Bekele (2017). Thus, this study sought to address the problems of multicollinearity in regression mod- 2 University of Ghana http://ugspace.ug.edu.gh eling by adopting as well as comparing between the L1 and L2 regularization methods across varying or different sample sizes from simulated data and a real-life data with multicollinearity. 1.3 Objectives of the Study This research seeks to make a comparative analysis between L1 and L2 shrinkage methods when multicollinearity is present in a dataset. The goal is to find the best esti- mator that minimizes the mean square error and standard error for a collinear dataset. • To use L1 and L2 regularization to solve multicollinearity problem • To know the effect of increasing sample size of a multicollinear dataset • To compare the performance of OLS, L1-L2 shrinkage on multicollinear data with four predictor variables • To know the performance OLS, and L1-L2 regularization at different correlation coefficients for two predictor variables using MSE criteria 1.4 Significance of the Study The property of minimum variance is not destroyed by multicollinearity. LS estimators have the minimum variance in the class of a linear unbiased estimators, that is, they are the most efficient. This however, does not infer that the variance of an OLS estimator will certainly be small in any given sample. Multicollinearity is a sample phenomenon in the sense that predictors may be correlated in a sample at hand even if they are not linearly dependent in the population. In postulating the theoretical regression function, we assume that all the predictor variables (X) are independent and each have a sepa- rate impact on the response variable Y in a multiple linear regression. It may happen in any given sample that is used to test population regression that, if the X-variables are highly collinear we cannot gain insight on their separate influence on the response 3 University of Ghana http://ugspace.ug.edu.gh variable (Y ) if the test sample used has highly collinear predictor variables(X). Ridge and Lasso regression are shrinkage techniques that helps us to control the weights of regression coefficients when the explanatory variables are linearly dependent. Regular- ization is useful if we know the estimates should not be too large. It allows the problem to be optimized when it otherwise would not be possible if X ′X is singular. This re- search will help us to know the effect of shrinkage regression on multicollinear data. It will also help us to know the effect of collecting additional samples from a highly homogeneous population when multicollinearity exist. The study will help us to un- derstand the degree of collinearity for which OLS should be preferred over shrinkage regression. This research will make us know the best estimation technique that gives the least mean square error and least standard error under multicollinearity conditions for a given number of predictor variables. 1.5 Scope of the Study This research focuses on solving multicollinearity problem in a multiple linear regres- sion using shrinkage regression techniques, that is L1 (Lasso regression) and L2 (ridge regression). The study sets the tone by doing a review of the Least Square estimator and identifying the difficulties of using least squares when multicollinearity is present in a dataset. The study assess and identify the indications and effect of multicollinear- ity in a dataset. Ridge and Lasso estimators are shrinkage estimators used normally when there are more predictors than observation in a dataset. The thesis focus on the use these shrinkage estimators on a simulated dataset to compare their performance as the sample size increases. The study investigates the effect of sample size on the co- variance of the predictor variables. The shrinkage parameter λ say, and how it can be derived and it’s effect on linear regression models is investigated in the study. Compar- ism is made between L1 and L2 to know which one performs better on multicollinear data using MSE criteria. The study also investigates the best regression technique that minimizes the standard errors in a multicollinear dataset. 4 University of Ghana http://ugspace.ug.edu.gh 1.6 Organization of the Study This study gives a broad overview of Ridge and Lasso regression and its application as an alternative to Ordinary Least squares in the presence of multicollinearity. The first chapter entails the background of the study, a statement of the problem, ob- jectives of the study and its significance, the scope and organization of the study. In Chapter 2, the study reviewed both the empirical and theoretical literature parting the least squares and multicollinearity and some of the proposed ways of dealing with ill- conditioned data. Chapter 3 deals with the research methodology. The study focuses on the theory part- ing to L1 and L2 regularization. The study compares the MSEs and standard errors of these two regression techniques at different values of covariance and different sample sizes. This will make us know the level of collinearity for which ridge and lasso re- gression are most efficient than ordinary least squares. Chapter 4 deals with the analysis and discussion of the results. The study discusses the behavior of the ridge and lasso estimators at different correlation coefficient values and for increasing sample sizes based on the simulated data. The thesis compare their MSEs and propose the best way to achieve minimum variance and smaller standard errors under collinearity conditions. In chapter 5, the study discusses the research findings and make conclusions based on results obtained from our simulation and application results from multicollinear datasets. 5 University of Ghana http://ugspace.ug.edu.gh Chapter 2 LITERATURE REVIEW 2.1 Introduction This chapter discusses the concept of collinearity problem and how the issue of mul- ticollinearity can be resolved. The study reviewed past researches done on shrinkage regression and their findings in attempt to solving the problem of multicollinearity in a dataset. 2.2 Multiple Linear Regression Multiple linear regression (MLR) is an extension of the simple linear regression. It is the case where we have two or more explanatory variables. The goal of MLR is to determine the finest set of parameters (predictor variables) so that the predicted value of the independent variables are close to the actual values by Orlov (1996). Let’s consider the multiple linear regression model Y = β0 + β1X1 + β2X2 + ξ (2.1) where, β1 is the change in Y for a unit change in X1 while X2 is being held constant. β2 is the change in Y for a unit change in X2 while X1 is being held constant. Mathematically, β = ∂Y and β = ∂Y1 .∂X 21 ∂X2 In multiple linear regression, the predictor variables are assumed to be independent but in practice they might be correlated as stated by Johnson and Wichern (2004). The degree of correlation between the predictor variables is known as multicollinearity. 6 University of Ghana http://ugspace.ug.edu.gh 2.3 Multicollinearity Let’s consider the model in equation 2.1. The model consist of two predictor vari- ables namely X1 and X2. The variables X1 and X2 are said to be collinear when they are correlated with each other by Belsley (2004). Assuming this thesis tries to classify students as being good, average or bad using their test scores in mathematics and economics. On the average a student performance in economics is dependent on the strength of their numeracy skills. In such an experiment, results observed might indicate that there is a strong positive correlation between math and economics perfor- mance leading to the problem of multicollinearity since the independent variables are linearly dependent. 2.4 Nature of Multicollinearity The nature of multicollinearity was classified into perfect and partial multicollinear- ity. Perfect multicollinearity: When two or more explanatory variables overlap com- pletely, with one variable a perfect linear function of the others, such that the method of analysis cannot distinguish one from the other, then we say that there is perfect mul- ticollinearity. This condition would not allow estimating coefficient from a multiple regression model since the design matrix would not be of full rank and the equation for estimating the regression parameters becomes unsolvable. Partial Multicollinearity: This is the condition where two or more explanatory vari- ables overlap such that they are correlated with each other but still contain independent variation. That is to say not all the predictor variables are perfect linear combinations of each other. This condition restricts the extent to which analysis can distinguish their casual significance, hence making it difficult to identify the most and genuinely significant variables in the regression model. 7 University of Ghana http://ugspace.ug.edu.gh 2.5 Sources of Multicollinearity Identifying the source of multicollinearity is paramount in solving the multicollinearity problem in a dataset. In situations where the such a problem arises, it has an impact on results analysis and interpretation. This research discuss five sources of multicollinear- ity. • Data Collection: This happens when the data are assembled from a small sub- space of the predictor variables. This is collinearity created as a result of wrong sampling methodology. Adding additional sample in an expanded range will solve this multicollinearity problem. An example is of this is when you try to fit a line to a single point. • Physical Constraints of the model or population: This collinearity arises as a result of constraints put on predictor variables (as to their range), either legally, politically and physically. This source of collinearity will exist irrespective of the sampling technique. • Over-fitted model: This happens when we have more predictor variables than observations and it can be avoided easily. • Model Specification: This form of collinearity arises from using predictor vari- ables that are powers or linear combinations of some other original set of vari- ables. If the sampling subspace for the predictor variables is narrow, then collinear- ity will rise further with any combinations of the original variables. • Extreme Values in a subspace of the predictor variables induces multicollinear- ity. This can be corrected by removing extreme values from the dataset of obser- vations. 8 University of Ghana http://ugspace.ug.edu.gh 2.6 Effects of Multicollinearity • If multicollinearity is perfect, then the regression coefficients of the predictor variables are vague and the standard errors are infinite. • Variance and Standard errors of regression coefficient estimates (βi) are inflated. This means V ar(βi) are inflated or is too large. Hence the coefficients cannot be estimated with incredible precision or accuracy. • The magnitude of βi may be different from what we expect affecting the accu- racy of predicted outcomes. • The signs of (βi) may be different from what we expect. Consider the model in equation 2.1, where X1and X2 are mathematics and economics scores respec- tively and Y represents the level of credibility of an actuary. We would expect economics scores to increase as math score increases. Hence the negative coef- ficient for such a model might be as a result of multicollinearity. • Adding or removing a predictor variable can cause huge changes in the β′is (re- gression coefficients). • In some cases the F- statistics is significant but the t-statistics β bii = , maySE(bi) be insignificant. 2.6.1 Test for Multicollinearity There are several ways of identifying multicollinearity in a dataset and a few are dis- cussed below. 2.6.1.1 Correlation Coefficient Calculate the correlation coefficient (cov()Xr = iX( j) ) for each pair of the predictor σ Xi σ Xj variables. If any of their values is significantly different from zero, then the predictor variables involved may be collinear. One limitation of using correlation coefficient to 9 University of Ghana http://ugspace.ug.edu.gh check for the degree of multicollinearity is that, although two predictor variables Xi and Xj may not be highly correlated, three or more predictor variables X1, X2 and X3 may be correlated as a group. 2.6.1.2 Eigenvalues Sometimes eigenvalues, condition indices and the condition number can be referred in checking for multicollinearity. The co(nditio)n number (K) is given a(s the)square root of th√e ratio of the largest eigenvalue λmax to the least eigenvalue λmin . That is K = λmax . When there is no multicollinearity, the eigenvalues and condition λmin number will all be equal to one. The eigenvalues turn to be both greater and less than one as multicollinearity increases. 2.6.1.3 Variance Inflation Factor(VIF) The rule of thumb s(tate)s that multicollinearity exist if V IF > 10. A V IF of 10, say, means that V ar βi is 10 times what it should have been if no multicollinearity existed in the dataset. V IF is a more rigorous check for collinearity than correlation coefficient.V IF = 1− 2 , where R 2 is the coefficient if determination. 1 Ri The denominator1−R2i is known as the tolerance. In the regression model, Y = β0 + β1X1 + β2X2 + ξ (2.2) R21 is obtained by regressing X1 on X2. In each case we find the coefficient of deter- mination and substitute it back into the V IF formula. Coefficient of determination R2 is a measure of goodness of fit. It tells how much of the variance in the response variable is explained by the predictor variables. 10 University of Ghana http://ugspace.ug.edu.gh 2.6.2 Solution to Multicollinearity In this sub-section we discuss the various ways of dealing with dataset under multi- collinearity conditions 2.6.3 Droping Collinear Variables Drop the variable causing the problem. If using a large number of X-variables, a step- wise regression procedure could be used to determine which of the variables to drop. Removing collinear X-variables is the simplest method of solving the multicollinearity problem. If all the X-variables are retained, then it is advisable to avoid making infer- ences on the individual β parameters. It is also advisable to limit inferences about the mean value of Y to values of X that lie within the experimental region. 2.6.4 Recoding Variables Recode the form of the independent variable. For instance if X1 and X2 are collinear, we might try using X and the ratio of X11 . Recoding of variables is very effective inX2 controlling multicollinearity if it was not caused by sampling problem but instead the design of the experiment or model specification. 2.6.5 Principal Component Regression This is a linear regression that is preceded by principal component analysis. In PCR, the PCs of the explanatory variables are used as regressors instead of regressing the response variable on the predictor variables directly by Filzmoser and Croux (2003). The principal components with higher variances are often chosen as regressors even though those with low variances may be significant in a few cases for precise forecasts. PCR helps overcome multicollinearity by excluding the low variance components in the regression step. In addition, by usually regressing on only a subset of the princi- pal components, PCR can result in dimension reduction of the number of parameters in the underlying model. This can be particularly beneficial in settings that require 11 University of Ghana http://ugspace.ug.edu.gh dimensional covariates. Also, through the appropriate selection of the principal com- ponents to be used for regression, PCR can lead to efficient forecast by installing a parsimonious model. 2.6.6 Stepwise Regression Stepwise regression is a variable selection procedure for independent variables. The basis of the selection is done by choosing variables that least satisfy a criterion (for- ward stepwise regression) or dropping variables that least satisfies a criterion (back- wards stepwise regression). At each step of the procedure, each predictor variable (X) is evaluated to see whether it should be maintained in the model. An example of a cri- terion used in stepwise regression is the t-value Xiaobo et al. (2010). Variables which have high β coefficients in the regression model are retained in a forward selection approach and variables with least β coefficients are dropped in a backward selection approach. 2.6.7 Regularization techniques (Ridge and Lasso regression) This is an alternative estimation procedure to OLS and will be discussed in Chapter 3. Duzan and Shariff (2015) performed a study to investigate the short comings of using Ordinary Least Squares (OLS) when multicollinearity is present in a regression anal- ysis. The only alternative method considered in this study was ridge regression. The goal of the study was to find an appropriate value for the ridge parameter in two- vari- able regression model. This investigation was done by simulation using 1000 samples with n=10. The performance of different ridge regression estimators was compared to the OLS. Mean square errors (MSE), Variance Inflation factor (VIF) and regression weights (beta estimates) were computed using few predictor variables (two and four) Random values of the ridge parameter(λ) were chosen to see which one produced the lowest MSE and least VIF. Coefficient of determination which is a measure of good- ness of fit of the data was calculated from linear models with correlated explanatory variables. The weight of the regression coefficients obtained by using ridge regression 12 University of Ghana http://ugspace.ug.edu.gh and ordinary least squares (the case where λ = 0) were compared for the 1000 obser- vations for every simulation. The ridge parameter was computed using the ridge trace. From the result analysis, it was found that the value of the ridge parameter is directly proportional to the covariance between two explanatory variables. Also higher values of λ produced smaller β estimates. In this research the researcher didn’t consider the behavior of the ridge estimator for small sample size. Also no comparative analysis was done between RR and OLS at different degree of collinearity. Pourbasheer et al. (2014) also did a comparative analysis on collinear data using ridge regression and OLS. In the research simulation was done using SAS package. Three different sample sizes (100, 50, 25) were considered. The goal here was to observe the behavior of the estimators as the sample size increases. Six predictor variables were used and they were highly correlated with each other. Eigenvalues for all the dependent variables were computed from the correlation matrix developed for the three sample sizes. With the smallest sample size of the three, some Eigenvalues not to be distinct producing results in the complex plane (having a real part and an imaginary part). Finally, com- parism was made between MSE and R-square of ORR and OLS. The result showed that the RR method produces a lower MSE than the OLS. Also estimated coefficients of Ordinary Ridge Regression (ORR) had smaller R-square values that OLS. It was found that an increase in sample size stabilized the estimated results of the regression model. Hence conclusion was made that RR is a better estimator than OLS under collinearity conditions. Santos-Cortez et al. (2014), published an article on the effect of ridge regression on re- gression estimates when the sample size is small by varying correlation coefficients. In this research the RR estimator was the only alternative introduced to the OLS. Differ- ent ways of evaluating the ridge constant were also discussed. Simulations were done using R software to achieve different degree of correlation coefficients classified as moderate and high correlations between independent variable (0.5, 0.7 and 0.9). The sample size used for the study was 20 and 10 explanatory variables were considered in each case. The explanatory variables were standardized to be in correlation form 13 University of Ghana http://ugspace.ug.edu.gh and standard error values of 0.1, 0.5, 1.0, 5.0 and 10.0 were considered for the study. The experiment was repeated for 1000 times and in each case the values of the ridge parameter λ was estimated. The performance of these estimators were measured using the average MSE with the equation ( ) ∑100 ( )T( )1 βRR = βR − β βR − β (2.3) 1000 i=1 Levy et al. (2005) used principal component analysis to solve the problem of mul- ticollinearity by reducing the size of the covariance matrix. However, PCA cannot always fix the problems with parameter estimation associated with multicollinearity. Again, the interpretation of the resulting model becomes cumbersome since each of the principal components used is a linear combination of all the other variables. Grewal et al. (2004) proposed controlling multicollinearity using structural equation models. The recommendation was to include all significant influences on a response variable. Unfortunately, behavioral models frequently have low explanatory power (Grewal et al. (2004)), so that may not be an option. A suggestion was to use adequate sample size (which may always not be available) to help solve multicollinearity. Chong and Jun (2005) and Jun et al. (2009) used variable selection method to control the impact of high covariance among predictor variables. It was established that a model with good fitness performance may not guarantee good variable selection performance especially in moderation studies where every predictor variable is of paramount relevance. Prob- lems of stepwise regression was investigated by Hauser (1974). Apart from the prob- lems of dropping some important variables, its significance test is misleading since the data used to generate the model is used again in testing the model. Kaufinger (2013) admonished researchers to introduce categorical variables to mod- els when multicollinearity is present in a dataset. It was troublesome to get closed form expression in the general case as there could be numerous conceivable combina- tions of dummy and quantitative variables in linear regression models. The issue was investigated more in detail by choosing different combinations of dummy and quan- titative variables. It was found that the presence of dummy variable and the choice 14 University of Ghana http://ugspace.ug.edu.gh of referenced category can also cause multicollinearity. The approach of including an interaction term can often be difficult to set up and to interpret. 15 University of Ghana http://ugspace.ug.edu.gh Chapter 3 METHODOLOGY 3.1 Introduction In this chapter, the study reviewed the concept of least squares regression and how to estimate its parameters. The study considered the difficulties of OLS estimator when multicollinearity is present in a dataset and ways in which it can be corrected. The study introduced specifically two shrinkage estimators namely L2 (Ridge regression) and L1 (Lasso regression) regularization which is known to be a good fit under multi- collinearity conditions. Data was simulated from a designed matrix with highly correlated covariates for four independent variables from a multivariate random normal distribution using R pack- age. The sample size is set for small (25) and increased for large sample sizes (50, 200, 1000). The study observed the behavior of estimators as sample size increases. To set up a regularization constant, cross validation was done on the simulated dataset to find a fixed parameter value for which the mean square error is minimum. The study seeks to find the best estimator using MSE criterion. The study also compares the effect of these regularization techniques on standard errors at different sample sizes Next the procedure deal with a simulation study with two predictor variables, vary- ing the bivariate correlation coefficient between the predictor variables from zero to one with an interval of 0.1 for each group. The study seeks to find the measure of covariance between two predictor variables for which the OLS estimator would per- form better than the regularized shrinkage estimators. An application study was done using bodyfat dataset in R-software considering 8 predictor variables. The results for increase in the number of predictor variables was investigated. 16 University of Ghana http://ugspace.ug.edu.gh 3.2 Review of Ordinary Least Squares Least squares is said to be the Best Linear Unbiased Estimator (BLUE) by Gauss Markov Theorem. The ’best’ means among all linear unbiased estimators, it gives the minimum variance or the least MSE. The ’unbiased’ means that if we find the means of all beta parameters estimates it will be close to the true population parameter. Let’s consider the linear model ∑k Yi = β0 + β1Xi1 + β2Xi2 + ...+ βkXik + ξi = β0 + βpXip + ξi (3.1) p=1 In matrix form, Y = Xβ + ξ (3.2) Here Y is an (n × 1) vector of dependent variables, X is an [n × (k + 1)] matrix of observations on (k − 1) predictor variables. β is a [(k + 1) × 1] vector of regression coefficients to be estimated from the observations(data), and ξ is an (n× 1) matrix of error terms which follows a normal distribution with mean zero and constant variance. Mathematically ξ ∼ N(0, σ2). Next, compute the expected value and variance. Now given, Y = Xβ + ξ (3.3) The Expectation of Y is given as E(Yi) = E(Xβ + ξ) = Xβ (3.4) 17 University of Ghana http://ugspace.ug.edu.gh The variance of Y is given as V ar(Yi) = V ar(Xβ + ξ) = σ 2 i (3.5) The goal of Ordinary Least Squares is to minimize the sum of squared differences between the observed and the predicted values of the L-2 norm of the Beta vector. From Y = Xβ + ξ, we have ξ = (Y −Xβ ) ( ) (3.6) ξT T ξ = (Y −Xβ Y)(−Xβ ) (3.7) = Y T − βTXT Y −Xβ (3.8) ∴ ξT ξ = (Y TY − Y TXβ − βTXTY + βTXTXβ (3.9) Differentiating with respect to β and setting the derivative to zero helps us to find the optimal value for β ∂ξ = −2XTY + 2XTXβ (3.10) ∂β ∂ξ ∂ξ = −2XTY + 2XTXβ, but = 0 (3.11) ∂β ∂β Implying that −2XTY + 2XTXβ = 0 (3.12) Therefore XTXβ = XTY (3.13) Rearranging the formula gives β̂ = (XTX)−1XTY (3.14) 18 University of Ghana http://ugspace.ug.edu.gh This equation is used to compute the β parameter estimates. 3.2.1 Model performance and accuracy of the OLS estimator Expectation, Variance and MSE of the estimator are respectively given as follows ( ) [( )−1 ] E β̂ = [E XTX XTY ] (3.15) = (XTX)−1XTE(Y ) (3.16) ( ) = XTX)−1XTXβ (3.17) ∴ E β̂ = β (3.18) Hence β̂ is unbiased in estimating β, we have ( ) [ ] V ar β̂ = V ar (XTX)−1XT − (Y ) ( ) (3.19)−1= ((XTX)) 1XTV ar Y X XTX (3.20) T −1 ( )2 −1( ) = X X σi IXTX XTX (3.21) V ar β̂ = σ2(XTi X) −1 (3.22) Again, ( ) MSE(β̂) = σ̂2 −1 trace XTX (3.23) Hence, ( ) ∑p 1 MSE β̂ = σ̂2 (3.24) K i=1 i 3.3 Ridge Regression Ridge regression (RR) was introduced by Hoerl and Kennard (1970) as a solution to problems of multicollinearity in a multiple linear regression where the ordinary least 19 University of Ghana http://ugspace.ug.edu.gh square estimator is unstable. The assumptions of ridge regression is the same as that of ordinary least squares. When the predictor variables are highly correlated in a linear regression model, the design matrix XTX nears singularity and hence not invertible said by Stone and Brooks (1990). This creates imprecise parameter estimates with large variances and some variables that explain variability in the response variable might come out as not significant in analysis due to the presence of some covariates. Ridge regression helps us to reduce the impact of correlated inputs by regularizing the norm of the Beta vector. ( ) β̂ = argmin J β = λ ‖β22‖ (3.25) β where λ is the ridge constant. For any vector, the lp norm is defined as ( ) 1 βp1 + β p 2 + ...+ β p p k ≡ ‖β‖k (3.26) Regularizing the l2 norm for a Linear Model, J(β) + λ‖β‖22 (3.27) In case of a linear regression, the loss function takes the form ( ) ∑N ( ) J β = Yi − βT 2 Xi + λ‖β‖22 (3.28) i=1 We want to find the value of β that minimizes the function above. To optimize the function, we have to differentiate the function with respect to β and set the derivative to zero. In particular, we know that ( ) ( )T ( ) J β = Y −Xβ Y −Xβ + λβTβ (3.29) 20 University of Ghana http://ugspace.ug.edu.gh Implying that ( ) ∂J β ( ) = −2XT Y −Xβ + 2λβ (3.30) ∂β ( ) ∂J β Now by setting = 0, we have ∂β −2XT (Y −Xβ) + 2λβ = 0 (3.31) Solving for β, we generate the solution ( )−1 β̂RR = X TX + λI XTY (3.32) where I is an identity matrix 3.3.1 Properties of the Ridge Estimator The main properties of the ridge solution are: • The ridge estimator β̂RR is a linear transformation of the least squares estimator β̂ • The length of β̂RR is a decreasing function of λ • The residual sum of squares is monotone and increases as a function of λ The expectation of βRR is given by ( ) [( ) T −1 ]E β̂ TRR = E( X X +)λI X (Y ) (3.33)−1 = (XTX + λI) XTE Y (3.34)−1 = [(XTX +) λI XTX( β (3.35) T −1 )T −1( )]( ) = [ X X( X )X +] λI XTX β (3.36)−1 −1 E β̂RR = I + λI X TX β (3.37) 21 University of Ghana http://ugspace.ug.edu.gh The covariance of βRR is given by ( ) [( ) T −1 ( )Cov β = C( ov X X )+ λI XTRR (Y C) ov βRR ] (3.38)−1 = (XTX + λI) XTCov Y ) (3.39)−1 −1 = (XTX)+( λI )XTX(XTX( +)λI σ2I(n×n) (3.40) T T −1 −1 )T 2 T −1( ) = X( X X X( )+)X( X λI σ) (X X + λI (3.41)−1 −1 ∴ Cov βRR = σ 2 I +XTX λI XTX + λI (3.42) The mean square error of βRR is given by ( ) [( )T ( )] MSE βRR = E ∑β̂RR − β β̂RR − β (3.43)2 Ki = σ2 + λ2β̂T (XTX + λI)−2β̂ (3.44) K + k ( ) ∑i=1 i2 ∑2 [ ]2 MSE βRR = V ar(β̂RR) + Bias(β̂RR) (3.45) i=1 i=1 whereKi represents the eigenvalues and k is the number of explanatory variables in the model. The first term is the trace of the dispersion matrix of (βRR) and the second term is the length of the bias vector. From the above equation we can see that the variance term is monotonic decreasing for values of λ > 0. The squares bias is monotone increasing function of λ. Therefore the suitable choice for λ is determined by striking a harmony between two terms which reduces the variance than increasing the bias. 3.4 Lasso Regression Least Absolute Shrinkage and Selection Operator regression popularly known as Lasso regression is a regularization technique in linear regression that uses shrinkage. In lasso regression a penalty term that is equal to the magnitude of the regression coefficients is added to the design matrix. Lasso gives sparse outputs Ogutu et al. (2012). The procedures in lasso regression encourages parsimonious models with fewer parameters as some coefficients can become zero. It is suitable for data that exhibit high degree of 22 University of Ghana http://ugspace.ug.edu.gh multicollinearity. It is also helpful in doing variable selection or parameter elimination when we don’t want to use filter approaches. The larger the penalty term, the smaller the coefficients of the regression parameters. Lasso regression is also known as L1 regularization. Ridge regression (L2) regularization doesn’t result in parsimonious models as coefficients are not zeroed. In contrast, the Lasso does variable selection and parameter shrinkage automatically. This makes interpretation of Lasso models far easier that Ridge models. Mathematically, we seek to find the value of β that minimizes the expression ∑n 2 ( ) ∑p‖Y −Xβ‖2 + λ ‖ 2β‖ = Y −Xiβ + λ ‖βj‖ (3.46) i=1 j=1 where the first term is the sum of squares and the second term is the Lasso penalty. Expanding out the first term, we get ( ) Y TY − Y TXβ − βTXTY + βTXTXβ (3.47) In the orthonormal case XTX = I = (XTX)−1, hence β̂LS = XTY . Since Y TY does not contain any of the variables of interest, we can discard it. Hence we have, ( ∑p ) Y TY − 2Y TXβ − βTXTY + βTβ + λ ‖βj‖ (3.48)( [ ] ∑ j=1p ) − Tβ̂OLS β − βT β̂ TOLS + β β + λ ‖βj‖ (3.49) ∑ j=1p ( ) − 2β̂ 2OLSβj + βj + λ1 ‖βj‖ (3.50) ∑j=1p ( ) min − 2β̂ 2OLSβj + βj + λ1 ‖βj‖ (3.51) βj j=1 23 University of Ghana http://ugspace.ug.edu.gh Minimization can be done per regression coefficient ( )   min− 2β̂ 2  OLSβj + βj − λ1 ‖βj‖ , for β > 0  min − β2β̂OLSβj+β2 jj+λ1 ‖βj‖ = βj  min− 2β̂OLSβj + β2j − λ1 ‖βj‖ , for β < 0  βj Solving the right-hand side yields    β̂ 1 OLS − λ , for β > 02 1 β̂LASSO(λ1) =  β̂ 1 OLS + λ1, for β < 02 Both the sum of squares and the lasso penalty are convex, and so is the lasso loss func- tion. Consequently, there exist a global minimum. However, the lasso loss function is not strictly convex. Consequently, there may be multiple values of β′s that minimize the lasso loss function. 3.5 Standard Errors The Lasso is a non-linear and a non-differentiable function of the response values, it is difficult to estimate its standard errors accurately even for a fixed value of λ. However the standard errors can be estimated via bootstrap, that is either λ can be fixed or we may optimize over λ for each bootstrap sample. Getting a fixed λ value is comparable to selecting a best subset and then using the least squares standard e∑rror for that∑subset.β2 An approximate estimate may be derived by writing the penalty | βj | as j|βj | . Hence at the Lasso estimate β̂, we may approximate the solution by a ridge regression of the form β̂ = (XTX + λV −)−1XTY , where V is a diagonal matrix with diagonal ∑elements | βj |, V −denotes the generalized inverse of W and λ is chosen such that | βj |= λ. The covariance matrix of the estimates may then be approximated by β̂ = (XTX + λV −)−1XTX(XTX + λV −)−1σ2 (3.52) 24 University of Ghana http://ugspace.ug.edu.gh Figure 3.1: This figure shows data partitioning for cross validation. where σ2 is an estimate of the error variance. The cumbersomeness of the above for- mula is that, it gives error variance of zero for predictors with β̂j = 0 but it does prove to be useful for selection of the lasso shrinkage parameter λ. 3.6 Cross Validation Cross validation is a method used in selecting the best estimator with the smallest RMSE from a group of competing estimators formed from the same postulate stated by Stone (1974). Some cross validation methods are outlined below • Holdout Method: To use the holdout method, a given dataset is partitioned into two, part A and part B say. Part A is used as the training data to generate our model and part B is used to validate the model generated. This method is prone to sample bias since the selection is done by random sampling without any criteria. • K-Fold Method: The dataset is divided into K-fold as illustrated in Figure 3.1. 25 University of Ghana http://ugspace.ug.edu.gh A partition is taken out to validate the model generated by the other partitions. The process is repeated for kCk−1 where k is the number of partitions and hence a number of kCk−1 models are generated. The model which gives the least RMSE after validation is chosen to be best and used for analysis. • Leave Out Method: This is a special case of the K-fold method where the par- titioning is done in such a way that K = N where K is the number of samples in each partition K. Even though it requires large computational time for large samples. • Bootstrap Method. This method is used if the samples have the same parent dis- tribution and are independent of each other. To use this method a random sample is drawn from the training dataset. The sampling is done with replacement. The models are fitted using the bootstrap samples, and they are examined to see the best consistent model across the bootstrap samples proposed by Madsen and Thyregod (2010). In this research the K-f(old me)thod was used in selecting the shrinkage parameter. Given a training dataset xi, yi , i = 1, 2, · · · , n we construct an estimator θ̂ of some unknown function θ. Suppose θ̂ = θ, depends on a turning parameter λ , cross vali- dation offers a way to choose a value of λ (penalizing constant) for regularization top o(ptimize predictive accuracy. Th)e idea is to divide up the training data into N folds where N is fixed , example: N=4 as shown in Figure 3.1. We then hold out each fold one at a time, train on the remaining data and predict the held out observations for each of the turning parameters. The cross validation error for each value of the turning parameter is ( ) N1∑ CV λ = (y −N 2i − θ (xi)) (3.53) n i=1 We choose the turning parameter that minimizes the CV error curve. ( ) λ̂ = argmin CV λ (3.54) 26 University of Ghana http://ugspace.ug.edu.gh In this research R programming was used to estimate the ridge and lasso parameter by performing cross validation on our simulated data. 3.7 The Use Of Monte Carlo Simulation Monte Carlo method is a stochastic technique which is used to investigate problems based on the use of random numbers in probability and statistics. Monte Carlo strategy can be utilized to unravel physical issues. For example it aids us to examine more complex frameworks. With Monte Carlo method, we can sample the large system in a number of random configurations. Midi et al. (2010) also conducted Monte Carlo simulation study in a robust approach in the presence of multicollinearity. 27 University of Ghana http://ugspace.ug.edu.gh 3.8 Simulation design In this research, we simulate a four-variable highly-correlated dataset using R package with different sample sizes (25, 50, 200, 1000). We find the effect of sample size on multicollinearity, regression coefficients, mean square errors and mean absolute errors. 28 University of Ghana http://ugspace.ug.edu.gh Chapter 4 DATA ANALYSIS 4.1 Introduction This chapter analyses and discuss the outcomes of our simulated and application datasets. We compared the behavior of the estimators as the sample size increased and for dif- ferent covariances for a two predictor variable model. The MSE’s of the estimators as well as the standard errors of the regression coefficients (OLS, RR and LR) were com- puted and compared to know the best estimator for the simulated and applied highly correlated datasets. 4.2 Performance of Estimators as Sample Size Increases Figure 4.1 is the matrix plot of the simulated dataset for different sample sizes. It represents the scatter plots of the response and predictor variables of the simulated dataset. 29 University of Ghana http://ugspace.ug.edu.gh Figure 4.1: Scatter Plot of Simulated Data for Different Sample Sizes (a) n= 25 (b) n=50 (c) n= 200 (d) n=1000 From the Figure 4.1 it can be seen that all the independent variables are highly posi- tively correlated with each other and a linear relationship exists among the variables. The data points in the matrix plot become more compact as the sample size increases. Pragmatically, this might be the case when additional samples are collected from a highly homogeneous population since all the samples might exhibit very similar char- acteristics. The tables of correlation matrices of the simulated dataset for different sample sizes are shown below. Correlation is the measure of the strength of the linear relationship between two variables. The diagonal elements represents their variances and the off- diagonal elements represents their covariances. 30 University of Ghana http://ugspace.ug.edu.gh Table 4.1: Correlation Matrix of the Simulated Dataset for n=25 n=25 Y X1 X2 X3 X4 Y 1 0.88322 0.92835 0.92660 0.97103 X1 1 0.91273 0.83321 0.91033 X2 1 0.90425 0.87658 X3 1 0.85925 X4 1 Table 4.2: Correlation Matrix of the Simulated Dataset for n=50 n=50 Y X1 X2 X3 X4 Y 1 0.87771 0.92616 0.94428 0.97250 X1 1 0.90807 0.84841 0.90175 X2 1 0.91918 0.87299 X3 1 0.88772 X4 1 Table 4.3: Correlation Matrix of the Simulated Dataset for n=200 n=200 Y X1 X2 X3 X4 Y 1 0.90780 0.92770 0.95635 0.97259 X1 1 0.91883 0.88993 0.90216 X2 1 0.92713 0.85399 X3 1 0.90826 X4 1 Table 4.4: Correlation Matrix of the Simulated Dataset for n=1000 n=100 Y X1 X2 X3 X4 Y 1 0.92105 0.94131 0.96338 0.97527 X1 1 0.90807 0.84841 0.90175 X2 1 0.91918 0.87299 X3 1 0.85925 X4 1 F = 540, P = 0.000, RSE = 0.09347, R24,20 value = 0.9908, Adj. R 2 = 0.989 31 University of Ghana http://ugspace.ug.edu.gh Table 4.5: OLS Regression Output of the Simulated Dataset for n=25 n=25 Estimate Std. Error t-value P-value Intercept 0.00084 0.01929 0.044 0.96570 X1 -0.01339 0.08425 -0.159 0.87535 X2 0.30626 0.07540 4.282 0.0036 X3 0.17686 0.07540 2.346 0.02942 X4 0.57121 0.06732 8.485 0.00000 From the Table 4.5, it can be seen that X1 has a negative coefficient even though all the predictors were positively correlated with the response variable (Y). The variability we expect if we continue re-sampling are very low due to the smaller standard error values. Some of the variables are not significant, X1 say at α level of 0.05. The table below shows the OLS output of the simulated dataset for a sample size of 50. Table 4.6: OLS Regression Output of the Simulated Dataset for n=50 n=50 Estimate Std. Error t-value P-value Intercept 0.00064 0.01219 0.052 0.95900 X1 -0.06785 0.04776 -0.421 0.16200 X2 0.29021 0.04765 6.068 0.00000 X3 0.21466 0.04765 4.505 0.00000 X4 0.60932 0.03664 16.631 0.00000 F4,45 = 1469, Pvalue = 0.000, RSE = 0.08601, R 2 = 0.9924, Adj. R2 = 0.9917 From the Table 4.6, it can be seen that the number of significant variables have in- creased at α = 0.05 say and the standard error of the predator variables have decreased. 32 University of Ghana http://ugspace.ug.edu.gh The table below shows the OLS of the simulated dataset for a sample size of 200. Table 4.7: OLS Regression Output of the Simulated Dataset for n=200 n=50 Estimate Std. Error t-value P-value Intercept 0.00184 0.00657 0.281 0.77900 X1 -0.010502 0.02319 -0.4528 0.00001 X2 0.34074 0.02581 13.201 0.00000 X3 0.18364 0.02637 6.963 0.00000 X4 0.61130 0.02003 30.516 0.00000 F4,195 = 4986, Pvalue = 0.000, RSE = 0.09226, R 2 = 0.9903, Adj. R2 = 0.9901 From the Table 4.7, all the variables would have been considered significant at α = 0.05 say and their standard errors have have increased to 9.2% but the significance of the predictors have improved as the p-values are decreasing. The R-squared tells us the percentage of variation in the response variable that is explained by the predictor variables. It is a measure of goodness of fit of the model. 33 University of Ghana http://ugspace.ug.edu.gh The table below shows the OLS output of the simulated dataset for a sample size of 1000. Table 4.8: OLS Regression Output of the Simulated Dataset for n=1000 n=50 Estimate Std. Error t-value P-value Intercept -0.00425 0.00295 -1.438 0.15100 X1 -0.13916 0.00988 -14.091 0.00000 X2 0.33910 0.01110 30.564 0.00000 X3 0.20930 0.01086 19.270 0.00000 X4 0.61502 0.00887 69.360 0.00000 F 4 24,995 = 2.813× 10 , Pvalue = 0.000, RSE = 0.09327, R = 0.9912, Adj. R2 = 0.9912 From the Table 4.8, it can be seen that an increase in the sample size has reduced dras- tically the standard errors. This means that stability of a model is hugely affected by the sample size of the dataset. The level of significance of the predictor variables have increased as compared to the cases for n=25, n=50 and n=200. However the negative sign attached to X1 was maintained across all the sample sizes even though it was positively correlated with Y . This might be due to multicollinearity in the data and will be investigated further using the VIF’s. Table 4.9 shows the variance inflation factors for the four predictor variables of the simulated data for different sample sizes. The variance inflation measures how much the variances of the regression coefficients have been inflated relative to the case where the independent variables are strictly uncorrelated. VIF values greater than 5 indicates serious multicollinearity. 34 University of Ghana http://ugspace.ug.edu.gh Table 4.9: VFS’s of the Simulated Dataset for Different Sample Sizes VIF n=25 n=50 n=200 n=1000 X1 17.9780 12.4855 10.5131 11.0825 X2 13.5777 12.7499 12.5500 13.8634 X3 12.2999 13.0577 13.9575 13.2335 X4 8.9445 7.9278 8.3723 8.9871 From Table 4.9, it can be seen that most of the predictor variables have V IF ′s > 10 across all sample sizes. This means that severe multicollinearity exists in the simulated dataset. Figure shows cross validation diagrams of Lasso regression of the simulated dataset. Figure 4.2: Cross Validation Diagrams for Lasso Regression (a) n= 25 (b) n=50 (c) n= 200 (d) n=1000 The two vertical lines in the diagrams above represents the different choices for λ. The minimum values are the ones which minimizes sample loss in cross validation. 35 University of Ghana http://ugspace.ug.edu.gh The maximum values gives us the insight for the largest λ within one standard error. The red dots represents the confidence intervals of the error estimates computed using cross validation and the vertical lines shows the location of maximum and minimum λ values. The numbers across the top represents the number of non-zero coefficient estimates. It can be observed across all cross validation plots that the number of non-zero coef- ficients are three. Hence it is expected that one of the independent variables will be dropped (it will have its coefficient zeroed) for the Lasso regression model. Figure 4.3: Cross Validation Diagrams for Ridge Regression (a) n= 25 (b) n=50 (c) n= 200 (d) n=1000 The most minimal point on the curve indicates the optimal λ value. That is the log value of λ that best minimized the error in cross validation. 36 University of Ghana http://ugspace.ug.edu.gh 4.3 Lasso, Ridge and OLS Coefficients. In particular, this section discusses the regression coefficients of the three estimators from the simulated dataset. From the tables below it can be seen that X1 has a negative coefficient under ordinary least squares across all the four sample sizes even though it was positively correlated with the response variable. This is an indication that multi- collinearity might exist in the dataset. This can be affirmed from the table of VIF’s. Ridge regression technique have maintained all the variables whereas Lasso has zeroed the coefficient of X2. This was one of the properties of lasso discussed in chapter three (that is it performs variable selection by taking out first, the least significant of the variables in the model). The estimates turn to be more significant as the sample size increases. (that is the regression coefficients are directly proportional to the sample size). However most es- timates of ridge regression and lasso regression are smaller than that of OLS estimates. Table 4.10: Regression Coefficients for n=25 n=25 Estimate(OLS) Estimate(RR) Estimate(LR) Intercept 0.00084 0.00407 0.00096 X1 -0.01339 0.18021 0.00000 X2 0.30626 0.20519 0.29481 X3 0.17686 0.22241 0.17890 X4 0.57121 0.38143 0.55879 The equation of the fitted OLS model is ŶOLS = 0.00084− 0.01339X1 + 0.30626X2 + 0.17686X3 + 0.57121X4 (4.1) The equation of the fitted RR model is ŶRR = 0.00407 + 0.18021X1 + 0.20519X2 + 0.22241X3 + 0.38143X4 (4.2) 37 University of Ghana http://ugspace.ug.edu.gh The equation of the fitted Lasso model is ŶLR = 0.00096 + 0.29481X2 + 0.17890X3 + 0.55879X4 (4.3) Table 4.11: Regression Coefficients for n=50 n=25 Estimate(OLS) Estimate(RR) Estimate(LR) Intercept 0.00064 0.00176 0.00088 X1 -0.06785 0.14878 0.00000 X2 0.29021 0.19981 0.24509 X3 0.21466 0.25636 0.22180 X4 0.60932 0.41149 0.57280 The equation of the fitted OLS model is ŶOLS = 0.00064− 0.06785X1 + 0.29021X2 + 0.21466X3 + 0.60932X4 (4.4) The equation of the fitted RR model is ŶRR = 0.00176 + 0.14878X1 + 0.19981X2 + 0.25636X3 + 0.41149X4 (4.5) The equation of the fitted Lasso model is ŶLR = 0.00088 + 0.24509X2 + 0.22180X3 + 0.57280X4 (4.6) Table 4.12: Regression Coefficients for n=200 n=25 Estimate(OLS) Estimate(RR) Estimate(LR) Intercept -0.00426 -0.00604 -0.00529 X1 -0.13916 0.09565 0.00000 X2 0.33910 0.21952 0.23535 X3 0.20930 0.27964 0.24147 X4 0.61502 0.39904 0.54228 The equation of the fitted OLS model is ŶOLS = −0.00951− 0.19151X1 + 0.38046X2 + 0.17039X3 + 0.66239X4 (4.7) 38 University of Ghana http://ugspace.ug.edu.gh The equation of the fitted RR model is ŶRR = 0.00234 + 0.11208X1 + 0.21660X2 + 0.26522X3 + 0.40529X4 (4.8) The equation of the fitted Lasso model is ŶLR = 0.00238 + 0.26348X2 + 0.20393X3 + 0.55640X4 (4.9) Table 4.13: Regression Coefficients for n=1000 n=25 Estimate(OLS) Estimate(RR) Estimate(LR) Intercept -0.00951 0.00234 0.00238 X1 -0.19151 0.11208 0.00000 X2 0.38046 0.21660 0.26348 X3 0.17039 0.26522 0.20393 X4 0.66239 0.40529 0.55640 The equation of the fitted OLS model is ŶOLS = −0.00426− 0.13916X1 + 0.33910X2 + 0.20930X3 + 0.61502X4 (4.10) The equation of the fitted RR model is ŶRR = −0.00604 + 0.09565X1 + 0.21952X2 + 0.27964X3 + 0.39904X4 (4.11) The equation of the fitted Lasso model is ŶLR = −0.00529 + 0.23535X2 + 0.24147X3 + 0.54228X4 (4.12) 39 University of Ghana http://ugspace.ug.edu.gh Table 4.14: Eigenvalues for the Independent Variables Variable X1 X2 X3 X4 Eigenvalues 0.1376 0.1105 0.0370 0.0062 From the above, Table 4.14 shows the eigenvalues of the predictor variables of the de- sign matrix of the simulated dataset. It can be observed that the ratio of the maximum to the minimum eigenvalues of the independent variables is large. This indicates severe multicollinearity among the explanatory variables. Figure 4.4 shows the ridge trace plots of coefficients against log λ values, where λ represents the shrinkage parameter. Figure 4.4: Ridge Trace Plot for Simulated Dataset (a) n= 25 (b) n=50 (c) n= 200 (d) n=1000 X1 − black, X2 − green, X3 − red and X4 − blue 40 University of Ghana http://ugspace.ug.edu.gh From the diagrams in Figure 4.4, it can be seen that the coefficient values approaches (shrinks to) zero as the lambda values are decreased across all sample sizes. How- ever all the four independent variables are maintained in the model for some range of lambda values. The shrinkage effect depends on the significance of the independent variables. The least efficient variables shrink fastest in the estimation. When lambda is zero, the ridge solutions will be the same as that of OLS. The diagrams of 4.5 shows the Lasso plots of coefficients against lambda values for different sample sizes. Figure 4.5: Lasso Plot of Independent Variables (a) n= 25 (b) n=50 (c) n= 200 (d) n=1000 X1 − black, X2 − green, X3 − red and X4 − blue 41 University of Ghana http://ugspace.ug.edu.gh From the diagrams in Figure 4.6, it can be seen that the coefficients shrink as the shrinkage parameter (λ) values decreases. The topmost part of the graph shows the number of retained independent variables in the model at specified lambda values. For instance it can be seen that the lasso regression model has four independent variables for λ = 1.0. The diagrams in Figure 4.6 shows the Lasso plots of coefficients against lambda values for different sample sizes. Figure 4.6: Lasso plot of Coefficients against Lambda (a) n= 50 (b) n=200 X1 − black, X2 − green, X3 − red and X4 − blue From the diagrams it can be seen that the number of predictor variables maintained in the model varies for different values of λ. Lasso regression in this case performs both shrinkage and parameter elimination. 42 University of Ghana http://ugspace.ug.edu.gh The diagrams below shows the effect of shrinkage for different sample sizes on the independent variables. Figure 4.7: Shrinkage Cross Validation Diagrams (a) n= 25 (b) n=50 (c) n= 200 (d) n=1000 From Figure 4.7, it can be seen that each of the Lasso regression models for the dif- ferent sample sizes have three predictor variables (X2, X3, X4) maintained as it was seen in our Lasso models. The interval between the coefficients decreases hugely as the sample size was increased from 25 to 50. However it increased again as the sample size was increased to 200 and the rigorous shrinkage occurred when the sample size was further increased to 1000. From the correlation matrices, the covariances between the variables at n=50 was more than the covariances at n=200. This means that the higher the covariances, the greater the shrinkage and hence the smaller our coefficients. 43 University of Ghana http://ugspace.ug.edu.gh The study concludes that shrinkage depends on the covariances and not on the sample size. 44 University of Ghana http://ugspace.ug.edu.gh Table 4.15 shows the shrinkage parameters for RR and LR after cross validation. Table 4.15: Shrinkage(Param) eters(for Ri)dge an(d Lass)o Reg(ressi)on Sample size(n) RR λLSE RR λmin LR λLSE LR λmin 25 0.1774 0.0925 0.0471 0.0067 50 0.1446 0.0971 0.0384 0.0066 200 0.1305 0.0987 0.0288 0.0065 1000 0.1063 0.1063 0.0195 0.0070 In particular, λLSE and λmin represents the range of λ values for which the regular- ized solutions has a smaller MSE than the OLS solution, where λLSE is the maximum and λmin is the minimum. λLSE is inversely proportional to the sample size increase whereas the λmin is directly proportional to the sample size. The minimum value is always chosen as our shrinkage parameter. It can be observed that, all λmin values of lasso are smaller than λmin of the ridge solutions in cross validation. Table 4.16 shows the mean square errors and mean absolute errors across four different sample sizes for the three different estimators. Table 4.16: MSE’s and MAE’s for OLS, RR and LR n MSE(OLS) MAE(OLS) MSE(RR) MAE(RR) MSE(Lasso) MAE(Lasso) 25 2.1050 1.2309 0.0102 0.0845 0.0070 0.0681 50 2.3977 1.2497 0.0121 0.0876 0.0070 0.0646 200 2.3237 1.2482 0.0144 0.0955 0.0092 0.0788 1000 2.3309 1.2339 0.0162 0.1016 0.0104 0.0813 From the Table 4.16 values of RR have smaller mean absolute errors (MAE) than OLS . Also values of RR have smaller MSEs of regression coefficient than OLS. Lasso regression on the other hand out performs both OLS and RR in the above mentioned categories. Consequently, Ridge and Lasso regression methods are better than OLS when the multicollinearity problem exists in a data with Lasso being the best in this case for the four predictor variable model. 45 University of Ghana http://ugspace.ug.edu.gh 4.4 Standard Errors of the Regression Coefficients The standard errors of β̂ estimates is a measure of how consistent β̂ will be if re- sampled repeatedly. It measures the sampling variation in estimating β according to Ludvigson and Ng (2009). Table 4.17, 4.18, 4.19 and 4.20 shows the standard errors of the estimates of the simulated dataset across different sample sizes. Table 4.17: Standard Errors for n=25 n=25 OLS RR LR X1 0.08425 0.00761 0.00887 X2 0.07152 0.00489 0.01881 X3 0.07540 0.00213 0.00294 X4 0.06732 0.00546 0.02395 Table 4.18: Standard Errors for n=50 n=50 OLS RR LR X1 0.04776 0.00337 0.00746 X2 0.04782 0.00239 0.01947 X3 0.04765 0.00148 0.00677 X4 0.03664 0.00276 0.01850 Table 4.19: Standard Errors for n=200 n=200 OLS RR LR X1 0.04776 0.00337 0.00746 X2 0.04782 0.00239 0.01947 X3 0.04765 0.00148 0.00677 X4 0.03664 0.00276 0.01850 Table 4.20: Standard Errors for n=1000 n=100 OLS RR LR X1 0.00988 0.00021 0.02127 X2 0.01110 0.00017 0.02383 X3 0.01086 0.00013 0.00419 X4 0.00887 0.00018 0.02082 46 University of Ghana http://ugspace.ug.edu.gh Figure 4.8: Shows the standard errors of regression coefficients across different samples sizes. From Figure 4.8, it can be observed that the standard errors of the β coefficients of RR has the least standard error across all sample sizes. The standard errors of the OLS coefficients decreases as the sample size increases. The most significant variable for OLS has its coefficient standard error to be less than that of the LR coefficient standard error for a very large sample size(i.e. when n=1000). 47 University of Ghana http://ugspace.ug.edu.gh 4.5 Performance of OLS, Ridge and Lasso Estimators at Different Correlation Coefficients for Two Pre- dictor Variables A simulation study was conducted using 200 samples with two predictor variables. The sample size was kept constant with varying correlation coefficients (between 0 and 1) to determine the extent to which ridge and lasso out performs ordinary least squares using MSE criteria. The results has been summarized in the table below for the mean square errors for varying correlation coefficient for two predictor variables. Table 4.21 shows the MSE’s across three different estimators for varying correlation coefficients. Table 4.21: MSE’s with Varying Correlation Coefficients for Two Predictor Vari- ables r12 MSE(OLS) MSE(RIDGE) MSE(LASSO) 0 < r12 < 0.1 1.08806 1.09098 1.09109 0.1 < r12 < 0.2 1.27009 0.97483 0.97115 0.2 < r12 < 0.3 1.20087 0.94655 0.94516 0.3 < r12 < 0.4 1.38027 0.90043 0.89886 0.4 < r12 < 0.5 1.48580 0.83960 0.83852 0.5 < r12 < 0.6 1.60904 0.71552 0.71463 0.6 < r12 < 0.7 1.74448 0.57992 0.57905 0.7 < r12 < 0.8 1.87089 0.45410 0.45288 0.8 < r12 < 0.9 2.03481 0.29197 0.29015 0.9 < r12 < 1 2.18323 0.14604 0.14367 From Table 4.21, it can be seen that ridge and lasso regression have smaller MSEs than OLS when the covariance between the two predictor variables is greater than 0.1. However between 0 and 0.1, OLS have smaller MSE than the proposed shrinkage tech- niques. Hence we can say that OLS is best when the samples are nearly independent. 48 University of Ghana http://ugspace.ug.edu.gh 4.6 Application of L1-L2 Regularization to Bodyfat Data L1-L2 regularization was compared to ordinary least squares on bodyfat dataset. The sample sizes used were 25 (small sample) , 100 and 200 (large sample sizes). This thesis investigated the behavior of these estimators as sample size increases. The study focused on the effect of increasing sample size on the significance and stability of the outcome of the regression estimates. We also compared the MSEs and MAEs of these estimators as we increased the sample size. The results are shown and discussed below. The diagram in Figure 4.9 shows the matrix plot of bodyfat data for different sample sizes. Figure 4.9: Matrix Plot of Predictor Variables of Bodyfat Data (a) n= 25 (b) n=50 (c) n= 200 The data points become very compact as the sample size increases and there exist a 49 University of Ghana http://ugspace.ug.edu.gh positive linear relationship among the variables. Table 4.22 displays the Pearson product moment correlation coefficient between each pair of predictor variables. It is the measure of the strength of the linear relationship between two variables and lies between -1 and 1. Table 4.22: Correlation Matrices For Bodyfat Data n=25 weight neck chest abdomen hip thigh knee biceps weight 1 0.80 0.83 0.70 0.90 0.79 0.77 0.86 neck 1 0.72 0.54 0.69 0.69 0.54 0.75 chest 1 0.75 0.79 0.72 0.65 0.70 abdomen 1 0.80 0.78 0.79 0.58 hip 1 0.81 0.73 0.77 thigh 1 0.67 0.84 knee 1 0.59 biceps 1 Table 4.23: Correlation Matrices For Bodyfat Data n=100 weight neck chest abdomen hip thigh knee biceps weight 1 0.85 0.90 0.89 0.94 0.89 0.86 0.82 neck 1 0.79 0.76 0.76 0.75 0.71 0.73 chest 1 0.93 0.85 0.79 0.76 0.75 abdomen 1 0.88 0.79 0.77 0.70 hip 1 091 0.83 0.79 thigh 1 0.83 0.79 knee 1 0.72 biceps 1 Table 4.24: Correlation Matrices For Bodyfat Data n=200 weight neck chest abdomen hip thigh knee biceps weight 1 0.83 0.89 0.89 0.94 0.87 0.85 0.80 neck 1 0.78 0.75 0.73 0.70 0.67 0.73 chest 1 0.92 0.83 0.73 0.72 0.73 abdomen 1 0.87 0.77 0.74 0.68 hip 1 0.90 0.82 0.74 thigh 1 0.80 0.76 knee 1 0.68 biceps 1 50 University of Ghana http://ugspace.ug.edu.gh From Table 4.24, it can be seen that the covariance are very high and the independent assumption has been violated. It is an indication that multicollinearity might exist in our dataset. Table 4.25 shows the regression output of the bodyfat dataset sample size of 25. Table 4.25: OLS output for n=25 n=25 coeff St. Error t-value P-value Intercept -81.2625 47.8353 -1.699 0.1087 Weight -0.3836 0.1581 -2.426 0.0274 Neck -0.7064 0.6411 -1.102 0.2868 Chest -0.1571 0.2816 -0.588 0.5847 Abdomen 0.7510 0.2910 2.581 0.0201 Hip 0.6870 0.5094 1.349 0.1962 Thigh 0.1247 0.6101 0.204 0.8406 Knee 1.0783 0.9054 1.191 0.2510 Biceps 0.6944 0.7265 0.956 0.3534 F Statistic: F8,16 = 5.82 , P-value: 0.001398 From the table of OLS output for n = 25, most of the independent variables are not significant at alpha level of 0.05 say but the F- statistics for the overall model is significant. This is due to multicollinearity among the independent variables. There are only two significant variables at α = 0.05 Table 4.26 shows the regression output of the bodyfat dataset sample size of 100. Table 4.26: OLS output for n=100 n=100 coeff St. Error t-value P-value Intercept -43.6160 19.1798 -2.274 0.0253 Weight -0.2095 0.0614 -3.413 0.0010 Neck -0.6319 0.3177 -1.989 0.0497 Chest 0.1438 0.1506 -0.955 0.3419 Abdomen 1.1128 0.1119 9.943 0.0000 Hip -0.3628 0.1990 -1.823 00.0715 Thigh 0.4970 0.2064 2.408 0.0180 Knee 0.2897 0.3700 0.783 0.4354 Biceps 0.0643 0.2419 0.266 0.7910 F8,91 = 44.2 , P-value : 0.0000 51 University of Ghana http://ugspace.ug.edu.gh From Table 4.25, 4.26 and 4.27, it can be observed that the number of significant variables have increased with an increase in sample size. Also the overall model has become more significant. This implies that collecting additional samples can help increase the significance of variables in a regression analysis and hence making the regression model more stable Table 4.27 shows the regression output of the bodyfat dataset for n=200. Table 4.27: OLS output for n=200 n=200 coeff St.Error t-value P-value Intercept -35.4794 12.9555 -2.739 0.0066 Weight -0.1433 0.0443 -3.232 0.0014 Neck -0.5700 0.2186 -2.607 0.0097 Chest 0.0274 0.0977 0.281 0.7791 Abdomen 1.0253 0.0753 13.617 0.0000 Hip -0.2081 0.1429 -1.456 0.1467 Thigh 0.2454 0.1312 1.870 0.0627 Knee 0.0448 0.2292 0.196 0.8451 Biceps 0.2712 0.1657 1.642 0.1018 F8,191 = 83.88 , P-value: 0.0000 It can be observed that the number of significant variables didn’t change. This is be- cause (n > 30) are considered to be large. However, the standard errors reduced as the sample size increased. This implies large samples produces more stable estimates and multicollinearity problem can only get worse in a normally distributed population when the sample size is small. 52 University of Ghana http://ugspace.ug.edu.gh Table 4.32 shows the MSE’s of the bodyfat data across three different sample sizes. Table 4.28: MSE’s Across Three Different Sample Sizes. Sample size(n) MSE(OLS) MSE(RR) MSE(LR) 25 11.03316 0.250453 0.2524211 100 16.52934 0.2605047 0.2028121 200 18.54474 0.3058488 0.2663728 Table 4.32 shows the MAE’s of the bodyfat data across three different sample sizes. Table 4.29: MAE’s Across Three Different Sample Sizes. Sample size(n) MAE(OLS) MAE(RR) MAE(LR) 25 2.776839 0.4399902 0.4261684 100 3.40597 0.4292737 0.3774634 200 3.547428 0.4487484 0.4240274 From Table 4.29, the mean absolute errors of the regularized estimators are smaller than that of the OLS across the three different sample sizes used in this experiment, with that of Lasso regression producing the least MAE among the competing estima- tors. Hence, L1-L2 regulation techniques are better alternatives to OLS under multi- collinearity conditions with Lasso being a better alternative to ridge regression Table 4.30: Standard Errors of Regression Coefficients for n=25. n=25 OLS RR LR weight 0.1581 0.0225 0.0524 neck 0.6411 0.0032 0.0311 chest 0.2816 0.0020 0.0638 abdomen 0.2910 0.0002 0.0049 hip 0.5094 0.0102 0.2470 thigh 0.6101 0.0055 0.0481 knee 0.9054 0.0050 0.0178 biceps 0.7265 0.0095 0.0636 53 University of Ghana http://ugspace.ug.edu.gh Table 4.31: Standard Errors of Regression Coefficients for n=100. n=100 OLS RR LR weight 0.2220 0.0031 0.2281 neck 0.0929 0.0002 0.0475 chest 0.1500 0.0012 0.0362 abdomen 0.1457 0.0039 0.0402 hip 0.1833 0.0016 0.0915 thigh 0.1329 0.0016 0.0882 knee 0.0986 0.0003 0.0172 biceps 0.0881 0.0004 0.0040 Table 4.32: Standard Errors of Regression Coefficients for n=200. n=200 OLS RR LR weight 0.0465 0.1509 0.0478 neck 0.2294 0.0216 0.1142 chest 0.1043 0.1358 0.1451 abdomen 0.0801 0.0016 0.4020 hip 0.1532 0.0682 0.0920 thigh 0.1445 0.0367 0.0541 knee 0.2671 0.0335 0.0461 biceps 0.1783 0.0636 0.0762 54 University of Ghana http://ugspace.ug.edu.gh Chapter 5 DISCUSSIONS, CONCLUSIONS AND RECOMMENDATIONS 5.1 Introduction This section discusses the research findings and make inferences based on the outputs. Recommendations are made for further studies. 5.2 Discussions and Conclusions In this research , the researcher referred to the multicollinearity problem, methods of detecting this problem and its effect on result of multiple regression model. From the simulation study, the high covariances of the covariates caused multicollinearity. A graph of the matrix plot showed there was a linear relationship between the response and each of the predictor variables and hence the researcher was able to fit a multi- ple linear regression to the data using OLS. The coefficient of X1 was negative even though it was positively correlated with Y as depicted in Table 4.6 and the intercept had very small value. This was the first indication that multicollinearity might exist in the dataset even though the standard errors were not so high. This procedure was repeated for different sample sizes (n=25, n=50, n=200 and n=1000). The standard errors of the predictors reduced for increased sample sizes and level of significance of the predictor variables increased as well. Even though some predictors would not have been significant at some given alpha 0.05 say, the overall model was always significant. This was the second indication that the predictor variables were collinear even without looking at the correlation matrix. A more rigorous check for multicollinearity (VIF) was adopted to affirm the suspicion of the collinear variables. From Table 4.9 of VIF’s, 55 University of Ghana http://ugspace.ug.edu.gh it could be observed that most of the VIF’s of the predictor variables were greater than ten which indicate multicollinearity according to the rule of thumb. L1 and L2 regularization techniques were adopted to solve the problem of collinearity in the simulated dataset. To be certain which of these techniques produced a more efficient and stable model, as well as reducing the standard errors of the regression coefficients, the study compared the MSEs and standard errors of OLS, L1 and L2 regularization. The regularization (shrinkage) techniques penalized the coefficients of the regression model towards zero. The least significant variables shrink faster as seen in the trace plots shown in Figure 4.4. In Figure 4.6 (a lasso plot of coefficient against shrinkage constants) it was observed that the variable X1 has been dropped as the other variables experienced shrinkage across different sample sizes (also seen in equation 4.3, 4.6, 4.9 and 4.12). This affirms the parameter elimination property of the lasso regression as read from literature section 3. Cross Validation was used in selecting the optimal λ value (shrinkage constant). From Figure 4.2 and 4.3 it was observed that there were a range of values of λ for which the MSEs of these regularized estimators would be less than the MSE of OLS. The minimum of these lambdas was chosen as the optimal value after cross validation of Table 4.15. From Table 4.32 of MSE’s, it was observed that L1 produced the least MSEs compared to L2 and OLS. The smaller the MSE of an estimator the smaller the prediction error. This means that unbiasedness, even though very important should not be the ultimate criteria when selecting between competing estimators. Likewise L1 and L2 had the least MAE across all sample sizes for the collinear simulated dataset. This means that in the presence of multicollinearity regularized regression approaches are more efficient. However, in moderation studies, L2 regression should be chosen over L1 approach since ridge regression does not eliminate parameters. With respect to their standard errors, RR performed best across all sample sizes. The OLS had its standard errors decreased as the sample size was increased. The Lasso had the better of the OLS from Figure 4.8 but for very large sample size (when n=1000) the OLS had its most 56 University of Ghana http://ugspace.ug.edu.gh significant variable’s coefficient to be less than the Lasso coefficient. The level of collinearity to which OLS out performs L1 and L2 regularization for two predictor variables model was investigated by simulation. The results was summarized in Table 4.21. It was found that, for all levels of covariance above 0.1 between two pre- dictor variables, the ridge and lasso produced smaller MSE than the OLS with Lasso having the minimum MSE. However, for covariance between 0 and 0.1 for two predic- tor variables, the OLS estimator had a smaller MSE than the ridge and lasso estimators. Hence the study affirms the fact that OLS is best when the samples are independent. In the analysis (summarized in Table 4.15), the smaller the shrinkage parameter the better the estimator. Lasso method produced shrinkage parameters that were smaller than the ridge shrinkage values across all sample sizes. From our simulated data, the relationship between increased sample size and covari- ance were irregular. It was observed that sample size does not always affect the degree of collinearity, however it affects the results of estimated value as seen in the applica- tion done on the bodyfat dataset. Whenever the sample size increases, the results of the methods of estimation becomes more stable. The following conclusions were made based on the above discussion • L1 and L2 regularization techniques helps to reduce standard errors of regression coefficients as well as reducing the prediction error of the generated model. • L1 regression is best and produces parsimonious models in the presence of mul- ticollinearity. • The higher the degree of multicollinearity, the smaller the shrinkage parameter. This means there is always an optimal value of lambda for every change in the dataset values. This makes λ a random variable. • Increasing sample size gives stable outcomes after estimation as it helps to re- duce the standard errors of the regression coefficients of the predictor variables. • L2 regularization would be the best alternative in moderation studies where we would like to keep all of the predictor variables. It also the best technique to be 57 University of Ghana http://ugspace.ug.edu.gh employed for highly inflated standard errors of OLS regression coefficients. • OLS is best for independent samples but the modern regression approaches (L1 and L2) should be embraced for correlated covariates. 58 University of Ghana http://ugspace.ug.edu.gh 5.3 Recommendations From the findings of the study, we recommended that in the presence of high multi- collinearity in the dataset, the best approach is the L2 since it yields smaller standard errors. However the L1 regularization yielded the least MSE across all sample sizes and hence a combination of the two techniques on a dataset concurrently can also be investigated to measure the performance. Further research should focus on the level of multicollinearity of the data and obtain the best robust approach for each level. 59 University of Ghana http://ugspace.ug.edu.gh REFERENCES Belsley, D. A. (2004). Conditioning diagnostics. Encyclopedia of Statistical Sciences, 2. Chong, I.-G. and Jun, C.-H. (2005). Performance of some variable selection methods when multicollinearity is present. Chemometrics and intelligent laboratory systems, 78(1-2):103–112. Duzan, H. and Shariff, N. S. B. M. (2015). Ridge regression for solving the multi- collinearity problem: review of methods and models. Journal of Applied Sciences, 15(3):392. Filzmoser, P. and Croux, C. (2003). Dimension reduction of the explanatory variables in multiple linear regression. Pliska Studia Mathematica Bulgarica, 14(1):59p–70p. Grewal, R., Cote, J. A., and Baumgartner, H. (2004). Multicollinearity and measure- ment error in structural equation models: Implications for theory testing. Marketing science, 23(4):519–529. Hauser, D. (1974). Some problems in the use of stepwise regression techniques in geographical research. Canadian Geographer/Le Géographe canadien, 18(2):148– 158. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67. Johnson, R. A. and Wichern, D. W. (2004). Multivariate analysis. Encyclopedia of Statistical Sciences, 8. Jun, C.-H., Lee, S.-H., Park, H.-S., and Lee, J.-H. (2009). Use of partial least squares regression for variable selection and quality prediction. In Computers & Indus- trial Engineering, 2009. CIE 2009. International Conference on, pages 1302–1307. IEEE. 60 University of Ghana http://ugspace.ug.edu.gh Kaufinger, G. G. (2013). Earnings management motivations in gift card breakage recognition decisions. Anderson University. Levy, G., Louis, E. D., Cote, L., Perez, M., Mejia-Santana, H., Andrews, H., Harris, J., Waters, C., Ford, B., Frucht, S., et al. (2005). Contribution of aging to the severity of different motor signs in parkinson disease. Archives of Neurology, 62(3):467–472. Ludvigson, S. C. and Ng, S. (2009). A factor analysis of bond risk premia. Technical report, National Bureau of Economic Research. Madsen, H. and Thyregod, P. (2010). Introduction to general and generalized linear models. CRC Press. Midi, H., Bagheri, A., and Imon, A. (2010). The application of robust multicollinearity diagnostic method based on robust coefficient determination to a non-collinear data. Journal of Applied Sciences, 10(8):611–619. Mogessie, E. M. and Bekele, G. (2017). Households’ willingness to pay for community based health insurance scheme: in kewiot and efratanagedem districts of amhara region, ethiopia. Business and Economic Research, 7(2):212–233. O’brien, R. M. (2007). A caution regarding rules of thumb for variance inflation fac- tors. Quality & quantity, 41(5):673–690. Ogutu, J. O., Schulz-Streeck, T., and Piepho, H.-P. (2012). Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. In BMC proceedings, volume 6, page S10. BioMed Central. Orlov, M. L. (1996). Multiple linear regression analysis using microsoft excel. Chem- istry Department, Oregon State University. Pourbasheer, E., Aalizadeh, R., Shokouhi Tabar, S., Ganjali, M. R., Norouzi, P., and Shadmanesh, J. (2014). 2d and 3d quantitative structure–activity relationship study of hepatitis c virus ns5b polymerase inhibitors by comparative molecular field analy- 61 University of Ghana http://ugspace.ug.edu.gh sis and comparative molecular similarity indices analysis methods. Journal of chem- ical information and modeling, 54(10):2902–2914. Santos-Cortez, R. L. P., Lee, K., Giese, A. P., Ansar, M., Amin-Ud-Din, M., Rehn, K., Wang, X., Aziz, A., Chiu, I., Hussain Ali, R., et al. (2014). Adenylate cyclase 1 (adcy1) mutations cause recessive hearing impairment in humans and defects in hair cell function and hearing in zebrafish. Human molecular genetics, 23(12):3289– 3298. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society. Series B (Methodological), pages 111–147. Stone, M. and Brooks, R. J. (1990). Continuum regression: cross-validated sequen- tially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Society. Series B (Methodological), pages 237–269. Wahab, N. S., Rusiman, M. S., Mohamad, M., Azmi, N. A., Him, N. C., Kamardan, M. G., and Ali, M. (2018). A technique of fuzzy c-mean in multiple linear regression model toward paddy yield. In Journal of Physics: Conference Series, volume 995, page 012010. IOP Publishing. Xiaobo, Z., Jiewen, Z., Povey, M. J., Holmes, M., and Hanpin, M. (2010). Variables selection methods in near-infrared spectroscopy. Analytica chimica acta, 667(1- 2):14–32. 62