University of Ghana http://ugspace.ug.edu.gh
UNIVERSITY OF GHANA
DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE
L1-L2 REGULARIZATION OF COLLINEAR DATA
BY
BOATENG OWUSU-ANSAH
(10600402)
THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, IN
PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF
M.PHIL STATISTICS DEGREE
JULY, 2018
University of Ghana http://ugspace.ug.edu.gh
DECLARATION
Candidate’s Declaration
I, Boateng Owusu-Ansah hereby declare that, except for references cited from the work
of others which have duly been acknowledged, this thesis is as a result of an original
research work carried out by me and has not been presented either whole or in part
elsewhere for another degree.
Signature: ................................ Date:........................
BOATENG OWUSU-ANSAH
(10600402)
Supervisors’ Declaration
We hereby certify that this thesis was prepared from the candidate’s own work and
supervised in accordance with guidelines on supervision of thesis laid down by the
University of Ghana.
Signature: ................................ Date:........................
DR. F.O METTLE
(Principal Supervisor)
Signature: ................................ Date:........................
DR. ISAAC BAIDOO
(Co-Supervisor)
i
University of Ghana http://ugspace.ug.edu.gh
DEDICATION
This thesis is dedicated to the Almighty God for his mercies and provision throughout
the years and also to my parents for their support and encouragement.
ii
University of Ghana http://ugspace.ug.edu.gh
ABSTRACT
Multiple linear regression analysis may be used to describe the relation of a variable
(response variable) based on the score of several other variables (independent vari-
ables). The least squares estimate of the regression coefficients are unsteady in that
replicate samples can give widely differing values of the regression coefficients if the
predictor variables are highly correlated. Ridge and Lasso regression analysis are reg-
ularization techniques for eliminating the effect of high covariance from the regression
analysis. They produce estimates that are biased but have smaller mean square errors
between the coefficients and their estimates. The lasso and ridge trace plot of the coef-
ficients versus λ and cross validation are some ways that helps to determine the value
of regularization constant λ and regression coefficients based on the data. Ridge re-
gression and Lasso regression help the analysis to a more trustable interpretation of the
results of multiple regression with highly correlated covariates.
iii
University of Ghana http://ugspace.ug.edu.gh
ACKNOWLEDGEMENTS
I would like to thank Dr. F. O. Mettle for his guidance and patience along the years.
His doors were always opened at whatever point I had questions about my research
writing. I am gratefully indebted to his very valuable comments on this thesis.
I am gratefully indebted to Dr. Isaac Baidoo for his extremely significant remarks on
this thesis.
iv
University of Ghana http://ugspace.ug.edu.gh
Contents
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         i
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         ii
Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       iii
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background of study . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Scope of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Organization of the Study . . . . . . . . . . . . . . . . . . . . . . . . 5
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Nature of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 7
v
University of Ghana http://ugspace.ug.edu.gh
2.5 Sources of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Effects of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Test for Multicollinearity . . . . . . . . . . . . . . . . . . . . 9
2.6.2 Solution to Multicollinearity . . . . . . . . . . . . . . . . . . 11
2.6.3 Droping Collinear Variables . . . . . . . . . . . . . . . . . . 11
2.6.4 Recoding Variables . . . . . . . . . . . . . . . . . . . . . . . 11
2.6.5 Principal Component Regression . . . . . . . . . . . . . . . . 11
2.6.6 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . 12
2.6.7 Regularization techniques (Ridge and Lasso regression) . . . 12
3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Review of Ordinary Least Squares . . . . . . . . . . . . . . . . . . . 17
3.2.1 Model performance and accuracy of the OLS estimator . . . . 19
3.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Properties of the Ridge Estimator . . . . . . . . . . . . . . . . 21
3.4 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 The Use Of Monte Carlo Simulation . . . . . . . . . . . . . . . . . . 27
3.8 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 DATA ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Performance of Estimators as Sample Size Increases . . . . . . . . . . 29
4.3 Lasso, Ridge and OLS Coefficients. . . . . . . . . . . . . . . . . . . 37
4.4 Standard Errors of the Regression Coefficients . . . . . . . . . . . . . 46
4.5 Performance of OLS, Ridge and Lasso Estimators at Different Corre-
lation Coefficients for Two Predictor Variables . . . . . . . . . . . . . 48
vi
University of Ghana http://ugspace.ug.edu.gh
4.6 Application of L1-L2 Regularization to
Bodyfat Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 DISCUSSIONS, CONCLUSIONS AND RECOMMENDATIONS . . . . 55
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .  60
                                                                
vii
University of Ghana http://ugspace.ug.edu.gh
List of Tables
4.1 Correlation Matrix of the Simulated Dataset for n=25 . . . . . . . 31
4.2 Correlation Matrix of the Simulated Dataset for n=50 . . . . . . . 31
4.3 Correlation Matrix of the Simulated Dataset for n=200 . . . . . . 31
4.4 Correlation Matrix of the Simulated Dataset for n=1000 . . . . . . 31
4.5 OLS Regression Output of the Simulated Dataset for n=25 . . . . 32
4.6 OLS Regression Output of the Simulated Dataset for n=50 . . . . 32
4.7 OLS Regression Output of the Simulated Dataset for n=200 . . . . 33
4.8 OLS Regression Output of the Simulated Dataset for n=1000 . . . 34
4.9 VFS’s of the Simulated Dataset for Different Sample Sizes . . . . . 35
4.10 Regression Coefficients for n=25 . . . . . . . . . . . . . . . . . . . 37
4.11 Regression Coefficients for n=50 . . . . . . . . . . . . . . . . . . . 38
4.12 Regression Coefficients for n=200 . . . . . . . . . . . . . . . . . . . 38
4.13 Regression Coefficients for n=1000 . . . . . . . . . . . . . . . . . . 39
4.14 Eigenvalues for the Independent Variables . . . . . . . . . . . . . 40
4.15 Shrinkage Parameters for Ridge and Lasso Regression . . . . . . . 45
4.16 MSE’s and MAE’s for OLS, RR and LR . . . . . . . . . . . . . . . 45
4.17 Standard Errors for n=25 . . . . . . . . . . . . . . . . . . . . . . . 46
4.18 Standard Errors for n=50 . . . . . . . . . . . . . . . . . . . . . . . 46
4.19 Standard Errors for n=200 . . . . . . . . . . . . . . . . . . . . . . 46
4.20 Standard Errors for n=1000 . . . . . . . . . . . . . . . . . . . . . . 46
4.21 MSE’s with Varying Correlation Coefficients for Two Predictor
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.22 Correlation Matrices For Bodyfat Data . . . . . . . . . . . . . . . 50
4.23 Correlation Matrices For Bodyfat Data . . . . . . . . . . . . . . . 50
viii
University of Ghana http://ugspace.ug.edu.gh
4.24 Correlation Matrices For Bodyfat Data . . . . . . . . . . . . . . . 50
4.25 OLS output for n=25 . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.26 OLS output for n=100 . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.27 OLS output for n=200 . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.28 MSE’s Across Three Different Sample Sizes. . . . . . . . . . . . . 53
4.29 MAE’s Across Three Different Sample Sizes. . . . . . . . . . . . . 53
4.30 Standard Errors of Regression Coefficients for n=25. . . . . . . . 53
4.31 Standard Errors of Regression Coefficients for n=100. . . . . . . 54
4.32 Standard Errors of Regression Coefficients for n=200. . . . . . . 54
ix
University of Ghana http://ugspace.ug.edu.gh
List of Figures
3.1 This figure shows data partitioning for cross validation. . . . . . . . . . 25
4.1 Scatter Plot of Simulated Data for Different Sample Sizes . . . . . 30
4.2 Cross Validation Diagrams for Lasso Regression . . . . . . . . . . 35
4.3 Cross Validation Diagrams for Ridge Regression . . . . . . . . . . 36
4.4 Ridge Trace Plot for Simulated Dataset . . . . . . . . . . . . . . . 40
4.5 Lasso Plot of Independent Variables . . . . . . . . . . . . . . . . . 41
4.6 Lasso plot of Coefficients against Lambda . . . . . . . . . . . . . . 42
4.7 Shrinkage Cross Validation Diagrams . . . . . . . . . . . . . . . . 43
4.8 Shows the standard errors of regression coefficients across differ-
ent samples sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.9 Matrix Plot of Predictor Variables of Bodyfat Data . . . . . . . . . 49
x
University of Ghana http://ugspace.ug.edu.gh
List of Abbreviations
LS Least Squares
OLS Ordinary Least Squares
RR Ridge Regression
LR Lasso Regression
MSE Mean Square Error
MAE Mean Absolute Error
PCA Principal Component Analysis
PCR Principal Component Regression
RMSE Root Mean Square Error
BLUE Best Linear Unbaised Estimator
xi
University of Ghana http://ugspace.ug.edu.gh
Chapter 1
INTRODUCTION
1.1 Background of study
One of the main precautions in fitting a statistical model is controlling underfitting and
reducing overfitting. A good statistical model should exhibit the following properties.
• Stability: minute changes in the data should not spring huge difference in the
predicted outcomes.
• Model Performance: the generated model should give accurate predictions.
• Interpretability: the generated model should be easy to use and explain. We
often seek to have a model with a fewer number of predictor variables that gives
accurate predictions.
• Reducing Bias: We often seek to generate a model that is unbiased. That is to
say that the estimated value using the model should be approximately equal to
the true population parameters.
Least squares is one of the most popular statistical modeling technique employed in
fitting linear models. This technique which is known to be the best linear unbiased
estimator (BLUE), falls short of the model fitting goals under multicollinearity condi-
tions.
Regularization techniques are some proposed ways to correct the prediction error. This
is done by introducing very little bias to gain a model that has a lower MSE hence
more stable and precise in prediction scores. This thesis deals with the theory of mul-
ticollinearity as well as with ways that have been proposed to detect and correct the
issues investigated. The study seeks to compare the L1 (Lasso regression) and L2
(ridge) regularization techniques to ordinary least squares using MSE criterion. Ridge
1
University of Ghana http://ugspace.ug.edu.gh
regression which was first proposed by Hoerl and Kennard (1970) has proven to be a
useful technique to handle the multicollinearity effect in the multiple linear regression
models. The thesis presents the ridge and lasso estimator and its properties and also
ways for selecting the regularization constant.
1.2 Statement of the Problem
Multiple regression analysis, as a statistical technique, helps to determine the effects of
several predictor variables on a dependent variable with forecasts. Thus, strong corre-
lation between predictor (s) and the dependent variable is desirable as opposed to high
correlations among the underlying independent variables. Consequently, high corre-
lation among two or more predictors introduces a statistical problem known as multi-
collinearity by over inflating standard errors of the regression model. According to Wa-
hab et al. (2018), presence of multicollinearity can render some predictor variables sta-
tistically insignificant when they ought to be significant and vice versa. Hence, mean-
ingful interpretations and conclusions cannot be made in regression analysis whenever
multicollinearity is present after modeling; thereby reducing the predictive power of
the fitted regression model. Many approaches have been developed by statisticians in
detecting as well as solving problems associated with multicollinearity in regression
analysis. Variance inflation factors (VIF) and tolerance are commonly used to identify
variables accounting to multicollinearity. But, different desirable thresholds of the VIF
have been proposed from many studies. However, a general rule of thumb is that the
unbiased Ordinary Least Square (OLS) estimation of the regression coefficients may
not be desirable when the VIF value of the predictors are greater than 10 according
to O’brien (2007). This therefore suggests that other appropriate regression models
are only necessary in predicting the dependent variables in the presence of high multi-
collinearity. In addition, sample sizes can contribute to the problems of standard error
since small samples could increase standard errors as compared to large samples as
proposed Mogessie and Bekele (2017).
Thus, this study sought to address the problems of multicollinearity in regression mod-
2
University of Ghana http://ugspace.ug.edu.gh
eling by adopting as well as comparing between the L1 and L2 regularization methods
across varying or different sample sizes from simulated data and a real-life data with
multicollinearity.
1.3 Objectives of the Study
This research seeks to make a comparative analysis between L1 and L2 shrinkage
methods when multicollinearity is present in a dataset. The goal is to find the best esti-
mator that minimizes the mean square error and standard error for a collinear dataset.
• To use L1 and L2 regularization to solve multicollinearity problem
• To know the effect of increasing sample size of a multicollinear dataset
• To compare the performance of OLS, L1-L2 shrinkage on multicollinear data
with four predictor variables
• To know the performance OLS, and L1-L2 regularization at different correlation
coefficients for two predictor variables using MSE criteria
1.4 Significance of the Study
The property of minimum variance is not destroyed by multicollinearity. LS estimators
have the minimum variance in the class of a linear unbiased estimators, that is, they are
the most efficient. This however, does not infer that the variance of an OLS estimator
will certainly be small in any given sample. Multicollinearity is a sample phenomenon
in the sense that predictors may be correlated in a sample at hand even if they are not
linearly dependent in the population. In postulating the theoretical regression function,
we assume that all the predictor variables (X) are independent and each have a sepa-
rate impact on the response variable Y in a multiple linear regression. It may happen
in any given sample that is used to test population regression that, if the X-variables
are highly collinear we cannot gain insight on their separate influence on the response
3
University of Ghana http://ugspace.ug.edu.gh
variable (Y ) if the test sample used has highly collinear predictor variables(X). Ridge
and Lasso regression are shrinkage techniques that helps us to control the weights of
regression coefficients when the explanatory variables are linearly dependent. Regular-
ization is useful if we know the estimates should not be too large. It allows the problem
to be optimized when it otherwise would not be possible if X ′X is singular. This re-
search will help us to know the effect of shrinkage regression on multicollinear data.
It will also help us to know the effect of collecting additional samples from a highly
homogeneous population when multicollinearity exist. The study will help us to un-
derstand the degree of collinearity for which OLS should be preferred over shrinkage
regression. This research will make us know the best estimation technique that gives
the least mean square error and least standard error under multicollinearity conditions
for a given number of predictor variables.
1.5 Scope of the Study
This research focuses on solving multicollinearity problem in a multiple linear regres-
sion using shrinkage regression techniques, that is L1 (Lasso regression) and L2 (ridge
regression). The study sets the tone by doing a review of the Least Square estimator
and identifying the difficulties of using least squares when multicollinearity is present
in a dataset. The study assess and identify the indications and effect of multicollinear-
ity in a dataset. Ridge and Lasso estimators are shrinkage estimators used normally
when there are more predictors than observation in a dataset. The thesis focus on the
use these shrinkage estimators on a simulated dataset to compare their performance as
the sample size increases. The study investigates the effect of sample size on the co-
variance of the predictor variables. The shrinkage parameter λ say, and how it can be
derived and it’s effect on linear regression models is investigated in the study. Compar-
ism is made between L1 and L2 to know which one performs better on multicollinear
data using MSE criteria. The study also investigates the best regression technique that
minimizes the standard errors in a multicollinear dataset.
4
University of Ghana http://ugspace.ug.edu.gh
1.6 Organization of the Study
This study gives a broad overview of Ridge and Lasso regression and its application as
an alternative to Ordinary Least squares in the presence of multicollinearity.
The first chapter entails the background of the study, a statement of the problem, ob-
jectives of the study and its significance, the scope and organization of the study. In
Chapter 2, the study reviewed both the empirical and theoretical literature parting the
least squares and multicollinearity and some of the proposed ways of dealing with ill-
conditioned data.
Chapter 3 deals with the research methodology. The study focuses on the theory part-
ing to L1 and L2 regularization. The study compares the MSEs and standard errors of
these two regression techniques at different values of covariance and different sample
sizes. This will make us know the level of collinearity for which ridge and lasso re-
gression are most efficient than ordinary least squares.
Chapter 4 deals with the analysis and discussion of the results. The study discusses
the behavior of the ridge and lasso estimators at different correlation coefficient values
and for increasing sample sizes based on the simulated data. The thesis compare their
MSEs and propose the best way to achieve minimum variance and smaller standard
errors under collinearity conditions.
In chapter 5, the study discusses the research findings and make conclusions based
on results obtained from our simulation and application results from multicollinear
datasets.
5
University of Ghana http://ugspace.ug.edu.gh
Chapter 2
LITERATURE REVIEW
2.1 Introduction
This chapter discusses the concept of collinearity problem and how the issue of mul-
ticollinearity can be resolved. The study reviewed past researches done on shrinkage
regression and their findings in attempt to solving the problem of multicollinearity in
a dataset.
2.2 Multiple Linear Regression
Multiple linear regression (MLR) is an extension of the simple linear regression. It is
the case where we have two or more explanatory variables. The goal of MLR is to
determine the finest set of parameters (predictor variables) so that the predicted value
of the independent variables are close to the actual values by Orlov (1996). Let’s
consider the multiple linear regression model
Y = β0 + β1X1 + β2X2 + ξ (2.1)
where,
β1 is the change in Y for a unit change in X1 while X2 is being held constant.
β2 is the change in Y for a unit change in X2 while X1 is being held constant.
Mathematically, β = ∂Y and β = ∂Y1 .∂X 21 ∂X2
In multiple linear regression, the predictor variables are assumed to be independent
but in practice they might be correlated as stated by Johnson and Wichern (2004). The
degree of correlation between the predictor variables is known as multicollinearity.
6
University of Ghana http://ugspace.ug.edu.gh
2.3 Multicollinearity
Let’s consider the model in equation 2.1. The model consist of two predictor vari-
ables namely X1 and X2. The variables X1 and X2 are said to be collinear when
they are correlated with each other by Belsley (2004). Assuming this thesis tries to
classify students as being good, average or bad using their test scores in mathematics
and economics. On the average a student performance in economics is dependent on
the strength of their numeracy skills. In such an experiment, results observed might
indicate that there is a strong positive correlation between math and economics perfor-
mance leading to the problem of multicollinearity since the independent variables are
linearly dependent.
2.4 Nature of Multicollinearity
The nature of multicollinearity was classified into perfect and partial multicollinear-
ity. Perfect multicollinearity: When two or more explanatory variables overlap com-
pletely, with one variable a perfect linear function of the others, such that the method
of analysis cannot distinguish one from the other, then we say that there is perfect mul-
ticollinearity. This condition would not allow estimating coefficient from a multiple
regression model since the design matrix would not be of full rank and the equation
for estimating the regression parameters becomes unsolvable.
Partial Multicollinearity: This is the condition where two or more explanatory vari-
ables overlap such that they are correlated with each other but still contain independent
variation. That is to say not all the predictor variables are perfect linear combinations
of each other. This condition restricts the extent to which analysis can distinguish
their casual significance, hence making it difficult to identify the most and genuinely
significant variables in the regression model.
7
University of Ghana http://ugspace.ug.edu.gh
2.5 Sources of Multicollinearity
Identifying the source of multicollinearity is paramount in solving the multicollinearity
problem in a dataset. In situations where the such a problem arises, it has an impact on
results analysis and interpretation. This research discuss five sources of multicollinear-
ity.
• Data Collection: This happens when the data are assembled from a small sub-
space of the predictor variables. This is collinearity created as a result of wrong
sampling methodology. Adding additional sample in an expanded range will
solve this multicollinearity problem. An example is of this is when you try to fit
a line to a single point.
• Physical Constraints of the model or population: This collinearity arises as a
result of constraints put on predictor variables (as to their range), either legally,
politically and physically. This source of collinearity will exist irrespective of
the sampling technique.
• Over-fitted model: This happens when we have more predictor variables than
observations and it can be avoided easily.
• Model Specification: This form of collinearity arises from using predictor vari-
ables that are powers or linear combinations of some other original set of vari-
ables. If the sampling subspace for the predictor variables is narrow, then collinear-
ity will rise further with any combinations of the original variables.
• Extreme Values in a subspace of the predictor variables induces multicollinear-
ity. This can be corrected by removing extreme values from the dataset of obser-
vations.
8
University of Ghana http://ugspace.ug.edu.gh
2.6 Effects of Multicollinearity
• If multicollinearity is perfect, then the regression coefficients of the predictor
variables are vague and the standard errors are infinite.
• Variance and Standard errors of regression coefficient estimates (βi) are inflated.
This means V ar(βi) are inflated or is too large. Hence the coefficients cannot be
estimated with incredible precision or accuracy.
• The magnitude of βi may be different from what we expect affecting the accu-
racy of predicted outcomes.
• The signs of (βi) may be different from what we expect. Consider the model in
equation 2.1, where X1and X2 are mathematics and economics scores respec-
tively and Y represents the level of credibility of an actuary. We would expect
economics scores to increase as math score increases. Hence the negative coef-
ficient for such a model might be as a result of multicollinearity.
• Adding or removing a predictor variable can cause huge changes in the β′is (re-
gression coefficients).
• In some cases the F- statistics is significant but the t-statistics β bii = , maySE(bi)
be insignificant.
2.6.1 Test for Multicollinearity
There are several ways of identifying multicollinearity in a dataset and a few are dis-
cussed below.
2.6.1.1 Correlation Coefficient
Calculate the correlation coefficient (cov()Xr = iX( j) ) for each pair of the predictor
σ Xi σ Xj
variables. If any of their values is significantly different from zero, then the predictor
variables involved may be collinear. One limitation of using correlation coefficient to
9
University of Ghana http://ugspace.ug.edu.gh
check for the degree of multicollinearity is that, although two predictor variables Xi
and Xj may not be highly correlated, three or more predictor variables X1, X2 and X3
may be correlated as a group.
2.6.1.2 Eigenvalues
Sometimes eigenvalues, condition indices and the condition number can be referred
in checking for multicollinearity. The co(nditio)n number (K) is given a(s the)square
root of th√e ratio of the largest eigenvalue λmax to the least eigenvalue λmin . That
is K = λmax . When there is no multicollinearity, the eigenvalues and condition
λmin
number will all be equal to one. The eigenvalues turn to be both greater and less than
one as multicollinearity increases.
2.6.1.3 Variance Inflation Factor(VIF)
The rule of thumb s(tate)s that multicollinearity exist if V IF > 10. A V IF of 10,
say, means that V ar βi is 10 times what it should have been if no multicollinearity
existed in the dataset. V IF is a more rigorous check for collinearity than correlation
coefficient.V IF = 1− 2 , where R
2 is the coefficient if determination.
1 Ri
The denominator1−R2i is known as the tolerance.
In the regression model,
Y = β0 + β1X1 + β2X2 + ξ (2.2)
R21 is obtained by regressing X1 on X2. In each case we find the coefficient of deter-
mination and substitute it back into the V IF formula.
Coefficient of determination R2 is a measure of goodness of fit. It tells how much of
the variance in the response variable is explained by the predictor variables.
10
University of Ghana http://ugspace.ug.edu.gh
2.6.2 Solution to Multicollinearity
In this sub-section we discuss the various ways of dealing with dataset under multi-
collinearity conditions
2.6.3 Droping Collinear Variables
Drop the variable causing the problem. If using a large number of X-variables, a step-
wise regression procedure could be used to determine which of the variables to drop.
Removing collinear X-variables is the simplest method of solving the multicollinearity
problem. If all the X-variables are retained, then it is advisable to avoid making infer-
ences on the individual β parameters. It is also advisable to limit inferences about the
mean value of Y to values of X that lie within the experimental region.
2.6.4 Recoding Variables
Recode the form of the independent variable. For instance if X1 and X2 are collinear,
we might try using X and the ratio of X11 . Recoding of variables is very effective inX2
controlling multicollinearity if it was not caused by sampling problem but instead the
design of the experiment or model specification.
2.6.5 Principal Component Regression
This is a linear regression that is preceded by principal component analysis. In PCR,
the PCs of the explanatory variables are used as regressors instead of regressing the
response variable on the predictor variables directly by Filzmoser and Croux (2003).
The principal components with higher variances are often chosen as regressors even
though those with low variances may be significant in a few cases for precise forecasts.
PCR helps overcome multicollinearity by excluding the low variance components in
the regression step. In addition, by usually regressing on only a subset of the princi-
pal components, PCR can result in dimension reduction of the number of parameters
in the underlying model. This can be particularly beneficial in settings that require
11
University of Ghana http://ugspace.ug.edu.gh
dimensional covariates. Also, through the appropriate selection of the principal com-
ponents to be used for regression, PCR can lead to efficient forecast by installing a
parsimonious model.
2.6.6 Stepwise Regression
Stepwise regression is a variable selection procedure for independent variables. The
basis of the selection is done by choosing variables that least satisfy a criterion (for-
ward stepwise regression) or dropping variables that least satisfies a criterion (back-
wards stepwise regression). At each step of the procedure, each predictor variable (X)
is evaluated to see whether it should be maintained in the model. An example of a cri-
terion used in stepwise regression is the t-value Xiaobo et al. (2010). Variables which
have high β coefficients in the regression model are retained in a forward selection
approach and variables with least β coefficients are dropped in a backward selection
approach.
2.6.7 Regularization techniques (Ridge and Lasso regression)
This is an alternative estimation procedure to OLS and will be discussed in Chapter 3.
Duzan and Shariff (2015) performed a study to investigate the short comings of using
Ordinary Least Squares (OLS) when multicollinearity is present in a regression anal-
ysis. The only alternative method considered in this study was ridge regression. The
goal of the study was to find an appropriate value for the ridge parameter in two- vari-
able regression model. This investigation was done by simulation using 1000 samples
with n=10. The performance of different ridge regression estimators was compared
to the OLS. Mean square errors (MSE), Variance Inflation factor (VIF) and regression
weights (beta estimates) were computed using few predictor variables (two and four)
Random values of the ridge parameter(λ) were chosen to see which one produced the
lowest MSE and least VIF. Coefficient of determination which is a measure of good-
ness of fit of the data was calculated from linear models with correlated explanatory
variables. The weight of the regression coefficients obtained by using ridge regression
12
University of Ghana http://ugspace.ug.edu.gh
and ordinary least squares (the case where λ = 0) were compared for the 1000 obser-
vations for every simulation. The ridge parameter was computed using the ridge trace.
From the result analysis, it was found that the value of the ridge parameter is directly
proportional to the covariance between two explanatory variables. Also higher values
of λ produced smaller β estimates. In this research the researcher didn’t consider the
behavior of the ridge estimator for small sample size. Also no comparative analysis
was done between RR and OLS at different degree of collinearity. Pourbasheer et al.
(2014) also did a comparative analysis on collinear data using ridge regression and
OLS. In the research simulation was done using SAS package. Three different sample
sizes (100, 50, 25) were considered. The goal here was to observe the behavior of the
estimators as the sample size increases. Six predictor variables were used and they
were highly correlated with each other. Eigenvalues for all the dependent variables
were computed from the correlation matrix developed for the three sample sizes. With
the smallest sample size of the three, some Eigenvalues not to be distinct producing
results in the complex plane (having a real part and an imaginary part). Finally, com-
parism was made between MSE and R-square of ORR and OLS. The result showed
that the RR method produces a lower MSE than the OLS. Also estimated coefficients
of Ordinary Ridge Regression (ORR) had smaller R-square values that OLS. It was
found that an increase in sample size stabilized the estimated results of the regression
model. Hence conclusion was made that RR is a better estimator than OLS under
collinearity conditions.
Santos-Cortez et al. (2014), published an article on the effect of ridge regression on re-
gression estimates when the sample size is small by varying correlation coefficients. In
this research the RR estimator was the only alternative introduced to the OLS. Differ-
ent ways of evaluating the ridge constant were also discussed. Simulations were done
using R software to achieve different degree of correlation coefficients classified as
moderate and high correlations between independent variable (0.5, 0.7 and 0.9). The
sample size used for the study was 20 and 10 explanatory variables were considered
in each case. The explanatory variables were standardized to be in correlation form
13
University of Ghana http://ugspace.ug.edu.gh
and standard error values of 0.1, 0.5, 1.0, 5.0 and 10.0 were considered for the study.
The experiment was repeated for 1000 times and in each case the values of the ridge
parameter λ was estimated. The performance of these estimators were measured using
the average MSE with the equation
( ) ∑100 ( )T( )1
βRR = βR − β βR − β (2.3)
1000
i=1
Levy et al. (2005) used principal component analysis to solve the problem of mul-
ticollinearity by reducing the size of the covariance matrix. However, PCA cannot
always fix the problems with parameter estimation associated with multicollinearity.
Again, the interpretation of the resulting model becomes cumbersome since each of the
principal components used is a linear combination of all the other variables. Grewal
et al. (2004) proposed controlling multicollinearity using structural equation models.
The recommendation was to include all significant influences on a response variable.
Unfortunately, behavioral models frequently have low explanatory power (Grewal et al.
(2004)), so that may not be an option. A suggestion was to use adequate sample size
(which may always not be available) to help solve multicollinearity. Chong and Jun
(2005) and Jun et al. (2009) used variable selection method to control the impact of
high covariance among predictor variables. It was established that a model with good
fitness performance may not guarantee good variable selection performance especially
in moderation studies where every predictor variable is of paramount relevance. Prob-
lems of stepwise regression was investigated by Hauser (1974). Apart from the prob-
lems of dropping some important variables, its significance test is misleading since the
data used to generate the model is used again in testing the model.
Kaufinger (2013) admonished researchers to introduce categorical variables to mod-
els when multicollinearity is present in a dataset. It was troublesome to get closed
form expression in the general case as there could be numerous conceivable combina-
tions of dummy and quantitative variables in linear regression models. The issue was
investigated more in detail by choosing different combinations of dummy and quan-
titative variables. It was found that the presence of dummy variable and the choice
14
University of Ghana http://ugspace.ug.edu.gh
of referenced category can also cause multicollinearity. The approach of including an
interaction term can often be difficult to set up and to interpret.
15
University of Ghana http://ugspace.ug.edu.gh
Chapter 3
METHODOLOGY
3.1 Introduction
In this chapter, the study reviewed the concept of least squares regression and how to
estimate its parameters. The study considered the difficulties of OLS estimator when
multicollinearity is present in a dataset and ways in which it can be corrected. The
study introduced specifically two shrinkage estimators namely L2 (Ridge regression)
and L1 (Lasso regression) regularization which is known to be a good fit under multi-
collinearity conditions.
Data was simulated from a designed matrix with highly correlated covariates for four
independent variables from a multivariate random normal distribution using R pack-
age. The sample size is set for small (25) and increased for large sample sizes (50, 200,
1000). The study observed the behavior of estimators as sample size increases. To set
up a regularization constant, cross validation was done on the simulated dataset to find
a fixed parameter value for which the mean square error is minimum. The study seeks
to find the best estimator using MSE criterion. The study also compares the effect of
these regularization techniques on standard errors at different sample sizes
Next the procedure deal with a simulation study with two predictor variables, vary-
ing the bivariate correlation coefficient between the predictor variables from zero to
one with an interval of 0.1 for each group. The study seeks to find the measure of
covariance between two predictor variables for which the OLS estimator would per-
form better than the regularized shrinkage estimators. An application study was done
using bodyfat dataset in R-software considering 8 predictor variables. The results for
increase in the number of predictor variables was investigated.
16
University of Ghana http://ugspace.ug.edu.gh
3.2 Review of Ordinary Least Squares
Least squares is said to be the Best Linear Unbiased Estimator (BLUE) by Gauss
Markov Theorem. The ’best’ means among all linear unbiased estimators, it gives the
minimum variance or the least MSE. The ’unbiased’ means that if we find the means
of all beta parameters estimates it will be close to the true population parameter.
Let’s consider the linear model
∑k
Yi = β0 + β1Xi1 + β2Xi2 + ...+ βkXik + ξi = β0 + βpXip + ξi (3.1)
p=1
In matrix form,
Y = Xβ + ξ (3.2)
Here Y is an (n × 1) vector of dependent variables, X is an [n × (k + 1)] matrix of
observations on (k − 1) predictor variables. β is a [(k + 1) × 1] vector of regression
coefficients to be estimated from the observations(data), and ξ is an (n× 1) matrix of
error terms which follows a normal distribution with mean zero and constant variance.
Mathematically ξ ∼ N(0, σ2).
Next, compute the expected value and variance.
Now given,
Y = Xβ + ξ (3.3)
The Expectation of Y is given as
E(Yi) = E(Xβ + ξ) = Xβ (3.4)
17
University of Ghana http://ugspace.ug.edu.gh
The variance of Y is given as
V ar(Yi) = V ar(Xβ + ξ) = σ
2
i (3.5)
The goal of Ordinary Least Squares is to minimize the sum of squared differences
between the observed and the predicted values of the L-2 norm of the Beta vector.
From Y = Xβ + ξ, we have
ξ = (Y −Xβ ) ( ) (3.6)
ξT
T
ξ = (Y −Xβ Y)(−Xβ ) (3.7)
= Y T − βTXT Y −Xβ (3.8)
∴ ξT ξ = (Y TY − Y TXβ − βTXTY + βTXTXβ (3.9)
Differentiating with respect to β and setting the derivative to zero helps us to find the
optimal value for β
∂ξ
= −2XTY + 2XTXβ (3.10)
∂β
∂ξ ∂ξ
= −2XTY + 2XTXβ, but = 0 (3.11)
∂β ∂β
Implying that
−2XTY + 2XTXβ = 0 (3.12)
Therefore
XTXβ = XTY (3.13)
Rearranging the formula gives
β̂ = (XTX)−1XTY (3.14)
18
University of Ghana http://ugspace.ug.edu.gh
This equation is used to compute the β parameter estimates.
3.2.1 Model performance and accuracy of the OLS estimator
Expectation, Variance and MSE of the estimator are respectively given as follows
( ) [( )−1 ]
E β̂ = [E XTX XTY ] (3.15)
= (XTX)−1XTE(Y ) (3.16)
( ) = XTX)−1XTXβ (3.17)
∴ E β̂ = β (3.18)
Hence β̂ is unbiased in estimating β, we have
( ) [ ]
V ar β̂ = V ar (XTX)−1XT
− (Y ) ( ) (3.19)−1= ((XTX)) 1XTV ar Y X XTX (3.20)
T −1 ( )2 −1( ) = X X σi IXTX XTX (3.21)
V ar β̂ = σ2(XTi X)
−1 (3.22)
Again,
( )
MSE(β̂) = σ̂2
−1
trace XTX (3.23)
Hence,
( ) ∑p 1
MSE β̂ = σ̂2 (3.24)
K
i=1 i
3.3 Ridge Regression
Ridge regression (RR) was introduced by Hoerl and Kennard (1970) as a solution to
problems of multicollinearity in a multiple linear regression where the ordinary least
19
University of Ghana http://ugspace.ug.edu.gh
square estimator is unstable. The assumptions of ridge regression is the same as that
of ordinary least squares. When the predictor variables are highly correlated in a linear
regression model, the design matrix XTX nears singularity and hence not invertible
said by Stone and Brooks (1990). This creates imprecise parameter estimates with
large variances and some variables that explain variability in the response variable
might come out as not significant in analysis due to the presence of some covariates.
Ridge regression helps us to reduce the impact of correlated inputs by regularizing the
norm of the Beta vector.
( )
β̂ = argmin J β = λ ‖β22‖ (3.25)
β
where λ is the ridge constant. For any vector, the lp norm is defined as
( ) 1
βp1 + β
p
2 + ...+ β
p p
k ≡ ‖β‖k (3.26)
Regularizing the l2 norm for a Linear Model,
J(β) + λ‖β‖22 (3.27)
In case of a linear regression, the loss function takes the form
( ) ∑N ( )
J β = Yi − βT
2
Xi + λ‖β‖22 (3.28)
i=1
We want to find the value of β that minimizes the function above. To optimize the
function, we have to differentiate the function with respect to β and set the derivative
to zero. In particular, we know that
( ) ( )T ( )
J β = Y −Xβ Y −Xβ + λβTβ (3.29)
20
University of Ghana http://ugspace.ug.edu.gh
Implying that
( )
∂J β ( )
= −2XT Y −Xβ + 2λβ (3.30)
∂β
( )
∂J β
Now by setting = 0, we have
∂β
−2XT (Y −Xβ) + 2λβ = 0 (3.31)
Solving for β, we generate the solution
( )−1
β̂RR = X
TX + λI XTY (3.32)
where I is an identity matrix
3.3.1 Properties of the Ridge Estimator
The main properties of the ridge solution are:
• The ridge estimator β̂RR is a linear transformation of the least squares estimator
β̂
• The length of β̂RR is a decreasing function of λ
• The residual sum of squares is monotone and increases as a function of λ
The expectation of βRR is given by
( ) [( )
T −1 ]E β̂ TRR = E( X X +)λI X (Y ) (3.33)−1
= (XTX + λI) XTE Y (3.34)−1
= [(XTX +) λI XTX( β (3.35)
T −1 )T −1( )]( ) = [ X X( X )X +] λI XTX β (3.36)−1 −1
E β̂RR = I + λI X
TX β (3.37)
21
University of Ghana http://ugspace.ug.edu.gh
The covariance of βRR is given by
( ) [( )
T −1 ( )Cov β = C( ov X X )+ λI XTRR (Y C) ov βRR ] (3.38)−1
= (XTX + λI) XTCov Y ) (3.39)−1 −1
= (XTX)+( λI )XTX(XTX( +)λI σ2I(n×n) (3.40)
T T −1 −1 )T 2 T −1( ) = X( X X X( )+)X( X λI σ) (X X + λI (3.41)−1 −1
∴ Cov βRR = σ
2 I +XTX λI XTX + λI (3.42)
The mean square error of βRR is given by
( ) [( )T ( )]
MSE βRR = E ∑β̂RR − β β̂RR − β (3.43)2 Ki
= σ2 + λ2β̂T (XTX + λI)−2β̂ (3.44)
K + k
( ) ∑i=1 i2 ∑2 [ ]2
MSE βRR = V ar(β̂RR) + Bias(β̂RR) (3.45)
i=1 i=1
whereKi represents the eigenvalues and k is the number of explanatory variables in the
model. The first term is the trace of the dispersion matrix of (βRR) and the second term
is the length of the bias vector. From the above equation we can see that the variance
term is monotonic decreasing for values of λ > 0. The squares bias is monotone
increasing function of λ. Therefore the suitable choice for λ is determined by striking
a harmony between two terms which reduces the variance than increasing the bias.
3.4 Lasso Regression
Least Absolute Shrinkage and Selection Operator regression popularly known as Lasso
regression is a regularization technique in linear regression that uses shrinkage. In lasso
regression a penalty term that is equal to the magnitude of the regression coefficients
is added to the design matrix. Lasso gives sparse outputs Ogutu et al. (2012). The
procedures in lasso regression encourages parsimonious models with fewer parameters
as some coefficients can become zero. It is suitable for data that exhibit high degree of
22
University of Ghana http://ugspace.ug.edu.gh
multicollinearity. It is also helpful in doing variable selection or parameter elimination
when we don’t want to use filter approaches. The larger the penalty term, the smaller
the coefficients of the regression parameters. Lasso regression is also known as L1
regularization. Ridge regression (L2) regularization doesn’t result in parsimonious
models as coefficients are not zeroed. In contrast, the Lasso does variable selection
and parameter shrinkage automatically. This makes interpretation of Lasso models far
easier that Ridge models. Mathematically, we seek to find the value of β that minimizes
the expression
∑n
2 ( ) ∑p‖Y −Xβ‖2 + λ ‖ 2β‖ = Y −Xiβ + λ ‖βj‖ (3.46)
i=1 j=1
where the first term is the sum of squares and the second term is the Lasso penalty.
Expanding out the first term, we get
( )
Y TY − Y TXβ − βTXTY + βTXTXβ (3.47)
In the orthonormal case XTX = I = (XTX)−1, hence β̂LS = XTY .
Since Y TY does not contain any of the variables of interest, we can discard it. Hence
we have,
( ∑p )
Y TY − 2Y TXβ − βTXTY + βTβ + λ ‖βj‖ (3.48)( [ ] ∑ j=1p )
− Tβ̂OLS β − βT β̂ TOLS + β β + λ ‖βj‖ (3.49)
∑ j=1p ( )
− 2β̂ 2OLSβj + βj + λ1 ‖βj‖ (3.50)
∑j=1p ( )
min − 2β̂ 2OLSβj + βj + λ1 ‖βj‖ (3.51)
βj
j=1
23
University of Ghana http://ugspace.ug.edu.gh
Minimization can be done per regression coefficient
( )   min− 2β̂
2 
OLSβj + βj − λ1 ‖βj‖ , for β > 0 
min − β2β̂OLSβj+β2 jj+λ1 ‖βj‖ =
βj  min− 2β̂OLSβj + β2j − λ1 ‖βj‖ , for β < 0 
βj
Solving the right-hand side yields



β̂ 1 OLS − λ , for β > 02 1
β̂LASSO(λ1) =  β̂ 1 OLS + λ1, for β < 02
Both the sum of squares and the lasso penalty are convex, and so is the lasso loss func-
tion. Consequently, there exist a global minimum. However, the lasso loss function is
not strictly convex. Consequently, there may be multiple values of β′s that minimize
the lasso loss function.
3.5 Standard Errors
The Lasso is a non-linear and a non-differentiable function of the response values, it is
difficult to estimate its standard errors accurately even for a fixed value of λ. However
the standard errors can be estimated via bootstrap, that is either λ can be fixed or we
may optimize over λ for each bootstrap sample. Getting a fixed λ value is comparable
to selecting a best subset and then using the least squares standard e∑rror for that∑subset.β2
An approximate estimate may be derived by writing the penalty | βj | as j|βj | .
Hence at the Lasso estimate β̂, we may approximate the solution by a ridge regression
of the form β̂ = (XTX + λV −)−1XTY , where V is a diagonal matrix with diagonal
∑elements | βj |, V −denotes the generalized inverse of W and λ is chosen such that
| βj |= λ. The covariance matrix of the estimates may then be approximated by
β̂ = (XTX + λV −)−1XTX(XTX + λV −)−1σ2 (3.52)
24
University of Ghana http://ugspace.ug.edu.gh
Figure 3.1: This figure shows data partitioning for cross validation.
where σ2 is an estimate of the error variance. The cumbersomeness of the above for-
mula is that, it gives error variance of zero for predictors with β̂j = 0 but it does prove
to be useful for selection of the lasso shrinkage parameter λ.
3.6 Cross Validation
Cross validation is a method used in selecting the best estimator with the smallest
RMSE from a group of competing estimators formed from the same postulate stated
by Stone (1974). Some cross validation methods are outlined below
• Holdout Method: To use the holdout method, a given dataset is partitioned into
two, part A and part B say. Part A is used as the training data to generate our
model and part B is used to validate the model generated. This method is prone to
sample bias since the selection is done by random sampling without any criteria.
• K-Fold Method: The dataset is divided into K-fold as illustrated in Figure 3.1.
25
University of Ghana http://ugspace.ug.edu.gh
A partition is taken out to validate the model generated by the other partitions. The
process is repeated for kCk−1 where k is the number of partitions and hence a number
of kCk−1 models are generated. The model which gives the least RMSE after validation
is chosen to be best and used for analysis.
• Leave Out Method: This is a special case of the K-fold method where the par-
titioning is done in such a way that K = N where K is the number of samples
in each partition K. Even though it requires large computational time for large
samples.
• Bootstrap Method. This method is used if the samples have the same parent dis-
tribution and are independent of each other. To use this method a random sample
is drawn from the training dataset. The sampling is done with replacement. The
models are fitted using the bootstrap samples, and they are examined to see the
best consistent model across the bootstrap samples proposed by Madsen and
Thyregod (2010).
In this research the K-f(old me)thod was used in selecting the shrinkage parameter.
Given a training dataset xi, yi , i = 1, 2, · · · , n we construct an estimator θ̂ of some
unknown function θ. Suppose θ̂ = θ, depends on a turning parameter λ , cross vali-
dation offers a way to choose a value of λ (penalizing constant) for regularization top
o(ptimize predictive accuracy. Th)e idea is to divide up the training data into N folds
where N is fixed , example: N=4 as shown in Figure 3.1. We then hold out each fold
one at a time, train on the remaining data and predict the held out observations for
each of the turning parameters. The cross validation error for each value of the turning
parameter is
( ) N1∑
CV λ = (y −N 2i − θ (xi)) (3.53)
n
i=1
We choose the turning parameter that minimizes the CV error curve.
( )
λ̂ = argmin CV λ (3.54)
26
University of Ghana http://ugspace.ug.edu.gh
In this research R programming was used to estimate the ridge and lasso parameter by
performing cross validation on our simulated data.
3.7 The Use Of Monte Carlo Simulation
Monte Carlo method is a stochastic technique which is used to investigate problems
based on the use of random numbers in probability and statistics. Monte Carlo strategy
can be utilized to unravel physical issues. For example it aids us to examine more
complex frameworks. With Monte Carlo method, we can sample the large system in
a number of random configurations. Midi et al. (2010) also conducted Monte Carlo
simulation study in a robust approach in the presence of multicollinearity.
27
University of Ghana http://ugspace.ug.edu.gh
3.8 Simulation design
In this research, we simulate a four-variable highly-correlated dataset using R package
with different sample sizes (25, 50, 200, 1000). We find the effect of sample size on
multicollinearity, regression coefficients, mean square errors and mean absolute errors.
28
University of Ghana http://ugspace.ug.edu.gh
Chapter 4
DATA ANALYSIS
4.1 Introduction
This chapter analyses and discuss the outcomes of our simulated and application datasets.
We compared the behavior of the estimators as the sample size increased and for dif-
ferent covariances for a two predictor variable model. The MSE’s of the estimators as
well as the standard errors of the regression coefficients (OLS, RR and LR) were com-
puted and compared to know the best estimator for the simulated and applied highly
correlated datasets.
4.2 Performance of Estimators as Sample Size Increases
Figure 4.1 is the matrix plot of the simulated dataset for different sample sizes. It
represents the scatter plots of the response and predictor variables of the simulated
dataset.
29
University of Ghana http://ugspace.ug.edu.gh
Figure 4.1: Scatter Plot of Simulated Data for Different Sample Sizes
(a) n= 25 (b) n=50
(c) n= 200 (d) n=1000
From the Figure 4.1 it can be seen that all the independent variables are highly posi-
tively correlated with each other and a linear relationship exists among the variables.
The data points in the matrix plot become more compact as the sample size increases.
Pragmatically, this might be the case when additional samples are collected from a
highly homogeneous population since all the samples might exhibit very similar char-
acteristics.
The tables of correlation matrices of the simulated dataset for different sample sizes
are shown below. Correlation is the measure of the strength of the linear relationship
between two variables. The diagonal elements represents their variances and the off-
diagonal elements represents their covariances.
30
University of Ghana http://ugspace.ug.edu.gh
Table 4.1: Correlation Matrix of the Simulated Dataset for n=25
n=25 Y X1 X2 X3 X4
Y 1 0.88322 0.92835 0.92660 0.97103
X1 1 0.91273 0.83321 0.91033
X2 1 0.90425 0.87658
X3 1 0.85925
X4 1
Table 4.2: Correlation Matrix of the Simulated Dataset for n=50
n=50 Y X1 X2 X3 X4
Y 1 0.87771 0.92616 0.94428 0.97250
X1 1 0.90807 0.84841 0.90175
X2 1 0.91918 0.87299
X3 1 0.88772
X4 1
Table 4.3: Correlation Matrix of the Simulated Dataset for n=200
n=200 Y X1 X2 X3 X4
Y 1 0.90780 0.92770 0.95635 0.97259
X1 1 0.91883 0.88993 0.90216
X2 1 0.92713 0.85399
X3 1 0.90826
X4 1
Table 4.4: Correlation Matrix of the Simulated Dataset for n=1000
n=100 Y X1 X2 X3 X4
Y 1 0.92105 0.94131 0.96338 0.97527
X1 1 0.90807 0.84841 0.90175
X2 1 0.91918 0.87299
X3 1 0.85925
X4 1
F = 540, P = 0.000, RSE = 0.09347, R24,20 value = 0.9908, Adj. R
2 = 0.989
31
University of Ghana http://ugspace.ug.edu.gh
Table 4.5: OLS Regression Output of the Simulated Dataset for n=25
n=25 Estimate Std. Error t-value P-value
Intercept 0.00084 0.01929 0.044 0.96570
X1 -0.01339 0.08425 -0.159 0.87535
X2 0.30626 0.07540 4.282 0.0036
X3 0.17686 0.07540 2.346 0.02942
X4 0.57121 0.06732 8.485 0.00000
From the Table 4.5, it can be seen that X1 has a negative coefficient even though all
the predictors were positively correlated with the response variable (Y). The variability
we expect if we continue re-sampling are very low due to the smaller standard error
values. Some of the variables are not significant, X1 say at α level of 0.05.
The table below shows the OLS output of the simulated dataset for a sample size of
50.
Table 4.6: OLS Regression Output of the Simulated Dataset for n=50
n=50 Estimate Std. Error t-value P-value
Intercept 0.00064 0.01219 0.052 0.95900
X1 -0.06785 0.04776 -0.421 0.16200
X2 0.29021 0.04765 6.068 0.00000
X3 0.21466 0.04765 4.505 0.00000
X4 0.60932 0.03664 16.631 0.00000
F4,45 = 1469, Pvalue = 0.000, RSE = 0.08601, R
2 = 0.9924, Adj. R2 = 0.9917
From the Table 4.6, it can be seen that the number of significant variables have in-
creased at α = 0.05 say and the standard error of the predator variables have decreased.
32
University of Ghana http://ugspace.ug.edu.gh
The table below shows the OLS of the simulated dataset for a sample size of 200.
Table 4.7: OLS Regression Output of the Simulated Dataset for n=200
n=50 Estimate Std. Error t-value P-value
Intercept 0.00184 0.00657 0.281 0.77900
X1 -0.010502 0.02319 -0.4528 0.00001
X2 0.34074 0.02581 13.201 0.00000
X3 0.18364 0.02637 6.963 0.00000
X4 0.61130 0.02003 30.516 0.00000
F4,195 = 4986, Pvalue = 0.000, RSE = 0.09226, R
2 = 0.9903, Adj. R2 = 0.9901
From the Table 4.7, all the variables would have been considered significant at α =
0.05 say and their standard errors have have increased to 9.2% but the significance of
the predictors have improved as the p-values are decreasing. The R-squared tells us
the percentage of variation in the response variable that is explained by the predictor
variables. It is a measure of goodness of fit of the model.
33
University of Ghana http://ugspace.ug.edu.gh
The table below shows the OLS output of the simulated dataset for a sample size of
1000.
Table 4.8: OLS Regression Output of the Simulated Dataset for n=1000
n=50 Estimate Std. Error t-value P-value
Intercept -0.00425 0.00295 -1.438 0.15100
X1 -0.13916 0.00988 -14.091 0.00000
X2 0.33910 0.01110 30.564 0.00000
X3 0.20930 0.01086 19.270 0.00000
X4 0.61502 0.00887 69.360 0.00000
F 4 24,995 = 2.813× 10 , Pvalue = 0.000, RSE = 0.09327, R = 0.9912,
Adj. R2 = 0.9912
From the Table 4.8, it can be seen that an increase in the sample size has reduced dras-
tically the standard errors. This means that stability of a model is hugely affected by
the sample size of the dataset. The level of significance of the predictor variables have
increased as compared to the cases for n=25, n=50 and n=200.
However the negative sign attached to X1 was maintained across all the sample sizes
even though it was positively correlated with Y . This might be due to multicollinearity
in the data and will be investigated further using the VIF’s.
Table 4.9 shows the variance inflation factors for the four predictor variables of the
simulated data for different sample sizes. The variance inflation measures how much
the variances of the regression coefficients have been inflated relative to the case where
the independent variables are strictly uncorrelated. VIF values greater than 5 indicates
serious multicollinearity.
34
University of Ghana http://ugspace.ug.edu.gh
Table 4.9: VFS’s of the Simulated Dataset for Different Sample Sizes
VIF n=25 n=50 n=200 n=1000
X1 17.9780 12.4855 10.5131 11.0825
X2 13.5777 12.7499 12.5500 13.8634
X3 12.2999 13.0577 13.9575 13.2335
X4 8.9445 7.9278 8.3723 8.9871
From Table 4.9, it can be seen that most of the predictor variables have V IF ′s > 10
across all sample sizes. This means that severe multicollinearity exists in the simulated
dataset.
Figure shows cross validation diagrams of Lasso regression of the simulated dataset.
Figure 4.2: Cross Validation Diagrams for Lasso Regression
(a) n= 25 (b) n=50
(c) n= 200 (d) n=1000
The two vertical lines in the diagrams above represents the different choices for λ.
The minimum values are the ones which minimizes sample loss in cross validation.
35
University of Ghana http://ugspace.ug.edu.gh
The maximum values gives us the insight for the largest λ within one standard error.
The red dots represents the confidence intervals of the error estimates computed using
cross validation and the vertical lines shows the location of maximum and minimum
λ values. The numbers across the top represents the number of non-zero coefficient
estimates.
It can be observed across all cross validation plots that the number of non-zero coef-
ficients are three. Hence it is expected that one of the independent variables will be
dropped (it will have its coefficient zeroed) for the Lasso regression model.
Figure 4.3: Cross Validation Diagrams for Ridge Regression
(a) n= 25 (b) n=50
(c) n= 200 (d) n=1000
The most minimal point on the curve indicates the optimal λ value. That is the log
value of λ that best minimized the error in cross validation.
36
University of Ghana http://ugspace.ug.edu.gh
4.3 Lasso, Ridge and OLS Coefficients.
In particular, this section discusses the regression coefficients of the three estimators
from the simulated dataset. From the tables below it can be seen that X1 has a negative
coefficient under ordinary least squares across all the four sample sizes even though it
was positively correlated with the response variable. This is an indication that multi-
collinearity might exist in the dataset. This can be affirmed from the table of VIF’s.
Ridge regression technique have maintained all the variables whereas Lasso has zeroed
the coefficient of X2. This was one of the properties of lasso discussed in chapter three
(that is it performs variable selection by taking out first, the least significant of the
variables in the model).
The estimates turn to be more significant as the sample size increases. (that is the
regression coefficients are directly proportional to the sample size). However most es-
timates of ridge regression and lasso regression are smaller than that of OLS estimates.
Table 4.10: Regression Coefficients for n=25
n=25 Estimate(OLS) Estimate(RR) Estimate(LR)
Intercept 0.00084 0.00407 0.00096
X1 -0.01339 0.18021 0.00000
X2 0.30626 0.20519 0.29481
X3 0.17686 0.22241 0.17890
X4 0.57121 0.38143 0.55879
The equation of the fitted OLS model is
ŶOLS = 0.00084− 0.01339X1 + 0.30626X2 + 0.17686X3 + 0.57121X4 (4.1)
The equation of the fitted RR model is
ŶRR = 0.00407 + 0.18021X1 + 0.20519X2 + 0.22241X3 + 0.38143X4 (4.2)
37
University of Ghana http://ugspace.ug.edu.gh
The equation of the fitted Lasso model is
ŶLR = 0.00096 + 0.29481X2 + 0.17890X3 + 0.55879X4 (4.3)
Table 4.11: Regression Coefficients for n=50
n=25 Estimate(OLS) Estimate(RR) Estimate(LR)
Intercept 0.00064 0.00176 0.00088
X1 -0.06785 0.14878 0.00000
X2 0.29021 0.19981 0.24509
X3 0.21466 0.25636 0.22180
X4 0.60932 0.41149 0.57280
The equation of the fitted OLS model is
ŶOLS = 0.00064− 0.06785X1 + 0.29021X2 + 0.21466X3 + 0.60932X4 (4.4)
The equation of the fitted RR model is
ŶRR = 0.00176 + 0.14878X1 + 0.19981X2 + 0.25636X3 + 0.41149X4 (4.5)
The equation of the fitted Lasso model is
ŶLR = 0.00088 + 0.24509X2 + 0.22180X3 + 0.57280X4 (4.6)
Table 4.12: Regression Coefficients for n=200
n=25 Estimate(OLS) Estimate(RR) Estimate(LR)
Intercept -0.00426 -0.00604 -0.00529
X1 -0.13916 0.09565 0.00000
X2 0.33910 0.21952 0.23535
X3 0.20930 0.27964 0.24147
X4 0.61502 0.39904 0.54228
The equation of the fitted OLS model is
ŶOLS = −0.00951− 0.19151X1 + 0.38046X2 + 0.17039X3 + 0.66239X4 (4.7)
38
University of Ghana http://ugspace.ug.edu.gh
The equation of the fitted RR model is
ŶRR = 0.00234 + 0.11208X1 + 0.21660X2 + 0.26522X3 + 0.40529X4 (4.8)
The equation of the fitted Lasso model is
ŶLR = 0.00238 + 0.26348X2 + 0.20393X3 + 0.55640X4 (4.9)
Table 4.13: Regression Coefficients for n=1000
n=25 Estimate(OLS) Estimate(RR) Estimate(LR)
Intercept -0.00951 0.00234 0.00238
X1 -0.19151 0.11208 0.00000
X2 0.38046 0.21660 0.26348
X3 0.17039 0.26522 0.20393
X4 0.66239 0.40529 0.55640
The equation of the fitted OLS model is
ŶOLS = −0.00426− 0.13916X1 + 0.33910X2 + 0.20930X3 + 0.61502X4 (4.10)
The equation of the fitted RR model is
ŶRR = −0.00604 + 0.09565X1 + 0.21952X2 + 0.27964X3 + 0.39904X4 (4.11)
The equation of the fitted Lasso model is
ŶLR = −0.00529 + 0.23535X2 + 0.24147X3 + 0.54228X4 (4.12)
39
University of Ghana http://ugspace.ug.edu.gh
Table 4.14: Eigenvalues for the Independent Variables
Variable X1 X2 X3 X4
Eigenvalues 0.1376 0.1105 0.0370 0.0062
From the above, Table 4.14 shows the eigenvalues of the predictor variables of the de-
sign matrix of the simulated dataset. It can be observed that the ratio of the maximum
to the minimum eigenvalues of the independent variables is large. This indicates severe
multicollinearity among the explanatory variables.
Figure 4.4 shows the ridge trace plots of coefficients against log λ values, where λ
represents the shrinkage parameter.
Figure 4.4: Ridge Trace Plot for Simulated Dataset
(a) n= 25 (b) n=50
(c) n= 200 (d) n=1000
X1 − black, X2 − green, X3 − red and X4 − blue
40
University of Ghana http://ugspace.ug.edu.gh
From the diagrams in Figure 4.4, it can be seen that the coefficient values approaches
(shrinks to) zero as the lambda values are decreased across all sample sizes. How-
ever all the four independent variables are maintained in the model for some range of
lambda values. The shrinkage effect depends on the significance of the independent
variables. The least efficient variables shrink fastest in the estimation. When lambda
is zero, the ridge solutions will be the same as that of OLS.
The diagrams of 4.5 shows the Lasso plots of coefficients against lambda values for
different sample sizes.
Figure 4.5: Lasso Plot of Independent Variables
(a) n= 25 (b) n=50
(c) n= 200 (d) n=1000
X1 − black, X2 − green, X3 − red and X4 − blue
41
University of Ghana http://ugspace.ug.edu.gh
From the diagrams in Figure 4.6, it can be seen that the coefficients shrink as the
shrinkage parameter (λ) values decreases. The topmost part of the graph shows the
number of retained independent variables in the model at specified lambda values. For
instance it can be seen that the lasso regression model has four independent variables
for λ = 1.0.
The diagrams in Figure 4.6 shows the Lasso plots of coefficients against lambda values
for different sample sizes.
Figure 4.6: Lasso plot of Coefficients against Lambda
(a) n= 50 (b) n=200
X1 − black, X2 − green, X3 − red and X4 − blue
From the diagrams it can be seen that the number of predictor variables maintained in
the model varies for different values of λ. Lasso regression in this case performs both
shrinkage and parameter elimination.
42
University of Ghana http://ugspace.ug.edu.gh
The diagrams below shows the effect of shrinkage for different sample sizes on the
independent variables.
Figure 4.7: Shrinkage Cross Validation Diagrams
(a) n= 25 (b) n=50
(c) n= 200 (d) n=1000
From Figure 4.7, it can be seen that each of the Lasso regression models for the dif-
ferent sample sizes have three predictor variables (X2, X3, X4) maintained as it was
seen in our Lasso models. The interval between the coefficients decreases hugely as
the sample size was increased from 25 to 50. However it increased again as the sample
size was increased to 200 and the rigorous shrinkage occurred when the sample size
was further increased to 1000. From the correlation matrices, the covariances between
the variables at n=50 was more than the covariances at n=200.
This means that the higher the covariances, the greater the shrinkage and hence the
smaller our coefficients.
43
University of Ghana http://ugspace.ug.edu.gh
The study concludes that shrinkage depends on the covariances and not on the sample
size.
44
University of Ghana http://ugspace.ug.edu.gh
Table 4.15 shows the shrinkage parameters for RR and LR after cross validation.
Table 4.15: Shrinkage(Param) eters(for Ri)dge an(d Lass)o Reg(ressi)on
Sample size(n) RR λLSE RR λmin LR λLSE LR λmin
25 0.1774 0.0925 0.0471 0.0067
50 0.1446 0.0971 0.0384 0.0066
200 0.1305 0.0987 0.0288 0.0065
1000 0.1063 0.1063 0.0195 0.0070
In particular, λLSE and λmin represents the range of λ values for which the regular-
ized solutions has a smaller MSE than the OLS solution, where λLSE is the maximum
and λmin is the minimum. λLSE is inversely proportional to the sample size increase
whereas the λmin is directly proportional to the sample size. The minimum value is
always chosen as our shrinkage parameter. It can be observed that, all λmin values of
lasso are smaller than λmin of the ridge solutions in cross validation.
Table 4.16 shows the mean square errors and mean absolute errors across four different
sample sizes for the three different estimators.
Table 4.16: MSE’s and MAE’s for OLS, RR and LR
n MSE(OLS) MAE(OLS) MSE(RR) MAE(RR) MSE(Lasso) MAE(Lasso)
25 2.1050 1.2309 0.0102 0.0845 0.0070 0.0681
50 2.3977 1.2497 0.0121 0.0876 0.0070 0.0646
200 2.3237 1.2482 0.0144 0.0955 0.0092 0.0788
1000 2.3309 1.2339 0.0162 0.1016 0.0104 0.0813
From the Table 4.16 values of RR have smaller mean absolute errors (MAE) than OLS
. Also values of RR have smaller MSEs of regression coefficient than OLS. Lasso
regression on the other hand out performs both OLS and RR in the above mentioned
categories. Consequently, Ridge and Lasso regression methods are better than OLS
when the multicollinearity problem exists in a data with Lasso being the best in this
case for the four predictor variable model.
45
University of Ghana http://ugspace.ug.edu.gh
4.4 Standard Errors of the Regression Coefficients
The standard errors of β̂ estimates is a measure of how consistent β̂ will be if re-
sampled repeatedly. It measures the sampling variation in estimating β according to
Ludvigson and Ng (2009). Table 4.17, 4.18, 4.19 and 4.20 shows the standard errors
of the estimates of the simulated dataset across different sample sizes.
Table 4.17: Standard Errors for n=25
n=25 OLS RR LR
X1 0.08425 0.00761 0.00887
X2 0.07152 0.00489 0.01881
X3 0.07540 0.00213 0.00294
X4 0.06732 0.00546 0.02395
Table 4.18: Standard Errors for n=50
n=50 OLS RR LR
X1 0.04776 0.00337 0.00746
X2 0.04782 0.00239 0.01947
X3 0.04765 0.00148 0.00677
X4 0.03664 0.00276 0.01850
Table 4.19: Standard Errors for n=200
n=200 OLS RR LR
X1 0.04776 0.00337 0.00746
X2 0.04782 0.00239 0.01947
X3 0.04765 0.00148 0.00677
X4 0.03664 0.00276 0.01850
Table 4.20: Standard Errors for n=1000
n=100 OLS RR LR
X1 0.00988 0.00021 0.02127
X2 0.01110 0.00017 0.02383
X3 0.01086 0.00013 0.00419
X4 0.00887 0.00018 0.02082
46
University of Ghana http://ugspace.ug.edu.gh
Figure 4.8: Shows the standard errors of regression coefficients across different
samples sizes.
From Figure 4.8, it can be observed that the standard errors of the β coefficients of RR
has the least standard error across all sample sizes. The standard errors of the OLS
coefficients decreases as the sample size increases. The most significant variable for
OLS has its coefficient standard error to be less than that of the LR coefficient standard
error for a very large sample size(i.e. when n=1000).
47
University of Ghana http://ugspace.ug.edu.gh
4.5 Performance of OLS, Ridge and Lasso Estimators
at Different Correlation Coefficients for Two Pre-
dictor Variables
A simulation study was conducted using 200 samples with two predictor variables. The
sample size was kept constant with varying correlation coefficients (between 0 and 1)
to determine the extent to which ridge and lasso out performs ordinary least squares
using MSE criteria. The results has been summarized in the table below for the mean
square errors for varying correlation coefficient for two predictor variables.
Table 4.21 shows the MSE’s across three different estimators for varying correlation
coefficients.
Table 4.21: MSE’s with Varying Correlation Coefficients for Two Predictor Vari-
ables
r12 MSE(OLS) MSE(RIDGE) MSE(LASSO)
0 < r12 < 0.1 1.08806 1.09098 1.09109
0.1 < r12 < 0.2 1.27009 0.97483 0.97115
0.2 < r12 < 0.3 1.20087 0.94655 0.94516
0.3 < r12 < 0.4 1.38027 0.90043 0.89886
0.4 < r12 < 0.5 1.48580 0.83960 0.83852
0.5 < r12 < 0.6 1.60904 0.71552 0.71463
0.6 < r12 < 0.7 1.74448 0.57992 0.57905
0.7 < r12 < 0.8 1.87089 0.45410 0.45288
0.8 < r12 < 0.9 2.03481 0.29197 0.29015
0.9 < r12 < 1 2.18323 0.14604 0.14367
From Table 4.21, it can be seen that ridge and lasso regression have smaller MSEs
than OLS when the covariance between the two predictor variables is greater than 0.1.
However between 0 and 0.1, OLS have smaller MSE than the proposed shrinkage tech-
niques. Hence we can say that OLS is best when the samples are nearly independent.
48
University of Ghana http://ugspace.ug.edu.gh
4.6 Application of L1-L2 Regularization to
Bodyfat Data
L1-L2 regularization was compared to ordinary least squares on bodyfat dataset. The
sample sizes used were 25 (small sample) , 100 and 200 (large sample sizes). This
thesis investigated the behavior of these estimators as sample size increases. The study
focused on the effect of increasing sample size on the significance and stability of the
outcome of the regression estimates. We also compared the MSEs and MAEs of these
estimators as we increased the sample size. The results are shown and discussed below.
The diagram in Figure 4.9 shows the matrix plot of bodyfat data for different sample
sizes.
Figure 4.9: Matrix Plot of Predictor Variables of Bodyfat Data
(a) n= 25 (b) n=50
(c) n= 200
The data points become very compact as the sample size increases and there exist a
49
University of Ghana http://ugspace.ug.edu.gh
positive linear relationship among the variables.
Table 4.22 displays the Pearson product moment correlation coefficient between each
pair of predictor variables. It is the measure of the strength of the linear relationship
between two variables and lies between -1 and 1.
Table 4.22: Correlation Matrices For Bodyfat Data
n=25 weight neck chest abdomen hip thigh knee biceps
weight 1 0.80 0.83 0.70 0.90 0.79 0.77 0.86
neck 1 0.72 0.54 0.69 0.69 0.54 0.75
chest 1 0.75 0.79 0.72 0.65 0.70
abdomen 1 0.80 0.78 0.79 0.58
hip 1 0.81 0.73 0.77
thigh 1 0.67 0.84
knee 1 0.59
biceps 1
Table 4.23: Correlation Matrices For Bodyfat Data
n=100 weight neck chest abdomen hip thigh knee biceps
weight 1 0.85 0.90 0.89 0.94 0.89 0.86 0.82
neck 1 0.79 0.76 0.76 0.75 0.71 0.73
chest 1 0.93 0.85 0.79 0.76 0.75
abdomen 1 0.88 0.79 0.77 0.70
hip 1 091 0.83 0.79
thigh 1 0.83 0.79
knee 1 0.72
biceps 1
Table 4.24: Correlation Matrices For Bodyfat Data
n=200 weight neck chest abdomen hip thigh knee biceps
weight 1 0.83 0.89 0.89 0.94 0.87 0.85 0.80
neck 1 0.78 0.75 0.73 0.70 0.67 0.73
chest 1 0.92 0.83 0.73 0.72 0.73
abdomen 1 0.87 0.77 0.74 0.68
hip 1 0.90 0.82 0.74
thigh 1 0.80 0.76
knee 1 0.68
biceps 1
50
University of Ghana http://ugspace.ug.edu.gh
From Table 4.24, it can be seen that the covariance are very high and the independent
assumption has been violated. It is an indication that multicollinearity might exist in
our dataset.
Table 4.25 shows the regression output of the bodyfat dataset sample size of 25.
Table 4.25: OLS output for n=25
n=25 coeff St. Error t-value P-value
Intercept -81.2625 47.8353 -1.699 0.1087
Weight -0.3836 0.1581 -2.426 0.0274
Neck -0.7064 0.6411 -1.102 0.2868
Chest -0.1571 0.2816 -0.588 0.5847
Abdomen 0.7510 0.2910 2.581 0.0201
Hip 0.6870 0.5094 1.349 0.1962
Thigh 0.1247 0.6101 0.204 0.8406
Knee 1.0783 0.9054 1.191 0.2510
Biceps 0.6944 0.7265 0.956 0.3534
F Statistic: F8,16 = 5.82 , P-value: 0.001398
From the table of OLS output for n = 25, most of the independent variables are
not significant at alpha level of 0.05 say but the F- statistics for the overall model is
significant. This is due to multicollinearity among the independent variables. There
are only two significant variables at α = 0.05
Table 4.26 shows the regression output of the bodyfat dataset sample size of 100.
Table 4.26: OLS output for n=100
n=100 coeff St. Error t-value P-value
Intercept -43.6160 19.1798 -2.274 0.0253
Weight -0.2095 0.0614 -3.413 0.0010
Neck -0.6319 0.3177 -1.989 0.0497
Chest 0.1438 0.1506 -0.955 0.3419
Abdomen 1.1128 0.1119 9.943 0.0000
Hip -0.3628 0.1990 -1.823 00.0715
Thigh 0.4970 0.2064 2.408 0.0180
Knee 0.2897 0.3700 0.783 0.4354
Biceps 0.0643 0.2419 0.266 0.7910
F8,91 = 44.2 , P-value : 0.0000
51
University of Ghana http://ugspace.ug.edu.gh
From Table 4.25, 4.26 and 4.27, it can be observed that the number of significant
variables have increased with an increase in sample size. Also the overall model has
become more significant. This implies that collecting additional samples can help
increase the significance of variables in a regression analysis and hence making the
regression model more stable
Table 4.27 shows the regression output of the bodyfat dataset for n=200.
Table 4.27: OLS output for n=200
n=200 coeff St.Error t-value P-value
Intercept -35.4794 12.9555 -2.739 0.0066
Weight -0.1433 0.0443 -3.232 0.0014
Neck -0.5700 0.2186 -2.607 0.0097
Chest 0.0274 0.0977 0.281 0.7791
Abdomen 1.0253 0.0753 13.617 0.0000
Hip -0.2081 0.1429 -1.456 0.1467
Thigh 0.2454 0.1312 1.870 0.0627
Knee 0.0448 0.2292 0.196 0.8451
Biceps 0.2712 0.1657 1.642 0.1018
F8,191 = 83.88 , P-value: 0.0000
It can be observed that the number of significant variables didn’t change. This is be-
cause (n > 30) are considered to be large. However, the standard errors reduced as
the sample size increased. This implies large samples produces more stable estimates
and multicollinearity problem can only get worse in a normally distributed population
when the sample size is small.
52
University of Ghana http://ugspace.ug.edu.gh
Table 4.32 shows the MSE’s of the bodyfat data across three different sample sizes.
Table 4.28: MSE’s Across Three Different Sample Sizes.
Sample size(n) MSE(OLS) MSE(RR) MSE(LR)
25 11.03316 0.250453 0.2524211
100 16.52934 0.2605047 0.2028121
200 18.54474 0.3058488 0.2663728
Table 4.32 shows the MAE’s of the bodyfat data across three different sample sizes.
Table 4.29: MAE’s Across Three Different Sample Sizes.
Sample size(n) MAE(OLS) MAE(RR) MAE(LR)
25 2.776839 0.4399902 0.4261684
100 3.40597 0.4292737 0.3774634
200 3.547428 0.4487484 0.4240274
From Table 4.29, the mean absolute errors of the regularized estimators are smaller
than that of the OLS across the three different sample sizes used in this experiment,
with that of Lasso regression producing the least MAE among the competing estima-
tors. Hence, L1-L2 regulation techniques are better alternatives to OLS under multi-
collinearity conditions with Lasso being a better alternative to ridge regression
Table 4.30: Standard Errors of Regression Coefficients for n=25.
n=25 OLS RR LR
weight 0.1581 0.0225 0.0524
neck 0.6411 0.0032 0.0311
chest 0.2816 0.0020 0.0638
abdomen 0.2910 0.0002 0.0049
hip 0.5094 0.0102 0.2470
thigh 0.6101 0.0055 0.0481
knee 0.9054 0.0050 0.0178
biceps 0.7265 0.0095 0.0636
53
University of Ghana http://ugspace.ug.edu.gh
Table 4.31: Standard Errors of Regression Coefficients for n=100.
n=100 OLS RR LR
weight 0.2220 0.0031 0.2281
neck 0.0929 0.0002 0.0475
chest 0.1500 0.0012 0.0362
abdomen 0.1457 0.0039 0.0402
hip 0.1833 0.0016 0.0915
thigh 0.1329 0.0016 0.0882
knee 0.0986 0.0003 0.0172
biceps 0.0881 0.0004 0.0040
Table 4.32: Standard Errors of Regression Coefficients for n=200.
n=200 OLS RR LR
weight 0.0465 0.1509 0.0478
neck 0.2294 0.0216 0.1142
chest 0.1043 0.1358 0.1451
abdomen 0.0801 0.0016 0.4020
hip 0.1532 0.0682 0.0920
thigh 0.1445 0.0367 0.0541
knee 0.2671 0.0335 0.0461
biceps 0.1783 0.0636 0.0762
54
University of Ghana http://ugspace.ug.edu.gh
Chapter 5
DISCUSSIONS, CONCLUSIONS AND
RECOMMENDATIONS
5.1 Introduction
This section discusses the research findings and make inferences based on the outputs.
Recommendations are made for further studies.
5.2 Discussions and Conclusions
In this research , the researcher referred to the multicollinearity problem, methods of
detecting this problem and its effect on result of multiple regression model. From the
simulation study, the high covariances of the covariates caused multicollinearity. A
graph of the matrix plot showed there was a linear relationship between the response
and each of the predictor variables and hence the researcher was able to fit a multi-
ple linear regression to the data using OLS. The coefficient of X1 was negative even
though it was positively correlated with Y as depicted in Table 4.6 and the intercept
had very small value. This was the first indication that multicollinearity might exist
in the dataset even though the standard errors were not so high. This procedure was
repeated for different sample sizes (n=25, n=50, n=200 and n=1000). The standard
errors of the predictors reduced for increased sample sizes and level of significance of
the predictor variables increased as well. Even though some predictors would not have
been significant at some given alpha 0.05 say, the overall model was always significant.
This was the second indication that the predictor variables were collinear even without
looking at the correlation matrix. A more rigorous check for multicollinearity (VIF)
was adopted to affirm the suspicion of the collinear variables. From Table 4.9 of VIF’s,
55
University of Ghana http://ugspace.ug.edu.gh
it could be observed that most of the VIF’s of the predictor variables were greater than
ten which indicate multicollinearity according to the rule of thumb.
L1 and L2 regularization techniques were adopted to solve the problem of collinearity
in the simulated dataset. To be certain which of these techniques produced a more
efficient and stable model, as well as reducing the standard errors of the regression
coefficients, the study compared the MSEs and standard errors of OLS, L1 and L2
regularization. The regularization (shrinkage) techniques penalized the coefficients
of the regression model towards zero. The least significant variables shrink faster as
seen in the trace plots shown in Figure 4.4. In Figure 4.6 (a lasso plot of coefficient
against shrinkage constants) it was observed that the variable X1 has been dropped as
the other variables experienced shrinkage across different sample sizes (also seen in
equation 4.3, 4.6, 4.9 and 4.12). This affirms the parameter elimination property of the
lasso regression as read from literature section 3.
Cross Validation was used in selecting the optimal λ value (shrinkage constant). From
Figure 4.2 and 4.3 it was observed that there were a range of values of λ for which
the MSEs of these regularized estimators would be less than the MSE of OLS. The
minimum of these lambdas was chosen as the optimal value after cross validation of
Table 4.15.
From Table 4.32 of MSE’s, it was observed that L1 produced the least MSEs compared
to L2 and OLS. The smaller the MSE of an estimator the smaller the prediction error.
This means that unbiasedness, even though very important should not be the ultimate
criteria when selecting between competing estimators. Likewise L1 and L2 had the
least MAE across all sample sizes for the collinear simulated dataset. This means
that in the presence of multicollinearity regularized regression approaches are more
efficient. However, in moderation studies, L2 regression should be chosen over L1
approach since ridge regression does not eliminate parameters. With respect to their
standard errors, RR performed best across all sample sizes. The OLS had its standard
errors decreased as the sample size was increased. The Lasso had the better of the OLS
from Figure 4.8 but for very large sample size (when n=1000) the OLS had its most
56
University of Ghana http://ugspace.ug.edu.gh
significant variable’s coefficient to be less than the Lasso coefficient.
The level of collinearity to which OLS out performs L1 and L2 regularization for two
predictor variables model was investigated by simulation. The results was summarized
in Table 4.21. It was found that, for all levels of covariance above 0.1 between two pre-
dictor variables, the ridge and lasso produced smaller MSE than the OLS with Lasso
having the minimum MSE. However, for covariance between 0 and 0.1 for two predic-
tor variables, the OLS estimator had a smaller MSE than the ridge and lasso estimators.
Hence the study affirms the fact that OLS is best when the samples are independent.
In the analysis (summarized in Table 4.15), the smaller the shrinkage parameter the
better the estimator. Lasso method produced shrinkage parameters that were smaller
than the ridge shrinkage values across all sample sizes.
From our simulated data, the relationship between increased sample size and covari-
ance were irregular. It was observed that sample size does not always affect the degree
of collinearity, however it affects the results of estimated value as seen in the applica-
tion done on the bodyfat dataset. Whenever the sample size increases, the results of
the methods of estimation becomes more stable.
The following conclusions were made based on the above discussion
• L1 and L2 regularization techniques helps to reduce standard errors of regression
coefficients as well as reducing the prediction error of the generated model.
• L1 regression is best and produces parsimonious models in the presence of mul-
ticollinearity.
• The higher the degree of multicollinearity, the smaller the shrinkage parameter.
This means there is always an optimal value of lambda for every change in the
dataset values. This makes λ a random variable.
• Increasing sample size gives stable outcomes after estimation as it helps to re-
duce the standard errors of the regression coefficients of the predictor variables.
• L2 regularization would be the best alternative in moderation studies where we
would like to keep all of the predictor variables. It also the best technique to be
57
University of Ghana http://ugspace.ug.edu.gh
employed for highly inflated standard errors of OLS regression coefficients.
• OLS is best for independent samples but the modern regression approaches (L1
and L2) should be embraced for correlated covariates.
58
University of Ghana http://ugspace.ug.edu.gh
5.3 Recommendations
From the findings of the study, we recommended that in the presence of high multi-
collinearity in the dataset, the best approach is the L2 since it yields smaller standard
errors. However the L1 regularization yielded the least MSE across all sample sizes
and hence a combination of the two techniques on a dataset concurrently can also be
investigated to measure the performance.
Further research should focus on the level of multicollinearity of the data and obtain
the best robust approach for each level.
59
University of Ghana http://ugspace.ug.edu.gh
REFERENCES
Belsley, D. A. (2004). Conditioning diagnostics. Encyclopedia of Statistical Sciences,
2.
Chong, I.-G. and Jun, C.-H. (2005). Performance of some variable selection methods
when multicollinearity is present. Chemometrics and intelligent laboratory systems,
78(1-2):103–112.
Duzan, H. and Shariff, N. S. B. M. (2015). Ridge regression for solving the multi-
collinearity problem: review of methods and models. Journal of Applied Sciences,
15(3):392.
Filzmoser, P. and Croux, C. (2003). Dimension reduction of the explanatory variables
in multiple linear regression. Pliska Studia Mathematica Bulgarica, 14(1):59p–70p.
Grewal, R., Cote, J. A., and Baumgartner, H. (2004). Multicollinearity and measure-
ment error in structural equation models: Implications for theory testing. Marketing
science, 23(4):519–529.
Hauser, D. (1974). Some problems in the use of stepwise regression techniques in
geographical research. Canadian Geographer/Le Géographe canadien, 18(2):148–
158.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1):55–67.
Johnson, R. A. and Wichern, D. W. (2004). Multivariate analysis. Encyclopedia of
Statistical Sciences, 8.
Jun, C.-H., Lee, S.-H., Park, H.-S., and Lee, J.-H. (2009). Use of partial least squares
regression for variable selection and quality prediction. In Computers & Indus-
trial Engineering, 2009. CIE 2009. International Conference on, pages 1302–1307.
IEEE.
60
University of Ghana http://ugspace.ug.edu.gh
Kaufinger, G. G. (2013). Earnings management motivations in gift card breakage
recognition decisions. Anderson University.
Levy, G., Louis, E. D., Cote, L., Perez, M., Mejia-Santana, H., Andrews, H., Harris, J.,
Waters, C., Ford, B., Frucht, S., et al. (2005). Contribution of aging to the severity of
different motor signs in parkinson disease. Archives of Neurology, 62(3):467–472.
Ludvigson, S. C. and Ng, S. (2009). A factor analysis of bond risk premia. Technical
report, National Bureau of Economic Research.
Madsen, H. and Thyregod, P. (2010). Introduction to general and generalized linear
models. CRC Press.
Midi, H., Bagheri, A., and Imon, A. (2010). The application of robust multicollinearity
diagnostic method based on robust coefficient determination to a non-collinear data.
Journal of Applied Sciences, 10(8):611–619.
Mogessie, E. M. and Bekele, G. (2017). Households’ willingness to pay for community
based health insurance scheme: in kewiot and efratanagedem districts of amhara
region, ethiopia. Business and Economic Research, 7(2):212–233.
O’brien, R. M. (2007). A caution regarding rules of thumb for variance inflation fac-
tors. Quality & quantity, 41(5):673–690.
Ogutu, J. O., Schulz-Streeck, T., and Piepho, H.-P. (2012). Genomic selection using
regularized linear regression models: ridge regression, lasso, elastic net and their
extensions. In BMC proceedings, volume 6, page S10. BioMed Central.
Orlov, M. L. (1996). Multiple linear regression analysis using microsoft excel. Chem-
istry Department, Oregon State University.
Pourbasheer, E., Aalizadeh, R., Shokouhi Tabar, S., Ganjali, M. R., Norouzi, P., and
Shadmanesh, J. (2014). 2d and 3d quantitative structure–activity relationship study
of hepatitis c virus ns5b polymerase inhibitors by comparative molecular field analy-
61
University of Ghana http://ugspace.ug.edu.gh
sis and comparative molecular similarity indices analysis methods. Journal of chem-
ical information and modeling, 54(10):2902–2914.
Santos-Cortez, R. L. P., Lee, K., Giese, A. P., Ansar, M., Amin-Ud-Din, M., Rehn, K.,
Wang, X., Aziz, A., Chiu, I., Hussain Ali, R., et al. (2014). Adenylate cyclase 1
(adcy1) mutations cause recessive hearing impairment in humans and defects in hair
cell function and hearing in zebrafish. Human molecular genetics, 23(12):3289–
3298.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions.
Journal of the royal statistical society. Series B (Methodological), pages 111–147.
Stone, M. and Brooks, R. J. (1990). Continuum regression: cross-validated sequen-
tially constructed prediction embracing ordinary least squares, partial least squares
and principal components regression. Journal of the Royal Statistical Society. Series
B (Methodological), pages 237–269.
Wahab, N. S., Rusiman, M. S., Mohamad, M., Azmi, N. A., Him, N. C., Kamardan,
M. G., and Ali, M. (2018). A technique of fuzzy c-mean in multiple linear regression
model toward paddy yield. In Journal of Physics: Conference Series, volume 995,
page 012010. IOP Publishing.
Xiaobo, Z., Jiewen, Z., Povey, M. J., Holmes, M., and Hanpin, M. (2010). Variables
selection methods in near-infrared spectroscopy. Analytica chimica acta, 667(1-
2):14–32.
62