Canonical Correlation Analysis to relate a
Genomic Dataset with a Neuroimage Dataset.
Augustine Annan
(10551764)
THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA,
LEGON IN PARTIAL FULFILLMENT OF THE REQUIREMENT
FOR THE AWARD OF MPHIL MATHEMATICS DEGREE
July, 2016
University of Ghana http://ugspace.ug.edu.gh
DECLARATION
This thesis was written in the Department of Mathematics, University of Ghana, Legon
from September 2015 to July 2016 in partial fulfillment of the requirements for the award
of Master of Philosophy degree in Mathematics under the supervision of Dr. Margaret
McIntyre, Dr. Douglas Adu-Gyamfi, and Dr. Eyram Schwinger of the University of Ghana
I hereby declare that except where due acknowledgement is made, this work has never been
presented wholly or in part for the award of a degree at the University of Ghana or any other
University.
Signature: ...................................................
Student: Augustine Annan
Signature: ...................................................
Dr. Margaret McIntyre
Signature: ...................................................
Dr. Douglas Adu-Gyamfi
i
University of Ghana http://ugspace.ug.edu.gh
DEDICATION
I dedicate my research project to my family. A special feeling of gratitude to my loving
mother, Agnes Esuon whose words of encouragement and push for tenacity ring in my
ears. My brothers Stephen and Humphrey, my sister Faustina and my friend Ansbertha
have never left my side and are very special.
ii
University of Ghana http://ugspace.ug.edu.gh
ACKNOWLEDGEMENTS
My warmest appreciation goes to my supervisors, Dr. Margaret McIntyre and Dr. Alessan-
dro Crimi, for the patience, motivation, immense knowledge and continuous support and
guidance he offered me throughout this project. Also to my other supervisors Dr. Douglas
Adu-Gyamfi and Dr. Eyram Schwinger, I show great appreciation for taking much time to
assist me in this work with so much patience.
I want to appreciate the African Institute for Mathematical Sciences (AIMS-Ghana), for
supporting this research financially.
To the Head of Department, Dr. Margaret McIntyre; and all the lecturers, I say a big thank
you for giving me such a great opportunity to step up my goals in academia.
To my mother, and siblings, I am grateful for your unconditional love, support and encour-
agement. My sincere, heartfelt gratitude goes to all my colleagues for all their encourage-
ment and fun moments.
To God be the glory.
iii
University of Ghana http://ugspace.ug.edu.gh
ABSTRACT
This thesis investigates the relationship between copy number variations and neuro-image
features of Glioblastoma patients. Canonical correlation analysis was employed to elicit
these relationships. This thesis highlights some of the concepts of the technique which
enabled us to obtain our main results. We found three pairs of significant canonical variates
with correlations of 0.6704,0.6347 and 0.5552 respectively, which was used to identify
genes and neuro-image features related to Glioblastoma.
iv
University of Ghana http://ugspace.ug.edu.gh
Contents
Declaration i
Dedication ii
Acknowledgements iii
Abstract iv
1 Introduction 1
1.1 Organisation of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Definitions 7
2.1 Definitions of statistical and mathematical terms . . . . . . . . . . . . . . . 7
3 Methodology 12
3.1 Canonical Correlation Analysis (CCA) . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Canonical Correlation . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Formulation and Derivation of the Canonical Variables . . . . . . . 14
3.1.5 Properties of the Canonical Variable Pairs . . . . . . . . . . . . . . 22
3.1.6 Canonical correlation coefficient under the non-singular transfor-
mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
University of Ghana http://ugspace.ug.edu.gh
3.1.7 Correlation Coefficient Between Canonical Variables and the Orig-
inal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.8 Computation of Canonical Correlation Coefficient Using Standard-
ized Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.9 Assessing Overall Model Fit and Canonical Dimension Reduction . 30
3.2 Example: Computation of Canonical variables and Canonical Coefficients . 35
4 Results 40
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1 Patient Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Correlation matrix of variables . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Assessment of Overall Model Fit . . . . . . . . . . . . . . . . . . 51
4.3.3 Interpreting Canonical Variate Pairs . . . . . . . . . . . . . . . . . 54
4.3.4 Interpretation of Canonical Variate Using Canonical Weights . . . . 55
4.3.5 Interpretation of Canonical Variate Using Canonical Loadings . . . 58
4.3.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.7 CCA on Sub-Sample A . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.8 CCA on Sub-Sample B . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Conclusion 67
References 71
University of Ghana http://ugspace.ug.edu.gh
List of Tables
4.1 Description of Neuro-image Features Used . . . . . . . . . . . . . . . . . 42
4.2 Copy Number Variation Variables (Genes) . . . . . . . . . . . . . . . . . . 43
4.3 Sex and Survival Status Distribution of Patients . . . . . . . . . . . . . . . 44
4.4 Age and Overall Survival Time of Patients . . . . . . . . . . . . . . . . . . 44
4.5 Frequency Distribution of Expression Subtype . . . . . . . . . . . . . . . . 45
4.6 Correlations for Variable Set 1 . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Correlations for the Copy Number Variation Variables . . . . . . . . . . . . 47
4.8 Correlations for the Copy Number Variation Variables . . . . . . . . . . . . 48
4.9 Correlations between Variable Set 1 and Variable Set 2 . . . . . . . . . . . 49
4.10 Raw Coefficients for the Neuro-image features . . . . . . . . . . . . . . . 50
4.11 Raw Coefficients for the Copy Number Variation Variables . . . . . . . . . 51
4.12 Test of Significance of all Canonical Correlations . . . . . . . . . . . . . . 52
4.13 Test of Significance of each Canonical Correlation . . . . . . . . . . . . . 53
4.14 Canonical Correlations and Eigenvalues . . . . . . . . . . . . . . . . . . . 53
4.15 Canonical redundancy analysis for Canonical Correlations . . . . . . . . . 54
4.16 Standardized Coefficients for the Neuro-image features . . . . . . . . . . . 56
4.17 Standardized Coefficients for the Copy Number Variation Variables . . . . 57
4.18 Summary of Important Related Variables . . . . . . . . . . . . . . . . . . . 58
4.19 Canonical Loadings for the Neuro-image features . . . . . . . . . . . . . . 58
4.20 Canonical Loadings for the Copy Number Variation Variables . . . . . . . 59
vii
University of Ghana http://ugspace.ug.edu.gh
4.21 Summary of Important Related Variables . . . . . . . . . . . . . . . . . . . 60
4.22 Test of Significance of each Canonical Correlation . . . . . . . . . . . . . 61
4.23 Canonical Loadings for the Neuro-image features . . . . . . . . . . . . . . 62
4.24 Canonical Loadings for the Copy Number Variation Variables . . . . . . . 62
4.25 Test of Significance of each Canonical Correlation . . . . . . . . . . . . . 63
4.26 Canonical Loadings for the Neuro-image features . . . . . . . . . . . . . . 64
4.27 Canonical Loadings for the Copy Number Variation Variables . . . . . . . 65
University of Ghana http://ugspace.ug.edu.gh
List of Figures
1.1 [The gene amplification has created a copy number variation.]The chromo-
some now has two copies of this section of DNA, rather than one [34]. . . . 3
1.2 [Magnetic Resonance Imaging (MRI) images of patients with GBM][37, 13] 4
1.3 [Fully automated Segmentation and VASARI Feature Extraction:]necrotic
core/contrast enhancing tumor(right) and edema(left) [37] . . . . . . . . . . 5
ix
University of Ghana http://ugspace.ug.edu.gh
Chapter 1
Introduction
Many complex diseases result from the interplay of genetics and neuroimage features. As
such understanding the underlying biological mechanism of such datasets are very impor-
tant. As a result of the emergence of increasing development of a wide range of genome-
wide assays, it is now possible for multiple measures of genomic markers from various
platforms for a particular subject such as single nucleotide polymorphism, gene expres-
sion, copy number variation and so on. These measurements relay information about vari-
ations of genome. Putting together two or more types of data does not only help in the
diagnosis of diseases but it does enhance comprehension of the biological mechanisms and
consequently could improve treatment strategies. So there is a high demand for integrative
approaches for use in large-scale genomic data analysis. Therefore, investigating the asso-
ciations between such entities is of great use.
Glioma is the most common type of primary brain tumor which arises from glial cells. It is
considered responsible for approximately 13000 deaths in the United States and more than
14000 in Europe each year [35]. Gliomas are heterogeneous and they can be classified in
accord with their grade: low-grade glioma, anaplastic glioma, and glioblastoma. The most
common type of glioma in adults is glioblastoma (GBM). It is generally diagnosed at an
average age of 55 years, and gives the affected patient an average survival time of only 10
to 18 months. Lower grade glioma can occur at younger ages [35]. The underlying tumor
pathology and biological function can be identified by imaging and genetic biomarkers. In
the context of clinical routing, if imaging phenotypes of GBM from magnetic resonance
imaging (MRI) can be easily associated with specific gene expression signatures, they will
serve as a non-invasive alternative to biopsy, providing important information for diagnosis,
prognosis and personalized treatment. Therefore this thesis seeks to investigate the corre-
1
University of Ghana http://ugspace.ug.edu.gh
spondence between genetic data, in particular the copy number variations and the imaging
phenotypes of the GBM.
One of the most important means of acquiring the relationships between two or more en-
tities or objects is to take measurements of pertinent relationships. A measure of a rela-
tionship depicts the strength of the relationship or association between the objects. So we
introduce the term correlation to mean any broad class of statistical relationships depicting
dependence. The degree of correlation can be measured by the use of correlation coef-
ficients, denoted by ρ or r. The most used coefficient is the measure developed by Karl
Pearson which is the Pearson correlation coefficient. The core of the project is to present
the idea of canonical correlation analysis and use it to investigate the relationship between
the copy number variations and neuroimage features. The main highlights of the technique
that helps to elicit the relationship between the datasets will be discussed. In the next two
paragraphs we introduce copy number variations and the neuroimage features of tumors.
Copy number variation (CNV) can be defined as alterations of the deoxyribonucleic acid
(DNA) of a genome that makes the cell have an abnormal repetitions and deletions of one or
more sections of the DNA [10]. The number of repetitions of such sections differs between
individuals in the human population [23]. It is a kind of structural variation, precisely a kind
of duplication event that highly affects a number of base pairs [34]. Human beings differ
in the number of copies of each gene and this leads to the idea of copy number invariants.
Recent research has shown that about two thirds of the entire human genome comprises of
repeats [36] and also about 4.75− 9.46% of the entire genome can be described as copy
number variations [39]. CNVs play a very notable role in producing the necessary variation
in the population and also in disease phenotype [23].
2
University of Ghana http://ugspace.ug.edu.gh
Figure 1.1: [The gene amplification has created a copy number variation.]The chromosome now
has two copies of this section of DNA, rather than one [34].
Humans have two copies of most genes, one from the mother’s chromosome and the other
from the father’s chromosome. Some alterations in the chromosome may cause either a
loss or a gain of one copy. Duplications and deletions of more than 1000 nucleotides are
referred to as copy number variants [3]. It is considered to be a very notable risk factor for
cancer and constitutes a wide spectrum of the total genomic variation [38]. There has been
an identification of recurrent copy number variations that demonstrate that various chro-
mosome regions are present. Also, as a result of cancer being an acquired disease and also
because inherited factors play a major role in its occurrence, there have been comparisons
of the early constitutional copy number alterations with the copy number variations present
in tumor biopsy [12].
GBM is an aggressive tumor with poor prognosis. Despite the introduction of new strate-
gies to treat the disease, the median survival is less than one year [12]. In recent studies,
important features have been identified. The pediatric primary GBM is different from
the adult GBM, considering both the genetic profiling and mean commulative survival
[29, 28, 9, 30]. Pediatric GBM and adult GBMs have varying pathways of tumorigene-
sis [30]. In 35− 50% of the time, a primary adult patient forms present amplification of
3
University of Ghana http://ugspace.ug.edu.gh
the epidermal growth factor receptro (EGRF) gene and inactivation of the phosphatase and
tensin homolog (PTEN) gene [26, 8]. However, in the secondary adult GBM patients that
may evolve from low-grade lesions, normally have no alterations of gene PTEN and no
EGFR duplications but most often have TP53 mutations [33]. Studies have shown that
there are differences in CNV between the adult GBMs and childhood GBMs. In pediatric
GBMs, heterozygous deletions are more common while duplications are more frequent in
adult GBMs [32].
Analyzing imaging features has revealed interesting relationships between the imaging fea-
tures and survival of patients. Considering patients with malignant gliomas, some tumor
imaging features and clinical data such as age, perioperative karnofsky performance sta-
tus and tumor resection have been established to correlate with survival [31]. The image
features include necrosis and edema. According to Pope et. al [31], edema, noncontrast-
enhancing tumor (nCET) and multifocality were the significant features related to survival
and these features could be classified as prognostic indicators.
There have been several studies on the relationship between imaging features and survival.
Consequently, there are reports that, the level of edema and the degree of necrosis are
correlated with survival negatively [27, 21, 16].
Figure 1.2: [Magnetic Resonance Imaging (MRI) images of patients with GBM][37, 13]
The importance of imaging has made it necessary for the availability of accurate informa-
tive quantities. The Visually AcceSAble Rembrandt Images (VASARI) feature set presents
actual standards by which a numeric score can be associated to a feature that will enable
the description of the degree of tumor features. It is a standard imaging feature consisting
of 30 features describing the size, location and the appearance of the MRI image set. The
4
University of Ghana http://ugspace.ug.edu.gh
image presents the global view of the tumor. A small tumor in the frontal lobe has a vastly
different outcome to a small tumor adjacent to motor area, for instance the eloquent cortex
[13]. For more accurate results, the Columbia University Medical Center [37], designed a
fully automated computer algorithm to score glioma tumors based on the available feature
set.
Figure 1.3: [Fully automated Segmentation and VASARI Feature Extraction:]necrotic core/contrast
enhancing tumor(right) and edema(left) [37]
Image features have also been used for exploratory radiogenomic analysis [11]. Gevaert et.
al obtained quantitative image features from MR images that characterize the radiographic
phenotype of GBM lesions. They also constructed radiogenomic maps relating the features
with particular molecular data [11]. Even after the consideration of clinical variables, imag-
ing features provide notable prognostic information. Currently, qualitative work suggests
an association between imaging phenotypes and genotypes [13].
Dongdong Lin et al (2013) [22] investigated the correspondence between single nucleotide
polymorphism (SNP) and brain activity measured by functional magnetic resonance imag-
ing (fMRI) to understand how genetic variation influences the brain activity. They de-
veloped a group sparse canonical correlation analysis method to explore the relationship
between these two datasets. They found two pairs of significant canonical variates with
average correlations of 0.4527 and 0.4292 respectively, which were used to identify genes
and voxels associated with schizophrenia.
5
University of Ghana http://ugspace.ug.edu.gh
1.1 Organisation of the Study
Chapter 2 will present brief definitions of some of the mathematical and statistical terms
that will be used in this work. The review of the main technique to be employed to investi-
gate the relationships will be discussed in Chapter 3. In Chapter 4, results from the analysis
of the data will be presented and discussion will follow in chapter 4. Chapter 5 will contain
the conclusions and recommendations and a brief discussion of possible directions for the
future work.
6
University of Ghana http://ugspace.ug.edu.gh
Chapter 2
Definitions
Prior to the presentation and discussion of the existing technique and methodology, this
chapter will present some definitions of concepts, terms and theorems to be used in the
sequel.
2.1 Definitions of statistical and mathematical terms
Definition 2.1.1.
Supposing we have a square matrix, A, of size m, then the m×1 vector k is a right eigen-
vector for A and λ ≥ 0 is the corresponding eigenvalue if Ak = λk. Also, a left eigenvector
n can be defined as satisfying nA = λn.
Definition 2.1.2.
Given an m×m matrix B, a matrix M for which M2 = B is called the square root of the
matrix B.
Several studies have examined the computation of matrix square roots [17, 6, 7, 18, 4].
Here we find the square root of an m×m matrix by the diagonalization method [4].
An m×m matrix B is diagonalizable if we have a diagonal matrix D and an invertible
matrix K such that B = KDK−1. The diagonal matrix is made up of the eigenvalues of B
and the columns of K are the m eigenvectors of B. The square root of B is given as
B
1
2 = K
√
DK−1
7
University of Ghana http://ugspace.ug.edu.gh
Example 2.1.3.
Given a matrix B =
(
18 12
12 28
)
, we find B
1
2 as follows.
The eigenvalues of B are 10,36 and eigenvectors are (−3,2),(2,3), so B eigendecomposes
to
B =
(
−3 2
2 3
)(
10 0
0 36
)(
−3 2
2 3
)−1
So we have the form B = KDK−1. Since from Definition 2.1.2, M2 = B, then there is an M
of the form K
√
DK−1
M =
(
−3 2
2 3
)(√
10 0
0
√
36
)(
−3 2
2 3
)−1
√
B =
(
4.035 1.310
1.310 5.127
)
Definition 2.1.4.
Let X1, . . . ,Xp be a set of n× 1 vectors. Then we have that the n× 1 vector lx is a linear
combination of these vectors if lx = a1X1 + . . .+ apXp for some real constants a1, . . .ap
which are usually called loadings.
Singular Value Decomposition
Let A be a p×q real matrix. Then it can be represented as A = UDV ′ where U is a p× p
orthogonal matrix, V is a q× q orthogonal matrix and D is a p× q diagonal matrix with
non-negative diagonal elements λi, i = 1, . . . ,min(p,q). The first min(p,q) columns of U
and V are left and right singular vectors, respectively, and λi, i = 1, ...,min(p,q) are the
corresponding singular values. Note that left singular vectors for A are the eigenvectors for
AA′ while the right singular vectors are the eigenvectors for A′A. The eigenvalues are equal
for AA′ and A′A and they are equal to the squared singular values of A.
Lemma 2.1.5. (The Cauchy-Schwartz Inequality)
Let H be a Hilbert space over C. We have that
| 〈x,y〉 |2≤ 〈x,x〉〈y,y〉 ,
∀x,y ∈ H.
8
University of Ghana http://ugspace.ug.edu.gh
Proof. If y = 0, then 〈x,0〉= 0 and the inequality is true. Assume y 6= 0 and that
a =−
〈x,y〉
〈y,y〉
.
Clearly a is a complex number since 〈x,y〉 is a complex number and 〈y,y〉 is a real number.
Then we have,
0≤ 〈x+ay,x+ay〉 = 〈x,x+ay〉+ 〈ay,x+ay〉
= 〈x,x〉+ 〈x,ay〉+ 〈ay,x〉+ 〈ay,ay〉
= 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+a〈y,ay〉
= 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+a〈ay,y〉
= 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+aa¯〈y,y〉
= 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+ |a|2 〈y,y〉
= 〈x,x〉−
〈x,y〉
〈y,y〉
〈x,y〉−
〈x,y〉
〈y,y〉
〈x,y〉+
∣
∣
∣
∣−
〈x,y〉
〈y,y〉
∣
∣
∣
∣
2
〈y,y〉
= 〈x,x〉−
2〈x,y〉
〈y,y〉
〈x,y〉+
| 〈x,y〉 |2
〈y,y〉
= 〈x,x〉−
2| 〈x,y〉 |2
〈y,y〉
+
| 〈x,y〉 |2
〈y,y〉
= 〈x,x〉−
|〈x,y〉 |2
〈y,y〉
.
Hence,
0 ≤ 〈x,x〉−
|〈x,y〉 |2
〈y,y〉
| 〈x,y〉 |2 ≤ 〈x,x〉〈y,y〉
| 〈x,y〉 | ≤
√
〈x,x〉
√
〈y,y〉
| 〈x,y〉 |2 ≤ 〈x,x〉〈y,y〉 as desired.
9
University of Ghana http://ugspace.ug.edu.gh
Definition of Statistical Terms
Definition 2.1.6.
Variance measures the spread or dispersion or compactness of a set of data. It is computed
as the average of the squared deviations from the mean score of the data set.
Definition 2.1.7.
Covariance is a measure of how much or the degree at which two variables change together.
The covariance matrix is a matrix which has the covariance of the ith and jth elements of
the variables in the position of the i jth position . All covariance matrices are symmetric
and positive semi-definite.
The following definitions are adapted from the supplement to Hair et. al’s textbook [14].
Definition 2.1.8.
A canonical variate also known as a linear compound or a linear composite is a linear
combination that constitutes the weighted sum of two or more variables. Thus a canonical
variate can be defined for either set of variables.
Definition 2.1.9.
A Canonical function depicts the relationship between two canonical variates (linear com-
posites). For each canonical function, there are two canonical variates, one variate for
one set of variables and another variate for the other set of variables. The degree of the
relationship is the canonical correlation.
Definition 2.1.10.
The canonical roots are the squared canonical correlations. They are also known as eigen-
values. The canonical roots provide the estimation of the shared variance between the
weighted canonical variates of the two set of variables.
Definition 2.1.11.
Orthogonality here is a mathematical constraint which specifies that canonical functions
are not dependent of one another. Put differently, to arrive at statistical independence of
the canonical functions we derive the functions so that each function is perpendicular to all
others when it is being plotted in a space (multivariate).
Definition 2.1.12.
The canonical loading is the measure of correlation between the original variables and their
canonical variates.
10
University of Ghana http://ugspace.ug.edu.gh
Definition 2.1.13.
The redundancy index is the measure of the amount of variance explained between a canon-
ical variate pair in a canonical function.
11
University of Ghana http://ugspace.ug.edu.gh
Chapter 3
Methodology
In this chapter, we present the idea of Canonical Correlation Analysis. The technique seeks
to identify the relationships between two datasets. The canonical correlation analysis will
be presented in Section 1 and an example will be illustrated in section 2. The discussion
of the technique will be skewed towards the datasets involved for this thesis. The main
references used for this chapter are [20, 15, 24].
3.1 Canonical Correlation Analysis (CCA)
3.1.1 Canonical Correlation
Canonical correlation analysis is a technique that measures the relationship between two
multidimensional variables. It seeks to find two bases in which the correlation matrix
between the variables is diagonal and the correlations on the diagonal are maximized.
CCA was first introduced by H. Hotelling in 1936 [19]. Canonical correlation is invari-
ant with respect to affine transformations of the variables. This property differentiates it
from the normal correlation analysis. Adopting CCA helps to summarize relationships
while preserving main features. CCA enables us to summarize the relationships into fewer
number of statistics while preserving the main facets of the relationships.
We begin with the following notation:
we define two vectors X and Y as two sets of variables, where X consists of p variables and
Y consists of q variables. We select X and Y depending on the number of variables in each
set so that p≤ q for computational reasons and convenience.
12
University of Ghana http://ugspace.ug.edu.gh
So
X =






X1
X2
...
Xp






and Y =






Y1
Y2
...
Yq






(3.1)
We define a set of linear combinations, M and N. M will consist of linear combinations of
variables Xi in X , and N will consist of linear combinations of variables Yj in Y . We have
M1 = a11X1 +a12X2 + · · ·+a1pXp
M2 = a21X1 +a22X2 + · · ·+a2pXp
...
Mp = ap1X1 +ap2X2 + · · ·+appXp = a
′X
N1 = b11Y1 +b12Y2 + · · ·+b1qYq
N2 = b21Y1 +b22Y2 + · · ·+b2qYq
...
Np = bp1Y1 +bp2Y2 + · · ·+bpqYq = b
′Y.
We also define (Mi,Ni) as the ith canonical variate pair. So (M1,N1) is the first canonical
variate pair, and (M2,N2) is the second canonical variate pair and so on. There are p
canonical variate pairs.
We seek to find linear combinations that maximize the correlations between the members
of each canonical variate pair.
The correlation corr(Mi,N j) between Mi and N j is then calculated using (3.2):
corr(Mi,N j) =
cov(Mi,N j)
√
var(Mi)var(N j)
, (3.2)
where cov(Mi,N j) is the covariance between Mi and N j and var(Mi) and var(N j) are the
variances of Mi and N j respectively. The canonical correlation for the ith canonical variate
pair is simply the correlation between Mi and Ni:
13
University of Ghana http://ugspace.ug.edu.gh
ρi =
cov(Mi,Ni)
√
var(Mi)var(Ni)
. (3.3)
The quantity in (3.3) is to be maximized, thus we find linear combinations of the X ′i s and
linear combinations of the Y ′js that maximize the above correlation.
So the main purpose of canonical correlation analysis is to explain the covariance struc-
ture or correlations structure between two sets of random vectors in terms of fewer linear
combinations.
3.1.2 Mathematical Formulation
The p-dimensional random vector X and q-dimensional vector Y , are such that cov(X ,X),cov(Y,Y )
and cov(X ,Y ) are denoted by ∑11,∑22 and ∑12 respectively. So, the covariance structure
of X and Y is given as
cov
(
X
Y
)
=
(
∑11 ∑12
∑21 ∑22
)
.
Considering the linear combinations a′X and b′Y , we have that
cov(a′X ,b′Y ) = a′∑12 b.
This implies that the canonical correlation of X and Y is
ρ(a′X ,b′Y ) = a
′∑12 b√
a′∑11 a×b′∑22 b
.
3.1.3 Formulation and Derivation of the Canonical Variables
The canonical variables and associated correlation coefficients are defined iteratively.
1st Pair of Canonical Variables:
Definition: Consider M1 = a′X and N1 = b′Y such that
14
University of Ghana http://ugspace.ug.edu.gh
• var(M1) = var(N1) = 1 and
• ρ(M1,N1) = max
a,b
ρ(a′X ,b′Y ),
then (M1,N1) is the 1st pair of canonical variables (canonical variate) and
ρ1 = max
a,b
ρ(a′X ,b′Y ) is the 1st canonical correlation coefficient.
2nd pair of Canonical Variables:
Definition: Consider linear combinations a′X and b′Y such that
• cov(a′X ,M1) = 0 = cov(b′Y,N1), that is M1 is uncorrelated with the linear combina-
tions a′X and N1 is uncorrelated with b′Y and
• var(a′X) = var(b′Y ) = 1
Then maximize the correlations between a′X and b′Y such that the above is satisfied. The
maximizing a′X and b′Y are called the second pair of canonical variates. The correlation
coefficient that maximizes the correlation of the second canonical variate pairs is the sec-
ond canonical correlation coefficient.
Kth pair of Canonical Variables:
Definition: The Kth pair of canonical variables are the linear combinations (Mk,Nk) having
unit variance which maximize the correlation among all possible linear combinations un-
correlated with the previous (k−1) canonical variate pairs.
The following statements will help us in the derivation of the canonical variables.
cov(X ,X) = ∑11 > 0,
cov(Y,Y ) = ∑22 > 0.
15
University of Ghana http://ugspace.ug.edu.gh
The covariance structure is positive definite. Now we consider a p×q matrix, A such that
A =∑
− 12
11 ∑12∑
− 12
22
and we now consider the following matrices
AA′ =∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 (p× p)
A′A =∑
− 12
22 ∑21∑
−1
11 ∑12∑
− 12
22 (q×q)
Let λ1 ≥ λ2 ≥ . . .≥ λp, be the eigenvalues of AA′ and let γ1 ≥ γ2 ≥ . . .≥ γq, be the eigen-
values of A′A.
We have that,
(i) A′A and AA′ are positive semi definite implies that λi ≥ 0 and γ j ≥ 0 ∀i, j.
(ii)Non-zero eigenvalues of AA′ are same as the non-zero eigenvalues of A′A and the eigen-
value 0 has different multiplicities in AA′ and A′A if q < p.
Theorem 3.1.4. [20] We suppose that p≤ q and cov
(
X
Y
)
=
(
∑11 ∑12
∑21 ∑22
)
.
Considering the linear combinations M = a′X and N = b′Y , we have that
max
a,b
ρ(a′X ,b′Y ) = ρ1
is attained by the linear combination
M1 = e
′
1∑
− 12
11 X and N1 = f
′
1∑
− 12
22 Y.
M1 and N1 are the first pair of canonical variables and
max
a,b
ρ(a′X ,b′Y ) = ρ2
is attained by the linear combination
M2 = e
′
2∑
− 12
11 X and N2 = f
′
2∑
− 12
22 Y.
M2 and N2 are the second pair of canonical variables.
16
University of Ghana http://ugspace.ug.edu.gh
In general
max
a,b
ρ(a′X ,b′Y ) = ρk
is attained by the linear combination
Mk = e
′
k∑
− 12
11 X and Nk = f
′
k∑
− 12
22 Y.
Now (ρ1)2 ≥ (ρ2)2 ≥ . . . ≥ (ρp)2 are the eigenvalues of the matrix ∑
− 12
11 ∑12∑
− 12
22 ∑21∑
− 12
11
matrix and e1,e2, . . . ,ep are the orthonormalized eigenvectors corresponding to
(ρ1)2, . . .(ρp)2.
The values (ρ1)2,(ρ2)2, . . .≥ (ρp)2 are the p largest eigenvalues of the matrix
∑
− 12
22 ∑21∑
−1
11 ∑12∑
− 12
22
with eigenvectors f1, f2, . . . , fp, where each fi is proportional to ∑
− 12
22 ∑21∑
− 12
11 ei.
Derivation of the 1st pair of canonical variables
Proof. From the definitions, we have that
ρ(a′X ,b′Y ) = a
′∑12 b
(a′∑11 ab′∑22 b)
1
2
. (3.4)
We let ∑
1
2
11 a = u =⇒ a =∑
− 12
11 u
and let ∑
1
2
22 b = v =⇒ b =∑
− 12
22 v.
So, equation 3.4 becomes
ρ(a′X ,b′Y ) =
u′∑
− 12
11 ∑12∑
− 12
22 v
((u′u)(v′v))
1
2
.
17
University of Ghana http://ugspace.ug.edu.gh
By applying the Cauchy Schwartz inequality, we have that
u′∑
− 12
11 ∑12∑
− 12
22 v≤
(
u′∑
− 12
11 ∑12∑
− 12
22 ∑
− 12
22 ∑21∑
− 12
11 u
) 1
2 (
v′v
) 1
2 . (3.5)
We make use of the following result to find an upper bound of the expression on the right.
From matrix theory, if C(p× p) is a real symmetric matrix with eigenvalues λ1 ≥ λ2 ≥
. . .≥ λp and eigenvectors orthornormalised at e1, . . . ,ep, then we have the following result
max
d
d′Cd
d′d
= λ1,
where λ1 is the largest eigenvalue of the real symmetric matrix C and d is a vector. The
maximum is attained at d = e1, where e1 the orthonormalised eigenvector corresponding to
the largest eigenvalue λ1.
This implies that
(d′Cd)≤ λ1d′d.
So we have that (
u′∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u
)
≤ (ρ1)2 u′u. (3.6)
In equation 3.6 equality holds at u = e1 and in equation 3.5 equality is attained if v =
∑
− 12
22 ∑21∑
− 12
11 e1.
That is,
u =∑
− 12
11 a, so a =∑
− 12
11 e1 and b =∑
− 12
12 ∑
− 12
22 ∑21∑
− 12
11 e1.
18
University of Ghana http://ugspace.ug.edu.gh
ρ(a′X ,b′Y ) ≤
[
(u′∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u)(v
′v)
] 1
2
(u′u · v′v)
1
2
=


u′∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u
u′u


1
2
≤
(
(ρ1)2u′u
u′u
) 1
2
= ρ1.
This implies that
max
a,b
ρ(a′X ,b′Y ) = ρ1
and
ρ(e′1∑
− 12
11 X , f
′
1∑
− 12
22 Y ) =
cov(e′1∑
− 12
11 X , f
′
1∑
− 12
22 Y )
(
var(e′1∑
− 12
11 X)var( f
′
1∑
− 12
22 Y )
) 1
2
= ρ1.
This implies that, the first pair of canonical variables is given by M1 = e′1∑
− 12
11 X and
N1 = f ′1∑
− 12
22 Y .
So we now have that
∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 e1 = λ1e1(λ1 = ρ1). (3.7)
We multiply both sides of equation 3.7 by the matrix
(
∑
− 12
22 ∑21∑
− 12
11
)
to obtain
(
∑
− 12
22 ∑21∑
− 12
11
)
∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 e1 = λ1∑
− 12
22 ∑21∑
− 12
11 e1.
19
University of Ghana http://ugspace.ug.edu.gh
That is,
∑
− 12
22 ∑21∑
−1
11 ∑12∑
− 12
22
(
∑
− 12
22 ∑21∑
− 12
11 e1
)
= λ1
(
∑
− 12
22 ∑21∑
− 12
11 e1
)
.
Since f1 is proportional to ∑
− 12
22 ∑21∑
− 12
11 e1, we have that
∑
− 12
22 ∑21∑
−1
11 ∑12∑
− 12
22 f1 = λ1 f1.
Thus we conclude that if (λ1,e1) is the eigenvalue-eigenvector pair of∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 ,
then (λ1, f1) is the eigenvalue-eigenvector pair of ∑
− 12
22 ∑21∑
−1
11 ∑12∑
− 12
22 .
Derivation of the second canonical variables
M1 and any linear combinations of Xs’ say given by
a′2X ,u
′
2∑
− 12
11 X ,
where ∑
1
2
11 a2 = u2 are uncorrelated if
cov(M1,u
′
2∑
− 12
11 X) = cov(e
′
1∑
− 12
11 ,u
′
2∑
− 12
11 X) = 0
= e′1∑
− 12
11 ∑11∑
− 12
11 u2 = 0
= e′1u2 = 0.
So, u2 is to be determined such that it is orthogonal to e1.
We want to find
ρ(a′2X ,b′2Y ) =
cov(a′2X ,b
′
2Y )(
var(a′2X) · var(b
′
2Y )
)
=
a′2∑12 b2
(
(a′2∑11 a2)(b
′
2∑22 b2)
) 1
2
.
We let ∑
1
2
11 a2 = u2 =⇒ a2 =∑
− 12
11 u2
and let ∑
1
2
22 b2 = v2 =⇒ b2 =∑
− 12
22 v2.
20
University of Ghana http://ugspace.ug.edu.gh
So we have that
ρ(a′2X ,b′2Y ) =
u′2∑
− 12
11 ∑12∑
− 12
22 v2
(u′2u2 · v
′
2v2)
1
2
.
We apply the Cauchy Schwartz inequality to the numerator and have that
(
u′2∑
− 12
11 ∑12∑
1
2
22 v2
)
≤
(
u′2∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u2
) 1
2 (
v2v
′
2
) 1
2 . (3.8)
So we concentrate on the expression u′2∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u2 and try to see what can be
given as an upper bound of this particular expression.
In order to get that, we again recall a result from matrix theory that states that for a real
symmetric matrix Cp×p with eigenvalue-eigenvector pairs (λi,ei); i = 1,2, . . . p such that
λ1 ≥ λ2 ≥ . . .≥ λp, we have that
max
d⊥e1
d′Cd
d′d
= λ2 =⇒ d′Cd ≤ λ2d′d (3.9)
and
max
d⊥e1,e2,...ek
d′Cd
d′d
= λk+1 =⇒ d′Cd ≤ λk+1d′d. (3.10)
In equation 3.9, equality holds if d = e2 and for equation 3.10, equality holds if d = ek+1.
From 3.9, we have that
(
u′2∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u2
)
≤ λ2(u′2u2) with equality at u2 = e2.
In equation 3.8 equality is attained if
v2 =∑
− 12
22 ∑21∑
− 12
11 e2 =⇒ b2 = ∑
− 12
22 ∑
− 12
22 ∑21∑
− 12
11 e2
b2 = ∑
− 12
22 f2.
21
University of Ghana http://ugspace.ug.edu.gh
So now we have that
ρ(a′2X ,b′2Y ) ≤
[(
u2′∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u2
)
(v2′v2)
] 1
2
(u2′u2 · v2′v2)
1
2
=


u′2∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 u2
u2′u2


1
2
≤
(
(ρ2)2u′2u2
u′2u2
) 1
2
= ρ2.
Thus
corr(a′2X ,b
′
2Y )≤ ρ2 with equality at u2 = e2
=⇒ a2 =∑
− 12
11 e2.
The Second Canonical Variable pairs are M2 = e′2∑
− 12
11 X and N2 = f
′
2∑
− 12
22 Y.
The second canonical correlation coefficient is ρ2 as required.
3.1.5 Properties of the Canonical Variable Pairs
(i) var(Mk) = var(Nk) = 1.
Proof.
var(Mk) = var(e
′
k∑
− 12
11 X) = e
′
k∑
− 12
11 ∑11∑
− 12
11 ek = e
′
kek = 1.
Similarly,
var(Nk) = f
′
k∑
− 12
22 ∑22∑
− 12
22 fk = f
′
k fk = 1.
22
University of Ghana http://ugspace.ug.edu.gh
(ii) cov(Mk,Mt) = corr(Mk,Mt) = 0, ∀k 6= t.
Proof.
cov(Mk,Mt) = cov(e
′
k∑
− 12
11 X ,e
′
t∑
− 12
11 X)
= e′k∑
− 12
11 ∑11∑
− 12
11 et
= e′ket = 0 ∀ k 6= t since ek and et are orthogonal.
(iii) cov(Nk,Nt) = corr(Nk,Nt) = 0, ∀ k 6= t.
Proof.
cov(Nk,Nt) = cov( f
′
k∑
− 12
22 Y, f
′
t ∑
− 12
22 Y )
= f ′k∑
− 12
22 ∑22∑
− 12
22 ft .
Also, because of the orthogonality of fk and ft ,
cov(Nk,Nl) = f
′
k ft = 0 ∀k 6= l.
(iv) cov(Mk,Nt) = corr(Mk,Nt) = 0, ∀ k 6= t.
Proof.
cov(Mk,Nt) = cov(e
′
k∑
− 12
11 X , f
′
t ∑
− 12
22 Y ) = e
′
k∑
− 12
11 ∑12∑
− 12
22 ft . (3.11)
23
University of Ghana http://ugspace.ug.edu.gh
We recall that fk is proportional to ∑
− 12
22 ∑21∑
− 12
11 ek and so
cov(Mk,Nt) = Q f
′
k ft = 0, ∀ k 6= t since fk ⊥ ft where Q is a constant.
3.1.6 Canonical correlation coefficient under the non-singular trans-
formation
In this section we seek to find the canonical correlations if the vectors, X and Y are being
transformed. We will also demonstrate that we can compute the canonical correlation co-
efficients either from the covariance matrix or from the correlation matrix. We derive the
canonical correlation coefficient under the transformation.
Xp×1→CX and
Yq×1→ DY,
where C and D are non-singular matrices. We have
cov
(
CX
DY
)
=
(
C∑11C
′ C∑12 D
′
D∑21C
′ D∑22 D
′
)
.
We have seen that ρ1,ρ2, . . . ,ρp are the canonical correlation coefficients for the
(
X
Y
)
set up. Also, (ρ1)2,(ρ2)2, . . . ,(ρp)2 are the eigenvalues of ∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 . Hence,
(ρ1)2,(ρ2)2, . . . ,(ρp)2 are the roots of
∣
∣
∣∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 −λ I
∣
∣
∣= 0.
24
University of Ghana http://ugspace.ug.edu.gh
So we now pre and post multiply by the matrix ∑
1
2
11 and ∑
− 12
11 to get
∣
∣
∣∑
1
2
11∑
− 12
11 ∑12∑
−1
22 ∑21∑
− 12
11 ∑
− 12
11 −λ I
∣
∣
∣= 0
∣
∣
∣∑12∑
−1
22 ∑21∑
−1
11 −λ I
∣
∣
∣= 0.
The matrix ∑12∑
−1
22 ∑21∑
−1
11 can be transformed under C and D as
∑12∑
−1
22 ∑21∑
−1
11
C,D
−→
(
(C∑12 D
′)(D∑22 D
′)−1(D∑21C
′)(C∑11C
′)−1
)
= C∑12∑
−1
22 ∑21∑
−1
11 C
−1.
We have that, the non-zero eigenvalues of C∑12∑
−1
22 ∑21∑
−1
11 C
−1 are the same as the non-
zero eigenvalues of C−1C∑12∑
−1
22 ∑21∑
−1
11 = ∑12∑
−1
22 ∑21∑
−1
11 .
Hence we conclude that the canonical correlation coefficient under the non-singular trans-
formation C,D are the same.
We now take a special case of such a transformation by defining C and D as follows;
C = N
− 12
11 where N11 = diag
(
∑11
)
and D = N
− 12
22 where N22 = diag
(
∑22
)
.
So we transform the vectors X and Y under the given transformation and compute the
covariance of X and Y under the transformation.
X →CX = N
− 12
11 X → cov(N
− 12
11 X) = N
− 12
11 ∑11 N
− 12
11 = ρ11,
Y → DY = N
− 12
22 Y → cov(N
− 12
22 Y ) = N
− 12
22 ∑22 N
− 12
22 = ρ22.
This implies that the eigenvalues of ∑
1
2
11∑12∑
−1
22 ∑21∑
− 12
11 are identical to the eigenvalues
of ρ−
1
2
11 ρ12ρ
−1
22 ρ21ρ
− 12
11 .
Therefore, computing the canonical correlation coefficients from either the covariance ma-
trix or the correlation matrix will yield the same values.
25
University of Ghana http://ugspace.ug.edu.gh
3.1.7 Correlation Coefficient Between Canonical Variables and the
Original Variables
We now derive the correlation coefficient between the canonical variables, (Mi and Ni)
where i = 1,2, . . . , p and the original variables X and Y .
The pth canonical variate pairs are defined as follows
Mp = e
′
p∑
− 12
11 X and Np = f
′
p∑
− 12
22 Y.
M︸︷︷︸
p×1
=






M1
M2
...
Mp






=






e′1
e′2
...
e′p






∑
− 12
11 X = CX and C =






e′1
e′2
...
e′p






∑
− 12
11 .
N︸︷︷︸
q×1
=






N1
N2
...
Np






=






f ′1
f ′2
...
f ′q






∑
− 12
22 Y = DY and D =






f ′1
f ′2
...
f ′q






∑
− 12
22 .
cov(M,X) = cov(CX ,X) = C∑11 =






e′1
e′2
...
e′p






∑
1
2
11 and
cov(N,Y ) = cov(DY,Y ) = B∑22 =






f ′1
f ′2
...
f ′q






∑
1
2
22 .
26
University of Ghana http://ugspace.ug.edu.gh
This implies that
corr(Mi,Xk) =
cov(Mi,Xk)
σ
1
2
kk
; (var(Xk) = σkk)
= cov(Mi,σ
− 12
kk Xk)
corr(M,X) = cov(M,N
− 12
11 X) where N11 = diag
(
∑11
)
= diag(σ11, . . . ,σpp)
= cov(CX ,N
− 12
11 X)
= C∑11 N
− 12
11
= CN
1
2
11N
− 12
11 ∑11 N
− 12
11
= CN
1
2
11ρ11. (3.12)
Similarly,
corr(M,Y ) = cov(M
˜
,N
− 12
22 Y )
= cov(CX ,N
− 12
22 Y ) where N22 = diag
(
∑22
)
= diag
(
σ11,σ22, . . . ,σqq
)
= C∑12 N
− 12
22 = CN
1
2
22ρ12. (3.13)
And
corr(N,X) = cov(N,N
− 12
11 X)
= Cov(DY,N
− 12
11 X)
= D∑21 N
− 12
11 = DN
1
2
11ρ21. (3.14)
Finally,
corr(N,Y ) = cov(N,N
− 12
12 Y )
= cov(BY,N
− 12
22 Y )
= D∑22 N
− 12
22 = DN
1
2
22ρ22. (3.15)
Equations 3.12, 3.13, 3.14 and 3.15 are the derived canonical coefficients between the
canonical variate pairs and the original variables.
27
University of Ghana http://ugspace.ug.edu.gh
3.1.8 Computation of Canonical Correlation Coefficient Using Stan-
dardized Variables
Here, we seek to derive the canonical coefficient by standardizing the original variables.
We denote the standardized variables are follows
Z(X) = (X−MX)N
− 12
11 and
Z(Y ) = (Y −MY )N
− 12
22 .
So the covariance matrix of the standardized variables is given by
cov
(
Z(X)
Z(Y )
)
=
(
ρ11 ρ12
ρ21 ρ22
)
.
From the correlation matrix, the derived canonical variables are
MZk = e
′
k∑
− 12
11 N
1
2
11Z
(X) and
NZk = f
′
k∑
− 12
22 N
1
2
22Z
(Y ).
MZ =




MZ1
...
MZp



=




e′1
...
e′p



∑
− 12
11 N
1
2
11Z
(X) = CZZ
(X) (3.16)
and
NZ =




NZ1
...
NZq



=




f ′1
...
f ′q



∑
− 12
22 N
1
2
22Z
(Y ) = DZZ
(Y ). (3.17)
Now we compute the correlation between the canonical variables obtained from the corre-
lation matrix and the standardized variables. We have
ρ(MZ,Z(X)) = cov(MZ,Z(X)) = cov(CZZ(X),Z(X))
= CZρ11. (3.18)
28
University of Ghana http://ugspace.ug.edu.gh
ρ(NZ,Z(Y )) = cov(NZ,Z(Y )) = cov(DZZ(Y ),Z(Y ))
= DZρ22. (3.19)
ρ(MZ,Z(Y )) = cov(CZZ(X),Z(Y )) = CZρ12. (3.20)
ρ(NZ,Z(X)) = cov(DZZ(Y ),Z(X)) = DZρ21. (3.21)
From equations 3.16 and 3.17, we have that
CZ =




e′1
...
e′p



∑
− 12
11 N
1
2
11 and
DZ =




f ′1
...
f ′p



∑
− 12
22 N
1
2
22.
This gives
ρ(M,X) = CN
1
2
11ρ11 =




e′1
...
e′p



∑
− 12
11 N
1
2
11ρ11 = CZρ11 = ρ(MZ,Z
(1)),
ρ(M,Y ) = CN
1
2
22ρ12 =




e′1
...
e′p



∑
− 12
12 N
1
2
22ρ12 = CZρ11 = ρ(MZ,Z
(X)),
ρ(N
˜
,X) = DN
1
2
11ρ21 =




f ′1
...
f ′p



∑
− 12
21 N
1
2
11ρ21 = DZρ21 = ρ(NZ,Z
(X)),
ρ(N,Y ) = DN
1
2
11ρ22 =




f ′1
...
f ′p



∑
− 12
22 N
1
2
22ρ22 = DZρ22 = ρ(NZ,Z
(Y )).
We then conclude that, computing correlations by standardizing the variables has no effect.
29
University of Ghana http://ugspace.ug.edu.gh
3.1.9 Assessing Overall Model Fit and Canonical Dimension Reduc-
tion
Under this section, two techniques will be discussed to explore the possibility that inter-
preting fewer canonical dimensions or canonical variate pairs can be enough to capture
sufficient covariance or correlation structure. It is known that not all canonical functions
are important. Evidently, the strength of the canonical correlation coefficient can suggest
the importance of the canonical variate pairs [2]. We are ultimately interested in the sig-
nificant canonical coefficients to make informed decisions. The first technique involves
the use of Wilk’s lambda and it’s corresponding F-tests to test the null hypothesis that all
canonical functions have canonical correlation coefficients to be zero at a 5% significance
level. Wilk’s lambda evaluates each canonical function against the null hypothesis that the
canonical coefficient is zero. The second technique seeks to ascertain if choosing k < p
canonical variate pairs is enough to capture the covariance structure.
Technique I
For each canonical correlation coefficient, there exists an eigenvalue that is related to the
Wilk’s lambda. The eigenvalue for each coefficient in relation to the Wilk’s lamda is cal-
culated as
λi =
ρi
(1−ρi)2
and Wilk’s lamda is computed as
Λ =
1
∏(1−λi)
.
The F-test value is calculated as
F =
1−Λ
1
w
Λ
1
w
(
degrees of freedom1
degrees of freedom2
)
.
30
University of Ghana http://ugspace.ug.edu.gh
Degrees of Freedom1 = p×q.
Degrees of Freedom2 = vw−
pq
2
+1.
v = n−
3
2
−
p+q
2
, n is the sample size.
w =
(
p2q2− p
p2 +q2−q
) 1
2
. (3.22)
The computation of w in equation 3.22 is iterative. We begin with the initial values of p
and q and repeatedly subtract one from p and q until either p or q has been reduced to one.
We now compute the p-value or the critical value to make the final decision. The critical
value is a value that the computed F value must exceed to reject the test hypothesis. The
critical value is computed from the F-distribution table using the two degrees of freedom
and the level of significance (5%).
The p-value is computed using the F value and the two degrees of freedom values. If the
p-value is less than 0.05, then we reject the null hypothesis, otherwise we fail to reject the
null hypothesis.
Technique II
We have that
M =




M1
...
Mp



= SX and so X = S
−1M, where S =




e′1
...
e′p



∑
− 12
11
and
N =




N1
...
Nq



= TY thus Y = T
−1N, and T =




f ′1
...
f ′q



∑
− 12
22 .
Clearly,
S−1 =∑
1
2
11(e1, . . . ,ep) and T
−1 =∑
1
2
22( f1, . . . , fq).
31
University of Ghana http://ugspace.ug.edu.gh
So writing S−1 and T−1 in the form below eases the computation.
We write
S−1 =
(
s(1), . . . ,s(p)
)
, where
s(i) = ∑
1
2
11 ei ; i = 1,2, . . . , p and (3.23)
T−1 =
(
t(1), . . . , t(q)
)
, where
t(i) = ∑
1
2
22 fi; i = 1,2, . . . ,q. (3.24)
Using this we rewrite X and Y as
X =
(
s(1), . . . ,s(p)
)
M
=
p
∑
i=1
s(i)M and (3.25)
Y =
(
t(1), . . . , t(q)
)
N
=
q
∑
i=1
t(i)N. (3.26)
We can then compute the covariance of X and Y as
cov(X) = cov
(
p
∑
i=1
s(i)Mi
)
=
p
∑
i=1
s(i)s(i)
′
and
cov(Y ) = cov
(
p
∑
i=1
t(i)Ni
)
=
q
∑
i=1
t(i)t(i)
′
.
So considering the first k canonical variables, we have that
X∗ =
k
∑
i=1
a(i)Mi and Y
∗ =
k
∑
i=1
b(i)Ni, thus
cov(X∗) =
k
∑
i=1
s(i)s(i)
′
and cov(Y ∗) =
k
∑
i=1
t(i)t(i)
′
.
32
University of Ghana http://ugspace.ug.edu.gh
We then compute the covariance between X and Y as
cov(X ,Y ) = cov(S−1M,T−1N) = S−1




ρ1 0 0
. . . 0
0 ρp




(
T−1
)′
and so cov(X ,Y ) = (s(1), . . . ,s(1))






ρ1 0 0
0 0 ρ2 0 0
. . . 0
0 0 ρp










t(1)
′
...
t(1)
′
q




=
p
∑
i=1
ρ∗i s(i)t(i)
′
.
Therefore,
cov(X∗,Y ∗) =
k
∑
i=1
ρis(i)t(i)
′
.
So having the covariance structure for the first k canonical variables, we now seek to find
out the closeness to a null matrix of the three matrices.
p
∑
i=k+1
s(i)s(i)
′
,
q
∑
i=k+1
t(i)t(i)
′
and
p
∑
i=k+1
ρis(i)t(i)
′
.
We make three observations.
(1) Since we usually choose k such that ρk+1 and hence ρk+2, . . . ,ρp are negligible,
p
∑
i=k+1
ρis(i)t(i)
′
will be closer to a null matrix than
p
∑
i=k+1
s(i)s(i)
′
and
q
∑
i=k+1
t(i)t(i)
′
.
(2)
cov(X ,M) = cov
(
S−1M,M
)
= S−1 =
(
s(1), . . . ,s(p)
)
=




cov(X1,M1) . . . cov(X1,Mp)
...
...
cov(Xp,M1) . . . cov(Xp,Mp)



 .
(3) Considering k < p canonical variables, M1, . . . ,Mk, the proportion of total variance X
33
University of Ghana http://ugspace.ug.edu.gh
explained by M1, . . . ,Mk is given as
tr (cov(X∗))
tr (cov(X)
=
tr
(
k
∑
i=1
s(i)a(i)
′
)
tr∑11
.
where tr is the trace of the matrices in question.
In addition
S−1 = (s(1), . . . ,s(p)) = cov(X ,M)
and s(i) =




cov(X1,Mi
...
cov(Xp,Mi)



 i = 1, . . . , p
thus s(i)
′
s(i) =
p
∑
j=1
cov(X j,Mi)
2 and
k
∑
i=1
s(i)
′
s(i) =
k
∑
i=1
p
∑
j=1
cov(X j,Mi)
2.
Thus
tr
(
k
∑
i=1
s(i)s(i)
′
)
tr∑11
=
k
∑
i=1
tr(s(i)s(i)
′
)
p
∑
i=1
tr(s(i)s(i)′)
=
k
∑
i=1
tr(s(i)
′
s(i))
p
∑
i=1
tr(s(i)s(i)′)
Since (s(i)
′
s(i)) is a scalar quantity, we have that
k
∑
i=1
tr(s(i)
′
s(i))
p
∑
i=1
tr(s(i)s(i)′)
=
k
∑
i=1
s(i)s(i)
′
p
∑
i=1
s(i)s(i)′
=
k
∑
i=1
∑
p
j=1 cov(X j,Mi)
2
p
∑
i=1
p
∑
j=1
cov(X j,Mi)2
.
34
University of Ghana http://ugspace.ug.edu.gh
Similarly, the proportion of total variance of Y explained by N1, . . . ,Nk, is given by
tr(
k
∑
i=1
t(i)t(i)
′
)
tr∑22
=
k
∑
i=1
q
∑
j=1
cov(Yj,Ni)2
q
∑
i=1
q
∑
j=1
cov(Yj,Ni)2
.
If the proportion of total variance is close to 1 or 100%, then the k dimensions are retained.
3.2 Example: Computation of Canonical variables and
Canonical Coefficients
Here we use the derived formulas obtained in this chapter to compute the canonical variable
pairs and the canonical coefficients of the covariance structure below. We consider a Z
standardized vector with variables standardized. It is divided into two.
Zq×1 =



Z(1)
Z2


 .
The Z(X) and Z(Y ) are standardized variables (2×1).
Suppose we are given
cov(Z) = cov



Z(1)
Z(2)


 =
(
ρ11 ρ12
ρ21 ρ22
)
=






(
1.00 0.40
0.40 1.00
) (
0.50 0.60
0.30 0.40
)
(
0.60 0.40
0.50 0.30
) (
1.00 0.20
0.20 1.00
)






.
We begin by calculating ρ−
1
2
11 and ρ
−1
22 as
ρ−
1
2
11 =
(
1.068 −0.223
−0.223 1.068
)
and ρ−122 =
(
1.042 −0.208
−0.208 1.042
)
.
35
University of Ghana http://ugspace.ug.edu.gh
so
ρ−
1
2
11 ρ12ρ
−1
22 ρ21ρ
− 12
11 =
(
0.437 0.218
0.218 0.120
)
.
Now we seek to ascertain the eigenvalues of the matrix ρ−
1
2
11 ρ12ρ
−1
22 ρ21ρ
− 12
11 . The eigenval-
ues ρ21 ,ρ22 are as follows
ρ21 = 0.548 and ρ22 = 0.0090,
hence,
ρ1 = 0.740 and ρ2 = 0.030.
The eigenvector, e1 associated to ρ21 is obtained as
e1 =
(
0.8911
0.4538
)
.
This implies that the coefficient vector for M1 : ρ
− 12
11 e1 = a1 =
(
0.856
0.278
)
. So
M1 = e
′
1ρ
− 12
11 Z
(X) = 0.856Z(X)1 +0.278Z
(X)
2 . (3.27)
We find the coefficient vector, b, for N1.
We have that f1 is proportional to ρ
− 12
22 ρ21ρ
− 12
11 e1 and b1 = ρ
− 12
22 f1. Thus f1 is propor-
tional to ρ−
1
2
22 ρ21a1. The constant of proportionality = 1 since b1 is such that var(b′1Z(Y )) =
var(N1) = b′1ρ22b1 = 1.
36
University of Ghana http://ugspace.ug.edu.gh
b1ρ
1
2
22 ∝ ρ
− 12
22 ρ21a1
b1 ∝ ρ
− 12
22 ρ
− 12
22 ρ21a1
b1 ∝ ρ−122 ρ21a1
ρ−122 ρ21a1 =
(
0.403
0.544
)
.
We orthonormalize ρ−122 ρ21a1
b′1ρ22b1 = 0.546
b1 =
1
√
0.546
(
0.403
0.544
)
N1 = b1Z
(Y ) =
0.403
√
0.546
Z(Y )1 +
0.544
√
0.546
Z(Y )2 .
The second canonical correlation coefficient is too small and hence further calculations will
not be done. We later show why only one canonical coefficient was enough.
We now compute the correlations between the original set of variables(standardized) and
the canonical variates M1 and N1.
For the first canonical variable pair, we have that
C′Z = (0.86,0.28) and
D′Z = (0.54,0.74).
The correlation between M1 and Z(X) is
ρ(M1,Z(X)) = CZρ11 = (0.97,0.62)
Similarly, ρ(N1,Z(Y )) = DZρ22 = (0.69,0.85),
ρ(M1,Z(Y )) = CZρ12 = (0.51,0.63) and
ρ(N1,Z(X)) = DZρ21 = (0.71,0.46).
We now show that only one canonical variable was sufficient to capture the correlation
structure.
37
University of Ghana http://ugspace.ug.edu.gh
For k = 1, the canonical functions are as follows
M1 = 0.86X1 +0.28X2
N1 = 0.54Y1 +0.74Y2.
So take a′1 = (0.86,0.28) and b
′
1 = (0.54,0.74).
Now
cov(X1,M1) = 0.86cov(X1,X1)+0.28cov(X1,X2) = 0.97,
cov(Y1,N1) = 0.54cov(Y1,Y1)+0.74cov(Y1,Y2) = 0.69,
cov(X2,M1) = 0.86cov(X1,X2)+0.28cov(X2,X2) = 0.62,
cov(Y2,N2) = 0.54cov(Y1,Y2)+0.74cov(Y2,Y2) = 0.85.
From the covariances computed above, we have that
s(1) =
(
0.97
0.62
)
and t(1) =
(
0.69
0.85
)
s(1)s(1)
′
=
(
0.95 0.61
0.61 0.4
)
and t(1)t(1)
′
=
(
0.47 0.58
0.58 0.72
)
ρ1s(1)t(1)
′
=
(
0.5 0.61
0.31 0.39
)
.
Thus if considering only 1 canonical variate pair (M1,N1), we check to see whether s(1)s(1)
′
,
t(1)t(1)
′
, ρ1s(1)t(1)
′
approximate ρ11,ρ22 and ρ12 respectively.
From our computations, we have
(
0.5 0.61
0.31 0.39
)
≈
(
0.5 0.6
0.3 0.4
)
.
We observe that of the three matrices only ρ1s(1)t(1)
′
has a reasonable approximation to
ρ12. This result conforms to the note presented above stating that,
p
∑
i=k+1
ρis(i)t(i)
′
is very
close to the null matrix.
38
University of Ghana http://ugspace.ug.edu.gh
We calculate the proportion of total variance explained by both M1 and N1.
tr(s(1)s(1)
′
)
tr∑11
=
0.95+0.4
2
' 68%
tr(t(1)t(1)
′
)
tr∑22
=
0.47+0.72
2
' 60%
M1 explains 68% of the total variation in X and N1 explains 60% of variation in Y . This
shows that the first canonical variate pairs is enough to capture sufficient covariance struc-
ture of the sets of variables.
39
University of Ghana http://ugspace.ug.edu.gh
Chapter 4
Results
This chapter presents the results and discussion of the analysis of the available data set.
The chapter is sub-divided into four sections. The first section gives a brief description
of the data and the variables used. The second section describes the characteristics of the
glioblastoma patients and the third section will present the main results of the analysis. The
final section presents a summary of the results obtained from the analysis.
4.1 Data
The data set consist of thirty-two (32) variables. The neuroimage features are explored us-
ing six (6) variables while the copy number variations of patients contain 26 variables. We
define the neuroimage features variables as set M and the copy number variation variables
as set N. Five hundred and twenty-seven (527) GBM patients were involved in this anal-
ysis. Out of the 527 patients, only 267 patients had a corresponding MRI of their tumor
available. Hence for the main analysis, 267 patients were involved.
4.1.1 Patient Features
The VASARI lexicon for magnetic resonance imaging annotation contains several imaging
descriptors based on different magnetic resonance imaging modalities [13]. The cardinal
image features as presented by Gutman et al [13] in their paper are edema, necrosis, non
Contrast-enhancing tumor (nCet) and enhancing. We added two more features, the major
axis length and minor axis length of the tumor to the cardinal features. So the follow-
ing magnetic resonance imaging features of Gliobastoma patients available on the Can-
40
University of Ghana http://ugspace.ug.edu.gh
cer Imaging Archive (TCIA) were used for the analysis: edema, necrosis, non Contrast-
enhancing tumor, enhancing tumor, major axis length and minor axis length. Table 4.1 lists
each image feature with its description.
The copy number variations of the Glioblastoma patients was obtained from the The Cancer
Genome Atlas (TCGA). The variables under the copy number variations are measured as
homozygous deletion, hemizygous deletion, neutral/no change, gain and high level ampli-
fication. Further information about the patients was acquired from TCGA to assess some
characteristic features of the patients. Table 4.2 gives the variables (genes) in the copy
number variation for the patients.
41
University of Ghana http://ugspace.ug.edu.gh
Table 4.1: Description of Neuro-image Features Used
Variable Name Description
Edema What proportion of the abnormality is vasogenic edema? It
is an accumulation of fluid in the brain that happens when
the blood-brain barrier is broken. Edema should be greater
in signal than nCET and somewhat lower in signal than
CSF. (Pseudopods are characteristic of edema)
Proportion Necrosis Defined as the region within the tumor that does not en-
hance or shows markedly diminished enhancement, is high
on T2W and proton density images, is low on T1W images,
and has an irregular border
Proportion Enhancing Proportion of tumor that is enhancing. (Assuming that the
entire abnormality may be comprised of: (1) an enhancing
component, (2) a nonenhancing component, (3) a necrotic
component and (4) an edema component.)
Proportion nCet Defined as the regions of T2W hyperintensity (less than the
intensity of cerebrospinal fluid, with corresponding T1W
hypointensity) that are associated with mass effect and ar-
chitectural distortion, including blurring of the gray-white
interface.(Assuming that the the entire abnormality may
be comprised of: (1) an enhancing component, (2) a non-
enhancing component, (3) a necrotic 9= Indeterminate com-
ponent and (4) an edema component.)
Major Axis Largest perpendicular(x−y) cross-sectional diameter of T2
signal abnormality measured on a single axial image only
Minor Axis Smallest perpendicular(x− y) cross-sectional diameter of
T2 signal abnormality measured on a single axial image
only
42
University of Ghana http://ugspace.ug.edu.gh
Table 4.2: Copy Number Variation Variables (Genes)
Variables Label
akt1 AKT serine/threonline kinase 1
akt2 AKT serine/threonline kinase 2
akt3 AKT serine/threonline kinase 3
ccnd2 cyclin D2
cdk4 cyclin dependent kinase 4
cdk6 cyclin dependent kinase 6
cdk2na cyclin dependent kinase inhibitor 2A
cdkn2c cyclin dependent kinase inhibitor 2C
egfr epidermal growth factor receptor
erbb2 erb-b2 receptor tyrosine kinase 2
foxo1 forkhead box C1
foxo3 forkhead box C3
hras HRas proto-oncogene, GTPase
kras KRAS proto-oncogene, GTPase
mdm2 MDM2 proto-oncogene
mdm4 MDM4 proto-oncogene
met MET proto-oncogene, receptor tyrosine kinase
nf1 neurofibromin 1
nras neuroblastoma RAS viral oncogene homolog
pdgfra platelet derived growth factor receptor alpha
pik3ca phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha
pik3r1 phosphoinositide-3-kinase regulatory subunit 1
pten phosphatase and tensin homolog
rb1 RB transcriptional corepressor 1
spry2 sprouty RTK signaling antagonist 2
tp53 tumor protein p53
4.2 Preliminaries
This section seeks to describe some notable characteristics of the Glioblastoma patients.
The characteristics range from sex, age of diagnosis, survival status (Deceased or Living),
the expression subtype and overall survival status of patient after diagnosis (Length of time
from diagnosis to death). Frequencies and descriptives of these variables will be presented
and discussed.
Observations from table 4.3 are that, of the 527 GBM patients, the majority (61.5%) are
males. Also, about seven out of every ten (77%) of the patients are deceased as at March
43
University of Ghana http://ugspace.ug.edu.gh
2016. The mean survival time from time of diagnosis to death was recorded to be 15
months with a standard deviation of 16.53. The mean age of diagnosis was obtained as
58 (Table 4.4). The survival time and age of diagnosis from our data set conforms to the
cancer statistics in 2012 [35] which stated that GBM is generally diagnosed at an average
age of 55 years, and gives the affected patient an average survival time of only 10 to 18
months.
Table 4.3: Sex and Survival Status Distribution of Patients
Characteristic Frequency Percentage
Sex: Male 324 61.5
Female 203 38.5
Survival Status: Deceased 406 77.0
Living 121 23.0
Table 4.4: Age and Overall Survival Time of Patients
Variable Minimum Maximum Mean SD
Age (in years) 10 89 58.23 14.31
Survival time (in months) 0 128 15.10 16.54
The Cancer Genome Atlas (TCGA) in 2011 indicated four distinct expression subtypes of
GBM [1]. The four subtypes were Classical, Proneural, Neural and Mesenchymal. The
Classical GBM tumors are always characterized by extremely high levels of EGFR. How-
ever, the abnormality of the EGFR gene occur a lower rate in the three subtypes. Further-
more, there is no mutation of the most mutated gene tumor protein p53(TP 53) in GBM
in the Classical GBM tumors. The TP53 is however significantly mutated in the Proneural
tumors. Only Proneural tumors have abnormally high levels of mutations of PDGFRA. The
most frequent number of mutations in the tumor suppressor gene NF1 can be found in the
Mesenchymal group. Also, tumor suppressor genes such as TP53 and PTEN have frequent
mutations in this group. For the Neural group, there is no stand out gene that exists in
abnormally higher or lower mutation rate [1].
There has also been an identification of a CpG Island Methylator Phenotype (G-CIMP) that
also presents a distinct subgroup of GBM [25].
44
University of Ghana http://ugspace.ug.edu.gh
Table 4.5 shows that majority (26.5%) of the GBM patients in our dataset have the Mes-
enchymal subtype, followed by the Classical subtype (25.1%).
Table 4.5: Frequency Distribution of Expression Subtype
Subtype Frequency Percent
Classical 144 25.1
G-CIMP 38 6.6
Mesenchymal 152 26.5
Neural 83 14.5
Proneural 97 16.9
Not Available 13 2.3
4.3 Main Results
4.3.1 Correlation matrix of variables
Canonical correlation analysis demands that there exist no high correlations within each of
the sets of variables. So we checked for correlations among the sets of variables.
Tables 4.6 and 4.7 lists the correlation coefficients between each variable set. Variable set
1 is the VASARI neuroimage features whereas variable set 2 is the copy number variations
variables. Table 4.6 shows the correlations between the VASARI neuroimage features and
Table 4.7 presents the correlations between the copy number variation variables. Among
the VASARI features, observations showed that the farthest correlation coefficient from
zero that existed was −0.6443, which is the correlation between the enhancing and edema.
This depicts that, as the proportion of edema increases, then the proportion of enhancing
diminishes and vice versa. The major axis of the tumor has a positive relationship with the
minor axis and with nCET. However, the major axis showed a negative relationship with
necrosis, edema and enhancing. Edema is negatively correlated with all other features.
nCET recorded a positive relationship with the major axis, the minor axis and necrosis.
Moreover, for the copy number variations, the farthest correlation coefficient from zero
45
University of Ghana http://ugspace.ug.edu.gh
recorded among the 26 variables was 0.8962 (Table 4.7). This relationship existed between
the foxo1 gene and rb1 gene. This relationship shows that the foxo1 gene and rb1 gene has
a strong direct positive relationship, hence an amplification of a patient’s foxo1 gene will
result in the amplification of the patient’s rb1 gene and vice versa.
Table 4.6: Correlations for Variable Set 1
Major Axis Minor Axis Necrosis Edema nCET Enhancing
Major Axis 1.0000
Minor Axis 0.4828 1.0000
Necrosis -0.0152 0.1356 1.0000
Edema -0.0752 -0.3974 -0.2578 1.0000
nCET 0.1168 0.1724 0.0160 -0.2488 1.0000
Enhancing -0.0203 0.2519 -0.0034 -0.6443 -0.1208 1.0000
46
University of Ghana http://ugspace.ug.edu.gh
Ta
bl
e
4.
7:
C
or
re
la
ti
on
s
fo
r
th
e
C
op
y
N
um
be
r
V
ar
ia
ti
on
V
ar
ia
bl
es
V
ar
ia
bl
es
ak
t1
ak
t2
ak
t3
cc
nd
2
cd
k4
cd
k6
cd
k2
na
cd
k2
nc
eg
fr
er
bb
2
fo
xo
1
fo
xo
3
A
kt
1
1.
00
A
kt
2
0.
30
72
1.
00
A
kt
3
0.
02
44
0.
22
97
1.
00
cc
nd
2
-0
.0
58
5
0.
08
06
-0
.0
10
9
1.
00
cd
k4
-0
.1
00
9
0.
07
22
0.
08
87
0.
40
62
1.
00
cd
k6
0.
09
21
0.
19
25
0.
03
11
0.
05
62
0.
05
24
1.
00
00
cd
k2
na
0.
05
69
-0
.1
65
1
0.
04
70
0.
00
66
0.
24
27
-0
.1
94
3
1.
00
00
cd
kn
2c
0.
17
30
0.
09
92
0.
42
07
-0
.0
62
9
0.
01
76
-0
.0
08
8
0.
12
45
1.
00
00
eg
fr
0.
26
34
0.
26
30
-0
.0
26
7
0.
00
35
0.
14
80
0.
53
17
-0
.1
64
1
0.
01
25
1.
00
00
er
bb
2
0.
04
87
-0
.0
83
6
0.
07
51
-0
.2
01
9
0.
00
57
-0
.1
68
8
-0
.0
58
8
-0
.0
74
6
-0
.0
05
8
1.
00
00
fo
xo
1
0.
15
66
-0
.0
41
4
-0
.1
74
5
0.
00
01
0.
01
18
-0
.0
04
8
-0
.1
42
9
-0
.1
15
5
0.
21
34
0.
27
80
1.
00
00
fo
xo
3
0.
43
16
0.
16
00
0.
05
77
-0
.2
33
6
-0
.1
19
1
-0
.1
03
9
0.
17
12
0.
15
27
0.
03
10
0.
28
72
0.
03
28
1.
00
00
hr
as
0.
07
57
-0
.0
54
6
-0
.1
15
6
-0
.0
97
3
-0
.0
31
5
0.
05
75
-0
.1
98
5
-0
.0
31
3
0.
20
57
0.
00
90
-0
.0
64
4
0.
03
14
kr
as
-0
.1
67
2
0.
08
50
0.
17
97
0.
66
58
0.
40
86
-0
.0
85
3
-0
.0
58
3
0.
09
09
-0
.0
64
5
-0
.1
89
3
-0
.0
97
4
-0
.2
58
6
m
dm
2
-0
.0
17
2
0.
12
93
0.
23
69
0.
39
19
0.
63
49
0.
11
19
0.
04
36
0.
11
82
0.
20
73
-0
.0
67
7
-0
.0
11
5
-0
.1
87
2
m
dm
4
0.
00
80
0.
00
50
0.
43
32
0.
02
00
0.
07
99
-0
.0
35
5
0.
04
77
0.
23
99
0.
06
19
0.
06
23
-0
.2
04
3
0.
09
54
m
et
0.
11
84
0.
10
34
-0
.0
17
1
0.
05
09
0.
03
40
0.
67
95
-0
.0
52
8
-0
.0
01
6
0.
36
81
-0
.2
26
2
-0
.0
30
6
-0
.0
20
5
nf
1
0.
05
60
-0
.0
03
6
0.
05
88
-0
.1
92
5
0.
05
87
-0
.1
22
1
0.
02
58
-0
.0
97
9
0.
04
46
0.
83
23
0.
28
01
0.
29
85
nr
as
-0
.0
22
8
0.
01
53
0.
54
24
-0
.0
93
8
0.
01
72
-0
.0
09
9
0.
15
58
0.
47
50
-0
.0
36
9
0.
11
58
-0
.1
73
7
0.
24
16
pd
gf
ra
-0
.1
18
6
-0
.0
68
4
0.
01
42
-0
.1
62
1
0.
05
68
-0
.1
76
3
0.
01
88
-0
.0
57
9
-0
.1
50
2
0.
06
91
0.
02
76
0.
06
87
pi
k3
ca
-0
.1
53
4
-0
.1
36
0
0.
07
82
-0
.2
15
3
-0
.1
91
3
-0
.0
29
8
-0
.0
92
1
0.
04
39
-0
.0
28
3
0.
16
17
-0
.1
30
9
-0
.0
82
3
pi
k3
r1
-0
.0
12
3
0.
11
68
-0
.1
21
4
0.
07
81
0.
07
12
0.
31
63
-0
.1
56
7
0.
16
27
0.
17
67
-0
.1
20
6
0.
10
65
-0
.0
63
1
pt
en
-0
.0
67
8
-0
.0
98
4
-0
.0
46
2
0.
14
08
-0
.0
41
2
-0
.3
24
2
0.
21
37
0.
05
79
-0
.1
50
2
0.
06
91
0.
17
18
0.
06
55
rb
1
0.
21
31
0.
05
98
-0
.2
14
7
0.
07
82
0.
06
33
0.
34
6
-0
.1
34
6
-0
.1
34
4
0.
22
82
0.
20
64
0.
89
62
-0
.0
27
7
sp
ry
2
0.
04
44
-0
.0
56
8
-0
.0
83
5
-0
.0
37
5
0.
00
12
0.
02
58
-0
.0
87
0
-0
.0
03
1
0.
16
97
0.
21
15
0.
73
61
0.
00
89
tp
53
0.
10
30
0.
03
85
-0
.2
05
4
-0
.0
14
9
0.
17
42
-0
.0
39
6
-0
.1
31
4
-0
.0
80
0
0.
08
31
0.
59
62
0.
34
45
0.
17
49
47
University of Ghana http://ugspace.ug.edu.gh
Ta
bl
e
4.
8:
C
or
re
la
ti
on
s
fo
r
th
e
C
op
y
N
um
be
r
V
ar
ia
ti
on
V
ar
ia
bl
es
V
ar
ia
bl
es
hr
as
kr
as
m
dm
2
m
dm
4
m
et
nf
1
nr
as
pd
df
ra
pi
k3
ca
pi
k3
r1
pt
en
rb
1
sp
ry
2
hr
as
1.
00
00
kr
as
-0
.0
01
3
1.
00
00
m
dm
2
0.
07
96
0.
49
19
1.
00
00
m
dm
4
0.
16
9
0.
06
42
-0
.0
02
5
1.
00
00
m
et
0.
07
23
-0
.0
96
7
0.
07
42
-0
.0
17
9
1.
00
00
nf
1
0.
04
68
-0
.1
84
1
-0
.0
94
0
0.
02
28
-0
.1
84
0
1.
00
00
nr
as
-0
.0
54
1
0.
07
51
0.
03
80
0.
30
72
0.
00
79
0.
09
50
1.
00
00
pd
gf
ra
-0
.2
55
8
-0
.1
15
3
0.
02
11
0.
11
07
-0
.0
81
7
0.
12
72
-0
.0
60
8
1.
00
00
pi
k3
ca
-0
.0
81
8
-0
.1
11
0
-0
.1
20
9
0.
04
05
-0
.0
03
5
0.
04
48
0.
05
36
-0
.0
13
2
1.
00
00
pi
k3
r1
0.
00
34
0.
01
80
0.
08
26
-0
.0
59
1
0.
38
58
-0
.1
86
8
0.
01
74
-0
.0
73
1
-0
.1
01
0
1.
00
00
pt
en
-0
.0
02
8
0.
06
75
-0
.0
10
2
-0
.0
36
3
-0
.2
66
8
0.
07
90
0.
11
67
-0
.0
27
1
-0
.0
25
7
0.
02
00
1.
00
00
rb
1
-0
.0
85
6
-0
.1
19
9
0.
01
80
-0
.2
30
9
0.
00
40
0.
20
60
-2
13
9
-0
.0
02
2
-0
.1
57
7
0.
13
60
0.
14
56
1.
00
00
sp
ry
2
-0
.0
33
0
-0
.1
97
9
-0
.0
59
5
-0
.0
16
0
0.
00
02
0.
18
84
0.
00
44
-0
.0
29
3
-0
.1
07
6
0.
10
65
0.
23
81
0.
74
04
1.
00
00
tp
53
0.
16
41
-0
.1
80
4
0.
08
10
-0
.1
64
5
-0
.1
20
0
0.
62
50
-0
.1
91
2
0.
10
97
-0
.2
03
5
0.
11
64
0.
03
55
0.
33
93
0.
27
42
48
University of Ghana http://ugspace.ug.edu.gh
Table 4.9: Correlations between Variable Set 1 and Variable Set 2
Variables Major Axis Minor Axis Necrosis Edema nCET Enhancing
Akt1 -0.0289 -0.2096 0.1527 0.0281 -0.2009 -0.0587
Akt2 0.0271 -0.0422 -0.0062 -0.0024 -0.1415 0.0594
Akt3 0.1124 0.1127 0.0157 0.1128 -0.0196 0.0094
ccnd2 -0.0192 -0.1031 0.0303 -0.0629 0.0140 0.0103
cdk4 0.0141 -0.0816 0.1420 0.0031 -0.3819 0.0483
cdk6 -0.0236 -0.0024 -0.0659 -0.0416 -0.2304 0.0700
cdk2na -0.0265 -0.1238 0.0559 0.1030 0.1351 -0.2265
cdkn2c -0.1915 -0.0106 0.1025 -0.0262 0.1272 0.0155
egfr 0.0772 0.0195 -0.0626 0.0062 -0.1012 0.0867
erbb2 0.1583 0.1254 -0.0789 0.1716 0.0138 -0.1357
foxo1 0.0979 0.3049 -0.1601 -0.1051 -0.0991 0.1337
foxo3 0.0067 0.0066 0.1963 0.0420 -0.0899 -0.0708
hras -0.0282 -0.0790 -0.1048 0.1182 0.0174 -0.0265
kras 0.0288 -0.0641 0.0527 -0.0821 0.0586 0.0724
mdm2 -0.0116 0.0226 0.1213 -0.0244 -0.0498 0.0478
mdm4 0.0430 0.0116 0.0812 -0.0212 0.0375 0.0729
met -0.0707 0.0033 0.0124 -0.1086 -0.0251 0.0212
nf1 0.1517 0.2630 -0.0658 0.0481 0.0132 -0.0629
nras -0.1572 0.1336 0.1472 0.1360 0.0256 -0.0295
pdgfra 0.2463 0.2697 0.7260 -0.1559 -0.0174 0.0033
pik3ca -0.0046 0.0441 -0.0637 0.1438 -0.0432 -0.1074
pik3r1 -0.0621 0.0953 -0.0721 -0.0738 -0.1124 0.0897
pten -0.3747 -0.2399 0.2036 0.0265 0.0331 -0.0672
rb1 0.0462 0.0101 -0.2429 -0.1075 -0.1338 0.1660
spry2 -0.4289 -0.0147 -0.4163 -0.0987 0.4001 0.1351
tp53 0.4475 0.0736 -0.4066 -0.0740 0.3999 -0.0047
The correlations between the copy number variation variables and the image features are
presented in table 4.9. There are both negative and positive relationships between the vari-
able sets. The highest correlation coefficient (0.7260) existed between pdgfra and necrosis.
There are relatively low correlations between the two variable sets. Moderate correlations
(-0.4066,-0.4163) existed between spry2, tp53 and necrosis respectively. Also, moderate
correlations (0.4475,-0.4289,-0.3747) existed between tp53, spry2, pten and major axis re-
spectively. Moreover, nCET was also moderately correlated with cdk4 (-0.3819), spry2
(0.4001), tp53 (0.3999). These bivariate correlations seem to suggest a relationship be-
tween some of the features and genes in the study.
The raw canonical coefficients are the weights of the M-variables and the N-variables,
49
University of Ghana http://ugspace.ug.edu.gh
maximizing the correlation among the sets of variables. The coefficients are interpreted
the same way as the regression coefficients. So from Table 4.10, for the variate M1, a unit
increase in the proportion of necrosis leads to a 1.6797 increment on the first canonical
variate of the N-variable set, with all other variables to be held constant.
Table 4.10: Raw Coefficients for the Neuro-image features
1 2 3 4 5 6
Major Axis 0.4264 0.1005 0.6760 0.2438 0.2857 -0.2874
Minor Axis 0.1240 -0.6065 -0.4383 -0.2898 0.0600 0.0756
Necrosis 1.6797 2.2665 -2.3086 0.3454 1.3586 -1.6622
Edema 0.6631 -0.6532 -1.1169 1.4566 -0.7277 -1.1151
nCET -1.2893 -0.3184 -0.3771 0.8520 0.5133 -1.5226
Enhancing 0.2989 0.0219 -0.2357 0.2624 -1.1305 -1.6793
50
University of Ghana http://ugspace.ug.edu.gh
Table 4.11: Raw Coefficients for the Copy Number Variation Variables
Variables 1 2 3 4 5 6
Akt1 0.1207 0.9216 -0.0343 0.2824 0.2217 0.4739
Akt2 -0.0602 -0.1999 0.0780 0.1185 -0.1724 0.4816
Akt3 0.7972 -0.7069 1.1209 0.6279 -0.0815 -0.3316
ccnd2 0.2203 -0.6280 -0.2054 0.1681 0.4499 0.7753
cdk4 0.7373 0.7428 -0.0870 0.1356 -0.5273 0.3419
cdk6 0.3429 0.3914 0.2704 -0.0327 -0.4198 0.7079
cdk2na -0.3741 -0.3616 0.3788 0.5088 0.4692 0.2825
cdkn2c -0.6646 0.2594 -0.3989 -0.1759 0.1041 -0.3262
egfr 0.1092 -0.3501 0.2107 0.2607 0.1746 -0.6444
erbb2 0.3651 -0.2714 -0.0902 2.4166 0.1850 0.0921
foxo1 0.6733 -1.2847 0.0458 -0.5207 0.7877 1.0521
foxo3 0.5249 0.3438 -0.0674 -0.3994 -0.0492 0.0241
hras 0.3176 -0.4606 0.1801 0.7627 -0.2045 0.2671
kras -0.9254 0.6619 1.1563 -0.0723 0.3101 -1.0851
mdm2 -0.1203 -0.2112 -0.6641 -0.4173 0.0846 -0.1012
mdm4 -0.1421 0.3140 0.1615 -0.2607 -0.0856 -0.6363
met -0.8376 0.4377 -0.1256 -0.4054 0.6902 0.0058
nf1 0.1457 -0.3427 -0.0013 -1.6359 0.0401 0.2830
nras 0.1661 -0.5333 -1.9065 -0.2347 -0.1279 -0.2008
pdgfra 0.4226 -0.0704 0.0563 -0.3877 0.7704 0.0288
pik3ca 0.0430 -0.1970 -0.0217 -0.0556 -0.0546 0.8620
pik3r1 0.6607 -0.8831 0.1641 -0.4134 -0.4106 0.8242
pten 0.0186 1.3438 -0.4755 0.1518 0.0121 0.1979
rb1 0.5788 0.9265 0.4969 0.0324 -1.0624 -1.5470
spry2 -1.4400 -0.0932 0.0854 -0.1515 -0.6050 0.6273
tp53 -1.4510 -0.0840 0.3839 -0.4220 0.6305 -0.4147
4.3.2 Assessment of Overall Model Fit
We now present results on the overall statistical fit of the entire model. The multivariate
F-tests and its corresponding Wilk’s lambda evaluate the hypothesis below.
H0 : The canonical correlation coefficient for all functions are zero.
H1 : The canonical correlation coefficient for at least one function is not zero.
51
University of Ghana http://ugspace.ug.edu.gh
Again, we check against the null hypothesis that each of the canonical functions’ canonical
correlation coefficient is zero.
From Table 4.12, we have that the null hypothesis for the entire model is rejected at 0.05
significance level, hence we can conclude that at least one canonical function has a non-
zero canonical correlation coefficient. Also, we confirm from Table 4.13 that the first three
canonical correlation coefficients are statistically significant at a significance level of 0.05.
This means that the null hypothesis, which states that the canonical correlation coefficient
of each of the the first three canonical function is zero is rejected. The remaining three cor-
relation coefficients are not significant based on the multivariate F-tests and Wilk’s lambda.
This means that the remaining coefficients will not be subjected to interpretations.
Table 4.12: Test of Significance of all Canonical Correlations
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.127081 156 1386.69 3.7459 0.0000
52
University of Ghana http://ugspace.ug.edu.gh
Table 4.13: Test of Significance of each Canonical Correlation
Test of Canonical Correlation 1
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.127081 156 1386.69 3.7459 0.0000
Test of Canonical Correlation 2
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.230809 125 1166.35 3.2384 0.0000
Test of Canonical Correlation 3
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.38655 96 941.39 2.6591 0.0001
Test of Canonical Correlation 4
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.730162 69 350.64 1.330 0.0514
Test of Canonical Correlation 5
Statistic df1 df2 F Prob>F
Wilk’s Lambda 812344 44 248.03 1.2001 0.1957
Test of Canonical Correlation 6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.894344 21 160.12 1.0831 0.3711
The canonical correlation coefficient and eigenvalues or canonical roots for each of the
functions are shown in Table 4.14. The magnitude of the relationship occurring between
the variate pairs is given by the canonical correlation coefficient.
Table 4.14: Canonical Correlations and Eigenvalues
Coefficients 0.6704 0.6347 0.5552 0.4844 0.4285 0.3250
Eigenvalues 0.4494 0.4028 0.3082 0.2346 0.1836 0.1056
Table 4.15 presents the canonical redundancy index for the canonical correlations. In the
53
University of Ghana http://ugspace.ug.edu.gh
first canonical function, the redundancy for the M-variables is 0.2012 and the redundancy
for the N-variables is 0.2101. The values obtained depict that each variate explains almost
the same amount of variance in the opposite set of variables in the canonical function.
Considering the second function, the redundancy measure for the M and N variables are
0.1876 and 0.1501. This means that the variate for the N-variables explains less variance
in the M-variables in the first function than the variate for the M- variables explains in the
set of N-variables.
Table 4.15: Canonical redundancy analysis for Canonical Correlations
Canonical redundancy analysis for Canonical Correlation 1
Canonical Correlation Coefficient 0.6704
Squared Canonical Correlation Coefficient 0.4494
Proportion of standardized variance O.V OP.V
of M variables with 0.3001 0.2101
of N variables with 0.3121 0.2112
Canonical redundancy analysis for Canonical Correlation 2
Canonical Correlation Coefficient 0.6347
Squared Canonical Correlation Coefficient 0.4028
Proportion of standardized variance O.V OP.V
of M variables with 0.4212 0.0.1501
of N variables with 0.3212 0.1876
Canonical redundancy analysis for Canonical Correlation 3
Canonical Correlation Coefficient 0.5552
Squared Canonical Correlation Coefficient 0.3052
Proportion of standardized variance O.V OP.V
of M variables with 0.3992 0.1001
of N variables with 0.3685 0.1019
O.V = Own Variate, OP.V= Opposite Variate
4.3.3 Interpreting Canonical Variate Pairs
Based on the F-test and the Wilk’s lambda, we have concluded that only three canonical
coefficients are significant, so we can can interpret and report the contribution of each of
the variables (original) that is in the canonical function. We would then resort to the stan-
54
University of Ghana http://ugspace.ug.edu.gh
dardized canonical coefficients and or canonical loadings to elicit the relative contributions
of the variables.
The canonical functions can be interpreted by observing the magnitude and sign of the
standardized canonical correlation coefficient or the canonical loadings that is assigned to
each original variable in its canonical variate. Variables that have higher coefficients have
a higher contribution to the variate. We set a coefficient threshold of |0.5| and above to
depict the most important variable in the canonical function. Moreover, original variables
that have coefficients with opposite signs depict an inverse association with one another.
Again, original variables with coefficients that have the same sign depict a direct associ-
ation. However, because the interpretation of the contribution of original variables by its
canonical coefficient faces the same problems that are associated to the interpretation of
beta values in the regression model, caution is taken in the interpretation of the results in
canonical analysis [2]. One of the problems faced is that, the weights or the coefficients are
subjected to considerable variability from a sample to the other. Therefore, the canonical
loadings will also be used to assess the contribution of the original variables.
Hence, if the findings from using the standardized coefficients and the canonical loadings
are similar or the same, then there is evidence for accuracy of the results.
4.3.4 Interpretation of Canonical Variate Using Canonical Weights
Here, we present the standardized coefficients and interpret them. The standardized co-
efficients always enable for easier comparisons among variables when the variables have
varying standard deviations. So because the canonical coefficients are standardized, then
we can make comparisons using their weights. The proportion of canonical correlation
weights for a set of canonical roots is their relative significance for the given impact [2].
The standardized canonical coefficients for the significant functions are shown in Table
4.16. Considering the first set of variables(Neuro image features) and the first canonical
function, the nCET is the most important, followed by major axis then edema and necro-
sis. A one standard deviation increase in proportion of necrosis leads to a 0.4280 standard
deviation increase in the score on the first canonical variate in the second variable set when
the other variables all held constant. Also, a one standard deviation increase in nCET leads
to 0.6407 decrease in the score on the first canonical variate in the second variable set with
other variables held constant. With the second canonical function, the most important fea-
tures are minor axis, necrosis and edema. The third canonical function has high coefficient
55
University of Ghana http://ugspace.ug.edu.gh
values for major axis, minor axis, necrosis and edema.
Considering standardized coefficients of the copy number variations from Table 4.17, spry2,
tp53, cdk4, foxo1, met, pdgfra, rb1, cdk2na, cdk2nc and akt3 are more closely related to
the first canonical function since their coefficients are greater than |0.3| whilst foxo1, cdk4,
akt1, pten, rb1, akt3, ccnd2, cdk2na, pik3r1 and kras are most closely related to the second
canonical function. For the third canonical function, nras, kras ,akt3, mdm2 and cdk2na
are also more closely related to it. Table 4.18 below summarize the most important fea-
tures and genes for each function based on the magnitude of the canonical loadings with a
threshold of |0.5| and above.
Table 4.16: Standardized Coefficients for the Neuro-image features
1 2 3
Major Axis 0.5317 0.1253 0.8430
Minor Axis 0.1914 -0.9363 -0.6766
Necrosis 0.4280 0.5774 -0.5882
Edema 0.4327 -0.4263 -0.7288
nCET -0.6407 -0.1582 -0.1874
Enhancing 0.2125 0.0156 -0.1675
56
University of Ghana http://ugspace.ug.edu.gh
Table 4.17: Standardized Coefficients for the Copy Number Variation Variables
Variables 1 2 3
Akt1 0.0735 0.5615 -0.0209
Akt2 -0.0355 -0.1178 0.0459
Akt3 0.3587 -0.3181 0.5040
ccnd2 0.1246 -0.3551 -0.1162
cdk4 0.6223 0.6269 -0.0734
cdk6 0.1661 0.1896 0.1310
cdk2na -0.3365 -0.3253 0.3407
cdkn2c -0.3274 0.1278 -0.1965
egfr 0.0755 -0.2418 0.1455
erbb2 0.1871 -0.1390 -0.0462
foxo1 0.3640 -0.6945 0.0247
foxo3 0.2905 0.1903 -0.0373
hras 0.1534 -0.2224 0.0870
kras -0.0145 0.3219 0.5623
mdm2 -0.0895 -0.1571 -0.4939
mdm4 -0.0929 0.2052 0.1055
met -0.4111 0.2148 -0.0617
nf1 0.0759 -0.1785 -0.0007
nras 0.0772 -0.2479 -0.8862
pdgfra 0.3257 -0.0542 0.0434
pik3ca 0.0235 -0.1076 -0.0118
pik3r1 0.2777 -0.3712 0.0690
pten 0.0077 0.5549 -0.1963
rb1 0.3224 0.5161 0.2768
spry2 -0.7778 -0.0503 0.0461
tp53 -0.6892 -0.0399 0.1823
57
University of Ghana http://ugspace.ug.edu.gh
Table 4.18: Summary of Important Related Variables
1 2 3
Image features Coeff. Image features Coeff. Image features Coeff.
nCET -0.6407 Minor Axis -0.9363 Major Axis 0.8430
Major axis 0.5317 Necrosis 0.5774 Edema 0.7288
Minor Axis -0.6766
Necrosis -0.5882
CNV CNV CNV
spry2 -0.7778 foxo1 -0.6945 nras -0.8862
tp53 -0.6892 cdk4 0.6269 kras 0.5623
cdk4 0.6223 Akt1 0.5615 Akt3 0.5040
pten 0.5549 cdk2na 0.5001
rb1 0.5161
4.3.5 Interpretation of Canonical Variate Using Canonical Loadings
Observations from Table 4.19 show that major axis, nCET and necrosis were most closely
related to the first canonical function since their coefficients were greater than |0.3|. The
second canonical function is closely related to minor axis, necrosis and major axis. The
third function is most related to major axis and necrosis.
From table 4.20, tp53, spry2 cdk4, pdgfra and cdk2na are closely related to the first function
while akt1, pten, foxo1, akt3, cdk4, nf1, erbb2 and rb1 are closely related to the second
function. Also, nras, cdkn2c,cdkn2a, foxo1, mdm2, rb1, akt3 and kras are closely related
to the third. Table 4.21 below summarizes the most important features and genes for each
function based on the magnitude of the canonical loadings with a threshold of |0.5| and
above.
Table 4.19: Canonical Loadings for the Neuro-image features
1 2 3
Major Axis 0.5059 -0.3222 0.5615
Minor Axis 0.2772 -0.6514 -0.1343
Necrosis 0.3233 0.5559 -0.5072
Edema 0.2289 -0.1832 -0.2172
nCET -0.6721 -0.1915 -0.0134
Enhancing 0.0470 0.0689 0.1391
58
University of Ghana http://ugspace.ug.edu.gh
Table 4.20: Canonical Loadings for the Copy Number Variation Variables
Variables 1 2 3
Akt1 0.1107 0.5473 0.0647
Akt2 0.1580 0.1002 0.1321
Akt3 0.2259 -0.4004 -0.5277
ccnd2 -0.0760 0.2148 0.1390
cdk4 0.6696 0.5968 0.0133
cdk6 0.0584 0.0010 0.1143
cdk2na -0.3552 0.2199 -0.5610
cdkn2c -0.2230 0.0574 -0.3996
egfr 0.1550 -0.0473 0.1596
erbb2 0.1655 -0.3476 -0.0178
foxo1 0.2746 -0.6825 0.3215
foxo3 0.2231 0.1626 -0.2092
hras -0.0606 -0.0688 0.0115
kras -0.0525 0.2893 0.5230
mdm2 0.1230 0.1033 -0.3419
mdm4 0.0629 0.0719 -0.0418
met -0.0866 0.0721 0.0202
nf1 0.1233 -0.3074 0.0529
nras 0.0614 -0.1925 -0.8356
pdgfra 0.3336 -0.0345 0.0154
pik3ca 0.0684 -0.2123 -0.1349
pik3r1 0.0202 -0.1385 -0.0263
pten -0.1482 0.6384 -0.2301
rb1 0.0594 -0.5262 0.3453
spry2 -0.7745 -0.1333 0.1369
tp53 -0.6364 -0.1668 0.1664
59
University of Ghana http://ugspace.ug.edu.gh
Table 4.21: Summary of Important Related Variables
1 2 3
Image features Loading Image features Loading Image features Loading
nCET -0.6721 Minor Axis -0.6514 Major Axis 0.5615
Major axis 0.5059 Necrosis 0.5559 Necrosis 0.5072
CNV CNV CNV
spry2 -0.7745 foxo1 -0.6825 nras -0.8356
cdk4 -0.6696 pten 0.6384 cdk2na -0.5615
tp53 -0.6364 cdk4 0.5968 Akt3 -0.5277
Akt1 0.5473 kras 0.5230
rb1 -0.5262
Since the two methods of interpretation, using the standardized coefficients and canonical
loadings, resulted in the similar conclusions, we are more confident in our findings and
hence move on to conduct model validation in the next section of the thesis.
4.3.6 Cross Validation
In this section, we subject our model to validation. There are various approaches in model
validation. We validate our model by using the sample splitting approach. The entire
sample (267) is divided into two sub-samples and the canonical correlation analysis is
conducted separately on each of the sub-samples. We then compare the results obtained
from each of the analyses.
4.3.7 CCA on Sub-Sample A
The first sub-sample contains 134 patients. From the six canonical functions, only two of
the functions were significant from the F-tests and Wilk’s lambda observations (see Table
4.22). Hence we present results on the canonical loadings of each of the variable set for
only the significant functions. Table 4.23 and 4.24 shows the contributions of each variable
in the each of the canonical functions. The significant canonical correlation coefficients for
the new sample were found to be 0.6601 and 0.6372.
60
University of Ghana http://ugspace.ug.edu.gh
Table 4.22: Test of Significance of each Canonical Correlation
Test of Canonical Correlation 1-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.118385 156 606.447 1.7054 0.000
Test of Canonical Correlation 2-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.226496 125 511.824 1.4423 0.0033
Test of Canonical Correlation 3-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.383356 96 414.513 1.1824 0.1366
Test of Canonical Correlation 4-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.564471 69 314.54 0.9617 0.5660
Test of Canonical Correlation 5-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.735015 44 212 0.8018 0.8069
Test of Canonical Correlation 6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.902262 21 107 0.5519 0.9408
Major axis and nCET were the most important variables in the first function since their
coefficients were equal to or greater than |0.5| while minor axis and necrosis were the most
important variables in the second function. In table 4.24, we observed that spry2, tp53 and
cdk4 were the most important variables in the first function. Akt1, cdk4, pten, rb1 and
foxo1 were the most important variables in the second canonical function.
61
University of Ghana http://ugspace.ug.edu.gh
Table 4.23: Canonical Loadings for the Neuro-image features
1 2
Major Axis 0.5091 -0.3300
Minor Axis 0.2534 -0.6565
Necrosis 0.3137 0.5473
Edema 0.2633 -0.1854
nCET -0.7192 -0.1742
Enhancing 0.0638 0.0534
Table 4.24: Canonical Loadings for the Copy Number Variation Variables
Variables 1 2
Akt1 0.0866 0.5843
Akt2 0.1538 0.1053
Akt3 0.2341 -0.2273
ccnd2 -0.1114 0.2097
cdk4 0.5836 0.5938
cdk6 0.0724 0.0135
cdk2na -0.2017 0.0954
cdkn2c -0.2398 0.0821
egfr 0.1729 -0.0423
erbb2 0.1869 -0.2338
foxo1 0.1791 -0.5588
foxo3 0.2398 0.1746
hras -0.0432 -0.0606
kras -0.0989 0.1855
mdm2 0.1222 0.0973
mdm4 0.1207 0.0428
met -0.1068 0.0827
nf1 0.1457 -0.2965
nras 0.0809 -0.2096
pdgfra 0.3263 -0.0160
pik3ca 0.0965 -0.2645
pik3r1 0.0369 -0.1067
pten -0.1087 0.5057
rb1 0.0566 -0.5162
spry2 -0.6601 -0.1130
tp53 -0.5210 -0.1153
62
University of Ghana http://ugspace.ug.edu.gh
4.3.8 CCA on Sub-Sample B
Sub-sample B contains 133 patients. Also, only two of the functions were significant from
the F-tests and Wilk’s lambda observations (see Table 4.25). Therefore only the results
from the significant functions will be presented and interpreted. Tables 4.26 and 4.27 shows
the contributions of each variable in each of the canonical functions. The significant canon-
ical correlation coefficients for this analysis were obtained as 0.6543 and 0.6338.
Table 4.25: Test of Significance of each Canonical Correlation
Test of Canonical Correlation 1-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.128719 156 600.581 1.6104 0.0000
Test of Canonical Correlation 2-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.225069 125 506.903 1.4354 0.0037
Test of Canonical Correlation 3-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.376206 96 410.551 1.1970 0.1202
Test of Canonical Correlation 4-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.539844 69 311.552 1.0348 0.4120
Test of Canonical Correlation 5-6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.721821 44 210 0.8449 0.7430
Test of Canonical Correlation 6
Statistic df1 df2 F Prob>F
Wilk’s Lambda 0.880579 21 106 0.6845 0.8398
Major axis and nCET were the most important variables in the first function since their
coefficients were equal to or greater than |0.5| while minor axis and necrosis were the most
important variables in the second function. Observations from Table 4.27 revealed that
63
University of Ghana http://ugspace.ug.edu.gh
spry2, tp53 and cdk4 were the most important variables in the first function. Akt1, cdk4,
pten, rb1 and foxo1 were the most important variables in the second canonical function.
Table 4.26: Canonical Loadings for the Neuro-image features
1 2
Major Axis 0.5889 0.3072
Minor Axis 0.2899 0.6673
Necrosis 0.4356 -0.5342
Edema 0.1722 0.1771
nCET -0.6209 0.1970
Enhancing 0.0388 -0.0818
64
University of Ghana http://ugspace.ug.edu.gh
Table 4.27: Canonical Loadings for the Copy Number Variation Variables
Variables 1 2
Akt1 0.1421 -0.6099
Akt2 0.1683 -0.0947
Akt3 0.2054 0.1766
ccnd2 -0.0259 -0.2243
cdk4 0.5612 -0.5921
cdk6 0.0490 0.0116
cdk2na -0.1022 -0.1496
cdkn2c -0.2065 -0.0216
egfr 0.1347 0.0505
erbb2 0.1261 0.3605
foxo1 0.0707 0.5014
foxo3 0.1995 -0.1374
hras -0.0823 0.0706
kras 0.0111 -0.1975
mdm2 0.1303 -0.1008
mdm4 -0.0023 -0.0977
met -0.0541 0.0548
nf1 0.0878 0.3180
nras 0.0196 0.1945
pdgfra 0.3428 0.0677
pik3ca 0.0244 0.1641
pik3r1 -0.0001 0.1765
pten -0.1991 -0.5676
rb1 0.0668 0.5293
spry2 -0.6229 0.1449
tp53 -0.5531 0.2149
4.4 Summary
The study investigated a model that links some neuroimage features (six features) with
copy number variations (26 genes) of Glioblastoma patients.
Wilk’s lambda and F-tests were employed to evaluate the null hypothesis that canonical
correlation coefficients for all the canonical functions are zero. From our model, only the
first three canonical correlation coefficients are statistically significant, thus with a p-value
less than 0.05. The other three functions were not significant and hence was not interpreted.
With our 3 significant canonical variate pairs, the strength of the relationship was depicted
65
University of Ghana http://ugspace.ug.edu.gh
by the canonical correlation coefficient. The first pair of canonical variates (first canonical
function) had a coefficient of 0.6704. The second canonical function had a coefficient
of 0.6347 and the third pair of variate had a canonical correlation coefficient of 0.5552
Squaring the canonical correlation coefficients shows the proportion of variance accounted
between the two optimally weighted variates.
The redundancy index measured the proportion of variance of the M-set of variables that is
predicted from the linear combination of the N-set of variables. The redundancy index can
only be equal to 1 if the the squared canonical coefficient (eigenvalue) is 1 and the variables
for the canonical function amount to all the variations of every variable in the set. The M-
variables in the first function had redundancy index to be 0.2012, and N-variables had
redundancy index to be 0.2101. The second function had a redundancy measure of 0.1876
for the M−variables and 0.1501 for the N-variables. For the third function, redundancy
index was equal to 0.1001 and 0.1019 for the M-variables and N-variables respectively.
The canonical loadings and standardized canonical coefficients were employed to evaluate
the importance of the variables in the function. A coefficient threshold of |0.5| and above
were used to select the important variables in each function. The standardized canonical
coefficients showed that, for the first function, major axis, nCET, spry2, tp53 and cdk4
were the most important variables. Minor axis, necrosis, foxo1, rb1, pten, cdk4 and are the
most important variables in the second function. For the third function, major axis, edema,
minor axis, necrosis, nras,cdk2na, kras and akt3 are the most important variables.
Using the canonical loadings, we obtained that for the first function, the most important
variables were nCET, major axis, spry2, cdk4 and tp53. The important contributing vari-
ables in the second function were minor axis, necrosis, foxo1, pten, cdk4, akt1 and rb1.
For the third function, major axis, necrosis, nras, cdk2na, akt3 and kras were the most
important variables.
We performed cross validations to check if the results were influenced by the number of
samples. So the 267 sample was divided into two and the canonical correlation analysis was
performed on both samples. Results from both samples indicated that only two functions
were significant and hence should be interpreted. For sample A, the first canonical variate
pair had a canonical coefficient of 0.6601 while the second variate pair had a canonical
correlation coefficient as 0.6372 Considering the first function, nCET, major axis, spry2,
cdk4, tp53 are most closely related and are most important. With the second function, akt1,
cdk4, foxo1 ,pten and rb1 was the most important variables. For sample B, the canonical
correlation coefficients were obtained to be 0.6543 and 0.6338. The same set of variables
from the first sample were found to be important in the second sample.
66
University of Ghana http://ugspace.ug.edu.gh
Chapter 5
Conclusion
Canonical correlation analysis is a very powerful and important technique for investigating
the relationship between multiple independent and dependent variables. Although the tech-
nique is fundamentally descriptive, it can also be employed for predictive purposes. This
thesis provided a review of canonical correlation analysis and applied it in exploring the
relationship between the copy number variations and neuro-image features of Glioblastoma
patients.
Canonical correlation coefficients under a non-singular transformation are unchanged and
the canonical correlation coefficients either from the correlation matrix or the covariance
matrix yield the same values. Also, computing correlations by standardizing the original
variables has no effect on the correlations.
We obtained from the data that mean survival status for Glioblastoma is 15 months and
mean age of diagnosis is 55 years.
The two set of multiple variables were related in three ways. We obtained three pairs of
significant canonical variates with correlations of 0.6704,0.6347 and 0.5552 respectively,
which were used to identify genes and features related to Glioblastoma. The important
genes and features forming these relationships are as follows. The major axis of the tu-
mor, the non-contrast enhancing tumor, the sprouty RTK signaling antagonist 2, the tumor
protein p53 and cyclin dependent kinase 4 are very much related. Also, minor axis of the
tumor, proportion of necrosis, forkhead box C1, phosphatase and tensin homolog, RB tran-
scriptional corepressor 1, AKT serine/ threonline kinase 1 and cyclin dependent kinase 4
are also very much related. Finally, we also obtained that major axis, proportion of necro-
sis, neuroblastoma RAS viral oncogene homolog, cyclin dependent kinase inhibitor 2A,
AKT serine/threonline kinase 3 and KRAS prott-oncogene, GTPase are highly related.
67
University of Ghana http://ugspace.ug.edu.gh
References
[1] Bartek, J., Ng, K., Fischer, W., Carter, B., and Chen, C. C. (2012). Key concepts in
glioblastoma therapy. Journal of Neurology, Neurosurgery & Psychiatry, 83(7):753–
760.
[2] Cliff, N. and Krus, D. J. (1976). Interpretation of canonical analysis: Rotated vs.
unrotated solutions. Psychometrika, 41(1):35–42.
[3] CNV (Accessed March 2016). Copy number variants. DNA Learning Center, http:
//www.dnalc.org/view/552-Copy-Number-Variants.html.
[4] Davies, E. B. (2007). Approximate diagonalization. SIAM Journal on Matrix Analysis
and Applications, 29(4):1051–1064.
[5] de Koning, A. J., Gu, W., Castoe, T. A., Batzer, M. A., and Pollock, D. D. (2011).
Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet,
7(12):e1002384.
[6] Denman, E. D. (1981). Roots of real matrices. Linear Algebra and its Applications,
36:133–139.
[7] Denman, E. D. and Beavers, A. N. (1976). The matrix sign function and computations
in systems. Applied mathematics and Computation, 2(1):63–94.
[8] Duerr, E.-M., Rollbrocker, B., Hayashi, Y., Peters, N., Meyer-Puttlitz, B., Louis, D. N.,
Schramm, J., Wiestler, O. D., Parsons, R., Eng, C., et al. (1998). PTEN mutations in
gliomas and glioneuronal tumors. Oncogene, 16(17).
[9] Ganigi, P., Santosh, V., Anandh, B., Chandramouli, B., and Sastry Kolluri, V. (2005).
Expression of p53, EGFR, pRb and bcl-2 proteins in pediatric glioblastoma multiforme:
a study of 54 patients. Pediatric neurosurgery, 41(6):292–299.
[10] Genetic Variability (Accessed May 2016). Copy Number Variations. Pathway detail
- flipper e nuvola http://flipper.diff.org/app/pathways/3685.
68
University of Ghana http://ugspace.ug.edu.gh
[11] Gevaert, O., Mitchell, L. A., Achrol, A. S., Xu, J., Echegaray, S., Steinberg, G. K.,
Cheshier, S. H., Napel, S., Zaharchuk, G., and Plevritis, S. K. (2014). Glioblastoma
multiforme: exploratory radiogenomic analysis by using quantitative image features.
Radiology, 273(1):168–174.
[12] Giunti, L., Pantaleo, M., Sardi, I., Provenzano, A., Magi, A., Cardellicchio, S., Cas-
tiglione, F., Tattini, L., Novara, F., Buccoliero, A. M., et al. (2014). Genome-wide copy
number analysis in pediatric glioblastoma multiforme. Am J Cancer Res, 4:293–303.
[13] Gutman, D. A., Cooper, L. A., Hwang, S. N., Holder, C. A., Gao, J., Aurora, T. D.,
Dunn Jr, W. D., Scarpace, L., Mikkelsen, T., Jain, R., et al. (2013). MR imaging predic-
tors of molecular profile and survival: multi-institutional study of the TCGA glioblas-
toma data set. Radiology, 267(2):560–569.
[14] Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., Tatham, R. L., et al. (2006a).
Canonical Correlation Analysis: A Supplement to Multivariate Data Analysis, vol-
ume 6. Pearson Prentice Hall Upper Saddle River, NJ.
[15] Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., Tatham, R. L., et al. (2006b).
Multivariate data analysis, volume 6. Pearson Prentice Hall Upper Saddle River, NJ.
[16] Hammoud, M. A., Sawaya, R., Shi, W., Thall, P. F., and Leeds, N. E. (1996). Prog-
nostic significance of preoperative MRI scans in glioblastoma multiforme. Journal of
neuro-oncology, 27(1):65–73.
[17] Higham, N. J. (1987). Computing real square roots of a real matrix. Linear Algebra
and its applications, 88:405–430.
[18] Hoskins, W. and Walton, D. (1978). A faster method of computing the square root of
a matrix. Automatic Control, IEEE Transactions on, 23(3):494–495.
[19] Hotelling, H. (1936). Relations between two sets of variates. Biometrika,
28(3/4):321–377.
[20] Johnson, R. A., Wichern, D. W., et al. (2002). Applied multivariate statistical analy-
sis, volume 5. Prentice hall Upper Saddle River, NJ.
[21] Lacroix, M., Abi-Said, D., Fourney, D. R., Gokaslan, Z. L., Shi, W., DeMonte, F.,
Lang, F. F., McCutcheon, I. E., Hassenbusch, S. J., Holland, E., et al. (2001). A mul-
tivariate analysis of 416 patients with glioblastoma multiforme: prognosis, extent of
resection, and survival. Journal of neurosurgery, 95(2):190–198.
69
University of Ghana http://ugspace.ug.edu.gh
[22] Lin, D., Calhoun, V. D., and Wang, Y.-P. (2014). Correspondence between fMRI
and SNP data by group sparse canonical correlation analysis. Medical image analysis,
18(6):891–902.
[23] McCarroll, S. A. and Altshuler, D. M. (2007). Copy-number variation and association
studies of human disease. Nature genetics, 39:S37–S42.
[24] Multivariate Analysis (Accessed March 2016). Multivariate Analysis. Philender,
http://www.philender.com/courses/multivariate/notes2/can1.html.
[25] Noushmehr, H., Weisenberger, D. J., Diefes, K., Phillips, H. S., Pujara, K., Berman,
B. P., Pan, F., Pelloski, C. E., Sulman, E. P., Bhat, K. P., et al. (2010). Identification of
a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer
cell, 17(5):510–522.
[26] Ohgaki, H., Dessen, P., Jourde, B., Horstmann, S., Nishikawa, T., Di Patre, P.-L.,
Burkhard, C., Schüler, D., Probst-Hensch, N. M., Maiorka, P. C., et al. (2004). Genetic
Pathways to Glioblastoma A Population-Based Study. Cancer research, 64(19):6892–
6899.
[27] Pierallini, A., Bonamini, M., Pantano, P., Palmeggiani, F., Raguso, M., Osti, M.,
Anaveri, G., and Bozzao, L. (1998). Radiological assessment of necrosis in glioblas-
toma: variability and prognostic value. Neuroradiology, 40(3):150–153.
[28] Pollack, I. F., Boyett, J. M., Yates, A. J., Burger, P. C., Gilles, F. H., Davis, R. L.,
Finlay, J. L., Group, C. C., et al. (2003). The influence of central review on outcome
associations in childhood malignant gliomas: results from the CCG-945 experience.
Neuro-oncology, 5(3):197–207.
[29] Pollack, I. F., Finkelstein, S. D., Woods, J., Burnham, J., Holmes, E. J., Hamilton,
R. L., Yates, A. J., Boyett, J. M., Finlay, J. L., and Sposto, R. (2002). Expression of p53
and prognosis in children with malignant gliomas. New England Journal of Medicine,
346(6):420–427.
[30] Pollack, I. F., Hamilton, R. L., James, C. D., Finkelstein, S. D., Burnham, J., Yates,
A. J., Holmes, E. J., Zhou, T., and Finlay, J. L. (2006). Rarity of PTEN deletions and
EGFR amplification in malignant gliomas of childhood: results from the Children’s
Cancer Group 945 cohort. Journal of Neurosurgery: Pediatrics, 105(5):418–424.
70
University of Ghana http://ugspace.ug.edu.gh
[31] Pope, W. B., Sayre, J., Perlina, A., Villablanca, J. P., Mischel, P. S., and Cloughesy,
T. F. (2005). MR imaging correlates of survival in patients with high-grade gliomas.
American Journal of Neuroradiology, 26(10):2466–2474.
[32] Qu, H.-Q., Jacob, K., Fatet, S., Ge, B., Barnett, D., Delattre, O., Faury, D., Mont-
petit, A., Solomon, L., Hauser, P., et al. (2010). Genome-wide profiling using single-
nucleotide polymorphism arrays identifies novel chromosomal imbalances in pediatric
glioblastomas. Neuro-oncology, 12(2):153–163.
[33] Reifenberger, G. and Collins, V. P. (2004). Pathology and molecular genetics of as-
trocytic gliomas. Journal of molecular medicine, 82(10):656–670.
[34] Sharp, A. J., Locke, D. P., McGrath, S. D., Cheng, Z., Bailey, J. A., Vallente, R. U.,
Pertz, L. M., Clark, R. A., Schwartz, S., Segraves, R., et al. (2005). Segmental dupli-
cations and copy-number variation in the human genome. The American Journal of
Human Genetics, 77(1):78–88.
[35] Siegel, R., Naishadham, D., and Jemal, A. (2012). Cancer statistics, 2012. CA: a
cancer journal for clinicians, 62(1):10–29.
[36] Taniguchi, Y., Choi, P. J., Li, G.-W., Chen, H., Babu, M., Hearn, J., Emili, A., and
Xie, X. S. (2010). Quantifying E. coli proteome and transcriptome with single-molecule
sensitivity in single cells. Science, 329(5991):533–538.
[37] Velazquez, E. R., Meier, R., Dunn Jr, W. D., Alexander, B., Wiest, R., Bauer, S.,
Gutman, D. A., Reyes, M., and Aerts, H. J. (2015). Fully automatic GBM segmentation
in the TCGA-GBM dataset: Prognosis and correlation with VASARI features. Scientific
reports, 5.
[38] Xiong, M., Dong, H., Siu, H., Peng, G., Wang, Y., and Jin, L. (2010). Genome-Wide
Association Studies of Copy Number Variation in Glioblastoma. In Bioinformatics and
Biomedical Engineering (iCBBE), 2010 4th International Conference on, pages 1–4.
IEEE.
[39] Zarrei, M., MacDonald, J. R., Merico, D., and Scherer, S. W. (2015). A copy number
variation map of the human genome. Nature Reviews Genetics, 16(3):172–183.
71
University of Ghana http://ugspace.ug.edu.gh