Canonical Correlation Analysis to relate a Genomic Dataset with a Neuroimage Dataset. Augustine Annan (10551764) THIS THESIS IS SUBMITTED TO THE UNIVERSITY OF GHANA, LEGON IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF MPHIL MATHEMATICS DEGREE July, 2016 University of Ghana http://ugspace.ug.edu.gh DECLARATION This thesis was written in the Department of Mathematics, University of Ghana, Legon from September 2015 to July 2016 in partial fulfillment of the requirements for the award of Master of Philosophy degree in Mathematics under the supervision of Dr. Margaret McIntyre, Dr. Douglas Adu-Gyamfi, and Dr. Eyram Schwinger of the University of Ghana I hereby declare that except where due acknowledgement is made, this work has never been presented wholly or in part for the award of a degree at the University of Ghana or any other University. Signature: ................................................... Student: Augustine Annan Signature: ................................................... Dr. Margaret McIntyre Signature: ................................................... Dr. Douglas Adu-Gyamfi i University of Ghana http://ugspace.ug.edu.gh DEDICATION I dedicate my research project to my family. A special feeling of gratitude to my loving mother, Agnes Esuon whose words of encouragement and push for tenacity ring in my ears. My brothers Stephen and Humphrey, my sister Faustina and my friend Ansbertha have never left my side and are very special. ii University of Ghana http://ugspace.ug.edu.gh ACKNOWLEDGEMENTS My warmest appreciation goes to my supervisors, Dr. Margaret McIntyre and Dr. Alessan- dro Crimi, for the patience, motivation, immense knowledge and continuous support and guidance he offered me throughout this project. Also to my other supervisors Dr. Douglas Adu-Gyamfi and Dr. Eyram Schwinger, I show great appreciation for taking much time to assist me in this work with so much patience. I want to appreciate the African Institute for Mathematical Sciences (AIMS-Ghana), for supporting this research financially. To the Head of Department, Dr. Margaret McIntyre; and all the lecturers, I say a big thank you for giving me such a great opportunity to step up my goals in academia. To my mother, and siblings, I am grateful for your unconditional love, support and encour- agement. My sincere, heartfelt gratitude goes to all my colleagues for all their encourage- ment and fun moments. To God be the glory. iii University of Ghana http://ugspace.ug.edu.gh ABSTRACT This thesis investigates the relationship between copy number variations and neuro-image features of Glioblastoma patients. Canonical correlation analysis was employed to elicit these relationships. This thesis highlights some of the concepts of the technique which enabled us to obtain our main results. We found three pairs of significant canonical variates with correlations of 0.6704,0.6347 and 0.5552 respectively, which was used to identify genes and neuro-image features related to Glioblastoma. iv University of Ghana http://ugspace.ug.edu.gh Contents Declaration i Dedication ii Acknowledgements iii Abstract iv 1 Introduction 1 1.1 Organisation of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Definitions 7 2.1 Definitions of statistical and mathematical terms . . . . . . . . . . . . . . . 7 3 Methodology 12 3.1 Canonical Correlation Analysis (CCA) . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Canonical Correlation . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3 Formulation and Derivation of the Canonical Variables . . . . . . . 14 3.1.5 Properties of the Canonical Variable Pairs . . . . . . . . . . . . . . 22 3.1.6 Canonical correlation coefficient under the non-singular transfor- mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 v University of Ghana http://ugspace.ug.edu.gh 3.1.7 Correlation Coefficient Between Canonical Variables and the Orig- inal Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.8 Computation of Canonical Correlation Coefficient Using Standard- ized Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.9 Assessing Overall Model Fit and Canonical Dimension Reduction . 30 3.2 Example: Computation of Canonical variables and Canonical Coefficients . 35 4 Results 40 4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 Patient Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Correlation matrix of variables . . . . . . . . . . . . . . . . . . . . 45 4.3.2 Assessment of Overall Model Fit . . . . . . . . . . . . . . . . . . 51 4.3.3 Interpreting Canonical Variate Pairs . . . . . . . . . . . . . . . . . 54 4.3.4 Interpretation of Canonical Variate Using Canonical Weights . . . . 55 4.3.5 Interpretation of Canonical Variate Using Canonical Loadings . . . 58 4.3.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.7 CCA on Sub-Sample A . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.8 CCA on Sub-Sample B . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Conclusion 67 References 71 University of Ghana http://ugspace.ug.edu.gh List of Tables 4.1 Description of Neuro-image Features Used . . . . . . . . . . . . . . . . . 42 4.2 Copy Number Variation Variables (Genes) . . . . . . . . . . . . . . . . . . 43 4.3 Sex and Survival Status Distribution of Patients . . . . . . . . . . . . . . . 44 4.4 Age and Overall Survival Time of Patients . . . . . . . . . . . . . . . . . . 44 4.5 Frequency Distribution of Expression Subtype . . . . . . . . . . . . . . . . 45 4.6 Correlations for Variable Set 1 . . . . . . . . . . . . . . . . . . . . . . . . 46 4.7 Correlations for the Copy Number Variation Variables . . . . . . . . . . . . 47 4.8 Correlations for the Copy Number Variation Variables . . . . . . . . . . . . 48 4.9 Correlations between Variable Set 1 and Variable Set 2 . . . . . . . . . . . 49 4.10 Raw Coefficients for the Neuro-image features . . . . . . . . . . . . . . . 50 4.11 Raw Coefficients for the Copy Number Variation Variables . . . . . . . . . 51 4.12 Test of Significance of all Canonical Correlations . . . . . . . . . . . . . . 52 4.13 Test of Significance of each Canonical Correlation . . . . . . . . . . . . . 53 4.14 Canonical Correlations and Eigenvalues . . . . . . . . . . . . . . . . . . . 53 4.15 Canonical redundancy analysis for Canonical Correlations . . . . . . . . . 54 4.16 Standardized Coefficients for the Neuro-image features . . . . . . . . . . . 56 4.17 Standardized Coefficients for the Copy Number Variation Variables . . . . 57 4.18 Summary of Important Related Variables . . . . . . . . . . . . . . . . . . . 58 4.19 Canonical Loadings for the Neuro-image features . . . . . . . . . . . . . . 58 4.20 Canonical Loadings for the Copy Number Variation Variables . . . . . . . 59 vii University of Ghana http://ugspace.ug.edu.gh 4.21 Summary of Important Related Variables . . . . . . . . . . . . . . . . . . . 60 4.22 Test of Significance of each Canonical Correlation . . . . . . . . . . . . . 61 4.23 Canonical Loadings for the Neuro-image features . . . . . . . . . . . . . . 62 4.24 Canonical Loadings for the Copy Number Variation Variables . . . . . . . 62 4.25 Test of Significance of each Canonical Correlation . . . . . . . . . . . . . 63 4.26 Canonical Loadings for the Neuro-image features . . . . . . . . . . . . . . 64 4.27 Canonical Loadings for the Copy Number Variation Variables . . . . . . . 65 University of Ghana http://ugspace.ug.edu.gh List of Figures 1.1 [The gene amplification has created a copy number variation.]The chromo- some now has two copies of this section of DNA, rather than one [34]. . . . 3 1.2 [Magnetic Resonance Imaging (MRI) images of patients with GBM][37, 13] 4 1.3 [Fully automated Segmentation and VASARI Feature Extraction:]necrotic core/contrast enhancing tumor(right) and edema(left) [37] . . . . . . . . . . 5 ix University of Ghana http://ugspace.ug.edu.gh Chapter 1 Introduction Many complex diseases result from the interplay of genetics and neuroimage features. As such understanding the underlying biological mechanism of such datasets are very impor- tant. As a result of the emergence of increasing development of a wide range of genome- wide assays, it is now possible for multiple measures of genomic markers from various platforms for a particular subject such as single nucleotide polymorphism, gene expres- sion, copy number variation and so on. These measurements relay information about vari- ations of genome. Putting together two or more types of data does not only help in the diagnosis of diseases but it does enhance comprehension of the biological mechanisms and consequently could improve treatment strategies. So there is a high demand for integrative approaches for use in large-scale genomic data analysis. Therefore, investigating the asso- ciations between such entities is of great use. Glioma is the most common type of primary brain tumor which arises from glial cells. It is considered responsible for approximately 13000 deaths in the United States and more than 14000 in Europe each year [35]. Gliomas are heterogeneous and they can be classified in accord with their grade: low-grade glioma, anaplastic glioma, and glioblastoma. The most common type of glioma in adults is glioblastoma (GBM). It is generally diagnosed at an average age of 55 years, and gives the affected patient an average survival time of only 10 to 18 months. Lower grade glioma can occur at younger ages [35]. The underlying tumor pathology and biological function can be identified by imaging and genetic biomarkers. In the context of clinical routing, if imaging phenotypes of GBM from magnetic resonance imaging (MRI) can be easily associated with specific gene expression signatures, they will serve as a non-invasive alternative to biopsy, providing important information for diagnosis, prognosis and personalized treatment. Therefore this thesis seeks to investigate the corre- 1 University of Ghana http://ugspace.ug.edu.gh spondence between genetic data, in particular the copy number variations and the imaging phenotypes of the GBM. One of the most important means of acquiring the relationships between two or more en- tities or objects is to take measurements of pertinent relationships. A measure of a rela- tionship depicts the strength of the relationship or association between the objects. So we introduce the term correlation to mean any broad class of statistical relationships depicting dependence. The degree of correlation can be measured by the use of correlation coef- ficients, denoted by ρ or r. The most used coefficient is the measure developed by Karl Pearson which is the Pearson correlation coefficient. The core of the project is to present the idea of canonical correlation analysis and use it to investigate the relationship between the copy number variations and neuroimage features. The main highlights of the technique that helps to elicit the relationship between the datasets will be discussed. In the next two paragraphs we introduce copy number variations and the neuroimage features of tumors. Copy number variation (CNV) can be defined as alterations of the deoxyribonucleic acid (DNA) of a genome that makes the cell have an abnormal repetitions and deletions of one or more sections of the DNA [10]. The number of repetitions of such sections differs between individuals in the human population [23]. It is a kind of structural variation, precisely a kind of duplication event that highly affects a number of base pairs [34]. Human beings differ in the number of copies of each gene and this leads to the idea of copy number invariants. Recent research has shown that about two thirds of the entire human genome comprises of repeats [36] and also about 4.75− 9.46% of the entire genome can be described as copy number variations [39]. CNVs play a very notable role in producing the necessary variation in the population and also in disease phenotype [23]. 2 University of Ghana http://ugspace.ug.edu.gh Figure 1.1: [The gene amplification has created a copy number variation.]The chromosome now has two copies of this section of DNA, rather than one [34]. Humans have two copies of most genes, one from the mother’s chromosome and the other from the father’s chromosome. Some alterations in the chromosome may cause either a loss or a gain of one copy. Duplications and deletions of more than 1000 nucleotides are referred to as copy number variants [3]. It is considered to be a very notable risk factor for cancer and constitutes a wide spectrum of the total genomic variation [38]. There has been an identification of recurrent copy number variations that demonstrate that various chro- mosome regions are present. Also, as a result of cancer being an acquired disease and also because inherited factors play a major role in its occurrence, there have been comparisons of the early constitutional copy number alterations with the copy number variations present in tumor biopsy [12]. GBM is an aggressive tumor with poor prognosis. Despite the introduction of new strate- gies to treat the disease, the median survival is less than one year [12]. In recent studies, important features have been identified. The pediatric primary GBM is different from the adult GBM, considering both the genetic profiling and mean commulative survival [29, 28, 9, 30]. Pediatric GBM and adult GBMs have varying pathways of tumorigene- sis [30]. In 35− 50% of the time, a primary adult patient forms present amplification of 3 University of Ghana http://ugspace.ug.edu.gh the epidermal growth factor receptro (EGRF) gene and inactivation of the phosphatase and tensin homolog (PTEN) gene [26, 8]. However, in the secondary adult GBM patients that may evolve from low-grade lesions, normally have no alterations of gene PTEN and no EGFR duplications but most often have TP53 mutations [33]. Studies have shown that there are differences in CNV between the adult GBMs and childhood GBMs. In pediatric GBMs, heterozygous deletions are more common while duplications are more frequent in adult GBMs [32]. Analyzing imaging features has revealed interesting relationships between the imaging fea- tures and survival of patients. Considering patients with malignant gliomas, some tumor imaging features and clinical data such as age, perioperative karnofsky performance sta- tus and tumor resection have been established to correlate with survival [31]. The image features include necrosis and edema. According to Pope et. al [31], edema, noncontrast- enhancing tumor (nCET) and multifocality were the significant features related to survival and these features could be classified as prognostic indicators. There have been several studies on the relationship between imaging features and survival. Consequently, there are reports that, the level of edema and the degree of necrosis are correlated with survival negatively [27, 21, 16]. Figure 1.2: [Magnetic Resonance Imaging (MRI) images of patients with GBM][37, 13] The importance of imaging has made it necessary for the availability of accurate informa- tive quantities. The Visually AcceSAble Rembrandt Images (VASARI) feature set presents actual standards by which a numeric score can be associated to a feature that will enable the description of the degree of tumor features. It is a standard imaging feature consisting of 30 features describing the size, location and the appearance of the MRI image set. The 4 University of Ghana http://ugspace.ug.edu.gh image presents the global view of the tumor. A small tumor in the frontal lobe has a vastly different outcome to a small tumor adjacent to motor area, for instance the eloquent cortex [13]. For more accurate results, the Columbia University Medical Center [37], designed a fully automated computer algorithm to score glioma tumors based on the available feature set. Figure 1.3: [Fully automated Segmentation and VASARI Feature Extraction:]necrotic core/contrast enhancing tumor(right) and edema(left) [37] Image features have also been used for exploratory radiogenomic analysis [11]. Gevaert et. al obtained quantitative image features from MR images that characterize the radiographic phenotype of GBM lesions. They also constructed radiogenomic maps relating the features with particular molecular data [11]. Even after the consideration of clinical variables, imag- ing features provide notable prognostic information. Currently, qualitative work suggests an association between imaging phenotypes and genotypes [13]. Dongdong Lin et al (2013) [22] investigated the correspondence between single nucleotide polymorphism (SNP) and brain activity measured by functional magnetic resonance imag- ing (fMRI) to understand how genetic variation influences the brain activity. They de- veloped a group sparse canonical correlation analysis method to explore the relationship between these two datasets. They found two pairs of significant canonical variates with average correlations of 0.4527 and 0.4292 respectively, which were used to identify genes and voxels associated with schizophrenia. 5 University of Ghana http://ugspace.ug.edu.gh 1.1 Organisation of the Study Chapter 2 will present brief definitions of some of the mathematical and statistical terms that will be used in this work. The review of the main technique to be employed to investi- gate the relationships will be discussed in Chapter 3. In Chapter 4, results from the analysis of the data will be presented and discussion will follow in chapter 4. Chapter 5 will contain the conclusions and recommendations and a brief discussion of possible directions for the future work. 6 University of Ghana http://ugspace.ug.edu.gh Chapter 2 Definitions Prior to the presentation and discussion of the existing technique and methodology, this chapter will present some definitions of concepts, terms and theorems to be used in the sequel. 2.1 Definitions of statistical and mathematical terms Definition 2.1.1. Supposing we have a square matrix, A, of size m, then the m×1 vector k is a right eigen- vector for A and λ ≥ 0 is the corresponding eigenvalue if Ak = λk. Also, a left eigenvector n can be defined as satisfying nA = λn. Definition 2.1.2. Given an m×m matrix B, a matrix M for which M2 = B is called the square root of the matrix B. Several studies have examined the computation of matrix square roots [17, 6, 7, 18, 4]. Here we find the square root of an m×m matrix by the diagonalization method [4]. An m×m matrix B is diagonalizable if we have a diagonal matrix D and an invertible matrix K such that B = KDK−1. The diagonal matrix is made up of the eigenvalues of B and the columns of K are the m eigenvectors of B. The square root of B is given as B 1 2 = K √ DK−1 7 University of Ghana http://ugspace.ug.edu.gh Example 2.1.3. Given a matrix B = ( 18 12 12 28 ) , we find B 1 2 as follows. The eigenvalues of B are 10,36 and eigenvectors are (−3,2),(2,3), so B eigendecomposes to B = ( −3 2 2 3 )( 10 0 0 36 )( −3 2 2 3 )−1 So we have the form B = KDK−1. Since from Definition 2.1.2, M2 = B, then there is an M of the form K √ DK−1 M = ( −3 2 2 3 )(√ 10 0 0 √ 36 )( −3 2 2 3 )−1 √ B = ( 4.035 1.310 1.310 5.127 ) Definition 2.1.4. Let X1, . . . ,Xp be a set of n× 1 vectors. Then we have that the n× 1 vector lx is a linear combination of these vectors if lx = a1X1 + . . .+ apXp for some real constants a1, . . .ap which are usually called loadings. Singular Value Decomposition Let A be a p×q real matrix. Then it can be represented as A = UDV ′ where U is a p× p orthogonal matrix, V is a q× q orthogonal matrix and D is a p× q diagonal matrix with non-negative diagonal elements λi, i = 1, . . . ,min(p,q). The first min(p,q) columns of U and V are left and right singular vectors, respectively, and λi, i = 1, ...,min(p,q) are the corresponding singular values. Note that left singular vectors for A are the eigenvectors for AA′ while the right singular vectors are the eigenvectors for A′A. The eigenvalues are equal for AA′ and A′A and they are equal to the squared singular values of A. Lemma 2.1.5. (The Cauchy-Schwartz Inequality) Let H be a Hilbert space over C. We have that | 〈x,y〉 |2≤ 〈x,x〉〈y,y〉 , ∀x,y ∈ H. 8 University of Ghana http://ugspace.ug.edu.gh Proof. If y = 0, then 〈x,0〉= 0 and the inequality is true. Assume y 6= 0 and that a =− 〈x,y〉 〈y,y〉 . Clearly a is a complex number since 〈x,y〉 is a complex number and 〈y,y〉 is a real number. Then we have, 0≤ 〈x+ay,x+ay〉 = 〈x,x+ay〉+ 〈ay,x+ay〉 = 〈x,x〉+ 〈x,ay〉+ 〈ay,x〉+ 〈ay,ay〉 = 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+a〈y,ay〉 = 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+a〈ay,y〉 = 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+aa¯〈y,y〉 = 〈x,x〉+ a¯〈x,y〉+a〈y,x〉+ |a|2 〈y,y〉 = 〈x,x〉− 〈x,y〉 〈y,y〉 〈x,y〉− 〈x,y〉 〈y,y〉 〈x,y〉+ ∣ ∣ ∣ ∣− 〈x,y〉 〈y,y〉 ∣ ∣ ∣ ∣ 2 〈y,y〉 = 〈x,x〉− 2〈x,y〉 〈y,y〉 〈x,y〉+ | 〈x,y〉 |2 〈y,y〉 = 〈x,x〉− 2| 〈x,y〉 |2 〈y,y〉 + | 〈x,y〉 |2 〈y,y〉 = 〈x,x〉− |〈x,y〉 |2 〈y,y〉 . Hence, 0 ≤ 〈x,x〉− |〈x,y〉 |2 〈y,y〉 | 〈x,y〉 |2 ≤ 〈x,x〉〈y,y〉 | 〈x,y〉 | ≤ √ 〈x,x〉 √ 〈y,y〉 | 〈x,y〉 |2 ≤ 〈x,x〉〈y,y〉 as desired. 9 University of Ghana http://ugspace.ug.edu.gh Definition of Statistical Terms Definition 2.1.6. Variance measures the spread or dispersion or compactness of a set of data. It is computed as the average of the squared deviations from the mean score of the data set. Definition 2.1.7. Covariance is a measure of how much or the degree at which two variables change together. The covariance matrix is a matrix which has the covariance of the ith and jth elements of the variables in the position of the i jth position . All covariance matrices are symmetric and positive semi-definite. The following definitions are adapted from the supplement to Hair et. al’s textbook [14]. Definition 2.1.8. A canonical variate also known as a linear compound or a linear composite is a linear combination that constitutes the weighted sum of two or more variables. Thus a canonical variate can be defined for either set of variables. Definition 2.1.9. A Canonical function depicts the relationship between two canonical variates (linear com- posites). For each canonical function, there are two canonical variates, one variate for one set of variables and another variate for the other set of variables. The degree of the relationship is the canonical correlation. Definition 2.1.10. The canonical roots are the squared canonical correlations. They are also known as eigen- values. The canonical roots provide the estimation of the shared variance between the weighted canonical variates of the two set of variables. Definition 2.1.11. Orthogonality here is a mathematical constraint which specifies that canonical functions are not dependent of one another. Put differently, to arrive at statistical independence of the canonical functions we derive the functions so that each function is perpendicular to all others when it is being plotted in a space (multivariate). Definition 2.1.12. The canonical loading is the measure of correlation between the original variables and their canonical variates. 10 University of Ghana http://ugspace.ug.edu.gh Definition 2.1.13. The redundancy index is the measure of the amount of variance explained between a canon- ical variate pair in a canonical function. 11 University of Ghana http://ugspace.ug.edu.gh Chapter 3 Methodology In this chapter, we present the idea of Canonical Correlation Analysis. The technique seeks to identify the relationships between two datasets. The canonical correlation analysis will be presented in Section 1 and an example will be illustrated in section 2. The discussion of the technique will be skewed towards the datasets involved for this thesis. The main references used for this chapter are [20, 15, 24]. 3.1 Canonical Correlation Analysis (CCA) 3.1.1 Canonical Correlation Canonical correlation analysis is a technique that measures the relationship between two multidimensional variables. It seeks to find two bases in which the correlation matrix between the variables is diagonal and the correlations on the diagonal are maximized. CCA was first introduced by H. Hotelling in 1936 [19]. Canonical correlation is invari- ant with respect to affine transformations of the variables. This property differentiates it from the normal correlation analysis. Adopting CCA helps to summarize relationships while preserving main features. CCA enables us to summarize the relationships into fewer number of statistics while preserving the main facets of the relationships. We begin with the following notation: we define two vectors X and Y as two sets of variables, where X consists of p variables and Y consists of q variables. We select X and Y depending on the number of variables in each set so that p≤ q for computational reasons and convenience. 12 University of Ghana http://ugspace.ug.edu.gh So X =       X1 X2 ... Xp       and Y =       Y1 Y2 ... Yq       (3.1) We define a set of linear combinations, M and N. M will consist of linear combinations of variables Xi in X , and N will consist of linear combinations of variables Yj in Y . We have M1 = a11X1 +a12X2 + · · ·+a1pXp M2 = a21X1 +a22X2 + · · ·+a2pXp ... Mp = ap1X1 +ap2X2 + · · ·+appXp = a ′X N1 = b11Y1 +b12Y2 + · · ·+b1qYq N2 = b21Y1 +b22Y2 + · · ·+b2qYq ... Np = bp1Y1 +bp2Y2 + · · ·+bpqYq = b ′Y. We also define (Mi,Ni) as the ith canonical variate pair. So (M1,N1) is the first canonical variate pair, and (M2,N2) is the second canonical variate pair and so on. There are p canonical variate pairs. We seek to find linear combinations that maximize the correlations between the members of each canonical variate pair. The correlation corr(Mi,N j) between Mi and N j is then calculated using (3.2): corr(Mi,N j) = cov(Mi,N j) √ var(Mi)var(N j) , (3.2) where cov(Mi,N j) is the covariance between Mi and N j and var(Mi) and var(N j) are the variances of Mi and N j respectively. The canonical correlation for the ith canonical variate pair is simply the correlation between Mi and Ni: 13 University of Ghana http://ugspace.ug.edu.gh ρi = cov(Mi,Ni) √ var(Mi)var(Ni) . (3.3) The quantity in (3.3) is to be maximized, thus we find linear combinations of the X ′i s and linear combinations of the Y ′js that maximize the above correlation. So the main purpose of canonical correlation analysis is to explain the covariance struc- ture or correlations structure between two sets of random vectors in terms of fewer linear combinations. 3.1.2 Mathematical Formulation The p-dimensional random vector X and q-dimensional vector Y , are such that cov(X ,X),cov(Y,Y ) and cov(X ,Y ) are denoted by ∑11,∑22 and ∑12 respectively. So, the covariance structure of X and Y is given as cov ( X Y ) = ( ∑11 ∑12 ∑21 ∑22 ) . Considering the linear combinations a′X and b′Y , we have that cov(a′X ,b′Y ) = a′∑12 b. This implies that the canonical correlation of X and Y is ρ(a′X ,b′Y ) = a ′∑12 b√ a′∑11 a×b′∑22 b . 3.1.3 Formulation and Derivation of the Canonical Variables The canonical variables and associated correlation coefficients are defined iteratively. 1st Pair of Canonical Variables: Definition: Consider M1 = a′X and N1 = b′Y such that 14 University of Ghana http://ugspace.ug.edu.gh • var(M1) = var(N1) = 1 and • ρ(M1,N1) = max a,b ρ(a′X ,b′Y ), then (M1,N1) is the 1st pair of canonical variables (canonical variate) and ρ1 = max a,b ρ(a′X ,b′Y ) is the 1st canonical correlation coefficient. 2nd pair of Canonical Variables: Definition: Consider linear combinations a′X and b′Y such that • cov(a′X ,M1) = 0 = cov(b′Y,N1), that is M1 is uncorrelated with the linear combina- tions a′X and N1 is uncorrelated with b′Y and • var(a′X) = var(b′Y ) = 1 Then maximize the correlations between a′X and b′Y such that the above is satisfied. The maximizing a′X and b′Y are called the second pair of canonical variates. The correlation coefficient that maximizes the correlation of the second canonical variate pairs is the sec- ond canonical correlation coefficient. Kth pair of Canonical Variables: Definition: The Kth pair of canonical variables are the linear combinations (Mk,Nk) having unit variance which maximize the correlation among all possible linear combinations un- correlated with the previous (k−1) canonical variate pairs. The following statements will help us in the derivation of the canonical variables. cov(X ,X) = ∑11 > 0, cov(Y,Y ) = ∑22 > 0. 15 University of Ghana http://ugspace.ug.edu.gh The covariance structure is positive definite. Now we consider a p×q matrix, A such that A =∑ − 12 11 ∑12∑ − 12 22 and we now consider the following matrices AA′ =∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 (p× p) A′A =∑ − 12 22 ∑21∑ −1 11 ∑12∑ − 12 22 (q×q) Let λ1 ≥ λ2 ≥ . . .≥ λp, be the eigenvalues of AA′ and let γ1 ≥ γ2 ≥ . . .≥ γq, be the eigen- values of A′A. We have that, (i) A′A and AA′ are positive semi definite implies that λi ≥ 0 and γ j ≥ 0 ∀i, j. (ii)Non-zero eigenvalues of AA′ are same as the non-zero eigenvalues of A′A and the eigen- value 0 has different multiplicities in AA′ and A′A if q < p. Theorem 3.1.4. [20] We suppose that p≤ q and cov ( X Y ) = ( ∑11 ∑12 ∑21 ∑22 ) . Considering the linear combinations M = a′X and N = b′Y , we have that max a,b ρ(a′X ,b′Y ) = ρ1 is attained by the linear combination M1 = e ′ 1∑ − 12 11 X and N1 = f ′ 1∑ − 12 22 Y. M1 and N1 are the first pair of canonical variables and max a,b ρ(a′X ,b′Y ) = ρ2 is attained by the linear combination M2 = e ′ 2∑ − 12 11 X and N2 = f ′ 2∑ − 12 22 Y. M2 and N2 are the second pair of canonical variables. 16 University of Ghana http://ugspace.ug.edu.gh In general max a,b ρ(a′X ,b′Y ) = ρk is attained by the linear combination Mk = e ′ k∑ − 12 11 X and Nk = f ′ k∑ − 12 22 Y. Now (ρ1)2 ≥ (ρ2)2 ≥ . . . ≥ (ρp)2 are the eigenvalues of the matrix ∑ − 12 11 ∑12∑ − 12 22 ∑21∑ − 12 11 matrix and e1,e2, . . . ,ep are the orthonormalized eigenvectors corresponding to (ρ1)2, . . .(ρp)2. The values (ρ1)2,(ρ2)2, . . .≥ (ρp)2 are the p largest eigenvalues of the matrix ∑ − 12 22 ∑21∑ −1 11 ∑12∑ − 12 22 with eigenvectors f1, f2, . . . , fp, where each fi is proportional to ∑ − 12 22 ∑21∑ − 12 11 ei. Derivation of the 1st pair of canonical variables Proof. From the definitions, we have that ρ(a′X ,b′Y ) = a ′∑12 b (a′∑11 ab′∑22 b) 1 2 . (3.4) We let ∑ 1 2 11 a = u =⇒ a =∑ − 12 11 u and let ∑ 1 2 22 b = v =⇒ b =∑ − 12 22 v. So, equation 3.4 becomes ρ(a′X ,b′Y ) = u′∑ − 12 11 ∑12∑ − 12 22 v ((u′u)(v′v)) 1 2 . 17 University of Ghana http://ugspace.ug.edu.gh By applying the Cauchy Schwartz inequality, we have that u′∑ − 12 11 ∑12∑ − 12 22 v≤ ( u′∑ − 12 11 ∑12∑ − 12 22 ∑ − 12 22 ∑21∑ − 12 11 u ) 1 2 ( v′v ) 1 2 . (3.5) We make use of the following result to find an upper bound of the expression on the right. From matrix theory, if C(p× p) is a real symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ . . .≥ λp and eigenvectors orthornormalised at e1, . . . ,ep, then we have the following result max d d′Cd d′d = λ1, where λ1 is the largest eigenvalue of the real symmetric matrix C and d is a vector. The maximum is attained at d = e1, where e1 the orthonormalised eigenvector corresponding to the largest eigenvalue λ1. This implies that (d′Cd)≤ λ1d′d. So we have that ( u′∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u ) ≤ (ρ1)2 u′u. (3.6) In equation 3.6 equality holds at u = e1 and in equation 3.5 equality is attained if v = ∑ − 12 22 ∑21∑ − 12 11 e1. That is, u =∑ − 12 11 a, so a =∑ − 12 11 e1 and b =∑ − 12 12 ∑ − 12 22 ∑21∑ − 12 11 e1. 18 University of Ghana http://ugspace.ug.edu.gh ρ(a′X ,b′Y ) ≤ [ (u′∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u)(v ′v) ] 1 2 (u′u · v′v) 1 2 =   u′∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u u′u   1 2 ≤ ( (ρ1)2u′u u′u ) 1 2 = ρ1. This implies that max a,b ρ(a′X ,b′Y ) = ρ1 and ρ(e′1∑ − 12 11 X , f ′ 1∑ − 12 22 Y ) = cov(e′1∑ − 12 11 X , f ′ 1∑ − 12 22 Y ) ( var(e′1∑ − 12 11 X)var( f ′ 1∑ − 12 22 Y ) ) 1 2 = ρ1. This implies that, the first pair of canonical variables is given by M1 = e′1∑ − 12 11 X and N1 = f ′1∑ − 12 22 Y . So we now have that ∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 e1 = λ1e1(λ1 = ρ1). (3.7) We multiply both sides of equation 3.7 by the matrix ( ∑ − 12 22 ∑21∑ − 12 11 ) to obtain ( ∑ − 12 22 ∑21∑ − 12 11 ) ∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 e1 = λ1∑ − 12 22 ∑21∑ − 12 11 e1. 19 University of Ghana http://ugspace.ug.edu.gh That is, ∑ − 12 22 ∑21∑ −1 11 ∑12∑ − 12 22 ( ∑ − 12 22 ∑21∑ − 12 11 e1 ) = λ1 ( ∑ − 12 22 ∑21∑ − 12 11 e1 ) . Since f1 is proportional to ∑ − 12 22 ∑21∑ − 12 11 e1, we have that ∑ − 12 22 ∑21∑ −1 11 ∑12∑ − 12 22 f1 = λ1 f1. Thus we conclude that if (λ1,e1) is the eigenvalue-eigenvector pair of∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 , then (λ1, f1) is the eigenvalue-eigenvector pair of ∑ − 12 22 ∑21∑ −1 11 ∑12∑ − 12 22 . Derivation of the second canonical variables M1 and any linear combinations of Xs’ say given by a′2X ,u ′ 2∑ − 12 11 X , where ∑ 1 2 11 a2 = u2 are uncorrelated if cov(M1,u ′ 2∑ − 12 11 X) = cov(e ′ 1∑ − 12 11 ,u ′ 2∑ − 12 11 X) = 0 = e′1∑ − 12 11 ∑11∑ − 12 11 u2 = 0 = e′1u2 = 0. So, u2 is to be determined such that it is orthogonal to e1. We want to find ρ(a′2X ,b′2Y ) = cov(a′2X ,b ′ 2Y )( var(a′2X) · var(b ′ 2Y ) ) = a′2∑12 b2 ( (a′2∑11 a2)(b ′ 2∑22 b2) ) 1 2 . We let ∑ 1 2 11 a2 = u2 =⇒ a2 =∑ − 12 11 u2 and let ∑ 1 2 22 b2 = v2 =⇒ b2 =∑ − 12 22 v2. 20 University of Ghana http://ugspace.ug.edu.gh So we have that ρ(a′2X ,b′2Y ) = u′2∑ − 12 11 ∑12∑ − 12 22 v2 (u′2u2 · v ′ 2v2) 1 2 . We apply the Cauchy Schwartz inequality to the numerator and have that ( u′2∑ − 12 11 ∑12∑ 1 2 22 v2 ) ≤ ( u′2∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u2 ) 1 2 ( v2v ′ 2 ) 1 2 . (3.8) So we concentrate on the expression u′2∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u2 and try to see what can be given as an upper bound of this particular expression. In order to get that, we again recall a result from matrix theory that states that for a real symmetric matrix Cp×p with eigenvalue-eigenvector pairs (λi,ei); i = 1,2, . . . p such that λ1 ≥ λ2 ≥ . . .≥ λp, we have that max d⊥e1 d′Cd d′d = λ2 =⇒ d′Cd ≤ λ2d′d (3.9) and max d⊥e1,e2,...ek d′Cd d′d = λk+1 =⇒ d′Cd ≤ λk+1d′d. (3.10) In equation 3.9, equality holds if d = e2 and for equation 3.10, equality holds if d = ek+1. From 3.9, we have that ( u′2∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u2 ) ≤ λ2(u′2u2) with equality at u2 = e2. In equation 3.8 equality is attained if v2 =∑ − 12 22 ∑21∑ − 12 11 e2 =⇒ b2 = ∑ − 12 22 ∑ − 12 22 ∑21∑ − 12 11 e2 b2 = ∑ − 12 22 f2. 21 University of Ghana http://ugspace.ug.edu.gh So now we have that ρ(a′2X ,b′2Y ) ≤ [( u2′∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u2 ) (v2′v2) ] 1 2 (u2′u2 · v2′v2) 1 2 =   u′2∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 u2 u2′u2   1 2 ≤ ( (ρ2)2u′2u2 u′2u2 ) 1 2 = ρ2. Thus corr(a′2X ,b ′ 2Y )≤ ρ2 with equality at u2 = e2 =⇒ a2 =∑ − 12 11 e2. The Second Canonical Variable pairs are M2 = e′2∑ − 12 11 X and N2 = f ′ 2∑ − 12 22 Y. The second canonical correlation coefficient is ρ2 as required. 3.1.5 Properties of the Canonical Variable Pairs (i) var(Mk) = var(Nk) = 1. Proof. var(Mk) = var(e ′ k∑ − 12 11 X) = e ′ k∑ − 12 11 ∑11∑ − 12 11 ek = e ′ kek = 1. Similarly, var(Nk) = f ′ k∑ − 12 22 ∑22∑ − 12 22 fk = f ′ k fk = 1. 22 University of Ghana http://ugspace.ug.edu.gh (ii) cov(Mk,Mt) = corr(Mk,Mt) = 0, ∀k 6= t. Proof. cov(Mk,Mt) = cov(e ′ k∑ − 12 11 X ,e ′ t∑ − 12 11 X) = e′k∑ − 12 11 ∑11∑ − 12 11 et = e′ket = 0 ∀ k 6= t since ek and et are orthogonal. (iii) cov(Nk,Nt) = corr(Nk,Nt) = 0, ∀ k 6= t. Proof. cov(Nk,Nt) = cov( f ′ k∑ − 12 22 Y, f ′ t ∑ − 12 22 Y ) = f ′k∑ − 12 22 ∑22∑ − 12 22 ft . Also, because of the orthogonality of fk and ft , cov(Nk,Nl) = f ′ k ft = 0 ∀k 6= l. (iv) cov(Mk,Nt) = corr(Mk,Nt) = 0, ∀ k 6= t. Proof. cov(Mk,Nt) = cov(e ′ k∑ − 12 11 X , f ′ t ∑ − 12 22 Y ) = e ′ k∑ − 12 11 ∑12∑ − 12 22 ft . (3.11) 23 University of Ghana http://ugspace.ug.edu.gh We recall that fk is proportional to ∑ − 12 22 ∑21∑ − 12 11 ek and so cov(Mk,Nt) = Q f ′ k ft = 0, ∀ k 6= t since fk ⊥ ft where Q is a constant. 3.1.6 Canonical correlation coefficient under the non-singular trans- formation In this section we seek to find the canonical correlations if the vectors, X and Y are being transformed. We will also demonstrate that we can compute the canonical correlation co- efficients either from the covariance matrix or from the correlation matrix. We derive the canonical correlation coefficient under the transformation. Xp×1→CX and Yq×1→ DY, where C and D are non-singular matrices. We have cov ( CX DY ) = ( C∑11C ′ C∑12 D ′ D∑21C ′ D∑22 D ′ ) . We have seen that ρ1,ρ2, . . . ,ρp are the canonical correlation coefficients for the ( X Y ) set up. Also, (ρ1)2,(ρ2)2, . . . ,(ρp)2 are the eigenvalues of ∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 . Hence, (ρ1)2,(ρ2)2, . . . ,(ρp)2 are the roots of ∣ ∣ ∣∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 −λ I ∣ ∣ ∣= 0. 24 University of Ghana http://ugspace.ug.edu.gh So we now pre and post multiply by the matrix ∑ 1 2 11 and ∑ − 12 11 to get ∣ ∣ ∣∑ 1 2 11∑ − 12 11 ∑12∑ −1 22 ∑21∑ − 12 11 ∑ − 12 11 −λ I ∣ ∣ ∣= 0 ∣ ∣ ∣∑12∑ −1 22 ∑21∑ −1 11 −λ I ∣ ∣ ∣= 0. The matrix ∑12∑ −1 22 ∑21∑ −1 11 can be transformed under C and D as ∑12∑ −1 22 ∑21∑ −1 11 C,D −→ ( (C∑12 D ′)(D∑22 D ′)−1(D∑21C ′)(C∑11C ′)−1 ) = C∑12∑ −1 22 ∑21∑ −1 11 C −1. We have that, the non-zero eigenvalues of C∑12∑ −1 22 ∑21∑ −1 11 C −1 are the same as the non- zero eigenvalues of C−1C∑12∑ −1 22 ∑21∑ −1 11 = ∑12∑ −1 22 ∑21∑ −1 11 . Hence we conclude that the canonical correlation coefficient under the non-singular trans- formation C,D are the same. We now take a special case of such a transformation by defining C and D as follows; C = N − 12 11 where N11 = diag ( ∑11 ) and D = N − 12 22 where N22 = diag ( ∑22 ) . So we transform the vectors X and Y under the given transformation and compute the covariance of X and Y under the transformation. X →CX = N − 12 11 X → cov(N − 12 11 X) = N − 12 11 ∑11 N − 12 11 = ρ11, Y → DY = N − 12 22 Y → cov(N − 12 22 Y ) = N − 12 22 ∑22 N − 12 22 = ρ22. This implies that the eigenvalues of ∑ 1 2 11∑12∑ −1 22 ∑21∑ − 12 11 are identical to the eigenvalues of ρ− 1 2 11 ρ12ρ −1 22 ρ21ρ − 12 11 . Therefore, computing the canonical correlation coefficients from either the covariance ma- trix or the correlation matrix will yield the same values. 25 University of Ghana http://ugspace.ug.edu.gh 3.1.7 Correlation Coefficient Between Canonical Variables and the Original Variables We now derive the correlation coefficient between the canonical variables, (Mi and Ni) where i = 1,2, . . . , p and the original variables X and Y . The pth canonical variate pairs are defined as follows Mp = e ′ p∑ − 12 11 X and Np = f ′ p∑ − 12 22 Y. M︸︷︷︸ p×1 =       M1 M2 ... Mp       =       e′1 e′2 ... e′p       ∑ − 12 11 X = CX and C =       e′1 e′2 ... e′p       ∑ − 12 11 . N︸︷︷︸ q×1 =       N1 N2 ... Np       =       f ′1 f ′2 ... f ′q       ∑ − 12 22 Y = DY and D =       f ′1 f ′2 ... f ′q       ∑ − 12 22 . cov(M,X) = cov(CX ,X) = C∑11 =       e′1 e′2 ... e′p       ∑ 1 2 11 and cov(N,Y ) = cov(DY,Y ) = B∑22 =       f ′1 f ′2 ... f ′q       ∑ 1 2 22 . 26 University of Ghana http://ugspace.ug.edu.gh This implies that corr(Mi,Xk) = cov(Mi,Xk) σ 1 2 kk ; (var(Xk) = σkk) = cov(Mi,σ − 12 kk Xk) corr(M,X) = cov(M,N − 12 11 X) where N11 = diag ( ∑11 ) = diag(σ11, . . . ,σpp) = cov(CX ,N − 12 11 X) = C∑11 N − 12 11 = CN 1 2 11N − 12 11 ∑11 N − 12 11 = CN 1 2 11ρ11. (3.12) Similarly, corr(M,Y ) = cov(M ˜ ,N − 12 22 Y ) = cov(CX ,N − 12 22 Y ) where N22 = diag ( ∑22 ) = diag ( σ11,σ22, . . . ,σqq ) = C∑12 N − 12 22 = CN 1 2 22ρ12. (3.13) And corr(N,X) = cov(N,N − 12 11 X) = Cov(DY,N − 12 11 X) = D∑21 N − 12 11 = DN 1 2 11ρ21. (3.14) Finally, corr(N,Y ) = cov(N,N − 12 12 Y ) = cov(BY,N − 12 22 Y ) = D∑22 N − 12 22 = DN 1 2 22ρ22. (3.15) Equations 3.12, 3.13, 3.14 and 3.15 are the derived canonical coefficients between the canonical variate pairs and the original variables. 27 University of Ghana http://ugspace.ug.edu.gh 3.1.8 Computation of Canonical Correlation Coefficient Using Stan- dardized Variables Here, we seek to derive the canonical coefficient by standardizing the original variables. We denote the standardized variables are follows Z(X) = (X−MX)N − 12 11 and Z(Y ) = (Y −MY )N − 12 22 . So the covariance matrix of the standardized variables is given by cov ( Z(X) Z(Y ) ) = ( ρ11 ρ12 ρ21 ρ22 ) . From the correlation matrix, the derived canonical variables are MZk = e ′ k∑ − 12 11 N 1 2 11Z (X) and NZk = f ′ k∑ − 12 22 N 1 2 22Z (Y ). MZ =     MZ1 ... MZp    =     e′1 ... e′p    ∑ − 12 11 N 1 2 11Z (X) = CZZ (X) (3.16) and NZ =     NZ1 ... NZq    =     f ′1 ... f ′q    ∑ − 12 22 N 1 2 22Z (Y ) = DZZ (Y ). (3.17) Now we compute the correlation between the canonical variables obtained from the corre- lation matrix and the standardized variables. We have ρ(MZ,Z(X)) = cov(MZ,Z(X)) = cov(CZZ(X),Z(X)) = CZρ11. (3.18) 28 University of Ghana http://ugspace.ug.edu.gh ρ(NZ,Z(Y )) = cov(NZ,Z(Y )) = cov(DZZ(Y ),Z(Y )) = DZρ22. (3.19) ρ(MZ,Z(Y )) = cov(CZZ(X),Z(Y )) = CZρ12. (3.20) ρ(NZ,Z(X)) = cov(DZZ(Y ),Z(X)) = DZρ21. (3.21) From equations 3.16 and 3.17, we have that CZ =     e′1 ... e′p    ∑ − 12 11 N 1 2 11 and DZ =     f ′1 ... f ′p    ∑ − 12 22 N 1 2 22. This gives ρ(M,X) = CN 1 2 11ρ11 =     e′1 ... e′p    ∑ − 12 11 N 1 2 11ρ11 = CZρ11 = ρ(MZ,Z (1)), ρ(M,Y ) = CN 1 2 22ρ12 =     e′1 ... e′p    ∑ − 12 12 N 1 2 22ρ12 = CZρ11 = ρ(MZ,Z (X)), ρ(N ˜ ,X) = DN 1 2 11ρ21 =     f ′1 ... f ′p    ∑ − 12 21 N 1 2 11ρ21 = DZρ21 = ρ(NZ,Z (X)), ρ(N,Y ) = DN 1 2 11ρ22 =     f ′1 ... f ′p    ∑ − 12 22 N 1 2 22ρ22 = DZρ22 = ρ(NZ,Z (Y )). We then conclude that, computing correlations by standardizing the variables has no effect. 29 University of Ghana http://ugspace.ug.edu.gh 3.1.9 Assessing Overall Model Fit and Canonical Dimension Reduc- tion Under this section, two techniques will be discussed to explore the possibility that inter- preting fewer canonical dimensions or canonical variate pairs can be enough to capture sufficient covariance or correlation structure. It is known that not all canonical functions are important. Evidently, the strength of the canonical correlation coefficient can suggest the importance of the canonical variate pairs [2]. We are ultimately interested in the sig- nificant canonical coefficients to make informed decisions. The first technique involves the use of Wilk’s lambda and it’s corresponding F-tests to test the null hypothesis that all canonical functions have canonical correlation coefficients to be zero at a 5% significance level. Wilk’s lambda evaluates each canonical function against the null hypothesis that the canonical coefficient is zero. The second technique seeks to ascertain if choosing k < p canonical variate pairs is enough to capture the covariance structure. Technique I For each canonical correlation coefficient, there exists an eigenvalue that is related to the Wilk’s lambda. The eigenvalue for each coefficient in relation to the Wilk’s lamda is cal- culated as λi = ρi (1−ρi)2 and Wilk’s lamda is computed as Λ = 1 ∏(1−λi) . The F-test value is calculated as F = 1−Λ 1 w Λ 1 w ( degrees of freedom1 degrees of freedom2 ) . 30 University of Ghana http://ugspace.ug.edu.gh Degrees of Freedom1 = p×q. Degrees of Freedom2 = vw− pq 2 +1. v = n− 3 2 − p+q 2 , n is the sample size. w = ( p2q2− p p2 +q2−q ) 1 2 . (3.22) The computation of w in equation 3.22 is iterative. We begin with the initial values of p and q and repeatedly subtract one from p and q until either p or q has been reduced to one. We now compute the p-value or the critical value to make the final decision. The critical value is a value that the computed F value must exceed to reject the test hypothesis. The critical value is computed from the F-distribution table using the two degrees of freedom and the level of significance (5%). The p-value is computed using the F value and the two degrees of freedom values. If the p-value is less than 0.05, then we reject the null hypothesis, otherwise we fail to reject the null hypothesis. Technique II We have that M =     M1 ... Mp    = SX and so X = S −1M, where S =     e′1 ... e′p    ∑ − 12 11 and N =     N1 ... Nq    = TY thus Y = T −1N, and T =     f ′1 ... f ′q    ∑ − 12 22 . Clearly, S−1 =∑ 1 2 11(e1, . . . ,ep) and T −1 =∑ 1 2 22( f1, . . . , fq). 31 University of Ghana http://ugspace.ug.edu.gh So writing S−1 and T−1 in the form below eases the computation. We write S−1 = ( s(1), . . . ,s(p) ) , where s(i) = ∑ 1 2 11 ei ; i = 1,2, . . . , p and (3.23) T−1 = ( t(1), . . . , t(q) ) , where t(i) = ∑ 1 2 22 fi; i = 1,2, . . . ,q. (3.24) Using this we rewrite X and Y as X = ( s(1), . . . ,s(p) ) M = p ∑ i=1 s(i)M and (3.25) Y = ( t(1), . . . , t(q) ) N = q ∑ i=1 t(i)N. (3.26) We can then compute the covariance of X and Y as cov(X) = cov ( p ∑ i=1 s(i)Mi ) = p ∑ i=1 s(i)s(i) ′ and cov(Y ) = cov ( p ∑ i=1 t(i)Ni ) = q ∑ i=1 t(i)t(i) ′ . So considering the first k canonical variables, we have that X∗ = k ∑ i=1 a(i)Mi and Y ∗ = k ∑ i=1 b(i)Ni, thus cov(X∗) = k ∑ i=1 s(i)s(i) ′ and cov(Y ∗) = k ∑ i=1 t(i)t(i) ′ . 32 University of Ghana http://ugspace.ug.edu.gh We then compute the covariance between X and Y as cov(X ,Y ) = cov(S−1M,T−1N) = S−1     ρ1 0 0 . . . 0 0 ρp     ( T−1 )′ and so cov(X ,Y ) = (s(1), . . . ,s(1))       ρ1 0 0 0 0 ρ2 0 0 . . . 0 0 0 ρp           t(1) ′ ... t(1) ′ q     = p ∑ i=1 ρ∗i s(i)t(i) ′ . Therefore, cov(X∗,Y ∗) = k ∑ i=1 ρis(i)t(i) ′ . So having the covariance structure for the first k canonical variables, we now seek to find out the closeness to a null matrix of the three matrices. p ∑ i=k+1 s(i)s(i) ′ , q ∑ i=k+1 t(i)t(i) ′ and p ∑ i=k+1 ρis(i)t(i) ′ . We make three observations. (1) Since we usually choose k such that ρk+1 and hence ρk+2, . . . ,ρp are negligible, p ∑ i=k+1 ρis(i)t(i) ′ will be closer to a null matrix than p ∑ i=k+1 s(i)s(i) ′ and q ∑ i=k+1 t(i)t(i) ′ . (2) cov(X ,M) = cov ( S−1M,M ) = S−1 = ( s(1), . . . ,s(p) ) =     cov(X1,M1) . . . cov(X1,Mp) ... ... cov(Xp,M1) . . . cov(Xp,Mp)     . (3) Considering k < p canonical variables, M1, . . . ,Mk, the proportion of total variance X 33 University of Ghana http://ugspace.ug.edu.gh explained by M1, . . . ,Mk is given as tr (cov(X∗)) tr (cov(X) = tr ( k ∑ i=1 s(i)a(i) ′ ) tr∑11 . where tr is the trace of the matrices in question. In addition S−1 = (s(1), . . . ,s(p)) = cov(X ,M) and s(i) =     cov(X1,Mi ... cov(Xp,Mi)     i = 1, . . . , p thus s(i) ′ s(i) = p ∑ j=1 cov(X j,Mi) 2 and k ∑ i=1 s(i) ′ s(i) = k ∑ i=1 p ∑ j=1 cov(X j,Mi) 2. Thus tr ( k ∑ i=1 s(i)s(i) ′ ) tr∑11 = k ∑ i=1 tr(s(i)s(i) ′ ) p ∑ i=1 tr(s(i)s(i)′) = k ∑ i=1 tr(s(i) ′ s(i)) p ∑ i=1 tr(s(i)s(i)′) Since (s(i) ′ s(i)) is a scalar quantity, we have that k ∑ i=1 tr(s(i) ′ s(i)) p ∑ i=1 tr(s(i)s(i)′) = k ∑ i=1 s(i)s(i) ′ p ∑ i=1 s(i)s(i)′ = k ∑ i=1 ∑ p j=1 cov(X j,Mi) 2 p ∑ i=1 p ∑ j=1 cov(X j,Mi)2 . 34 University of Ghana http://ugspace.ug.edu.gh Similarly, the proportion of total variance of Y explained by N1, . . . ,Nk, is given by tr( k ∑ i=1 t(i)t(i) ′ ) tr∑22 = k ∑ i=1 q ∑ j=1 cov(Yj,Ni)2 q ∑ i=1 q ∑ j=1 cov(Yj,Ni)2 . If the proportion of total variance is close to 1 or 100%, then the k dimensions are retained. 3.2 Example: Computation of Canonical variables and Canonical Coefficients Here we use the derived formulas obtained in this chapter to compute the canonical variable pairs and the canonical coefficients of the covariance structure below. We consider a Z standardized vector with variables standardized. It is divided into two. Zq×1 =    Z(1) Z2    . The Z(X) and Z(Y ) are standardized variables (2×1). Suppose we are given cov(Z) = cov    Z(1) Z(2)    = ( ρ11 ρ12 ρ21 ρ22 ) =       ( 1.00 0.40 0.40 1.00 ) ( 0.50 0.60 0.30 0.40 ) ( 0.60 0.40 0.50 0.30 ) ( 1.00 0.20 0.20 1.00 )       . We begin by calculating ρ− 1 2 11 and ρ −1 22 as ρ− 1 2 11 = ( 1.068 −0.223 −0.223 1.068 ) and ρ−122 = ( 1.042 −0.208 −0.208 1.042 ) . 35 University of Ghana http://ugspace.ug.edu.gh so ρ− 1 2 11 ρ12ρ −1 22 ρ21ρ − 12 11 = ( 0.437 0.218 0.218 0.120 ) . Now we seek to ascertain the eigenvalues of the matrix ρ− 1 2 11 ρ12ρ −1 22 ρ21ρ − 12 11 . The eigenval- ues ρ21 ,ρ22 are as follows ρ21 = 0.548 and ρ22 = 0.0090, hence, ρ1 = 0.740 and ρ2 = 0.030. The eigenvector, e1 associated to ρ21 is obtained as e1 = ( 0.8911 0.4538 ) . This implies that the coefficient vector for M1 : ρ − 12 11 e1 = a1 = ( 0.856 0.278 ) . So M1 = e ′ 1ρ − 12 11 Z (X) = 0.856Z(X)1 +0.278Z (X) 2 . (3.27) We find the coefficient vector, b, for N1. We have that f1 is proportional to ρ − 12 22 ρ21ρ − 12 11 e1 and b1 = ρ − 12 22 f1. Thus f1 is propor- tional to ρ− 1 2 22 ρ21a1. The constant of proportionality = 1 since b1 is such that var(b′1Z(Y )) = var(N1) = b′1ρ22b1 = 1. 36 University of Ghana http://ugspace.ug.edu.gh b1ρ 1 2 22 ∝ ρ − 12 22 ρ21a1 b1 ∝ ρ − 12 22 ρ − 12 22 ρ21a1 b1 ∝ ρ−122 ρ21a1 ρ−122 ρ21a1 = ( 0.403 0.544 ) . We orthonormalize ρ−122 ρ21a1 b′1ρ22b1 = 0.546 b1 = 1 √ 0.546 ( 0.403 0.544 ) N1 = b1Z (Y ) = 0.403 √ 0.546 Z(Y )1 + 0.544 √ 0.546 Z(Y )2 . The second canonical correlation coefficient is too small and hence further calculations will not be done. We later show why only one canonical coefficient was enough. We now compute the correlations between the original set of variables(standardized) and the canonical variates M1 and N1. For the first canonical variable pair, we have that C′Z = (0.86,0.28) and D′Z = (0.54,0.74). The correlation between M1 and Z(X) is ρ(M1,Z(X)) = CZρ11 = (0.97,0.62) Similarly, ρ(N1,Z(Y )) = DZρ22 = (0.69,0.85), ρ(M1,Z(Y )) = CZρ12 = (0.51,0.63) and ρ(N1,Z(X)) = DZρ21 = (0.71,0.46). We now show that only one canonical variable was sufficient to capture the correlation structure. 37 University of Ghana http://ugspace.ug.edu.gh For k = 1, the canonical functions are as follows M1 = 0.86X1 +0.28X2 N1 = 0.54Y1 +0.74Y2. So take a′1 = (0.86,0.28) and b ′ 1 = (0.54,0.74). Now cov(X1,M1) = 0.86cov(X1,X1)+0.28cov(X1,X2) = 0.97, cov(Y1,N1) = 0.54cov(Y1,Y1)+0.74cov(Y1,Y2) = 0.69, cov(X2,M1) = 0.86cov(X1,X2)+0.28cov(X2,X2) = 0.62, cov(Y2,N2) = 0.54cov(Y1,Y2)+0.74cov(Y2,Y2) = 0.85. From the covariances computed above, we have that s(1) = ( 0.97 0.62 ) and t(1) = ( 0.69 0.85 ) s(1)s(1) ′ = ( 0.95 0.61 0.61 0.4 ) and t(1)t(1) ′ = ( 0.47 0.58 0.58 0.72 ) ρ1s(1)t(1) ′ = ( 0.5 0.61 0.31 0.39 ) . Thus if considering only 1 canonical variate pair (M1,N1), we check to see whether s(1)s(1) ′ , t(1)t(1) ′ , ρ1s(1)t(1) ′ approximate ρ11,ρ22 and ρ12 respectively. From our computations, we have ( 0.5 0.61 0.31 0.39 ) ≈ ( 0.5 0.6 0.3 0.4 ) . We observe that of the three matrices only ρ1s(1)t(1) ′ has a reasonable approximation to ρ12. This result conforms to the note presented above stating that, p ∑ i=k+1 ρis(i)t(i) ′ is very close to the null matrix. 38 University of Ghana http://ugspace.ug.edu.gh We calculate the proportion of total variance explained by both M1 and N1. tr(s(1)s(1) ′ ) tr∑11 = 0.95+0.4 2 ' 68% tr(t(1)t(1) ′ ) tr∑22 = 0.47+0.72 2 ' 60% M1 explains 68% of the total variation in X and N1 explains 60% of variation in Y . This shows that the first canonical variate pairs is enough to capture sufficient covariance struc- ture of the sets of variables. 39 University of Ghana http://ugspace.ug.edu.gh Chapter 4 Results This chapter presents the results and discussion of the analysis of the available data set. The chapter is sub-divided into four sections. The first section gives a brief description of the data and the variables used. The second section describes the characteristics of the glioblastoma patients and the third section will present the main results of the analysis. The final section presents a summary of the results obtained from the analysis. 4.1 Data The data set consist of thirty-two (32) variables. The neuroimage features are explored us- ing six (6) variables while the copy number variations of patients contain 26 variables. We define the neuroimage features variables as set M and the copy number variation variables as set N. Five hundred and twenty-seven (527) GBM patients were involved in this anal- ysis. Out of the 527 patients, only 267 patients had a corresponding MRI of their tumor available. Hence for the main analysis, 267 patients were involved. 4.1.1 Patient Features The VASARI lexicon for magnetic resonance imaging annotation contains several imaging descriptors based on different magnetic resonance imaging modalities [13]. The cardinal image features as presented by Gutman et al [13] in their paper are edema, necrosis, non Contrast-enhancing tumor (nCet) and enhancing. We added two more features, the major axis length and minor axis length of the tumor to the cardinal features. So the follow- ing magnetic resonance imaging features of Gliobastoma patients available on the Can- 40 University of Ghana http://ugspace.ug.edu.gh cer Imaging Archive (TCIA) were used for the analysis: edema, necrosis, non Contrast- enhancing tumor, enhancing tumor, major axis length and minor axis length. Table 4.1 lists each image feature with its description. The copy number variations of the Glioblastoma patients was obtained from the The Cancer Genome Atlas (TCGA). The variables under the copy number variations are measured as homozygous deletion, hemizygous deletion, neutral/no change, gain and high level ampli- fication. Further information about the patients was acquired from TCGA to assess some characteristic features of the patients. Table 4.2 gives the variables (genes) in the copy number variation for the patients. 41 University of Ghana http://ugspace.ug.edu.gh Table 4.1: Description of Neuro-image Features Used Variable Name Description Edema What proportion of the abnormality is vasogenic edema? It is an accumulation of fluid in the brain that happens when the blood-brain barrier is broken. Edema should be greater in signal than nCET and somewhat lower in signal than CSF. (Pseudopods are characteristic of edema) Proportion Necrosis Defined as the region within the tumor that does not en- hance or shows markedly diminished enhancement, is high on T2W and proton density images, is low on T1W images, and has an irregular border Proportion Enhancing Proportion of tumor that is enhancing. (Assuming that the entire abnormality may be comprised of: (1) an enhancing component, (2) a nonenhancing component, (3) a necrotic component and (4) an edema component.) Proportion nCet Defined as the regions of T2W hyperintensity (less than the intensity of cerebrospinal fluid, with corresponding T1W hypointensity) that are associated with mass effect and ar- chitectural distortion, including blurring of the gray-white interface.(Assuming that the the entire abnormality may be comprised of: (1) an enhancing component, (2) a non- enhancing component, (3) a necrotic 9= Indeterminate com- ponent and (4) an edema component.) Major Axis Largest perpendicular(x−y) cross-sectional diameter of T2 signal abnormality measured on a single axial image only Minor Axis Smallest perpendicular(x− y) cross-sectional diameter of T2 signal abnormality measured on a single axial image only 42 University of Ghana http://ugspace.ug.edu.gh Table 4.2: Copy Number Variation Variables (Genes) Variables Label akt1 AKT serine/threonline kinase 1 akt2 AKT serine/threonline kinase 2 akt3 AKT serine/threonline kinase 3 ccnd2 cyclin D2 cdk4 cyclin dependent kinase 4 cdk6 cyclin dependent kinase 6 cdk2na cyclin dependent kinase inhibitor 2A cdkn2c cyclin dependent kinase inhibitor 2C egfr epidermal growth factor receptor erbb2 erb-b2 receptor tyrosine kinase 2 foxo1 forkhead box C1 foxo3 forkhead box C3 hras HRas proto-oncogene, GTPase kras KRAS proto-oncogene, GTPase mdm2 MDM2 proto-oncogene mdm4 MDM4 proto-oncogene met MET proto-oncogene, receptor tyrosine kinase nf1 neurofibromin 1 nras neuroblastoma RAS viral oncogene homolog pdgfra platelet derived growth factor receptor alpha pik3ca phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha pik3r1 phosphoinositide-3-kinase regulatory subunit 1 pten phosphatase and tensin homolog rb1 RB transcriptional corepressor 1 spry2 sprouty RTK signaling antagonist 2 tp53 tumor protein p53 4.2 Preliminaries This section seeks to describe some notable characteristics of the Glioblastoma patients. The characteristics range from sex, age of diagnosis, survival status (Deceased or Living), the expression subtype and overall survival status of patient after diagnosis (Length of time from diagnosis to death). Frequencies and descriptives of these variables will be presented and discussed. Observations from table 4.3 are that, of the 527 GBM patients, the majority (61.5%) are males. Also, about seven out of every ten (77%) of the patients are deceased as at March 43 University of Ghana http://ugspace.ug.edu.gh 2016. The mean survival time from time of diagnosis to death was recorded to be 15 months with a standard deviation of 16.53. The mean age of diagnosis was obtained as 58 (Table 4.4). The survival time and age of diagnosis from our data set conforms to the cancer statistics in 2012 [35] which stated that GBM is generally diagnosed at an average age of 55 years, and gives the affected patient an average survival time of only 10 to 18 months. Table 4.3: Sex and Survival Status Distribution of Patients Characteristic Frequency Percentage Sex: Male 324 61.5 Female 203 38.5 Survival Status: Deceased 406 77.0 Living 121 23.0 Table 4.4: Age and Overall Survival Time of Patients Variable Minimum Maximum Mean SD Age (in years) 10 89 58.23 14.31 Survival time (in months) 0 128 15.10 16.54 The Cancer Genome Atlas (TCGA) in 2011 indicated four distinct expression subtypes of GBM [1]. The four subtypes were Classical, Proneural, Neural and Mesenchymal. The Classical GBM tumors are always characterized by extremely high levels of EGFR. How- ever, the abnormality of the EGFR gene occur a lower rate in the three subtypes. Further- more, there is no mutation of the most mutated gene tumor protein p53(TP 53) in GBM in the Classical GBM tumors. The TP53 is however significantly mutated in the Proneural tumors. Only Proneural tumors have abnormally high levels of mutations of PDGFRA. The most frequent number of mutations in the tumor suppressor gene NF1 can be found in the Mesenchymal group. Also, tumor suppressor genes such as TP53 and PTEN have frequent mutations in this group. For the Neural group, there is no stand out gene that exists in abnormally higher or lower mutation rate [1]. There has also been an identification of a CpG Island Methylator Phenotype (G-CIMP) that also presents a distinct subgroup of GBM [25]. 44 University of Ghana http://ugspace.ug.edu.gh Table 4.5 shows that majority (26.5%) of the GBM patients in our dataset have the Mes- enchymal subtype, followed by the Classical subtype (25.1%). Table 4.5: Frequency Distribution of Expression Subtype Subtype Frequency Percent Classical 144 25.1 G-CIMP 38 6.6 Mesenchymal 152 26.5 Neural 83 14.5 Proneural 97 16.9 Not Available 13 2.3 4.3 Main Results 4.3.1 Correlation matrix of variables Canonical correlation analysis demands that there exist no high correlations within each of the sets of variables. So we checked for correlations among the sets of variables. Tables 4.6 and 4.7 lists the correlation coefficients between each variable set. Variable set 1 is the VASARI neuroimage features whereas variable set 2 is the copy number variations variables. Table 4.6 shows the correlations between the VASARI neuroimage features and Table 4.7 presents the correlations between the copy number variation variables. Among the VASARI features, observations showed that the farthest correlation coefficient from zero that existed was −0.6443, which is the correlation between the enhancing and edema. This depicts that, as the proportion of edema increases, then the proportion of enhancing diminishes and vice versa. The major axis of the tumor has a positive relationship with the minor axis and with nCET. However, the major axis showed a negative relationship with necrosis, edema and enhancing. Edema is negatively correlated with all other features. nCET recorded a positive relationship with the major axis, the minor axis and necrosis. Moreover, for the copy number variations, the farthest correlation coefficient from zero 45 University of Ghana http://ugspace.ug.edu.gh recorded among the 26 variables was 0.8962 (Table 4.7). This relationship existed between the foxo1 gene and rb1 gene. This relationship shows that the foxo1 gene and rb1 gene has a strong direct positive relationship, hence an amplification of a patient’s foxo1 gene will result in the amplification of the patient’s rb1 gene and vice versa. Table 4.6: Correlations for Variable Set 1 Major Axis Minor Axis Necrosis Edema nCET Enhancing Major Axis 1.0000 Minor Axis 0.4828 1.0000 Necrosis -0.0152 0.1356 1.0000 Edema -0.0752 -0.3974 -0.2578 1.0000 nCET 0.1168 0.1724 0.0160 -0.2488 1.0000 Enhancing -0.0203 0.2519 -0.0034 -0.6443 -0.1208 1.0000 46 University of Ghana http://ugspace.ug.edu.gh Ta bl e 4. 7: C or re la ti on s fo r th e C op y N um be r V ar ia ti on V ar ia bl es V ar ia bl es ak t1 ak t2 ak t3 cc nd 2 cd k4 cd k6 cd k2 na cd k2 nc eg fr er bb 2 fo xo 1 fo xo 3 A kt 1 1. 00 A kt 2 0. 30 72 1. 00 A kt 3 0. 02 44 0. 22 97 1. 00 cc nd 2 -0 .0 58 5 0. 08 06 -0 .0 10 9 1. 00 cd k4 -0 .1 00 9 0. 07 22 0. 08 87 0. 40 62 1. 00 cd k6 0. 09 21 0. 19 25 0. 03 11 0. 05 62 0. 05 24 1. 00 00 cd k2 na 0. 05 69 -0 .1 65 1 0. 04 70 0. 00 66 0. 24 27 -0 .1 94 3 1. 00 00 cd kn 2c 0. 17 30 0. 09 92 0. 42 07 -0 .0 62 9 0. 01 76 -0 .0 08 8 0. 12 45 1. 00 00 eg fr 0. 26 34 0. 26 30 -0 .0 26 7 0. 00 35 0. 14 80 0. 53 17 -0 .1 64 1 0. 01 25 1. 00 00 er bb 2 0. 04 87 -0 .0 83 6 0. 07 51 -0 .2 01 9 0. 00 57 -0 .1 68 8 -0 .0 58 8 -0 .0 74 6 -0 .0 05 8 1. 00 00 fo xo 1 0. 15 66 -0 .0 41 4 -0 .1 74 5 0. 00 01 0. 01 18 -0 .0 04 8 -0 .1 42 9 -0 .1 15 5 0. 21 34 0. 27 80 1. 00 00 fo xo 3 0. 43 16 0. 16 00 0. 05 77 -0 .2 33 6 -0 .1 19 1 -0 .1 03 9 0. 17 12 0. 15 27 0. 03 10 0. 28 72 0. 03 28 1. 00 00 hr as 0. 07 57 -0 .0 54 6 -0 .1 15 6 -0 .0 97 3 -0 .0 31 5 0. 05 75 -0 .1 98 5 -0 .0 31 3 0. 20 57 0. 00 90 -0 .0 64 4 0. 03 14 kr as -0 .1 67 2 0. 08 50 0. 17 97 0. 66 58 0. 40 86 -0 .0 85 3 -0 .0 58 3 0. 09 09 -0 .0 64 5 -0 .1 89 3 -0 .0 97 4 -0 .2 58 6 m dm 2 -0 .0 17 2 0. 12 93 0. 23 69 0. 39 19 0. 63 49 0. 11 19 0. 04 36 0. 11 82 0. 20 73 -0 .0 67 7 -0 .0 11 5 -0 .1 87 2 m dm 4 0. 00 80 0. 00 50 0. 43 32 0. 02 00 0. 07 99 -0 .0 35 5 0. 04 77 0. 23 99 0. 06 19 0. 06 23 -0 .2 04 3 0. 09 54 m et 0. 11 84 0. 10 34 -0 .0 17 1 0. 05 09 0. 03 40 0. 67 95 -0 .0 52 8 -0 .0 01 6 0. 36 81 -0 .2 26 2 -0 .0 30 6 -0 .0 20 5 nf 1 0. 05 60 -0 .0 03 6 0. 05 88 -0 .1 92 5 0. 05 87 -0 .1 22 1 0. 02 58 -0 .0 97 9 0. 04 46 0. 83 23 0. 28 01 0. 29 85 nr as -0 .0 22 8 0. 01 53 0. 54 24 -0 .0 93 8 0. 01 72 -0 .0 09 9 0. 15 58 0. 47 50 -0 .0 36 9 0. 11 58 -0 .1 73 7 0. 24 16 pd gf ra -0 .1 18 6 -0 .0 68 4 0. 01 42 -0 .1 62 1 0. 05 68 -0 .1 76 3 0. 01 88 -0 .0 57 9 -0 .1 50 2 0. 06 91 0. 02 76 0. 06 87 pi k3 ca -0 .1 53 4 -0 .1 36 0 0. 07 82 -0 .2 15 3 -0 .1 91 3 -0 .0 29 8 -0 .0 92 1 0. 04 39 -0 .0 28 3 0. 16 17 -0 .1 30 9 -0 .0 82 3 pi k3 r1 -0 .0 12 3 0. 11 68 -0 .1 21 4 0. 07 81 0. 07 12 0. 31 63 -0 .1 56 7 0. 16 27 0. 17 67 -0 .1 20 6 0. 10 65 -0 .0 63 1 pt en -0 .0 67 8 -0 .0 98 4 -0 .0 46 2 0. 14 08 -0 .0 41 2 -0 .3 24 2 0. 21 37 0. 05 79 -0 .1 50 2 0. 06 91 0. 17 18 0. 06 55 rb 1 0. 21 31 0. 05 98 -0 .2 14 7 0. 07 82 0. 06 33 0. 34 6 -0 .1 34 6 -0 .1 34 4 0. 22 82 0. 20 64 0. 89 62 -0 .0 27 7 sp ry 2 0. 04 44 -0 .0 56 8 -0 .0 83 5 -0 .0 37 5 0. 00 12 0. 02 58 -0 .0 87 0 -0 .0 03 1 0. 16 97 0. 21 15 0. 73 61 0. 00 89 tp 53 0. 10 30 0. 03 85 -0 .2 05 4 -0 .0 14 9 0. 17 42 -0 .0 39 6 -0 .1 31 4 -0 .0 80 0 0. 08 31 0. 59 62 0. 34 45 0. 17 49 47 University of Ghana http://ugspace.ug.edu.gh Ta bl e 4. 8: C or re la ti on s fo r th e C op y N um be r V ar ia ti on V ar ia bl es V ar ia bl es hr as kr as m dm 2 m dm 4 m et nf 1 nr as pd df ra pi k3 ca pi k3 r1 pt en rb 1 sp ry 2 hr as 1. 00 00 kr as -0 .0 01 3 1. 00 00 m dm 2 0. 07 96 0. 49 19 1. 00 00 m dm 4 0. 16 9 0. 06 42 -0 .0 02 5 1. 00 00 m et 0. 07 23 -0 .0 96 7 0. 07 42 -0 .0 17 9 1. 00 00 nf 1 0. 04 68 -0 .1 84 1 -0 .0 94 0 0. 02 28 -0 .1 84 0 1. 00 00 nr as -0 .0 54 1 0. 07 51 0. 03 80 0. 30 72 0. 00 79 0. 09 50 1. 00 00 pd gf ra -0 .2 55 8 -0 .1 15 3 0. 02 11 0. 11 07 -0 .0 81 7 0. 12 72 -0 .0 60 8 1. 00 00 pi k3 ca -0 .0 81 8 -0 .1 11 0 -0 .1 20 9 0. 04 05 -0 .0 03 5 0. 04 48 0. 05 36 -0 .0 13 2 1. 00 00 pi k3 r1 0. 00 34 0. 01 80 0. 08 26 -0 .0 59 1 0. 38 58 -0 .1 86 8 0. 01 74 -0 .0 73 1 -0 .1 01 0 1. 00 00 pt en -0 .0 02 8 0. 06 75 -0 .0 10 2 -0 .0 36 3 -0 .2 66 8 0. 07 90 0. 11 67 -0 .0 27 1 -0 .0 25 7 0. 02 00 1. 00 00 rb 1 -0 .0 85 6 -0 .1 19 9 0. 01 80 -0 .2 30 9 0. 00 40 0. 20 60 -2 13 9 -0 .0 02 2 -0 .1 57 7 0. 13 60 0. 14 56 1. 00 00 sp ry 2 -0 .0 33 0 -0 .1 97 9 -0 .0 59 5 -0 .0 16 0 0. 00 02 0. 18 84 0. 00 44 -0 .0 29 3 -0 .1 07 6 0. 10 65 0. 23 81 0. 74 04 1. 00 00 tp 53 0. 16 41 -0 .1 80 4 0. 08 10 -0 .1 64 5 -0 .1 20 0 0. 62 50 -0 .1 91 2 0. 10 97 -0 .2 03 5 0. 11 64 0. 03 55 0. 33 93 0. 27 42 48 University of Ghana http://ugspace.ug.edu.gh Table 4.9: Correlations between Variable Set 1 and Variable Set 2 Variables Major Axis Minor Axis Necrosis Edema nCET Enhancing Akt1 -0.0289 -0.2096 0.1527 0.0281 -0.2009 -0.0587 Akt2 0.0271 -0.0422 -0.0062 -0.0024 -0.1415 0.0594 Akt3 0.1124 0.1127 0.0157 0.1128 -0.0196 0.0094 ccnd2 -0.0192 -0.1031 0.0303 -0.0629 0.0140 0.0103 cdk4 0.0141 -0.0816 0.1420 0.0031 -0.3819 0.0483 cdk6 -0.0236 -0.0024 -0.0659 -0.0416 -0.2304 0.0700 cdk2na -0.0265 -0.1238 0.0559 0.1030 0.1351 -0.2265 cdkn2c -0.1915 -0.0106 0.1025 -0.0262 0.1272 0.0155 egfr 0.0772 0.0195 -0.0626 0.0062 -0.1012 0.0867 erbb2 0.1583 0.1254 -0.0789 0.1716 0.0138 -0.1357 foxo1 0.0979 0.3049 -0.1601 -0.1051 -0.0991 0.1337 foxo3 0.0067 0.0066 0.1963 0.0420 -0.0899 -0.0708 hras -0.0282 -0.0790 -0.1048 0.1182 0.0174 -0.0265 kras 0.0288 -0.0641 0.0527 -0.0821 0.0586 0.0724 mdm2 -0.0116 0.0226 0.1213 -0.0244 -0.0498 0.0478 mdm4 0.0430 0.0116 0.0812 -0.0212 0.0375 0.0729 met -0.0707 0.0033 0.0124 -0.1086 -0.0251 0.0212 nf1 0.1517 0.2630 -0.0658 0.0481 0.0132 -0.0629 nras -0.1572 0.1336 0.1472 0.1360 0.0256 -0.0295 pdgfra 0.2463 0.2697 0.7260 -0.1559 -0.0174 0.0033 pik3ca -0.0046 0.0441 -0.0637 0.1438 -0.0432 -0.1074 pik3r1 -0.0621 0.0953 -0.0721 -0.0738 -0.1124 0.0897 pten -0.3747 -0.2399 0.2036 0.0265 0.0331 -0.0672 rb1 0.0462 0.0101 -0.2429 -0.1075 -0.1338 0.1660 spry2 -0.4289 -0.0147 -0.4163 -0.0987 0.4001 0.1351 tp53 0.4475 0.0736 -0.4066 -0.0740 0.3999 -0.0047 The correlations between the copy number variation variables and the image features are presented in table 4.9. There are both negative and positive relationships between the vari- able sets. The highest correlation coefficient (0.7260) existed between pdgfra and necrosis. There are relatively low correlations between the two variable sets. Moderate correlations (-0.4066,-0.4163) existed between spry2, tp53 and necrosis respectively. Also, moderate correlations (0.4475,-0.4289,-0.3747) existed between tp53, spry2, pten and major axis re- spectively. Moreover, nCET was also moderately correlated with cdk4 (-0.3819), spry2 (0.4001), tp53 (0.3999). These bivariate correlations seem to suggest a relationship be- tween some of the features and genes in the study. The raw canonical coefficients are the weights of the M-variables and the N-variables, 49 University of Ghana http://ugspace.ug.edu.gh maximizing the correlation among the sets of variables. The coefficients are interpreted the same way as the regression coefficients. So from Table 4.10, for the variate M1, a unit increase in the proportion of necrosis leads to a 1.6797 increment on the first canonical variate of the N-variable set, with all other variables to be held constant. Table 4.10: Raw Coefficients for the Neuro-image features 1 2 3 4 5 6 Major Axis 0.4264 0.1005 0.6760 0.2438 0.2857 -0.2874 Minor Axis 0.1240 -0.6065 -0.4383 -0.2898 0.0600 0.0756 Necrosis 1.6797 2.2665 -2.3086 0.3454 1.3586 -1.6622 Edema 0.6631 -0.6532 -1.1169 1.4566 -0.7277 -1.1151 nCET -1.2893 -0.3184 -0.3771 0.8520 0.5133 -1.5226 Enhancing 0.2989 0.0219 -0.2357 0.2624 -1.1305 -1.6793 50 University of Ghana http://ugspace.ug.edu.gh Table 4.11: Raw Coefficients for the Copy Number Variation Variables Variables 1 2 3 4 5 6 Akt1 0.1207 0.9216 -0.0343 0.2824 0.2217 0.4739 Akt2 -0.0602 -0.1999 0.0780 0.1185 -0.1724 0.4816 Akt3 0.7972 -0.7069 1.1209 0.6279 -0.0815 -0.3316 ccnd2 0.2203 -0.6280 -0.2054 0.1681 0.4499 0.7753 cdk4 0.7373 0.7428 -0.0870 0.1356 -0.5273 0.3419 cdk6 0.3429 0.3914 0.2704 -0.0327 -0.4198 0.7079 cdk2na -0.3741 -0.3616 0.3788 0.5088 0.4692 0.2825 cdkn2c -0.6646 0.2594 -0.3989 -0.1759 0.1041 -0.3262 egfr 0.1092 -0.3501 0.2107 0.2607 0.1746 -0.6444 erbb2 0.3651 -0.2714 -0.0902 2.4166 0.1850 0.0921 foxo1 0.6733 -1.2847 0.0458 -0.5207 0.7877 1.0521 foxo3 0.5249 0.3438 -0.0674 -0.3994 -0.0492 0.0241 hras 0.3176 -0.4606 0.1801 0.7627 -0.2045 0.2671 kras -0.9254 0.6619 1.1563 -0.0723 0.3101 -1.0851 mdm2 -0.1203 -0.2112 -0.6641 -0.4173 0.0846 -0.1012 mdm4 -0.1421 0.3140 0.1615 -0.2607 -0.0856 -0.6363 met -0.8376 0.4377 -0.1256 -0.4054 0.6902 0.0058 nf1 0.1457 -0.3427 -0.0013 -1.6359 0.0401 0.2830 nras 0.1661 -0.5333 -1.9065 -0.2347 -0.1279 -0.2008 pdgfra 0.4226 -0.0704 0.0563 -0.3877 0.7704 0.0288 pik3ca 0.0430 -0.1970 -0.0217 -0.0556 -0.0546 0.8620 pik3r1 0.6607 -0.8831 0.1641 -0.4134 -0.4106 0.8242 pten 0.0186 1.3438 -0.4755 0.1518 0.0121 0.1979 rb1 0.5788 0.9265 0.4969 0.0324 -1.0624 -1.5470 spry2 -1.4400 -0.0932 0.0854 -0.1515 -0.6050 0.6273 tp53 -1.4510 -0.0840 0.3839 -0.4220 0.6305 -0.4147 4.3.2 Assessment of Overall Model Fit We now present results on the overall statistical fit of the entire model. The multivariate F-tests and its corresponding Wilk’s lambda evaluate the hypothesis below. H0 : The canonical correlation coefficient for all functions are zero. H1 : The canonical correlation coefficient for at least one function is not zero. 51 University of Ghana http://ugspace.ug.edu.gh Again, we check against the null hypothesis that each of the canonical functions’ canonical correlation coefficient is zero. From Table 4.12, we have that the null hypothesis for the entire model is rejected at 0.05 significance level, hence we can conclude that at least one canonical function has a non- zero canonical correlation coefficient. Also, we confirm from Table 4.13 that the first three canonical correlation coefficients are statistically significant at a significance level of 0.05. This means that the null hypothesis, which states that the canonical correlation coefficient of each of the the first three canonical function is zero is rejected. The remaining three cor- relation coefficients are not significant based on the multivariate F-tests and Wilk’s lambda. This means that the remaining coefficients will not be subjected to interpretations. Table 4.12: Test of Significance of all Canonical Correlations Statistic df1 df2 F Prob>F Wilk’s Lambda 0.127081 156 1386.69 3.7459 0.0000 52 University of Ghana http://ugspace.ug.edu.gh Table 4.13: Test of Significance of each Canonical Correlation Test of Canonical Correlation 1 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.127081 156 1386.69 3.7459 0.0000 Test of Canonical Correlation 2 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.230809 125 1166.35 3.2384 0.0000 Test of Canonical Correlation 3 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.38655 96 941.39 2.6591 0.0001 Test of Canonical Correlation 4 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.730162 69 350.64 1.330 0.0514 Test of Canonical Correlation 5 Statistic df1 df2 F Prob>F Wilk’s Lambda 812344 44 248.03 1.2001 0.1957 Test of Canonical Correlation 6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.894344 21 160.12 1.0831 0.3711 The canonical correlation coefficient and eigenvalues or canonical roots for each of the functions are shown in Table 4.14. The magnitude of the relationship occurring between the variate pairs is given by the canonical correlation coefficient. Table 4.14: Canonical Correlations and Eigenvalues Coefficients 0.6704 0.6347 0.5552 0.4844 0.4285 0.3250 Eigenvalues 0.4494 0.4028 0.3082 0.2346 0.1836 0.1056 Table 4.15 presents the canonical redundancy index for the canonical correlations. In the 53 University of Ghana http://ugspace.ug.edu.gh first canonical function, the redundancy for the M-variables is 0.2012 and the redundancy for the N-variables is 0.2101. The values obtained depict that each variate explains almost the same amount of variance in the opposite set of variables in the canonical function. Considering the second function, the redundancy measure for the M and N variables are 0.1876 and 0.1501. This means that the variate for the N-variables explains less variance in the M-variables in the first function than the variate for the M- variables explains in the set of N-variables. Table 4.15: Canonical redundancy analysis for Canonical Correlations Canonical redundancy analysis for Canonical Correlation 1 Canonical Correlation Coefficient 0.6704 Squared Canonical Correlation Coefficient 0.4494 Proportion of standardized variance O.V OP.V of M variables with 0.3001 0.2101 of N variables with 0.3121 0.2112 Canonical redundancy analysis for Canonical Correlation 2 Canonical Correlation Coefficient 0.6347 Squared Canonical Correlation Coefficient 0.4028 Proportion of standardized variance O.V OP.V of M variables with 0.4212 0.0.1501 of N variables with 0.3212 0.1876 Canonical redundancy analysis for Canonical Correlation 3 Canonical Correlation Coefficient 0.5552 Squared Canonical Correlation Coefficient 0.3052 Proportion of standardized variance O.V OP.V of M variables with 0.3992 0.1001 of N variables with 0.3685 0.1019 O.V = Own Variate, OP.V= Opposite Variate 4.3.3 Interpreting Canonical Variate Pairs Based on the F-test and the Wilk’s lambda, we have concluded that only three canonical coefficients are significant, so we can can interpret and report the contribution of each of the variables (original) that is in the canonical function. We would then resort to the stan- 54 University of Ghana http://ugspace.ug.edu.gh dardized canonical coefficients and or canonical loadings to elicit the relative contributions of the variables. The canonical functions can be interpreted by observing the magnitude and sign of the standardized canonical correlation coefficient or the canonical loadings that is assigned to each original variable in its canonical variate. Variables that have higher coefficients have a higher contribution to the variate. We set a coefficient threshold of |0.5| and above to depict the most important variable in the canonical function. Moreover, original variables that have coefficients with opposite signs depict an inverse association with one another. Again, original variables with coefficients that have the same sign depict a direct associ- ation. However, because the interpretation of the contribution of original variables by its canonical coefficient faces the same problems that are associated to the interpretation of beta values in the regression model, caution is taken in the interpretation of the results in canonical analysis [2]. One of the problems faced is that, the weights or the coefficients are subjected to considerable variability from a sample to the other. Therefore, the canonical loadings will also be used to assess the contribution of the original variables. Hence, if the findings from using the standardized coefficients and the canonical loadings are similar or the same, then there is evidence for accuracy of the results. 4.3.4 Interpretation of Canonical Variate Using Canonical Weights Here, we present the standardized coefficients and interpret them. The standardized co- efficients always enable for easier comparisons among variables when the variables have varying standard deviations. So because the canonical coefficients are standardized, then we can make comparisons using their weights. The proportion of canonical correlation weights for a set of canonical roots is their relative significance for the given impact [2]. The standardized canonical coefficients for the significant functions are shown in Table 4.16. Considering the first set of variables(Neuro image features) and the first canonical function, the nCET is the most important, followed by major axis then edema and necro- sis. A one standard deviation increase in proportion of necrosis leads to a 0.4280 standard deviation increase in the score on the first canonical variate in the second variable set when the other variables all held constant. Also, a one standard deviation increase in nCET leads to 0.6407 decrease in the score on the first canonical variate in the second variable set with other variables held constant. With the second canonical function, the most important fea- tures are minor axis, necrosis and edema. The third canonical function has high coefficient 55 University of Ghana http://ugspace.ug.edu.gh values for major axis, minor axis, necrosis and edema. Considering standardized coefficients of the copy number variations from Table 4.17, spry2, tp53, cdk4, foxo1, met, pdgfra, rb1, cdk2na, cdk2nc and akt3 are more closely related to the first canonical function since their coefficients are greater than |0.3| whilst foxo1, cdk4, akt1, pten, rb1, akt3, ccnd2, cdk2na, pik3r1 and kras are most closely related to the second canonical function. For the third canonical function, nras, kras ,akt3, mdm2 and cdk2na are also more closely related to it. Table 4.18 below summarize the most important fea- tures and genes for each function based on the magnitude of the canonical loadings with a threshold of |0.5| and above. Table 4.16: Standardized Coefficients for the Neuro-image features 1 2 3 Major Axis 0.5317 0.1253 0.8430 Minor Axis 0.1914 -0.9363 -0.6766 Necrosis 0.4280 0.5774 -0.5882 Edema 0.4327 -0.4263 -0.7288 nCET -0.6407 -0.1582 -0.1874 Enhancing 0.2125 0.0156 -0.1675 56 University of Ghana http://ugspace.ug.edu.gh Table 4.17: Standardized Coefficients for the Copy Number Variation Variables Variables 1 2 3 Akt1 0.0735 0.5615 -0.0209 Akt2 -0.0355 -0.1178 0.0459 Akt3 0.3587 -0.3181 0.5040 ccnd2 0.1246 -0.3551 -0.1162 cdk4 0.6223 0.6269 -0.0734 cdk6 0.1661 0.1896 0.1310 cdk2na -0.3365 -0.3253 0.3407 cdkn2c -0.3274 0.1278 -0.1965 egfr 0.0755 -0.2418 0.1455 erbb2 0.1871 -0.1390 -0.0462 foxo1 0.3640 -0.6945 0.0247 foxo3 0.2905 0.1903 -0.0373 hras 0.1534 -0.2224 0.0870 kras -0.0145 0.3219 0.5623 mdm2 -0.0895 -0.1571 -0.4939 mdm4 -0.0929 0.2052 0.1055 met -0.4111 0.2148 -0.0617 nf1 0.0759 -0.1785 -0.0007 nras 0.0772 -0.2479 -0.8862 pdgfra 0.3257 -0.0542 0.0434 pik3ca 0.0235 -0.1076 -0.0118 pik3r1 0.2777 -0.3712 0.0690 pten 0.0077 0.5549 -0.1963 rb1 0.3224 0.5161 0.2768 spry2 -0.7778 -0.0503 0.0461 tp53 -0.6892 -0.0399 0.1823 57 University of Ghana http://ugspace.ug.edu.gh Table 4.18: Summary of Important Related Variables 1 2 3 Image features Coeff. Image features Coeff. Image features Coeff. nCET -0.6407 Minor Axis -0.9363 Major Axis 0.8430 Major axis 0.5317 Necrosis 0.5774 Edema 0.7288 Minor Axis -0.6766 Necrosis -0.5882 CNV CNV CNV spry2 -0.7778 foxo1 -0.6945 nras -0.8862 tp53 -0.6892 cdk4 0.6269 kras 0.5623 cdk4 0.6223 Akt1 0.5615 Akt3 0.5040 pten 0.5549 cdk2na 0.5001 rb1 0.5161 4.3.5 Interpretation of Canonical Variate Using Canonical Loadings Observations from Table 4.19 show that major axis, nCET and necrosis were most closely related to the first canonical function since their coefficients were greater than |0.3|. The second canonical function is closely related to minor axis, necrosis and major axis. The third function is most related to major axis and necrosis. From table 4.20, tp53, spry2 cdk4, pdgfra and cdk2na are closely related to the first function while akt1, pten, foxo1, akt3, cdk4, nf1, erbb2 and rb1 are closely related to the second function. Also, nras, cdkn2c,cdkn2a, foxo1, mdm2, rb1, akt3 and kras are closely related to the third. Table 4.21 below summarizes the most important features and genes for each function based on the magnitude of the canonical loadings with a threshold of |0.5| and above. Table 4.19: Canonical Loadings for the Neuro-image features 1 2 3 Major Axis 0.5059 -0.3222 0.5615 Minor Axis 0.2772 -0.6514 -0.1343 Necrosis 0.3233 0.5559 -0.5072 Edema 0.2289 -0.1832 -0.2172 nCET -0.6721 -0.1915 -0.0134 Enhancing 0.0470 0.0689 0.1391 58 University of Ghana http://ugspace.ug.edu.gh Table 4.20: Canonical Loadings for the Copy Number Variation Variables Variables 1 2 3 Akt1 0.1107 0.5473 0.0647 Akt2 0.1580 0.1002 0.1321 Akt3 0.2259 -0.4004 -0.5277 ccnd2 -0.0760 0.2148 0.1390 cdk4 0.6696 0.5968 0.0133 cdk6 0.0584 0.0010 0.1143 cdk2na -0.3552 0.2199 -0.5610 cdkn2c -0.2230 0.0574 -0.3996 egfr 0.1550 -0.0473 0.1596 erbb2 0.1655 -0.3476 -0.0178 foxo1 0.2746 -0.6825 0.3215 foxo3 0.2231 0.1626 -0.2092 hras -0.0606 -0.0688 0.0115 kras -0.0525 0.2893 0.5230 mdm2 0.1230 0.1033 -0.3419 mdm4 0.0629 0.0719 -0.0418 met -0.0866 0.0721 0.0202 nf1 0.1233 -0.3074 0.0529 nras 0.0614 -0.1925 -0.8356 pdgfra 0.3336 -0.0345 0.0154 pik3ca 0.0684 -0.2123 -0.1349 pik3r1 0.0202 -0.1385 -0.0263 pten -0.1482 0.6384 -0.2301 rb1 0.0594 -0.5262 0.3453 spry2 -0.7745 -0.1333 0.1369 tp53 -0.6364 -0.1668 0.1664 59 University of Ghana http://ugspace.ug.edu.gh Table 4.21: Summary of Important Related Variables 1 2 3 Image features Loading Image features Loading Image features Loading nCET -0.6721 Minor Axis -0.6514 Major Axis 0.5615 Major axis 0.5059 Necrosis 0.5559 Necrosis 0.5072 CNV CNV CNV spry2 -0.7745 foxo1 -0.6825 nras -0.8356 cdk4 -0.6696 pten 0.6384 cdk2na -0.5615 tp53 -0.6364 cdk4 0.5968 Akt3 -0.5277 Akt1 0.5473 kras 0.5230 rb1 -0.5262 Since the two methods of interpretation, using the standardized coefficients and canonical loadings, resulted in the similar conclusions, we are more confident in our findings and hence move on to conduct model validation in the next section of the thesis. 4.3.6 Cross Validation In this section, we subject our model to validation. There are various approaches in model validation. We validate our model by using the sample splitting approach. The entire sample (267) is divided into two sub-samples and the canonical correlation analysis is conducted separately on each of the sub-samples. We then compare the results obtained from each of the analyses. 4.3.7 CCA on Sub-Sample A The first sub-sample contains 134 patients. From the six canonical functions, only two of the functions were significant from the F-tests and Wilk’s lambda observations (see Table 4.22). Hence we present results on the canonical loadings of each of the variable set for only the significant functions. Table 4.23 and 4.24 shows the contributions of each variable in the each of the canonical functions. The significant canonical correlation coefficients for the new sample were found to be 0.6601 and 0.6372. 60 University of Ghana http://ugspace.ug.edu.gh Table 4.22: Test of Significance of each Canonical Correlation Test of Canonical Correlation 1-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.118385 156 606.447 1.7054 0.000 Test of Canonical Correlation 2-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.226496 125 511.824 1.4423 0.0033 Test of Canonical Correlation 3-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.383356 96 414.513 1.1824 0.1366 Test of Canonical Correlation 4-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.564471 69 314.54 0.9617 0.5660 Test of Canonical Correlation 5-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.735015 44 212 0.8018 0.8069 Test of Canonical Correlation 6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.902262 21 107 0.5519 0.9408 Major axis and nCET were the most important variables in the first function since their coefficients were equal to or greater than |0.5| while minor axis and necrosis were the most important variables in the second function. In table 4.24, we observed that spry2, tp53 and cdk4 were the most important variables in the first function. Akt1, cdk4, pten, rb1 and foxo1 were the most important variables in the second canonical function. 61 University of Ghana http://ugspace.ug.edu.gh Table 4.23: Canonical Loadings for the Neuro-image features 1 2 Major Axis 0.5091 -0.3300 Minor Axis 0.2534 -0.6565 Necrosis 0.3137 0.5473 Edema 0.2633 -0.1854 nCET -0.7192 -0.1742 Enhancing 0.0638 0.0534 Table 4.24: Canonical Loadings for the Copy Number Variation Variables Variables 1 2 Akt1 0.0866 0.5843 Akt2 0.1538 0.1053 Akt3 0.2341 -0.2273 ccnd2 -0.1114 0.2097 cdk4 0.5836 0.5938 cdk6 0.0724 0.0135 cdk2na -0.2017 0.0954 cdkn2c -0.2398 0.0821 egfr 0.1729 -0.0423 erbb2 0.1869 -0.2338 foxo1 0.1791 -0.5588 foxo3 0.2398 0.1746 hras -0.0432 -0.0606 kras -0.0989 0.1855 mdm2 0.1222 0.0973 mdm4 0.1207 0.0428 met -0.1068 0.0827 nf1 0.1457 -0.2965 nras 0.0809 -0.2096 pdgfra 0.3263 -0.0160 pik3ca 0.0965 -0.2645 pik3r1 0.0369 -0.1067 pten -0.1087 0.5057 rb1 0.0566 -0.5162 spry2 -0.6601 -0.1130 tp53 -0.5210 -0.1153 62 University of Ghana http://ugspace.ug.edu.gh 4.3.8 CCA on Sub-Sample B Sub-sample B contains 133 patients. Also, only two of the functions were significant from the F-tests and Wilk’s lambda observations (see Table 4.25). Therefore only the results from the significant functions will be presented and interpreted. Tables 4.26 and 4.27 shows the contributions of each variable in each of the canonical functions. The significant canon- ical correlation coefficients for this analysis were obtained as 0.6543 and 0.6338. Table 4.25: Test of Significance of each Canonical Correlation Test of Canonical Correlation 1-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.128719 156 600.581 1.6104 0.0000 Test of Canonical Correlation 2-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.225069 125 506.903 1.4354 0.0037 Test of Canonical Correlation 3-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.376206 96 410.551 1.1970 0.1202 Test of Canonical Correlation 4-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.539844 69 311.552 1.0348 0.4120 Test of Canonical Correlation 5-6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.721821 44 210 0.8449 0.7430 Test of Canonical Correlation 6 Statistic df1 df2 F Prob>F Wilk’s Lambda 0.880579 21 106 0.6845 0.8398 Major axis and nCET were the most important variables in the first function since their coefficients were equal to or greater than |0.5| while minor axis and necrosis were the most important variables in the second function. Observations from Table 4.27 revealed that 63 University of Ghana http://ugspace.ug.edu.gh spry2, tp53 and cdk4 were the most important variables in the first function. Akt1, cdk4, pten, rb1 and foxo1 were the most important variables in the second canonical function. Table 4.26: Canonical Loadings for the Neuro-image features 1 2 Major Axis 0.5889 0.3072 Minor Axis 0.2899 0.6673 Necrosis 0.4356 -0.5342 Edema 0.1722 0.1771 nCET -0.6209 0.1970 Enhancing 0.0388 -0.0818 64 University of Ghana http://ugspace.ug.edu.gh Table 4.27: Canonical Loadings for the Copy Number Variation Variables Variables 1 2 Akt1 0.1421 -0.6099 Akt2 0.1683 -0.0947 Akt3 0.2054 0.1766 ccnd2 -0.0259 -0.2243 cdk4 0.5612 -0.5921 cdk6 0.0490 0.0116 cdk2na -0.1022 -0.1496 cdkn2c -0.2065 -0.0216 egfr 0.1347 0.0505 erbb2 0.1261 0.3605 foxo1 0.0707 0.5014 foxo3 0.1995 -0.1374 hras -0.0823 0.0706 kras 0.0111 -0.1975 mdm2 0.1303 -0.1008 mdm4 -0.0023 -0.0977 met -0.0541 0.0548 nf1 0.0878 0.3180 nras 0.0196 0.1945 pdgfra 0.3428 0.0677 pik3ca 0.0244 0.1641 pik3r1 -0.0001 0.1765 pten -0.1991 -0.5676 rb1 0.0668 0.5293 spry2 -0.6229 0.1449 tp53 -0.5531 0.2149 4.4 Summary The study investigated a model that links some neuroimage features (six features) with copy number variations (26 genes) of Glioblastoma patients. Wilk’s lambda and F-tests were employed to evaluate the null hypothesis that canonical correlation coefficients for all the canonical functions are zero. From our model, only the first three canonical correlation coefficients are statistically significant, thus with a p-value less than 0.05. The other three functions were not significant and hence was not interpreted. With our 3 significant canonical variate pairs, the strength of the relationship was depicted 65 University of Ghana http://ugspace.ug.edu.gh by the canonical correlation coefficient. The first pair of canonical variates (first canonical function) had a coefficient of 0.6704. The second canonical function had a coefficient of 0.6347 and the third pair of variate had a canonical correlation coefficient of 0.5552 Squaring the canonical correlation coefficients shows the proportion of variance accounted between the two optimally weighted variates. The redundancy index measured the proportion of variance of the M-set of variables that is predicted from the linear combination of the N-set of variables. The redundancy index can only be equal to 1 if the the squared canonical coefficient (eigenvalue) is 1 and the variables for the canonical function amount to all the variations of every variable in the set. The M- variables in the first function had redundancy index to be 0.2012, and N-variables had redundancy index to be 0.2101. The second function had a redundancy measure of 0.1876 for the M−variables and 0.1501 for the N-variables. For the third function, redundancy index was equal to 0.1001 and 0.1019 for the M-variables and N-variables respectively. The canonical loadings and standardized canonical coefficients were employed to evaluate the importance of the variables in the function. A coefficient threshold of |0.5| and above were used to select the important variables in each function. The standardized canonical coefficients showed that, for the first function, major axis, nCET, spry2, tp53 and cdk4 were the most important variables. Minor axis, necrosis, foxo1, rb1, pten, cdk4 and are the most important variables in the second function. For the third function, major axis, edema, minor axis, necrosis, nras,cdk2na, kras and akt3 are the most important variables. Using the canonical loadings, we obtained that for the first function, the most important variables were nCET, major axis, spry2, cdk4 and tp53. The important contributing vari- ables in the second function were minor axis, necrosis, foxo1, pten, cdk4, akt1 and rb1. For the third function, major axis, necrosis, nras, cdk2na, akt3 and kras were the most important variables. We performed cross validations to check if the results were influenced by the number of samples. So the 267 sample was divided into two and the canonical correlation analysis was performed on both samples. Results from both samples indicated that only two functions were significant and hence should be interpreted. For sample A, the first canonical variate pair had a canonical coefficient of 0.6601 while the second variate pair had a canonical correlation coefficient as 0.6372 Considering the first function, nCET, major axis, spry2, cdk4, tp53 are most closely related and are most important. With the second function, akt1, cdk4, foxo1 ,pten and rb1 was the most important variables. For sample B, the canonical correlation coefficients were obtained to be 0.6543 and 0.6338. The same set of variables from the first sample were found to be important in the second sample. 66 University of Ghana http://ugspace.ug.edu.gh Chapter 5 Conclusion Canonical correlation analysis is a very powerful and important technique for investigating the relationship between multiple independent and dependent variables. Although the tech- nique is fundamentally descriptive, it can also be employed for predictive purposes. This thesis provided a review of canonical correlation analysis and applied it in exploring the relationship between the copy number variations and neuro-image features of Glioblastoma patients. Canonical correlation coefficients under a non-singular transformation are unchanged and the canonical correlation coefficients either from the correlation matrix or the covariance matrix yield the same values. Also, computing correlations by standardizing the original variables has no effect on the correlations. We obtained from the data that mean survival status for Glioblastoma is 15 months and mean age of diagnosis is 55 years. The two set of multiple variables were related in three ways. We obtained three pairs of significant canonical variates with correlations of 0.6704,0.6347 and 0.5552 respectively, which were used to identify genes and features related to Glioblastoma. The important genes and features forming these relationships are as follows. The major axis of the tu- mor, the non-contrast enhancing tumor, the sprouty RTK signaling antagonist 2, the tumor protein p53 and cyclin dependent kinase 4 are very much related. Also, minor axis of the tumor, proportion of necrosis, forkhead box C1, phosphatase and tensin homolog, RB tran- scriptional corepressor 1, AKT serine/ threonline kinase 1 and cyclin dependent kinase 4 are also very much related. Finally, we also obtained that major axis, proportion of necro- sis, neuroblastoma RAS viral oncogene homolog, cyclin dependent kinase inhibitor 2A, AKT serine/threonline kinase 3 and KRAS prott-oncogene, GTPase are highly related. 67 University of Ghana http://ugspace.ug.edu.gh References [1] Bartek, J., Ng, K., Fischer, W., Carter, B., and Chen, C. C. (2012). Key concepts in glioblastoma therapy. Journal of Neurology, Neurosurgery & Psychiatry, 83(7):753– 760. [2] Cliff, N. and Krus, D. J. (1976). Interpretation of canonical analysis: Rotated vs. unrotated solutions. Psychometrika, 41(1):35–42. [3] CNV (Accessed March 2016). Copy number variants. DNA Learning Center, http: //www.dnalc.org/view/552-Copy-Number-Variants.html. [4] Davies, E. B. (2007). Approximate diagonalization. SIAM Journal on Matrix Analysis and Applications, 29(4):1051–1064. [5] de Koning, A. J., Gu, W., Castoe, T. A., Batzer, M. A., and Pollock, D. D. (2011). Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet, 7(12):e1002384. [6] Denman, E. D. (1981). Roots of real matrices. Linear Algebra and its Applications, 36:133–139. [7] Denman, E. D. and Beavers, A. N. (1976). The matrix sign function and computations in systems. Applied mathematics and Computation, 2(1):63–94. [8] Duerr, E.-M., Rollbrocker, B., Hayashi, Y., Peters, N., Meyer-Puttlitz, B., Louis, D. N., Schramm, J., Wiestler, O. D., Parsons, R., Eng, C., et al. (1998). PTEN mutations in gliomas and glioneuronal tumors. Oncogene, 16(17). [9] Ganigi, P., Santosh, V., Anandh, B., Chandramouli, B., and Sastry Kolluri, V. (2005). Expression of p53, EGFR, pRb and bcl-2 proteins in pediatric glioblastoma multiforme: a study of 54 patients. Pediatric neurosurgery, 41(6):292–299. [10] Genetic Variability (Accessed May 2016). Copy Number Variations. Pathway detail - flipper e nuvola http://flipper.diff.org/app/pathways/3685. 68 University of Ghana http://ugspace.ug.edu.gh [11] Gevaert, O., Mitchell, L. A., Achrol, A. S., Xu, J., Echegaray, S., Steinberg, G. K., Cheshier, S. H., Napel, S., Zaharchuk, G., and Plevritis, S. K. (2014). Glioblastoma multiforme: exploratory radiogenomic analysis by using quantitative image features. Radiology, 273(1):168–174. [12] Giunti, L., Pantaleo, M., Sardi, I., Provenzano, A., Magi, A., Cardellicchio, S., Cas- tiglione, F., Tattini, L., Novara, F., Buccoliero, A. M., et al. (2014). Genome-wide copy number analysis in pediatric glioblastoma multiforme. Am J Cancer Res, 4:293–303. [13] Gutman, D. A., Cooper, L. A., Hwang, S. N., Holder, C. A., Gao, J., Aurora, T. D., Dunn Jr, W. D., Scarpace, L., Mikkelsen, T., Jain, R., et al. (2013). MR imaging predic- tors of molecular profile and survival: multi-institutional study of the TCGA glioblas- toma data set. Radiology, 267(2):560–569. [14] Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., Tatham, R. L., et al. (2006a). Canonical Correlation Analysis: A Supplement to Multivariate Data Analysis, vol- ume 6. Pearson Prentice Hall Upper Saddle River, NJ. [15] Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., Tatham, R. L., et al. (2006b). Multivariate data analysis, volume 6. Pearson Prentice Hall Upper Saddle River, NJ. [16] Hammoud, M. A., Sawaya, R., Shi, W., Thall, P. F., and Leeds, N. E. (1996). Prog- nostic significance of preoperative MRI scans in glioblastoma multiforme. Journal of neuro-oncology, 27(1):65–73. [17] Higham, N. J. (1987). Computing real square roots of a real matrix. Linear Algebra and its applications, 88:405–430. [18] Hoskins, W. and Walton, D. (1978). A faster method of computing the square root of a matrix. Automatic Control, IEEE Transactions on, 23(3):494–495. [19] Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4):321–377. [20] Johnson, R. A., Wichern, D. W., et al. (2002). Applied multivariate statistical analy- sis, volume 5. Prentice hall Upper Saddle River, NJ. [21] Lacroix, M., Abi-Said, D., Fourney, D. R., Gokaslan, Z. L., Shi, W., DeMonte, F., Lang, F. F., McCutcheon, I. E., Hassenbusch, S. J., Holland, E., et al. (2001). A mul- tivariate analysis of 416 patients with glioblastoma multiforme: prognosis, extent of resection, and survival. Journal of neurosurgery, 95(2):190–198. 69 University of Ghana http://ugspace.ug.edu.gh [22] Lin, D., Calhoun, V. D., and Wang, Y.-P. (2014). Correspondence between fMRI and SNP data by group sparse canonical correlation analysis. Medical image analysis, 18(6):891–902. [23] McCarroll, S. A. and Altshuler, D. M. (2007). Copy-number variation and association studies of human disease. Nature genetics, 39:S37–S42. [24] Multivariate Analysis (Accessed March 2016). Multivariate Analysis. Philender, http://www.philender.com/courses/multivariate/notes2/can1.html. [25] Noushmehr, H., Weisenberger, D. J., Diefes, K., Phillips, H. S., Pujara, K., Berman, B. P., Pan, F., Pelloski, C. E., Sulman, E. P., Bhat, K. P., et al. (2010). Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer cell, 17(5):510–522. [26] Ohgaki, H., Dessen, P., Jourde, B., Horstmann, S., Nishikawa, T., Di Patre, P.-L., Burkhard, C., Schüler, D., Probst-Hensch, N. M., Maiorka, P. C., et al. (2004). Genetic Pathways to Glioblastoma A Population-Based Study. Cancer research, 64(19):6892– 6899. [27] Pierallini, A., Bonamini, M., Pantano, P., Palmeggiani, F., Raguso, M., Osti, M., Anaveri, G., and Bozzao, L. (1998). Radiological assessment of necrosis in glioblas- toma: variability and prognostic value. Neuroradiology, 40(3):150–153. [28] Pollack, I. F., Boyett, J. M., Yates, A. J., Burger, P. C., Gilles, F. H., Davis, R. L., Finlay, J. L., Group, C. C., et al. (2003). The influence of central review on outcome associations in childhood malignant gliomas: results from the CCG-945 experience. Neuro-oncology, 5(3):197–207. [29] Pollack, I. F., Finkelstein, S. D., Woods, J., Burnham, J., Holmes, E. J., Hamilton, R. L., Yates, A. J., Boyett, J. M., Finlay, J. L., and Sposto, R. (2002). Expression of p53 and prognosis in children with malignant gliomas. New England Journal of Medicine, 346(6):420–427. [30] Pollack, I. F., Hamilton, R. L., James, C. D., Finkelstein, S. D., Burnham, J., Yates, A. J., Holmes, E. J., Zhou, T., and Finlay, J. L. (2006). Rarity of PTEN deletions and EGFR amplification in malignant gliomas of childhood: results from the Children’s Cancer Group 945 cohort. Journal of Neurosurgery: Pediatrics, 105(5):418–424. 70 University of Ghana http://ugspace.ug.edu.gh [31] Pope, W. B., Sayre, J., Perlina, A., Villablanca, J. P., Mischel, P. S., and Cloughesy, T. F. (2005). MR imaging correlates of survival in patients with high-grade gliomas. American Journal of Neuroradiology, 26(10):2466–2474. [32] Qu, H.-Q., Jacob, K., Fatet, S., Ge, B., Barnett, D., Delattre, O., Faury, D., Mont- petit, A., Solomon, L., Hauser, P., et al. (2010). Genome-wide profiling using single- nucleotide polymorphism arrays identifies novel chromosomal imbalances in pediatric glioblastomas. Neuro-oncology, 12(2):153–163. [33] Reifenberger, G. and Collins, V. P. (2004). Pathology and molecular genetics of as- trocytic gliomas. Journal of molecular medicine, 82(10):656–670. [34] Sharp, A. J., Locke, D. P., McGrath, S. D., Cheng, Z., Bailey, J. A., Vallente, R. U., Pertz, L. M., Clark, R. A., Schwartz, S., Segraves, R., et al. (2005). Segmental dupli- cations and copy-number variation in the human genome. The American Journal of Human Genetics, 77(1):78–88. [35] Siegel, R., Naishadham, D., and Jemal, A. (2012). Cancer statistics, 2012. CA: a cancer journal for clinicians, 62(1):10–29. [36] Taniguchi, Y., Choi, P. J., Li, G.-W., Chen, H., Babu, M., Hearn, J., Emili, A., and Xie, X. S. (2010). Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science, 329(5991):533–538. [37] Velazquez, E. R., Meier, R., Dunn Jr, W. D., Alexander, B., Wiest, R., Bauer, S., Gutman, D. A., Reyes, M., and Aerts, H. J. (2015). Fully automatic GBM segmentation in the TCGA-GBM dataset: Prognosis and correlation with VASARI features. Scientific reports, 5. [38] Xiong, M., Dong, H., Siu, H., Peng, G., Wang, Y., and Jin, L. (2010). Genome-Wide Association Studies of Copy Number Variation in Glioblastoma. In Bioinformatics and Biomedical Engineering (iCBBE), 2010 4th International Conference on, pages 1–4. IEEE. [39] Zarrei, M., MacDonald, J. R., Merico, D., and Scherer, S. W. (2015). A copy number variation map of the human genome. Nature Reviews Genetics, 16(3):172–183. 71 University of Ghana http://ugspace.ug.edu.gh