# Chapter 2 Principal Component Analysis

**Data table:** PCA is used to analyze one table of quantitative data.

**Goal:** PCA computes new variables called principal components which is equated by linear combinations of the original variables so as to find new variables that maximizes the variance of the data.

This is obtained by performing an SVD on correlation/covariance matrix

**Key ideas**

- Principle components are orthogonal to each other and are indepenent as well.
- Singular values are Standard deviation of each component
- Eigenvalues give us the variances which is same as the Sum of Squares.

NOTE: **SD > 1** Streches the data and **SD < 1** Compresses the data

**Interpretation**

```
1. Factor scores are the coordinates of the row observations. They are
interpreted by the distances between them, and their distance from the origin.
2. Loadings describe the column variables. Loadings are interpreted by the
angle between them and the principal axis, and their distance from the origin.
3. The distance from the origin is important in both maps, because squared
distance from the mean is inertia. Because of the Pythagorean Theorem, the
total information contributed by a data point (its squared distance to the
origin) is also equal to the sum of its squared factor scores.
```

## 2.1 Dataset: Survey of Autobiographical Mememory

The data was collected by Baycrest Institute at University of Toronto. Participants with different memory scores took several questionnaire.

Participants were asked to rate the extent to which a particular item applied to their memory in general, using a **5-point Likert scale** (1- completely disagree, 2-4 - intermediate degrees of agreement/disagreement, 5 completely agree).

There are 153 obseravtions(**rows**) which represents the participants who answer to 26(**Columns**) questions that comprised of 8 Episodic memory based questions 6 Semantic memory questions, 6 Spatial memory based questions and 6 Prospective memory related questions.

The subjects include both men and women with their ages in the range of 18-84 years which are also mentioned as age and sex variable. A survey based measure of AM is also used to caatogorize the participants into two groups- High memory, Normal Memory.

**Dataset Cleaning**

Steps: 1. Check for NA or incomplete data in the dataset and remove them if they exist. 2. Remove Mysterious Memory Groups : participants with conflicting responses 3. Flitering data : Use only Numeric data to perform PCA

**Preprocessing the Data**

- Centering: Refers to subtracting mean of each column from each of its points.
- Scaling: Normalization (Since SAM data consists of likert scale they all range from 0-5. There exixts homogenity in units of the data. SO refrain from scaling the data.

## 2.2 Looking at the data pattern

**Correlation Plot**

**What it does?**
The corrplot package depicts correlation matrix with a graph. One can play around with the details of the plot by alloting parameters for color, text labels, color labels and layout.

**Analyzing the plot**
There are seven visualization methods (parameter method) in corrplot package, named “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.

Positive correlations are displayed in blue and negative correlations in red color. Color intensity are proportional to the correlation coefficients.

```
E <- c("#5f9ea0","#5f9ea0","#5f9ea0","#5f9ea0","#5f9ea0","#5f9ea0","#5f9ea0","#5f9ea0")
S <- c("#ff7f24", "#ff7f24", "#ff7f24", "#ff7f24", "#ff7f24", "#ff7f24")
P <- c("#76ee00","#76ee00","#76ee00","#76ee00","#76ee00","#76ee00")
Fu<- c("#305ABF","#305ABF","#305ABF","#305ABF","#305ABF","#305ABF")
color.vector <- as.matrix(c(E,Fu,P,S))
#colors for groups by rows
col4row <- d_active$memoryGroups
col4row <- recode(col4row,
Low = 'orange',
Norm = 'tomato',
High = 'darkred'
)
cor.res <- cor(d_centered)
corrplot.pca <-corrplot(cor.res, method = "color", tl.cex = .5, tl.col = color.vector,
addCoef.col = "black", number.digits = 0, number.cex = .4,
cl.pos = 'b', cl.cex = .3,
addCoefasPercent = TRUE,
col = colorRampPalette(c("darkred", "white","midnightblue"))(20))
```

## 2.3 PCA Analysis

`center = TRUE`

: substracts the mean from each column`scale = FALSE`

: after centering (or not).(Note:Likert scale data is same throughout)`DESIGN`

: colors the observations (rows)`graphs = FALSE`

: this gives you plots from`epPCA`

, but make sure to flag it`FALSE`

for Rmarkdown to run correctly

Note: We run the epPCA and epPCA.inference package by passing the data containing only quantitative variables and also the design variables(colors for plot)

```
# Using numerical data and disregarding age and gender factor
d_PCA<- d_use[ , c(6:31)]
res_pca<- epPCA(
d_PCA,
center = TRUE,
scale = FALSE,
DESIGN = d_use$memoryGroups, # Memory Groups of participants based on AM
graphs = FALSE
)
```

### 2.3.1 Scree Plot

**What it does**
The scree plot shows the eigenvalues, the amount of information on each component. The number of components (the dimensionality of the factor space) is min(nrow(DATA), ncol(DATA)) minus 1.
**Analysing the Plot**
Here, 8 columns give 7 components. The scree plot is used to determine how many of the components should be interpreted.

## 2.4 PCA inference

The inference battery package includes permutation and bootstrap tests. The inference is important to check for the stability and reliability of your results. (Just like F test)

```
res_pcaInf <- epPCA.inference.battery(d_PCA, center = TRUE,
scale = FALSE,
DESIGN = d_use$memoryGroups,
graphs = FALSE)
```

```
## [1] "It is estimated that your iterations will take 0.05 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================
```

### 2.4.1 Scree Plot

This plot includes the results from permutation with Scree plot (i.e., color the significant components) by adding the estimated p-values to the `PlotScree`

function.

### 2.4.2 Testing the eigenvalues

An alternative way to check the no : of significant components

```
zeDim = 1
pH1 <- prettyHist(
distribution = res_pcaInf$Inference.Data$components$eigs.perm[,zeDim],
observed = res_pcaInf$Fixed.Data$ExPosition.Data$eigs[zeDim],
xlim = c(0, 3000), # needs to be set by hand
breaks = 10,
border = "white",
main = paste0("Permutation Test for Eigenvalue ",zeDim),
xlab = paste0("Eigenvalue ",zeDim),
ylab = "",
counts = FALSE,
cutoffs = c( 0.975))
```

```
zeDim = 2
pH2 <- pH1 <- prettyHist(
distribution = res_pcaInf$Inference.Data$components$eigs.perm[,zeDim],
observed = res_pcaInf$Fixed.Data$ExPosition.Data$eigs[zeDim],
xlim = c(0, 1800), # needs to be set by hand
breaks = 10,
border = "white",
main = paste0("Permutation Test for Eigenvalue ",zeDim),
xlab = paste0("Eigenvalue ",zeDim),
ylab = "",
counts = FALSE,
cutoffs = c(0.975))
```

```
zeDim = 3
pH3<- pH1 <- prettyHist(
distribution = res_pcaInf$Inference.Data$components$eigs.perm[,zeDim],
observed = res_pcaInf$Fixed.Data$ExPosition.Data$eigs[zeDim],
xlim = c(0, 1000), # needs to be set by hand
breaks = 10,
border = "white",
main = paste0("Permutation Test for Eigenvalue ",zeDim),
xlab = paste0("Eigenvalue ",zeDim),
ylab = "",
counts = FALSE,
cutoffs = c(0.975))
```

### 2.4.3 Row Factor scores

*Row Factor scores*
F = XQ = P*delta
Projections of the observations onto the principal components.

```
my.fi.plot <- createFactorMap(res_pcaInf$Fixed.Data$ExPosition.Data$fi, # data
title = "SAM Row Factor Scores", # title of the plot
axis1 = 1, axis2 = 2, # which component for x and y axes
display.labels = FALSE,
pch = 19, # the shape of the dots (google `pch`)
cex = 2, # the size of the dots
text.cex = 2.5, # the size of the text
alpha.points = 0.3,
col.points = col4row, # color of the dots
col.labels = col4row # color for labels of dots
)
fi.labels <- createxyLabels.gen(1,2,
lambda = res_pcaInf$Fixed.Data$ExPosition.Data$eigs,
tau = round(res_pcaInf$Fixed.Data$ExPosition.Data$t),
axisName = "Component "
)
fi.plot <- my.fi.plot$zeMap + fi.labels # you need this line to be able to save them in the end
fi.plot
```

Color for each group:

```
# get index for the first row of each group
grp.ind <- order(d_use$memoryGroups)[!duplicated(sort(d_use$memoryGroups))]
grp.col <- res_pcaInf$Fixed.Data$Plotting.Data$fi.col[grp.ind] # get the color
grp.name <- d_use$memoryGroups[grp.ind] # get the corresponding groups
names(grp.col) <- grp.name
```

#### 2.4.3.1 With group means

```
group.mean <- PTCA4CATA::getMeans(res_pcaInf$Fixed.Data$ExPosition.Data$fi,
d_use$memoryGroups)
col4Means <- recode(rownames(group.mean),
Low = 'orange',
High = 'darkred',
Norm = 'tomato2',
)
names(col4Means) <- rownames(group.mean)
```

##### 2.4.3.1.1 Looking at Dimension 1 and 2

```
fi.mean.plot <- createFactorMap(group.mean,
alpha.points = 1,
display.labels = TRUE,
col.points = col4Means,
col.labels = col4Means,
pch = 17,
cex = 3,
text.cex = 3
)
fi.WithMean <- my.fi.plot$zeMap_background +
my.fi.plot$zeMap_dots +
fi.mean.plot$zeMap_dots +
fi.mean.plot$zeMap_text +
fi.labels
fi.WithMean
```

##### 2.4.3.1.2 Looking at dimension 2 and 3

```
my.fi.plot23 <- createFactorMap(res_pcaInf$Fixed.Data$ExPosition.Data$fi, # data
title = "SAM Row Factor Scores", # title of the plot
axis1 = 2, axis2 = 3, # which component for x and y axes
display.labels = FALSE,
pch = 19, # the shape of the dots (google `pch`)
cex = 2, # the size of the dots
text.cex = 2.5, # the size of the text
alpha.points = 0.3,
col.points = col4row, # color of the dots
col.labels = col4row # color for labels of dots
)
fi.labels23 <- createxyLabels.gen(2,3,
lambda = res_pcaInf$Fixed.Data$ExPosition.Data$eigs,
tau = round(res_pcaInf$Fixed.Data$ExPosition.Data$t),
axisName = "Component "
)
fi.mean.plot23 <- createFactorMap(group.mean,
alpha.points = 1,
axis1 = 2,
axis2 = 3,
display.labels = TRUE,
col.points = col4Means,
col.labels = col4Means,
pch = 17,
cex = 3,
text.cex = 3
)
fi.WithMean23 <- my.fi.plot23$zeMap_background+ my.fi.plot23$zeMap_dots + fi.mean.plot23$zeMap_dots +fi.mean.plot23$zeMap_text +
fi.labels23
fi.WithMean23
```

#### 2.4.3.2 Tolerance interval

The spread of the factor scores based on their corresponding groups. The groups tend to overlap each showing that the data is not segregated as per groups, although the means are separate.

##### 2.4.3.2.1 Looking at dimension 1 and 2

```
TIplot <- MakeToleranceIntervals(res_pcaInf$Fixed.Data$ExPosition.Data$fi,
design = as.factor(d_use$memoryGroups),
# line below is needed
names.of.factors = c("Dim1","Dim2"), # needed
col = col4Means,
line.size = .50,
line.type = 3,
alpha.ellipse = .2,
alpha.line = .4,
p.level = .95)
fi.WithMeanTI <- my.fi.plot$zeMap_background +
my.fi.plot$zeMap_dots+
fi.mean.plot$zeMap_dots +
fi.mean.plot$zeMap_text+
TIplot + fi.labels
fi.WithMeanTI
```

##### 2.4.3.2.2 Looking at dimension 2 and 3

```
TIplot23 <- MakeToleranceIntervals(res_pcaInf$Fixed.Data$ExPosition.Data$fi[,c(2:3)],
design = as.factor(d_use$memoryGroups),
# line below is needed
names.of.factors = c("Dim1","Dim2"), # needed
col = col4Means,
line.size = .50,
line.type = 3,
alpha.ellipse = .15,
alpha.line = .4,
p.level = .95)
fi.WithMeanTI <- my.fi.plot23$zeMap_background +
my.fi.plot23$zeMap_dots+
fi.mean.plot23$zeMap_dots +
fi.mean.plot23$zeMap_text+
TIplot23 + fi.labels23
fi.WithMeanTI
```

#### 2.4.3.3 Bootstrap interval

We can also add the bootstrap interval for the group means to see if these group means are significantly different.

```
# Depend on the size of your data, this might take a while
fi.boot <- Boot4Mean(res_pcaInf$Fixed.Data$ExPosition.Data$fi[,1:3],
design = d_use$memoryGroups,
niter = 1000)
for (i in 1:ncol(fi.boot$BootCube)){
colnames(fi.boot$BootCube)[i] <- paste("Dimension", i)
}
```

##### 2.4.3.3.1 Looking at Dimension 1 and 2

##### 2.4.3.3.2 Lookinf at dimension 2 and 3

```
axis1=2
axis2=3
bootCI4mean23 <- MakeCIEllipses(fi.boot$BootCube[,c(2:3),], # get the first two components
names.of.factors = paste0("Dimension ", c(2,3)),
col = col4Means,)
fi.WithMeanCI23 <- my.fi.plot23$zeMap_background +
fi.labels23 + bootCI4mean23+
my.fi.plot23$zeMap_dots +
fi.mean.plot23$zeMap_dots +
fi.mean.plot23$zeMap_text
fi.WithMeanCI23
```

### 2.4.4 Loadings

Loadings describe the similarity (observe the difference in angular distance) between the variables. Loadings show how the input variables relate to each other. Loadings also show which variables is helping in explaining the component.

Note: The Loading plot shows that the episodic memory questions share a smaller angular distance with the future memory questions depicting that they are positively correlated.(cos0 =1 completly correlated)

Similarly the spatial memory questions and semantic memory questions are strongly correlated.

Also, Future and Spatial memory loadings depic an angular difference of 90 deg angle. This shows they are uncorrelated with each other. (cos90 = 0 uncorrelated) Hence, in this case a varimax rotation will help with a better understanding of the components.

```
res_pcaInf$Plotting.Data$fj.col<-color.vector
cor.loading <- cor(d_PCA, res_pcaInf$Fixed.Data$ExPosition.Data$fi)
colnames(cor.loading)<- rownames(cor.loading)
loading.plot <- createFactorMap(cor.loading,
constraints = list(minx = -1, miny = -1, maxx = 1, maxy = 1),
col.points = color.vector,
col.labels = color.vector)
LoadingMapWithCircles <- loading.plot$zeMap +
addArrows(cor.loading, color = res_pcaInf$Plotting.Data$fj.col) +
addCircleOfCor() +
xlab("Component 1") +
ylab("Component 2")
LoadingMapWithCircles
```

### 2.4.5 Column Factor Scores

You can also include the variance of each component and plot the factor scores for the columns (i.e., the variables). The following looks at the column factor scores for dimension 1cand 2.

```
res_pcaInf$Plotting.Data$fj.col<-color.vector
my.fj.plot <- createFactorMap(res_pcaInf$Fixed.Data$ExPosition.Data$fj, # data
title = "SAM Column Factor Scores", # title of the plot
axis1 =1 , axis2 = 2, # which component for x and y axes
pch = 19, # the shape of the dots (google `pch`)
cex = 3, # the size of the dots
text.cex = 3, # the size of the text
col.points = color.vector , # color of the dots
col.labels = color.vector , # color for labels of dots
)
fj.plot <- my.fj.plot$zeMap + fi.labels
fj.plot
```

Similarly looking at column factor scores for dimension 2 and 3

## 2.5 Variman Rotation

The loading circle plot hinted towards a rotation of the axis. The future and spatial memory loadings appeared to share an angle of 90 deg indicating that they are uncorrelated and probably orthogonal.

```
var.fj <- varimax(res_pcaInf$Fixed.Data$ExPosition.Data$fj)
rotate.loadings<- as.matrix(var.fj[["loadings"]])
fj.rotated<-rotate.loadings[, c(1:2)]
varimax.plot <- createFactorMap(fj.rotated, # data
title = "Column Factor Scores with rotation", # title of the plot
axis1 =1 , axis2 = 2, # which component for x and y axes
pch = 19, # the shape of the dots (google `pch`)
cex = 3, # the size of the dots
text.cex = 3, # the size of the text
col.points = color.vector , # color of the dots
col.labels = color.vector , # color for labels of dots
)
fj.plot.rotated <- varimax.plot$zeMap + fi.labels
fj.plot.rotated
```

```
fj.rotated23<-rotate.loadings[, c(2:3)]
varimax.plot23 <- createFactorMap(fj.rotated23, # data
title = "Column Factor Scores with rotation", # title of the plot
axis1 =1 , axis2 = 2, # which component for x and y axes
pch = 19, # the shape of the dots (google `pch`)
cex = 3, # the size of the dots
text.cex = 3, # the size of the text
col.points = color.vector , # color of the dots
col.labels = color.vector , # color for labels of dots
)
fj.plot.rotated23 <- varimax.plot23$zeMap + fi23.labels
fj.plot.rotated23
```

This Plot shows that spatial memory P1 explains component 1 while future memory F2 helps to explain component 2 after a roation of the axis.

Component 1: Episodic memory

Component 2: Spatial memory

Component 3: Future memory

### 2.5.1 Contributions of columns and their Bootstrap Ratios

Contribution Bar plots : How the variables contribute to each component by plotting the contributions with a line that represents the threshold we compared them to.

Bootstrap ratio Bar plots: Checks the significance of contribution of variables to the priciple components.

```
grid.arrange(
as.grob(ctrJ.2),
as.grob(ctrJ.3),
as.grob(ba002.BR2),
as.grob(ba003.BR3),
ncol = 2,nrow = 2,
top = textGrob("Barplots for variables", gp = gpar(fontsize = 12, font = 3))
)
```

## 2.6 Conclusion

When we interpret the factor scores and loadings together, the PCA revealed:

**Before Rotation:**

Component 1: Participanta who were grouped as high Autobiographical Memory have higher episodic memory.

Component 2: Tends to seperate semantic memory which from memory that requires imagination like future and spatial memory.

Note: Semantic memory is based on facts, meanings, concepts and knowledge about the external world that we have acquired and is independent of personal experience and of the spatial/temporal context in which it was acquired.

- Component 3: Explains how Future memory is different or separates from Spatial Memory.

**After Rotation:**

Component 1: Participanta who were grouped as high Autobiographical Memory have higher Episodic memory.

Component 2: The positive side if the dimension is explained by Spatial Memory

Component 3: Explained by Future memory