Chapter 1 Principal Component Analysis

Advice: Use the simplest method that provides the clearest picture.

Principal component analysis (PCA) is used to analyze one table of quantitative data. PCA mixes the input variables to give new variables, called principal components. The first principal component is the line of best fit. It is the line that maximizes the inertia (similar to variance) of the cloud of data points. Subsequent components are defined as orthogonal to previous components, and maximize the remaining inertia.

PCA gives one map for the rows (called factor scores), and one map for the columns (called loadings). These 2 maps are related, because they both are described by the same components. However, these 2 maps project different kinds of information onto the components, and so they are interpreted differently. Factor scores are the coordinates of the row observations. They are interpreted by the distances between them, and their distance from the origin. Loadings describe the column variables. Loadings are interpreted by the angle between them, and their distance from the origin.

The distance from the origin is important in both maps, because squared distance from the mean is inertia (variance, information; see sum of squares as in ANOVA/regression). Because of the Pythagorean Theorem, the total information contributed by a data point (its squared distance to the origin) is also equal to the sum of its squared factor scores.

1.0.1 PCA Data: PHQ

The Patient Health Questionnaire is a survey that is a preliminary measurement for depression severity.

There are 9 questions (columns) measured on a scale from 1 i.e. “no days” to 4 “nearly every day” and 225 participants (rows).

The descriptors of the pariticpants are memory group, sex, and age. For this analysis I will be using memory group that is either high, normal, or low memory.

#head(PHQ)
kable(head(PHQ))

Pleasure	Hopeless	Sleep	Energy	Appetite	Failure	Focus	Speed	Suicide
1	1	2	3	1	1	1	1	1
1	1	1	2	1	1	1	1	1
1	1	1	1	1	1	1	1	1
1	3	4	2	3	2	1	1	1
1	1	1	1	1	1	1	1	1
1	2	3	3	1	3	2	1	1

1.1 PCA correlation plot

cor.res <- cor(PHQ)
corrplot.mixed(cor.res, tl.cex = 0.7, tl.col = "black", addCoefasPercent = TRUE)

#cor.plot <- recordPlot() # you need this line to be able to save the figure to PPT later
memory <- matrix(,nrow = nrow(PHQ), ncol = 1)
memory[1:72] <- 3 #high memory
memory[73:144] <-2 #norm memory
memory[145:216] <-1 #low memory

#Totalscore and Memory design variables correlation
TotalscoreMemory <- cbind(data.frame(totalscore), data.frame(memory))
colnames(TotalscoreMemory) <- c("Totalscore","Memory")

PHQ9BeckScatter <- ggscatter(TotalscoreMemory, x = "Totalscore", y = "Memory", 
          add = "reg.line", conf.int = TRUE, color = 'mediumorchid4',
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Depression Severity", ylab = "Memory Group")
PHQ9BeckScatter

We can see that all questions are positively correlated with each other. Feelings of failure and hopleness have the highest correlation of .74. Loss of pleasure and hopeless are also highly correlated (.68).

From the correlation plot we can see that in this data set depression severity is not significantly correlated with what memory group the participants are in.

1.2 PCA Analysis

res_pcaInf <- epPCA.inference.battery(PHQ, center = TRUE, scale = FALSE, DESIGN = GroupingVaribles$memoryGroups, graphs = FALSE)

## [1] "It is estimated that your iterations will take 0.01 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================

1.2.1 PCA Testing the eigenvalues

zeDim = 1
pH1 <- prettyHist(
  distribution = res_pcaInf$Inference.Data$components$eigs.perm[,zeDim], 
           observed = res_pcaInf$Fixed.Data$ExPosition.Data$eigs[zeDim], 
           xlim = c(200, 1300), # needs to be set by hand
           breaks = 20,
           border = "white", 
           main = paste0("Permutation Test for Eigenvalue ",zeDim),
           xlab = paste0("Eigenvalue ",zeDim), 
           ylab = "", 
           counts = FALSE, 
           cutoffs = c( 0.975))

#eigs1 <- recordPlot()
zeDim = 2
pH2 <- pH1 <- prettyHist(
  distribution = res_pcaInf$Inference.Data$components$eigs.perm[,zeDim], 
           observed = res_pcaInf$Fixed.Data$ExPosition.Data$eigs[zeDim], 
           xlim = c(200, 350), # needs to be set by hand
           breaks = 20,
           border = "white", 
           main = paste0("Permutation Test for Eigenvalue ",zeDim),
           xlab = paste0("Eigenvalue ",zeDim), 
           ylab = "", 
           counts = FALSE, 
           cutoffs = c(0.975))

#eigs2 <- recordPlot()

Eigenvalue 1 is reliable, but eigenvalue 2 doesn’t look reliable. However, that doesn’t mean we ignore the second eigenvalue!!

1.2.2 PCA Scree Plot

my.scree <- PlotScree(ev = res_pcaInf$Fixed.Data$ExPosition.Data$eigs,
                      p.ev = res_pcaInf$Inference.Data$components$p.vals, plotKaiser = TRUE)

#my.scree <- recordPlot() # you need this line to be able to save them in the end

Dimension 1 is abvoe the Kaiser cut off and dimension 2 is really close!

1.2.3 PCA Factor scores

my.fi.plot <- createFactorMap(res_pcaInf$Fixed.Data$ExPosition.Data$fi, # data
                            title = "PHQ Row Factor Scores", # title of the plot
                            axis1 = 1, axis2 = 2, # which component for x and y axes
                            pch = 19, # the shape of the dots (google `pch`)
                            cex = 2, # the size of the dots
                            display.labels = FALSE,
                            text.cex = 2.5, # the size of the text
                            alpha.points = 0.3,
                            col.points = res_pcaInf$Fixed.Data$Plotting.Data$fi.col, # color of the dots
                            col.labels = res_pcaInf$Fixed.Data$Plotting.Data$fi.col, # color for labels of dots
                            )

fi.labels <- createxyLabels.gen(1,2,
                             lambda = res_pcaInf$Fixed.Data$ExPosition.Data$eigs,
                             tau = round(res_pcaInf$Fixed.Data$ExPosition.Data$t),
                             axisName = "Component "
                             )
fi.plot <- my.fi.plot$zeMap + fi.labels # you need this line to be able to save them in the end
fi.plot

As the rows move from left to right the variance starts smaller and then gradually increases and slighlty decreases again. However, from this plot it is hard to tell what is happening on each component.

1.2.3.1 PCA With group means

#Get color for each group
# get index for the first row of each group
grp.ind <- order(GroupingVaribles$memoryGroups)[!duplicated(sort(GroupingVaribles$memoryGroups))]
grp.col <- res_pcaInf$Fixed.Data$Plotting.Data$fi.col[grp.ind] # get the color
grp.name <- GroupingVaribles$memoryGroups[grp.ind] # get the corresponding groups
names(grp.col) <- grp.name

#Get means
group.mean <- aggregate(res_pcaInf$Fixed.Data$ExPosition.Data$fi,
                     by = list(GroupingVaribles$memoryGroups), # must be a list
                     mean)

# need to format the results from `aggregate` correctly
rownames(group.mean) <- group.mean[,1] # Use the first column as row names
fi.mean <- group.mean[,-1] # Exclude the first column

#Plot it!
fi.mean.plot <- createFactorMap(fi.mean,
                                alpha.points = 0.8,
                                col.points = grp.col[rownames(fi.mean)],
                                col.labels = grp.col[rownames(fi.mean)],
                                pch = 17,
                                cex = 3,
                                text.cex = 3)
fi.WithMean <- my.fi.plot$zeMap_background + my.fi.plot$zeMap_dots + fi.mean.plot$zeMap_dots + fi.mean.plot$zeMap_text + fi.labels
fi.WithMean

All the group means fall on or very close to component 1.

1.2.4 PCA Tolerance interval

We can plot the tolerance interval

TIplot <- MakeToleranceIntervals(res_pcaInf$Fixed.Data$ExPosition.Data$fi,
                            design = as.factor(GroupingVaribles$memoryGroups),
                            # line below is needed
                            names.of.factors =  c("Dim1","Dim2"), # needed 
                            col = grp.col[rownames(fi.mean)],
                            line.size = .50, 
                            line.type = 3,
                            alpha.ellipse = .2,
                            alpha.line    = .4,
                            p.level       = .95)

fi.WithMeanTI <- my.fi.plot$zeMap_background + my.fi.plot$zeMap_dots + fi.mean.plot$zeMap_dots + fi.mean.plot$zeMap_text + TIplot + fi.labels

fi.WithMeanTI

The tolerance interval shows there is overlap of the distribution of the data points in the groups.

1.2.5 PCA Bootstrap interval

#We can also add the bootstrap interval for the group means to see if these group means are significantly different.

#First step: bootstrap the group means

# Depend on the size of your data, this might take a while
fi.boot <- Boot4Mean(res_pcaInf$Fixed.Data$ExPosition.Data$fi,
                     design = GroupingVaribles$memoryGroups,
                     niter = 1000)

# Check other parameters you can change for this function
bootCI4mean <- MakeCIEllipses(fi.boot$BootCube[,c(1:2),], # get the first two components
                              col = grp.col[rownames(fi.mean)])

fi.WithMeanCI <- my.fi.plot$zeMap_background + bootCI4mean + my.fi.plot$zeMap_dots + fi.mean.plot$zeMap_dots + fi.mean.plot$zeMap_text + fi.labels
fi.WithMeanCI

With a smaller sample size, we can conclude that the groups are not reliably different from one another due to all confidence intervals overlapping.

1.2.6 PCA Loadings

colnames(res_pcaInf$Fixed.Data$ExPosition.Data$fi) <- c('Pleasure','Hopeless','Sleep','Energy','Appetite','Failure','Focus','Speed','Suicide')
cor.loading <- cor(PHQ, res_pcaInf$Fixed.Data$ExPosition.Data$fi)
rownames(cor.loading) <- colnames(cor.loading)

#Get colors for different variables
VariableColors <- prettyGraphsColorSelection(n.colors = ncol(PHQ))

loading.plot <- createFactorMap(cor.loading,
                                constraints = list(minx = -1, miny = -1,
                                                   maxx = 1, maxy = 1),
                                col.points = VariableColors,
                                col.labels = VariableColors)
LoadingMapWithCircles <- loading.plot$zeMap + 
  addArrows(cor.loading, color = VariableColors) + 
  addCircleOfCor() + xlab("Component 1") + ylab("Component 2")

LoadingMapWithCircles

Component 1: All variables are positively correlated.

Component 2: Sleep, Energy, and Appetite seem to be less correlated to the rest of the variables (approaching orthogonality).

More of the variance for Suicide, Speed, and Focus seem to be in a different dimension.

#You can also include the variance of each component and plot the factor scores for the columns (i.e., the variables):
my.fj.plot <- createFactorMap(res_pcaInf$Fixed.Data$ExPosition.Data$fj, # data
                            title = "PHQ Column Loadings", # title of the plot
                            axis1 = 1, axis2 = 2, # which component for x and y axes
                            pch = 19, # the shape of the dots (google `pch`)
                            cex = 3, # the size of the dots
                            text.cex = 3, # the size of the text
                            col.points = VariableColors, # color of the dots
                            col.labels = VariableColors, # color for labels of dots
                            )

fj.plot <- my.fj.plot$zeMap + fi.labels # you need this line to be able to save them in the end
fj.plot

Component 1: All Symptoms
Component 2: Emotional symptoms vs physical symptoms

1.2.6.1 PCA Bootstrap Ratio of columns

_**Note: This is not the same as the contribution bars_

1.2.6.1.1 PCA Component 1 and 2

signed.ctrJ <- res_pcaInf$Fixed.Data$ExPosition.Data$cj * sign(res_pcaInf$Fixed.Data$ExPosition.Data$fj)

# plot contributions for component 1
ctrJ.1 <- PrettyBarPlot2(signed.ctrJ[,1],
                         threshold = 1 / NROW(signed.ctrJ),
                         font.size = 5,
                         color4bar = gplots::col2hex(VariableColors), # we need hex code
                         ylab = 'Contributions',
                         ylim = c(1.2*min(signed.ctrJ), 1.2*max(signed.ctrJ))
) + ggtitle("Contribution barplots", subtitle = 'Component 1: Variable Contributions (Signed)')

# plot contributions for component 2
ctrJ.2 <- PrettyBarPlot2(signed.ctrJ[,2],
                         threshold = 1 / NROW(signed.ctrJ),
                         font.size = 5,
                         color4bar = gplots::col2hex(VariableColors), # we need hex code
                         ylab = 'Contributions',
                         ylim = c(1.2*min(signed.ctrJ), 1.2*max(signed.ctrJ))
) + ggtitle("",subtitle = 'Component 2: Variable Contributions (Signed)')


BR <- res_pcaInf$Inference.Data$fj.boots$tests$boot.ratios
laDim = 1

# Plot the bootstrap ratios for Dimension 1
ba001.BR1 <- PrettyBarPlot2(BR[,laDim],
                        threshold = 2,
                        font.size = 5,
                   color4bar = gplots::col2hex(VariableColors), # we need hex code
                  ylab = 'Bootstrap ratios'
                  #ylim = c(1.2*min(BR[,laDim]), 1.2*max(BR[,laDim]))
) + ggtitle("Bootstrap ratios", subtitle = paste0('Component ', laDim))

# Plot the bootstrap ratios for Dimension 2
laDim = 2
ba002.BR2 <- PrettyBarPlot2(BR[,laDim],
                        threshold = 2,
                        font.size = 5,
                   color4bar = gplots::col2hex(VariableColors), # we need hex code
                  ylab = 'Bootstrap ratios'
                  #ylim = c(1.2*min(BR[,laDim]), 1.2*max(BR[,laDim]))
) + ggtitle("",subtitle = paste0('Component ', laDim))

#We then use the next line of code to put two figures side to side:

  grid.arrange(
    as.grob(ctrJ.1),
    as.grob(ctrJ.2),
    as.grob(ba001.BR1),
    as.grob(ba002.BR2),
    ncol = 2,nrow = 2,
    top = textGrob("Barplots for variables", gp = gpar(fontsize = 18, font = 3))
  )

#BothCtrJ <- recordPlot() # you need this line to be able to save them in the end

Component 1: The variables with the greatest contributions are all but Focus, Speed, and Suicide.

Component 2: Failure is positively contributing while Sleep, Energy, and Appetite are negatively contributing.

The barplots for bootstrap ratios show that all the variables that contributed a signficant amount are actually significant.

1.3 Summary

When we interpret the factor scores and loadings together, the PCA revealed:

Component 1: Regardless of memory group they all experience more of the symptoms, except feelings of suicide, loss of focus, and distrubances in speed i.e. feeling hyperactive or lethargic.
Component 2: All memory groups experience more appetite, sleep, and energy when they have only increased feelings of faliure.
Both: When all memory groups experience increases in only one symptom they experience the other symptoms differently than compared to when they experience increases in all symptoms together.

Multivariate Statistical Analysis using R