Chapter 1 Principal Component Analysis

Advice: Use the simplest method that provides the clearest picture.

Principal component analysis (PCA) is used to analyze one table of quantitative data. PCA mixes the input variables to give new variables, called principal components. The first principal component is the line of best fit. It is the line that maximizes the inertia (similar to variance) of the cloud of data points. Subsequent components are defined as orthogonal to previous components, and maximize the remaining inertia.

PCA gives one map for the rows (called factor scores), and one map for the columns (called loadings). These 2 maps are related, because they both are described by the same components. However, these 2 maps project different kinds of information onto the components, and so they are interpreted differently. Factor scores are the coordinates of the row observations. They are interpreted by the distances between them, and their distance from the origin. Loadings describe the column variables. Loadings are interpreted by the angle between them, and their distance from the origin.

The distance from the origin is important in both maps, because squared distance from the mean is inertia (variance, information; see sum of squares as in ANOVA/regression). Because of the Pythagorean Theorem, the total information contributed by a data point (its squared distance to the origin) is also equal to the sum of its squared factor scores.

1.0.1 PCA Data: PHQ

The Patient Health Questionnaire is a survey that is a preliminary measurement for depression severity.

There are 9 questions (columns) measured on a scale from 1 i.e. “no days” to 4 “nearly every day” and 225 participants (rows).

The descriptors of the pariticpants are memory group, sex, and age. For this analysis I will be using memory group that is either high, normal, or low memory.

Pleasure Hopeless Sleep Energy Appetite Failure Focus Speed Suicide
1 1 2 3 1 1 1 1 1
1 1 1 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 3 4 2 3 2 1 1 1
1 1 1 1 1 1 1 1 1
1 2 3 3 1 3 2 1 1

1.2 PCA Analysis

## [1] "It is estimated that your iterations will take 0.01 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================

1.2.3 PCA Factor scores

As the rows move from left to right the variance starts smaller and then gradually increases and slighlty decreases again. However, from this plot it is hard to tell what is happening on each component.

1.2.6 PCA Loadings

Component 1: All variables are positively correlated.

Component 2: Sleep, Energy, and Appetite seem to be less correlated to the rest of the variables (approaching orthogonality).

More of the variance for Suicide, Speed, and Focus seem to be in a different dimension.

  • Component 1: All Symptoms

  • Component 2: Emotional symptoms vs physical symptoms

1.2.6.1 PCA Bootstrap Ratio of columns

_**Note: This is not the same as the contribution bars_

1.2.6.1.1 PCA Component 1 and 2
signed.ctrJ <- res_pcaInf$Fixed.Data$ExPosition.Data$cj * sign(res_pcaInf$Fixed.Data$ExPosition.Data$fj)

# plot contributions for component 1
ctrJ.1 <- PrettyBarPlot2(signed.ctrJ[,1],
                         threshold = 1 / NROW(signed.ctrJ),
                         font.size = 5,
                         color4bar = gplots::col2hex(VariableColors), # we need hex code
                         ylab = 'Contributions',
                         ylim = c(1.2*min(signed.ctrJ), 1.2*max(signed.ctrJ))
) + ggtitle("Contribution barplots", subtitle = 'Component 1: Variable Contributions (Signed)')

# plot contributions for component 2
ctrJ.2 <- PrettyBarPlot2(signed.ctrJ[,2],
                         threshold = 1 / NROW(signed.ctrJ),
                         font.size = 5,
                         color4bar = gplots::col2hex(VariableColors), # we need hex code
                         ylab = 'Contributions',
                         ylim = c(1.2*min(signed.ctrJ), 1.2*max(signed.ctrJ))
) + ggtitle("",subtitle = 'Component 2: Variable Contributions (Signed)')


BR <- res_pcaInf$Inference.Data$fj.boots$tests$boot.ratios
laDim = 1

# Plot the bootstrap ratios for Dimension 1
ba001.BR1 <- PrettyBarPlot2(BR[,laDim],
                        threshold = 2,
                        font.size = 5,
                   color4bar = gplots::col2hex(VariableColors), # we need hex code
                  ylab = 'Bootstrap ratios'
                  #ylim = c(1.2*min(BR[,laDim]), 1.2*max(BR[,laDim]))
) + ggtitle("Bootstrap ratios", subtitle = paste0('Component ', laDim))

# Plot the bootstrap ratios for Dimension 2
laDim = 2
ba002.BR2 <- PrettyBarPlot2(BR[,laDim],
                        threshold = 2,
                        font.size = 5,
                   color4bar = gplots::col2hex(VariableColors), # we need hex code
                  ylab = 'Bootstrap ratios'
                  #ylim = c(1.2*min(BR[,laDim]), 1.2*max(BR[,laDim]))
) + ggtitle("",subtitle = paste0('Component ', laDim))

Component 1: The variables with the greatest contributions are all but Focus, Speed, and Suicide.

Component 2: Failure is positively contributing while Sleep, Energy, and Appetite are negatively contributing.

The barplots for bootstrap ratios show that all the variables that contributed a signficant amount are actually significant.

1.3 Summary

When we interpret the factor scores and loadings together, the PCA revealed:

  • Component 1: Regardless of memory group they all experience more of the symptoms, except feelings of suicide, loss of focus, and distrubances in speed i.e. feeling hyperactive or lethargic.

  • Component 2: All memory groups experience more appetite, sleep, and energy when they have only increased feelings of faliure.

  • Both: When all memory groups experience increases in only one symptom they experience the other symptoms differently than compared to when they experience increases in all symptoms together.