Chapter 21 PCA

Hello! In this tutorial, we will be learning about principal component analysis (PCA). PCA is considered a “data reduction” variable: it is used when researchers have a large number of variables that they want to represent with a smaller number of variables (called “[principal] components”). The idea behind this strategy is that researchers can represent their data using a linear combination of variables (i.e., a weighted average). The goal of the PCA algorithm is to identify those optimal weights.

To learn more about PCA, let’s use our survey data. PCA is often used in survey data to test whether indicies are meaningful together or to try and group multiple questions. This helps to reduce dimensionality (and, by extension, multicollinearity issues).

21.1 Set Up

Let’s set our data up by bringing in tidyverse. Because of the popularity of PCA, base R contains substantial support for conducting a PCA.

library(tidyverse)

survey_data <- read_csv("data/pew_AmTrendsPanel_110_2022_revised.csv")
colnames(survey_data)

##  [1] "...1"                       "perceived_party_difference" "gun_bill_approval"          "personal_fin"               "party"                     
##  [6] "abortion_opinion"           "age"                        "education"                  "income"                     "voter"

To use this data, we need to select the variables we want to focus on and clean them. While saving the variables as characters or factors is useful for human interpretation, PCA takes numeric variables specifically, so likert scales and other variables must be converted.

survey_data_trim <- survey_data %>% select(perceived_party_difference, gun_bill_approval, personal_fin, abortion_opinion, age)

Importantly, we will remove data rows with missing observations. Traditional PCA does not support missing data, but there are PCA variants (like bpca, or Bayesian Principal Component Analysis) that can handle missing data (for bpca, I recommend using the pcaMethods package, which you can learn about here). We can subset our data to focus on the rows with all their responses using complete.cases().

survey_data_trim[survey_data_trim == "NaN"] <-  NA  

survey_complete <- survey_data_trim[complete.cases(survey_data_trim),]

21.1.1 Centering

In some instances (i.e., when you are using variables with different levels of analysis), it is often helpful to center the variables.

#apply(survey_complete, 2, var) #use this line to calculate the variance of each variable
#?scale #use scale() center columns of a numeric matrix
scaled_df <- apply(survey_complete, 2, scale)
head(scaled_df)

##      perceived_party_difference gun_bill_approval personal_fin abortion_opinion          age
## [1,]              -0.1395452393         0.5558159  -0.16356318      -0.34543710  0.184887641
## [2,]              -0.1395452393        -0.1149440  -0.16356318      -0.34543710  0.184887641
## [3,]              -0.1395452393        -0.1149440  -0.01039481      -0.34543710  0.005124445
## [4,]               0.1405432872         0.3322293   0.14277356      -0.34543710  0.005124445
## [5,]               0.0004990239         0.1086427  -0.16356318      -0.34543710 -0.354401947
## [6,]               0.0004990239        -0.3385306   0.14277356      -0.01592245 -0.174638751

Now that our data are prepared, let’s look at this data further.

21.2 Descriptive Analysis

PCA is especially effective when the variables are highly correlated (0.3 or higher). If the variables are not strongly correlated, they will be difficult to combine into an index (“component”).

To identify how strongly correlated these variables are, we will use cor():

cor(scaled_df)

##                            perceived_party_difference gun_bill_approval personal_fin abortion_opinion          age
## perceived_party_difference                 1.00000000        0.01542172  0.032231101      0.211518972  0.029760235
## gun_bill_approval                          0.01542172        1.00000000  0.163473970      0.141640764 -0.010785667
## personal_fin                               0.03223110        0.16347397  1.000000000      0.048013644 -0.008167237
## abortion_opinion                           0.21151897        0.14164076  0.048013644      1.000000000 -0.008922412
## age                                        0.02976024       -0.01078567 -0.008167237     -0.008922412  1.000000000

It looks like quite a few of our variables are strongly correlated, so let us proceed with our analysis!

21.3 PCA Basics

In this tutorial, I will walk through two processes for conducting a PCA. The first approach is the “older” coding approach: we will calculate eigenvalues/eigenvectors from a covariance matrix and use this to identify the optimal number of components as well as to construct our components. In the newer coding approach, we use prcomp, which is great for PCA analysis and is especially useful for the interpretation of data. At this point, prcomp is the more popular approach, but the emphasis on data visualizations can obfuscate the back-end math, which is why I also discuss the older method.

PCA takes the following steps: (1) Calculate eigenvalues from covariance matrix, (2) determine the optimal number of components, and then (3) plot the data based on these components.

21.3.1 (1) Eigenvalues/vectors

This process of identifying our data’s eigenvalues and eigenvectors is fairly straightforward: we use cov() to calculate the covariate matrix and then eigen() to calculate eigenvectors/eigenvalues. Recall from our class that eigenvectors are used to represent (matrix) data because some vectors contains special properties that, in the case of PCA, are useful for aggregating multiple variables. Learn more about eigenvectors and eigenvalues here.

eigen_survey <- cov(scaled_df) %>%
  eigen() #create eigenvalues and eigenvectors
str(eigen_survey)

## List of 2
##  $ values : num [1:5] 1.315 1.085 0.996 0.865 0.739
##  $ vectors: num [1:5, 1:5] -0.471049 -0.502777 -0.401197 -0.60363 0.000212 ...
##  - attr(*, "class")= chr "eigen"

eigen_survey

## eigen() decomposition
## $values
## [1] 1.3149517 1.0845356 0.9960530 0.8651995 0.7392602
## 
## $vectors
##              [,1]       [,2]        [,3]       [,4]        [,5]
## [1,] -0.471048758  0.5675086 -0.07892895 -0.4004943  0.53797915
## [2,] -0.502777186 -0.4659175  0.10875962  0.5436650  0.47194884
## [3,] -0.401196809 -0.5391333  0.21785455 -0.6640245 -0.24492299
## [4,] -0.603630056  0.3036400 -0.17345626  0.3010754 -0.65015388
## [5,]  0.000212002  0.2792721  0.95099626  0.1116143 -0.07180104

21.3.2 (2) Component selection

Next, we will need to determine the number of components we want in our analysis. To figure this out, it is helpful to calculate the proportion of variance explained (PVE). The goal of PVE is to have as much of the variance explained by as few components as possible (because what’s the point of reducing a 28-variable dataset to 28 components? hint: there isn’t one). One straightforward way to calculate the PVE of each component is to simply divide the eigenvalue by the sum of all the eigenvalues.

pve <- eigen_survey$values / sum(eigen_survey$values)
round(pve, 2)

## [1] 0.26 0.22 0.20 0.17 0.15

Note that, in our data, the first 3 variables explain 66%+ of the data, so let’s focus on these (the percent explained drops a little after this). We’ll talk about whether this is really the right approach at the end.

To visualize how meaningful this PCA is, it is often helpful to include a visual about how the data are reduced. To do this, we’ll

Notably, R reverses the numbers (https://uc-r.github.io/pca), so you need to flip it back, which we do in the first line of this chunk. Each statistical software varies in whether they reverse the numbers or not, so it may be helpful to look into what other softwares use. For interpretability, we will also add variable names to our matrix to make this easier to interpret.

components <- -(eigen_survey$vectors[,1:3]) #negate
row.names(components) <- colnames(survey_complete) #save the variable names
colnames(components) <- c("PC1", "PC2", "PC3")
components

##                                     PC1        PC2         PC3
## perceived_party_difference  0.471048758 -0.5675086  0.07892895
## gun_bill_approval           0.502777186  0.4659175 -0.10875962
## personal_fin                0.401196809  0.5391333 -0.21785455
## abortion_opinion            0.603630056 -0.3036400  0.17345626
## age                        -0.000212002 -0.2792721 -0.95099626

Based on this data we can tell that PC1 seems to distinguish strongly between variables like covid, aca opinion, medicare4ll opinion, and opinions about a public option for healthcare and variables such as approval of Trump, party affiliation and swing vote. PC2 distinguishes between the healthcare variable and the covid variable and opinions about a public option for healthcare. We’ll explore this distinction further.

21.3.3 (3) Plotting

To plot the results of our PCA, let’s calculate the principal component score for each observation (in this case, for each participant).

PC1 <- as.matrix(scaled_df) %*% components[,1]
PC2 <- as.matrix(scaled_df) %*% components[,2]
PC3 <- as.matrix(scaled_df) %*% components[,3]

PC <- data.frame(Participant = row.names(survey_complete), PC1, PC2, PC3)
head(PC)

##   Participant         PC1          PC2        PC3
## 1           1 -0.06045748  0.303229695 -0.2715772
## 2           2 -0.39770025 -0.009289081 -0.1986256
## 3           3 -0.33621148  0.123491933 -0.0610399
## 4           4  0.08200305  0.255463288 -0.1209356
## 5           5 -0.21920398  0.166016071  0.3009731
## 6           6 -0.12226434 -0.027430121  0.1690729

We can then use this data to plot the poll data on a two-dimensional scale using the principal components as axes (i.e., using the eigenvectors as directions in a feature space):

ggplot(PC, aes(PC1, PC2, PC3)) + 
  modelr::geom_ref_line(h = 0) +
  modelr::geom_ref_line(v = 0) +
  geom_text(aes(label = Participant), size = 3) +
  xlab("First Principal Component") + 
  ylab("Second Principal Component") + 
  ggtitle("First Two Principal Components of Poll Data")

While possibly interpretable, this figure is somewhat difficult to interpret. This is a big reason why folks have move onto more advanced and interpretable visualizations for showing their principal component analysis.

21.4 PCA, Intermediate

21.4.1 (1) Eigenvectors/values

Using the function prcomp(), we can also calculate the eigenvectors/eigenvalues. We will set scale == TRUE to scale the data as well.

df_pca <- prcomp(survey_complete, scale = TRUE)
df_pca

## Standard deviations (1, .., p=5):
## [1] 1.1467134 1.0414104 0.9980245 0.9301610 0.8598024
## 
## Rotation (n x k) = (5 x 5):
##                                     PC1        PC2         PC3        PC4         PC5
## perceived_party_difference  0.471048758 -0.5675086  0.07892895  0.4004943 -0.53797915
## gun_bill_approval           0.502777186  0.4659175 -0.10875962 -0.5436650 -0.47194884
## personal_fin                0.401196809  0.5391333 -0.21785455  0.6640245  0.24492299
## abortion_opinion            0.603630056 -0.3036400  0.17345626 -0.3010754  0.65015388
## age                        -0.000212002 -0.2792721 -0.95099626 -0.1116143  0.07180104

21.4.2 (2) scree plots

While PCE is great for some straightforward numbers, it sometimes also helps to visualize the PVE to show how meaningful each component is. We can use factoextra (which is a separate library) to plot such a visualization (this is called a scree plot).

#install.packages("factoextra")
library(factoextra) #for visualization
fviz_eig(df_pca)

As we saw in the first analysis, the first component. explains a majority of the variance, and the second component explains a little more.

factoextra also has a useful function, get_eigenvalue(), for creating a dataframe of eigenvalue information.

eig.val <- get_eigenvalue(df_pca)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.3149517         26.29903                    26.29903
## Dim.2  1.0845356         21.69071                    47.98975
## Dim.3  0.9960530         19.92106                    67.91081
## Dim.4  0.8651995         17.30399                    85.21480
## Dim.5  0.7392602         14.78520                   100.00000

21.4.3 (3) factoextra visualizations

We can also use factoextra to visualize the meaningfulness of our components “more elegantly.” In factoextra, there is a great visualization called fviz_pca that can help you plot the participants (similar to what we did above), the variables, or both. Since you already know how to visualize the participants, let’s focus on the variables.

fviz_pca_var(df_pca,
             col.var = "contrib", # Color by contributions to the component
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

We can also visualize this using ggfortify, which adds additional support to ggplot2 for visualizing PCA results. One advantage of this method is that you have much more control of your visualization.

#install.package()
library(ggfortify)

## Warning: package 'ggfortify' was built under R version 4.2.2

ggplot2::autoplot(df_pca, data = survey_complete, 
                  shape = TRUE, #if you set this to false, it will show you the participant id's
                  loadings.colour = 'blue', loadings = TRUE, #style of variable arrows
                  loadings.label.colour = 'red', loadings.label = TRUE, loadings.label.size = 3, #style of variable names
                  loadings.label.repel=T) + #this is the eqivalent of repel in fviz_pca_var
  theme_minimal()

More Resources:
1. PCA tutorial which has great information about how to report on a PCA.
2. PCA tutorial using state data (this helps with the interpretation of individual data points since you’d be using state names rather than participant numbers, as we had).
3. PCA tutorial; I like the way this one is written.
4. More on prcomp()
5. If you want to use this sort of approach on categorical variable, I encourage you to look into factor analysis.
6. If you want to use this approach on mixed data (categorical or numeric), look into factor analyses of mixed data.
+ Note: sometimes, you’ll see categorical variables be called “qualitative” and numeric variables called “quantitative” in this literature.