5.2 How many factors should we retain?
The goal of principal component analysis is to reduce the number of dimensions that describe our data, without losing too much information. The first step in principal component analysis is to decide upon the number of principal components or factors we want to retain. To help us decide, we’ll use the PCA
function from the FactoMineR
package:
To be able to use the PCA
function, we need to transform the data frame first:
office.df <- office %>%
select(- brand) %>% # The input for the principal components analysis should be only the dimensions, not the identifier(s), so let's remove the identifiers.
as.data.frame() # then change the type of the object to 'data.frame'. This is necessary for the PCA function
rownames(office.df) <- office$brand # Set the row names of the data.frame to the brands (this is important later on when making a biplot)
We can now proceed with the principal component analysis:
office.pca <- PCA(office.df, graph=FALSE) # Carry out the principal component analysis
office.pca$eig # and look at the table with information on explained variance
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.2656310 71.093850 71.09385
## comp 2 1.6197932 26.996554 98.09040
## comp 3 0.1145758 1.909596 100.00000
If we look at this table, then we see that two components explain 98.1 percent of the variance in the ratings. This is quite a lot already and it suggests we can safely do with two dimensions to describe our data. A rule of thumb here is that the cumulative variance explained by the components should be at least 70%.