7.4 PCA Analysis

PCA and Factor analysis are the most commonly used methods in dimension reduction. In a general data science project, it is possible that a given dataset can have tens or hundreds of features (attributes). For example in the text analysis, if we count words’ appearance in a document, we could easily have hundreds even thousands of dimensions. If we want to reduce the dimension into a manageable number, PCA can be very useful. Particularly in visualization, humans are not good with anything over three dimensions.

PCA uses Eigenvalues and Eigenvectors³ to reserve the original data information and variation as much as possible. Therefore PCA is simple to calculate the given data’s Eigenvectors. The Eigenvectors show the attributes’ importance.

PCA normally has the following steps:

Calculate the Covariance Matrix⁴ of the given dataset.
Calculate the Eigenvalues and Eigenvectors of the resulting Covariance Matrix.
The resulting Eigenvector that corresponds to the largest Eigenvalue can then be used to reconstruct a large fraction of the variance of the original dataset.

In R, we have a function called prcomp(). It takes numerical values. Let us calculate all the 18 attributes’ Eigenvalue in our RE_data dataset except passengerId and Survived. It is obvious that these two attributes are out of consideration.

# Calculate Eigenvalues of the attributes
data.pca <- prcomp(RE_data[1:891,c(-1, -2)], center = TRUE, scale = TRUE)
summary(data.pca)

## Importance of components:
##                           PC1    PC2   PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.1734 1.9300 1.362 1.23379 0.99671 0.91959 0.79275
## Proportion of Variance 0.2952 0.2328 0.116 0.09514 0.06209 0.05285 0.03928
## Cumulative Proportion  0.2952 0.5280 0.644 0.73913 0.80122 0.85407 0.89335
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.75131 0.66502 0.55475 0.48300 0.27453 0.19723 0.15091
## Proportion of Variance 0.03528 0.02764 0.01923 0.01458 0.00471 0.00243 0.00142
## Cumulative Proportion  0.92863 0.95627 0.97550 0.99008 0.99479 0.99722 0.99865
##                           PC15      PC16
## Standard deviation     0.14717 1.534e-15
## Proportion of Variance 0.00135 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00

We have seen 16 principal components, which named as PC1 to PC16. Each of these explains a percentage of the total variation in the dataset. That is to say, PC1 explains 29% of the total variance, PC2 explains nearly 24% of the variance. Together with over half of the information in the dataset can be encapsulated by just these two principal components. So, by knowing the position of a sample in relation to just PC1 and PC2, you can get a very accurate view of where it stands in relation to other samples, as just PC1 and PC2 can explain 53% of the variance.

Let’s call str() to have a look at the PCA object.

str(data.pca)

## List of 5
##  $ sdev    : num [1:16] 2.173 1.93 1.362 1.234 0.997 ...
##  $ rotation: num [1:16, 1:16] -0.1995 0.0756 0.3116 -0.3526 -0.2938 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:16] "Pclass" "Sex" "Age" "SibSp" ...
##   .. ..$ : chr [1:16] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:16] 2.309 1.648 29.452 0.523 0.382 ...
##   ..- attr(*, "names")= chr [1:16] "Pclass" "Sex" "Age" "SibSp" ...
##  $ scale   : Named num [1:16] 0.836 0.478 13.432 1.103 0.806 ...
##   ..- attr(*, "names")= chr [1:16] "Pclass" "Sex" "Age" "SibSp" ...
##  $ x       : num [1:891, 1:16] -0.571 1.664 -0.24 1.708 0.818 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:891] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:16] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The above results contain a lot of details, briefly:

The center point ($center), scaling ($scale), standard deviation(sdev) of each principal component
The relationship (correlation or anti-correlation, etc) between the initial variables (on the whole, It can be regarded as the data record) and the principal components ($rotation)
The values of each sample in terms of the principal components ($x)

Let us plot PCA to get a visual sense of it. To do so we need to use biplot. A biplot is a type of plot that will allow you to visualize how the samples relate to one another in the selected principal components (which samples are similar and which are different) and will simultaneously reveal how each variable contributes to each principal component.

Figure 7.2: The 1st and the 2nd PCs ploted with ggplot_pca

Figure 7.3: The 1st and the 2nd PCs ploted with ggplot_pca

The axes are seen as arrows originating from the center point. Here, you see that the variables Fare_pp, Age_group, and Survived contribute to PC1, with higher values in those variables moving the records to the right on this plot. This lets you see how the data points relate to the axes.

We also have other principal components available although they may have fewer weights in comparison with the first two. Each of the other components maps differs from the original variables. We can also plot these other components, for example, PC3 and PC4. If you look into the PC3 and PC4, they are Sex and Age_group. You may wonder what do they do with our prediction. Well, it can show at least the contribution between them with the dependent variable Survived, in addition, it can also show the covariance of both with other variables.

Figure 7.4: The 3rd and the 4th PC ploted with ggplot_pca

This Plot shows that original attributes Ticket_class, Sex, Fare_PP, and Group_size contribute to PC3, which is Sex, in a negative way. It means that with lower values in those variables, the records will move to the left on this plot. Notice the graph shows a close relationship between original attributes with the newly created Principal Components. It indicates the correlation among them.

With these correlation and PCA analyses, We can have a pretty good idea about the attributes. Depending on the models we are constructing, we can be confident to select the number of predictors and specific predictors to ensure our model has a good performance.
Attribute selection is a parsimonious process that aims to identify a minimal set of predictors for the maximum gain (predictive accuracy). This approach is the opposite of the data pre-process whereas as many meaningful attributes as possible are considered for potential use. Later on, when we talk about prediction models, a lot of models have a function to analyse its predictor’s importance. It is very similar to the PCA here.

It is important to recognize that attribute selection could be an iterative process that occurs throughout the model building process. It finishes after no more improvement can be achieved in terms of model accuracy.

In linear algebra, an Eigenvector or characteristic vector of a linear transformation is a non-zero vector that changes by a scalar factor when that linear transformation is applied to it.↩︎
Covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.↩︎