Chapter 2 Principal Component Analysis

Data table: PCA is used to analyze one table of quantitative data.

Goal: PCA computes new variables called principal components which is equated by linear combinations of the original variables so as to find new variables that maximizes the variance of the data.

This is obtained by performing an SVD on correlation/covariance matrix

Key ideas

  1. Principle components are orthogonal to each other and are indepenent as well.
  2. Singular values are Standard deviation of each component
  3. Eigenvalues give us the variances which is same as the Sum of Squares.

NOTE: SD > 1 Streches the data and SD < 1 Compresses the data

Interpretation

1. Factor scores are the coordinates of the row observations. They are
interpreted by the distances between them, and their distance from the origin. 

2. Loadings describe the column variables. Loadings are interpreted by the
angle between them and the principal axis, and their distance from the origin.

3. The distance from the origin is important in both maps, because squared
distance from the mean is inertia. Because of the Pythagorean Theorem, the
total information contributed by a data point (its squared distance to the
origin) is also equal to the sum of its squared factor scores.

2.1 Dataset: Survey of Autobiographical Mememory

The data was collected by Baycrest Institute at University of Toronto. Participants with different memory scores took several questionnaire.

Participants were asked to rate the extent to which a particular item applied to their memory in general, using a 5-point Likert scale (1- completely disagree, 2-4 - intermediate degrees of agreement/disagreement, 5 completely agree).

There are 153 obseravtions(rows) which represents the participants who answer to 26(Columns) questions that comprised of 8 Episodic memory based questions 6 Semantic memory questions, 6 Spatial memory based questions and 6 Prospective memory related questions.

The subjects include both men and women with their ages in the range of 18-84 years which are also mentioned as age and sex variable. A survey based measure of AM is also used to caatogorize the participants into two groups- High memory, Normal Memory.

Dataset Cleaning

Steps: 1. Check for NA or incomplete data in the dataset and remove them if they exist. 2. Remove Mysterious Memory Groups : participants with conflicting responses 3. Flitering data : Use only Numeric data to perform PCA

Preprocessing the Data

  1. Centering: Refers to subtracting mean of each column from each of its points.
  2. Scaling: Normalization (Since SAM data consists of likert scale they all range from 0-5. There exixts homogenity in units of the data. SO refrain from scaling the data.

2.2 Looking at the data pattern

Correlation Plot

What it does? The corrplot package depicts correlation matrix with a graph. One can play around with the details of the plot by alloting parameters for color, text labels, color labels and layout.

Analyzing the plot There are seven visualization methods (parameter method) in corrplot package, named “circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”.

Positive correlations are displayed in blue and negative correlations in red color. Color intensity are proportional to the correlation coefficients.

2.3 PCA Analysis

  • center = TRUE: substracts the mean from each column
  • scale = FALSE: after centering (or not).(Note:Likert scale data is same throughout)
  • DESIGN: colors the observations (rows)
  • graphs = FALSE: this gives you plots from epPCA, but make sure to flag it FALSE for Rmarkdown to run correctly

Note: We run the epPCA and epPCA.inference package by passing the data containing only quantitative variables and also the design variables(colors for plot)

2.3.1 Scree Plot

What it does The scree plot shows the eigenvalues, the amount of information on each component. The number of components (the dimensionality of the factor space) is min(nrow(DATA), ncol(DATA)) minus 1. Analysing the Plot Here, 8 columns give 7 components. The scree plot is used to determine how many of the components should be interpreted.

2.4 PCA inference

The inference battery package includes permutation and bootstrap tests. The inference is important to check for the stability and reliability of your results. (Just like F test)

## [1] "It is estimated that your iterations will take 0.05 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================

2.4.1 Scree Plot

This plot includes the results from permutation with Scree plot (i.e., color the significant components) by adding the estimated p-values to the PlotScree function.

2.4.3 Row Factor scores

Row Factor scores F = XQ = P*delta Projections of the observations onto the principal components.

Color for each group:

2.4.3.1 With group means

2.4.4 Loadings

Loadings describe the similarity (observe the difference in angular distance) between the variables. Loadings show how the input variables relate to each other. Loadings also show which variables is helping in explaining the component.

Note: The Loading plot shows that the episodic memory questions share a smaller angular distance with the future memory questions depicting that they are positively correlated.(cos0 =1 completly correlated)

Similarly the spatial memory questions and semantic memory questions are strongly correlated.

Also, Future and Spatial memory loadings depic an angular difference of 90 deg angle. This shows they are uncorrelated with each other. (cos90 = 0 uncorrelated) Hence, in this case a varimax rotation will help with a better understanding of the components.

2.5 Variman Rotation

The loading circle plot hinted towards a rotation of the axis. The future and spatial memory loadings appeared to share an angle of 90 deg indicating that they are uncorrelated and probably orthogonal.

This Plot shows that spatial memory P1 explains component 1 while future memory F2 helps to explain component 2 after a roation of the axis.

  • Component 1: Episodic memory

  • Component 2: Spatial memory

  • Component 3: Future memory

2.5.1 Contributions of columns and their Bootstrap Ratios

  • Contribution Bar plots : How the variables contribute to each component by plotting the contributions with a line that represents the threshold we compared them to.

  • Bootstrap ratio Bar plots: Checks the significance of contribution of variables to the priciple components.

2.6 Conclusion

When we interpret the factor scores and loadings together, the PCA revealed:

Before Rotation:

  • Component 1: Participanta who were grouped as high Autobiographical Memory have higher episodic memory.

  • Component 2: Tends to seperate semantic memory which from memory that requires imagination like future and spatial memory.

Note: Semantic memory is based on facts, meanings, concepts and knowledge about the external world that we have acquired and is independent of personal experience and of the spatial/temporal context in which it was acquired.

  • Component 3: Explains how Future memory is different or separates from Spatial Memory.

After Rotation:

  • Component 1: Participanta who were grouped as high Autobiographical Memory have higher Episodic memory.

  • Component 2: The positive side if the dimension is explained by Spatial Memory

  • Component 3: Explained by Future memory