## 2.1What is PCA?

Core Mathematical Understanding about Singular Value Decomposition:

Factorization Method: Singular Value Decomposition
X: Data table
X^TX: Correlation Matrix
X = P∆Q^T, where:

1. Condition: (P^T)P = (Q^T)Q = I (Orthogonal)
2. P (size: Observation x Rank) is the projection matrix of variables
3. Q (size: Variables x Rank) is the projection matrix of observations
4. Factor Score F = P∆ = XQ = F^TF
5. Delta ∆ is the diagonal matrix that stores singular values (square root of Eigenvalues)
6. Eigenvalues describes Variances

Purpose:

Principle Component Analysis (PCA) is a multivariate technique for analyzing quantitative data. The goal of PCA is to reduce dimensionality, noise, and extract important information (features / attributes) from large amount of data.

General Processes of PCA:

• First, PCA finds the new origin of the data by taking average of horizontal and vertical range of all data points (Note that PCA projects data on a 2D plane).
• Second, PCA sets up a new axis (called First Principal component) that maximizes the inertia (variances) of all data points. In other words, this new line or axis has the largest combined distances from data points (Sum of Squared Distances can be found by Pythagorean Theorem).
• Third, PCA sets up subsequent axis orthogonal to the first one and also maximizes the remaining inertia.

PCA maps and their interpretations:

• After projecting all data points onto a new plane, PCA explores the relationships in data among both observations (rows) and variables (columns). Using different interpretation, Principal Components help explain these relationships.
• PCA Observation map is called factor scores. We can interpret factor scores map by assessing the distances between row data points using techniques like finding group means, tolerance intervals, bootstrap intervals, and clustering (will discuss in later chapter). The distance between points represents the similarity between them, points close to each other are neighborhoods with similar profiles, and points far away have dissimilar profiles.
• PCA Variables map is called loading. We can interpret loading map by assessing the angles between them (best to do this on a circle loading map where the effects of variables are standardized). • The angle between the vectors is an approximation of the correlation between the variables. A small angle indicates the variables are positively correlated, an angle of 90 degrees indicates the variables are not correlated, and an angle close to 180 degrees indicates the variables are negatively correlated.
• Note: It is a good idea to assess and visualize data by different groupings (for both rows and columns). We should use scaling (can be done on both row and columns) whenever necessary. For example: Use column scaling when the values are not uniform and do not use similar units. Use row scaling when we want to control for inertia-setting effects caused by some subjective variables such as Panelists or Raters.

Other important aids:

• Scree plot: describes how much inertia is explained by each component. Note that this is not a “tell all - be all” plot used to decide which component is important or not. We should use scree plot as a exploratory tool only.
• Permutation test for eigenvalues: tests if eigenvalues are reliable using Random Resampling.
• Correlation Plot: describes the linear relationships between variables (use when applicable). This is another exploratory tool we should use before doing PCA.
• Variables Contribution: describes the effect strength (compared to average contribution) of each variable. It is usually plotted in bar plots along with Bootstrap Ratios.

Author’s notes:

PCA is very elegant mathematically and is powerful in interpreting large datasets. The intuition and techniques behind PCA can be built upon and often found in many modern statistical methods. Overall, I see PCA as a great exploratory techniques that emphasizes on feature engineering and feature extraction. It is useful to see what hidden effects and relationships there are between observations and variables. It is also very good at summarizing the data and explaining topical effects by reducing data existing in multiple dimensions into 2-3 important dimensions.