Module 6 Part 1: Principal Component Analysis

Reading

Basic Concepts

Acknowledgements

PCA Introduction

Important

Big Goal: To find meaning from high dimensional data. In other words, find a low dimensional representation. This is referred to as dimensionality reduction.

PCA is a statistical technique used to reduce dimensionality of a data while retaining as much variability as possible.

Key Features:

  • Each PC captures maximum possible variance.

  • PCs are mutually orthogonal (uncorrelated)

  • PCA create components as linear combinations of the original variables.

  • PCA reduces number of dimensions by selecting top few that explain \(\sim 90\%\) of the variance.

  • Suppose we have \(N\) measurements on each of \(p\) variables \(\mathbf{X}_j, \quad j = 1,…,p\). Let \(\mathbf{X} = (\mathbf{X}_1, …, \mathbf{X}_p) \in \mathbf{R}^{N\times p}\).

  • Generate a set of uncorrelated variables \(Z_1, ..., Z_p\).

Ways to think about PCA:

  • Approximate \(\mathbf{X}\) by the best rank\(-q\) matrix \(\hat{\mathbf{X}}_{(q)}\). This is the usual motivation for the Singular Value Decomposition (SVD).

  • Approximate the original set of \(N\) points in \(\mathbf{R}^{p}\) by a sequence of best linear approximations to the data.

PCA in Practice

  • Applications Include

    • Dimension reduction

    • Factor Analysis

    • Data compression and reconstruction

    • Data Visualization

    • A way to address multicollinearity via PCA regression

    • A way to reduce \(p>N\).

Key Concept: when variables are correlated, there exists a lower dimensional reduction capturing most of the information.

Example of PCA in two dimensions

\(\mathbf{Z}_1 = \mathbf{X}\mathbf{u}_1\) is the projection of the data onto the longest direction, and has the largest variance among projections. It is the first principal component (PC1).

  • \(\mathbf{u}_1\): The PC1 direction, a vector in the space of the original variables, determined to maximize the variance of the data along this direction.

  • \(\mathbf{Z}_1\): The projection of the data onto the PC1 direction \(\mathbf{u}_1\). This new variable represents the data transformed into the PC1 space.

Consider a dataset with 4 observations and 2 variables:

\(x_1\) \(x_2\)
2 1
4 3
6 5
8 7

Step 1: Center the Data: subtract the mean of each variable. This ensures that variance drives the analysis because the means are shifted to zero.

  • Mean of \(x_1 = 5\), mean of \(x_2 = 4\).

  • Centered data: \[ (-3, -3), (-1, -1), (1, 1), (3, 3). \]

Step 2: Find the Principal Components Using Eigen Value Decomposition (EVD)

  1. The first principal component (\(\mathbf{u}_1\)) is the direction with the largest variance, roughly \(\mathbf{u}_1 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}\).
  2. The second principal component (\(\mathbf{u}_2\)) is orthogonal to \(\mathbf{u}_1\).

Step 3: Project Data onto \(\mathbf{u}_1\) The projection of each centered data point \(\mathbf{x}_i\) onto \(\mathbf{u}_1\) is:

\[ z_1 = \mathbf{x}_i' \mathbf{u}_1 = \frac{1}{\sqrt{2}} (x_{i1} + x_{i2}). \]

For example, for 1st observation \((-3, -3)\):

\[ z_1 = \frac{-3 - 3}{\sqrt{2}} = -\frac{6}{\sqrt{2}}. \]

Step 4: Interpretation The new variable \(\mathbf{Z}_1\) (all \(z_1\) values for each observation) captures the largest variance in the data. The original data can be approximated as:

\[ \hat{\mathbf{X}} = \mathbf{Z}_1 \mathbf{u}_1', \]

providing a lower-dimensional representation.

Visualization

The first principal component represents the diagonal direction of greatest spread. Projections onto this line simplify the data into one dimension while retaining most of the information.

PCA in three dimensions

The best rank-two linear approximation to the half sphere data.

Figure: The left displays the projection from\(\mathbf{R}^3\) to \(\mathbf{R}^2\). The right shows the points projected on the principal axes \(\mathbf{u}_1\) and \(\mathbf{u}_2\), which are the subject scores \((z_{i1}, z_{i2})\).

Contrasting PCA and OLS: Supervised vs. Unsupervised Learning

  • PCA is an unsupervised learning problem.

    • For example, we have data in different categories and covariates; PCA finds “clusters” without knowing the categories.

    • And treats ALL variables \(X_1, …, X_p, Y\) as equal contributors to the overall variance.

    • PCA treats data symmetrically for \((Y, X_1, X_2)\) i.e., what is the “best” representation of all variables.

  • In contrast OLS is a type of supervised learning problem.

    • Goal is to predict \(Y\) based on 1+ predictors.

    • Assumes \(Y\) is dependent on \(\mathbf{X}\).

    • Prioritizes \(Y\) as the outcome variable, i.e., does not treat them symmetrically, and seeks to explain \(var(Y)\) based on predictors.

Optimization Problem

  • In PCA, we will find a sequence of directions that successively maximize the variance explained in the data, given the previous directions.

  • Let’s first find the rank-1 projection of the data that captures the most variance:

    • A rank-1 projection transforms data points \(\mathbf{x}_i\) (rows of \(\mathbf{X}\)) onto the line spanned by \(\mathbf{u}_1\) , i.e., for each point, the projection is \(\hat{x}_i = (\mathbf{u}_1' x_i) \mathbf{u}_1\).

\[ \underset{\mathbf{u}_1 \in R^p: \mathbf{u}_1'\mathbf{u}_1=1}{argmax}\underbrace{\frac{1}{N} \sum_{i=1}^N (\underbrace{\mathbf{u}_1' (x_i - \bar{x})}_{\text{projection}})^2}_{\text{Avg variance of projection}} \]

  • Note the constraint is necessary for this to be finite and for \(\mathbf{u}_1\) to be a unit vector.

  • Note how PCA treats data symmetrically, which contrasts with OLS which for \(x_i \in R^2\) tries to predict \(x_{i2}\).

Projecting a Data Point

Finding PC 1

Let

\[ \underset{\mathbf{u}_1 \in \mathbb{R}^p : \mathbf{u}_1'\mathbf{u}_1 = 1}{argmax}\frac{1}{N}\sum_{i=1}^{N} (\mathbf{u}_1'(x_i - \bar{x}))^2, \]

where \(S\) is the covariance matrix defined as :

\[ S = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})'. \]

The previous equation can be rewritten as:

\[ \underset{\mathbf{u}_1 \in \mathbb{R}^p : \mathbf{u}_1'\mathbf{u}_1 = 1}{argmax} \mathbf{u}_1' S \mathbf{u}_1. \]

We can solve this optimization problem using the method of Lagrange multipliers.

To reformulate the constraint into the objective function, we rewrite it such that the \(argmax\) of the constrained objective function corresponds to a stationary point:

\[ J(\mathbf{u}_1) = \mathbf{u}_1' S \mathbf{u}_1 + \lambda_1 (1 - \mathbf{u}_1' \mathbf{u}_1). \]

To find the optimal solution, we calculate the partial derivative of the objective function and set it equal to zero:

The partial derivative is given by:

\[ \frac{\partial J}{\partial \mathbf{u}_1} = 2 S \mathbf{u}_1 - 2 \lambda_1 \mathbf{u}_1. \]

Setting this equal to zero, we obtain:

\[ \lambda_1 \mathbf{u}_1 = S \mathbf{u}_1. \] where \(\mathbf{u}_1\) is an eigenvector of the covariance matrix \(S\) and \(\lambda_1\) is the corresponding largest eigenvalue.

Finding PC2

Assume \(p > 2\). We next aim to find \(\mathbf{u}_2\), orthogonal to \(\mathbf{u}_1\), that captures the most information. This can be formulated as:

\[ \text{argmax}_{\mathbf{u}_2 \in \mathbb{R}^p} \mathbf{u}_2' S \mathbf{u}_2, \]

subject to:

\[ \mathbf{u}_2' \mathbf{u}_2 = 1 \quad \text{and} \quad \mathbf{u}_2' \mathbf{u}_1 = 0 \text{ orthogonal }. \]

Using the method of Lagrange Multipliers, we formulate the objective function as:

\[J(\mathbf{u}_2) = \mathbf{u}_2' S \mathbf{u}_2 + \lambda_2 (1 - \mathbf{u}_2' \mathbf{u}_2) + \beta \mathbf{u}_2' \mathbf{u}_1.\]

The partial derivative is given by:

\[\frac{\partial J}{\partial \mathbf{u}_2} = 2 S \mathbf{u}_2 - 2 \lambda_2 \mathbf{u}_2 + \beta \mathbf{u}_1 = 0.\]

To solve for \(\beta\), we take the dot product with \(\mathbf{u}_2\):

\[2 \mathbf{u}_2' S \mathbf{u}_2 - 2 \lambda_2 \mathbf{u}_2' \mathbf{u}_2 + \beta \mathbf{u}_2' \mathbf{u}_1 = 0\]

Since \(\mathbf{u}_2' \mathbf{u}_1 = 0\), this simplifies to:

\[2 \lambda_1 \mathbf{u}_2' \mathbf{u}_2 - 0 + \beta = 0\]

Hence, \(\beta = 0\).

So the Lagrangian equation simplifies to:

\[J(\mathbf{u}_2) = \mathbf{u}_2' S \mathbf{u}_2 + \lambda_2 (1 - \mathbf{u}_2' \mathbf{u}_2)\]

And as before, we have:

\[S \mathbf{u}_2 = \lambda_2 \mathbf{u}_2\]

Hence, \(\lambda_2\) is the second largest eigenvalue, and \(\mathbf{u}_2\) is its corresponding eigenvector.

In \(\mathbb{R}^2\), \(\mathbf{u}_2\) is simply the unique direction orthogonal to \(\mathbf{u}_1\). This corresponds to the projection that minimizes the variance.

Spectral Decomposition of Positive Definite Symmetric Matrices

Any positive definite symmetric matrix can be decomposed using the eigenvalue decomposition (EVD), also known as the spectral decomposition:

\[ S = \mathbf{U} \Lambda \mathbf{U}', \]

where:

  • \(\Lambda\) is a diagonal matrix with elements \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p > 0\).
  • The last eigenvalue is strictly greater than \(0\) when the matrix is positive definite, i.e., \(x' A x > 0 \, \forall x \in \mathbb{R}^p\).
  • Intuitively, positive definite matrices generalize the notion of a positive integer to matrices.
  • \(U'U = I\), meaning \(U\) is an orthogonal matrix (its rows are orthogonal and have unit norm).

Principal Component Scores and Loadings

This decomposition provides us with the principal component (PC) directions and their variances.

Let \(X_c\) represent \(X\) with the rows centered (i.e., for each variable, subtract its mean). Then:

\[ S = \frac{1}{N} X_c X_c'. \]

We can compute the EVD of the \(S\) using the decomposition

\[ S=\mathbf{U} \Lambda \mathbf{U}' \]

Let

\[ U_{(q)} = [\mathbf{u}_1, \dots, \mathbf{u}_q], \]

where \(U_{(q)}\) contains the first \(q\) principal component directions, and \(U_{(-q)}\) represent the directions from \(q + 1\) to \(p\), which contain the remaining variance. Then:

\[ X_c = X_c U_{(q)} U_{(q)}' + X_c U_{(-q)} U_{(-q)}'. \]

The term \(X_c U_{(q)} U'_{(q)}\) is the rank-\(q\) projection of \(X_c\) with the highest variance among all linear projections.

Define \(Z = X_c U_{(q)}\). These are the principal component scores, representing a subspace of \(\mathbb{R}^N\).

For the \(i\)th subject, assuming \(x_i\) is centered, \(x_i \in \mathbb{R}^p\), and the \(k\)th component is:

\[ z_{ik} = \mathbf{u}_k' x_i. \]

In this projection, \(U_{(q)}\) weights the variables, and these weights are referred to as the loadings, which form a subspace of \(\mathbb{R}^p\).

Key Terminology

  • Subject Scores: The values \(z_{ik}\) represent the projection of a data point onto the principal components, summarizing its position in the principal subspace.
  • Variable Loadings: The columns of \(U_{(q)}\) represent how much each variable contributes to the principal components.
  • The terminology can be confusing, as different authors may arrange the data matrix differently, such that the data matrix may be \(p \times n\) (variables by subjects) or \(n \times p\) (subjects by variables).

Variance of Principal Component Scores

Note that the covariance of the principal component scores, \(Z\), is given by:

\[ \text{Cov}(Z) = \frac{1}{N} Z'Z \]

Substituting \(Z = X_c U_{(q)}\), we have:

\[ \text{Cov}(Z) = \frac{1}{N} U'_{(q)} X_c' X_c U_{(q)}. \]

Using the eigenvalue decomposition \(S = U \Lambda U'\), this simplifies to:

\[ \text{Cov}(Z) = \Lambda_q, \]

where \(\Lambda_q\) is the diagonal matrix containing the top \(q\) eigenvalues.

Interpretation
  • The eigenvalues \(\lambda_1, \lambda_2, \dots, \lambda_q\) are the variances of the principal component scores.
  • Some treatments constrain the scores to have unit variance, in which case the loadings are adjusted to reflect the variance relationship.

Singular Value Decomposition (SVD) and its Relation to PCA

The singular value decomposition (SVD) of \(X_c\) is closely related to PCA:

\[ X_c = V D U', \]

where:

  • \(V\) are the left singular vectors (orthonormal),
  • \(D\) are the singular values, and
  • \(U'\) are the right singular vectors.

Relationship to Covariance Decomposition

From the SVD, we observe:

\[ X_c' X_c = U D V' V D U' = U D^2 U'. \]

Since \(D^2\) is proportional to the eigenvalues of the covariance matrix, we rewrite this as:

\[ X_c' X_c = N U \Lambda U', \]

where \(\Lambda\) contains the eigenvalues of the covariance matrix.

  • The right singular vectors (\(U\)) of the SVD are identical to the eigenvectors from the covariance decomposition.
  • When \(p \gg n\) (more variables than samples), it is computationally inefficient to calculate the \(p \times p\) covariance matrix. Instead, SVD provides a more practical alternative.
  • This is particularly important for working with high-dimensional data.

Principal Component Scores via SVD

We have:

\[ Z = X_c U_{(q)}, \quad X_c = (V D) U', \]

and it follows that:

\[ Z = V_{(q)} D_{(q)}. \]

Interpretation

In words, the first \(q\) principal component scores are the first \(q\) left singular vectors, scaled by their corresponding singular values.

PCA as Minimizing Residual Error

We can also think of PCA as minimizing residual error. For notational simplicity, let \(x_i\) be centered. For \(q = 1\), PCA solves:

\[ \text{argmin}_{\mathbf{u}_1 : \mathbf{u}_1' \mathbf{u}_1 = 1} \sum_{i=1}^N || \mathbf{u}_1 \mathbf{u}_1' x_i - x_i ||_2^2. \]

It turns out that the solution is the same as the one we previously derived!

This formulation provides motivation for PCA as a data compression technique.

Rank-\(q\) Decomposition

Equivalently, for a rank-\(q\) decomposition, we minimize the variance in the remaining \(p - q\) directions:

\[ X U_{(-q)} U'_{(-q)}. \]

Probabilistic PCA (PPCA)

The discussion so far has not explicitly involved statistics. We can formulate a population PCA model, as in probabilistic PCA (PPCA) by Tipping and Bishop (1999).

Let \(z_i\) be latent variables (a random vector), where \(z_i \in \mathbb{R}^q\) for \(q < p\). Let \(W \in \mathbb{R}^{p \times q}\) be a mixing matrix (fixed), and let \(\epsilon_i\) represent measurement error, where \(\epsilon_i \perp\!\!\!\perp z_i\). We assume isotropic noise.

The model is defined as:

\[ x_i = W z_i + \epsilon_i, \]

where:

  • \(z_i \sim \mathcal{N}(0, \text{diag}(\gamma_1, \ldots, \gamma_q))\)
  • \(\epsilon_i \sim \mathcal{N}(0, \sigma^2 I)\)

Key Assumptions

  • \(z_i\) is a low-dimensional latent variable with a diagonal covariance structure, \(\text{diag}(\gamma_1, \ldots, \gamma_q)\).
  • \(\epsilon_i\) is independent of \(z_i\) and represents isotropic noise with variance \(\sigma^2\).

This model allows us to gain insight into the covariance structure of \(X\).

Let \(W = U_q\) from the eigenvalue decomposition (EVD), and assume that the columns of \(Z\) are orthogonal. Define \(\Gamma = \text{diag}(\gamma_1, \ldots, \gamma_q)\) and let \(\Gamma_+\) be the \(p \times p\) matrix formed by padding \(\Gamma\) with zeros.

The covariance of \(x_i\) is:

\[ \text{Cov}(x_i) = U_q \, \text{Cov}(z_i) \, U_q' + \sigma^2 I, \]

which simplifies to:

\[ \text{Cov}(x_i) = U_q \Gamma U_q' + \sigma^2 UU'. \]

In terms of the full decomposition, this can be expressed as:

\[ \text{Cov}(x_i) = U (\Gamma_+ + \sigma^2 I)U'. \]

Interpretation

From this representation:

  • The largest eigenvalues and their corresponding eigenvectors are associated with the latent factors, as their variance is \(\gamma_k + \sigma^2\) for \(k \leq q\).
  • The directions representing pure noise include only the variance \(\sigma^2\), contributing no additional structure to the covariance matrix.

Scree Plot

This motivates a scree plot:

          [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
[1,] 0.1857568 -0.1667510 -0.1286567 -0.1399970  0.6328499 -0.2898327
[2,] 0.5669335  0.1102782  0.1235070 -0.1485792 -0.3686996  0.2577719
           [,7]        [,8]       [,9]      [,10]
[1,] -0.2336329 -0.03314183  0.2228338 -0.5582985
[2,] -0.4457676 -0.10949844 -0.3309524 -0.3263151
             [,1]         [,2]
[1,] 1.000000e+00 1.387779e-16
[2,] 1.387779e-16 1.000000e+00
corrplot 0.92 loaded

We typically look for an “elbow”, very roughly where the eignvalues go from exponential decay to a roughly linear trend, where a linear trend results in an isotropic nois model from the sorted sample estimates of the variance of the noise directions.

Warnings! - PCA is sensitive to scaling, eg, if you measured one variable in millimeters, and another in Km, the first variable would dominate the decomposition because numbers are much larger and hence has larger variance.

Motivating Example

PCA on a Wine Dataset

Here, we perform dimension reduction on a dataset with 178 wine samples. The dataset includes 13 variables measuring physicochemical properties:

  • Alcohol
  • Malic acid
  • Ash
  • Alkalinity
  • Magnesium
  • Phenols
  • Flavanoids
  • Non-flavonoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

We also have a variable, “cultivar” (cultivated variety), which can be thought of as group labels. However, we will pretend we do not have access to this variable and examine whether patterns in the multivariate data from the 13 variables can be visualized in a lower-dimensional space.

Why PCA?

Many of the variables are highly correlated, making PCA a useful tool for this analysis. PCA will identify a smaller number of orthogonal variables (principal components) that capture most of the variance in the data. This allows us to reduce the dimensionality while preserving the structure and variability of the original dataset.

Goal

The goal is to transform the high-dimensional data into a lower-dimensional space (e.g., 2D or 3D) to visualize patterns or clusters, potentially corresponding to the different cultivars, without directly using the cultivar labels.

# citation: https://www.r-bloggers.com/principal-component-analysis-in-r/
# http://archive.ics.uci.edu/ml/datasets/Wine
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")


# Name the variables
colnames(wine) <- c("Cvs","Alcohol","Malic acid","Ash","Alkalinity", "Magnesium", "Phenols", "Flavanoids", "NonflavPhen", "Proantho", "Color", "Hue", "OD", "Proline")

# The first column corresponds to the cultivars (varieties)
wineClasses <- factor(wine$Cvs)
wine$Cvs = wineClasses

# Use pairs
pairs(wine[,-1], col = wineClasses, upper.panel = NULL, pch = 16, cex = 0.5)
legend("topright", bty = "n", legend = c("Cv1","Cv2","Cv3"), pch = 16, col = c("black","red","green"),xpd = T, cex = 2, y.intersp = 0.5)

# cor(wine[,-1])
corrplot(cor(wine[,-1]))

# a fair amount of correlation, so we can reasonably hope to capture most of the informations with some $q < 13$
winePCA <- prcomp(wine[,-1],center = TRUE, scale. = TRUE)
summary(winePCA)
Importance of components:
                         PC1    PC2    PC3     PC4     PC5     PC6     PC7
Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
                           PC8     PC9   PC10    PC11    PC12    PC13
Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000
screeplot(winePCA)

# I prefer the plot with points:
#screeplot: the most important tool for selecting number of components
plot(winePCA$sdev^2,xlab='Rank of Eigenvalue',ylab='Eigenvalue',main='Screeplot for the Wine Dataset',type='b')

# here, 3 would be a good option since ith corresponds to the "elbow"
cumsum(winePCA$sdev^2)/sum(winePCA$sdev^2)
 [1] 0.3619885 0.5540634 0.6652997 0.7359900 0.8016229 0.8509812 0.8933680
 [8] 0.9201754 0.9423970 0.9616972 0.9790655 0.9920479 1.0000000
# 67% of variance with 3. 90% if often a rule of thumb for data compression; for latent variable representation, "elbow" is a useful heuristic


# the scores can often be useful for visualizing differences between groups
# plot the scores for the first two PCs and color the points by the categories:
par(mfrow=c(1,3))
plot(winePCA$x[,1:2], col = wineClasses)
plot(winePCA$x[,c(1,3)], col = wineClasses)
plot(winePCA$x[,c(2,3)], col = wineClasses)

# First two look seem to help the most with separating all three

PCA Regression

PCA regression involves estimating the first \(q\) principal components and then using these components (subject scores) as predictors in linear regression:

\[ y_i = \beta_0 + \sum_{k=1}^q \beta_k z_{ik}. \]

Advantages of PCA Regression

  1. Reduced Overfitting: By estimating \(q < p\) coefficients, we reduce the risk of overfitting, especially in high-dimensional datasets.

  2. Orthogonality of Predictors: The predictors (principal components) are orthogonal, which allows for more precise estimates of their standard errors compared to the original variables \(x_i\).

Limitation

The principal components can be more difficult to interpret because each component is a linear combination of the original variables:

\[ z_{ik} = u_{k}' x_{i}, \]

where \(u_{kj}\) are the loadings for the \(k\)th principal component.

Summary

  • While PCA regression improves model precision and reduces overfitting, the trade-off is a loss in interpretability since the predictors no longer correspond directly to the original variables.

  • When \(q < p\), PCA regression assumes that the directions containing the most variance in \(x_i\) are the ones most closely associated with the response \(y_i\). However, this assumption may not always hold, as low-variance directions could still be strongly associated with \(y_i\).

  • When \(p > n\), the usual ordinary least squares (OLS) estimate is undefined. In such scenarios, one approach is to choose some \(q\) such that \(q < n < p\), and then use principal component regression (PCR). This ensures a stable solution by reducing the number of predictors to fewer than the number of observations.

  • While PCA regression has limitations due to its independence from \(y_i\), it remains a practical method for dimensionality reduction, especially when \(p > n\) and traditional regression methods are not feasible.

Motivating Example: PCA regression

Baseball example

Example 6.7: MLB Stats and Salary Prediction

Objective

The goal is to predict Salary of MLB players in 1987 based on performance and demographic statistics from 1986.

Dataset

The dataset contains MLB statistics for 1986, with the response variable being the player’s Salary in 1987. Below is a list of variables included in the dataset.

library(ISLR)
library(pls)

Attaching package: 'pls'
The following object is masked from 'package:corrplot':

    corrplot
The following object is masked from 'package:stats':

    loadings
#Read in the data
data(Hitters)
Hitters = na.omit(Hitters)
names(Hitters)
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"
pcr.fit = pcr(Salary~., data = Hitters, scale = T)
summary(pcr.fit)
Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19
TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
        16 comps  17 comps  18 comps  19 comps
X          99.89     99.97     99.99    100.00
Salary     53.01     53.85     54.61     54.61
# NOTE: make sure to scale=TRUE -- defaults to false:(
# % variance explained: percent in X, second row is R^2 in PRC
set.seed(2) # for cross-validation to determine number of components
pcr.fit = pcr(Salary~., data=Hitters, scale=TRUE, validation='CV')
validationplot(pcr.fit,estimate='all',legendpos='topright')

  • In this example, cross validation is not very helpful.
  • Minimized at 16, not a clear minimum.
  • PC1 does fairly well.
  • Lets take a detailed look at what is happening in the regression:
# create a matrix that can be used by prcomp:
hitters = model.matrix(Salary~.,data=Hitters)

# now use prcomp and then lm:
pca.hitters = prcomp(x=hitters[,-1],center = TRUE, scale. = TRUE)

# if we used the 95% rule of thumb, choose 9 components
# Let's try four, since these eigenvalues stand out
pcr.fit.v2 = lm(Hitters$Salary~pca.hitters$x[,1:4])
summary(pcr.fit.v2)

Call:
lm(formula = Hitters$Salary ~ pca.hitters$x[, 1:4])

Residuals:
    Min      1Q  Median      3Q     Max 
-905.49 -173.80  -24.63  115.75 2163.36 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              535.926     21.122  25.373   <2e-16 ***
pca.hitters$x[, 1:4]PC1  106.571      7.843  13.587   <2e-16 ***
pca.hitters$x[, 1:4]PC2   21.645     10.388   2.084   0.0382 *  
pca.hitters$x[, 1:4]PC3  -24.341     14.852  -1.639   0.1024    
pca.hitters$x[, 1:4]PC4  -37.056     16.962  -2.185   0.0298 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 342.5 on 258 degrees of freedom
Multiple R-squared:  0.4322,    Adjusted R-squared:  0.4234 
F-statistic:  49.1 on 4 and 258 DF,  p-value: < 2.2e-16
# look what happens with 16:
tempdata = data.frame('Salary'=Hitters$Salary,pca.hitters$x[,1:16])
pcr.fit.v3 = lm(Salary~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11+PC12+PC13+PC14+PC15+PC16,data=tempdata)
summary(pcr.fit.v3)

Call:
lm(formula = Salary ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + 
    PC8 + PC9 + PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16, 
    data = tempdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-875.70 -173.28  -20.81  141.21 1909.63 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  535.926     19.678  27.235  < 2e-16 ***
PC1          106.571      7.307  14.584  < 2e-16 ***
PC2           21.645      9.678   2.236 0.026220 *  
PC3          -24.341     13.836  -1.759 0.079787 .  
PC4          -37.056     15.802  -2.345 0.019823 *  
PC5           58.525     19.729   2.966 0.003309 ** 
PC6          -62.325     21.700  -2.872 0.004433 ** 
PC7          -24.686     23.746  -1.040 0.299562    
PC8           15.858     27.526   0.576 0.565054    
PC9           29.633     39.373   0.753 0.452398    
PC10          99.838     45.860   2.177 0.030432 *  
PC11         -30.170     53.218  -0.567 0.571298    
PC12         -21.033     55.219  -0.381 0.703608    
PC13         -72.540     63.769  -1.138 0.256418    
PC14        -277.213     79.802  -3.474 0.000606 ***
PC15          74.312     86.478   0.859 0.391001    
PC16        -423.532    117.811  -3.595 0.000392 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 319.1 on 246 degrees of freedom
Multiple R-squared:  0.5301,    Adjusted R-squared:  0.4996 
F-statistic: 17.35 on 16 and 246 DF,  p-value: < 2.2e-16
# some later components are highly significant

PC1 is highly significant so lets take a look to see what variables are driving PC1:

# examine loadings for first PC:
pca.hitters$rotation[,1]
        AtBat          Hits         HmRun          Runs           RBI 
 0.1982903511  0.1958612933  0.2043689229  0.1983370917  0.2351738026 
        Walks         Years        CAtBat         CHits        CHmRun 
 0.2089237517  0.2825754503  0.3304629263  0.3307416802  0.3189794925 
        CRuns          CRBI        CWalks       LeagueN     DivisionW 
 0.3382078595  0.3403428387  0.3168029362 -0.0544708722 -0.0257252900 
      PutOuts       Assists        Errors    NewLeagueN 
 0.0776971752 -0.0008416413 -0.0078593695 -0.0419103083 

We see that career numbers (CRBI, CRuns, CHits, CAtBats. CHmRun, CWalks) tend to have larger loadings thatn 1986 numbers.

Interpretation
  • Cross-validation insights:
    • Cross-validation suggests minimizing prediction error using 16 principal components, but no clear minimum is observed.
    • PC1 (the first principal component) explains a significant proportion of variance and performs well on its own.
  • Regression with PCA:
    • Using 4 principal components (pcr.fit.v2) provides a concise model, but only partially explains salary variation.
    • A model with all 16 components (pcr.fit.v3) includes later components that are surprisingly significant.
  • Driving factors of PC1:
    • The loadings for the first principal component show career statistics (e.g., CRBI, CRuns, CHits, CAtBats, CHmRun, CWalks) have higher weights compared to the 1986 season-specific statistics.
    • This suggests that career performance metrics are stronger predictors of player salary than short-term (single-season) performance.
  • Significance of later components:
    • While higher-order components explain less variance in predictors, some have significant contributions to predicting salary, highlighting that less dominant patterns in data can still be relevant.
  • Implications:
    • Career statistics provide more consistent and predictive value for salaries.
    • Dimensionality reduction using PCA helps simplify the model but may risk excluding later components with predictive importance.
  • Compare these results to OLS fitting the following model
# ols.fit <- lm(Salary ~., data = Hitters)
  • Look at the VIFs using \(car::vif(ols.fit)\). What are the findings?

  • Is it correct to use OLS in this case based on VIFs and based on significant coefficients from the OLS output?