## 3.6 Dimension reduction techniques

As we have seen in Section 3.2, the selection of the best linear model from a set of $$p$$ predictors is a challenging task that increases with the dimension of the problem, that is, with $$p$$. In addition to the growth of the set of possible models70 as $$p$$ grows, the model space becomes more and more complicated to explore due to the potential multicollinearity among the predictors. We will see in this section two methods to deal with these two problems simultaneously.

### 3.6.1 Review on principal component analysis

Principal Component Analysis (PCA) is a multivariate technique designed to summarize the most important features and relations of $$p$$ numerical random variables $$X_1,\ldots,X_p$$. PCA computes a new set of variables, the principal components $$\mathrm{PC}_1,\ldots, \mathrm{PC}_p$$, that contain the same information as $$X_1,\ldots,X_p$$ but expressed in a more convenient way. The goal of PCA is to retain only a limited number $$1\leq l<p$$ of principal components such that they explain most of the information and perform dimension reduction.

If $$X_1,\ldots,X_p$$ are centred71, then the principal components are orthonormal linear combinations of $$X_1,\ldots,X_p$$:

\begin{align} \mathrm{PC}_j:=a_{1j}X_1+a_{2j}X_2+\ldots+a_{pj}X_p=\mathbf{a}_j'\mathbf{X},\quad j=1,\ldots,p,\tag{3.5} \end{align}

where $$\mathbf{a}_j:=(a_{1j},\ldots,a_{pj})'$$, $$\mathbf{X}:=(X_1,\ldots,X_p)'$$, and the orthonormality condition is

\begin{align*} \mathbf{a}_i'\mathbf{a}_j=\begin{cases}1,&\text{if } i=j,\\0,&\text{if } i\neq j.\end{cases} \end{align*}

Remarkably, PCA computes the principal components in an ordered way: $$\mathrm{PC}_1$$ is the principal component that explains the most of the information (quantified as the variance) of $$X_1,\ldots,X_p$$, and then the explained information decreases monotonically down to $$\mathrm{PC}_p$$, the principal component that explains the least information. Precisely:

\begin{align} \mathbb{V}\mathrm{ar}[\mathrm{PC}_1]\geq \mathbb{V}\mathrm{ar}[\mathrm{PC}_2]\geq\ldots\geq\mathbb{V}\mathrm{ar}[\mathrm{PC}_p].\tag{3.6} \end{align}

Mathematically, PCA reduces to compute the spectral decomposition72 of the covariance matrix $$\boldsymbol\Sigma:=\mathbb{V}\mathrm{ar}[\mathbf{X}]$$:

\begin{align*} \boldsymbol\Sigma=\mathbf{A}\boldsymbol\Lambda\mathbf{A}', \end{align*}

where $$\boldsymbol\Lambda=\mathrm{diag}(\lambda_1,\ldots,\lambda_p)$$ contains the eigenvalues of $$\boldsymbol\Sigma$$ and $$\mathbf{A}$$ is the orthogonal matrix73 that contains the unit-norm eigenvectors of $$\boldsymbol\Sigma$$ as columns. The matrix $$\mathbf{A}$$ gives, thus, the coefficients of the orthonormal linear combinations:

\begin{align*} \mathbf{A}=\begin{pmatrix} \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_p \end{pmatrix}=\begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1p} \\ a_{21} & a_{22} & \cdots & a_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ a_{p1} & a_{p2} & \cdots & a_{pp} \end{pmatrix}. \end{align*}

If the data is not centered, the computation of the principal components is done by first subtracting $$\boldsymbol\mu=\mathbb{E}[\mathbf{X}]$$ and then premultiplying with $$\mathbf{A}'$$:

\begin{align} \mathbf{PC}=\mathbf{A}'(\mathbf{X}-\boldsymbol{\mu}),\tag{3.7} \end{align}

where $$\mathbf{PC}=(\mathrm{PC}_1,\ldots,\mathrm{PC}_p)'$$, $$\mathbf{X}=(X_1,\ldots,X_p)'$$, and $$\boldsymbol{\mu}=(\mu_1,\ldots,\mu_p)'$$. Note that, because of (1.3) and (3.7),

\begin{align} \mathbb{V}\mathrm{ar}[\mathbf{PC}]=\mathbf{A}'\boldsymbol{\Sigma}\mathbf{A}=\mathbf{A}'\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}'\mathbf{A}=\boldsymbol{\Lambda}, \end{align}

therefore $$\mathbb{V}\mathrm{ar}[\mathrm{PC}_j]=\lambda_j$$, $$j=1,\ldots,p$$, and as a consequence (3.6) indeed holds.

Also, from (3.7), it is evident than we can express the random variables $$\mathbf{X}$$ in terms of $$\mathbf{PC}$$:

\begin{align} \mathbf{X}=\boldsymbol{\mu}+\mathbf{A}\mathbf{PC},\tag{3.8} \end{align}

which admits an insightful interpretation: the $$\mathbf{PC}$$ are uncorrelated74 variables that, once rotated by $$\mathbf{A}$$ and translated to the location $$\boldsymbol{\mu}$$, produce exactly $$\mathbf{X}$$. So, $$\mathbf{PC}$$ contains the same information as $$\mathbf{X}$$ but rearranged in a more convenient way because the principal components are centred and uncorrelated between them:

\begin{align*} \mathbb{E}[\mathrm{PC}_i]=0\text{ and }\mathrm{Cov}[\mathrm{PC}_i,\mathrm{PC}_j]=\begin{cases}\lambda_i,&\text{if } i=j,\\0,&\text{if } i\neq j.\end{cases} \end{align*}

Given the uncorrelation of $$\mathbf{PC}$$, we can measure the total variance of $$\mathbf{PC}$$ as $$\sum_{j=1}^p\mathbb{V}\mathrm{ar}[\mathrm{PC}_j]=\sum_{j=1}^p\lambda_j$$. Consequently, we can can define the proportion of variance explained by the first $$l$$ principal components, $$1\leq l\leq p$$, as

\begin{align*} \frac{\sum_{j=1}^l\lambda_j}{\sum_{j=1}^p\lambda_j}. \end{align*}

In the sample case75 where a sample $$\mathbf{X}_1,\ldots,\mathbf{X}_n$$ is given and $$\boldsymbol\mu$$ and $$\boldsymbol\Sigma$$ are unknown, $$\boldsymbol\mu$$ is replaced by the sample mean $$\bar{\mathbf{X}}$$ and $$\boldsymbol\Sigma$$ by the sample variance-covariance matrix $$\mathbf{S}=\frac{1}{n}\sum_{i=1}^n(\mathbf{X}_i-\bar{\mathbf{X}})(\mathbf{X}_i-\bar{\mathbf{X}})'$$. Then, the spectral decomposition of $$\mathbf{S}$$ is computed76. This gives $$\hat{\mathbf{A}}$$ and produces the scores of the data:

\begin{align*} \hat{\mathbf{PC}}_1:=\hat{\mathbf{A}}'(\mathbf{X}_1-\bar{\mathbf{X}}),\ldots,\hat{\mathbf{PC}}_n:=\hat{\mathbf{A}}'(\mathbf{X}_n-\bar{\mathbf{X}}). \end{align*}

The scores are centred, uncorrelated, and have sample variances in each vector’s entry that are sorted in a decreasing way.

The maximum number of principal components that can be determined from a sample $$\mathbf{X}_1,\ldots,\mathbf{X}_n$$ is $$\min(n-1,p)$$, assuming that the matrix formed by $$\left(\mathbf{X}_1 \cdots \mathbf{X}_n\right)$$ is of full rank (i.e., if the rank is $$\min(n,p)$$). If $$n\geq p$$ and the variables $$X_1,\ldots,X_p$$ are such that only $$r$$ of them are linearly independent, then the maximum is $$\min(n-1,r)$$.

Let’s see an example of these concepts in La Liga 2015/2016 dataset. It contains the standings and team statistics for La Liga 2015/2016.

laliga <- readxl::read_excel("la-liga-2015-2016.xlsx", sheet = 1, col_names = TRUE)
laliga <- as.data.frame(laliga) # Avoid tibble since it drops row.names

A quick preprocessing gives:

rownames(laliga) <- laliga$Team # Set teams as case names to avoid factors laliga$Team <- NULL
laliga <- laliga[, -c(2, 8)] # Do not add irrelevant information
summary(laliga)
##      Points           Wins           Draws           Loses        Goals.scored    Goals.conceded
##  Min.   :32.00   Min.   : 8.00   Min.   : 4.00   Min.   : 4.00   Min.   : 34.00   Min.   :18.00
##  1st Qu.:41.25   1st Qu.:10.00   1st Qu.: 8.00   1st Qu.:12.00   1st Qu.: 40.00   1st Qu.:42.50
##  Median :44.50   Median :12.00   Median : 9.00   Median :15.50   Median : 45.50   Median :52.50
##  Mean   :52.40   Mean   :14.40   Mean   : 9.20   Mean   :14.40   Mean   : 52.15   Mean   :52.15
##  3rd Qu.:60.50   3rd Qu.:17.25   3rd Qu.:10.25   3rd Qu.:18.25   3rd Qu.: 51.25   3rd Qu.:63.25
##  Max.   :91.00   Max.   :29.00   Max.   :18.00   Max.   :22.00   Max.   :112.00   Max.   :74.00
##  Percentage.scored.goals Percentage.conceded.goals     Shots       Shots.on.goal   Penalties.scored  Assistances
##  Min.   :0.890           Min.   :0.470             Min.   :346.0   Min.   :129.0   Min.   : 1.00    Min.   :23.00
##  1st Qu.:1.050           1st Qu.:1.115             1st Qu.:413.8   1st Qu.:151.2   1st Qu.: 1.00    1st Qu.:28.50
##  Median :1.195           Median :1.380             Median :438.0   Median :165.0   Median : 3.00    Median :32.50
##  Mean   :1.371           Mean   :1.371             Mean   :452.4   Mean   :173.1   Mean   : 3.45    Mean   :37.85
##  3rd Qu.:1.347           3rd Qu.:1.663             3rd Qu.:455.5   3rd Qu.:180.0   3rd Qu.: 4.50    3rd Qu.:36.75
##  Max.   :2.950           Max.   :1.950             Max.   :712.0   Max.   :299.0   Max.   :11.00    Max.   :90.00
##    Fouls.made    Matches.without.conceding  Yellow.cards     Red.cards       Offsides
##  Min.   :385.0   Min.   : 4.0              Min.   : 66.0   Min.   :1.00   Min.   : 72.00
##  1st Qu.:483.8   1st Qu.: 7.0              1st Qu.: 97.0   1st Qu.:4.00   1st Qu.: 83.25
##  Median :530.5   Median :10.5              Median :108.5   Median :5.00   Median : 88.00
##  Mean   :517.6   Mean   :10.7              Mean   :106.2   Mean   :5.05   Mean   : 92.60
##  3rd Qu.:552.8   3rd Qu.:13.0              3rd Qu.:115.2   3rd Qu.:6.00   3rd Qu.:103.75
##  Max.   :654.0   Max.   :24.0              Max.   :141.0   Max.   :9.00   Max.   :123.00

Let’s check that R’s function for PCA, princomp, returns the same principal components we outlined in the theory.

# PCA
pcaLaliga <- princomp(laliga)
summary(pcaLaliga)
## Importance of components:
##                             Comp.1     Comp.2      Comp.3      Comp.4      Comp.5     Comp.6      Comp.7      Comp.8
## Standard deviation     104.7782561 48.5461449 22.13337511 12.66692413 8.234215354 7.83426116 6.068864168 4.137079559
## Proportion of Variance   0.7743008  0.1662175  0.03455116  0.01131644 0.004782025 0.00432876 0.002597659 0.001207133
## Cumulative Proportion    0.7743008  0.9405183  0.97506949  0.98638593 0.991167955 0.99549671 0.998094374 0.999301507
##                              Comp.9      Comp.10      Comp.11      Comp.12      Comp.13      Comp.14      Comp.15
## Standard deviation     2.0112480391 1.8580509157 1.126111e+00 9.568824e-01 4.716064e-01 1.707105e-03 8.365534e-04
## Proportion of Variance 0.0002852979 0.0002434908 8.943961e-05 6.457799e-05 1.568652e-05 2.055361e-10 4.935768e-11
## Cumulative Proportion  0.9995868048 0.9998302956 9.999197e-01 9.999843e-01 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.16 Comp.17
## Standard deviation     5.867584e-07       0
## Proportion of Variance 2.428208e-17       0
## Cumulative Proportion  1.000000e+00       1
# The standard deviations are the square roots of the eigenvalues
# The cumulative proportion of variance explained accumulates the
# variance explained starting at the first component

# Plot of variances of each component (screeplot)
plot(pcaLaliga, type = "l")

# Useful for detecting an "elbow" in the graph whose location gives the
# "right" number of components to retain. Ideally, this elbow appears
# when the next variances are almost similar and notably smaller when
# compared with the previous

# Alternatively: plot of the cumulated percentage of variance
# barplot(cumsum(pcaLaliga$sdev^2) / sum(pcaLaliga$sdev^2))

# Computation of PCA from the spectral decomposition
n <- nrow(laliga)
eig <- eigen(cov(laliga) * (n - 1) / n) # By default, cov() computes the
# quasi-variance-covariance matrix that divides by n - 1, but PCA and princomp
# consider the sample variance-covariance matrix that divides by n
A <- eig$vectors # Same eigenvalues pcaLaliga$sdev^2 - eig$values ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 ## -1.818989e-12 2.728484e-12 2.273737e-13 -2.273737e-13 5.684342e-14 3.552714e-14 4.263256e-14 -3.552714e-14 ## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 ## 1.234568e-13 5.373479e-14 1.199041e-14 1.054712e-14 1.049161e-14 -3.709614e-15 -2.191892e-15 4.476020e-13 ## Comp.17 ## 2.048814e-12 # The eigenvectors (the a_j vectors) are the column vectors in$loadings
pcaLaliga$loadings ## ## Loadings: ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 ## Points -0.125 0.497 0.195 0.139 -0.340 0.425 0.379 0.129 0.166 ## Wins 0.184 -0.175 0.134 0.198 0.139 ## Draws 0.101 0.186 -0.608 0.175 0.185 -0.251 ## Loses -0.129 -0.114 -0.157 0.410 -0.243 -0.166 0.112 ## Goals.scored -0.181 0.251 -0.186 -0.169 -0.399 0.335 -0.603 0.155 0.129 0.289 -0.230 ## Goals.conceded -0.471 -0.493 -0.277 -0.257 0.280 0.441 0.118 0.297 ## Percentage.scored.goals ## Percentage.conceded.goals ## Shots -0.718 -0.442 -0.342 0.255 0.241 0.188 ## Shots.on.goal -0.386 -0.213 0.182 -0.287 -0.532 0.163 -0.599 ## Penalties.scored -0.350 -0.258 0.378 -0.661 0.456 ## Assistances -0.148 0.198 -0.173 -0.362 0.216 0.215 -0.356 -0.685 -0.265 ## Fouls.made 0.480 -0.844 0.166 0.110 ## Matches.without.conceding 0.151 0.129 -0.182 0.176 0.369 -0.376 -0.411 ## Yellow.cards 0.141 -0.144 -0.363 0.113 0.225 -0.637 -0.550 -0.126 -0.156 ## Red.cards 0.123 -0.157 0.405 0.666 ## Offsides -0.108 0.202 -0.696 0.647 -0.106 ## Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 ## Points 0.138 0.278 -0.315 ## Wins 0.147 0.907 ## Draws -0.304 0.526 0.277 ## Loses 0.156 0.803 ## Goals.scored -0.153 ## Goals.conceded ## Percentage.scored.goals 0.760 -0.650 ## Percentage.conceded.goals -0.650 -0.760 ## Shots ## Shots.on.goal ## Penalties.scored -0.114 ## Assistances -0.102 ## Fouls.made ## Matches.without.conceding -0.664 ## Yellow.cards ## Red.cards -0.587 ## Offsides ## ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 ## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 ## Proportion Var 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 ## Cumulative Var 0.059 0.118 0.176 0.235 0.294 0.353 0.412 0.471 0.529 0.588 0.647 0.706 0.765 0.824 ## Comp.15 Comp.16 Comp.17 ## SS loadings 1.000 1.000 1.000 ## Proportion Var 0.059 0.059 0.059 ## Cumulative Var 0.882 0.941 1.000 # The scores is the representation of the data in the principal components - # it has the same information as laliga head(pcaLaliga$scores)
##                     Comp.1     Comp.2    Comp.3      Comp.4     Comp.5      Comp.6     Comp.7     Comp.8     Comp.9
## Barcelona       -242.23916  21.581016 25.380103 -17.4375054 -7.1797218  -9.0814106 -5.5920449 -7.3615386  0.3715688
## Real Madrid     -313.60263 -63.202402 -8.756998   8.5557906  0.7119181  -0.2221379  6.7034894  2.4455971 -1.8388132
## Atlético Madrid  -45.99393   0.646609 38.540964  31.3888316  3.9162812  -3.2904137  0.2431925  5.0912667  3.0444029
## Villarreal        96.22013  42.932869 50.003639 -11.2420481 10.4732634  -2.4293930 -3.0183049  0.1958417 -1.2106025
## Athletic         -14.51728  16.189672 18.884019  -0.4122161 -5.6491352   6.9329640  8.0652665  2.4783231  2.6920566
## Celta             13.07483  -6.792525  5.227118  -9.0945489  6.1264750 -11.8794638 -2.6148154  6.9706627 -3.0825781
##                    Comp.10    Comp.11    Comp.12     Comp.13       Comp.14       Comp.15       Comp.16       Comp.17
## Barcelona        1.7160752  0.0264937 -1.0948280  0.15160351  0.0010244179 -0.0002177801 -1.047044e-12  7.623799e-12
## Real Madrid     -2.9660592 -0.1344557  0.3409538 -0.03316355 -0.0014744734  0.0003891502  1.587123e-12 -1.215672e-11
## Atlético Madrid  2.0974553  0.6771343 -0.3985625 -0.18088616 -0.0004984680 -0.0001116115  6.252046e-13 -1.090230e-12
## Villarreal      -1.7453143  0.1350586 -0.5735057 -0.58936061  0.0001067413 -0.0002431758  3.319980e-13  3.406707e-12
## Athletic         0.8950389  0.1542005  1.4714997  0.11090055  0.0031346887 -0.0002557828 -3.696130e-12  1.895661e-11
## Celta            0.3129865  0.0859623  1.9159241  0.37219921  0.0017960697  0.0002911046 -2.150767e-12  5.513489e-12

# Uncorrelated
corrplot::corrplot(cor(pcaLaliga$scores), addCoef.col = "gray") # Caution! What happened in the last columns? What happened is that the # variance for the last principal components is close to zero (because there # are linear dependencies on the variables; e.g. Points, Wins, Loses, Draws), # so the computation of the correlation matrix becomes unstable for those # variables (a 0/0 division takes place) # Better to inspect the covariance matrix corrplot::corrplot(cov(pcaLaliga$scores), addCoef.col = "gray", is.corr = FALSE)


# The scores are A' * (X_i - mu). We center the data with scale()
# and then multiply each row by A'
scores <- scale(laliga, center = TRUE, scale = FALSE) %*% A

# Same as (but this is much slower)
# scores <- t(apply(scale(laliga, center = TRUE, scale = FALSE), 1,
#                   function(x) t(A) %*% x))

# Same scores (up to possible changes in signs)
max(abs(abs(pcaLaliga$scores) - abs(scores))) ## [1] 1.427989e-11 # Reconstruct the data from all the principal components head( sweep(pcaLaliga$scores %*% t(pcaLaliga$loadings), 2, pcaLaliga$center, "+")
)
##                 Points Wins Draws Loses Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## Barcelona           91   29     4     5          112             29                    2.95                      0.76
## Real Madrid         90   28     6     4          110             34                    2.89                      0.89
## Atlético Madrid     88   28     4     6           63             18                    1.66                      0.47
## Villarreal          64   18    10    10           44             35                    1.16                      0.92
## Athletic            62   18     8    12           58             45                    1.53                      1.18
## Celta               60   17     9    12           51             59                    1.34                      1.55
##                 Shots Shots.on.goal Penalties.scored Assistances Fouls.made Matches.without.conceding Yellow.cards
## Barcelona         600           277               11          79        385                        18           66
## Real Madrid       712           299                6          90        420                        14           72
## Atlético Madrid   481           186                1          49        503                        24           91
## Villarreal        346           135                3          32        534                        17          100
## Athletic          450           178                3          42        502                        13           84
## Celta             442           170                4          43        528                        10          116
##                 Red.cards Offsides
## Barcelona               1      120
## Villarreal              4      106
## Athletic                5       92
## Celta                   6      103

An important issue when doing PCA is the scale of the variables, since the variance depends on the units in which the variable is measured77. Therefore, when variables with different ranges are mixed, the variability of one may dominate the other as an artifact of the scale. To prevent this, we standardize the dataset prior to do a PCA.

# Use cor = TRUE to standardize variables (all have unit variance)
# and avoid scale distortions
pcaLaligaStd <- princomp(x = laliga, cor = TRUE)
summary(pcaLaligaStd)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7      Comp.8
## Standard deviation     3.2918365 1.5511043 1.13992451 0.91454883 0.85765282 0.59351138 0.45780827 0.370649324
## Proportion of Variance 0.6374228 0.1415250 0.07643694 0.04919997 0.04326873 0.02072093 0.01232873 0.008081231
## Cumulative Proportion  0.6374228 0.7789478 0.85538472 0.90458469 0.94785342 0.96857434 0.98090308 0.988984306
##                             Comp.9     Comp.10     Comp.11      Comp.12     Comp.13      Comp.14      Comp.15
## Standard deviation     0.327182806 0.217470830 0.128381750 0.0976778705 0.083027923 2.582795e-03 1.226924e-03
## Proportion of Variance 0.006296976 0.002781974 0.000969522 0.0005612333 0.000405508 3.924019e-07 8.854961e-08
## Cumulative Proportion  0.995281282 0.998063256 0.999032778 0.9995940111 0.999999519 9.999999e-01 1.000000e+00
##                             Comp.16 Comp.17
## Standard deviation     2.222483e-08       0
## Proportion of Variance 2.905548e-17       0
## Cumulative Proportion  1.000000e+00       1

# The effects of the distorsion can be clearly seen with the biplot
# Variability absorbed by Shots, Shots.on.goal, Fouls.made
biplot(pcaLaliga, cex = 0.75)

# The effects of the variables are more balanced
biplot(pcaLaligaStd, cex = 0.75)

The biplot78 provides a powerful and succinct way of displaying the relevant information contained in the first two principal components. It shows:

1. The scores of the data in $$\mathrm{PC}_1$$ and $$\mathrm{PC}_2$$ by points (with optional text labels, depending if there are case names). This is the representation of the data in the first two PCs.
2. The variables represented in the $$\mathrm{PC}_1$$ and $$\mathrm{PC}_2$$ by the arrows. These arrows are centered at $$(0,0)$$.

Let’s examine the population79 arrow associated to the variable $$X_j$$. $$X_j$$ is expressed in terms of $$\mathrm{PC}_1$$ and $$\mathrm{PC}_2$$ by the weights $$a_{j1}$$ and $$a_{j2}$$:

\begin{align*} X_j=a_{j1}\mathrm{PC}_{1} + a_{j2}\mathrm{PC}_{2} + \ldots + a_{jk}\mathrm{PC}_{p}\approx a_{j1}\mathrm{PC}_{1} + a_{j2}\mathrm{PC}_{2}. \end{align*}

The weights $$a_{j1}$$ and $$a_{j2}$$ have the same sign as $$\mathrm{Cor}(X_j,\mathrm{PC}_{1})$$ and $$\mathrm{Cor}(X_j,\mathrm{PC}_{2})$$, respectively. The arrow associated to $$X_j$$ is given by the segment joining $$(0,0)$$ and $$(a_{j1},a_{j2})$$. Therefore:

• If the arrow points right ($$a_{j1}>0$$), there is positive correlation between $$X_j$$ and $$\mathrm{PC}_1$$. Analogous if the arrow points left.
• If the arrow is approximately vertical ($$a_{j1}\approx0$$), there is uncorrelation between $$X_j$$ and $$\mathrm{PC}_1$$.

Analogously:

• If the arrow points up ($$a_{j2}>0$$), there is positive correlation between $$X_j$$ and $$\mathrm{PC}_2$$. Analogous if the arrow points down.
• If the arrow is approximately horizontal ($$a_{j2}\approx0$$), there is uncorrelation between $$X_j$$ and $$\mathrm{PC}_2$$.

In addition, the magnitude of the arrow informs us about the strength of the correlation of $$X_j$$ with $$(\mathrm{PC}_{1}, \mathrm{PC}_{2})$$.

The biplot also informs about the direct relation between variables, at sight of their expressions in $$\mathrm{PC}_1$$ and $$\mathrm{PC}_2$$. The angle of the arrows of variable $$X_j$$ and $$X_k$$ gives an approximation to the correlation between them, $$\mathrm{Cor}(X_j,X_k)$$:

• If $$\text{angle}\approx 0^\circ$$, the two variables are highly positively correlated.
• If $$\text{angle}\approx 90^\circ$$, they are approximately uncorrelated.
• If $$\text{angle}\approx 180^\circ$$, the two variables are highly negatively correlated.

The insights obtained from the approximate correlations between the variables and principal components are as valid as the percentage of variance explained by $$\mathrm{PC}_1$$ and $$\mathrm{PC}_2$$.

Some interesting insights in La Liga 2015/2016 biplot, obtained from the previous remarks, are80:

• $$-\mathrm{PC}_1$$ can be regarded as the performance of a team during the season. $$\mathrm{PC}_1$$ is negatively correlated with Wins, Points, etc. and positively correlated with Draws, Loses, Yellow.cards, etc. The best performing teams are not surprising: Barcelona, Real Madrid, and Atlético Madrid. On the other hand, among the worst-performing teams are Levante, Getafe, and Granada.
• $$-\mathrm{PC}_2$$ can be seen as the efficiency of a team (obtaining points with little participation in the game). Using this interpretation we can see that Atlético Madrid and Villareal were the most efficient teams and that Rayo Vallecano and Real Madrid were the most inefficient.
• Offsides is approximately uncorrelated with Red.cards and Matches.without.conceding.

A 3D representation of the biplot can be computed through:

pca3d::pca3d(pcaLaligaStd, show.labels = TRUE, biplot = TRUE)
## [1] 0.1747106 0.0678678 0.0799681
rgl::rglwidget()

Finally, the biplot function allows to construct the biplot using two arbitrary principal components by specifying the choices argument. Keep in mind than these pseudo-biplots will explain a lower proportion of variance than the default choices = 1:2.

biplot(pcaLaligaStd, choices = c(1, 3)) # 0.7138 proportion of variance
biplot(pcaLaligaStd, choices = c(2, 3)) # 0.2180 proportion of variance
At the sight of the previous plots, can you think about an interpretation for $$\mathrm{PC}_3$$?

### 3.6.2 Principal components regression

The key idea behind Principal Components Regression (PCR) is to regress the response $$Y$$ in a set of principal components $$\mathrm{PC}_1,\ldots,\mathrm{PC}_l$$ obtained from the predictors $$X_1,\ldots,X_p$$, where $$l<p$$. The motivation is that often a small number of principal components is enough to explain most of the variability of the predictors and consequently their relationship with $$Y$$81. Therefore, we look for fitting the linear model82

\begin{align} Y=\alpha_0+\alpha_1\mathrm{PC}_1+\ldots+\alpha_l\mathrm{PC}_l+\varepsilon.\tag{3.9} \end{align}

The main advantages of PCR are two:

1. Multicollinearity is avoided by design: $$\mathrm{PC}_1,\ldots,\mathrm{PC}_l$$ are uncorrelated between them.
2. There are less coefficients to estimate ($$l$$ instead of $$p$$), hence the accuracy of the estimation increases.

However, keep in mind that PCR affects the linear model in two fundamental ways:

1. Interpretation of the coefficients is not directly related with the predictors, but with the principal components. Hence, the interpretability of a given coefficient in the regression model is tied to the interpretability of the associated principal component.
2. Prediction needs an extra step, since it is required to obtain the scores of the new observations of the predictors in the principal components.

The first point is worth discussing now. The PCR model (3.9) can be seen as a linear model expressed in terms of the original predictors. To make this point clearer, let’s re-express (3.9) as

\begin{align} Y=\alpha_0+\mathbf{PC}_{1:l}'\boldsymbol{\alpha}_{1:l}+\varepsilon,\tag{3.10} \end{align}

where the subindex $$1:l$$ denotes the inclusion of the vector entries from $$1$$ to $$l$$. Now, we can express the PCR problem (3.10) in terms of the original predictors83:

\begin{align*} Y&=\alpha_0+(\mathbf{A}_{1:l}'(\mathbf{X}-\boldsymbol{\mu}))'\boldsymbol{\alpha}_{1:l}+\varepsilon\\ &=(\alpha_0-\boldsymbol{\mu}'\mathbf{A}_{1:l}\boldsymbol{\alpha}_{1:l})+\mathbf{X}'\mathbf{A}_{1:l}\boldsymbol{\alpha}_{1:l}+\varepsilon\\ &=\gamma_0-\mathbf{X}'\boldsymbol{\gamma}_{1:l}+\varepsilon, \end{align*}

where $$\mathbf{A}_{1:l}$$ represents the $$\mathbf{A}$$ matrix with only its first $$l$$ columns and

\begin{align} \gamma_0:=\alpha_0-\boldsymbol{\mu}'\mathbf{A}_{1:l}\boldsymbol{\alpha}_{1:l},\quad \boldsymbol{\gamma}_{1:p}:=\mathbf{A}_{1:l}\boldsymbol{\alpha}_{1:l}.\tag{3.11} \end{align}

In other words, the coefficients $$\boldsymbol{\alpha}_{1:l}$$ of the PCR done with $$l$$ principal components in (3.9) translate into the coefficients (3.11) of the linear model

\begin{align*} Y=\gamma_0+\gamma_1X_1+\ldots+\gamma_pX_p+\varepsilon, \end{align*}

In the sample case, we have that

\begin{align} \hat{\gamma}_0=\hat\alpha_0-\bar{\mathbf{X}}'\hat{\mathbf{A}}_{1:l}\hat{\boldsymbol{\alpha}}_{1:l},\quad \hat{\boldsymbol{\gamma}}_{1:p}=\hat{\mathbf{A}}_{1:l}\hat{\boldsymbol{\alpha}}_{1:l}.\tag{3.12} \end{align}

Notice that $$\hat{\boldsymbol{\gamma}}$$ is not the least squares estimator that we use to denote by $$\hat{\boldsymbol{\beta}}$$, but just the coefficients of the PCR that are associated to the original predictors. Consequently, $$\hat{\boldsymbol{\gamma}}$$ is useful for the interpretation of the linear model produced by PCR, as it can be interpreted in the same way $$\hat{\boldsymbol{\beta}}$$ was84.

Finally, remember that the usefulness of PCR relies on how well we are able to reduce the dimensionality85 of the predictors and the veracity of the assumption that the $$l$$ principal components are related with $$Y$$.

Keep in mind that PCR considers the PCA done in the set of predictors, this is, we exclude the response for obvious reasons (a perfect and useless fit). It is important to remove the response from the call to princomp if we want to use the output in lm.

We see now two approaches for performing PCR, which we illustrate with the laliga dataset. The common objective is to predict Points using the remaining variables (except from the directly related variables Wins, Draws, Loses, and Matches.without.conceding) in order to quantify, explain, and predict the final points of a team from its performance.

The first approach combines the use of the princomp and lm functions. Its strong points are that is both able to predict and explain, and is linked with techniques we have employed so far. The weak point is that it requires some extra coding.

# A linear model is problematic
mod <- lm(Points ~ . - Wins - Draws - Loses - Matches.without.conceding,
data = laliga)
summary(mod) # Lots of non-significant predictors
##
## Call:
## lm(formula = Points ~ . - Wins - Draws - Loses - Matches.without.conceding,
##     data = laliga)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.2117 -1.4766  0.0544  1.9515  4.1422
##
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)
## (Intercept)                 77.11798   26.12915   2.951   0.0214 *
## Goals.scored               -28.21714   17.44577  -1.617   0.1498
## Goals.conceded             -24.23628   15.45595  -1.568   0.1608
## Percentage.scored.goals   1066.98731  655.69726   1.627   0.1477
## Percentage.conceded.goals  896.94781  584.97833   1.533   0.1691
## Shots                       -0.10246    0.07754  -1.321   0.2279
## Shots.on.goal                0.02024    0.13656   0.148   0.8863
## Penalties.scored            -0.81018    0.77600  -1.044   0.3312
## Assistances                  1.41971    0.44103   3.219   0.0147 *
## Fouls.made                  -0.04438    0.04267  -1.040   0.3328
## Yellow.cards                 0.27850    0.16814   1.656   0.1416
## Red.cards                    0.68663    1.44229   0.476   0.6485
## Offsides                    -0.00486    0.14854  -0.033   0.9748
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared:  0.9795, Adjusted R-squared:  0.9443
## F-statistic: 27.83 on 12 and 7 DF,  p-value: 9.784e-05

# We try to clean the model
modBIC <- MASS::stepAIC(mod, k = log(nrow(laliga)), trace = 0)
summary(modBIC) # Better, but still unsatisfactory
##
## Call:
## lm(formula = Points ~ Goals.scored + Goals.conceded + Percentage.scored.goals +
##     Percentage.conceded.goals + Shots + Assistances + Yellow.cards,
##     data = laliga)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.4830 -1.4505  0.9008  1.1813  5.8662
##
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                62.91373   10.73528   5.860 7.71e-05 ***
## Goals.scored              -23.90903   11.22573  -2.130  0.05457 .
## Goals.conceded            -12.16610    8.11352  -1.499  0.15959
## Percentage.scored.goals   894.56861  421.45891   2.123  0.05528 .
## Percentage.conceded.goals 440.76333  307.38671   1.434  0.17714
## Shots                      -0.05752    0.02713  -2.120  0.05549 .
## Assistances                 1.42267    0.28462   4.999  0.00031 ***
## Yellow.cards                0.11313    0.07868   1.438  0.17603
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.743 on 12 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.9572
## F-statistic: 61.77 on 7 and 12 DF,  p-value: 1.823e-08

# Also, huge multicollinearity
car::vif(modBIC)
##              Goals.scored            Goals.conceded   Percentage.scored.goals Percentage.conceded.goals
##              77998.044758              22299.952547              76320.612577              22322.307151
##                     Shots               Assistances              Yellow.cards
##                  6.505748                 32.505831                  3.297224

# A quick way of removing columns without knowing its position
laligaRed <- subset(laliga, select = -c(Points, Wins, Draws, Loses,
Matches.without.conceding))
# PCA without Points, Wins, Draws, Loses, and Matches.without.conceding
pcaLaligaRed <- princomp(x = laligaRed, cor = TRUE)
summary(pcaLaligaRed) # l = 3 gives 86% of variance explained
## Importance of components:
##                           Comp.1    Comp.2     Comp.3    Comp.4     Comp.5    Comp.6      Comp.7      Comp.8
## Standard deviation     2.7437329 1.4026745 0.91510249 0.8577839 0.65747209 0.5310954 0.332556029 0.263170555
## Proportion of Variance 0.6273392 0.1639580 0.06978438 0.0613161 0.03602246 0.0235052 0.009216126 0.005771562
## Cumulative Proportion  0.6273392 0.7912972 0.86108155 0.9223977 0.95842012 0.9819253 0.991141438 0.996913000
##                             Comp.9     Comp.10      Comp.11      Comp.12
## Standard deviation     0.146091551 0.125252621 3.130311e-03 1.801036e-03
## Proportion of Variance 0.001778562 0.001307352 8.165704e-07 2.703107e-07
## Cumulative Proportion  0.998691562 0.999998913 9.999997e-01 1.000000e+00

# Interpretation of PC1 and PC2
biplot(pcaLaligaRed)

# PC1: attack performance of the team

# Create a new dataset with the response + principal components
laligaPCA <- data.frame("Points" = laliga$Points, pcaLaligaRed$scores)

# Regression on all the principal components
modPCA <- lm(Points ~ ., data = laligaPCA)
summary(modPCA) # Predictors clearly significative - same R^2 as mod
##
## Call:
## lm(formula = Points ~ ., data = laligaPCA)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.2117 -1.4766  0.0544  1.9515  4.1422
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   52.4000     0.9557  54.831 1.76e-10 ***
## Comp.1         5.7690     0.3483  16.563 7.14e-07 ***
## Comp.2         2.4376     0.6813   3.578   0.0090 **
## Comp.3         3.4222     1.0443   3.277   0.0135 *
## Comp.4         3.6079     1.1141   3.238   0.0143 *
## Comp.5        -1.9713     1.4535  -1.356   0.2172
## Comp.6        -5.7067     1.7994  -3.171   0.0157 *
## Comp.7        -3.4169     2.8737  -1.189   0.2732
## Comp.8         9.0212     3.6313   2.484   0.0419 *
## Comp.9        -4.6455     6.5415  -0.710   0.5006
## Comp.10      -10.2087     7.6299  -1.338   0.2227
## Comp.11     -222.0340   305.2920  -0.727   0.4907
## Comp.12     -954.7650   530.6164  -1.799   0.1150
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared:  0.9795, Adjusted R-squared:  0.9443
## F-statistic: 27.83 on 12 and 7 DF,  p-value: 9.784e-05
car::vif(modPCA) # No problems at all
##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 Comp.11 Comp.12
##       1       1       1       1       1       1       1       1       1       1       1       1

# Using the first three components
modPCA3 <- lm(Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
summary(modPCA3)
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -8.178 -4.541 -1.401  3.501 16.093
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  52.4000     1.5672  33.435 3.11e-16 ***
## Comp.1        5.7690     0.5712  10.100 2.39e-08 ***
## Comp.2        2.4376     1.1173   2.182   0.0444 *
## Comp.3        3.4222     1.7126   1.998   0.0630 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.009 on 16 degrees of freedom
## Multiple R-squared:  0.8738, Adjusted R-squared:  0.8501
## F-statistic: 36.92 on 3 and 16 DF,  p-value: 2.027e-07

# Coefficients associated to each original predictor (gamma)
alpha <- modPCA3$coefficients gamma <- pcaLaligaRed$loadings[, 1:3] %*% alpha[-1] # Slopes
gamma <- c(alpha[1] - pcaLaligaRed$center %*% gamma, gamma) # Intercept gamma ## [1] -44.2288551 1.7305124 -3.4048178 1.7416378 -3.3944235 0.2347716 0.8782162 2.6044699 1.4548813 ## [10] -0.1171732 -1.7826488 -2.6423211 1.3755697 # We can overpenalize to have a simpler model - also one single # principal component does quite well modPCABIC <- MASS::stepAIC(modPCA, k = 2 * log(nrow(laliga)), trace = 0) summary(modPCABIC) ## ## Call: ## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3 + Comp.4 + Comp.6 + ## Comp.8, data = laligaPCA) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.6972 -2.6418 -0.3265 2.3535 8.4944 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 52.4000 1.0706 48.946 3.95e-16 *** ## Comp.1 5.7690 0.3902 14.785 1.65e-09 *** ## Comp.2 2.4376 0.7632 3.194 0.00705 ** ## Comp.3 3.4222 1.1699 2.925 0.01182 * ## Comp.4 3.6079 1.2481 2.891 0.01263 * ## Comp.6 -5.7067 2.0158 -2.831 0.01416 * ## Comp.8 9.0212 4.0680 2.218 0.04502 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.788 on 13 degrees of freedom ## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9301 ## F-statistic: 43.11 on 6 and 13 DF, p-value: 7.696e-08 # Note that the order of the principal components does not correspond # exactly to its importance in the regression! # To perform prediction we need to compute first the scores associated to the # new values of the predictors, conveniently preprocessed # Predictions for FCB and RMA (although they are part of the training sample) newPredictors <- laligaRed[1:2, ] newPredictors <- scale(newPredictors, center = pcaLaligaRed$center,
scale = pcaLaligaRed$scale) # Centred and scaled newScores <- t(apply(newPredictors, 1, function(x) t(pcaLaligaRed$loadings) %*% x))

# We need a data frame for prediction
newScores <- data.frame("Comp" = newScores)
predict(modPCABIC, newdata = newScores, interval = "prediction")
##                  fit      lwr      upr
## Barcelona   93.64950 80.35115 106.9478
## Real Madrid 90.05622 77.11876 102.9937

# Reality
laliga[1:2, 1]
## [1] 91 90

The second approach employs the function pls::pcr and is more direct, yet less connected with the techniques we have seen so far. It employs a model object that is different from the lm object and, as a consequence, functions like summary, BIC, MASS::stepAIC, or plot will not work properly. This implies that inference, model selection, and model diagnostics are not so straightforward. In exchange, pls::pcr allows for model fitting in an easier way and model selection through the use of cross-validation. In summary, this is a more pure predictive approach than predictive and explicative.

# Create a dataset without the problematic predictors and with the response
laligaRed2 <- subset(laliga, select = -c(Wins, Draws, Loses,
Matches.without.conceding))

# Simple call to pcr
library(pls)
modPcr <- pcr(Points ~ ., data = laligaRed2, scale = TRUE)
# Notice we do not need to create a data.frame with PCA, it is automatically
# done within pcr. We also have flexibility to remove predictors from the PCA
# scale = TRUE means that the variables are scaled prior to compute PCA

# The summary of the model is different
summary(modPcr)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X         62.73    79.13    86.11    92.24    95.84    98.19    99.11    99.69    99.87    100.00       100    100.00
## Points    80.47    84.23    87.38    90.45    90.99    93.94    94.36    96.17    96.32     96.84        97     97.95
# First row: percentage of variance explained of the predictors
# Second row: percentage of variance explained of Y (the R^2)
# Note that we have the same R^2 for 3 and 12 components as in the previous
# approach

# Slots of information in the model - most of them as 3-dim arrays with the
# third dimension indexing the number of components considered
names(modPcr)
##  [8] "fitted.values" "residuals"     "Xvar"          "Xtotvar"       "fit.time"      "ncomp"         "method"
## [15] "scale"         "call"          "terms"         "model"

# The coefficients of the original predictors (gammas), not of the components!
modPcr$coefficients[, , 12] ## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals ## -602.85050765 -383.07010184 600.61255371 374.38729000 ## Shots Shots.on.goal Penalties.scored Assistances ## -8.27239221 0.88174787 -2.14313238 24.42240486 ## Fouls.made Yellow.cards Red.cards Offsides ## -2.96044265 5.51983512 1.20945331 -0.07231723 # pcr() computes up to ncomp (in this case, 12) linear models, each one # considering one extra principal component.$coefficients returns in a
# 3-dim array the coefficients of all the linear models

# Prediction is simpler and can be done for different number of components
predict(modPcr, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
##               Points
## Barcelona   92.01244

# Selecting the number of components to retain. All the components up to ncomp
# are selected, no further flexibility is possible
modPcr2 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE, ncomp = 3)
summary(modPcr2)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps
## X         62.73    79.13    86.11
## Points    80.47    84.23    87.38

# Selecting the number of components to retain by Leave-One-Out
# cross-validation
modPcrCV1 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "LOO")
summary(modPcrCV1)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 20 leave-one-out segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps
## CV           18.57    8.505    8.390    8.588    7.571    7.688    6.743     7.09    6.224    6.603     7.547     8.375
## adjCV        18.57    8.476    8.356    8.525    7.513    7.663    6.655     7.03    6.152    6.531     7.430     8.236
##        12 comps
## CV        7.905
##
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X         62.73    79.13    86.11    92.24    95.84    98.19    99.11    99.69    99.87    100.00       100    100.00
## Points    80.47    84.23    87.38    90.45    90.99    93.94    94.36    96.17    96.32     96.84        97     97.95

# View cross-validation Mean Squared Error in Prediction
validationplot(modPcrCV1, val.type = "MSEP") # l = 8 gives the minimum CV

# The black is the CV loss, the dashed red line is the adjCV loss, a bias
# corrected version of the MSEP (not described in the notes)

# Selecting the number of components to retain by 10-fold Cross-Validation
# (k = 10 is the default)
modPcrCV10 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
summary(modPcrCV10)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps
## CV           18.57    8.519    9.126    8.781    8.090    9.180    8.457    9.362    8.020    8.095     8.840     11.52
## adjCV        18.57    8.456    9.013    8.646    7.941    9.071    8.196    9.089    7.765    7.859     8.525     11.04
##        12 comps
## CV        9.945
##
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X         62.73    79.13    86.11    92.24    95.84    98.19    99.11    99.69    99.87    100.00       100    100.00
## Points    80.47    84.23    87.38    90.45    90.99    93.94    94.36    96.17    96.32     96.84        97     97.95
validationplot(modPcrCV10, val.type = "MSEP") # l = 6 gives the minimum CV

pcr() does an internal scaling of the predictors by their quasi-standard deviations. This means that each variable is divided by $$\frac{1}{\sqrt{n-1}}$$, when in princomp a scaling of $$\frac{1}{\sqrt{n}}$$ is applied (the standard deviations are employed). This results in a minor discrepancy in the scores object of both methods that is easily patchable. The scores of princomp() are the ones of pcr() multiplied by $$\sqrt{\frac{n}{n-1}}$$. This problem is inherited to the coefficients, which assume scores divided by $$\frac{1}{\sqrt{n-1}}$$. Therefore, the $$\hat{\boldsymbol{\gamma}}$$ coefficients described in (3.12) are obtained by dividing the coefficients of pcr() by $$\sqrt{\frac{n}{n-1}}$$.

The next chunk of code illustrates the previous warning.

# Equality of loadings from princomp() and pcr()
max(abs(abs(pcaLaligaRed$loadings[, 1:3]) - abs(modPcr$loadings[, 1:3])))
## [1] 1.804112e-15

# Equality of scores from princomp() and pcr() (with the same standardization)
max(abs(abs(pcaLaligaRed$scores[, 1:3]) - abs(modPcr$scores[, 1:3] * sqrt(n / (n - 1)))))
## [1] 5.551115e-15

# Equality of the gamma coefficients obtained previously for 3 PCA
# (with the same standardization)
modPcr$coefficients[, , 3] / sqrt(n / (n - 1)) ## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals ## 1.7305124 -3.4048178 1.7416378 -3.3944235 ## Shots Shots.on.goal Penalties.scored Assistances ## 0.2347716 0.8782162 2.6044699 1.4548813 ## Fouls.made Yellow.cards Red.cards Offsides ## -0.1171732 -1.7826488 -2.6423211 1.3755697 gamma[-1] ## [1] 1.7305124 -3.4048178 1.7416378 -3.3944235 0.2347716 0.8782162 2.6044699 1.4548813 -0.1171732 -1.7826488 ## [11] -2.6423211 1.3755697 # Coefficients associated to the principal components - same as in modPCA3 lm(Points ~ ., data = data.frame("Points" = laliga$Points,
modPcr$scores[, 1:3] * sqrt(n / (n - 1)))) ## ## Call: ## lm(formula = Points ~ ., data = data.frame(Points = laliga$Points,
##     modPcrscores[, 1:3] * sqrt(n/(n - 1)))) ## ## Coefficients: ## (Intercept) Comp.1 Comp.2 Comp.3 ## 52.400 -5.769 2.438 -3.422 modPCA3 ## ## Call: ## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA) ## ## Coefficients: ## (Intercept) Comp.1 Comp.2 Comp.3 ## 52.400 5.769 2.438 3.422 # Of course, flipping of signs is always possible with PCA The selection of $$l$$ by cross-validation attempts to minimize the Mean Squared Error in Prediction (MSEP) or, equivalently, the Root MSEP (RMSEP), of the model86. This is a jackknife method valid for the selection of any tuning parameter $$\lambda$$ that affects the form of the estimate $$\hat m_\lambda$$ of the regression function $$m$$ (remember (1.1)). Given the sample $$\{(\mathbf{X}_i,Y_i)\}_{i=1}^n$$, leave-one-out cross-validation considers the tuning parameter \begin{align} \hat{\lambda}_\mathrm{CV}:=\arg\min_\lambda \sum_{i=1}^n (Y_i-\hat m_{\lambda,-i}(\mathbf{X}_i))^2,\tag{3.13} \end{align} where $$\hat m_{\lambda,-i}$$ represents the fit of the model $$\hat m_\lambda$$ without the $$i$$-th observation $$(\mathbf{X}_i,Y_i)$$. A less computationally expensive variation on leave-one-out cross-validation is $$k$$-fold cross-validation, which partitions the data into $$k$$ folds $$F_1,\ldots, F_k$$ of approximately equal size, trains the model $$\hat m_\lambda$$ in the aggregation of $$k-1$$ folds and evaluates its MSEP in the remaining fold: \begin{align*} \hat{\lambda}_{k\mathrm{-CV}}:=\arg\min_\lambda \sum_{j=1}^k \sum_{i\in F_j} (Y_i-\hat m_{\lambda,-F_{j}}(\mathbf{X}_i))^2, \end{align*} where $$\hat m_{\lambda,-F_{j}}$$ represents the fit of the model $$\hat m_\lambda$$ excluding the data from the $$j$$-th fold $$F_{j}$$. Recall that $$k$$-fold cross-validation is more general than leave-one-out cross-validation, since the latter is a particular case of the former with $$k=n$$. $$k$$-fold cross validation with $$k<n$$ has an inconvenience to be aware of: it depends on the choice of the folds for splitting the data (i.e., how each datum is assigned to each of the $$k$$ folds). Some statistical softwares do this assignment randomly, which means that the selection of $$\hat{\lambda}_{k\mathrm{-CV}}$$ may vary from one run to another. Thus, fixing the seed prior to the parameter selection is important to ensure reproducibility. Another alternative is to aggregate, in a convenient way, the results of several $$k$$-fold cross-validations done in different random random partitions. Notice that this problem is not present in leave-one-out cross-validation (where $$k=n$$). Inference in PCR can be carried out as it was done for the standard linear model. The key point is to realize that the inference is about the coefficients $$\boldsymbol{\alpha}_{0:l}$$ associated to the $$l$$ principal components, and that it can be carried out by the summary() function on the output of lm(). The coefficients $$\boldsymbol{\alpha}_{0:l}$$ are the ones estimated by least squares when considering the scores of the $$l$$ principal components as the predictors. This transformation of the predictors does not affect the validity of the inferential results of Section 2.4 (they assumed that the predictors were deterministic). But recall that the inference is not about $$\boldsymbol{\gamma}$$. Inference in PCR is based on the assumptions of the linear model being satisfied for the principal components we are considering. The evaluation of the assumptions can be done using the exact same tools described in Section 3.5. However, keep in mind that PCA is a linear transformation of the data, and therefore: • If the linearity assumption fails for the predictors $$X_1,\ldots,X_p$$, then it will likely fail for $$\mathrm{PC}_1,\ldots,\mathrm{PC}_l$$, since the transformation will not introduce nonlinearities able to capture the nonlinear effects. • Similarly, if the homoscedasticity, normality, or independence assumptions fail for $$X_1,\ldots,X_p$$, then they will likely fail for $$\mathrm{PC}_1,\ldots,\mathrm{PC}_l$$. Exceptions to the previous common implications are possible, and may involve the association of one or several problematic predictors (e.g., have nonlinear effects on the response) to the principal components that are excluded from the model. Up to which extent the failure of the assumptions in the original predictors can be mitigated by PCR depends on each application. ### 3.6.3 Partial least squares PCR works by replacing the predictors $$X_1,\ldots,X_p$$ by a set of principal components $$\mathrm{PC}_1,\ldots,\mathrm{PC}_l$$ under the hope that these directions, that explain most of the variability of the predictors, are also the best directions for predicting the response $$Y$$. While this is a reasonable belief, it is not a guaranteed fact. Partial Least Squares (PLS) precisely tackles this point with the idea of regressing the response $$Y$$ on a set of new variables $$\mathrm{PLS}_1,\ldots,\mathrm{PLS}_l$$ that are constructed with the objective of predicting $$Y$$ from $$X_1,\ldots,X_p$$ in the best way. As with PCA, the idea to find linear combinations of the predictors $$X_1,\ldots,X_p$$, that is, to have: \begin{align*} \mathrm{PLS}_1:=\sum_{j=1}^pa_{j1}X_j,\,\ldots,\,\mathrm{PLS}_p:=\sum_{j=1}^pa_{jp}X_j. \end{align*} The question is how to choose the coefficients $$a_{jk}$$, $$j,k=1,\ldots,p$$. PLS does it by placing the most weight in the predictors that are most strongly correlated with $$Y$$ and in such a way that the resulting $$\mathrm{PLS}_1,\ldots,\mathrm{PLS}_p$$ are uncorrelated. After standardizing the variables, for $$\mathrm{PLS}_1$$ this is achieved by setting $$a_{j1}$$ equal to the theoretical slope coefficient of regressing $$Y$$ into $$X_j$$, that is87: \begin{align*} a_{j1}:=\beta_{j1}:=\frac{\mathrm{Cov}[X_j,Y]}{\mathbb{V}\mathrm{ar}[X_j]}, \end{align*} where $$\beta_{j1}$$ stems from $$Y=\beta_{0}+\beta_{j1}X_j+\varepsilon$$. The second partial least squares direction, $$\mathrm{PLS}_2$$, is computed in a similar way, but once the linear effects of $$\mathrm{PLS}_1$$ on $$X_1,\ldots,X_p$$ are removed. This is achieved by: 1. Regressing each of $$X_1,\ldots,X_p$$ on $$\mathrm{PLS}_1$$. That is, fit the $$p$$ simple linear models \begin{align*} X_j=\beta_0+\beta_{j1}\mathrm{PLS}_1+\varepsilon_j,\quad j=1,\ldots,p. \end{align*} 1. Regress $$Y$$ on each of the $$p$$ random errors $$\varepsilon_{1},\ldots,\varepsilon_{p}\,$$88 from the above regressions, yielding \begin{align*} a_{j2}:=\beta_{j2}:=\frac{\mathrm{Cov}[\varepsilon_j,Y]}{\mathbb{V}\mathrm{ar}[\varepsilon_j]}, \end{align*} where $$\beta_{j2}$$ stems from $$Y=\beta_{0}+\beta_{j2}\varepsilon_j+\varepsilon$$. If the process is iterated, the coefficients for $$\mathrm{PLS}_j$$, $$j>2$$, can be computed89. Once $$\mathrm{PLS}_1,\ldots,\mathrm{PLS}_l$$ are obtained, then PLS proceeds as PCR and fits the model \begin{align*} Y=\beta_0+\beta_1\mathrm{PLS}_1+\ldots+\beta_l\mathrm{PLS}_l+\varepsilon. \end{align*} The implementation of PLS can be done by the function pls::plsr, which has an analogous syntax to pls::pcr. # Simple call to plsr - very similar to pcr modPls <- plsr(Points ~ ., data = laligaRed2, scale = TRUE) # The summary of the model summary(modPls) ## Data: X dimension: 20 12 ## Y dimension: 20 1 ## Fit method: kernelpls ## Number of components considered: 12 ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps ## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00 ## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95 # First row: percentage of variance explained of the predictors # Second row: percentage of variance explained of Y (the R^2) # Note we have the same R^2 for 12 components as in the linear model # Slots of information in the model names(modPls) ## [1] "coefficients" "scores" "loadings" "loading.weights" "Yscores" "Yloadings" ## [7] "projection" "Xmeans" "Ymeans" "fitted.values" "residuals" "Xvar" ## [13] "Xtotvar" "fit.time" "ncomp" "method" "scale" "call" ## [19] "terms" "model" # PLS scores head(modPlsscores)
## [1]  7.36407868  6.70659816  2.45577740  0.06913485  0.96513341 -0.39588401

# Also uncorrelated
head(cov(modPls$scores)) ## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8 ## Comp 1 7.393810e+00 1.704822e-16 1.667417e-16 -1.716470e-16 -1.788249e-16 -8.227618e-16 -4.380633e-16 -1.202580e-16 ## Comp 2 1.704822e-16 1.267859e+00 1.815810e-16 1.695817e-16 1.084887e-16 9.991690e-17 7.754899e-18 3.451971e-17 ## Comp 3 1.667417e-16 1.815810e-16 9.021126e-01 -1.412726e-17 -7.389551e-17 -1.422399e-16 -3.699555e-17 -4.991681e-17 ## Comp 4 -1.716470e-16 1.695817e-16 -1.412726e-17 3.130984e-01 1.130620e-17 1.088368e-17 -1.278556e-17 2.547340e-18 ## Comp 5 -1.788249e-16 1.084887e-16 -7.389551e-17 1.130620e-17 2.586132e-01 8.607103e-18 -2.416915e-17 -1.698417e-17 ## Comp 6 -8.227618e-16 9.991690e-17 -1.422399e-16 1.088368e-17 8.607103e-18 2.792408e-01 -2.305428e-17 -7.131527e-18 ## Comp 9 Comp 10 Comp 11 Comp 12 ## Comp 1 -3.291438e-16 -3.606813e-16 -2.165469e-16 -3.535228e-16 ## Comp 2 9.691056e-17 1.087954e-16 2.611563e-17 -2.943533e-17 ## Comp 3 -1.134240e-17 7.717273e-18 -1.008660e-17 3.430733e-17 ## Comp 4 -2.368046e-17 -1.480284e-17 -1.227829e-17 -1.358123e-17 ## Comp 5 -3.565028e-17 2.589496e-17 8.527293e-18 9.912266e-18 ## Comp 6 -1.579360e-17 -2.849998e-18 1.062075e-17 -2.612850e-18 # The coefficients of the original predictors, not of the components! modPls$coefficients[, , 2]
##              Goals.scored            Goals.conceded   Percentage.scored.goals Percentage.conceded.goals
##                 1.8192870                -4.4038213                 1.8314760                -4.4045722
##                     Shots             Shots.on.goal          Penalties.scored               Assistances
##                 0.4010902                 0.9369002                -0.2006251                 2.3688050
##                 0.2807601                -1.6677725                -2.4952503                 1.2187529

# Obtaining the coefficients of the PLS components
lm(formula = Points ~., data = data.frame("Points" = laliga$Points, modPls$scores[, 1:3]))
##
## Call:
## lm(formula = Points ~ ., data = data.frame(Points = laliga$Points, ## modPls$scores[, 1:3]))
##
## Coefficients:
## (Intercept)       Comp.1       Comp.2       Comp.3
##      52.400        6.093        4.341        3.232

# Prediction
predict(modPls, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
##               Points
## Barcelona   92.01244

# Selecting the number of components to retain
modPls2 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE, ncomp = 2)
summary(modPls2)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##         1 comps  2 comps
## X         62.53    76.07
## Points    83.76    91.06

# Selecting the number of components to retain by Leave-One-Out cross-validation
modPlsCV1 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "LOO")
summary(modPlsCV1)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 20 leave-one-out segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps
## CV           18.57    8.307    7.221    6.807    6.254    6.604    6.572    6.854    7.348    7.548     7.532     7.854
## adjCV        18.57    8.282    7.179    6.742    6.193    6.541    6.490    6.764    7.244    7.430     7.416     7.717
##        12 comps
## CV        7.905
##
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X         62.53    76.07    84.94    88.29    92.72    95.61    98.72    99.41    99.83    100.00    100.00    100.00
## Points    83.76    91.06    93.93    95.43    95.94    96.44    96.55    96.73    96.84     96.84     97.69     97.95

# View cross-validation Mean Squared Error Prediction
validationplot(modPlsCV1, val.type = "MSEP") # l = 4 gives the minimum CV


# Selecting the number of components to retain by 10-fold Cross-Validation
# (k = 10 is the default)
modPlsCV10 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
summary(modPlsCV10)
## Data:    X dimension: 20 12
##  Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps
## CV           18.57    8.376    7.389    6.906    6.280    7.040    7.360    7.635    8.514    8.959     9.363     9.382
## adjCV        18.57    8.324    7.299    6.769    6.165    6.885    7.149    7.421    8.240    8.649     9.027     9.020
##        12 comps
## CV        9.304
##
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## X         62.53    76.07    84.94    88.29    92.72    95.61    98.72    99.41    99.83    100.00    100.00    100.00
## Points    83.76    91.06    93.93    95.43    95.94    96.44    96.55    96.73    96.84     96.84     97.69     97.95
validationplot(modPlsCV10, val.type = "MSEP")

# l = 4 is close to the minimum CV

# Regress manually Points in the scores, in order to have an lm object
# Create a new dataset with the response + PLS components
laligaPLS <- data.frame("Points" = laliga$Points, cbind(modPls$scores))

# Regression on all principal components
modPLS <- lm(Points ~ Comp.1 + Comp.2, data = laligaPLS)
summary(modPLS) # Predictors clearly significative - same R^2 as in modPls2
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2, data = laligaPLS)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.3565 -3.6157  0.4508  2.3288 12.3116
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  52.4000     1.2799  40.941  < 2e-16 ***
## Comp.1        6.0933     0.4829  12.618 4.65e-10 ***
## Comp.2        4.3413     1.1662   3.723  0.00169 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.724 on 17 degrees of freedom
## Multiple R-squared:  0.9106, Adjusted R-squared:    0.9
## F-statistic: 86.53 on 2 and 17 DF,  p-value: 1.225e-09
car::vif(modPLS) # No problems at all
## Comp.1 Comp.2
##      1      1

Let’s perform PCR and PLS in the iris dataset. Recall that this dataset contains a factor variable, which can not be treated directly by princomp but can be used in PCR/PLS (it is transformed internally to two dummy variables). Do the following:

• Compute the PCA of iris, excluding Species. What is the percentage of variability explained with two components?
• Draw the biplot and look for interpretations of the principal components.
• Plot the PCA scores in a scatterplotMatrix plot such that the scores are coloured by the levels in Species (you can use the groups argument).
• Compute the PCR and PLS regressions of Petal.Width with $$l=2$$ (exclude Species). Inspect by CV the best selection of $$l$$ for each regression.
• Plot the PLS scores of the data in a scatterplotMatrix plot such that the scores are coloured by the levels in Species. Compare with the PCA scores. (A quick way of converting the scores to a matrix is with rbind.)
For regression, PLS does a more clever selection of the directions of dimensionality reduction than PCA. However, in practise PCR and PLS tend to perform similarly, since PLS potentially increases the variance of the directions with respect to PCR.
Inference for PLS is more involved since, differently to what happens in PCR, the PLS directions are dependent on the response $$Y$$. This directly breaks a core assumption made in Section 2.4: that the randomness of the regression model came only from the error terms and not from the predictors. We do not go into details on how to perform inference on PLS in these notes.

1. Precisely, the size of the set is $$2^{p+1}$$.

2. That is, $$\mathbb{E}[X_j]=0$$, $$j=1,\ldots,p$$. This is important since PCA is sensitive to the centering of the data.

3. Recall that the covariance matrix is a real, symmetric, semi-positive definite matrix.

4. Therefore, $$\mathbf{A}^{-1}=\mathbf{A}'$$.

5. Since the variance-covariance matrix $$\mathbb{V}\mathrm{ar}[\mathbf{PC}]$$ is diagonal.

6. Up to now, the exposition has been focused exclusively on the population case.

7. Some care is needed here. The matrix $$\mathbf{S}$$ is obtained from linear combinations of the $$n$$ vectors $$\mathbf{X}_1-\bar{\mathbf{X}},\ldots,\mathbf{X}_n-\bar{\mathbf{X}}$$. Recall that these $$n$$ vectors are not linearly independent, as they are guaranteed to add $$\mathbf{0}$$, $$\sum_{i=1}^n[\mathbf{X}_i-\bar{\mathbf{X}}]=\mathbf{0}$$, so it is possibly to express one perfectly on the rest. That implies that the $$p\times p$$ matrix $$\mathbf{S}$$ has rank smaller or equal to $$n-1$$. If $$p\leq n-1$$, then the matrix has full rank $$p$$ and it is invertible. But if $$p\geq n$$, then $$\mathbf{S}$$ is singular and, as a consequence, $$\lambda_j=0$$ for $$j\geq n$$. This implies that the principal components for those eigenvalues can not be determined univocally.

8. Therefore, a sample of lengths measured in centimeters will have a variance $$10^4$$ times larger than the same sample measured in meters – yet it is the same information!

9. Computed with biplot.princomp or with biplot applied to a princomp object.

10. For the sample version, replace the variable $$X_j$$ by its sample $$X_{1j},\ldots,X_{nj}$$, the weights $$a_{ji}$$ by their estimates $$\hat a_{ji}$$, and the principal component $$\mathrm{PC}_j$$ by the scores $$\hat{\mathrm{PC}}_{j1},\ldots,\hat{\mathrm{PC}}_{jn}$$.

11. The first two principal components of the dataset explain the $$78\%$$ of the variability of the data. Hence, the insights obtained from the biplot should be regarded as “$$78\%$$-accurate”.

12. This does not need to be true, but it is often the case.

13. For the sake of simplicity, we considering the exposition the first $$l$$ principal components, but obviously other combinations of $$l$$ principal components are possible.

14. Since we know by (3.7) that $$\mathbf{PC}=\mathbf{A}'(\mathbf{X}-\boldsymbol{\mu})$$.

15. For example, thanks to $$\hat{\boldsymbol{\gamma}}$$ we know that the estimated conditional response of $$Y$$ precisely increases $$\hat{\gamma}_j$$ for a marginal unit increment in $$X_j$$.

16. If $$l=p$$, then PCR is equivalent to the least squares estimation.

17. Alternatively, the “in Prediction” part of the latter terms is dropped and they are just referred as the MSE and RMSE.

18. Recall (2.3) for the sample version.

19. They play the role of $$X_1,\ldots,X_p$$ in the computation of $$\mathrm{PLS}_1$$.

20. Of course, in practice, the computations need to be done in terms of the sample.