## 3.6 Dimension reduction techniques

As we have seen in Section 3.2, the selection of the best linear model from a set of \(p\) predictors is a challenging task that increases with the *dimension* of the problem, that is, with \(p\). In addition to the growth of the set of possible models^{70} as \(p\) grows, the model space becomes more and more complicated to explore due to the potential multicollinearity among the predictors. We will see in this section two methods to deal with these two problems simultaneously.

### 3.6.1 Review on principal component analysis

Principal Component Analysis (PCA) is a multivariate technique designed to summarize the most important features and relations of \(p\) numerical random variables \(X_1,\ldots,X_p\). PCA computes a new set of variables, the **principal components** \(\mathrm{PC}_1,\ldots, \mathrm{PC}_p\), that contain the same information as \(X_1,\ldots,X_p\) but expressed in a more convenient way. The goal of PCA is to retain only a limited number \(1\leq l<p\) of principal components such that they explain most of the information and perform *dimension reduction*.

If \(X_1,\ldots,X_p\) are *centred*^{71}, then the principal components are *orthonormal linear combinations* of \(X_1,\ldots,X_p\):

where \(\mathbf{a}_j:=(a_{1j},\ldots,a_{pj})'\), \(\mathbf{X}:=(X_1,\ldots,X_p)'\), and the orthonormality condition is

\[\begin{align*} \mathbf{a}_i'\mathbf{a}_j=\begin{cases}1,&\text{if } i=j,\\0,&\text{if } i\neq j.\end{cases} \end{align*}\]Remarkably, PCA computes the principal components in an *ordered* way: \(\mathrm{PC}_1\) is the principal component that explains the most of the *information* (quantified as the variance) of \(X_1,\ldots,X_p\), and then the explained information decreases monotonically down to \(\mathrm{PC}_p\), the principal component that explains the least information. Precisely:

Mathematically, PCA reduces to compute the *spectral decomposition*^{72} of the covariance matrix \(\boldsymbol\Sigma:=\mathbb{V}\mathrm{ar}[\mathbf{X}]\):

where \(\boldsymbol\Lambda=\mathrm{diag}(\lambda_1,\ldots,\lambda_p)\) contains the eigenvalues of \(\boldsymbol\Sigma\) and \(\mathbf{A}\) is the orthogonal matrix^{73} that contains the unit-norm eigenvectors of \(\boldsymbol\Sigma\) as columns. The matrix \(\mathbf{A}\) gives, thus, the coefficients of the orthonormal linear combinations:

If the data is not centered, the computation of the principal components is done by first subtracting \(\boldsymbol\mu=\mathbb{E}[\mathbf{X}]\) and then premultiplying with \(\mathbf{A}'\):

\[\begin{align} \mathbf{PC}=\mathbf{A}'(\mathbf{X}-\boldsymbol{\mu}),\tag{3.7} \end{align}\]where \(\mathbf{PC}=(\mathrm{PC}_1,\ldots,\mathrm{PC}_p)'\), \(\mathbf{X}=(X_1,\ldots,X_p)'\), and \(\boldsymbol{\mu}=(\mu_1,\ldots,\mu_p)'\). Note that, because of (1.3) and (3.7),

\[\begin{align} \mathbb{V}\mathrm{ar}[\mathbf{PC}]=\mathbf{A}'\boldsymbol{\Sigma}\mathbf{A}=\mathbf{A}'\mathbf{A}\boldsymbol{\Lambda}\mathbf{A}'\mathbf{A}=\boldsymbol{\Lambda}, \end{align}\]therefore \(\mathbb{V}\mathrm{ar}[\mathrm{PC}_j]=\lambda_j\), \(j=1,\ldots,p\), and as a consequence (3.6) indeed holds.

Also, from (3.7), it is evident than we can express the random variables \(\mathbf{X}\) in terms of \(\mathbf{PC}\):

\[\begin{align} \mathbf{X}=\boldsymbol{\mu}+\mathbf{A}\mathbf{PC},\tag{3.8} \end{align}\]which admits an insightful interpretation: the \(\mathbf{PC}\) are *uncorrelated*^{74} variables that, once *rotated* by \(\mathbf{A}\) and *translated* to the location \(\boldsymbol{\mu}\), produce exactly \(\mathbf{X}\). So, \(\mathbf{PC}\) contains the same information as \(\mathbf{X}\) but rearranged in a more convenient way because the **principal components are centred and uncorrelated** between them:

Given the uncorrelation of \(\mathbf{PC}\), we can measure the *total variance* of \(\mathbf{PC}\) as \(\sum_{j=1}^p\mathbb{V}\mathrm{ar}[\mathrm{PC}_j]=\sum_{j=1}^p\lambda_j\). Consequently, we can can define the **proportion of variance explained** by the first \(l\) principal components, \(1\leq l\leq p\), as

In the *sample case*^{75} where a sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) is given and \(\boldsymbol\mu\) and \(\boldsymbol\Sigma\) are unknown, \(\boldsymbol\mu\) is replaced by the sample mean \(\bar{\mathbf{X}}\) and \(\boldsymbol\Sigma\) by the sample variance-covariance matrix \(\mathbf{S}=\frac{1}{n}\sum_{i=1}^n(\mathbf{X}_i-\bar{\mathbf{X}})(\mathbf{X}_i-\bar{\mathbf{X}})'\). Then, the spectral decomposition of \(\mathbf{S}\) is computed^{76}. This gives \(\hat{\mathbf{A}}\) and produces the **scores** of the data:

The scores are *centred*, *uncorrelated*, and have sample variances in each vector’s entry that are *sorted* in a decreasing way.

The maximum number of principal components that can be determined from a sample \(\mathbf{X}_1,\ldots,\mathbf{X}_n\) is \(\min(n-1,p)\), assuming that the matrix formed by \(\left(\mathbf{X}_1 \cdots \mathbf{X}_n\right)\) is of full rank (*i.e.*, if the rank is \(\min(n,p)\)). If \(n\geq p\) and the variables \(X_1,\ldots,X_p\) are such that only \(r\) of them are linearly independent, then the maximum is \(\min(n-1,r)\).

Let’s see an example of these concepts in *La Liga 2015/2016* dataset. It contains the standings and team statistics for La Liga 2015/2016.

```
laliga <- readxl::read_excel("la-liga-2015-2016.xlsx", sheet = 1, col_names = TRUE)
laliga <- as.data.frame(laliga) # Avoid tibble since it drops row.names
```

A quick preprocessing gives:

```
rownames(laliga) <- laliga$Team # Set teams as case names to avoid factors
laliga$Team <- NULL
laliga <- laliga[, -c(2, 8)] # Do not add irrelevant information
summary(laliga)
## Points Wins Draws Loses Goals.scored Goals.conceded
## Min. :32.00 Min. : 8.00 Min. : 4.00 Min. : 4.00 Min. : 34.00 Min. :18.00
## 1st Qu.:41.25 1st Qu.:10.00 1st Qu.: 8.00 1st Qu.:12.00 1st Qu.: 40.00 1st Qu.:42.50
## Median :44.50 Median :12.00 Median : 9.00 Median :15.50 Median : 45.50 Median :52.50
## Mean :52.40 Mean :14.40 Mean : 9.20 Mean :14.40 Mean : 52.15 Mean :52.15
## 3rd Qu.:60.50 3rd Qu.:17.25 3rd Qu.:10.25 3rd Qu.:18.25 3rd Qu.: 51.25 3rd Qu.:63.25
## Max. :91.00 Max. :29.00 Max. :18.00 Max. :22.00 Max. :112.00 Max. :74.00
## Percentage.scored.goals Percentage.conceded.goals Shots Shots.on.goal Penalties.scored Assistances
## Min. :0.890 Min. :0.470 Min. :346.0 Min. :129.0 Min. : 1.00 Min. :23.00
## 1st Qu.:1.050 1st Qu.:1.115 1st Qu.:413.8 1st Qu.:151.2 1st Qu.: 1.00 1st Qu.:28.50
## Median :1.195 Median :1.380 Median :438.0 Median :165.0 Median : 3.00 Median :32.50
## Mean :1.371 Mean :1.371 Mean :452.4 Mean :173.1 Mean : 3.45 Mean :37.85
## 3rd Qu.:1.347 3rd Qu.:1.663 3rd Qu.:455.5 3rd Qu.:180.0 3rd Qu.: 4.50 3rd Qu.:36.75
## Max. :2.950 Max. :1.950 Max. :712.0 Max. :299.0 Max. :11.00 Max. :90.00
## Fouls.made Matches.without.conceding Yellow.cards Red.cards Offsides
## Min. :385.0 Min. : 4.0 Min. : 66.0 Min. :1.00 Min. : 72.00
## 1st Qu.:483.8 1st Qu.: 7.0 1st Qu.: 97.0 1st Qu.:4.00 1st Qu.: 83.25
## Median :530.5 Median :10.5 Median :108.5 Median :5.00 Median : 88.00
## Mean :517.6 Mean :10.7 Mean :106.2 Mean :5.05 Mean : 92.60
## 3rd Qu.:552.8 3rd Qu.:13.0 3rd Qu.:115.2 3rd Qu.:6.00 3rd Qu.:103.75
## Max. :654.0 Max. :24.0 Max. :141.0 Max. :9.00 Max. :123.00
```

Let’s check that R’s function for PCA, `princomp`

, returns the same principal components we outlined in the theory.

```
# PCA
pcaLaliga <- princomp(laliga)
summary(pcaLaliga)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 104.7782561 48.5461449 22.13337511 12.66692413 8.234215354 7.83426116 6.068864168 4.137079559
## Proportion of Variance 0.7743008 0.1662175 0.03455116 0.01131644 0.004782025 0.00432876 0.002597659 0.001207133
## Cumulative Proportion 0.7743008 0.9405183 0.97506949 0.98638593 0.991167955 0.99549671 0.998094374 0.999301507
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 2.0112480391 1.8580509157 1.126111e+00 9.568824e-01 4.716064e-01 1.707105e-03 8.365534e-04
## Proportion of Variance 0.0002852979 0.0002434908 8.943961e-05 6.457799e-05 1.568652e-05 2.055361e-10 4.935768e-11
## Cumulative Proportion 0.9995868048 0.9998302956 9.999197e-01 9.999843e-01 1.000000e+00 1.000000e+00 1.000000e+00
## Comp.16 Comp.17
## Standard deviation 5.867584e-07 0
## Proportion of Variance 2.428208e-17 0
## Cumulative Proportion 1.000000e+00 1
# The standard deviations are the square roots of the eigenvalues
# The cumulative proportion of variance explained accumulates the
# variance explained starting at the first component
# Plot of variances of each component (screeplot)
plot(pcaLaliga, type = "l")
```

```
# Useful for detecting an "elbow" in the graph whose location gives the
# "right" number of components to retain. Ideally, this elbow appears
# when the next variances are almost similar and notably smaller when
# compared with the previous
# Alternatively: plot of the cumulated percentage of variance
# barplot(cumsum(pcaLaliga$sdev^2) / sum(pcaLaliga$sdev^2))
# Computation of PCA from the spectral decomposition
n <- nrow(laliga)
eig <- eigen(cov(laliga) * (n - 1) / n) # By default, cov() computes the
# quasi-variance-covariance matrix that divides by n - 1, but PCA and princomp
# consider the sample variance-covariance matrix that divides by n
A <- eig$vectors
# Same eigenvalues
pcaLaliga$sdev^2 - eig$values
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## -1.818989e-12 2.728484e-12 2.273737e-13 -2.273737e-13 5.684342e-14 3.552714e-14 4.263256e-14 -3.552714e-14
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16
## 1.234568e-13 5.373479e-14 1.199041e-14 1.054712e-14 1.049161e-14 -3.709614e-15 -2.191892e-15 4.476020e-13
## Comp.17
## 2.048814e-12
# The eigenvectors (the a_j vectors) are the column vectors in $loadings
pcaLaliga$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Points -0.125 0.497 0.195 0.139 -0.340 0.425 0.379 0.129 0.166
## Wins 0.184 -0.175 0.134 0.198 0.139
## Draws 0.101 0.186 -0.608 0.175 0.185 -0.251
## Loses -0.129 -0.114 -0.157 0.410 -0.243 -0.166 0.112
## Goals.scored -0.181 0.251 -0.186 -0.169 -0.399 0.335 -0.603 0.155 0.129 0.289 -0.230
## Goals.conceded -0.471 -0.493 -0.277 -0.257 0.280 0.441 0.118 0.297
## Percentage.scored.goals
## Percentage.conceded.goals
## Shots -0.718 -0.442 -0.342 0.255 0.241 0.188
## Shots.on.goal -0.386 -0.213 0.182 -0.287 -0.532 0.163 -0.599
## Penalties.scored -0.350 -0.258 0.378 -0.661 0.456
## Assistances -0.148 0.198 -0.173 -0.362 0.216 0.215 -0.356 -0.685 -0.265
## Fouls.made 0.480 -0.844 0.166 0.110
## Matches.without.conceding 0.151 0.129 -0.182 0.176 0.369 -0.376 -0.411
## Yellow.cards 0.141 -0.144 -0.363 0.113 0.225 -0.637 -0.550 -0.126 -0.156
## Red.cards 0.123 -0.157 0.405 0.666
## Offsides -0.108 0.202 -0.696 0.647 -0.106
## Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## Points 0.138 0.278 -0.315
## Wins 0.147 0.907
## Draws -0.304 0.526 0.277
## Loses 0.156 0.803
## Goals.scored -0.153
## Goals.conceded
## Percentage.scored.goals 0.760 -0.650
## Percentage.conceded.goals -0.650 -0.760
## Shots
## Shots.on.goal
## Penalties.scored -0.114
## Assistances -0.102
## Fouls.made
## Matches.without.conceding -0.664
## Yellow.cards
## Red.cards -0.587
## Offsides
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059
## Cumulative Var 0.059 0.118 0.176 0.235 0.294 0.353 0.412 0.471 0.529 0.588 0.647 0.706 0.765 0.824
## Comp.15 Comp.16 Comp.17
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.059 0.059 0.059
## Cumulative Var 0.882 0.941 1.000
# The scores is the representation of the data in the principal components -
# it has the same information as laliga
head(pcaLaliga$scores)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## Barcelona -242.23916 21.581016 25.380103 -17.4375054 -7.1797218 -9.0814106 -5.5920449 -7.3615386 0.3715688
## Real Madrid -313.60263 -63.202402 -8.756998 8.5557906 0.7119181 -0.2221379 6.7034894 2.4455971 -1.8388132
## Atlético Madrid -45.99393 0.646609 38.540964 31.3888316 3.9162812 -3.2904137 0.2431925 5.0912667 3.0444029
## Villarreal 96.22013 42.932869 50.003639 -11.2420481 10.4732634 -2.4293930 -3.0183049 0.1958417 -1.2106025
## Athletic -14.51728 16.189672 18.884019 -0.4122161 -5.6491352 6.9329640 8.0652665 2.4783231 2.6920566
## Celta 13.07483 -6.792525 5.227118 -9.0945489 6.1264750 -11.8794638 -2.6148154 6.9706627 -3.0825781
## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## Barcelona 1.7160752 0.0264937 -1.0948280 0.15160351 0.0010244179 -0.0002177801 -1.047044e-12 7.623799e-12
## Real Madrid -2.9660592 -0.1344557 0.3409538 -0.03316355 -0.0014744734 0.0003891502 1.587123e-12 -1.215672e-11
## Atlético Madrid 2.0974553 0.6771343 -0.3985625 -0.18088616 -0.0004984680 -0.0001116115 6.252046e-13 -1.090230e-12
## Villarreal -1.7453143 0.1350586 -0.5735057 -0.58936061 0.0001067413 -0.0002431758 3.319980e-13 3.406707e-12
## Athletic 0.8950389 0.1542005 1.4714997 0.11090055 0.0031346887 -0.0002557828 -3.696130e-12 1.895661e-11
## Celta 0.3129865 0.0859623 1.9159241 0.37219921 0.0017960697 0.0002911046 -2.150767e-12 5.513489e-12
# Uncorrelated
corrplot::corrplot(cor(pcaLaliga$scores), addCoef.col = "gray")
```

```
# Caution! What happened in the last columns? What happened is that the
# variance for the last principal components is close to zero (because there
# are linear dependencies on the variables; e.g. Points, Wins, Loses, Draws),
# so the computation of the correlation matrix becomes unstable for those
# variables (a 0/0 division takes place)
# Better to inspect the covariance matrix
corrplot::corrplot(cov(pcaLaliga$scores), addCoef.col = "gray", is.corr = FALSE)
```

```
# The scores are A' * (X_i - mu). We center the data with scale()
# and then multiply each row by A'
scores <- scale(laliga, center = TRUE, scale = FALSE) %*% A
# Same as (but this is much slower)
# scores <- t(apply(scale(laliga, center = TRUE, scale = FALSE), 1,
# function(x) t(A) %*% x))
# Same scores (up to possible changes in signs)
max(abs(abs(pcaLaliga$scores) - abs(scores)))
## [1] 1.427989e-11
# Reconstruct the data from all the principal components
head(
sweep(pcaLaliga$scores %*% t(pcaLaliga$loadings), 2, pcaLaliga$center, "+")
)
## Points Wins Draws Loses Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## Barcelona 91 29 4 5 112 29 2.95 0.76
## Real Madrid 90 28 6 4 110 34 2.89 0.89
## Atlético Madrid 88 28 4 6 63 18 1.66 0.47
## Villarreal 64 18 10 10 44 35 1.16 0.92
## Athletic 62 18 8 12 58 45 1.53 1.18
## Celta 60 17 9 12 51 59 1.34 1.55
## Shots Shots.on.goal Penalties.scored Assistances Fouls.made Matches.without.conceding Yellow.cards
## Barcelona 600 277 11 79 385 18 66
## Real Madrid 712 299 6 90 420 14 72
## Atlético Madrid 481 186 1 49 503 24 91
## Villarreal 346 135 3 32 534 17 100
## Athletic 450 178 3 42 502 13 84
## Celta 442 170 4 43 528 10 116
## Red.cards Offsides
## Barcelona 1 120
## Real Madrid 5 114
## Atlético Madrid 3 84
## Villarreal 4 106
## Athletic 5 92
## Celta 6 103
```

An important issue when doing PCA is the **scale** of the variables, since the variance depends on the units in which the variable is measured^{77}. Therefore, when variables with different ranges are mixed, the variability of one may dominate the other as an artifact of the scale. To prevent this, we **standardize** the dataset prior to do a PCA.

```
# Use cor = TRUE to standardize variables (all have unit variance)
# and avoid scale distortions
pcaLaligaStd <- princomp(x = laliga, cor = TRUE)
summary(pcaLaligaStd)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 3.2918365 1.5511043 1.13992451 0.91454883 0.85765282 0.59351138 0.45780827 0.370649324
## Proportion of Variance 0.6374228 0.1415250 0.07643694 0.04919997 0.04326873 0.02072093 0.01232873 0.008081231
## Cumulative Proportion 0.6374228 0.7789478 0.85538472 0.90458469 0.94785342 0.96857434 0.98090308 0.988984306
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.327182806 0.217470830 0.128381750 0.0976778705 0.083027923 2.582795e-03 1.226924e-03
## Proportion of Variance 0.006296976 0.002781974 0.000969522 0.0005612333 0.000405508 3.924019e-07 8.854961e-08
## Cumulative Proportion 0.995281282 0.998063256 0.999032778 0.9995940111 0.999999519 9.999999e-01 1.000000e+00
## Comp.16 Comp.17
## Standard deviation 2.222483e-08 0
## Proportion of Variance 2.905548e-17 0
## Cumulative Proportion 1.000000e+00 1
# The effects of the distorsion can be clearly seen with the biplot
# Variability absorbed by Shots, Shots.on.goal, Fouls.made
biplot(pcaLaliga, cex = 0.75)
# The effects of the variables are more balanced
biplot(pcaLaligaStd, cex = 0.75)
```

The biplot^{78} provides a powerful and succinct way of displaying the relevant information contained in the first two principal components. It shows:

- The
**scores**of the data in**\(\mathrm{PC}_1\)**and**\(\mathrm{PC}_2\)**by*points*(with optional text labels, depending if there are case names). This is the representation of the data in the first two PCs. - The
**variables represented**in the**\(\mathrm{PC}_1\)**and**\(\mathrm{PC}_2\)**by the*arrows*. These arrows are centered at \((0,0)\).

Let’s examine the *population*^{79} arrow associated to the variable \(X_j\). \(X_j\) is expressed in terms of \(\mathrm{PC}_1\) and \(\mathrm{PC}_2\) by the *weights* \(a_{j1}\) and \(a_{j2}\):

The weights \(a_{j1}\) and \(a_{j2}\) have the same sign as \(\mathrm{Cor}(X_j,\mathrm{PC}_{1})\) and \(\mathrm{Cor}(X_j,\mathrm{PC}_{2})\), respectively. The arrow associated to \(X_j\) is given by the segment joining \((0,0)\) and \((a_{j1},a_{j2})\). Therefore:

- If the arrow
*points right*(\(a_{j1}>0\)), there is*positive correlation*between \(X_j\) and \(\mathrm{PC}_1\). Analogous if the arrow points left. - If the arrow is
*approximately vertical*(\(a_{j1}\approx0\)), there is*uncorrelation*between \(X_j\) and \(\mathrm{PC}_1\).

Analogously:

- If the arrow
*points up*(\(a_{j2}>0\)), there is*positive correlation*between \(X_j\) and \(\mathrm{PC}_2\). Analogous if the arrow points down. - If the arrow is
*approximately horizontal*(\(a_{j2}\approx0\)), there is*uncorrelation*between \(X_j\) and \(\mathrm{PC}_2\).

In addition, the **magnitude of the arrow** informs us about the strength of the **correlation** of \(X_j\) with \((\mathrm{PC}_{1}, \mathrm{PC}_{2})\).

The biplot also informs about the direct relation between variables, at sight of their expressions in \(\mathrm{PC}_1\) and \(\mathrm{PC}_2\). The **angle** of the arrows of variable \(X_j\) and \(X_k\) gives an **approximation to the correlation** between them, \(\mathrm{Cor}(X_j,X_k)\):

- If \(\text{angle}\approx 0^\circ\), the two variables are highly positively correlated.
- If \(\text{angle}\approx 90^\circ\), they are approximately uncorrelated.
- If \(\text{angle}\approx 180^\circ\), the two variables are highly negatively correlated.

The insights obtained from the *approximate* correlations between the variables and principal components are as *valid* as the percentage of variance explained by \(\mathrm{PC}_1\) and \(\mathrm{PC}_2\).

Some interesting insights in La Liga 2015/2016 biplot, obtained from the previous remarks, are^{80}:

- \(-\mathrm{PC}_1\) can be regarded as the
*performance of a team*during the season. \(\mathrm{PC}_1\) is negatively correlated with`Wins`

,`Points`

, etc. and positively correlated with`Draws`

,`Loses`

,`Yellow.cards`

, etc. The best performing teams are not surprising: Barcelona, Real Madrid, and Atlético Madrid. On the other hand, among the worst-performing teams are Levante, Getafe, and Granada. - \(-\mathrm{PC}_2\) can be seen as the
*efficiency of a team*(obtaining points with little participation in the game). Using this interpretation we can see that Atlético Madrid and Villareal were the most efficient teams and that Rayo Vallecano and Real Madrid were the most inefficient. `Offsides`

is approximately uncorrelated with`Red.cards`

and`Matches.without.conceding`

.

A 3D representation of the biplot can be computed through:

```
pca3d::pca3d(pcaLaligaStd, show.labels = TRUE, biplot = TRUE)
## [1] 0.1747106 0.0678678 0.0799681
rgl::rglwidget()
```

Finally, the `biplot`

function allows to construct the biplot using two arbitrary principal components by specifying the `choices`

argument. Keep in mind than these pseudo-biplots will explain a lower proportion of variance than the default `choices = 1:2`

.

```
biplot(pcaLaligaStd, choices = c(1, 3)) # 0.7138 proportion of variance
biplot(pcaLaligaStd, choices = c(2, 3)) # 0.2180 proportion of variance
```

### 3.6.2 Principal components regression

The key idea behind *Principal Components Regression* (PCR) is to regress the response \(Y\) in a set of principal components \(\mathrm{PC}_1,\ldots,\mathrm{PC}_l\) obtained from the predictors \(X_1,\ldots,X_p\), where \(l<p\). The motivation is that often a small number of principal components is enough to explain most of the variability of the predictors *and consequently their relationship with \(Y\)*^{81}. Therefore, we look for fitting the linear model^{82}

The main advantages of PCR are two:

**Multicollinearity**is**avoided**by design: \(\mathrm{PC}_1,\ldots,\mathrm{PC}_l\) are uncorrelated between them.- There are
**less coefficients to estimate**(\(l\) instead of \(p\)), hence the accuracy of the estimation increases.

However, keep in mind that PCR affects the linear model in two fundamental ways:

*Interpretation of the coefficients is not directly related with the predictors*, but with the principal components. Hence, the interpretability of a given coefficient in the regression model is tied to the interpretability of the associated principal component.*Prediction needs an extra step*, since it is required to obtain the scores of the new observations of the predictors in the principal components.

The first point is worth discussing now. The PCR model (3.9) can be seen as a linear model expressed in terms of the original predictors. To make this point clearer, let’s re-express (3.9) as

\[\begin{align} Y=\alpha_0+\mathbf{PC}_{1:l}'\boldsymbol{\alpha}_{1:l}+\varepsilon,\tag{3.10} \end{align}\]where the subindex \(1:l\) denotes the inclusion of the vector entries from \(1\) to \(l\). Now, we can express the PCR problem (3.10) in terms of the original predictors^{83}:

where \(\mathbf{A}_{1:l}\) represents the \(\mathbf{A}\) matrix with only its first \(l\) columns and

\[\begin{align} \gamma_0:=\alpha_0-\boldsymbol{\mu}'\mathbf{A}_{1:l}\boldsymbol{\alpha}_{1:l},\quad \boldsymbol{\gamma}_{1:p}:=\mathbf{A}_{1:l}\boldsymbol{\alpha}_{1:l}.\tag{3.11} \end{align}\]In other words, the coefficients \(\boldsymbol{\alpha}_{1:l}\) of the PCR done with \(l\) principal components in (3.9) translate into the coefficients (3.11) of the linear model

\[\begin{align*} Y=\gamma_0+\gamma_1X_1+\ldots+\gamma_pX_p+\varepsilon, \end{align*}\]In the sample case, we have that

\[\begin{align} \hat{\gamma}_0=\hat\alpha_0-\bar{\mathbf{X}}'\hat{\mathbf{A}}_{1:l}\hat{\boldsymbol{\alpha}}_{1:l},\quad \hat{\boldsymbol{\gamma}}_{1:p}=\hat{\mathbf{A}}_{1:l}\hat{\boldsymbol{\alpha}}_{1:l}.\tag{3.12} \end{align}\]Notice that \(\hat{\boldsymbol{\gamma}}\) is **not the least squares estimator** that we use to denote by \(\hat{\boldsymbol{\beta}}\), but just the coefficients of the PCR that are associated to the *original predictors*. Consequently, \(\hat{\boldsymbol{\gamma}}\) is useful for the interpretation of the linear model produced by PCR, as it can be interpreted in the same way \(\hat{\boldsymbol{\beta}}\) was^{84}.

Finally, remember that the **usefulness** of PCR relies on how well we are able to reduce the dimensionality^{85} of the predictors and the veracity of the assumption that the \(l\) principal components are related with \(Y\).

Keep in mind that PCR considers the **PCA done in the set of predictors**, this is, we exclude the response for obvious reasons (a perfect and useless fit). It is important to remove the response from the call to `princomp`

if we want to use the output in `lm`

.

We see now two approaches for performing PCR, which we illustrate with the `laliga`

dataset. The common objective is to predict `Points`

using the remaining variables (except from the directly related variables `Wins`

, `Draws`

, `Loses`

, and `Matches.without.conceding`

) in order to quantify, explain, and predict the final points of a team from its performance.

The first approach combines the use of the `princomp`

and `lm`

functions. Its strong points are that is both able to predict and explain, and is linked with techniques we have employed so far. The weak point is that it requires some extra coding.

```
# A linear model is problematic
mod <- lm(Points ~ . - Wins - Draws - Loses - Matches.without.conceding,
data = laliga)
summary(mod) # Lots of non-significant predictors
##
## Call:
## lm(formula = Points ~ . - Wins - Draws - Loses - Matches.without.conceding,
## data = laliga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2117 -1.4766 0.0544 1.9515 4.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.11798 26.12915 2.951 0.0214 *
## Goals.scored -28.21714 17.44577 -1.617 0.1498
## Goals.conceded -24.23628 15.45595 -1.568 0.1608
## Percentage.scored.goals 1066.98731 655.69726 1.627 0.1477
## Percentage.conceded.goals 896.94781 584.97833 1.533 0.1691
## Shots -0.10246 0.07754 -1.321 0.2279
## Shots.on.goal 0.02024 0.13656 0.148 0.8863
## Penalties.scored -0.81018 0.77600 -1.044 0.3312
## Assistances 1.41971 0.44103 3.219 0.0147 *
## Fouls.made -0.04438 0.04267 -1.040 0.3328
## Yellow.cards 0.27850 0.16814 1.656 0.1416
## Red.cards 0.68663 1.44229 0.476 0.6485
## Offsides -0.00486 0.14854 -0.033 0.9748
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared: 0.9795, Adjusted R-squared: 0.9443
## F-statistic: 27.83 on 12 and 7 DF, p-value: 9.784e-05
# We try to clean the model
modBIC <- MASS::stepAIC(mod, k = log(nrow(laliga)), trace = 0)
summary(modBIC) # Better, but still unsatisfactory
##
## Call:
## lm(formula = Points ~ Goals.scored + Goals.conceded + Percentage.scored.goals +
## Percentage.conceded.goals + Shots + Assistances + Yellow.cards,
## data = laliga)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4830 -1.4505 0.9008 1.1813 5.8662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.91373 10.73528 5.860 7.71e-05 ***
## Goals.scored -23.90903 11.22573 -2.130 0.05457 .
## Goals.conceded -12.16610 8.11352 -1.499 0.15959
## Percentage.scored.goals 894.56861 421.45891 2.123 0.05528 .
## Percentage.conceded.goals 440.76333 307.38671 1.434 0.17714
## Shots -0.05752 0.02713 -2.120 0.05549 .
## Assistances 1.42267 0.28462 4.999 0.00031 ***
## Yellow.cards 0.11313 0.07868 1.438 0.17603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.743 on 12 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.9572
## F-statistic: 61.77 on 7 and 12 DF, p-value: 1.823e-08
# Also, huge multicollinearity
car::vif(modBIC)
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## 77998.044758 22299.952547 76320.612577 22322.307151
## Shots Assistances Yellow.cards
## 6.505748 32.505831 3.297224
# A quick way of removing columns without knowing its position
laligaRed <- subset(laliga, select = -c(Points, Wins, Draws, Loses,
Matches.without.conceding))
# PCA without Points, Wins, Draws, Loses, and Matches.without.conceding
pcaLaligaRed <- princomp(x = laligaRed, cor = TRUE)
summary(pcaLaligaRed) # l = 3 gives 86% of variance explained
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 2.7437329 1.4026745 0.91510249 0.8577839 0.65747209 0.5310954 0.332556029 0.263170555
## Proportion of Variance 0.6273392 0.1639580 0.06978438 0.0613161 0.03602246 0.0235052 0.009216126 0.005771562
## Cumulative Proportion 0.6273392 0.7912972 0.86108155 0.9223977 0.95842012 0.9819253 0.991141438 0.996913000
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.146091551 0.125252621 3.130311e-03 1.801036e-03
## Proportion of Variance 0.001778562 0.001307352 8.165704e-07 2.703107e-07
## Cumulative Proportion 0.998691562 0.999998913 9.999997e-01 1.000000e+00
# Interpretation of PC1 and PC2
biplot(pcaLaligaRed)
```

```
# PC1: attack performance of the team
# Create a new dataset with the response + principal components
laligaPCA <- data.frame("Points" = laliga$Points, pcaLaligaRed$scores)
# Regression on all the principal components
modPCA <- lm(Points ~ ., data = laligaPCA)
summary(modPCA) # Predictors clearly significative - same R^2 as mod
##
## Call:
## lm(formula = Points ~ ., data = laligaPCA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2117 -1.4766 0.0544 1.9515 4.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 0.9557 54.831 1.76e-10 ***
## Comp.1 5.7690 0.3483 16.563 7.14e-07 ***
## Comp.2 2.4376 0.6813 3.578 0.0090 **
## Comp.3 3.4222 1.0443 3.277 0.0135 *
## Comp.4 3.6079 1.1141 3.238 0.0143 *
## Comp.5 -1.9713 1.4535 -1.356 0.2172
## Comp.6 -5.7067 1.7994 -3.171 0.0157 *
## Comp.7 -3.4169 2.8737 -1.189 0.2732
## Comp.8 9.0212 3.6313 2.484 0.0419 *
## Comp.9 -4.6455 6.5415 -0.710 0.5006
## Comp.10 -10.2087 7.6299 -1.338 0.2227
## Comp.11 -222.0340 305.2920 -0.727 0.4907
## Comp.12 -954.7650 530.6164 -1.799 0.1150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.274 on 7 degrees of freedom
## Multiple R-squared: 0.9795, Adjusted R-squared: 0.9443
## F-statistic: 27.83 on 12 and 7 DF, p-value: 9.784e-05
car::vif(modPCA) # No problems at all
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## 1 1 1 1 1 1 1 1 1 1 1 1
# Using the first three components
modPCA3 <- lm(Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
summary(modPCA3)
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.178 -4.541 -1.401 3.501 16.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.5672 33.435 3.11e-16 ***
## Comp.1 5.7690 0.5712 10.100 2.39e-08 ***
## Comp.2 2.4376 1.1173 2.182 0.0444 *
## Comp.3 3.4222 1.7126 1.998 0.0630 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.009 on 16 degrees of freedom
## Multiple R-squared: 0.8738, Adjusted R-squared: 0.8501
## F-statistic: 36.92 on 3 and 16 DF, p-value: 2.027e-07
# Coefficients associated to each original predictor (gamma)
alpha <- modPCA3$coefficients
gamma <- pcaLaligaRed$loadings[, 1:3] %*% alpha[-1] # Slopes
gamma <- c(alpha[1] - pcaLaligaRed$center %*% gamma, gamma) # Intercept
gamma
## [1] -44.2288551 1.7305124 -3.4048178 1.7416378 -3.3944235 0.2347716 0.8782162 2.6044699 1.4548813
## [10] -0.1171732 -1.7826488 -2.6423211 1.3755697
# We can overpenalize to have a simpler model - also one single
# principal component does quite well
modPCABIC <- MASS::stepAIC(modPCA, k = 2 * log(nrow(laliga)), trace = 0)
summary(modPCABIC)
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3 + Comp.4 + Comp.6 +
## Comp.8, data = laligaPCA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6972 -2.6418 -0.3265 2.3535 8.4944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.0706 48.946 3.95e-16 ***
## Comp.1 5.7690 0.3902 14.785 1.65e-09 ***
## Comp.2 2.4376 0.7632 3.194 0.00705 **
## Comp.3 3.4222 1.1699 2.925 0.01182 *
## Comp.4 3.6079 1.2481 2.891 0.01263 *
## Comp.6 -5.7067 2.0158 -2.831 0.01416 *
## Comp.8 9.0212 4.0680 2.218 0.04502 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.788 on 13 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9301
## F-statistic: 43.11 on 6 and 13 DF, p-value: 7.696e-08
# Note that the order of the principal components does not correspond
# exactly to its importance in the regression!
# To perform prediction we need to compute first the scores associated to the
# new values of the predictors, conveniently preprocessed
# Predictions for FCB and RMA (although they are part of the training sample)
newPredictors <- laligaRed[1:2, ]
newPredictors <- scale(newPredictors, center = pcaLaligaRed$center,
scale = pcaLaligaRed$scale) # Centred and scaled
newScores <- t(apply(newPredictors, 1,
function(x) t(pcaLaligaRed$loadings) %*% x))
# We need a data frame for prediction
newScores <- data.frame("Comp" = newScores)
predict(modPCABIC, newdata = newScores, interval = "prediction")
## fit lwr upr
## Barcelona 93.64950 80.35115 106.9478
## Real Madrid 90.05622 77.11876 102.9937
# Reality
laliga[1:2, 1]
## [1] 91 90
```

The second approach employs the function `pls::pcr`

and is more direct, yet less connected with the techniques we have seen so far. It employs a model object that is different from the `lm`

object and, as a consequence, functions like `summary`

, `BIC`

, `MASS::stepAIC`

, or `plot`

will not work properly. This implies that inference, model selection, and model diagnostics are not so straightforward. In exchange, `pls::pcr`

allows for model fitting in an easier way and model selection through the use of cross-validation. In summary, this is a more *pure predictive* approach than *predictive and explicative*.

```
# Create a dataset without the problematic predictors and with the response
laligaRed2 <- subset(laliga, select = -c(Wins, Draws, Loses,
Matches.without.conceding))
# Simple call to pcr
library(pls)
modPcr <- pcr(Points ~ ., data = laligaRed2, scale = TRUE)
# Notice we do not need to create a data.frame with PCA, it is automatically
# done within pcr. We also have flexibility to remove predictors from the PCA
# scale = TRUE means that the variables are scaled prior to compute PCA
# The summary of the model is different
summary(modPcr)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
# First row: percentage of variance explained of the predictors
# Second row: percentage of variance explained of Y (the R^2)
# Note that we have the same R^2 for 3 and 12 components as in the previous
# approach
# Slots of information in the model - most of them as 3-dim arrays with the
# third dimension indexing the number of components considered
names(modPcr)
## [1] "coefficients" "scores" "loadings" "Yloadings" "projection" "Xmeans" "Ymeans"
## [8] "fitted.values" "residuals" "Xvar" "Xtotvar" "fit.time" "ncomp" "method"
## [15] "scale" "call" "terms" "model"
# The coefficients of the original predictors (gammas), not of the components!
modPcr$coefficients[, , 12]
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## -602.85050765 -383.07010184 600.61255371 374.38729000
## Shots Shots.on.goal Penalties.scored Assistances
## -8.27239221 0.88174787 -2.14313238 24.42240486
## Fouls.made Yellow.cards Red.cards Offsides
## -2.96044265 5.51983512 1.20945331 -0.07231723
# pcr() computes up to ncomp (in this case, 12) linear models, each one
# considering one extra principal component. $coefficients returns in a
# 3-dim array the coefficients of all the linear models
# Prediction is simpler and can be done for different number of components
predict(modPcr, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
## Points
## Barcelona 92.01244
## Real Madrid 91.38026
# Selecting the number of components to retain. All the components up to ncomp
# are selected, no further flexibility is possible
modPcr2 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE, ncomp = 3)
summary(modPcr2)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 62.73 79.13 86.11
## Points 80.47 84.23 87.38
# Selecting the number of components to retain by Leave-One-Out
# cross-validation
modPcrCV1 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "LOO")
summary(modPcrCV1)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 20 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.505 8.390 8.588 7.571 7.688 6.743 7.09 6.224 6.603 7.547 8.375
## adjCV 18.57 8.476 8.356 8.525 7.513 7.663 6.655 7.03 6.152 6.531 7.430 8.236
## 12 comps
## CV 7.905
## adjCV 7.760
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
# View cross-validation Mean Squared Error in Prediction
validationplot(modPcrCV1, val.type = "MSEP") # l = 8 gives the minimum CV
```

```
# The black is the CV loss, the dashed red line is the adjCV loss, a bias
# corrected version of the MSEP (not described in the notes)
# Selecting the number of components to retain by 10-fold Cross-Validation
# (k = 10 is the default)
modPcrCV10 <- pcr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
summary(modPcrCV10)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.519 9.126 8.781 8.090 9.180 8.457 9.362 8.020 8.095 8.840 11.52
## adjCV 18.57 8.456 9.013 8.646 7.941 9.071 8.196 9.089 7.765 7.859 8.525 11.04
## 12 comps
## CV 9.945
## adjCV 9.528
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.73 79.13 86.11 92.24 95.84 98.19 99.11 99.69 99.87 100.00 100 100.00
## Points 80.47 84.23 87.38 90.45 90.99 93.94 94.36 96.17 96.32 96.84 97 97.95
validationplot(modPcrCV10, val.type = "MSEP") # l = 6 gives the minimum CV
```

`pcr()`

does an internal scaling of the predictors by their *quasi-standard deviations*. This means that each variable is divided by \(\frac{1}{\sqrt{n-1}}\), when in

`princomp`

a scaling of \(\frac{1}{\sqrt{n}}\) is applied (the *standard deviations*are employed). This results in a minor discrepancy in the

`scores`

object of both methods that is easily patchable. The `scores`

of `princomp()`

are the ones of `pcr()`

**multiplied**by \(\sqrt{\frac{n}{n-1}}\). This problem is inherited to the coefficients, which assume scores divided by \(\frac{1}{\sqrt{n-1}}\). Therefore, the \(\hat{\boldsymbol{\gamma}}\) coefficients described in (3.12) are obtained by

**dividing**the

`coefficients`

of `pcr()`

by \(\sqrt{\frac{n}{n-1}}\).
The next chunk of code illustrates the previous warning.

```
# Equality of loadings from princomp() and pcr()
max(abs(abs(pcaLaligaRed$loadings[, 1:3]) - abs(modPcr$loadings[, 1:3])))
## [1] 1.804112e-15
# Equality of scores from princomp() and pcr() (with the same standardization)
max(abs(abs(pcaLaligaRed$scores[, 1:3]) -
abs(modPcr$scores[, 1:3] * sqrt(n / (n - 1)))))
## [1] 5.551115e-15
# Equality of the gamma coefficients obtained previously for 3 PCA
# (with the same standardization)
modPcr$coefficients[, , 3] / sqrt(n / (n - 1))
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## 1.7305124 -3.4048178 1.7416378 -3.3944235
## Shots Shots.on.goal Penalties.scored Assistances
## 0.2347716 0.8782162 2.6044699 1.4548813
## Fouls.made Yellow.cards Red.cards Offsides
## -0.1171732 -1.7826488 -2.6423211 1.3755697
gamma[-1]
## [1] 1.7305124 -3.4048178 1.7416378 -3.3944235 0.2347716 0.8782162 2.6044699 1.4548813 -0.1171732 -1.7826488
## [11] -2.6423211 1.3755697
# Coefficients associated to the principal components - same as in modPCA3
lm(Points ~ ., data = data.frame("Points" = laliga$Points,
modPcr$scores[, 1:3] * sqrt(n / (n - 1))))
##
## Call:
## lm(formula = Points ~ ., data = data.frame(Points = laliga$Points,
## modPcr$scores[, 1:3] * sqrt(n/(n - 1))))
##
## Coefficients:
## (Intercept) Comp.1 Comp.2 Comp.3
## 52.400 -5.769 2.438 -3.422
modPCA3
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2 + Comp.3, data = laligaPCA)
##
## Coefficients:
## (Intercept) Comp.1 Comp.2 Comp.3
## 52.400 5.769 2.438 3.422
# Of course, flipping of signs is always possible with PCA
```

The selection of \(l\) by **cross-validation** attempts to minimize the *Mean Squared Error in Prediction* (MSEP) or, equivalently, the *Root MSEP* (RMSEP), of the model^{86}. This is a **jackknife method** valid for the selection of any tuning parameter \(\lambda\) that affects the form of the estimate \(\hat m_\lambda\) of the regression function \(m\) (remember (1.1)). Given the sample \(\{(\mathbf{X}_i,Y_i)\}_{i=1}^n\), **leave-one-out cross-validation** considers the tuning parameter

where \(\hat m_{\lambda,-i}\) represents the fit of the model \(\hat m_\lambda\) without the \(i\)-th observation \((\mathbf{X}_i,Y_i)\).

A less computationally expensive variation on leave-one-out cross-validation is **\(k\)-fold cross-validation**, which partitions the data into \(k\) folds \(F_1,\ldots, F_k\) of approximately equal size, trains the model \(\hat m_\lambda\) in the aggregation of \(k-1\) folds and evaluates its MSEP in the remaining fold:

where \(\hat m_{\lambda,-F_{j}}\) represents the fit of the model \(\hat m_\lambda\) excluding the data from the \(j\)-th fold \(F_{j}\). Recall that \(k\)-fold cross-validation is **more general than leave-one-out cross-validation**, since the latter is a particular case of the former with \(k=n\).

*i.e.*, how each datum is assigned to each of the \(k\) folds). Some statistical softwares do this assignment

**randomly**, which means that the selection of \(\hat{\lambda}_{k\mathrm{-CV}}\) may vary from one run to another. Thus, fixing the

**seed**prior to the parameter selection is important to ensure

**reproducibility**. Another alternative is to aggregate, in a convenient way, the results of

*several*\(k\)-fold cross-validations done in different random random partitions. Notice that this problem is

*not*present in leave-one-out cross-validation (where \(k=n\)).

`summary()`

function on the output of `lm()`

. The coefficients \(\boldsymbol{\alpha}_{0:l}\) are the ones estimated by least squares when considering the scores of the \(l\) principal components as the predictors. This transformation of the predictors does not affect the validity of the inferential results of Section 2.4 (they assumed that the predictors were deterministic). But recall that the **inference is not about**\(\boldsymbol{\gamma}\).

Inference in PCR is based on the assumptions of the linear model being satisfied for the principal components we are considering. The evaluation of the assumptions can be done using the exact same tools described in Section 3.5. However, keep in mind that **PCA is a linear transformation of the data**, and therefore:

- If the
**linearity**assumption fails for the predictors \(X_1,\ldots,X_p\), then it will likely fail for \(\mathrm{PC}_1,\ldots,\mathrm{PC}_l\), since the transformation will not introduce nonlinearities able to capture the nonlinear effects. - Similarly, if the
**homoscedasticity, normality, or independence**assumptions fail for \(X_1,\ldots,X_p\), then they will likely fail for \(\mathrm{PC}_1,\ldots,\mathrm{PC}_l\).

Exceptions to the previous common implications are possible, and may involve the association of one or several *problematic* predictors (e.g., have nonlinear effects on the response) to the principal components that are excluded from the model. Up to which extent the failure of the assumptions in the original predictors can be mitigated by PCR depends on each application.

### 3.6.3 Partial least squares

PCR works by replacing the predictors \(X_1,\ldots,X_p\) by a set of principal components \(\mathrm{PC}_1,\ldots,\mathrm{PC}_l\) under the hope that these directions, that explain most of the variability of the predictors, are also the best directions for predicting the response \(Y\). While this is a reasonable belief, it is not a guaranteed fact. *Partial Least Squares* (PLS) precisely tackles this point with the idea of regressing the response \(Y\) on a set of new variables \(\mathrm{PLS}_1,\ldots,\mathrm{PLS}_l\) that are constructed with the objective of *predicting \(Y\) from \(X_1,\ldots,X_p\) in the best way*.

As with PCA, the idea to find linear combinations of the predictors \(X_1,\ldots,X_p\), that is, to have:

\[\begin{align*} \mathrm{PLS}_1:=\sum_{j=1}^pa_{j1}X_j,\,\ldots,\,\mathrm{PLS}_p:=\sum_{j=1}^pa_{jp}X_j. \end{align*}\]The question is how to choose the coefficients \(a_{jk}\), \(j,k=1,\ldots,p\). PLS does it by placing the **most weight** in the **predictors that are most strongly correlated with \(Y\)** and in such a way that the resulting \(\mathrm{PLS}_1,\ldots,\mathrm{PLS}_p\) are **uncorrelated**. After standardizing the variables, for \(\mathrm{PLS}_1\) this is achieved by setting \(a_{j1}\) equal to the theoretical slope coefficient of regressing \(Y\) into \(X_j\), that is^{87}:

where \(\beta_{j1}\) stems from \(Y=\beta_{0}+\beta_{j1}X_j+\varepsilon\).

The second partial least squares direction, \(\mathrm{PLS}_2\), is computed in a similar way, but once the linear effects of \(\mathrm{PLS}_1\) on \(X_1,\ldots,X_p\) are removed. This is achieved by:

- Regressing each of \(X_1,\ldots,X_p\) on \(\mathrm{PLS}_1\). That is, fit the \(p\) simple linear models

Regress \(Y\) on each of the \(p\) random errors \(\varepsilon_{1},\ldots,\varepsilon_{p}\,\)

\[\begin{align*} a_{j2}:=\beta_{j2}:=\frac{\mathrm{Cov}[\varepsilon_j,Y]}{\mathbb{V}\mathrm{ar}[\varepsilon_j]}, \end{align*}\]^{88}from the above regressions, yieldingwhere \(\beta_{j2}\) stems from \(Y=\beta_{0}+\beta_{j2}\varepsilon_j+\varepsilon\). If the process is iterated, the coefficients for \(\mathrm{PLS}_j\), \(j>2\), can be computed

^{89}. Once \(\mathrm{PLS}_1,\ldots,\mathrm{PLS}_l\) are obtained, then PLS proceeds as PCR and fits the model

The implementation of PLS can be done by the function `pls::plsr`

, which has an analogous syntax to `pls::pcr`

.

```
# Simple call to plsr - very similar to pcr
modPls <- plsr(Points ~ ., data = laligaRed2, scale = TRUE)
# The summary of the model
summary(modPls)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95
# First row: percentage of variance explained of the predictors
# Second row: percentage of variance explained of Y (the R^2)
# Note we have the same R^2 for 12 components as in the linear model
# Slots of information in the model
names(modPls)
## [1] "coefficients" "scores" "loadings" "loading.weights" "Yscores" "Yloadings"
## [7] "projection" "Xmeans" "Ymeans" "fitted.values" "residuals" "Xvar"
## [13] "Xtotvar" "fit.time" "ncomp" "method" "scale" "call"
## [19] "terms" "model"
# PLS scores
head(modPls$scores)
## [1] 7.36407868 6.70659816 2.45577740 0.06913485 0.96513341 -0.39588401
# Also uncorrelated
head(cov(modPls$scores))
## Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## Comp 1 7.393810e+00 1.704822e-16 1.667417e-16 -1.716470e-16 -1.788249e-16 -8.227618e-16 -4.380633e-16 -1.202580e-16
## Comp 2 1.704822e-16 1.267859e+00 1.815810e-16 1.695817e-16 1.084887e-16 9.991690e-17 7.754899e-18 3.451971e-17
## Comp 3 1.667417e-16 1.815810e-16 9.021126e-01 -1.412726e-17 -7.389551e-17 -1.422399e-16 -3.699555e-17 -4.991681e-17
## Comp 4 -1.716470e-16 1.695817e-16 -1.412726e-17 3.130984e-01 1.130620e-17 1.088368e-17 -1.278556e-17 2.547340e-18
## Comp 5 -1.788249e-16 1.084887e-16 -7.389551e-17 1.130620e-17 2.586132e-01 8.607103e-18 -2.416915e-17 -1.698417e-17
## Comp 6 -8.227618e-16 9.991690e-17 -1.422399e-16 1.088368e-17 8.607103e-18 2.792408e-01 -2.305428e-17 -7.131527e-18
## Comp 9 Comp 10 Comp 11 Comp 12
## Comp 1 -3.291438e-16 -3.606813e-16 -2.165469e-16 -3.535228e-16
## Comp 2 9.691056e-17 1.087954e-16 2.611563e-17 -2.943533e-17
## Comp 3 -1.134240e-17 7.717273e-18 -1.008660e-17 3.430733e-17
## Comp 4 -2.368046e-17 -1.480284e-17 -1.227829e-17 -1.358123e-17
## Comp 5 -3.565028e-17 2.589496e-17 8.527293e-18 9.912266e-18
## Comp 6 -1.579360e-17 -2.849998e-18 1.062075e-17 -2.612850e-18
# The coefficients of the original predictors, not of the components!
modPls$coefficients[, , 2]
## Goals.scored Goals.conceded Percentage.scored.goals Percentage.conceded.goals
## 1.8192870 -4.4038213 1.8314760 -4.4045722
## Shots Shots.on.goal Penalties.scored Assistances
## 0.4010902 0.9369002 -0.2006251 2.3688050
## Fouls.made Yellow.cards Red.cards Offsides
## 0.2807601 -1.6677725 -2.4952503 1.2187529
# Obtaining the coefficients of the PLS components
lm(formula = Points ~., data = data.frame("Points" = laliga$Points,
modPls$scores[, 1:3]))
##
## Call:
## lm(formula = Points ~ ., data = data.frame(Points = laliga$Points,
## modPls$scores[, 1:3]))
##
## Coefficients:
## (Intercept) Comp.1 Comp.2 Comp.3
## 52.400 6.093 4.341 3.232
# Prediction
predict(modPls, newdata = laligaRed2[1:2, ], ncomp = 12)
## , , 12 comps
##
## Points
## Barcelona 92.01244
## Real Madrid 91.38026
# Selecting the number of components to retain
modPls2 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE, ncomp = 2)
summary(modPls2)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 62.53 76.07
## Points 83.76 91.06
# Selecting the number of components to retain by Leave-One-Out cross-validation
modPlsCV1 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "LOO")
summary(modPlsCV1)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 20 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.307 7.221 6.807 6.254 6.604 6.572 6.854 7.348 7.548 7.532 7.854
## adjCV 18.57 8.282 7.179 6.742 6.193 6.541 6.490 6.764 7.244 7.430 7.416 7.717
## 12 comps
## CV 7.905
## adjCV 7.760
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95
# View cross-validation Mean Squared Error Prediction
validationplot(modPlsCV1, val.type = "MSEP") # l = 4 gives the minimum CV
```

```
# Selecting the number of components to retain by 10-fold Cross-Validation
# (k = 10 is the default)
modPlsCV10 <- plsr(Points ~ ., data = laligaRed2, scale = TRUE,
validation = "CV")
summary(modPlsCV10)
## Data: X dimension: 20 12
## Y dimension: 20 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 18.57 8.376 7.389 6.906 6.280 7.040 7.360 7.635 8.514 8.959 9.363 9.382
## adjCV 18.57 8.324 7.299 6.769 6.165 6.885 7.149 7.421 8.240 8.649 9.027 9.020
## 12 comps
## CV 9.304
## adjCV 8.934
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## X 62.53 76.07 84.94 88.29 92.72 95.61 98.72 99.41 99.83 100.00 100.00 100.00
## Points 83.76 91.06 93.93 95.43 95.94 96.44 96.55 96.73 96.84 96.84 97.69 97.95
validationplot(modPlsCV10, val.type = "MSEP")
```

```
# l = 4 is close to the minimum CV
# Regress manually Points in the scores, in order to have an lm object
# Create a new dataset with the response + PLS components
laligaPLS <- data.frame("Points" = laliga$Points, cbind(modPls$scores))
# Regression on all principal components
modPLS <- lm(Points ~ Comp.1 + Comp.2, data = laligaPLS)
summary(modPLS) # Predictors clearly significative - same R^2 as in modPls2
##
## Call:
## lm(formula = Points ~ Comp.1 + Comp.2, data = laligaPLS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3565 -3.6157 0.4508 2.3288 12.3116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4000 1.2799 40.941 < 2e-16 ***
## Comp.1 6.0933 0.4829 12.618 4.65e-10 ***
## Comp.2 4.3413 1.1662 3.723 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.724 on 17 degrees of freedom
## Multiple R-squared: 0.9106, Adjusted R-squared: 0.9
## F-statistic: 86.53 on 2 and 17 DF, p-value: 1.225e-09
car::vif(modPLS) # No problems at all
## Comp.1 Comp.2
## 1 1
```

Let’s perform PCR and PLS in the `iris`

dataset. Recall that this dataset contains a factor variable, which can not be treated directly by `princomp`

but can be used in PCR/PLS (it is transformed internally to two dummy variables). Do the following:

- Compute the PCA of
`iris`

, excluding`Species`

. What is the percentage of variability explained with two components? - Draw the biplot and look for interpretations of the principal components.
- Plot the PCA scores in a
`scatterplotMatrix`

plot such that the scores are coloured by the levels in`Species`

(you can use the`groups`

argument). - Compute the PCR and PLS regressions of
`Petal.Width`

with \(l=2\) (exclude`Species`

). Inspect by CV the best selection of \(l\) for each regression. - Plot the PLS scores of the data in a
`scatterplotMatrix`

plot such that the scores are coloured by the levels in`Species`

. Compare with the PCA scores. (A quick way of converting the`scores`

to a matrix is with`rbind`

.)

**in practise PCR and PLS tend to perform similarly**, since PLS potentially increases the variance of the directions with respect to PCR.

**Inference for PLS is more involved**since, differently to what happens in PCR, the

**PLS directions are dependent on the response \(Y\)**. This directly breaks a core assumption made in Section 2.4: that the randomness of the regression model came only from the error terms and not from the predictors. We do not go into details on how to perform inference on PLS in these notes.

Precisely, the size of the set is \(2^{p+1}\).↩

That is, \(\mathbb{E}[X_j]=0\), \(j=1,\ldots,p\). This is important since PCA is sensitive to the centering of the data.↩

Recall that the covariance matrix is a real, symmetric, semi-positive definite matrix.↩

Therefore, \(\mathbf{A}^{-1}=\mathbf{A}'\).↩

Since the variance-covariance matrix \(\mathbb{V}\mathrm{ar}[\mathbf{PC}]\) is diagonal.↩

Up to now, the exposition has been focused exclusively on the population case.↩

Some care is needed here. The matrix \(\mathbf{S}\) is obtained from linear combinations of the \(n\) vectors \(\mathbf{X}_1-\bar{\mathbf{X}},\ldots,\mathbf{X}_n-\bar{\mathbf{X}}\). Recall that these \(n\) vectors are

*not*linearly independent, as they are guaranteed to add \(\mathbf{0}\), \(\sum_{i=1}^n[\mathbf{X}_i-\bar{\mathbf{X}}]=\mathbf{0}\), so it is possibly to express one perfectly on the rest. That implies that the \(p\times p\) matrix \(\mathbf{S}\) has rank smaller or equal to \(n-1\). If \(p\leq n-1\), then the matrix has full rank \(p\) and it is invertible. But if \(p\geq n\), then \(\mathbf{S}\) is singular and, as a consequence, \(\lambda_j=0\) for \(j\geq n\). This implies that the principal components for those eigenvalues can not be determined univocally.↩Therefore, a sample of lengths measured in centimeters will have a variance \(10^4\) times larger than the same sample measured in meters – yet it is the same information!↩

Computed with

`biplot.princomp`

or with`biplot`

applied to a`princomp`

object.↩For the sample version, replace the variable \(X_j\) by its sample \(X_{1j},\ldots,X_{nj}\), the weights \(a_{ji}\) by their estimates \(\hat a_{ji}\), and the principal component \(\mathrm{PC}_j\) by the scores \(\hat{\mathrm{PC}}_{j1},\ldots,\hat{\mathrm{PC}}_{jn}\).↩

The first two principal components of the dataset explain the \(78\%\) of the variability of the data. Hence, the insights obtained from the biplot should be regarded as “\(78\%\)-accurate”.↩

This does not need to be true, but it is often the case.↩

For the sake of simplicity, we considering the exposition the

*first*\(l\) principal components, but obviously other combinations of \(l\) principal components are possible.↩Since we know by (3.7) that \(\mathbf{PC}=\mathbf{A}'(\mathbf{X}-\boldsymbol{\mu})\).↩

For example, thanks to \(\hat{\boldsymbol{\gamma}}\) we know that the estimated conditional response of \(Y\) precisely increases \(\hat{\gamma}_j\) for a marginal unit increment in \(X_j\).↩

If \(l=p\), then PCR is equivalent to the least squares estimation.↩

Alternatively, the “

*in Prediction*” part of the latter terms is dropped and they are just referred as the MSE and RMSE.↩They play the role of \(X_1,\ldots,X_p\) in the computation of \(\mathrm{PLS}_1\).↩

Of course, in practice, the computations need to be done in terms of the sample.↩