# Chapter 2 Linear regression and high-dimensionality

In this chapter we will review multiple linear regression and in particular the Ordinary Least Squares (OLS) estimator. We will further investigate the challenges which appear in the high-dimensional setting where the number of covariates is large compared to the number of observations, i.e. \(p>>n\).

## 2.1 Remarks on notation

We will typically denote an input variable by the symbol \(X\). If \(X\) is a vector, its components can be accessed by subscripts \(X_j\). Quantative outputs will be denoted by \(Y\), and qualitative outputs by \(G\) (for group). We use uppercase letters such as \(X\), \(Y\) or \(G\) when referring to the generic aspects of a variable. Observed values are written in lowercase; hence \(i\)th observed value of \(X\) is written as \(x_i\) (where \(x_i\) is again a scalar or vector). Matrices are represented by bold uppercase letters; for example, a set of \(N\) input p-vectors \(x_i\), \(i=1,\ldots,N\) would be represented by the \(N\times p\) matrix \(\bf{X}\). All vectors are assumed to be column vectors, the \(i\)th row of \(\bf{X}\) is \(x_i^T\).

## 2.2 Linear regression and Least Squares

Given a vector of inputs \(X=(X_1,X_2,\ldots,X_p)\), in multivariate regression we predict the output \(Y\) via the linear model:

\[ \hat{Y}=\sum_{j=1}^{p}X_j\hat\beta_j.\]

The term \(\beta_0\) is the intercept. If we include the constant variable 1 in \(X\), include \(\hat\beta_0\) in the vector of coefficients \(\hat\beta\), then we can write

\[\hat{Y}=X^T\hat\beta.\]

How do we fit the linear model to a set of training data (i.e. how do we obtain the estimator \(\hat \beta\))? We typically use *ordinary least squares* (OLS) where we pick the coefficient \(\beta\) to minimize the residual sum of squares

\[\begin{align*} \rm{RSS}(\beta)&=\sum_{i=1}^{N}(y_i-x_i^T\beta)^2\\ &=(\bf y - \bf X \beta)^T (\bf y - \bf X \beta)\\ &=\|\bf y - \bf X \beta\|^2_2. \end{align*}\]

If the matrix \(\bf X^T \bf X\) is nonsingular, then the solution is given by

\[\hat\beta=(\bf X^T \bf X)^{-1}\bf X^T \bf y.\]

Thus, the prediction at the new input point \(X_{\rm new}\) is

\[\begin{align*} \hat{Y}&=\hat{f}(X_{\rm new})\\ &=X_{\rm new}^T\hat\beta\\ &=X_{\rm new}^T(\bf X^T \bf X)^{-1}\bf X^T \bf y.\\ \end{align*}\]

Figures 2.1 and 2.2 show two geometric representations of the OLS estimator. In Figure 2.1 the \(N\) data points \((y_i,x_{i2},\ldots,x_{ip})\) randomly spread around a \((p-1)\)-dimensional hyperplane in a \(p\)-dimensional space; the random spread only occurs parallel to the y-axis. Figure 2.2 shows a different representation where the vector \(\bf y\) is a single point in the \(N\)-dimensional space \({\bf R}^N\); the fitted values \(\hat {\bf y}\) are the orthogonal projection onto the \(p\)-dimensional subspace of \({\bf R}^N\) spanned by the vectors \({\bf x}_1,\ldots,{\bf x}_p\).

## 2.3 Overfitting

Overfitting refers to the phenomenon of modelling the noise rather than the signal. In case the true model is parsimonious (few covariates driving the response \(Y\)) and data on many covariates are available, it is likely that a linear combination of all covariates yields a higher likelihood than a combination of the few that are actually related to the response. As only the few covariates related to the response contain the signal, the model involving all covariates then cannot but explain more than the signal alone: it also models the error. Hence, it overfits the data.

In high-dimensional settings overfitting is a real threat! In the situation where \(p>n\) it is possible to form a linear combination of the covariates that perfectly explains the response, including the noise. In general, large estimates of regression coefficients are often an indication of overfitting. We illustrate overfitting in the next example. We simulate \(n=10\) training data points. We take \(p=15\) and \(X_{i1},\ldots,X_{ip}\) i.i.d \(N(0,1)\). The response depends only on the first covariate, i.e. \(Y_i=\beta_1 X_{i1}+\epsilon_i\), where \(\beta_1=2\) and \(\epsilon_i\) i.i.d \(N(0,0.5)\).

```
set.seed(1)
<- 10
n <- 15
p <- c(2,rep(0,p-1))
beta
# simulate covariates
<- matrix(rnorm(n*p),n,p)
xtrain <- xtrain%*%beta+rnorm(n,sd=0.5)
ytrain <- data.frame(xtrain)
dtrain $y <- ytrain dtrain
```

We start by fitting a linear regression model with only the first covariate.

`<- lm(y~X1,data=dtrain) fit1 `

The regression coefficients are obtained using `coef`

.

`summary(fit1)`

```
##
## Call:
## lm(formula = y ~ X1, data = dtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59574 -0.41567 -0.06222 0.18490 0.97592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1002 0.1785 -0.561 0.59
## X1 1.8070 0.2373 7.614 6.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5558 on 8 degrees of freedom
## Multiple R-squared: 0.8787, Adjusted R-squared: 0.8636
## F-statistic: 57.97 on 1 and 8 DF, p-value: 6.223e-05
```

Next, we plot the fitted values against the first covariate and we add the reference line representing the “true” relationship.

We add two “noise” covariates to the model.

```
<- lm(y~X1+X2+X3+X4,data=dtrain)
fit4 summary(fit4)
```

```
##
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X4, data = dtrain)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.08607 -0.08669 -0.18732 -0.12641 -0.37113 -0.25548 0.70950 0.29125
## 9 10
## -0.68858 0.80092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09973 0.21924 -0.455 0.66826
## X1 2.07658 0.40301 5.153 0.00361 **
## X2 -0.11117 0.25161 -0.442 0.67707
## X3 0.29805 0.37543 0.794 0.46325
## X4 0.25982 0.27471 0.946 0.38767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6281 on 5 degrees of freedom
## Multiple R-squared: 0.9032, Adjusted R-squared: 0.8257
## F-statistic: 11.66 on 4 and 5 DF, p-value: 0.009503
```

The fitted values (red triangles) start to deviate from the truth (blue line).

With a total of 8 covariates the fitted values get closer to the data points (black dots). This means the fitted model also captures the deviations from the truth (blue line).

`<- lm(y~X1+X2+X3+X4+X5+X6+X7+X8,data=dtrain) fit8 `

We investigate the model fit.

`summary(fit8)`

```
##
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = dtrain)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 0.32017 -0.26990 -0.21987 0.10342 -0.02651 0.06748 0.31531 -0.05819
## 9 10
## -0.24516 0.01324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.25168 0.33931 -0.742 0.594
## X1 0.07841 2.14373 0.037 0.977
## X2 0.31719 0.46470 0.683 0.619
## X3 -0.73848 1.36553 -0.541 0.684
## X4 0.72203 0.83656 0.863 0.547
## X5 0.03598 0.95115 0.038 0.976
## X6 -0.57347 0.66831 -0.858 0.549
## X7 -0.37493 0.46452 -0.807 0.568
## X8 -1.46034 1.59592 -0.915 0.528
##
## Residual standard error: 0.6346 on 1 degrees of freedom
## Multiple R-squared: 0.9802, Adjusted R-squared: 0.8221
## F-statistic: 6.199 on 8 and 1 DF, p-value: 0.3015
```

We note that many regression coefficient are clearly *over-estimated* (the true coefficients are \(\beta = (2, 0, \ldots, 0)\)). The model “overfits” the data.

Finally we fit the model with all 15 covariates.

```
# fit linear regression (all 15 covariates, intercept excluded for this illustration)
<- lm(y~.,data=dtrain) fit15
```

The fitted values perfectly match the data.

We investigate the model fit.

`summary(fit15)`

```
##
## Call:
## lm(formula = y ~ ., data = dtrain)
##
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01592 NaN NaN NaN
## X1 -0.50138 NaN NaN NaN
## X2 0.81492 NaN NaN NaN
## X3 -0.56052 NaN NaN NaN
## X4 0.72667 NaN NaN NaN
## X5 1.84831 NaN NaN NaN
## X6 0.05759 NaN NaN NaN
## X7 -1.21460 NaN NaN NaN
## X8 -1.30908 NaN NaN NaN
## X9 -1.39005 NaN NaN NaN
## X10 NA NA NA NA
## X11 NA NA NA NA
## X12 NA NA NA NA
## X13 NA NA NA NA
## X14 NA NA NA NA
## X15 NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 9 and 0 DF, p-value: NA
```

Clearly something went wrong. Many entries in the table of coefficients are not available. One message in the output says that there are no residual degrees of freedom. Further 6 coefficients cannot be calculated because of “singularities.” In fact, the OLS estimator as introduced before is not well defined because the design matrix \(\bf X\) is rank deficient (\({\rm rank}({\bf X})=n< p\)) and therefore the matrix \({\bf X}^T {\bf X}\) is singular.

```
<- model.matrix(fit15)
x det(t(x)%*%x)
```

`## [1] -2.8449e-81`

A useful quantity in the context of overfitting is the *residual degrees of freedom* which is defined as \(n-{\rm df}\), where \({\rm df}\) are the degrees of freedom of the model. For a linear regression model we have \({\rm df}=p+1\). A rule of thumb to avoid overfitting is that the residual degrees of freedom \(n-{\rm df}\) should be at least \(5\) (be aware that this is not a general applicable rule!).

## 2.4 Collinearity

Recall that *collinearity* in regression analysis refers to the event of two (or multiple) covariates being strongly linearly related. The case of two (or multiple) covariates being perfectly linearly dependent is referred to as *super-collinearity*.

For illustration we generate some data.

```
set.seed(1315)
<- 20
n <- rnorm(n)
x1 <- rnorm(n)
x2 <- x1+rnorm(n,sd=0.25)
x3 <- x1
x4 <- data.frame(x1,x2,x3,x4)
dat $y <- 2*dat$x1+rnorm(n) dat
```

The figures show pairs of covariates with no-, high- and super-collinearity.

In the presence of collinearity, the estimate of one variable’s impact on the dependent variable \(Y\) while controlling for the others tends to be less precise than if covariates were uncorrelated with one another. Intuitively this can be explained based on Figure 2.3 which shows the response Y and two highly correlated covariates x1 and x2. The depicted hyperplains for the true (“modell”) and estimated (“geschätzt”) coefficients are deviating from each other. Further we see that the exact position of the estimated hyperplain is stable alongside of the “hedge” of data points (but unstable in the orthogonal direction).

We will now illustrate this numerically by fitting the following two models.

```
<- lm(y~x1+x2,data=dat)
modela <- lm(y~x1+x3,data=dat) modelb
```

We obtain the variances of the estimated regression coefficients. The variances are larger for model b where x1 and x3 are correlated.

```
# variance of model a
diag(vcov(modela))
```

```
## (Intercept) x1 x2
## 0.04276263 0.12382277 0.05039847
```

```
#variance of model b
diag(vcov(modelb))
```

```
## (Intercept) x1 x3
## 0.04663828 1.63334879 1.27557431
```

In a high-dimensional setting where \(p>n\) we always have super-collinearity: the rank of the design matrix cannot exceed \(n\) which implies that the columns of \(\bf X\) are linearly dependent.

## 2.5 Prediction error

So far we investigated the fitted model visually and by inspecting the output of the regression analysis. Another important aspect is to assess how the fitted model generalizes beyond the observed data. We do this by distinguishing between model fitting and model testing, i.e. we assesses the prediction error on separate test data.

We illustrate this by simulating dummy test data.

`set.seed(2)`

```
# simulate test data
<- matrix(rnorm(n*p),n,p)
xtest <- xtest%*%beta+rnorm(n,sd=0.5)
ytest <- data.frame(xtest)
dtest $y <- ytest dtest
```

Next, we use the models obtained based on the training data to make predictions on the test data.

```
<- predict(fit4,newdata = dtest)
pred4 <- predict(fit8,newdata = dtest) pred8
```

We illustrate the predictions using a scatter plot which plots test data (in black) together with predictions from models `fit4`

(in orange) and `fit8`

(in green). The predictions from the model with 8 covariates lie way off from the truth.

We quantify the prediction error with the root-mean-square error (RMSE).

\[ {\bf RMSE}= \sqrt{\frac{\sum_{i=1}^N(y_i-\hat y_i)^2}{N}}\]

The RMSE measures how far off we should expect the prediction of our model to be.
In our dummy data example we can calculate the prediction error using the function `RMSE`

.

`RMSE(pred4,ytest)`

`## [1] 0.724046`

`RMSE(pred8,ytest)`

`## [1] 3.211847`

The RMSE=0.72 for model `fit4`

is close to the inherent error \(\sigma=0.5\). The RMSE=3.21 for model `fit8`

indicates that a predictions deviate on average 3.21 units from the observed values.