# Chapter 3 Multivariate regression and challenges in high-dimensions

In this chapter we will review multivariate regression and in particular the *Ordinary Least Squares* (OLS) estimator. We will further illustrate challenges in high dimensions.

## 3.1 Ordinary least squares

Given a vector of inputs \(X=(X_1,X_2,\ldots,X_p)\), in multivariate regression we predict the output \(Y\) via the linear model:

\[ \hat{Y}=\sum_{j=1}^{p}X_j\hat\beta_j.\]

The term \(\beta_0\) is the intercept. If we include the constant variable 1 in \(X\), include \(\hat\beta_0\) in the vector of coefficients \(\hat\beta\), then we can write

\[\hat{Y}=X^T\hat\beta.\]

How do we fit the linear model to a set of training data (i.e. how do we obtain the estimator \(\hat \beta\))? We typically use *ordinary least squares* (OLS) where we pick the coefficient \(\beta\) to minimize the residual sum of squares

\[\begin{align*} \rm{RSS}(\beta)&=\sum_{i=1}^{N}(y_i-x_i^T\beta)^2\\ &=(\bf y - \bf X \beta)^T (\bf y - \bf X \beta)\\ &=\|\bf y - \bf X \beta\|^2_2. \end{align*}\]

If the matrix \(\bf X^T \bf X\) is nonsingular, then the solution is given by

\[\hat\beta=(\bf X^T \bf X)^{-1}\bf X^T \bf y.\]

Thus, the learning rule at the new input point \(X_{\rm new}\) is

\[\begin{align*} \hat{Y}&=\hat{f}(X_{\rm new})\\ &=X_{\rm new}^T\hat\beta\\ &=X_{\rm new}^T(\bf X^T \bf X)^{-1}\bf X^T \bf y.\\ \end{align*}\]

Figures 3.1 and 3.2 show two geometric representations of the OLS estimator. In Figure 3.1 the \(N\) data points \((y_i,x_{i2},\ldots,x_{ip})\) randomly spread around a \((p-1)\)-dimensional hyperplane in a \(p\)-dimensional space; the random spread only occurs parallel to the y-axis. Figure 3.2 shows a different representation where the vector \(\bf y\) is a single point in the \(N\)-dimensional space \({\bf R}^N\); the fitted values \(\hat {\bf y}\) are the orthogonal projection onto the \(p\)-dimensional subspace of \({\bf R}^N\) spanned by the vectors \({\bf x}_1,\ldots,{\bf x}_p\).

## 3.2 High-dimensionality

The field of high-dimensional statistics deals with inference in situations where the model complexity is large compared to the number of independent observations. In the multivariate regression setup this is often paraphrased as \(p>>n\), referring to the situation where we have more covariates than observations. The challenge with high-dimensions is manifested in *overfitting* and *collinearity*.

### 3.2.1 Overfitting

Overfitting refers to the phenomenon of modelling the noise rather than the signal. In case the true model is parsimonious (few covariates driving the response \(Y\)) and data on many covariates are available, it is likely that a linear combination of all covariates yields a higher likelihood than a combination of the few that are actually related to the response. As only the few covariates related to the response contain the signal, the model involving all covariates then cannot but explain more than the signal alone: it also models the error. Hence, it overfits the data.

In high-dimensional settings overfitting is a real threat! In the situation where \(p>n\) it is possible to form a linear combination of the covariates that perfectly explains the response, including the noise. In general, large estimates of regression coefficients are often an indication of overfitting. We illustrate overfitting in the next example. We simulate \(n=10\) training data points. We take \(p=9\) and \(X_{i1},\ldots,X_{ip}\) i.i.d \(N(0,1)\). The response depends only on the first covariate, i.e. \(Y_i=\beta_1 X_{i1}+\epsilon_i\), where \(\beta_1=2\) and \(\epsilon_i\) i.i.d \(N(0,0.5)\).

```
set.seed(1)
<- 10
n <- 9
p <- c(2,rep(0,p-1))
beta
# simulate covariates
<- matrix(rnorm(n*p),n,p)
xtrain <- xtrain%*%beta+rnorm(n,sd=0.5)
ytrain <- data.frame(xtrain)
dtrain $y <- ytrain dtrain
```

We start by fitting a linear regression model with only the first covariate.

`<- lm(y~0+X1,data=dtrain) fit1 `

Next, we plot the fitted values against the values of the first covariate and we add the reference line representing the “true” relationship.

We add two “noise” covariates to the model.

`<- lm(y~X1+X2+X3,data=dtrain) fit3 `

The fitted values (red triangles) start to deviate from the truth (blue line).

With a total of 5 covariates the fitted values get closer to the data points (black dots). This means the fitted model also captures the deviations from the truth (blue line).

`<- lm(y~0+X1+X2+X3+X4+X5,data=dtrain) fit5 `

Finally we fit the model with all 9 covariates.

```
# fit linear regression (all 9 covariates, intercept excluded for this illustration)
<- lm(y~-1+.,data=dtrain) fit9
```

The fitted values perfectly match the data.

We investigated estimated regression coefficients.

`coef(fit9)`

```
## X1 X2 X3 X4 X5 X6 X7
## 4.0809328 -1.1527744 1.0301885 -1.0541565 -1.7119016 -0.3461442 1.4309021
## X8 X9
## 0.9209326 1.4237864
```

We note that many regression coefficient are clearly *over-estimated* (the true coefficients are \(\beta = (2, 0, \ldots, 0)\)). The model overfits the data. In other words, the *residual degrees of freedom* which remains to estimate the error is \(n-p\) =1. A common rule of thumb to avoid overfitting is that the residual degrees of freedom should be at least \(5\) (be aware that this is not a general applicable rule!).

### 3.2.2 Collinearity

Recall that *collinearity* in regression analysis refers to the event of two (or multiple) covariates being strongly linearly related. The case of two (or multiple) covariates being perfectly linearly dependent is referred to as *super-collinearity*.

For illustration we generate some data.

```
set.seed(1315)
<- 20
n <- rnorm(n)
x1 <- rnorm(n)
x2 <- x1+rnorm(n,sd=0.25)
x3 <- x1
x4 <- data.frame(x1,x2,x3,x4)
dat $y <- 2*dat$x1+rnorm(n) dat
```

The figures show pairs of covariates with no-, high- and super-collinearity.

In the presence of collinearity, the estimate of one variable’s impact on the dependent variable \(Y\) while controlling for the others tends to be less precise than if covariates were uncorrelated with one another. Intuitively this can be explained based on Figure 3.3 which shows the response Y and two highly correlated covariates x1 and x2. The depicted hyperplains for the true (“modell”) and estimated (“geschätzt”) coefficients are deviating from each other. Further we see that the exact position of the estimated hyperplain is stable alongside of the “hedge” of data points (but unstable in the orthogonal direction).

We will now illustrate this numerically by fitting the following two models.

```
<- lm(y~x1+x2,data=dat)
modela <- lm(y~x1+x3,data=dat) modelb
```

We obtain the variances of the estimated regression coefficients. The variances are larger for model b where x1 and x3 are correlated.

```
# variance of model a
diag(vcov(modela))
```

```
## (Intercept) x1 x2
## 0.04276263 0.12382277 0.05039847
```

```
#variance of model b
diag(vcov(modelb))
```

```
## (Intercept) x1 x3
## 0.04663828 1.63334879 1.27557431
```

If we would also include x4 into the model (scenario of *super-collinearity*), then the OLS estimate would not be well defined (rank of the design matrix is 0 and therefore the matrix \({\bf X}^T {\bf X}\) is singular).

```
<- as.matrix(dat[,-5])
x det(t(x)%*%x)
```

`## [1] 0`

High-dimensional problems with \(p>n\) always suffer from super-collinearity: the rank of the design matrix cannot exceed \(n\) which implies that the columns of \(\bf X\) are linearly dependent. Therefore the OLS estimator cannot be calculated in this situation.