Chapter 3 Multiple regression and challenges in high-dimensions

In this chapter we will review multiple linear regression and in particular the Ordinary Least Squares (OLS) estimator. We will further investigate the challenges which appear in the high-dimensional setting where the number of covariates is large compared to the number of observations, i.e. \(p>>n\).

3.1 Multiple regression and ordinary least squares

Given a vector of inputs \(X=(X_1,X_2,\ldots,X_p)\), in multivariate regression we predict the output \(Y\) via the linear model:

\[ \hat{Y}=\sum_{j=1}^{p}X_j\hat\beta_j.\]

The term \(\beta_0\) is the intercept. If we include the constant variable 1 in \(X\), include \(\hat\beta_0\) in the vector of coefficients \(\hat\beta\), then we can write

\[\hat{Y}=X^T\hat\beta.\]

How do we fit the linear model to a set of training data (i.e. how do we obtain the estimator \(\hat \beta\))? We typically use ordinary least squares (OLS) where we pick the coefficient \(\beta\) to minimize the residual sum of squares

\[\begin{align*} \rm{RSS}(\beta)&=\sum_{i=1}^{N}(y_i-x_i^T\beta)^2\\ &=(\bf y - \bf X \beta)^T (\bf y - \bf X \beta)\\ &=\|\bf y - \bf X \beta\|^2_2. \end{align*}\]

If the matrix \(\bf X^T \bf X\) is nonsingular, then the solution is given by

\[\hat\beta=(\bf X^T \bf X)^{-1}\bf X^T \bf y.\]

Thus, the learning rule at the new input point \(X_{\rm new}\) is

\[\begin{align*} \hat{Y}&=\hat{f}(X_{\rm new})\\ &=X_{\rm new}^T\hat\beta\\ &=X_{\rm new}^T(\bf X^T \bf X)^{-1}\bf X^T \bf y.\\ \end{align*}\]

Figures 3.1 and 3.2 show two geometric representations of the OLS estimator. In Figure 3.1 the \(N\) data points \((y_i,x_{i2},\ldots,x_{ip})\) randomly spread around a \((p-1)\)-dimensional hyperplane in a \(p\)-dimensional space; the random spread only occurs parallel to the y-axis. Figure 3.2 shows a different representation where the vector \(\bf y\) is a single point in the \(N\)-dimensional space \({\bf R}^N\); the fitted values \(\hat {\bf y}\) are the orthogonal projection onto the \(p\)-dimensional subspace of \({\bf R}^N\) spanned by the vectors \({\bf x}_1,\ldots,{\bf x}_p\).

N data points spreading around the (p-1)-dimensional OLS hyperplane.

Figure 3.1: N data points spreading around the (p-1)-dimensional OLS hyperplane.

OLS fit as the orthogonal projection **y** onto subspace spanned by covariates.

Figure 3.2: OLS fit as the orthogonal projection y onto subspace spanned by covariates.

3.2 Overfitting

Overfitting refers to the phenomenon of modelling the noise rather than the signal. In case the true model is parsimonious (few covariates driving the response \(Y\)) and data on many covariates are available, it is likely that a linear combination of all covariates yields a higher likelihood than a combination of the few that are actually related to the response. As only the few covariates related to the response contain the signal, the model involving all covariates then cannot but explain more than the signal alone: it also models the error. Hence, it overfits the data.

In high-dimensional settings overfitting is a real threat! In the situation where \(p>n\) it is possible to form a linear combination of the covariates that perfectly explains the response, including the noise. In general, large estimates of regression coefficients are often an indication of overfitting. We illustrate overfitting in the next example. We simulate \(n=10\) training data points. We take \(p=15\) and \(X_{i1},\ldots,X_{ip}\) i.i.d \(N(0,1)\). The response depends only on the first covariate, i.e. \(Y_i=\beta_1 X_{i1}+\epsilon_i\), where \(\beta_1=2\) and \(\epsilon_i\) i.i.d \(N(0,0.5)\).

set.seed(1)
n <- 10
p <- 15
beta <- c(2,rep(0,p-1))

# simulate covariates
xtrain <- matrix(rnorm(n*p),n,p)
ytrain <- xtrain%*%beta+rnorm(n,sd=0.5)
dtrain <- data.frame(xtrain)
dtrain$y <- ytrain

We start by fitting a linear regression model with only the first covariate.

fit1 <- lm(y~X1,data=dtrain)

The regression coefficients are obtained using coef.

summary(fit1)
## 
## Call:
## lm(formula = y ~ X1, data = dtrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59574 -0.41567 -0.06222  0.18490  0.97592 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1002     0.1785  -0.561     0.59    
## X1            1.8070     0.2373   7.614 6.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5558 on 8 degrees of freedom
## Multiple R-squared:  0.8787, Adjusted R-squared:  0.8636 
## F-statistic: 57.97 on 1 and 8 DF,  p-value: 6.223e-05

Next, we plot the fitted values against the first covariate and we add the reference line representing the “true” relationship.

We add two “noise” covariates to the model.

fit4 <- lm(y~X1+X2+X3+X4,data=dtrain)
summary(fit4)
## 
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X4, data = dtrain)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.08607 -0.08669 -0.18732 -0.12641 -0.37113 -0.25548  0.70950  0.29125 
##        9       10 
## -0.68858  0.80092 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.09973    0.21924  -0.455  0.66826   
## X1           2.07658    0.40301   5.153  0.00361 **
## X2          -0.11117    0.25161  -0.442  0.67707   
## X3           0.29805    0.37543   0.794  0.46325   
## X4           0.25982    0.27471   0.946  0.38767   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6281 on 5 degrees of freedom
## Multiple R-squared:  0.9032, Adjusted R-squared:  0.8257 
## F-statistic: 11.66 on 4 and 5 DF,  p-value: 0.009503

The fitted values (red triangles) start to deviate from the truth (blue line).

With a total of 8 covariates the fitted values get closer to the data points (black dots). This means the fitted model also captures the deviations from the truth (blue line).

fit8 <- lm(y~X1+X2+X3+X4+X5+X6+X7+X8,data=dtrain)

We investigate the model fit.

summary(fit8)
## 
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = dtrain)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  0.32017 -0.26990 -0.21987  0.10342 -0.02651  0.06748  0.31531 -0.05819 
##        9       10 
## -0.24516  0.01324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.25168    0.33931  -0.742    0.594
## X1           0.07841    2.14373   0.037    0.977
## X2           0.31719    0.46470   0.683    0.619
## X3          -0.73848    1.36553  -0.541    0.684
## X4           0.72203    0.83656   0.863    0.547
## X5           0.03598    0.95115   0.038    0.976
## X6          -0.57347    0.66831  -0.858    0.549
## X7          -0.37493    0.46452  -0.807    0.568
## X8          -1.46034    1.59592  -0.915    0.528
## 
## Residual standard error: 0.6346 on 1 degrees of freedom
## Multiple R-squared:  0.9802, Adjusted R-squared:  0.8221 
## F-statistic: 6.199 on 8 and 1 DF,  p-value: 0.3015

We note that many regression coefficient are clearly over-estimated (the true coefficients are \(\beta = (2, 0, \ldots, 0)\)). The model “overfits” the data.

Finally we fit the model with all 15 covariates.

# fit linear regression (all 15 covariates, intercept excluded for this illustration)
fit15 <- lm(y~.,data=dtrain)

The fitted values perfectly match the data.

We investigate the model fit.

summary(fit15)
## 
## Call:
## lm(formula = y ~ ., data = dtrain)
## 
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (6 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01592        NaN     NaN      NaN
## X1          -0.50138        NaN     NaN      NaN
## X2           0.81492        NaN     NaN      NaN
## X3          -0.56052        NaN     NaN      NaN
## X4           0.72667        NaN     NaN      NaN
## X5           1.84831        NaN     NaN      NaN
## X6           0.05759        NaN     NaN      NaN
## X7          -1.21460        NaN     NaN      NaN
## X8          -1.30908        NaN     NaN      NaN
## X9          -1.39005        NaN     NaN      NaN
## X10               NA         NA      NA       NA
## X11               NA         NA      NA       NA
## X12               NA         NA      NA       NA
## X13               NA         NA      NA       NA
## X14               NA         NA      NA       NA
## X15               NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 9 and 0 DF,  p-value: NA

Clearly something went wrong. Many entries in the table of coefficients are not available. One message in the output says that there are no residual degrees of freedom. Further 6 coefficients cannot be calculated because of “singularities.” In fact, the OLS estimator as introduced before is not well defined because the design matrix \(\bf X\) is rank deficient (\({\rm rank}({\bf X})=n< p\)) and therefore the matrix \({\bf X}^T {\bf X}\) is singular.

x <- model.matrix(fit15)
det(t(x)%*%x)
## [1] -2.8449e-81

A useful quantity in the context of overfitting is the residual degrees of freedom which is defined as \(n-{\rm df}\), where \({\rm df}\) are the degrees of freedom of the model. For a linear regression model we have \({\rm df}=p+1\). A rule of thumb to avoid overfitting is that the residual degrees of freedom \(n-{\rm df}\) should be at least \(5\) (be aware that this is not a general applicable rule!).

3.3 Collinearity

Recall that collinearity in regression analysis refers to the event of two (or multiple) covariates being strongly linearly related. The case of two (or multiple) covariates being perfectly linearly dependent is referred to as super-collinearity.

For illustration we generate some data.

set.seed(1315)
n <- 20
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- x1+rnorm(n,sd=0.25)
x4 <- x1
dat <- data.frame(x1,x2,x3,x4)
dat$y <- 2*dat$x1+rnorm(n)

The figures show pairs of covariates with no-, high- and super-collinearity.

In the presence of collinearity, the estimate of one variable’s impact on the dependent variable \(Y\) while controlling for the others tends to be less precise than if covariates were uncorrelated with one another. Intuitively this can be explained based on Figure 3.3 which shows the response Y and two highly correlated covariates x1 and x2. The depicted hyperplains for the true (“modell”) and estimated (“geschätzt”) coefficients are deviating from each other. Further we see that the exact position of the estimated hyperplain is stable alongside of the “hedge” of data points (but unstable in the orthogonal direction).

This Figure illustrates OLS regression in the context of covariates with high collinearity.

Figure 3.3: This Figure illustrates OLS regression in the context of covariates with high collinearity.

We will now illustrate this numerically by fitting the following two models.

modela <- lm(y~x1+x2,data=dat)
modelb <- lm(y~x1+x3,data=dat)

We obtain the variances of the estimated regression coefficients. The variances are larger for model b where x1 and x3 are correlated.

# variance of model a
diag(vcov(modela))
## (Intercept)          x1          x2 
##  0.04276263  0.12382277  0.05039847
#variance of model b
diag(vcov(modelb))
## (Intercept)          x1          x3 
##  0.04663828  1.63334879  1.27557431

In a high-dimensional setting where \(p>n\) we always have super-collinearity: the rank of the design matrix cannot exceed \(n\) which implies that the columns of \(\bf X\) are linearly dependent.

3.4 Prediction error

So far we investigated the fitted model visually and by inspecting the output of the regression analysis. Another important aspect is to assess how the fitted model generalizes beyond the observed data. We do this by distinguishing between model fitting and model testing, i.e. we assesses the prediction error on separate test data.

Training and Test data

Figure 3.4: Training and Test data

To illustrate this with our dummy data example we simulate test data.

set.seed(2)
# simulate test data
xtest <- matrix(rnorm(n*p),n,p)
ytest <- xtest%*%beta+rnorm(n,sd=0.5)
dtest <- data.frame(xtest)
dtest$y <- ytest

Next, we use the models obtained based on the training data to make predictions on the test data.

pred4 <- predict(fit4,newdata = dtest)
pred8 <- predict(fit8,newdata = dtest)

We illustrate the predictions using a scatter plot which plots test data (in black) together with predictions from models fit4 (in orange) and fit8 (in green). The predictions from the model with 8 covariates lie way off from the truth.

We quantify the prediction error with the root-mean-square error (RMSE).

\[ {\bf RMSE}= \sqrt{\frac{\sum_{i=1}^N(y_i-\hat y_i)^2}{N}}\]

The RMSE measures how far off we should expect the prediction of our model to be. In our dummy data example we can calculate the prediction error using the function RMSE.

RMSE(pred4,ytest)
## [1] 0.724046
RMSE(pred8,ytest)
## [1] 3.211847

The RMSE=0.72 for model fit4 is close to the inherent error \(\sigma=0.5\). The RMSE=3.21 for model fit8 indicates that a predictions deviate on average 3.21 units from the observed values.