3.2 Model formulation and estimation by least squares

The multiple linear model extends the simple linear model by describing the relation between the random variables X1,,Xk and Y. For example, in the last model for the wine dataset, we had k=4 variables X1=WinterRain, X2=AGST, X3=HarvestRain and X4=Age and Y= Price. Therefore, as in Section 2.3, the multiple linear model is constructed by assuming that the linear relation (3.1)Y=β0+β1X1++βkXk+ε holds between the predictors X1,,Xk and the response Y. In (3.1), β0 is the intercept and β1,,βk are the slopes, respectively. ε is a random variable with mean zero and independent from X1,,Xk. Another way of looking at (3.1) is (3.2)E[Y|X1=x1,,Xk=xk]=β0+β1x1++βkxk, since E[ε|X1=x1,,Xk=xk]=0.

The LHS of (3.2) is the conditional expectation of Y given X1,,Xk. It represents how the mean of the random variable Y is changing according to particular values, denoted by x1,,xk, of the random variables X1,,Xk. With the RHS, what we are saying is that the mean of Y is changing in a linear fashion with respect to the values of X1,,Xk. Hence the interpretation of the coefficients:

  • β0: is the mean of Y when X1==Xk=0.
  • βj, 1jk: is the increment in mean of Y for an increment of one unit in Xj=xj, provided that the remaining variables X1,,Xj1,Xj+1,,Xk do not change.

Figure 3.5 illustrates the geometrical interpretation of a multiple linear model: a plane in the (k+1)-dimensional space. If k=1, the plane is the regression line for simple linear regression. If k=2, then the plane can be visualized in a three-dimensional plot

Figure 3.5: The least squares regression plane y=β^0+β^1x1+β^2x2 and its dependence on the kind of squared distance considered. Application also available here.

The estimation of β0,β1,,βk is done as in simple linear regression, by minimizing the Residual Sum of Squares (RSS). First we need to introduce some helpful matrix notation. In the following, bold face are used for distinguishing vectors and matrices from scalars:

  • A sample of (X1,,Xk,Y) is (X11,,X1k,Y1),,(Xn1,,Xnk,Yn), where Xij denotes the i-th observation of the j-th predictor Xj. We denote with Xi=(Xi1,,Xik) to the i-th observation of (X1,,Xk), so the sample simplifies to (X1,Y1),,(Xn,Yn).

  • The design matrix contains all the information of the predictors and a column of ones X=(1X11X1k1Xn1Xnk)n×(k+1).

  • The vector of responses Y, the vector of coefficients β and the vector of errors are, respectively22, Y=(Y1Yn)n×1,β=(β0β1βk)(k+1)×1 and ε=(ε1εn)n×1. Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely Yi=β0+β1Xi1++βkXik+εi,i=1,,n, into something as compact as Y=Xβ+ε.

Recall that if k=1 we have the simple linear model. In this case: X=(1X111Xn1)n×2 and β=(β0β1)2×1.

The RSS for the multiple linear regression is RSS(β)=i=1n(Yiβ0β1Xi1βkXik)2(3.3)=(YXβ)T(YXβ). The RSS aggregates the squared vertical distances from the data to a regression plane given by β. Remember that the vertical distances are considered because we want to minimize the error in the prediction of Y. The least squares estimators are the minimizers of the RSS23: β^=argminβRk+1RSS(β). Luckily, thanks to the matrix form of (3.3), it is simple to compute a closed-form expression for the least squares estimates: (3.4)β^=(XTX)1XTY.

There are some similarities between (3.4) and β^1=(sx2)1sxy from the simple linear model: both are related to the covariance between X and Y weighted by the variance of X.

The data of the illustration has been generated with the following code:

# Generates 50 points from a N(0, 1): predictors and error
set.seed(34567) # Fixes the seed for the random generator
x1 <- rnorm(50)
x2 <- rnorm(50)
x3 <- x1 + rnorm(50, sd = 0.05) # Make variables dependent
eps <- rnorm(50)

# Responses
yLin <- -0.5 + 0.5 * x1 + 0.5 * x2 + eps
yQua <- -0.5 + x1^2 + 0.5 * x2 + eps
yExp <- -0.5 + 0.5 * exp(x2) + x3 + eps

# Data
leastSquares3D <- data.frame(x1 = x1, x2 = x2, yLin = yLin,
                             yQua = yQua, yExp = yExp)

Let’s check that indeed the coefficients given by lm are the ones given by equation (3.4) for the regression yLin ~ x1 + x2.

# Matrix X
X <- cbind(1, x1, x2)

# Vector Y
Y <- yLin

# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
# %*% multiplies matrices
# solve() computes the inverse of a matrix
# t() transposes a matrix
beta
##          [,1]
##    -0.5702694
## x1  0.4832624
## x2  0.3214894

# Output from lm
mod <- lm(yLin ~ x1 + x2, data = leastSquares3D)
mod$coefficients
## (Intercept)          x1          x2 
##  -0.5702694   0.4832624   0.3214894

Compute β for the regressions yLin ~ x1 + x2, yQua ~ x1 + x2 and yExp ~ x2 + x3 using:

  • equation (3.4) and
  • the function lm.
Check that the fitted plane and the coefficient estimates are coherent.

Once we have the least squares estimates β^, we can define the next two concepts:

  • The fitted values Y^1,,Y^n, where Y^i=β^0+β^1Xi1++β^kXik,i=1,,n. They are the vertical projections of Y1,,Yn into the fitted line (see Figure 3.5). In a matrix form, inputting (3.3) Y^=Xβ^=X(XTX)1XTY=HY, where H=X(XTX)1XT is called the hat matrix because it “puts the hat into Y.” What it does is to project Y into the regression plane (see Figure 3.5).

  • The residuals (or estimated errors) ε^1,,ε^n, where ε^i=YiY^i,i=1,,n. They are the vertical distances between actual data and fitted data.

We conclude with an insight on the relation of multiple and simple linear regressions. It is illustrated in Figure 3.6.

Consider the multiple linear model Y=β0+β1X1+β2X2+ε and its associated simple linear models Y=α0+α1X1+ε and Y=γ0+γ1X2+ε. Assume that we have a sample (X11,X12,Y1),,(Xn1,Xn2,Yn). Then, in general, α^0β^0, α^1β^1, γ^0β^0 and γ^1β^1. This is, in general, the inclusion of a new predictor changes the coefficient estimates.

The regression plane (blue) and its relation with the simple linear regressions (green lines). The red points represent the sample for $(X_1,X_2,Y)$ and the black points the subsamples for $(X_1,X_2)$ (bottom), $(X_1,Y)$ (left) and $(X_2,Y)$ (right).

Figure 3.6: The regression plane (blue) and its relation with the simple linear regressions (green lines). The red points represent the sample for (X1,X2,Y) and the black points the subsamples for (X1,X2) (bottom), (X1,Y) (left) and (X2,Y) (right).

The data employed in Figure 3.6 is:

set.seed(212542)
n <- 100
x1 <- rnorm(n, sd = 2)
x2 <- rnorm(n, mean = x1, sd = 3)
y <- 1 + 2 * x1 - x2 + rnorm(n, sd = 1)
data <- data.frame(x1 = x1, x2 = x2, y = y)

With the above data, check how the fitted coefficients change for y ~ x1, y ~ x2 and y ~ x1 + x2.


  1. The vectors are regarded as column matrices.↩︎

  2. They are unique and always exist.↩︎