2.2 Model formulation and least squares

In order to simplify the introduction of the foundations of the linear model, we first present the simple linear model and then extend it to the multiple linear model.

2.2.1 Simple linear model

The simple linear model is constructed by assuming that the linear relation

(2.1)Y=β0+β1X+ε

holds between X and Y. In (2.1), β0 and β1 are known as the intercept and slope, respectively. The random variable ε has mean zero and is independent from X. It describes the error around the mean, or the effect of other variables that we do not model. Another way of looking at (2.1) is

(2.2)E[Y|X=x]=β0+β1x,

since E[ε|X=x]=0.

The LHS of (2.2) is the conditional expectation of Y given X. It represents how the mean of the random variable Y is changing according to a particular value x of the random variable X. With the RHS, what we are saying is that the mean of Y is changing in a linear fashion with respect to the value of X. Hence the clear interpretation of the coefficients:

  • β0: is the mean of Y when X=0.
  • β1: is the increment in the mean of Y for an increment of one unit in X=x.

If we have a sample (X1,Y1),,(Xn,Yn) for our random variables X and Y, we can estimate the unknown coefficients β0 and β1. A possible way of doing so is by looking for certain optimality, for example the minimization of the Residual Sum of Squares (RSS):

RSS(β0,β1):=i=1n(Yiβ0β1Xi)2.

In other words, we look for the estimators (β^0,β^1) such that

(β^0,β^1)=argmin(β0,β1)R2RSS(β0,β1).

The motivation for minimizing the RSS is geometrical, as shown by Figure 2.3. We aim to minimize the squares of the distances of points projected vertically onto the line determined by (β^0,β^1).

Figure 2.3: The effect of the kind of distance in the error criterion. The choices of intercept and slope that minimize the sum of squared distances for one kind of distance are not the optimal choices for a different kind of distance. Application available here.

It can be seen that the minimizers of the RSS17 are

(2.3)β^0=Y¯β^1X¯,β^1=sxysx2,

where:

  • X¯=1ni=1nXi is the sample mean.
  • sx2=1ni=1n(XiX¯)2 is the sample variance.18
  • sxy=1ni=1n(XiX¯)(YiY¯) is the sample covariance. It measures the degree of linear association between X1,,Xn and Y1,,Yn. Once scaled by sxsy, it gives the sample correlation coefficient rxy=sxysxsy.

There are some important points hidden behind the election of RSS as the error criterion for obtaining (β^0,β^1):

  • Why the vertical distances and not horizontal or perpendicular? Because we want to minimize the error in the prediction of Y! Note that the treatment of the variables is not symmetrical.
  • Why the squares in the distances and not the absolute value? Due to mathematical convenience. Squares are nice to differentiate and are closely related with maximum likelihood estimation under the normal distribution (see Appendix A.2).

Let’s see how to obtain automatically the minimizers of the error in Figure 2.3 by the lm (linear model) function. The data of the figure has been generated with the following code:

# Generates 50 points from a N(0, 1): predictor and error
set.seed(34567)
x <- rnorm(n = 50)
eps <- rnorm(n = 50)

# Responses
yLin <- -0.5 + 1.5 * x + eps
yQua <- -0.5 + 1.5 * x^2 + eps
yExp <- -0.5 + 1.5 * 2^x + eps

# Data
leastSquares <- data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp)

For a simple linear model, lm has the syntax lm(formula = response ~ predictor, data = data), where response and predictor are the names of two variables in the data frame data. Note that the LHS of ~ represents the response and the RHS the predictors.

# Call lm
lm(yLin ~ x, data = leastSquares)
## 
## Call:
## lm(formula = yLin ~ x, data = leastSquares)
## 
## Coefficients:
## (Intercept)            x  
##     -0.6154       1.3951
lm(yQua ~ x, data = leastSquares)
## 
## Call:
## lm(formula = yQua ~ x, data = leastSquares)
## 
## Coefficients:
## (Intercept)            x  
##      0.9710      -0.8035
lm(yExp ~ x, data = leastSquares)
## 
## Call:
## lm(formula = yExp ~ x, data = leastSquares)
## 
## Coefficients:
## (Intercept)            x  
##       1.270        1.007

# The lm object
mod <- lm(yLin ~ x, data = leastSquares)
mod
## 
## Call:
## lm(formula = yLin ~ x, data = leastSquares)
## 
## Coefficients:
## (Intercept)            x  
##     -0.6154       1.3951

# We can produce a plot with the linear fit easily
plot(x, yLin)
abline(coef = mod$coefficients, col = 2)
Linear fit for the data employed in Figure 2.3 minimizing the RSS.

Figure 2.4: Linear fit for the data employed in Figure 2.3 minimizing the RSS.


# Access coefficients with $coefficients
mod$coefficients
## (Intercept)           x 
##  -0.6153744   1.3950973

# Compute the minimized RSS
sum((yLin - mod$coefficients[1] - mod$coefficients[2] * x)^2)
## [1] 41.02914
sum(mod$residuals^2)
## [1] 41.02914

# mod is a list of objects whose names are
names(mod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"           
##  [8] "df.residual"   "xlevels"       "call"          "terms"         "model"
Check that you can not improve the error in Figure 2.3 when using the coefficients given by lm, if vertical distances are selected. Check also that these coefficients are only optimal for vertical distances.

An interesting exercise is to check that lm is actually implementing the estimates given in (2.3):

# Covariance
Sxy <- cov(x, yLin)

# Variance
Sx2 <- var(x)

# Coefficients
beta1 <- Sxy / Sx2
beta0 <- mean(yLin) - beta1 * mean(x)
c(beta0, beta1)
## [1] -0.6153744  1.3950973

# Output from lm
mod <- lm(yLin ~ x, data = leastSquares)
mod$coefficients
## (Intercept)           x 
##  -0.6153744   1.3950973

The population regression coefficients, (β0,β1), are not the same as the estimated regression coefficients, (β^0,β^1):

  • (β0,β1) are the theoretical and always unknown quantities (except under controlled scenarios).
  • (β^0,β^1) are the estimates computed from the data. They are random variables, since they are computed from the random sample (X1,Y1),,(Xn,Yn).

In an abuse of notation, the term regression line is often used to denote both the theoretical (y=β0+β1x) and the estimated (y=β^0+β^1x) regression lines.

2.2.2 Case study application

Let’s get back to the wine dataset and compute some simple linear regressions. Prior to that, let’s begin by summarizing the information in Table 2.1 to get a grasp of the structure of the data. For that, we first correctly import the dataset into R:

# Read data
wine <- read.table(file = "wine.csv", header = TRUE, sep = ",")

Now we can conduct a quick exploratory analysis to have insights into the data:

# Numerical -- marginal distributions
summary(wine)
##       Year          Price         WinterRain         AGST        HarvestRain         Age          FrancePop    
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98   Min.   : 38.0   Min.   : 3.00   Min.   :43184  
##  1st Qu.:1960   1st Qu.:6.508   1st Qu.:543.5   1st Qu.:16.15   1st Qu.: 88.0   1st Qu.: 9.50   1st Qu.:46856  
##  Median :1967   Median :6.984   Median :600.0   Median :16.42   Median :123.0   Median :16.00   Median :50650  
##  Mean   :1967   Mean   :7.042   Mean   :608.4   Mean   :16.48   Mean   :144.8   Mean   :16.19   Mean   :50085  
##  3rd Qu.:1974   3rd Qu.:7.441   3rd Qu.:705.5   3rd Qu.:17.01   3rd Qu.:185.5   3rd Qu.:22.50   3rd Qu.:53511  
##  Max.   :1980   Max.   :8.494   Max.   :830.0   Max.   :17.65   Max.   :292.0   Max.   :31.00   Max.   :55110

# Graphical -- pairwise relations with linear and "smooth" regressions
car::scatterplotMatrix(wine, col = 1, regLine = list(col = 2),
                       smooth = list(col.smooth = 4, col.spread = 4))
Scatterplot matrix for the wine dataset. The diagonal plots show density estimators of the pdf of each variable (see Section 6.1.2). The \((i,j)\)-th scatterplot shows the data of \(X_i\) vs. \(X_j,\) where the red line is the regression line of \(X_i\) (response) on \(X_j\) (predictor) and the blue curve represents a smoother that estimates nonparametrically the regression function of \(X_i\) on \(X_j\) (see Section 6.2). The dashed blue curves are the confidence intervals associated to the nonparametric smoother.

Figure 2.5: Scatterplot matrix for the wine dataset. The diagonal plots show density estimators of the pdf of each variable (see Section 6.1.2). The (i,j)-th scatterplot shows the data of Xi vs. Xj, where the red line is the regression line of Xi (response) on Xj (predictor) and the blue curve represents a smoother that estimates nonparametrically the regression function of Xi on Xj (see Section 6.2). The dashed blue curves are the confidence intervals associated to the nonparametric smoother.

As we can see, Year and FrancePop are very dependent, and Year and Age are perfectly dependent. This is so because Age = 1983 - Year. Therefore, we opt to remove the predictor Year and use it to set the case names, which can be helpful later for identifying outliers:

# Set row names to Year -- useful for outlier identification
row.names(wine) <- wine$Year
wine$Year <- NULL

Remember that the objective is to predict Price. Based on the above matrix scatterplot, the best we can predict Price by a simple linear regression seems to be with AGST or HarvestRain. Let’s see which one yields the higher R2, which, as we will see in Section 2.7.1, is an indicative of the performance of the linear model.

# Price ~ AGST
modAGST <- lm(Price ~ AGST, data = wine)

# Summary of the model
summary(modAGST)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78370 -0.23827 -0.03421  0.29973  0.90198 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.5469     2.3641  -1.500 0.146052    
## AGST          0.6426     0.1434   4.483 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4819 on 25 degrees of freedom
## Multiple R-squared:  0.4456, Adjusted R-squared:  0.4234 
## F-statistic: 20.09 on 1 and 25 DF,  p-value: 0.0001425

# The summary is also an object
sumModAGST <- summary(modAGST)
names(sumModAGST)
##  [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"         "df"           
##  [8] "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled"

# R^2
sumModAGST$r.squared
## [1] 0.4455894

Complete the analysis by computing the linear models Price ~ FrancePop, Price ~ Age, Price ~ WinterRain, and Price ~ HarvestRain. Name them as modFrancePop, modAge, modWinterRain, and modHarvestRain. Obtain their R2’s and display them in a table like:

Predictor R2
AGST 0.4456
HarvestRain 0.2572
FrancePop 0.2314
Age 0.2120
WinterRain 0.0181

It seems that none of these simple linear models on their own are properly explaining Price. Intuitively, it would make sense to bind them together to achieve a better explanation of Price. Let’s see how to do that with a more advanced model.

2.2.3 Multiple linear model

The multiple linear model extends the simple linear model by describing the relation between several random variables X1,,Xp19 and Y. Therefore, as before, the multiple linear model is constructed by assuming that the linear relation

(2.4)Y=β0+β1X1++βpXp+ε

holds between the predictors X1,,Xp and the response Y. In (2.4), β0 is the intercept and β1,,βp are the slopes, respectively. The random variable ε has mean zero and is independent from X1,,Xp. Another way of looking at (2.4) is

(2.5)E[Y|X1=x1,,Xp=xp]=β0+β1x1++βpxp,

since E[ε|X1=x1,,Xp=xp]=0.

The LHS of (2.5) is the conditional expectation of Y given X1,, Xp. It represents how the mean of the random variable Y is changing, now according to particular values of several predictors. With the RHS, what we are saying is that the mean of Y is changing in a linear fashion with respect to the values of X1,,Xp. Hence the neat interpretation of the coefficients:

  • β0: is the mean of Y when X1==Xp=0.
  • βj, 1jp: is the increment in the mean of Y for an increment of one unit in Xj=xj, provided that the rest of predictors X1,,Xj1,Xj+1,,Xp remain constant.

Figure 2.6 illustrates the geometrical interpretation of a multiple linear model: a hyperplane in Rp+1. If p=1, the hyperplane is actually a line, the regression line for simple linear regression. If p=2, then the regression plane can be visualized in a 3-dimensional plot.

The estimation of β0,β1,,βp is done as in simple linear regression by minimizing the RSS, which now accounts for the sum of squared distances of the data to the vertical projections on the hyperplane. Before doing so, we need to introduce some helpful matrix notation:

  • A sample of (X1,,Xp,Y) is denoted as {(Xi1,,Xip,Yi)}i=1n, where Xij is the i-th observation of the j-th predictor Xj. We denote with Xi:=(Xi1,,Xip) to the i-th observation of (X1,,Xp), so the sample simplifies to {(Xi,Yi)}i=1n.

  • The design matrix contains all the information of the predictors plus a column of ones

X:=(1X11X1p1Xn1Xnp)n×(p+1).

  • The vector of responses Y, the vector of coefficients β, and the vector of errors are, respectively,

Y:=(Y1Yn)n×1,β:=(β0β1βp)(p+1)×1, and ε:=(ε1εn)n×1.

Thanks to the matrix notation, we can turn the sample version20 of the multiple linear model, namely

Yi=β0+β1Xi1++βpXip+εi,i=1,,n,

into something as compact as

Y=Xβ+ε.

Figure 2.6: The least squares regression plane y=β^0+β^1x1+β^2x2 and its dependence on the kind of squared distance considered. Application available here.

Recall that if p=1 we recover the simple linear model. In this case:

X=(1X111Xn1)n×2 and β=(β0β1)2×1.

With this notation, the RSS for the multiple linear regression is

RSS(β):=i=1n(Yiβ0β1Xi1βpXip)2(2.6)=(YXβ)(YXβ).

The RSS aggregates the squared vertical distances from the data to a regression plane given by β. The least squares estimators are the minimizers of the RSS21:

β^:=argminβRp+1RSS(β).

Luckily, thanks to the matrix form of (2.6), it is possible22 to compute a closed-form expression for the least squares estimates:

(2.7)β^=(XX)1XY.

There are some similarities between (2.7) and β^1=(sx2)1 sxy from the simple linear model: both are related to the covariance between X and Y weighted by the variance of X.

Let’s check that indeed the coefficients given by lm are the ones given by (2.7). For that purpose we consider the leastSquares3D data frame in the least-squares-3D.RData dataset. Among other variables, the data frame contains the response yLin and the predictors x1 and x2.

load(file = "least-squares-3D.RData")

Let’s compute the coefficients of the regression of yLin on the predictors x1 and x2, which is denoted by yLin ~ x1 + x2. Note the use of + for including all the predictors. This does not mean that they are all added and then the regression is done on the sum.23 Instead, this syntax is designed to resemble the mathematical form of the multiple linear model.

# Output from lm
mod <- lm(yLin ~ x1 + x2, data = leastSquares3D)
mod$coefficients
## (Intercept)          x1          x2 
##  -0.5702694   0.4832624   0.3214894

# Matrix X
X <- cbind(1, leastSquares3D$x1, leastSquares3D$x2)

# Vector Y
Y <- leastSquares3D$yLin

# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
# %*% multiplies matrices
# solve() computes the inverse of a matrix
# t() transposes a matrix
beta
##            [,1]
## [1,] -0.5702694
## [2,]  0.4832624
## [3,]  0.3214894

Compute β^ for the regressions yLin ~ x1 + x2, yQua ~ x1 + x2, and yExp ~ x2 + x3 using:

  1. Equation (2.7) and
  2. the function lm.
Check that both are the same.

Once we have the least squares estimates β^, we can define the next concepts:

  • The fitted values Y^1,,Y^n, where

    Y^i:=β^0+β^1Xi1++β^pXip,i=1,,n.

    They are the vertical projections of Y1,,Yn onto the fitted plane (see Figure 2.6). In a matrix form, inputting (2.6)

    Y^=Xβ^=X(XX)1XY=HY,

    where H:=X(XX)1X is called the hat matrix because it “puts the hat into Y”. What it does is to project Y into the regression plane (see Figure 2.6).

  • The residuals (or estimated errors) ε^1,,ε^n, where

    ε^i:=YiY^i,i=1,,n.

    They are the vertical distances between actual data and fitted data.

These two objects are present in the output of lm:

# Fitted values
mod$fitted.values

# Residuals
mod$residuals

We conclude with an important insight on the relation of multiple and simple linear regressions that is illustrated in Figure 2.7. The data used in that figure is:

set.seed(212542)
n <- 100
x1 <- rnorm(n, sd = 2)
x2 <- rnorm(n, mean = x1, sd = 3)
y <- 1 + 2 * x1 - x2 + rnorm(n, sd = 1)
data <- data.frame(x1 = x1, x2 = x2, y = y)

Consider the multiple linear model Y=β0+β1X1+β2X2 +ε1 and its associated simple linear models Y=α0+α1X1+ε2 and Y=γ0+γ1X2+ε3, where ε1,ε2,ε3 are random errors. Assume that we have a sample {(Xi1,Xi2,Yi)}i=1n. Then, in general, α^0β^0γ^0, α^1β^1, and γ^1β^2. Even if α0=β0=γ0, α1=β1, and γ1=β2. That is, in general, the inclusion of a new predictor changes the coefficient estimates of the rest of predictors.

The regression plane (blue) of \(Y\) on \(X_1\) and \(X_2\) and its relation with the simple linear regressions (green lines) of \(Y\) on \(X_1\) and of \(Y\) on \(X_2.\) The red points represent the sample for \((X_1,X_2,Y)\) and the black points the sample projections for \((X_1,X_2)\) (bottom), \((X_1,Y)\) (left), and \((X_2,Y)\) (right). As it can be seen, the regression plane does not extend the simple linear regressions.

Figure 2.7: The regression plane (blue) of Y on X1 and X2 and its relation with the simple linear regressions (green lines) of Y on X1 and of Y on X2. The red points represent the sample for (X1,X2,Y) and the black points the sample projections for (X1,X2) (bottom), (X1,Y) (left), and (X2,Y) (right). As it can be seen, the regression plane does not extend the simple linear regressions.

With the above data, check how the fitted coefficients change for y ~ x1, y ~ x2, and y ~ x1 + x2.

2.2.4 Case study application

A natural step now is to extend these simple regressions to increase both the R2 and the prediction accuracy for Price by means of the multiple linear regression:

# Regression on all the predictors
modWine1 <- lm(Price ~ Age + AGST + FrancePop + HarvestRain + WinterRain,
               data = wine)

# A shortcut
modWine1 <- lm(Price ~ ., data = wine)
modWine1
## 
## Call:
## lm(formula = Price ~ ., data = wine)
## 
## Coefficients:
## (Intercept)   WinterRain         AGST  HarvestRain          Age    FrancePop  
##  -2.343e+00    1.153e-03    6.144e-01   -3.837e-03    1.377e-02   -2.213e-05

# Summary
summary(modWine1)
## 
## Call:
## lm(formula = Price ~ ., data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46541 -0.24133  0.00413  0.18974  0.52495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.343e+00  7.697e+00  -0.304  0.76384    
## WinterRain   1.153e-03  4.991e-04   2.311  0.03109 *  
## AGST         6.144e-01  9.799e-02   6.270 3.22e-06 ***
## HarvestRain -3.837e-03  8.366e-04  -4.587  0.00016 ***
## Age          1.377e-02  5.821e-02   0.237  0.81531    
## FrancePop   -2.213e-05  1.268e-04  -0.175  0.86313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.293 on 21 degrees of freedom
## Multiple R-squared:  0.8278, Adjusted R-squared:  0.7868 
## F-statistic: 20.19 on 5 and 21 DF,  p-value: 2.232e-07

The fitted regression is Price =2.343+0.013× Age +0.614× AGST 0.000× FrancePop 0.003× HarvestRain +0.001× WinterRain. Recall that the 'Multiple R-squared' has almost doubled with respect to the best simple linear regression! This tells us that combining several predictors may lead to important performance gains in the prediction of the response. However, note that the R2 of the multiple linear model is not the sum of the R2’s of the simple linear models. The performance gain of combining predictors is hard to anticipate from the single-predictor models and depends on the dependence among the predictors.


  1. They are unique and always exist (except if sx=0, when all the data points are the same). They can be obtained by solving β0RSS(β0,β1)=0 and β1RSS(β0,β1)=0.↩︎

  2. The sample standard deviation is sx=sx2.↩︎

  3. Note that now X1 represents the first predictor and not the first element of a sample of X.↩︎

  4. Recall that (2.4) and (2.5) were referring to the relation of the random variable (or population) Y with the random variables X1,,Xp. Those are population versions of the linear model and clearly generate the sample versions when they are replicated for each observation (Xi,Yi), i=1,,n, of the random vector (X,Y).↩︎

  5. It can be seen that they are unique and that they always exist, provided that rank(XX)=p+1.↩︎

  6. It follows from (Ax)x=A and (f(x)g(x))x=f(x)g(x)x+g(x)f(x)x for two vector-valued functions f,g:RpRm, where (f(x)g(x))x is the gradient row vector of fg, and f(x)x and g(x)x are the Jacobian matrices of f and g, respectively.↩︎

  7. If you wanted to do so, you will need to use the function I() for indicating that + is not including predictors in the model, but is acting as the algebraic sum operator.↩︎