B.1 Linear regression

B.1.1 Model formulation and least squares

The multiple linear regression employs multiple predictors X1,,Xp287 for explaining a single response Y by assuming that a linear relation of the form

(B.1)Y=β0+β1X1++βpXp+ε

holds between the predictors X1,,Xp and the response Y. In (B.1), β0 is the intercept and β1,,βp are the slopes, respectively. ε is a random variable with mean zero and independent from X1,,Xp. Another way of looking at (B.1) is

(B.2)E[Y|X1=x1,,Xp=xp]=β0+β1x1++βpxp,

since E[ε|X1=x1,,Xp=xp]=0. Therefore, the expectation of Y is changing in a linear fashion with respect to the values of X1,,Xp. Hence the interpretation of the coefficients:

  • β0: is the expectation of Y when X1==Xp=0.
  • βj, 1jp: is the additive increment in expectation of Y for an increment of one unit in Xj=xj, provided that the remaining variables do not change.

Figure B.1 illustrates the geometrical interpretation of a multiple linear model: a plane in the (p+1)-dimensional space. If p=1, the plane is the regression line for simple linear regression. If p=2, then the plane can be visualized in a three-dimensional plot.

The regression plane (blue) of \(Y\) on \(X_1\) and \(X_2,\) and its relation with the regression lines (green lines) of \(Y\) on \(X_1\) (left) and of \(Y\) on \(X_2\) (right). The red points represent the sample for \((X_1,X_2,Y)\) and the black points the projections for \((X_1,X_2)\) (bottom), \((X_1,Y)\) (left), and \((X_2,Y)\) (right). Note that the regression plane is not a direct extension of the marginal regression lines.

Figure B.1: The regression plane (blue) of Y on X1 and X2, and its relation with the regression lines (green lines) of Y on X1 (left) and of Y on X2 (right). The red points represent the sample for (X1,X2,Y) and the black points the projections for (X1,X2) (bottom), (X1,Y) (left), and (X2,Y) (right). Note that the regression plane is not a direct extension of the marginal regression lines.

The estimation of β0,β1,,βp is done by minimizing the so-called Residual Sum of Squares (RSS). We first need to introduce some helpful notation for this and the next section:

  • A sample of (X1,,Xp,Y) is denoted by (X11,,X1p,Y1),, (Xn1,,Xnp,Yn), where Xij denotes the i-th observation of the j-th predictor Xj. We denote with Xi=(Xi1,,Xip) to the i-th observation of (X1,,Xp), so the sample is (X1,Y1),,(Xn,Yn).

  • The design matrix contains all the information of the predictors and a column of ones

    X=(1X11X1p1Xn1Xnp)n×(p+1).

  • The vector of responses Y, the vector of coefficients β, and the vector of errors are, respectively,

    Y=(Y1Yn)n×1,β=(β0β1βp)(p+1)×1, and ε=(ε1εn)n×1.

Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely

Yi=β0+β1Xi1++βpXip+εi,i=1,,n,

into something as compact as

Y=Xβ+ε.

The RSS for the multiple linear regression is

RSS(β):=i=1n(Yiβ0β1Xi1βpXip)2(B.3)=(YXβ)(YXβ).

RSS(β) aggregates the squared vertical distances from the data to a regression plane given by β. Note that the vertical distances are considered because we want to minimize the error in the prediction of Y. Thus, the treatment of the variables is not symmetrical;288 see Figure B.2. The least squares estimators are the minimizers of (B.3):

β^:=argminβRp+1RSS(β).

Luckily, thanks to the matrix form of (B.3), it is simple to compute a closed-form expression for the least squares estimates:

(B.4)β^=(XX)1XY.

Exercise B.1 β^ can be obtained by differentiating (B.3). Prove it using that Axx=A and f(x)g(x)x=f(x)g(x)x+g(x)f(x)x for two vector-valued functions f and g.

Figure B.2: The least squares regression plane y=β^0+β^1x1+β^2x2 and its dependence on the kind of squared distance considered. Application available here.

Let’s check that indeed the coefficients given by R’s lm are the ones given by (B.4) in a toy linear model.

# Generates 50 points from a N(0, 1): predictors and error
set.seed(34567)
x1 <- rnorm(50)
x2 <- rnorm(50)
x3 <- x1 + rnorm(50, sd = 0.05) # Make variables dependent
eps <- rnorm(50)

# Responses
y_lin <- -0.5 + 0.5 * x1 + 0.5 * x2 + eps
y_qua <- -0.5 + x1^2 + 0.5 * x2 + eps
y_exp <- -0.5 + 0.5 * exp(x2) + x3 + eps

# Data
data_animation <- data.frame(x1 = x1, x2 = x2, y_lin = y_lin,
                             y_qua = y_qua, y_exp = y_exp)

# Call lm
# lm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done
mod <- lm(y_lin ~ x1 + x2, data = data_animation)
summary(mod)
## 
## Call:
## lm(formula = y_lin ~ x1 + x2, data = data_animation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.37003 -0.54305  0.06741  0.75612  1.63829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.5703     0.1302  -4.380 6.59e-05 ***
## x1            0.4833     0.1264   3.824 0.000386 ***
## x2            0.3215     0.1426   2.255 0.028831 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9132 on 47 degrees of freedom
## Multiple R-squared:  0.276,  Adjusted R-squared:  0.2452 
## F-statistic: 8.958 on 2 and 47 DF,  p-value: 0.0005057

# mod is a list with a lot of information
# str(mod) # Long output

# Coefficients
mod$coefficients
## (Intercept)          x1          x2 
##  -0.5702694   0.4832624   0.3214894

# Application of formula (3.4)

# Matrix X
X <- cbind(1, x1, x2)

# Vector Y
Y <- y_lin

# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
##          [,1]
##    -0.5702694
## x1  0.4832624
## x2  0.3214894

Exercise B.2 Compute β for the regressions y_lin ~ x1 + x2, y_qua ~ x1 + x2, and y_exp ~ x2 + x3 using equation (B.4) and the function lm. Check that the fitted plane and the coefficient estimates are coherent.

Once we have the least squares estimates β^, we can define the following two concepts:

  • The fitted values Y^1,,Y^n, where

    Y^i:=β^0+β^1Xi1++β^pXip,i=1,,n.

    They are the vertical projections of Y1,,Yn into the fitted line (see Figure B.2). In a matrix form, inputting (B.3)

    Y^=Xβ^=X(XX)1XY=HY,

    where H:=X(XX)1X is called the hat matrix because it “puts the hat into Y”. What it does is to project Y into the regression plane (see Figure B.2).

  • The residuals (or estimated errors) ε^1,,ε^n, where

    ε^i:=YiY^i,i=1,,n.

    They are the vertical distances between actual and fitted data.

B.1.2 Model assumptions

Observe that β^ was derived from purely geometrical arguments, not probabilistic ones. That is, we have not made any probabilistic assumption on the data generation process. However, some probabilistic assumptions are required to infer the unknown population coefficients β from the sample (X1,Y1),,(Xn,Yn).

The assumptions of the multiple linear model are:

  1. Linearity: E[Y|X1=x1,,Xp=xp]=β0+β1x1++βpxp.
  2. Homoscedasticity: Var[εi]=σ2, with σ2 constant for i=1,,n.
  3. Normality: εiN(0,σ2) for i=1,,n.
  4. Independence of the errors: ε1,,εn are independent.

A good one-line summary of the linear model is the following (independence is assumed):

(B.5)Y|(X1=x1,,Xp=xp)N(β0+β1x1++βpxp,σ2).

The key concepts of the simple linear model. The blue densities denote the conditional density of \(Y\) for each cut in the \(X\) axis. The yellow band denotes where the \(95\%\) of the data is, according to the model. The red points represent a sample following the model.

Figure B.3: The key concepts of the simple linear model. The blue densities denote the conditional density of Y for each cut in the X axis. The yellow band denotes where the 95% of the data is, according to the model. The red points represent a sample following the model.

Inference on the parameters β and σ can be done, conditionally289 on X1,,Xn, from (B.5). We do not explore this further, referring the interested reader to, e.g., Section 2.4 in García-Portugués (2025). Instead, we remark the connection between least squares estimation and the maximum likelihood estimator derived from (B.5).

First, note that (B.5) is the population version of the linear model (it is expressed in terms of the random variables). The sample version that summarizes assumptions i–iv is

Y|XNn(Xβ,σ2In).

Using this result, it is easy to obtain the log-likelihood function of Y1,,Yn conditionally on X1,,Xn as290

(B.6)(β)=logϕσ2In(YXβ)=i=1nlogϕσ(Yi(Xβ)i).

Finally, the following result justifies the consideration of the least squares estimate: it equals the maximum likelihood estimator derived under assumptions i–iv.

Theorem B.1 Under assumptions i–iv, the maximum likelihood estimator of β is the least squares estimate (B.4):

β^ML=argmaxβRp+1(β)=(XX)1XY.

Proof. Expanding the first equality at (B.6) gives291

(β)=log((2π)n/2σn)12σ2(YXβ)(YXβ).

Optimizing does not require knowledge on σ2, since differentiating with respect to β and equating to zero gives (see Exercise B.1) 1σ2(YXβ)X=0. Solving the equation gives the form for β^ML.

Exercise B.3 Conclude the proof of Theorem B.1.

References

García-Portugués, E. 2025. Notes for Predictive Modeling. https://bookdown.org/egarpor/PM-UC3M/.

  1. Not to confuse with a sample!↩︎

  2. If that were the case, we would consider perpendicular distances, which yield Principal Component Analysis (PCA).↩︎

  3. We assume that the randomness is on the response only.↩︎

  4. Recall that ϕΣ(μ) and ϕσ(μ) denote the pdf of a Np(μ,Σ) and a N(μ,σ2), respectively.↩︎

  5. Since |σ2In|1/2=σn.↩︎