B.1 Linear regression
B.1.1 Model formulation and least squares
The multiple linear regression employs multiple predictors 287 for explaining a single response by assuming that a linear relation of the form
holds between the predictors and the response In (B.1), is the intercept and are the slopes, respectively. is a random variable with mean zero and independent from Another way of looking at (B.1) is
since Therefore, the expectation of is changing in a linear fashion with respect to the values of Hence the interpretation of the coefficients:
- : is the expectation of when
- : is the additive increment in expectation of for an increment of one unit in provided that the remaining variables do not change.
Figure B.1 illustrates the geometrical interpretation of a multiple linear model: a plane in the -dimensional space. If the plane is the regression line for simple linear regression. If then the plane can be visualized in a three-dimensional plot.

Figure B.1: The regression plane (blue) of on and and its relation with the regression lines (green lines) of on (left) and of on (right). The red points represent the sample for and the black points the projections for (bottom), (left), and (right). Note that the regression plane is not a direct extension of the marginal regression lines.
The estimation of is done by minimizing the so-called Residual Sum of Squares (RSS). We first need to introduce some helpful notation for this and the next section:
A sample of is denoted by where denotes the -th observation of the -th predictor We denote with to the -th observation of so the sample is
The design matrix contains all the information of the predictors and a column of ones
The vector of responses the vector of coefficients and the vector of errors are, respectively,
Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely
into something as compact as
The RSS for the multiple linear regression is
aggregates the squared vertical distances from the data to a regression plane given by Note that the vertical distances are considered because we want to minimize the error in the prediction of Thus, the treatment of the variables is not symmetrical;288 see Figure B.2. The least squares estimators are the minimizers of (B.3):
Luckily, thanks to the matrix form of (B.3), it is simple to compute a closed-form expression for the least squares estimates:
Exercise B.1 can be obtained by differentiating (B.3). Prove it using that and for two vector-valued functions and
Figure B.2: The least squares regression plane and its dependence on the kind of squared distance considered. Application available here.
Let’s check that indeed the coefficients given by R’s lm
are the ones given by (B.4) in a toy linear model.
# Generates 50 points from a N(0, 1): predictors and error
set.seed(34567)
x1 <- rnorm(50)
x2 <- rnorm(50)
x3 <- x1 + rnorm(50, sd = 0.05) # Make variables dependent
eps <- rnorm(50)
# Responses
y_lin <- -0.5 + 0.5 * x1 + 0.5 * x2 + eps
y_qua <- -0.5 + x1^2 + 0.5 * x2 + eps
y_exp <- -0.5 + 0.5 * exp(x2) + x3 + eps
# Data
data_animation <- data.frame(x1 = x1, x2 = x2, y_lin = y_lin,
y_qua = y_qua, y_exp = y_exp)
# Call lm
# lm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done
mod <- lm(y_lin ~ x1 + x2, data = data_animation)
summary(mod)
##
## Call:
## lm(formula = y_lin ~ x1 + x2, data = data_animation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37003 -0.54305 0.06741 0.75612 1.63829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5703 0.1302 -4.380 6.59e-05 ***
## x1 0.4833 0.1264 3.824 0.000386 ***
## x2 0.3215 0.1426 2.255 0.028831 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9132 on 47 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2452
## F-statistic: 8.958 on 2 and 47 DF, p-value: 0.0005057
# mod is a list with a lot of information
# str(mod) # Long output
# Coefficients
mod$coefficients
## (Intercept) x1 x2
## -0.5702694 0.4832624 0.3214894
# Application of formula (3.4)
# Matrix X
X <- cbind(1, x1, x2)
# Vector Y
Y <- y_lin
# Coefficients
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
## [,1]
## -0.5702694
## x1 0.4832624
## x2 0.3214894
Exercise B.2 Compute for the regressions y_lin ~ x1 + x2
, y_qua ~ x1 + x2
, and y_exp ~ x2 + x3
using equation (B.4) and the function lm
. Check that the fitted plane and the coefficient estimates are coherent.
Once we have the least squares estimates we can define the following two concepts:
The fitted values where
They are the vertical projections of into the fitted line (see Figure B.2). In a matrix form, inputting (B.3)
where is called the hat matrix because it “puts the hat into ”. What it does is to project into the regression plane (see Figure B.2).
The residuals (or estimated errors) where
They are the vertical distances between actual and fitted data.
B.1.2 Model assumptions
Observe that was derived from purely geometrical arguments, not probabilistic ones. That is, we have not made any probabilistic assumption on the data generation process. However, some probabilistic assumptions are required to infer the unknown population coefficients from the sample
The assumptions of the multiple linear model are:
- Linearity:
- Homoscedasticity: with constant for
- Normality: for
- Independence of the errors: are independent.
A good one-line summary of the linear model is the following (independence is assumed):

Figure B.3: The key concepts of the simple linear model. The blue densities denote the conditional density of for each cut in the axis. The yellow band denotes where the of the data is, according to the model. The red points represent a sample following the model.
Inference on the parameters and can be done, conditionally289 on from (B.5). We do not explore this further, referring the interested reader to, e.g., Section 2.4 in García-Portugués (2025). Instead, we remark the connection between least squares estimation and the maximum likelihood estimator derived from (B.5).
First, note that (B.5) is the population version of the linear model (it is expressed in terms of the random variables). The sample version that summarizes assumptions i–iv is
Using this result, it is easy to obtain the log-likelihood function of conditionally on as290
Finally, the following result justifies the consideration of the least squares estimate: it equals the maximum likelihood estimator derived under assumptions i–iv.
Theorem B.1 Under assumptions i–iv, the maximum likelihood estimator of is the least squares estimate (B.4):
Proof. Expanding the first equality at (B.6) gives291
Optimizing does not require knowledge on since differentiating with respect to and equating to zero gives (see Exercise B.1) Solving the equation gives the form for
Exercise B.3 Conclude the proof of Theorem B.1.