2.3 Model formulation and estimation by least squares
The simple linear model is a statistical tool for describing the relation between two random variables, and . For example, in the pisa
dataset, could be ReadingMean
and MathMean
. The simple linear model is constructed by assuming that the linear relation
holds between and . In (2.1), and are known as the intercept and slope, respectively. is a random variable with mean zero and independent from . It describes the error around the mean, or the effect of other variables that we do not model. Another way of looking at (2.1) is
since .
The Left Hand Side (LHS) of (2.2) is the conditional expectation of given . It represents how the mean of the random variable is changing according to a particular value, denoted by , of the random variable . With the RHS, what we are saying is that the mean of is changing in a linear fashion with respect to the value of . Hence the interpretation of the coefficients:
- : is the mean of when .
- : is the increment in mean of for an increment of one unit in .
If we have a sample for our random variables and , we can estimate the unknown coefficients and . In the pisa
dataset, the sample are the observations for ReadingMean
and MathMean
. A possible way of estimating is by minimizing the Residual Sum of Squares (RSS):
In other words, we look for the estimators such that
It can be seen that the minimizers of the RSS8 are where:
- is the sample mean.
- is the sample variance. The sample standard deviation is .
- is the sample covariance. It measures the degree of linear association between and . Once scaled by , it gives the sample correlation coefficient, .
As a consequence of , has always the same sign as .
There are some important points hidden behind the election of RSS as the error criterion for obtaining :
- Why the vertical distances and not horizontal or perpendicular? Because we want to minimize the error in the prediction of ! Note that the treatment of the variables is not symmetrical9.
- Why the squares in the distances and not the absolute value? Due to mathematical convenience. Squares are nice to differentiate and are closely related with the normal distribution.
Figure 2.16 illustrates the influence of the distance employed in the sum of squares. Try to minimize the sum of squares for the different datasets. Is the best choice of intercept and slope independent of the type of distance?
Figure 2.16: The effect of the kind of distance in the error criterion. The choices of intercept and slope that minimize the sum of squared distances for a kind of distance are not the optimal for a different kind of distance. Application also available here.
The data of the figure has been generated with the following code:
# Generates 50 points from a N(0, 1): predictor and error
set.seed(34567) # Fixes the seed for the random generator
<- rnorm(n = 50)
x <- rnorm(n = 50)
eps
# Responses
<- -0.5 + 1.5 * x + eps
yLin <- -0.5 + 1.5 * x^2 + eps
yQua <- -0.5 + 1.5 * 2^x + eps
yExp
# Data
<- data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp) leastSquares
The minimizers of the error in the above illustration are indeed the coefficients given by the lm
function. Check this for the three types of responses: yLin
, yQua
and yExp
.
The population regression coefficients, , are not the same as the estimated regression coefficients, :
- are the theoretical and always unknown quantities (except under controlled scenarios).
-
are the estimates computed from the data. In particular, they are the output of
lm
. They are random variables, since they are computed from the random sample .
In an abuse of notation, the term regression line is often used to denote both the theoretical () and the estimated () regression lines.
Once we have the least squares estimates , we can define the next two concepts:
The fitted values , where They are the vertical projections of into the fitted line (see Figure 2.16).
The residuals (or estimated errors) , where They are the vertical distances between actual data and fitted data . Hence, another way of writing the minimum RSS is .
To conclude this section, we check that the regression coefficients given by lm
are indeed the ones given in (2.3).
# Covariance
<- cov(x, yLin)
Sxy
# Variance
<- var(x)
Sx2
# Coefficients
<- Sxy / Sx2
beta1 <- mean(yLin) - beta1 * mean(x)
beta0 c(beta0, beta1)
## [1] -0.6153744 1.3950973
# Output from lm
<- lm(yLin ~ x, data = leastSquares)
mod $coefficients
mod## (Intercept) x
## -0.6153744 1.3950973
Adapt the code conveniently for doing the same checking with
-
ReadingMean
andMathMean
from thepisa
dataset. -
logGDPp
andMathMean
. -
Population2010
andSeats2013.2023
from theUS
dataset. -
Population2010
andSeats2011
from theEU
dataset.