2.3 Model formulation and estimation by least squares
The simple linear model is a statistical tool for describing the relation between two random variables, \(X\) and \(Y\). For example, in thepisa
dataset, \(X\) could be ReadingMean
and \(Y=\) MathMean
. The simple linear model is constructed by assuming that the linear relation
\[\begin{align}
Y = \beta_0 + \beta_1 X + \varepsilon \tag{2.1}
\end{align}\]
holds between \(X\) and \(Y\). In (2.1), \(\beta_0\) and \(\beta_1\) are known as the intercept and slope, respectively. \(\varepsilon\) is a random variable with mean zero and independent from \(X\). It describes the error around the mean, or the effect of other variables that we do not model. Another way of looking at (2.1) is
\[\begin{align}
\mathbb{E}[YX=x]=\beta_0+\beta_1x, \tag{2.2}
\end{align}\]
since \(\mathbb{E}[\varepsilonX=x]=0\).
The Left Hand Side (LHS) of (2.2) is the conditional expectation of \(Y\) given \(X\). It represents how the mean of the random variable \(Y\) is changing according to a particular value, denoted by \(x\), of the random variable \(X\). With the RHS, what we are saying is that the mean of \(Y\) is changing in a linear fashion with respect to the value of \(X\). Hence the interpretation of the coefficients:
 \(\beta_0\): is the mean of \(Y\) when \(X=0\).
 \(\beta_1\): is the increment in mean of \(Y\) for an increment of one unit in \(X=x\).
pisa
dataset, the sample are the observations for ReadingMean
and MathMean
. A possible way of estimating \((\beta_0,\beta_1)\) is by minimizing the Residual Sum of Squares (RSS):
\[\begin{align*}
\text{RSS}(\beta_0,\beta_1)=\sum_{i=1}^n(Y_i\beta_0\beta_1X_i)^2.
\end{align*}\]
In other words, we look for the estimators \((\hat\beta_0,\hat\beta_1)\) such that
\[\begin{align*}
(\hat\beta_0,\hat\beta_1)=\arg\min_{(\beta_0,\beta_1)\in\mathbb{R}^2} \text{RSS}(\beta_0,\beta_1).
\end{align*}\]
It can be seen that the minimizers of the RSS^{9} are
\[\begin{align}
\hat\beta_0=\bar{Y}\hat\beta_1\bar{X},\quad \hat\beta_1=\frac{s_{xy}}{s_x^2},\tag{2.3}
\end{align}\]
where:
 \(\bar{X}=\frac{1}{n}\sum_{i=1}^nX_i\) is the sample mean.
 \(s_x^2=\frac{1}{n}\sum_{i=1}^n(X_i\bar{X})^2\) is the sample variance. The sample standard deviation is \(s_x=\sqrt{s_x^2}\).
 \(s_{xy}=\frac{1}{n}\sum_{i=1}^n(X_i\bar{X})(Y_i\bar{Y})\) is the sample covariance. It measures the degree of linear association between \(X_1,\ldots,X_n\) and \(Y_1,\ldots,Y_n\). Once scaled by \(s_xs_y\), it gives the sample correlation coefficient, \(r_{xy}=\frac{s_{xy}}{s_xs_y}\).
As a consequence of \(\hat\beta_1=r_{xy}\frac{s_y}{s_x}\), \(\hat\beta_1\) has aways the same sign as r_{xy}.
There are some important points hidden behind the election of RSS as the error criterion for obtaining \((\hat\beta_0,\hat\beta_1)\):
 Why the vertical distances and not horizontal or perpendicular? Because we want to minimize the error in the prediction of \(Y\)! Note that the treatment of the variables is not symmetrical^{10}.
 Why the squares in the distances and not the absolute value? Due to mathematical convenience. Squares are nice to differentiate and are closely related with the normal distribution.
Figure 2.16 illustrates the influence of the distance employed in the sum of squares. Try to minimize the sum of squares for the different datasets. Is the best choice of intercept and slope independent of the type of distance?
The data of the figure has been generated with the following code:
# Generates 50 points from a N(0, 1): predictor and error
set.seed(34567) # Fixes the seed for the random generator
x < rnorm(n = 50)
eps < rnorm(n = 50)
# Responses
yLin < 0.5 + 1.5 * x + eps
yQua < 0.5 + 1.5 * x^2 + eps
yExp < 0.5 + 1.5 * 2^x + eps
# Data
leastSquares < data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp)
The minimizers of the error in the above illustration are indeed the coefficients given by the lm
function. Check this for the three types of responses: yLin
, yQua
and yExp
.
The population regression coefficients, (β_{0}, β_{1}), are not the same as the estimated regression coefficients, \((\hat\beta_0,\hat\beta_1)\):
 (β_{0}, β_{1}) are the theoretical and always unknown quantities (except under controlled scenarios).

\((\hat\beta_0,\hat\beta_1)\) are the estimates computed from the data. In particular, they are the output of
lm
. They are random variables, since they are computed from the random sample (X_{1}, Y_{1}),…,(X_{n}, Y_{n}).
In an abuse of notation, the term regression line is often used to denote both the theoretical (y = β_{0} + β_{1}x) and the estimated (\(y=\hat\beta_0+\hat\beta_1x\)) regression lines.
Once we have the least squares estimates \((\hat\beta_0,\hat\beta_1)\), we can define the next two concepts:
 The fitted values \(\hat Y_1,\ldots,\hat Y_n\), where
\[\begin{align*}
\hat Y_i=\hat\beta_0+\hat\beta_1X_i,\quad i=1,\ldots,n.
\end{align*}\]
They are the vertical projections of \(Y_1,\ldots,Y_n\) into the fitted line (see Figure 2.16).
 The residuals (or estimated errors) \(\hat \varepsilon_1,\ldots,\hat \varepsilon_n\), where
\[\begin{align*}
\hat\varepsilon_i=Y_i\hat Y_i,\quad i=1,\ldots,n.
\end{align*}\]
They are the vertical distances between actual data \((X_i,Y_i)\) and fitted data \((X_i,\hat Y_i)\). Hence, another way of writing the minimum RSS is \(\sum_{i=1}^n(Y_i\hat\beta_0\hat\beta_1X_i)^2=\sum_{i=1}^n\hat\varepsilon_i^2\).
To conclude this section, we check that the regression coefficients given by lm
are indeed the ones given in (2.3).
# Covariance
Sxy < cov(x, yLin)
# Variance
Sx2 < var(x)
# Coefficients
beta1 < Sxy / Sx2
beta0 < mean(yLin)  beta1 * mean(x)
c(beta0, beta1)
## [1] 0.6153744 1.3950973
# Output from lm
mod < lm(yLin ~ x, data = leastSquares)
mod$coefficients
## (Intercept) x
## 0.6153744 1.3950973
Adapt the code conveniently for doing the same checking with

X=
ReadingMean
and Y=MathMean
from thepisa
dataset. 
X=
logGDPp
and Y=MathMean
. 
X=
Population2010
and Y=Seats2013.2023
from theUS
dataset. 
X=
Population2010
and Y=Seats2011
from theEU
dataset.