2.3 Model formulation and estimation by least squares

The simple linear model is a statistical tool for describing the relation between two random variables, $X$ and $Y$ . For example, in the pisa dataset, $X$ could be ReadingMean and $Y=$ MathMean. The simple linear model is constructed by assuming that the linear relation $\begin{align} Y = \beta_0 + \beta_1 X + \varepsilon \tag{2.1} \end{align}$ holds between $X$ and $Y$ . In (2.1), $\beta_0$ and $\beta_1$ are known as the intercept and slope, respectively. $\varepsilon$ is a random variable with mean zero and independent from $X$ . It describes the error around the mean, or the effect of other variables that we do not model. Another way of looking at (2.1) is $\begin{align} \mathbb{E}[Y|X=x]=\beta_0+\beta_1x, \tag{2.2} \end{align}$ since $\mathbb{E}[\varepsilon|X=x]=0$ .

The Left Hand Side (LHS) of (2.2) is the conditional expectation of $Y$ given $X$ . It represents how the mean of the random variable $Y$ is changing according to a particular value, denoted by $x$ , of the random variable $X$ . With the RHS, what we are saying is that the mean of $Y$ is changing in a linear fashion with respect to the value of $X$ . Hence the interpretation of the coefficients:

$\beta_0$ : is the mean of $Y$ when $X=0$ .
$\beta_1$ : is the increment in mean of $Y$ for an increment of one unit in $X=x$ .

If we have a sample $(X_1,Y_1),\ldots,(X_n,Y_n)$ for our random variables $X$ and $Y$ , we can estimate the unknown coefficients $\beta_0$ and $\beta_1$ . In the pisa dataset, the sample are the observations for ReadingMean and MathMean. A possible way of estimating $(\beta_0,\beta_1)$ is by minimizing the Residual Sum of Squares (RSS): $\begin{align*} \text{RSS}(\beta_0,\beta_1)=\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_i)^2. \end{align*}$ In other words, we look for the estimators $(\hat\beta_0,\hat\beta_1)$ such that $\begin{align*} (\hat\beta_0,\hat\beta_1)=\arg\min_{(\beta_0,\beta_1)\in\mathbb{R}^2} \text{RSS}(\beta_0,\beta_1). \end{align*}$

It can be seen that the minimizers of the RSS⁸ are $\begin{align} \hat\beta_0=\bar{Y}-\hat\beta_1\bar{X},\quad \hat\beta_1=\frac{s_{xy}}{s_x^2},\tag{2.3} \end{align}$ where:

$\bar{X}=\frac{1}{n}\sum_{i=1}^nX_i$ is the sample mean.
$s_x^2=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2$ is the sample variance. The sample standard deviation is $s_x=\sqrt{s_x^2}$ .
$s_{xy}=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})$ is the sample covariance. It measures the degree of linear association between $X_1,\ldots,X_n$ and $Y_1,\ldots,Y_n$ . Once scaled by $s_xs_y$ , it gives the sample correlation coefficient, $r_{xy}=\frac{s_{xy}}{s_xs_y}$ .

As a consequence of $\hat\beta_1=r_{xy}\frac{s_y}{s_x}$ , $\hat\beta_1$ has always the same sign as $r_{xy}$ .

There are some important points hidden behind the election of RSS as the error criterion for obtaining $(\hat\beta_0,\hat\beta_1)$ :

Why the vertical distances and not horizontal or perpendicular? Because we want to minimize the error in the prediction of $Y$ ! Note that the treatment of the variables is not symmetrical⁹.
Why the squares in the distances and not the absolute value? Due to mathematical convenience. Squares are nice to differentiate and are closely related with the normal distribution.

Figure 2.16 illustrates the influence of the distance employed in the sum of squares. Try to minimize the sum of squares for the different datasets. Is the best choice of intercept and slope independent of the type of distance?

Figure 2.16: The effect of the kind of distance in the error criterion. The choices of intercept and slope that minimize the sum of squared distances for a kind of distance are not the optimal for a different kind of distance. Application also available here.

The data of the figure has been generated with the following code:

# Generates 50 points from a N(0, 1): predictor and error
set.seed(34567) # Fixes the seed for the random generator
x <- rnorm(n = 50)
eps <- rnorm(n = 50)

# Responses
yLin <- -0.5 + 1.5 * x + eps
yQua <- -0.5 + 1.5 * x^2 + eps
yExp <- -0.5 + 1.5 * 2^x + eps

# Data
leastSquares <- data.frame(x = x, yLin = yLin, yQua = yQua, yExp = yExp)

The minimizers of the error in the above illustration are indeed the coefficients given by the lm function. Check this for the three types of responses: yLin, yQua and yExp.

The population regression coefficients, $(\beta_0,\beta_1)$ , are not the same as the estimated regression coefficients, $(\hat\beta_0,\hat\beta_1)$ :

$(\beta_0,\beta_1)$ are the theoretical and always unknown quantities (except under controlled scenarios).
$(\hat\beta_0,\hat\beta_1)$ are the estimates computed from the data. In particular, they are the output of lm. They are random variables, since they are computed from the random sample $(X_1,Y_1),\ldots,(X_n,Y_n)$ .

In an abuse of notation, the term regression line is often used to denote both the theoretical ( $y=\beta_0+\beta_1x$ ) and the estimated ( $y=\hat\beta_0+\hat\beta_1x$ ) regression lines.

Once we have the least squares estimates $(\hat\beta_0,\hat\beta_1)$ , we can define the next two concepts:

The fitted values $\hat Y_1,\ldots,\hat Y_n$ , where $\begin{align*} \hat Y_i=\hat\beta_0+\hat\beta_1X_i,\quad i=1,\ldots,n. \end{align*}$ They are the vertical projections of $Y_1,\ldots,Y_n$ into the fitted line (see Figure 2.16).
The residuals (or estimated errors) $\hat \varepsilon_1,\ldots,\hat \varepsilon_n$ , where $\begin{align*} \hat\varepsilon_i=Y_i-\hat Y_i,\quad i=1,\ldots,n. \end{align*}$ They are the vertical distances between actual data $(X_i,Y_i)$ and fitted data $(X_i,\hat Y_i)$ . Hence, another way of writing the minimum RSS is $\sum_{i=1}^n(Y_i-\hat\beta_0-\hat\beta_1X_i)^2=\sum_{i=1}^n\hat\varepsilon_i^2$ .

To conclude this section, we check that the regression coefficients given by lm are indeed the ones given in (2.3).

# Covariance
Sxy <- cov(x, yLin)

# Variance
Sx2 <- var(x)

# Coefficients
beta1 <- Sxy / Sx2
beta0 <- mean(yLin) - beta1 * mean(x)
c(beta0, beta1)
## [1] -0.6153744  1.3950973

# Output from lm
mod <- lm(yLin ~ x, data = leastSquares)
mod$coefficients
## (Intercept)           x 
##  -0.6153744   1.3950973

Adapt the code conveniently for doing the same checking with

$X=$ ReadingMean and $Y=$ MathMean from the pisa dataset.
$X=$ logGDPp and $Y=$ MathMean.
$X=$ Population2010 and $Y=$ Seats2013.2023 from the US dataset.
$X=$ Population2010 and $Y=$ Seats2011 from the EU dataset.

They are unique and always exist. They can be obtained by solving $\frac{\partial}{\partial \beta_0}\text{RSS}(\beta_0,\beta_1)=0$ and $\frac{\partial}{\partial \beta_1}\text{RSS}(\beta_0,\beta_1)=0$ .↩︎
In Chapter 5 we will consider perpendicular distances.↩︎