2.2 Line of best fit

Now that we have discussed how to define a straight line, we can apply this knowledge to finding a "line of best fit". Consider again the below scatter plot which shows average happiness versus average income for different countries (Gapminder.org 2021):

Fitting a 'line of best fit' requires choosing a slope and $y$ -intercept for our line that means the line will be placed in the best spot to fit the data. In the above graph, we can see that, depending on the choices of the $y$ -intercept and slope, we can end up with better or worse models (or lines). Simple Linear Regression allows us to use the data to help us determine where exactly the line of best fit should be. Before we consider how this is done, let's see how the model is defined.

Recall that we can define a straight line as $y = mx + c$ , where $m$ is the slope and $c$ is the $y$ -intercept. Equivalently, we could write this equation as $y = c + mx$ . Bearing this concept in mind, we can define the simple linear regression model as follows:

Simple linear regression model definition:

$y = \beta_0 + \beta_1 x + \epsilon,$ where:

$x$ is the explanatory variable (also referred to as the independent variable or predictor variable)
$y$ is the response variable (also referred to as the dependent variable)
$\beta_0$ is the $y$ -intercept of the line (just like $c$ in the equation we looked at earlier) and is referred to as the intercept coefficient
$\beta_1$ is the slope of the line (just like $m$ in the equation we looked at earlier) and is referred to as the slope coefficient
$\epsilon$ is known as the random error term which has expected value $\text{E}(\epsilon) = 0$ .

Pronounciation of terms
$\beta_0$ can be pronounced "beta nought"
$\beta_1$ can be pronounced "beta one"
$\epsilon$ can be pronounced "epsilon"

Then, supposing we have a data set with $n$ observations, each with a value for $x$ and a value for $y$ denoted as

$(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$

we can use this data to help us obtain the sample estimates, that is,

$\widehat{y}=\widehat{\beta}_0+\widehat{\beta}_1x.$

How do we choose the line of best fit?

Using the data available to us, we can use some criteria to ensure we choose a slope $(\widehat{\beta}_1)$ and intercept $(\widehat{\beta}_0)$ for our line to ensure it is in the 'best' spot. With Simple Linear Regression, this criteria involves fitting a model to the data such that the sum of squared residuals is minimized. To understand, let's consider a simple example based on the below figure:

Suppose we have a data set consisting of the three green observations displayed in the above figure. Each observation has an associated 'fit' on the blue line, represented by the black dots. Each observation also has a corresponding residual, $e$ , which is the vertical distance between the observation and the corresponding fit. If we squared all of these residuals and added them up, Simple Linear Regression would allow us to place the line in the spot such that this sum would be minimized.

What is a residual?

A residual is the vertical distance between the observed value and the regression line, or $y - \widehat{y}$ .

Returning to our happiness versus income example, let's compare the sum of squared residuals (SSR) for the three models:

As we can see, the 'best fit' has the lowest sum of squared residuals (2,529) of the three models and, by choosing optimal values for $\widehat{\beta}_0$ and $\widehat{\beta}_1$ , the line is placed such that it obtains the lowest sum of squared residuals possible.

Thanks to calculus, it is possible to estimate the values of $\widehat{\beta}_0$ and $\widehat{\beta}_1$ by hand, however we will allow statistical software packages to do the hard work for us.

References

Gapminder.org. 2021. “Free Data from World Bank via Gapminder.org, CC-BY License.” 2021. https://www.gapminder.org/data/.