3 Simple linear regression

Simple linear regressions are models that describe the relationship between two variables with a straight line. To illustrate the modeling process, we’ll use the county dataset introduced in section 1.2. Let us first look at a summary of the dataset:

library(openintro)
summary(county)
##      name                state         pop2000           pop2010       
##  Length:3142        Texas   : 254   Min.   :     67   Min.   :     82  
##  Class :character   Georgia : 159   1st Qu.:  11224   1st Qu.:  11114  
##  Mode  :character   Virginia: 133   Median :  24621   Median :  25872  
##                     Kentucky: 120   Mean   :  89650   Mean   :  98262  
##                     Missouri: 115   3rd Qu.:  61775   3rd Qu.:  66780  
##                     Kansas  : 105   Max.   :9519338   Max.   :9818605  
##                     (Other) :2256   NA's   :3                          
##     pop2017           pop_change          poverty      homeownership  
##  Min.   :      88   Min.   :-33.6300   Min.   : 2.40   Min.   : 0.00  
##  1st Qu.:   10976   1st Qu.: -1.9700   1st Qu.:11.30   1st Qu.:69.50  
##  Median :   25857   Median : -0.0600   Median :15.20   Median :74.60  
##  Mean   :  103763   Mean   :  0.5339   Mean   :15.97   Mean   :73.27  
##  3rd Qu.:   67756   3rd Qu.:  2.3750   3rd Qu.:19.40   3rd Qu.:78.40  
##  Max.   :10163507   Max.   : 37.1900   Max.   :52.00   Max.   :91.30  
##  NA's   :3          NA's   :3          NA's   :2                      
##    multi_unit    unemployment_rate  metro             median_edu  
##  Min.   : 0.00   Min.   : 1.620    no  :1974   below_hs    :   2  
##  1st Qu.: 6.10   1st Qu.: 3.520    yes :1165   hs_diploma  :1397  
##  Median : 9.70   Median : 4.360    NA's:   3   some_college:1695  
##  Mean   :12.32   Mean   : 4.611                bachelors   :  46  
##  3rd Qu.:15.90   3rd Qu.: 5.355                NA's        :   2  
##  Max.   :98.50   Max.   :19.070                                   
##                  NA's   :3                                        
##  per_capita_income median_hh_income   smoking_ban  
##  Min.   :10467     Min.   : 19264   none    :1927  
##  1st Qu.:21772     1st Qu.: 41126   partial : 635  
##  Median :25445     Median : 48073   complete:   0  
##  Mean   :26093     Mean   : 49765   NA's    : 580  
##  3rd Qu.:29276     3rd Qu.: 55771                  
##  Max.   :69533     Max.   :129588                  
##  NA's   :2         NA's   :2

The output of summary(county) gives a glimpse of the variables in the dataset - it allows us to see which variables are numerical, which are categorical, and what are the possible values they can take.

Suppose that we would like to gain a better understanding of the relationships between the variables in the dataset and the population change from 2010 to 2017, with the ultimate goal of identifying which variable is most associated with population change. We can use the plot function to look at scatterplots (or boxplots if the explanatory variable is categorical) of pop_change (y) versus each variable (x).

plot(pop_change ~ ., data = county[,-c(1:5)])

The dot in the expression pop_change ~ . means “all variables in the dataset except pop_change”. The expression data = county[,-c(1:5)] means that the dataset to be considered is the county dataset with the exclusion of columns 1 through 5. That is, these five variables will not be plotted against pop_change.

It is possible to see from the plots that some variables seem to be associated with pop_change. For example, per_capita_income and median_hh_income seem to be positively associated with pop_change. In the remainder of this chapter, we will look at ways to quantify such relationships.

3.1 Least squares model

Let us first develop a model that considers one quantitative explanatory variable and one quantitative response variable (we will later describe how this model changes when considering a categorical explanatory variable). For a sample with \(n\) observations for the explanatory (\(X\)) and response (\(Y\)) variables, write each pair as \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\). The goal is to write the equation of a line that “best” represents the relationship between the variables \(X\) and \(Y\). It turns out that there is more than one criterion that defines what the best line is. Here we go over the most used one, the least squares criterion.

Definition 3.1 (Least Squares line) The least squares (LS) line is given by \(\hat{Y} = \hat{a} + \hat{b}X\), where \(\hat{a}\) and \(\hat{b}\) are the values of \(a\) and \(b\) that minimize the quantity \[SS(a,b) = \sum_{i=1}^n(Y_i - (a+bX_i))^2.\]


Here SS stands for “Sum of Squares”. The quantity \(Y_i - (a+bX_i)\) is the error, also called the residual, of the estimate \(a+bX_i\) in predicting the value \(Y_i\). Visually, each term \(Y_i - (a+bX_i)\) is the vertical distance of each \(Y_i\) from the regression line, with a positive sign if \(Y_i\) is above the line and negative otherwise (in the figure below, positive residuals are blue and negative ones are red.)

For example, for the line \(Y=2X\), the point \((5,6)\) in the dataset has the following residual: \(6 - (0+2\times 5) = -4\). For the line \(Y=-0.5+2X\), the same point \((5,6)\) now has the following residual: \(6 - (-0.5+2\times 5) = -3.5\).

Theorem 3.1 The values of \(a\) and \(b\) which minimize \(SS(a,b)\) are \[ \hat{b} = \frac{\sum(Y_i-\overline{Y})(X_i-\overline{X})}{\sum(X_i-\overline{X})^2} \quad\mbox{and}\quad \hat{a} =\overline{Y}-b\overline{X}. \]

Proof. We want to minimize a function of two variables, which can be done using calculus or, in this specific case, we can also use properties of quadratic functions. For a fixed \(b\), \(SS(a,b)\) is a quadratic function of \(a\), so let us write \(SS(a,b)\) in a way that makes it easier to find the vertex of a parabola: \[\begin{eqnarray} SS(a,b) &=& \sum(-a+(Y_i-bX_i))^2 = \sum(a^2-2a(Y_i-bX_i)+(Y_i-bX_i)^2)\\ &=& na^2-2a(n\overline{Y}-bn\overline{X}) + \sum(Y_i-bX_i)^2, \end{eqnarray}\] which has a minimum at \[\hat{a} = \frac{2(n\overline{Y}-bn\overline{X})}{2n} = \overline{Y} - b\overline{X}.\] Here we used the formula for the x-coordinate of the vertex of a parabola. Now replace \({a}=\overline{Y} - b\overline{X}\) into \(SS(a,b)\): \[\begin{eqnarray} SS(b) &=& \sum(Y_i - (\overline{Y} - b\overline{X} + bX_i))^2 = \sum((Y_i - \overline{Y}) - b(X_i - \overline{X}))^2\\ &=& \sum(Y_i - \overline{Y})^2 - 2b\sum(Y_i - \overline{Y})(X_i - \overline{X}) + b^2\sum(X_i - \overline{X})^2, \end{eqnarray}\] which has a minimum at \[\hat{b} = \frac{\sum(Y_i - \overline{Y})(X_i - \overline{X})}{\sum(X_i - \overline{X})^2}.\]


Let’s look at a very small dataset to illustrate the calculation of the least squares line.

Example 3.1 Consider the data \((-1,0),(1,3),(2,2)\). To find the least squares line, we first find \(\hat{b}\) and then \(\hat{a}\). We also need \(\overline{X} = (-1+1+2)/3 = 2/3\) and \(\overline{Y} = (0+3+2)/3 = 5/3\). \[\begin{eqnarray} \hat{b} &=& \frac{\sum(Y_i - \overline{Y})(X_i - \overline{X})}{\sum(X_i - \overline{X})^2}\\ &=& \frac{(0-5/3)(-1-2/3)+(3-5/3)(1-2/3)+(2-5/3)(2-2/3)}{(-1-2/3)^2+(1-2/3)^2+(2-2/3)^2} = 0.7857.\\ \hat{a} &=& \overline{Y} - \hat{b}\overline{X} = 5/3 - 0.7857\times 2/3 = 1.1429. \end{eqnarray}\]


In R, the least squares line is calculated with the function lm. For example 3.1, the intercept and slope of the LS line are calculated as follows:

dat <- data.frame(x = c(-1,1,2), y = c(0,3,2))
lm(y ~ x, data = dat)
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##      1.1429       0.7857

We can also plot the LS line together with the data points. We can do this with the layer geom_smooth(method = "lm"):

ggplot(dat, aes(x = x, y = y)) +
  geom_point() + 
  geom_smooth(method = "lm", se = F)

Here, se = F suppresses the plotting of the standard error band.

Example 3.2 The LS line for pop_change and median_hh_income in the county dataset is given by:

lm(pop_change ~ median_hh_income, data = county)
## 
## Call:
## lm(formula = pop_change ~ median_hh_income, data = county)
## 
## Coefficients:
##      (Intercept)  median_hh_income  
##       -5.7712741         0.0001267
ggplot(county, aes(x = median_hh_income, y = pop_change)) +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", se = F)

Using the LS line for interpretation: One major advantage of linear models over more complex models is how friendly to interpretation they are. For instance, the slope of a line can be interpreted as the average increase or decrease (depending on the sign of the slope) in the response variable that is associated with a 1-unit increase in the explanatory variable. The intercept can be interpreted as the expected value of the response variable when the explanatory variable equals 0. It is not always appropriate to interpret the intercept because it may be far from the cloud of points and such extrapolation should be avoided. In example 3.2, the slope \(\hat{b} = 0.00025\) can be interpreted as follows: A $1 increase in median household income is associated with an average 0.00025% change in population from 2010 to 2017. It may be more useful for interpretation to multiply these values by 10000, that is, A $10000 increase in median household income for a county is associated with an average 2.5% change in population from 2010 to 2017.

The intercept should not be interpreted for this model, since there are no counties with a median_household income near 0. If there were such counties, we would say that counties with a median household income equal to zero have an average population change of -11.68% from 2010 to 2017.

3.2 Categorical predictor

In regression, categorical explanatory variables are handled through the use of indicator variables, which are variables that can take only two values, 1 or 0. For a categorical variable with two levels (that is, only two possible categories), we replace it with one indicator variable. For variables with 2 or more levels, we replace it with more than one indicator variable. More specifically, we use as many indicator variables as the number of levels minus one. This construction is developed in more detail in what follows.

3.2.1 Categorical predictor with two levels

Let’s first work on the case in which the explanatory variable is categorical with 2 levels. To illustrate that, we’ll continue to consider pop_change as the response variable and will look at metro (whether the county contains a metropolitan area) as the explanatory variable.

For a categorical variable with 2 levels, we replace it with an indicator variable that assigns the value 0 to one of the levels, called the reference level, and the value 1 to the other level. We think of 0 as the absence of a characteristic and 1 and the presence of the characteristic. For example, the variable metro has the levels no and yes, so it’s conventional to assign 0 to the level no and 1 to yes. We can then find \(\hat{a}\) and \(\hat{b}\) using the formulas from theorem 3.1, or in case of a LAD regression, solve the optimization problem to find \(\hat{a}\) and \(\hat{b}\).

The good news for the R user is that R goes through this process automatically when a categorical variable is the explanatory variable in a regression model. R calls a categorical variable a factor and lists its levels in alphabetical order, making the first listed level the reference level. Let’s check what happens when we run lm with a categorical explanatory variable:

lm(pop_change ~ metro, data = county)
## 
## Call:
## lm(formula = pop_change ~ metro, data = county)
## 
## Coefficients:
## (Intercept)     metroyes  
##      -0.678        3.261

In this case, the equation of the LS line is \(\hat{Y} = -1.241 + 5.539X\), where \(X\) is an indicator variable for metro. Since \(X\) can only take the values 0 and 1, the LS line is essentially predicting only two scenarios: the percent change in population for metropolitan areas and for non-metro areas. For example, when \(X=0\), we get \(\hat{Y} = -1.241\), which means that non-metro areas had an average population change of -1.24% from 2010 to 2017, while metro areas had an average population change of \(\hat{Y} = -1.241 + 5.539(1) = 4.298.\) So in a simple LS regression with a categorical predictor, the intercept is the average response for the reference level and the sum of the intercept and the slope is the average response for the remaining level. We can check this statement by calculating such averages:

county %>%
  group_by(metro) %>%
  summarize(mean = mean(pop_change, na.rm = T))
## # A tibble: 3 × 2
##   metro   mean
##   <fct>  <dbl>
## 1 no    -0.678
## 2 yes    2.58 
## 3 <NA>   1.13

Going back to the output of lm(pop_change ~ metro, data = county), notice that R turned the variable metro into metroyes. This is to point out which level was assigned the value 1. In this case, 1 = yes, and the reference level is no, as expected because R lists levels in alphabetical order. If we want to change the reference level, we can run the function factor and list the levels in the order that we would like (the first listed level will be the reference level). For instance, we can set yes to be the reference level by running:

county$metro <- factor(county$metro, levels = c("yes", "no"))

Then running lm again gives:

lm(pop_change ~ metro, data = county)
## 
## Call:
## lm(formula = pop_change ~ metro, data = county)
## 
## Coefficients:
## (Intercept)      metrono  
##       2.583       -3.261

Now the intercept corresponds to the average percent change in population for metro areas and the sum of the intercept and the slope is the average percent change for non-metro areas. Let’s return to having no as the reference level:

county$metro <- factor(county$metro, levels = c("no", "yes"))

3.2.2 Categorical predictor with three or more levels

We now look at cases where a categorical explanatory variable has three or more levels. We’ll use the variable smoking_ban, which describes the type of county-level smoking ban in place in 2010, taking one of the values ‘none’, ‘partial’, or ‘comprehensive’. In this case, we will need 2 indicator variables instead of one. More specificaly, we will use indicator variables for the two levels who are not the reference level. Let’s check how R handles this scenario:

lm(pop_change ~ smoking_ban, data = county)
## 
## Call:
## lm(formula = pop_change ~ smoking_ban, data = county)
## 
## Coefficients:
##        (Intercept)  smoking_banpartial  
##             0.4497              0.5455

This model has now two slopes and its equation is: \(\hat{Y} = 0.4302 + 0.2349X_1 + 1.2175X_2,\) where \(X_1\) and \(X_2\) are indicator variables satisfying \(X_1 = 1\) if there’s no smoking ban, \(X_2 = 1\) if there’s partial smoking ban. Sometimes it’s convenient to use actual names for the variables instead of \(X_1\) and \(X_2\), for exmple, \[\widehat{pop\_change} = 0.4302 + 0.2349\times none + 1.2175 \times partial. \] We will use both styles throughout these notes. This model estimates percent change in population for three scenarios, which correspond to the three levels of smoking_ban. Each level is achieved through a combination of possible values for \(X_1\) and \(X_2\), that is:

  • \(X_1 = 1\) and \(X_2 = 0\) corresponds to smoking_ban = none

  • \(X_1 = 0\) and \(X_2 = 1\) corresponds to smoking_ban = partial

  • \(X_1 = 0\) and \(X_2 = 0\) corresponds to smoking_ban = comprehensive (reference level)

  • The case \(X_1 = 1\) and \(X_2 = 1\) is not possible, since the levels cannot occur simultaneously.

Therefore, the interpretation of this model goes as follows:

The average percent change in population was \(0.4302 + 0.2349\times 0 + 1.2175 \times 0 = 0.43\)% for counties with comprehensive smoking ban; \(0.4302 + 0.2349\times 0 + 1.2175 \times 1 = 1.65\)% for counties with partial smoking ban; and \(0.4302 + 0.2349\times 1 + 1.2175 \times 0 = 0.67\)% for counties with no smoking ban.

Let’s confirm by finding the averages with summarize:

county %>%
  group_by(smoking_ban) %>%
  summarize(mean = mean(pop_change, na.rm = T))
## # A tibble: 3 × 2
##   smoking_ban  mean
##   <fct>       <dbl>
## 1 none        0.450
## 2 partial     0.995
## 3 <NA>        0.308

The slopes in the LS model also provide a convenient way to compare the levels with the reference level. For instance, the average percent change in population for counties with no smoking ban was higher than those with a comprehensive ban by 0.23%. Likewise, the average percent change in population for counties with partial smoking ban was higher than those with a comprehensive ban by 1.22%.

We have gone over how to find the LS line in the case of one predictor. In chapter 4, we look at the case of two or more predictors.

3.3 Correlations

Correlations are measures that assess the strength of the relationship between two numerical variables. We will look at two types of correlations, Pearson’s correlation coefficient (which assesses the strength of a linear relationship between two variables), and Spearman’s correlation (which assesses the strength of a monotonic relationship between two variables). Usually the term “correlation”, when no more specific information is provided, refers to Pearson’s correlation.

3.3.1 Pearson’s correlation coefficient

There is a direct connection between Pearson’s10 correlation and the slope of a least squares line. The slope of the LS line has units11 which depend on the units of the variables \(X\) and \(Y\). To get a “unitless” slope, we can standardize the variables \(X\) and \(Y\) by centering them around their mean and dividing the result by their standard deviatons. Let \(S_X\) and \(S_Y\) be the standard deviations of the variables \(X\) and \(Y\), respectively. Then the standardizations of \(X\) and \(Y\) are:

\[\frac{X_i-\overline{X}}{S_X}\quad \mbox{and} \quad \frac{Y_i-\overline{Y}}{S_Y}.\]

Definition 3.2 Pearson’s correlation coefficient for the sample \((X_i,Y_i), i=1,2, \dots, n,\) is the slope of the LS line between the standardized \(X\) and \(Y\). That is, \[r = \frac{\sum\frac{(X_i-\overline{X})}{S_X}\frac{(Y_i-\overline{Y})}{S_Y}}{\sum\frac{(X_i-\overline{X})^2}{S_X^2}} = \hat{b}_{X,Y}\frac{S_X}{S_Y},\] where \(\hat{b}_{X,Y}\) is the slope of the LS line between \(X\) and \(Y\). It can also be written as

\[r = \frac{\sum (X_i-\overline{X})(Y_i-\overline{Y})}{(n-1)S_X S_Y},\] because \(\sum(X_i-\overline{X})^2 = (n-1)S_X^2.\)

In R, Pearson’s correlation can be calculated with the function cor:

cor(county$median_hh_income, county$pop_change, method = "pearson", use = "complete.obs")
## [1] 0.4078856

The input use = "complete.obs" ensures that pairs with missing values are excluded, that is, only “complete observations” should be taken into account. The input method = "pearson" can be ommited (by default, R calculates Pearson’s correlation).

Pearson’s correlation coefficient has the following properties:

  • \(-1 \leq r \leq 1\).
  • Values of \(r\) that are closer to 1 and -1 correspond to stronger linear relationships. The sign of \(r\) tells us whether the association is positive or negative (recall the discussion in section 1.4).

IMPORTANT: It is important to notice that Pearson’s correlation coefficient is a measure of strength of a linear relationship. Non-linear relationships, even when strong, will usually return a low Pearson’s correlation. This correlation is also not robust, meaning that extreme observations can significantly change the value of the correlation.

3.3.2 Spearman’s correlation coefficient

Spearman’s correlation coefficient, also known as Spearman’s rho, can measure the strength of any monotonic relationship. A monotonic relationship is one that consistently increases or decreases, but not necessarily in a linear fashion. To calculate Spearman’s rho, we replace the original data values with their ranks within each variable. That is, is replaces the lowest value in \(X\) by the number 1, the next lowest value by 2, and so on, until the highest value is assigned the value \(n\). The same ranking method is applied to the variable \(Y\). In the case of ties, the observations receive the same ranking12. A scatterplot of the Y-ranks against the X-ranks shows a linear trend whenever the relationship between \(X\) and \(Y\) is monotonic. This inspires the definition of Spearman’s rho:

Definition 3.3 The Spearman’s correlation coefficient between two variables \(X\) and \(Y\) is the Pearson’s correlation of the ranks of \(X\) and \(Y\). It is denoted by \(\rho\) or \(r_s\).

Notice that \(-1 \leq r_s \leq 1\). Therefore, higher values of \(r_s\) correspond to strong monotonic relationships (which may be linear or not).

In R, the same cor function can be used to find Spearman’s rho:

cor(county$median_hh_income, county$pop_change, method = "spearman", use = "complete.obs")
## [1] 0.4089177

Notice that Pearson’s and Spearman’s correlations are very similar in this example. This is because the relationship between pop_change and median_hh_income are fairly linear.


  1. Pearson’s correlation coefficient was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s and for which the mathematical formula was derived and published by Auguste Bravais in 1844,↩︎

  2. The units of the LS line are \(\frac{\mbox{units of }X}{\mbox{units of } Y}\).↩︎

  3. There are other ways to handle ties, we present the simplest one here.↩︎