3 Simple linear regression
Simple linear regressions are models that describe the relationship between two variables with a straight line. To illustrate the modeling process, we’ll use the county
dataset introduced in section 1.2. Let us first look at a summary of the dataset:
## name state pop2000 pop2010
## Length:3142 Texas : 254 Min. : 67 Min. : 82
## Class :character Georgia : 159 1st Qu.: 11224 1st Qu.: 11114
## Mode :character Virginia: 133 Median : 24621 Median : 25872
## Kentucky: 120 Mean : 89650 Mean : 98262
## Missouri: 115 3rd Qu.: 61775 3rd Qu.: 66780
## Kansas : 105 Max. :9519338 Max. :9818605
## (Other) :2256 NA's :3
## pop2017 pop_change poverty homeownership
## Min. : 88 Min. :-33.6300 Min. : 2.40 Min. : 0.00
## 1st Qu.: 10976 1st Qu.: -1.9700 1st Qu.:11.30 1st Qu.:69.50
## Median : 25857 Median : -0.0600 Median :15.20 Median :74.60
## Mean : 103763 Mean : 0.5339 Mean :15.97 Mean :73.27
## 3rd Qu.: 67756 3rd Qu.: 2.3750 3rd Qu.:19.40 3rd Qu.:78.40
## Max. :10163507 Max. : 37.1900 Max. :52.00 Max. :91.30
## NA's :3 NA's :3 NA's :2
## multi_unit unemployment_rate metro median_edu
## Min. : 0.00 Min. : 1.620 no :1974 below_hs : 2
## 1st Qu.: 6.10 1st Qu.: 3.520 yes :1165 hs_diploma :1397
## Median : 9.70 Median : 4.360 NA's: 3 some_college:1695
## Mean :12.32 Mean : 4.611 bachelors : 46
## 3rd Qu.:15.90 3rd Qu.: 5.355 NA's : 2
## Max. :98.50 Max. :19.070
## NA's :3
## per_capita_income median_hh_income smoking_ban
## Min. :10467 Min. : 19264 none :1927
## 1st Qu.:21772 1st Qu.: 41126 partial : 635
## Median :25445 Median : 48073 complete: 0
## Mean :26093 Mean : 49765 NA's : 580
## 3rd Qu.:29276 3rd Qu.: 55771
## Max. :69533 Max. :129588
## NA's :2 NA's :2
The output of summary(county)
gives a glimpse of the variables in the dataset - it allows us to see which variables are numerical, which are categorical, and what are the possible values they can take.
Suppose that we would like to gain a better understanding of the relationships between the variables in the dataset and the population change from 2010 to 2017, with the ultimate goal of identifying which variable is most associated with population change. We can use the plot
function to look at scatterplots (or boxplots if the explanatory variable is categorical) of pop_change
(y) versus each variable (x).
The dot in the expression pop_change ~ .
means “all variables in the dataset except pop_change”. The expression data = county[,-c(1:5)]
means that the dataset to be considered is the county dataset with the exclusion of columns 1 through 5. That is, these five variables will not be plotted against pop_change
.
It is possible to see from the plots that some variables seem to be associated with pop_change
. For example, per_capita_income
and median_hh_income
seem to be positively associated with pop_change
. In the remainder of this chapter, we will look at ways to quantify such relationships.
3.1 Least squares model
Let us first develop a model that considers one quantitative explanatory variable and one quantitative response variable (we will later describe how this model changes when considering a categorical explanatory variable). For a sample with \(n\) observations for the explanatory (\(X\)) and response (\(Y\)) variables, write each pair as \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\). The goal is to write the equation of a line that “best” represents the relationship between the variables \(X\) and \(Y\). It turns out that there is more than one criterion that defines what the best line is. Here we go over the most used one, the least squares criterion.
Definition 3.1 (Least Squares line) The least squares (LS) line is given by \(\hat{Y} = \hat{a} + \hat{b}X\), where \(\hat{a}\) and \(\hat{b}\) are the values of \(a\) and \(b\) that minimize the quantity \[SS(a,b) = \sum_{i=1}^n(Y_i - (a+bX_i))^2.\]
Here SS stands for “Sum of Squares”. The quantity \(Y_i - (a+bX_i)\) is the error, also called the residual, of the estimate \(a+bX_i\) in predicting the value \(Y_i\). Visually, each term \(Y_i - (a+bX_i)\) is the vertical distance of each \(Y_i\) from the regression line, with a positive sign if \(Y_i\) is above the line and negative otherwise (in the figure below, positive residuals are blue and negative ones are red.)
For example, for the line \(Y=2X\), the point \((5,6)\) in the dataset has the following residual: \(6 - (0+2\times 5) = -4\). For the line \(Y=-0.5+2X\), the same point \((5,6)\) now has the following residual: \(6 - (-0.5+2\times 5) = -3.5\).
Theorem 3.1 The values of \(a\) and \(b\) which minimize \(SS(a,b)\) are \[ \hat{b} = \frac{\sum(Y_i-\overline{Y})(X_i-\overline{X})}{\sum(X_i-\overline{X})^2} \quad\mbox{and}\quad \hat{a} =\overline{Y}-b\overline{X}. \]
Proof. We want to minimize a function of two variables, which can be done using calculus or, in this specific case, we can also use properties of quadratic functions. For a fixed \(b\), \(SS(a,b)\) is a quadratic function of \(a\), so let us write \(SS(a,b)\) in a way that makes it easier to find the vertex of a parabola: \[\begin{eqnarray} SS(a,b) &=& \sum(-a+(Y_i-bX_i))^2 = \sum(a^2-2a(Y_i-bX_i)+(Y_i-bX_i)^2)\\ &=& na^2-2a(n\overline{Y}-bn\overline{X}) + \sum(Y_i-bX_i)^2, \end{eqnarray}\] which has a minimum at \[\hat{a} = \frac{2(n\overline{Y}-bn\overline{X})}{2n} = \overline{Y} - b\overline{X}.\] Here we used the formula for the x-coordinate of the vertex of a parabola. Now replace \({a}=\overline{Y} - b\overline{X}\) into \(SS(a,b)\): \[\begin{eqnarray} SS(b) &=& \sum(Y_i - (\overline{Y} - b\overline{X} + bX_i))^2 = \sum((Y_i - \overline{Y}) - b(X_i - \overline{X}))^2\\ &=& \sum(Y_i - \overline{Y})^2 - 2b\sum(Y_i - \overline{Y})(X_i - \overline{X}) + b^2\sum(X_i - \overline{X})^2, \end{eqnarray}\] which has a minimum at \[\hat{b} = \frac{\sum(Y_i - \overline{Y})(X_i - \overline{X})}{\sum(X_i - \overline{X})^2}.\]
Let’s look at a very small dataset to illustrate the calculation of the least squares line.
Example 3.1 Consider the data \((-1,0),(1,3),(2,2)\). To find the least squares line, we first find \(\hat{b}\) and then \(\hat{a}\). We also need \(\overline{X} = (-1+1+2)/3 = 2/3\) and \(\overline{Y} = (0+3+2)/3 = 5/3\). \[\begin{eqnarray} \hat{b} &=& \frac{\sum(Y_i - \overline{Y})(X_i - \overline{X})}{\sum(X_i - \overline{X})^2}\\ &=& \frac{(0-5/3)(-1-2/3)+(3-5/3)(1-2/3)+(2-5/3)(2-2/3)}{(-1-2/3)^2+(1-2/3)^2+(2-2/3)^2} = 0.7857.\\ \hat{a} &=& \overline{Y} - \hat{b}\overline{X} = 5/3 - 0.7857\times 2/3 = 1.1429. \end{eqnarray}\]
In R, the least squares line is calculated with the function lm
. For example 3.1, the intercept and slope of the LS line are calculated as follows:
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Coefficients:
## (Intercept) x
## 1.1429 0.7857
We can also plot the LS line together with the data points. We can do this with the layer geom_smooth(method = "lm")
:
Here, se = F
suppresses the plotting of the standard error band.
Example 3.2 The LS line for pop_change
and median_hh_income
in the county
dataset is given by:
##
## Call:
## lm(formula = pop_change ~ median_hh_income, data = county)
##
## Coefficients:
## (Intercept) median_hh_income
## -5.7712741 0.0001267
ggplot(county, aes(x = median_hh_income, y = pop_change)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = F)
Using the LS line for interpretation: One major advantage of linear models over more complex models is how friendly to interpretation they are. For instance, the slope of a line can be interpreted as the average increase or decrease (depending on the sign of the slope) in the response variable that is associated with a 1-unit increase in the explanatory variable. The intercept can be interpreted as the expected value of the response variable when the explanatory variable equals 0. It is not always appropriate to interpret the intercept because it may be far from the cloud of points and such extrapolation should be avoided. In example 3.2, the slope \(\hat{b} = 0.00025\) can be interpreted as follows: A $1 increase in median household income is associated with an average 0.00025% change in population from 2010 to 2017. It may be more useful for interpretation to multiply these values by 10000, that is, A $10000 increase in median household income for a county is associated with an average 2.5% change in population from 2010 to 2017.
The intercept should not be interpreted for this model, since there are no counties with a median_household income near 0. If there were such counties, we would say that counties with a median household income equal to zero have an average population change of -11.68% from 2010 to 2017.
3.2 Categorical predictor
In regression, categorical explanatory variables are handled through the use of indicator variables, which are variables that can take only two values, 1 or 0. For a categorical variable with two levels (that is, only two possible categories), we replace it with one indicator variable. For variables with 2 or more levels, we replace it with more than one indicator variable. More specifically, we use as many indicator variables as the number of levels minus one. This construction is developed in more detail in what follows.
3.2.1 Categorical predictor with two levels
Let’s first work on the case in which the explanatory variable is categorical with 2 levels. To illustrate that, we’ll continue to consider pop_change
as the response variable and will look at metro
(whether the county contains a metropolitan area) as the explanatory variable.
For a categorical variable with 2 levels, we replace it with an indicator variable that assigns the value 0 to one of the levels, called the reference level, and the value 1 to the other level. We think of 0 as the absence of a characteristic and 1 and the presence of the characteristic. For example, the variable metro
has the levels no
and yes
, so it’s conventional to assign 0 to the level no
and 1 to yes
. We can then find \(\hat{a}\) and \(\hat{b}\) using the formulas from theorem 3.1, or in case of a LAD regression, solve the optimization problem to find \(\hat{a}\) and \(\hat{b}\).
The good news for the R user is that R goes through this process automatically when a categorical variable is the explanatory variable in a regression model. R calls a categorical variable a factor
and lists its levels in alphabetical order, making the first listed level the reference level. Let’s check what happens when we run lm
with a categorical explanatory variable:
##
## Call:
## lm(formula = pop_change ~ metro, data = county)
##
## Coefficients:
## (Intercept) metroyes
## -0.678 3.261
In this case, the equation of the LS line is \(\hat{Y} = -1.241 + 5.539X\), where \(X\) is an indicator variable for metro
. Since \(X\) can only take the values 0 and 1, the LS line is essentially predicting only two scenarios: the percent change in population for metropolitan areas and for non-metro areas. For example, when \(X=0\), we get \(\hat{Y} = -1.241\), which means that non-metro areas had an average population change of -1.24% from 2010 to 2017, while metro areas had an average population change of \(\hat{Y} = -1.241 + 5.539(1) = 4.298.\) So in a simple LS regression with a categorical predictor, the intercept is the average response for the reference level and the sum of the intercept and the slope is the average response for the remaining level. We can check this statement by calculating such averages:
## # A tibble: 3 × 2
## metro mean
## <fct> <dbl>
## 1 no -0.678
## 2 yes 2.58
## 3 <NA> 1.13
Going back to the output of lm(pop_change ~ metro, data = county)
, notice that R turned the variable metro
into metroyes
. This is to point out which level was assigned the value 1. In this case, 1 = yes
, and the reference level is no
, as expected because R lists levels in alphabetical order. If we want to change the reference level, we can run the function factor
and list the levels in the order that we would like (the first listed level will be the reference level). For instance, we can set yes
to be the reference level by running:
Then running lm
again gives:
##
## Call:
## lm(formula = pop_change ~ metro, data = county)
##
## Coefficients:
## (Intercept) metrono
## 2.583 -3.261
Now the intercept corresponds to the average percent change in population for metro areas and the sum of the intercept and the slope is the average percent change for non-metro areas. Let’s return to having no
as the reference level:
3.2.2 Categorical predictor with three or more levels
We now look at cases where a categorical explanatory variable has three or more levels. We’ll use the variable smoking_ban
, which describes the type of county-level smoking ban in place in 2010, taking one of the values ‘none’, ‘partial’, or ‘comprehensive’. In this case, we will need 2 indicator variables instead of one. More specificaly, we will use indicator variables for the two levels who are not the reference level. Let’s check how R handles this scenario:
##
## Call:
## lm(formula = pop_change ~ smoking_ban, data = county)
##
## Coefficients:
## (Intercept) smoking_banpartial
## 0.4497 0.5455
This model has now two slopes and its equation is: \(\hat{Y} = 0.4302 + 0.2349X_1 + 1.2175X_2,\) where \(X_1\) and \(X_2\) are indicator variables satisfying \(X_1 = 1\) if there’s no smoking ban, \(X_2 = 1\) if there’s partial smoking ban. Sometimes it’s convenient to use actual names for the variables instead of \(X_1\) and \(X_2\), for exmple,
\[\widehat{pop\_change} = 0.4302 + 0.2349\times none + 1.2175 \times partial. \]
We will use both styles throughout these notes. This model estimates percent change in population for three scenarios, which correspond to the three levels of smoking_ban
. Each level is achieved through a combination of possible values for \(X_1\) and \(X_2\), that is:
\(X_1 = 1\) and \(X_2 = 0\) corresponds to
smoking_ban
= none\(X_1 = 0\) and \(X_2 = 1\) corresponds to
smoking_ban
= partial\(X_1 = 0\) and \(X_2 = 0\) corresponds to
smoking_ban
= comprehensive (reference level)The case \(X_1 = 1\) and \(X_2 = 1\) is not possible, since the levels cannot occur simultaneously.
Therefore, the interpretation of this model goes as follows:
The average percent change in population was \(0.4302 + 0.2349\times 0 + 1.2175 \times 0 = 0.43\)% for counties with comprehensive smoking ban; \(0.4302 + 0.2349\times 0 + 1.2175 \times 1 = 1.65\)% for counties with partial smoking ban; and \(0.4302 + 0.2349\times 1 + 1.2175 \times 0 = 0.67\)% for counties with no smoking ban.
Let’s confirm by finding the averages with summarize
:
## # A tibble: 3 × 2
## smoking_ban mean
## <fct> <dbl>
## 1 none 0.450
## 2 partial 0.995
## 3 <NA> 0.308
The slopes in the LS model also provide a convenient way to compare the levels with the reference level. For instance, the average percent change in population for counties with no smoking ban was higher than those with a comprehensive ban by 0.23%. Likewise, the average percent change in population for counties with partial smoking ban was higher than those with a comprehensive ban by 1.22%.
We have gone over how to find the LS line in the case of one predictor. In chapter 4, we look at the case of two or more predictors.
3.3 Correlations
Correlations are measures that assess the strength of the relationship between two numerical variables. We will look at two types of correlations, Pearson’s correlation coefficient (which assesses the strength of a linear relationship between two variables), and Spearman’s correlation (which assesses the strength of a monotonic relationship between two variables). Usually the term “correlation”, when no more specific information is provided, refers to Pearson’s correlation.
3.3.1 Pearson’s correlation coefficient
There is a direct connection between Pearson’s11 correlation and the slope of a least squares line. The slope of the LS line has units12 which depend on the units of the variables \(X\) and \(Y\). To get a “unitless” slope, we can standardize the variables \(X\) and \(Y\) by centering them around their mean and dividing the result by their standard deviatons. Let \(S_X\) and \(S_Y\) be the standard deviations of the variables \(X\) and \(Y\), respectively. Then the standardizations of \(X\) and \(Y\) are:
\[\frac{X_i-\overline{X}}{S_X}\quad \mbox{and} \quad \frac{Y_i-\overline{Y}}{S_Y}.\]
Definition 3.2 Pearson’s correlation coefficient for the sample \((X_i,Y_i), i=1,2, \dots, n,\) is the slope of the LS line between the standardized \(X\) and \(Y\). That is, \[r = \frac{\sum\frac{(X_i-\overline{X})}{S_X}\frac{(Y_i-\overline{Y})}{S_Y}}{\sum\frac{(X_i-\overline{X})^2}{S_X^2}} = \hat{b}_{X,Y}\frac{S_X}{S_Y},\] where \(\hat{b}_{X,Y}\) is the slope of the LS line between \(X\) and \(Y\). It can also be written as
\[r = \frac{\sum (X_i-\overline{X})(Y_i-\overline{Y})}{(n-1)S_X S_Y},\] because \(\sum(X_i-\overline{X})^2 = (n-1)S_X^2.\)
In R, Pearson’s correlation can be calculated with the function cor
:
## [1] 0.4078856
The input use = "complete.obs"
ensures that pairs with missing values are excluded, that is, only “complete observations” should be taken into account. The input method = "pearson"
can be ommited (by default, R calculates Pearson’s correlation).
Pearson’s correlation coefficient has the following properties:
- \(-1 \leq r \leq 1\).
- Values of \(r\) that are closer to 1 and -1 correspond to stronger linear relationships. The sign of \(r\) tells us whether the association is positive or negative (recall the discussion in section 1.4).
IMPORTANT: It is important to notice that Pearson’s correlation coefficient is a measure of strength of a linear relationship. Non-linear relationships, even when strong, will usually return a low Pearson’s correlation. This correlation is also not robust, meaning that extreme observations can significantly change the value of the correlation.
3.3.2 Spearman’s correlation coefficient
Spearman’s correlation coefficient, also known as Spearman’s rho, can measure the strength of any monotonic relationship. A monotonic relationship is one that consistently increases or decreases, but not necessarily in a linear fashion. To calculate Spearman’s rho, we replace the original data values with their ranks within each variable. That is, is replaces the lowest value in \(X\) by the number 1, the next lowest value by 2, and so on, until the highest value is assigned the value \(n\). The same ranking method is applied to the variable \(Y\). In the case of ties, the observations receive the same ranking13. A scatterplot of the Y-ranks against the X-ranks shows a linear trend whenever the relationship between \(X\) and \(Y\) is monotonic. This inspires the definition of Spearman’s rho:
Definition 3.3 The Spearman’s correlation coefficient between two variables \(X\) and \(Y\) is the Pearson’s correlation of the ranks of \(X\) and \(Y\). It is denoted by \(\rho\) or \(r_s\).
Notice that \(-1 \leq r_s \leq 1\). Therefore, higher values of \(r_s\) correspond to strong monotonic relationships (which may be linear or not).
In R, the same cor
function can be used to find Spearman’s rho:
## [1] 0.4089177
Notice that Pearson’s and Spearman’s correlations are very similar in this example. This is because the relationship between pop_change
and median_hh_income
are fairly linear.
Pearson’s correlation coefficient was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s and for which the mathematical formula was derived and published by Auguste Bravais in 1844,↩︎
The units of the LS line are \(\frac{\mbox{units of }X}{\mbox{units of } Y}\).↩︎
There are other ways to handle ties, we present the simplest one here.↩︎