Chapter 4 Linear Regression

We’ve been able to extract meaningful insights from data sets using various transformation functions and visualizations, but these insights have been largely qualitative, meaning that we’ve mostly only been able to describe our insights rhetorically by appealing to nicely displayed tables and plots. In this chapter, we’ll use statistical models as ways to extract more quantitative insights. In the process, we’ll have to dive deeper into statistics topics than we have so far.

We’ll need a new package called modelr. You’ll have to install it first:

install.packages("modelr")

And remember to load the necessary libraries:

library(tidyverse)
library(modelr)

4.1 Simple Linear Regression

A statistical model is a mathematical equation that captures a relationship among variables in a data set. This equation often describes how one variable, known as the response (or dependent) variable, is a function of other variables, known as the predictor (or feature or explanatory or independent) variables.

For example, consider the small sample data set sim1:

sim1
## # A tibble: 30 x 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## 11     4 11.9 
## 12     4 14.3 
## 13     5 19.1 
## 14     5 11.7 
## 15     5 16.0 
## 16     6 13.3 
## 17     6 16.0 
## 18     6 16.9 
## 19     7 20.1 
## 20     7 17.2 
## 21     7 19.9 
## 22     8 21.7 
## 23     8 18.4 
## 24     8 22.5 
## 25     9 26.8 
## 26     9 22.8 
## 27     9 21.1 
## 28    10 25.0 
## 29    10 23.3 
## 30    10 22.0

We’ll take x to be the predictor variable and y to be the response variable. A natural question is, “If we know the value of x, can we predict the value of y?” Technically, the answer is “no.” For example, notice that when x is 3, y could be either 7.36 or 10.50 or 10.51. But the purpose of a statistical model is to provide an approximate prediction rather than an exact one.

A good first step toward making a prediction would be to obtain a scatter plot of the data.

ggplot(data = sim1, mapping = aes(x = x, y = y)) +
  geom_point()

Before proceeding, it’s finally worth pointing out a shortcut in the ggplot syntax. The first argument required by every ggplot call is data =. Thus, it’s actually unnecessary to include data =; you can just type the name of the data set. The same is true for mapping =. This is the second argument by default, so you can just enter your aesthetics with aes. Finally, in geom_point, it’s understood that the first two arguments used within aes are x = and y =, so you can just enter the x and y variables directly. Thus, the above ggplot call can be shortened to:

ggplot(sim1, aes(x, y)) +
  geom_point()

We’ll be using these shortcuts from now on.

Anyway, referring to our scatter plot, it’s clear that, while our points don’t all lie exactly on the same line, they do seem to cluster around some common line. This line of best fit is called the regression line of the data set. Being a line, it can be described algebraically by an equation of the form \(y=mx + b\). If we can find the values of \(m\) (the slope) and \(b\) (the y-intercept), we’ll have a statistical model for our data set. A statistical model of this type is called a simple linear regression model. The word “simple” refers to the fact that we have only one predictor variable. When we have more than one, we call it a multiple linear regression model.

One way to find \(m\) and \(b\) would be to eyeball the scatter plot. This isn’t ideal, but it will get us started. If we were to imagine a line that passes as closely as possible to all of the data points, it looks like it might cross the \(y\)-axis around 4. Also, it looks like as x increases from 1 to 10, y increases from about 5 to 24. This means we have a rise of 19 over a run of 9, for a slope of 19/9, or about 2.1. If we have a slope and intercept, we can plot the corresponding line with geom_abline. Let’s plot our estimated regression line over our scatter plot:

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(intercept = 4, slope = 2.1)

This looks like a pretty good fit, but we’d prefer that R find \(m\) and \(b\) for us. We can achieve this with the lm (which stands for linear model) function from the modelr library. In the code below, sim1_model is the name we’re choosing for our regression model. Also, notice the first argument of lm is y ~ x. The lm function wants us to enter the response variable on the left of ~ and the predictor on the right. (This is the way to state that we want to predict y as a function of x.)

sim1_model <- lm(y ~ x, data = sim1)

Let’s see what we have:

sim1_model
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Coefficients:
## (Intercept)            x  
##       4.221        2.052

It looks like the actual intercept value is 4.221. The value 2.052 listed under x in the above output is the coefficient of x in our linear model, i.e., the slope. So, we were close. Let’s plot the actual regression line in red over our estimated one from above:

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(intercept = 4, slope = 2.1) +
  geom_abline(intercept = 4.221, slope = 2.052, color = "red")

It’s hard to tell the difference, but we’ll trust that the red line fits the data more closely.

Of course, it’s not a good practice to do all of this piecemeal, transcribing the intercept and slope over to geom_abline. An entry error could be made. We can extract these values directly from the model, though, by getting the coefficient vector:

coef(sim1_model)
## (Intercept)           x 
##    4.220822    2.051533

If we give this vector a name, we can then access the individual entries when plotting our regression line:

sim1_coefs <- coef(sim1_model)

b <- sim1_coefs[1]
m <- sim1_coefs[2]

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(intercept = b, slope = m, color = "red")

4.1.1 Exercises

  1. Obtain the scatter plot from the mpg data set of hwy vs displ. (Recall that this means that hwy is the \(y\) variable and displ is the \(x\).) Guess the values of the slope and intercept of the regression line, and plot the line along with the scatter plot.

  2. Use lm to find the actual regression line. Then plot it in a different color on top of your scatter plot and estimated regression line from the previous exercise.

4.2 Statistics Background

The primary uses of regression models are to make predictions and to examine the effects of some variables on other variables. As we just saw, obtaining a simple regression model is easy in R, but before actually using a model, we have to determine how well it fits the data, to correctly interpret its parameters, and to distinguish between statistical randomness and the model’s description of a real relationship among variables. Before doing any of this, we’ll need some background in basic statistics.

Recall that a variable in a data set is a column name and its values are listed in the column below that name. Variables in R are handled as vectors and can be entered in our usual way:

<VECTOR NAME> = c(<VALUE>, <VALUE>, ..., <VALUE>)

We can also define a vector by referring to a specific column in a specific data set:

<VECTOR NAME> = <DATA SET NAME>$<COLUMN NAME>

We should think of vectors as samples from a population. For example, the vector mpg$model contains the list of car models found in mpg, but this is not an exhaustive list of all the car models in existence; rather it’s a collection of some of them. When we compute statistics of variables, such as the mean, we’re usually using these statistics to say something about a parameter of an entire population from which the sample is taken. For example, we can find the average highway fuel efficiency among the cars in mpg:

mean(mpg$hwy)
## [1] 23.44017

We would interpret this mean as an estimate of the mean highway fuel efficiency of the entire population of cars from which the mpg data set is taken. Computing the statistics of a sample and then using them to make estimates about parameters of populations is known as inference. Inference can be a very useful way to make claims about the behaviors of populations by analyzing small subsets of those populations, but there are risks involved.

Inference only works when a sample fairly represents the population. If the sample size is small, we can’t reliably use the sample’s statistics to make claims about parameters of the entire population. For example, you wouldn’t make a claim about the average salary of adult males in your state by just taking the average of your dad’s and uncle’s salaries.

Samples can also fail to be representative when they’re biased. This means there’s a flaw in the way a sample is chosen which makes the sample’s statistics unsuitable for inference. For example, suppose you want to try again to make your claim about the average salary of adult males in your state, so you send out a survey to several businesses across the state, asking the male employees to report their salaries. You could find the average of the respondents’ salaries, but it still wouldn’t be right to use this average salary to make an inference about males’ salaries throughout the state. By only sending the surveys to businesses, you’d be excluding people who work in education, health care, government, and other industries in which the salaries could be significantly different from those in the business sector. Your sample would be biased.

There’s another kind of bias that can creep into statistical inference, and it’s the main type we’ll focus on here. Suppose we want to compute a statistic from a sample. But to compute that statistic, suppose we have to compute another statistic first, and then use that first statistic to compute the desired one. When using one sample statistic to compute another, we’re assuming that the first sample statistic is a reliable estimate of a true population parameter. However, there’s always a possibility that the first statistic is not reliable. In that case, the desired statistic we compute from that first one would be even less reliable than the first. Whenever we compute one statistic by referring to another, the desired statistic will be biased toward the first one. As we’ll see, there’s a way to correct this bias, though.


The most conspicuous example of a biased statistic is standard deviation. The standard deviation of a variable is a measure the dispersion of the variable, which is the extent to which the values of the variable deviate from the mean of the variable. For example, consider the following data set:

df <- tibble(
  x = c(-50,-27,85,115,165,228),
  y = c(-0.32,0.11,0.18,0.19,0.22,0.24)
)

df
## # A tibble: 6 x 2
##       x     y
##   <dbl> <dbl>
## 1   -50 -0.32
## 2   -27  0.11
## 3    85  0.18
## 4   115  0.19
## 5   165  0.22
## 6   228  0.24

It’s clear that the x values are much more dispersed than the y values. The standard deviations, which can be computed in R using the sd function, of these variables can be used to quantify this dispersion.

X <- df$x
Y <- df$y

sd(X)
## [1] 108.1776
sd(Y)
## [1] 0.2121006

The standard deviation of X is much higher than that of Y, as we suspected.

Standard deviation plays a huge role in assessing the performance of linear regression models, so it’s worth looking closely at how it’s actually computed. As we do so, we’ll see how the type of bias mentioned above enters the picture and how we can correct it.

The standard deviation of X computed above is 108.1176, but what does this number actually measure? We’ll start with the mean deviations of the values of X, i.e., the distances between the values of X and the mean X value. For example, the mean deviation of the X value -27 is:

abs(-27 - mean(X))
## [1] 113

The absolute value function abs is necessary since a mean deviation is a distance between two numbers; a negative mean deviation therefore wouldn’t make sense.

The notion of mean deviation presents one way to measure the dispersion of the X variable: Find the average of all of the mean deviations of the values of X. This would give us the average distance that the values of X stray from the central X value.

n <- length(X)

1/n * sum(abs(X - mean(X)))
## [1] 83.33333

However, this number does not agree with the standard deviation we found above. It’s actually a statistic known as mean absolute deviation, but it is not often used, partly because formulas containing absolute values can be hard to manipulate algebraically.

As noted, the reason for using absolute values for mean deviations is to ensure that each term represents a distance between an X value and the mean. Without the absolute values, some of the mean deviations would take on negative values, and these negative values would partially negate the mean deviations with positive values, thus obscuring our desired dispersion measurement.

Another, more algebra-friendly, way to avoid negative terms would be to square each mean deviation in the sum above. This would give:

1/n * sum((X - mean(X))^2)
## [1] 9752

This is still not our standard deviation. One obvious problem with this calculation is that since we’re squaring the mean deviations, the units of the number 9752 will be different from the units of the values of our vector X. For example, suppose the values of X represent locations of a moving object along a number line and are measured in inches. The units of 9752 would therefore be square inches. We could remedy that by taking the square root:

sqrt(1/n * sum((X - mean(X))^2))
## [1] 98.75222

This is still not the standard deviation we found above, but we’re getting closer. It’s actually known as the sample standard deviation. The reason it’s different from the standard deviation is that we’re encountering the long-promised bias.

To see why sample standard deviation is biased, consider the fact that we’re comparing each value of X to the mean of X. The value of mean(X) is supposed to be an estimate of the mean of the entire population from which the values of X have been sampled. However, the values of X will naturally be closer on average to the sample mean mean(X) than to the true population mean. (The sample mean is the best possible representative value of the sample, so the true population mean can only be a worse representative. Thus the sample values are, on average, closer to the sample mean than to the population mean.)

So by using the value of mean(X) in the above calculation rather than the true population mean, we are guaranteed to get an underestimate of the true population standard deviation. Since the true population mean is usually not attainable, we would need to correct our computation in a way that makes the underestimate we get a little bigger. This is done by changing the \(\frac{1}{n}\) factor to \(\frac{1}{n-1}\). Making the denominator smaller will make the entire fraction bigger, and it will increase the overall number obtained. The unbiased standard deviation is thus computed as follows:

sqrt(1/(n - 1) * sum((X - mean(X))^2))
## [1] 108.1776

This is the same standard deviation value R gives us, and this last calculation is the way standard deviation is actually calculated. By replacing \(n\) with \(n - 1\), we effectively correct the bias incurred in the sample standard deviation. The result is that the unbiased standard deviation is much safer to use than the sample standard deviation when making an inference about the dispersion of the entire population.

The quantity inside the square root above (1/(n - 1) * sum((X - mean(X))^2)) is also of interest, mainly because it captures roughly the same basic information about the dispersion, but the lack of a square root makes it more amenable to algebraic manipulations. The quantity inside the root is called the variance of the variable.

Let SS(M) denote the sum of the squares of the mean deviations of the variable X. In summary, we have:

  • The variance of the variable X is:

\[\textrm{var(X)}=\frac{\textrm{SS(M)}}{n-1}.\]

  • The standard deviation of a variable X is the square root of the variance. Thus

\[\textrm{sd(X)}=\sqrt{\frac{\textrm{SS(M)}}{n-1}}.\]

In R, we can use the functions var(X) and sd(X).

An important point of this discussion is the way to correct for a biased calculation that is incurred when we use a sample statistic (such as the sample mean) to calculate another sample statistic (such as the sample standard deviation). We correct for this bias by replacing \(n\) with \(n - 1\).2 What if we have to compute a certain sample statistic that forces us to use two other sample statistics? Now we’d be dealing with a doubly biased estimate. We would thus have to correct for both biases, and it turns out that the way to do this is to replace \(n\) by \(n - 2\). In general, if we have to use \(p\) sample statistics while computing some other sample statistic, we can correct for the \(p\)-fold bias by replacing \(n\) by \(n - p\). The value of \(n - p\) is called the number of degrees of freedom of the sample statistic we’re computing. Degrees of freedom will be an important ingredient when discussing standard deviation in the context of regression.


Soon, we will have the occasion to examine the distribution of a certain variable. Recall that a variable’s distribution is a description of how the values of the variable are distributed throughout the variable’s range. As we saw in Chapter 1, distributions of continuous variables are visualized as histograms. We’ll be interested in one special type of distribution: the normal distribution.

A variable has a normal distribution when the peak of the variable’s histogram is at the variable’s mean and the heights of the bars decrease toward 0 symmetrically to the left and right of the peak. The width of the distribution is determined by the standard deviation of the variable. (There’s actually a much more precise definition of a normal distribution, but this will suffice for us.) The histogram below exhibits an approximately normally distributed variable X whose mean value is 5 and whose standard deviation is 1.6:

What do the red and blue lines signify? It turns out that about 70% of the area under a normal curve lies within 1 standard deviation of the mean. This means that about 70% of the area, and hence about 70% of the values of X, lie between \(5 - 1.6=3.4\) and \(5 + 1.6=6.6\). In other words, if we were to pick a value of X at random, we could be about 70% confident that it would lie between 3.4 and 6.6. This interval is called the 70% confidence interval (CI). The red lines are placed at 3.4 and 6.6 to help visualize the 70% CI.

It also turns out that if we stray 2 standard deviations away from the mean in either direction, we pick up about 95% of the values of X. Thus we also have a 95% confidence interval. If we pick a value of X at random, we can be about 95% confident that it lies between 1.8 (which equals \(5-2\times 1.6\)) and 8.2 (which is \(5+2\times 1.6\)). These endpoints are marked by the blue lines above.

Confidence intervals and standard deviation will play very prominent roles in the assessment of regression models, to which we’ll turn our attention in the next three sections.

4.2.1 Exercises

  1. Let X be the vector c(32.5, 29.1, 17.7, 25.1, 19.2, 26.8, 27.4, 21.1). Find the variance and standard deviation of X without using the built-in var and sd functions in R. Then check your answers using var and sd.
  2. Explain in words what the values of the variance and standard deviation from the previous problem actually mean.
  1. You can obtain a random sample of values from a population with a normal distribution with a given mean and standard deviation in R by using the function rnorm(<SAMPLE SIZE>, <MEAN>, <STANDARD DEVIATION>). For example, rnorm(25, 10, 2) will produce 25 points from a normal distribution with mean 10 and standard deviation 2. Obtain a sample of 100 points from a normally distributed population with mean 50 and standard deviation 0.6.

    1. About how many of your 100 points should you expect to be within 0.6 units of the mean? Compare this expected number to the actual number.
    2. About how many of your 100 points should you expect to be within 1.2 units of the mean? Compare this expected number to the actual number.
  2. When a data set is very large, bias isn’t as much of an issue when computing variance and standard deviation. Why not? (Hint: When \(n\) is large, how to the values of \(\frac{1}{n}\) and \(\frac{1}{n-1}\) compare in size? Test it for a few large \(n\) values.)

4.3 Residuals

Model assessment and use can be broken into three stages:

  1. Pre-assessment. During this stage, we check whether certain prerequisite conditions are satisfied that ensure that a regression model is appropriate for our data set.
  2. Post-assessment. If the model passes the pre-assessment stage, we can then check how well it fits the data by appealing to some goodness-of-fit statistics.
  3. Model interpretation. If our model has favorable goodness-of-fit statistics, we can use it to make predictions or test relationships among variables.

We’ll address the pre-assessment stage in this section.

The residual of a data point relative to a model is the difference between the actual response value of the point and the predicted response value. In other words, \[\textrm{residual} = \textrm{actual response} - \textrm{predicted response}.\]

Let’s calculate a few of the residuals of points in sim1 relative to sim1_model from Section 4.1. Recall that the formula of our model is \(y=2.052x+4.221\). One of the data points in sim1 is \(x = 2\), \(y = 11.3\). So 11.3 is the actual response value when \(x\) is 2. To get the predicted response value, we have to plug 2 in for \(x\) in our model. The predicted \(y\) value is thus \((2.052\times 2) + 4.221 = 8.325\). The residual is therefore \(11.3 - 8.325 = 2.975\). Our model underestimates the \(y\) value by 2.975.

Another of our data points in sim1 is \(x = 4\), \(y = 12.4\). The predicted \(y\) value is \((2.052\times 4) + 4.221 = 12.429\). The residual, then, is \(12.4 - 12.429 = -0.029\). The model just barely overestimates the \(y\) value.

We’d ideally like all of our residuals to be near 0. The modelr library includes two very useful functions that will help us check our residuals: add_residuals and add_predictions. Their names describe exactly what they do:

sim1_w_pred_resids <- sim1 %>%
  add_predictions(sim1_model) %>%
  add_residuals(sim1_model)

sim1_w_pred_resids
## # A tibble: 30 x 4
##        x     y  pred    resid
##    <int> <dbl> <dbl>    <dbl>
##  1     1  4.20  6.27 -2.07   
##  2     1  7.51  6.27  1.24   
##  3     1  2.13  6.27 -4.15   
##  4     2  8.99  8.32  0.665  
##  5     2 10.2   8.32  1.92   
##  6     2 11.3   8.32  2.97   
##  7     3  7.36 10.4  -3.02   
##  8     3 10.5  10.4   0.130  
##  9     3 10.5  10.4   0.136  
## 10     4 12.4  12.4   0.00763
## # ... with 20 more rows
## # i Use `print(n = ...)` to see more rows

The pred column contains the results of plugging the x values into our model. The resid column contains the residuals, the values of which are equal to y - pred. By scrolling through the resid column, we see that a lot of the residuals are in fact close to 0. We can more easily see this by obtaining a residual plot using the resid column:

ggplot(sim1_w_pred_resids) +
  geom_point(aes(x, resid))

Since we’re interested in the proximity of the residuals to 0, we can also add a horizontal reference line at 0 with geom_hline:

ggplot(sim1_w_pred_resids) +
  geom_point(aes(x, resid)) +
  geom_hline(yintercept = 0)

The clustering of the residuals around 0 is a good sign, but the residual plot also reveals another reason to believe that a simple linear regression model is appropriate for this data set: The residual plot has no obvious pattern.

To explain why the lack of a residual pattern is relevant, consider the fact that in any statistical process, there will be an element of randomness that is completely unpredictable. This is called noise. No model should predict noise. However, if there is any real underlying relationship among the variables at play in the statistical process, a good model would detect that relationship. Thus, we want our models to detect real relationships but not noise.

What does this have to do with our residual plot? Well, if our model can explain all of the behavior of our data except the random noise, then the only reason for a difference between actual and predicted values would be that noise. In other words, the residuals would measure only the noise. Since noise is totally random, this means that a good model would have a residual plot that looks totally random, as the one above does.

To contrast this situation, consider the following data set. (It’s not built in, which is why we’re importing it.)

sim1a <- readr::read_csv("https://raw.githubusercontent.com/jafox11/MS282/main/sim1_a.csv")
sim1a
## # A tibble: 10 x 2
##        x     y
##    <dbl> <dbl>
##  1     1  8.34
##  2     2  9.91
##  3     3 10.6 
##  4     4 11.2 
##  5     5 12.5 
##  6     6 14.7 
##  7     7 17.2 
##  8     8 19.0 
##  9     9 19.9 
## 10    10 20.5

Looking at a scatter plot, it seems like a linear model would be appropriate:

ggplot(sim1a) +
  geom_point(aes(x, y))

However, we should make sure that the residuals are random. Let’s build a linear model on sim1a, add the resulting residuals column to sim1a, and then get the residual plot:

sim1a_model <- lm(y ~ x, data = sim1a)

sim1a_w_resid <- sim1a %>%
  add_residuals(sim1a_model)

ggplot(sim1a_w_resid) +
  geom_point(aes(x, resid)) +
  geom_hline(yintercept = 0)

There’s a definite pattern in the residual plot. This means that there’s some relationship between x and y that sim1a_model is not detecting; its predictive power isn’t as good as it could be, and a different type of model might be more appropriate than our linear one. (However, only one residual value is more than 1 unit away from 0, so despite its shortcomings, our model is still pretty good at predicting y from x.)


The resid columns we added to sim1 and sim1a above are variables in their own rights. Thus, we can talk about the mean, standard deviation, and distribution of the residuals of a model. Let’s do this for sim1_model. First, we need the mean:

mean(sim1_w_pred_resids$resid)
## [1] 5.121872e-15

This is scientific notation for the number 0.000000000000005121872. Whenever a statistical calculation produces a number with that many leading zeros after the decimal place, you should assume it’s actually 0. The very, very small deviation of the actual number from 0 is due to round-off error. So we can assume that the mean of the residuals is 0.

In fact, that will always be the case. Regression lines are calculated precisely so that they balance out the total of the positive residuals (which occur when the model makes an underestimate) with that of the negative residuals (which are due to overestimates). The fact that the residuals of a model always have a mean of 0 will be an important point in the next section.

We can also find the standard deviation of the residuals, but we’ll wait for this until the next section since it requires another look at bias and degrees of freedom. Suffice it to say for now, sd(sim1_w_pred_resids$resid) does not give quite the right answer.

Finally, we can check the distribution of the residuals by obtaining a histogram. (You might have to play with the bin width for awhile to get a satisfying visualization).

ggplot(sim1_w_pred_resids) +
  geom_histogram(aes(resid), binwidth = 0.3)

Part of the pre-assessment stage should involve examining the residual distribution histogram in addition to the residual scatter plot. Ideally, we would like the residuals to be normally distributed so we can meaningfully discuss confidence intervals of residuals. The above plot looks roughly normal, enough so to proceed past the pre-assessment stage.

By comparison, the distribution of the residuals for sim1a_model does not look normal. (Although with only 10 data points, we can’t expect much.)

ggplot(sim1a_w_resid) + 
  geom_histogram(aes(resid), binwidth = 0.25)

This doesn’t necessarily mean that we have to discard our model, but we should be aware that the residuals’ lack of normality will weaken the significance of the confidence intervals we discuss in Section 4.4.

4.3.1 Exercises

Exercises 1-5 refer to the model that was built in Exercises 4.1.1.

  1. Row 36 of the mpg data set has a displ value of 3.5 and a hwy value of 29. Using only the formula for your model (and not the add_predictions or add_residuals functions) find the hwy value predicted by your model and corresponding residual when displ = 3.5.

  2. Add columns to the mpg data set that contain the residuals and predictions relative to your model.

  3. Plot the residuals from the previous problem. Include a horizontal reference line through 0.

  4. Plot the distribution of the residuals. (Be sure to use an appropriate bin width.)

  5. Based on your residual plot and residual distribution from the previous two problems, would you say that a linear model is appropriate for this data set? Explain your answer.

  6. Which of the following residual plots best indicates that a linear model is an appropriate way to model the data? Explain your answer.

  1. Go to this web site. Copy and paste the 200 records from the table into a csv file and import it into R. Find the linear model that best fits the data. Then obtain the residual plot and residual histogram. Based on these, would you say that a linear model is appropriate for this data? Explain your answer.

4.4 Residual Standard Error

If a linear regression model passes the pre-assessment stage, the next step is to assess how well it actually captures the relationship among its variables. In this and the next section, we’ll work through the post-assessment stage by considering two goodness-of-fit statistics: residual standard error and the R2 values.

From Section 4.3, we know that the mean value of the residuals of any regression model is 0. Assuming the residuals are normally distributed, if we can find the standard deviation of the residuals, we can investigate the 70% and 95% confidence intervals of the residuals. The standard deviation of the residuals is called the residual standard error (RSE).

Recall that the standard deviation of a variable X is \[\textrm{sd(X)}=\sqrt{\frac{\textrm{SS(M)}}{n-1}},\] where SS(M) is the sum of the squares of the mean deviations of X. The key point to review at this point is the reason we use \(n-1\) in the denominator rather than \(n\): It corrects the bias incurred by using the sample mean in place of the true population mean. As stated in Section 4.2, if we have to use more than one sample statistic, say \(p\) of them, to compute another sample statistic, then we’ll pick up even more bias. The way to correct for this is to use \(n-p\) in place of \(n\). This is exactly what happens when we attempt to compute the RSE of a model.

When we compute a residual, we subtract a predicted response value from an actual response value. The predicted response is calculated from the regression equation, which is a linear equation built from two sample statistics: the slope and the intercept. Thus, any statistic we compute on the residuals will rely on two other sample statistics, and this means that any residual statistic is doubly biased. Therefore, when computing the RSE (i.e., the standard deviation of the residuals), we will have to correct for this double bias by using \(n-2\) in place of \(n-1\).

Recall that the mean value of the residuals of a model is always 0, so the mean deviation of a residual value is just the residual itself. Thus when we compute the sum of the squares of the mean deviations (i.e., the SS(M)) of the residuals, we’re really just computing the sum of the squares of the residuals. To emphasize this, we’ll denote this quantity by SS(R) rather than SS(M). The RSE is thus given by \[\textrm{RSE}=\sqrt{\frac{\textrm{SS(R)}}{n-2}}.\] The expression inside the radical above is then just the variance of the residuals. This quantity is sometimes called the mean squared residual (MSR). We thus have \[\textrm{MSR}=\frac{\textrm{SS(R)}}{n-2}.\]

For sim1 and sim1_model, we get the following MSR and RSE:

R <- sim1_w_pred_resids$resid

n <- length(R)

MSR <- 1/(n - 2) * sum(R^2)
RSE <- sqrt(MSR)

MSR
## [1] 4.852664
RSE
## [1] 2.202876

The RSE of a model is included in a model’s summary output. It’s three lines up from the bottom.

summary(sim1_model)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14

Notice it mentions there are 28 degrees of freedom, which is the value of \(n - 2\) when \(n\) is 30. (The only part of the summary above that we won’t discuss in this chapter is the last line, which deals with F-statistics.)

So what does the RSE tell us about our model? We found that the RSE of sim1_model is about 2.203. What does this number actually mean?

When the residuals are approximately normally distributed, the RSE of a model determines confidence intervals. Recall that the 70% CI is within 1 standard deviation of the mean, and the 95% CI is within 2 standard deviations of the mean. The mean of the residuals is always 0, and a residual of 0 means the prediction of the model exactly agrees with the actual response value. Therefore, the CIs of a model tell us how confident we can be that our predictions are within a certain tolerance of the actual values.

Since the RSE of sim1_model is 2.203, we get a 70% CI for the residuals of -2.203 to 2.203. This means we can be 70% confident that a given prediction made by the model will be within 2.203 of the actual value. Similarly, we get a 95% CI of -4.406 to 4.406, meaning that we can be 95% confident that a given prediction will be within 4.406 of the actual value. If our sample size is large, we can use these CIs to make inferences about the entire population from which the sample is taken, thanks to the fact that we corrected the bias incurred.

We can test our confidence intervals by plotting bands around the sim1_model regression line. The red lines are 1 RSE away from the regression line, and the blue lines are 2 RSEs away.

b <- sim1_coefs[1]
m <- sim1_coefs[2]

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(slope = m, intercept = b) +
  geom_abline(slope = m, intercept = b + 2.203, color = "red") +
  geom_abline(slope = m, intercept = b - 2.203, color = "red") +
  geom_abline(slope = m, intercept = b + 4.406, color = "blue") +
  geom_abline(slope = m, intercept = b - 4.406, color = "blue")

We can see that 21 out of 30 (70%) of the points lie between the red lines, and 29 out of 30 (about 97%, which is fairly close to the stated 95% confidence probability) of the points lie between the two blue lines. (These confidence bands might remind you of the envelopes around the trend curves from Section 1.4.)

4.4.1 Exercises

Exercises 1-2 refer to the model that was built in Exercises 4.1.1.

  1. Find the RSE of this model. Use it to fill in the blank in the following sentences:

    1. There is a 70% chance that for a given value of displ, the value of hwy predicted by the model is within _________ of the actual value of hwy.
    2. There is a 95% chance that for a given value of displ, the value of hwy predicted by the model is within _________ of the actual value of hwy.
  2. Use the RSE to plot a 95% confidence band around the regression line. Then find the percentage of data points in the scatter plot that actually lie within this band.

  3. Find the 95% confidence interval for sim1a_model from Section 4.3. Should you trust this confidence interval? (Hint: What did the distribution of the residuals of sim1a_model look like? How many data points do you have to work with? How many degrees of freedom are you restricted to when computing the RSE?)

4.5 R2 Values

If we were asked to predict the value of a response variable given the value of a predictor and we did not have a model to help us, the best we could do would be to use the mean response value as our prediction. The error we should expect from this prediction would just be the expected deviation of the response values from the mean; this deviation could be measured by the variance or standard deviation of the response variable.

Of course, if we had a model to help with our prediction, we should expect a better result. The expected prediction error using the model would be the expected deviation of the actual response values from the predicted response values; in other words, it would be the expected value of the residuals of the model. As we know, the expected value of the residuals can be measured by the MSR or RSE.

This leads to another way to assess the performance of a model: compare the MSR to the variance of the response value. (Or we could compare the RSE to the standard deviation, but we’ll opt for the version that avoids roots.) We’ll make this comparison by forming the ratio \(\frac{\textrm{MSR}}{\textrm{variance}}\). If the model is any good at all, the predictions obtained from the model will be much better than the one obtained by just using the mean, and thus the value of the MSR will be much less than the variance. This means the ratio \(\frac{\textrm{MSR}}{\textrm{variance}}\) will be near 0. We can think of this ratio as a rough measurement of the percentage of the total variation in the response that the model cannot explain.

In practice, we actually consider the complement of this percentage: \(1 - \frac{\textrm{MSR}}{\textrm{variance}}\). This then is a rough measurement of the total variation in the response that the model can explain, and it’s known as the adjusted R2 value. (The reason for the “square” in R2 is due to the fact that we’re using squared measures of variation: MSR and variance.) We thus have \[\textrm{adjusted R}^2=1-\frac{\textrm{MSR}}{\textrm{variance of response}}.\]

Let’s consider the sim1 data set again and its regression model sim1_model. To get the variance of the response y, we’ll just use the built-in var function:

Y <- sim1$y

total_var <- var(Y)

We now need the MSR of sim1_model. As we saw in the last section, the MSR is the variance of the model’s residuals, using \(\frac{1}{n-2}\) rather than \(\frac{1}{n}\) to avoid bias.

X <- sim1_w_pred_resids$resid

n <- length(X)

MSR <- 1/(n - 2) * sum(X^2)

Thus, the adjusted R2 value is:

1 - MSR/total_var
## [1] 0.8804914

We can thus say that sim1_model can explain about 88% of the variation in the response variable y. Or in other words, about 88% of the variation in y is due to the influence of x. The other 12% is due to random noise or factors the model doesn’t know about. The adjusted R2 value is included in the summary output for sim1_model. (Check the second line up from the bottom.)

summary(sim1_model)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14

Notice that there’s another R2 value reported in the summary that’s a little bigger than the adjusted R2: the multiple R2 value. The multiple R2 value is very similar to the adjusted R2, except that the degrees of freedom for the variance and for the MSR are taken to be the same. The reason for this is motivated by an algebraic simplification. When there are \(n\) data points, there are \(n - 1\) degrees of freedom for the variance and \(n - 2\) for the MSR. If we let SS(M) stand for the sum of the squared mean deviations and SS(R) stand for the sum of the squared residuals as in the previous section, then we have \[\textrm{adjusted R}^2=1 - \frac{\textrm{SS(R)}/(n-2)}{\textrm{SS(M)}/(n-1)}.\] But if we assume the same number of degrees of freedom for variance and MSR (let’s say \(n - 1\) for both), the right side of the formula above becomes \[1-\frac{\textrm{SS(R)}/(n-1)}{\textrm{SS(M)}/(n-1)}.\] The \(n - 1\) factors cancel each other, and we are left with the definition of the multiple R2 value: \[\textrm{multiple R}^2=1 - \frac{\textrm{SS(R)}}{\textrm{SS(M)}}.\] For sim1_model, this gives:

resids <- sim1_w_pred_resids$resid  # The residuals of sim1_model
mean_dev <- sim1$y - mean(sim1$y)  # The mean deviations of the response

SSR <- sum((resids)^2)  # The sum of the squares of the residuals
SSM <- sum((mean_dev)^2)  # The sum of the squares of the mean deviations

multRsq <- 1 - SSR/SSM

multRsq
## [1] 0.8846124

Based on this R2 value, we would still say that the model can explain about 88% of the variation in y.

In practice, the default R2 value to use is the multiple R2, and the word “multiple” is usually omitted. This is a little lazy, but it does ease the algebra, and for large \(n\), the multiple R2 value is very close to the adjusted R2.

The adjusted R2 value has the benefit of being unbiased, but it has an even more important property that makes it sometimes preferable to multiple R2, as we’ll see in Section 4.8.


Residual standard errors and R2 values provide quick metrics we can use to assess a model, but as is always the case with any kind of assessment, we should employ as complete an assessment as possible.

For example, recall the data set sim1a and its model sim1a_model from Section 4.3. Let’s run the model summary.

summary(sim1a_model)
## 
## Call:
## lm(formula = y ~ x, data = sim1a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1206 -0.4750  0.1561  0.5620  0.9511 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.36737    0.52153   12.21 1.88e-06 ***
## x            1.45886    0.08405   17.36 1.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7634 on 8 degrees of freedom
## Multiple R-squared:  0.9741, Adjusted R-squared:  0.9709 
## F-statistic: 301.3 on 1 and 8 DF,  p-value: 1.237e-07

Notice that the RSE and both R2 values indicate that sim1a_model is better than sim1_model. However, recall that the residuals for sim1a_model exhibited a definite pattern, meaning that sim1a_model doesn’t have as much explanatory power as it could. The residuals for sim1_model, however, are random, meaning that sim1_model is explaining as much of the behavior as can be expected by a simple linear regression model. One might argue, then, that sim1_model is actually the superior model despite its worse goodness-of-fit statistics.

In practice, the assessment of the performance of a regression model should never be the responsibility of a single method. A good rule of thumb is to perform the pre-assessment stage by checking the residual plot of a model as well as the distribution of the residuals. We want the residual plot to look random, and we want the distribution to be normal. We then move on to the post-assessment stage in which we calculate both the RSE and R2 value. We want a low RSE, indicating that there is a high probability that a given data point lies close to the regression line, and a high R2 value, indicating that a large percentage of the variation in the response variable is explained by the model.

If we pass the pre- and post-assessment stages, we’re ready to move on to the interpretation stage, which we’ll discuss in the next section.

4.5.1 Exercises

  1. Referring to the model that was built in Exercises 4.1.1, what are the adjusted and multiple R2 values? Explain precisely what these numbers tell you about your model. (You can have R compute the R2 values for you.)

  2. For the data set below, build a linear model with y as the response and state the R2 value. Then look at the residual plot. Explain why this example shows that a nearly perfect R2 value does not automatically mean that a model is nearly perfect.

df <- tibble(
  x = c(1, 2, 3, 4, 5, 6, 7, 8, 100),
  y = c(2, 1, 4, 3, 6, 5, 8, 7, 100)
)
  1. Remember that R2 values are not true percentages. For example, they can be negative. What would it mean for a model to have a negative R2 value? (Hint: If \(1 - \frac{\textrm{MSR}}{\textrm{variance}}\) is to be negative, how would the MSR compare to the variance?)

  2. Without knowing anything about a data set or its linear regression model, explain which of these statements is more meaningful on its own (and give a reason for your answer): (1) “The RSE of the model is 2.25” or (2) “The R2 value of the model is 0.78.”

4.6 Regression Coefficients and p-values

Once we’ve assessed a regression model, we’re ready to put the model to use. One of the most important applications of linear regression is to determine the extent to which a predictor variable has an impact on the response variable. For example, how much of an impact does a person’s level of education have on their salary? How much of an impact does the Federal Reserve interest rate have on your personal savings account interest rate? How much of an impact does a professional basketball player’s age have on his or her amount of playing time? In this section, we’ll see how to both quantify this type of impact and determine whether the impact is statistically significant.

We’ll again develop the main ideas in this section using the model sim1_model that we built on the data set sim1 in Section 4.1: \(y = 2.052x + 4.221\). What kind of impact does the predictor x have on the response y? The slope of our regression line is 2.052, meaning that every time x increases by 1 unit, y increases by 2.052 units. In other words, the coefficient of the predictor variable measures the change in the response as the result of increasing the predictor by 1.

Even with this interpretation in place, it’s still not clear whether a change of 2.052 is significant for y. If not, then x would have very little predictive power as a predictor of y.

You can imagine a situation in which a change of 2.052 in a response variable would barely even register; for example, a change of $2.05 in the U.S. national debt would be completely insignificant. Is this the case for y in sim1? Scrolling through sim1, it would appear that a change of 2.052 is fairly significant, and we can verify this statistically using a hypothesis test as follows.

Suppose we start with the hypothesis that x is not a good predictor of y. This is called the null hypothesis, and it would mean that a change in x, even a big one, would produce very little or no change in y. In terms of the regression model, this would mean that the slope should be near 0. Should we consider 2.052 to be near 0?

This is really a question about expected deviations: How does a deviation of 2.052 units from 0 compare to the expected deviation? If it’s much less than the expected deviation, then the deviation is probably just due to randomness. If it’s much greater, then it’s probably a real feature of the data.

Let’s see what we mean by an “expected deviation.” The slope of a regression model is a statistic derived from the sample of data points we’re given. As with all statistics, the regression slope we get from our sample is supposed to be an estimate of the true regression slope of the entire population from which our sample is taken. If we were to take a different sample of data points from the population and use these to calculate a regression slope, we would get a different estimate of the true regression slope. If we continued this process of sampling points from the population and using them to calculate a regression slope, we would end up with an entire distribution of regression slopes. We could then determine the expected deviation of these slopes by computing their standard deviation. The standard deviation of the regression slopes is called the standard error of the regression slopes.3

Now, without going into the details of the calculation, it turns out that the standard error of the regression slopes we get from the population from which sim1 is taken is 0.14. This means that, on average, any regression slope we calculate from that population will have a value that differs from the true regression slope by about 0.14.

Recall the null hypothesis we’re testing: x is not a good predictor of y, meaning that the true regression slope should be 0. Our standard error now tells us that we should expect a sample to produce a regression slope that differs from 0 by 0.14 on average. However, our sample produced a slope of 2.052, which is several standard errors away from 0. We can calculate exactly how many standard errors separate 2.052 from 0 by dividing our slope estimate by the standard error:

2.052/0.14
## [1] 14.65714

This number, 14.65714, is called the t-statistic of the regression slope. It tells us that the slope of 2.052 that we obtained is more than 14 standard errors removed from 0! That’s a lot. The probability of that happening by chance is probably pretty low. In fact, skipping the details again, we can actually calculate that probability. It turns out to be only about 0.00000000000117%. This probability is called the p-value of the regression slope. We interpret the p-value by saying that if our null hypothesis is correct (that the true slope is 0), then the probability of obtaining a slope of 2.052 just by chance is about 0.00000000000117%. Since this probability is so small, then the most logical conclusion to draw is that 2.052 is not just the result of chance but that it’s a good estimate of the true slope. Thus the true slope must not be 0, and we should reject the null hypothesis that x has no significant impact on y. We can thus claim that x in fact does have a significant impact on y.


The process just outlined is the standard way to determine whether a predictor variable has a statistically significant impact on the response variable. Here’s a summary:

  1. Calculate the regression slope m (often called the regression coefficient) in R.
  2. Calculate the standard error se of the regression slope. This is found in the model’s summary output in the “Coefficients” section.
  3. Calculate the t-statistic, which tells you by how many standard errors m differs from 0. (This is also found in the summary output.)
  4. Calculate the p-value, which tells you the probability of getting your t-statistic or higher if the true slope is actually 0. (Also found in the summary output.)
  5. If your p-value is small, you can reject the null hypothesis that the true slope is 0 and claim that the predictor does have a significant impact on the response. If the p-value is not small, you don’t have enough evidence to say that the predictor has a significant impact on the response.

How small does the p-value have to be in order to reject the null hypothesis that the slope is 0? The most common practice is to require a p-value of 5% (0.05) or less. (However, different circumstances might warrant higher or lower cutoffs for the p-value’s significance level.) In the summary output, you’ll see a line labeled “Signif. codes.” This explains the asterisks you’ll see next to the p-values in the “Coefficients” section. When a p-value is between 0 and 0.001, it’s labeled with ***; between 0.001 and 0.01, with **; between 0.01 and 0.05, with *; between 0.05 and 0.1, with .; and above 0.1 it’s not labeled.


Consider, for example, another data set (entered manually below) that records the number of hours that students in a class spent studying for an exam and the exam grades they received.

exam_grade <- tibble(
  hours = c(1, 3, 2.5, 1, 1, 4, 3.5, 2, 2, 3.5, 1.5, 1, 0.5, 3, 2, 2.5, 3, 1, 2.5, 2),
  grade = c(80, 75, 60, 75, 90, 80, 65, 70, 75, 80, 90, 55, 75, 80, 75, 90, 95, 85, 75, 80)
)

exam_grade
## # A tibble: 20 x 2
##    hours grade
##    <dbl> <dbl>
##  1   1      80
##  2   3      75
##  3   2.5    60
##  4   1      75
##  5   1      90
##  6   4      80
##  7   3.5    65
##  8   2      70
##  9   2      75
## 10   3.5    80
## 11   1.5    90
## 12   1      55
## 13   0.5    75
## 14   3      80
## 15   2      75
## 16   2.5    90
## 17   3      95
## 18   1      85
## 19   2.5    75
## 20   2      80

Is hours a good predictor of grade? Let’s look at the scatter plot:

ggplot(exam_grade) +
  geom_point(aes(hours, grade))

There doesn’t seem to be a strong linear relationship. Let’s build a model and look at the coefficient statistics.

exam_model <- lm(grade ~ hours, data = exam_grade)

summary(exam_model)
## 
## Call:
## lm(formula = grade ~ hours, data = exam_grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2830  -2.5965  -0.0241   3.9670  17.3312 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.0900     5.4937  14.032 3.91e-11 ***
## hours         0.1929     2.3452   0.082    0.935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 18 degrees of freedom
## Multiple R-squared:  0.0003758,  Adjusted R-squared:  -0.05516 
## F-statistic: 0.006767 on 1 and 18 DF,  p-value: 0.9353

Extracting out the relevant numbers from this summary, we have:

predictor coefficient std. error t-statistic p-value
hours 0.1929 2.3452 0.082 0.935

The coefficient of hours (i.e., the slope) is 0.1929, which seems close to 0. This would be our first clue that hours is not a good predictor of grade. However, we have to compare this coefficient to the standard error, 2.3452. The coefficient is much smaller than the standard error, and this gives us a very small t-statistic of 0.082. This means that our coefficient only differs from 0 by about 0.082 standard errors (or equivalently, by about 8.2% of a standard error.) The probability of this happening as a result of random noise would seem pretty high. Indeed, the p-value of 0.935 confirms this, stating that if the true value of the coefficient should be 0, then the probability of accidentally getting a coefficient estimate of 0.1929 or above is about 93.5%. This is nowhere near the 5% threshold, so we cannot confidently reject the hypothesis that the coefficient is actually 0.

So what should we conclude? We got a high p-value, which means that if hours has no impact on grade, then getting a slope of at least 0.1929 is very likely. This means that our obtained slope is consistent with the null hypothesis, but it doesn’t allow us to confidently say that the null hypothesis is true. The only conclusion we can make from our p-value is that we cannot reject the null hypothesis. (You may have noticed the terrible R2 values in the summary, meaning that we shouldn’t have even moved past the post-assessment stage in the first place. We broke our rule in order to illustrate p-value interpretation, though.)

The logic of hypothesis testing relies on an argumentation method known as proof by contradiction. Suppose you disagree with a claim and want to prove that this claim is false. The idea is to suppose, for the sake of argument, that the claim is actually true. Then if we take the supposed truth of this claim as a starting point, we try to show that it logically implies another claim which cannot possibly be true. This means that if we were to accept the original claim, we would also be forced to accept a related claim which we know to be false. Since we cannot accept false statements, we cannot accept the original claim either. Therefore, the original claim must be false. Notice that this technique only allows us to prove that the original claim is false; it doesn’t allow us to prove the original claim is true.

In a hypothesis test, our goal is to show that the null hypothesis is false. So, following the steps of a proof by contradiction argument, we start by supposing that the null hypothesis is true. We then try to show that the likelihood of obtaining our regression slope is very small, i.e., that it has a small p-value. If we can do this, then we will have shown that if we were to accept the null hypothesis, we’d be forced to also accept the occurrence of an event that is very unlikely. At that point, we would have our contradiction, and we would thus have the grounds to reject the null hypothesis. However, if we get a high p-value, then the we will not have contradicted the null hypothesis, which means our proof by contradiction argument did not work. Therefore, we’d have no grounds to reject the null hypothesis.


We now know that the coefficient of the predictor variable (the slope) measures the amount by which the response variable changes when the predictor variable increases by 1. We also have a way to determine whether this slope estimate indicates whether the predictor has a significant impact on the response – we just check whether the p-value of the coefficient is sufficiently small.

But there’s another statistic estimated by the regression model: the intercept. In a linear equation \(y = mx + b\), the intercept is the value of \(b\). It’s where the graph of the line intercepts the \(y\)-axis. In a regression model, the intercept is the value of the response when the predictor value is 0. In other words, the intercept measures a baseline value of the response variable prior to any influence from the predictor.

For example, in the exam_grade data set above, the intercept is 77.09. This means that the expected exam grade of a student who doesn’t study at all would be about 77%. Further, the intercept, being a statistic calculated from a sample taken from a larger population, has a standard error (i.e., standard deviation), t-statistic, and p-value, and these are interpreted the same way as above. The standard error is a measurement of the average deviation away from the true intercept of the intercept values computed from their samples. The t-statistic is the ratio of the intercept value to the standard error, and the p-value is the probability of obtaining our t-statistic by chance under the assumption that the true intercept value is 0. For the intercept in exam_grade, we have a p-value of only 0.0000000000391. This is well below the 0.05 threshold, so we can reject the null hypothesis that the intercept should be 0. (However, in the exercises below, you’ll take a closer look at this example and determine whether this intercept estimate should be taken seriously.)

In practice, the significance of the intercept is often not as important as that of the slope, and intercepts with high p-values are usually still worth using.


When using statistics like the ones covered in this section to interpret a model, it’s extremely important to use precise language. Below is how you should phrase the interpretations of your coefficient statistics.

Suppose we’re modeling a response y by a predictor x, and our statistics are as follows:

predictor coefficient std. error t-statistic p-value
x m se t p
  • To say that the coefficient is m means, “Every 1-unit increase in x leads to a change of m units in y.”
  • To say that the standard error is se means, “We would expect an average deviation of se units from the true coefficient when using our data set to estimate the value of the coefficient.”
  • To say that the t-statistic is t means, “The value of our coefficient is t standard errors away from 0.”
  • To say that the p-value is p means, “Assuming the true value of our coefficient is 0, there is a probability p chance of getting a coefficient whose t-statistic is (1) t or greater if t is positive or (2) t or less if t is negative.” If p < 0.05, we can claim that x has a significant impact on y. Otherwise, we cannot make that claim.

4.6.1 Exercises

Exercises 1-4 refer to the model that was built in Exercises 4.1.1.

  1. State the coefficient, standard error, t-statistic, and p-value for this model.

  2. Use the wording at the end of this section to state precisely what your numbers from the previous exercise mean.

  3. Does displ have a significant impact on hwy? Explain your answer.

  4. What is the intercept of the model? Is it meaningful in the real-life context?

  5. Using the tibble or tribble function, manually enter a 10-observation data set with columns w, x, y, and z subject to the following conditions. Then show that these conditions are satisfied.

    1. The coefficient of w as a predictor of the response z has a very large positive t-statistic.
    2. The coefficient of x as a predictor of the response z has a very large negative t-statistic.
    3. The coefficient of y as a predictor of the response z has a t-statistic which is very close to 0.
  6. Suppose you’ve built a model with a variable R as the response and E as the explanatory variable that passes the pre-assessment checks (the residuals exhibit no pattern and are normally distributed) and the post-assessment checks (the RSE is low and the R2 is high). Suppose also that the slope of the model is 3.1 and has a p-value of 0.23. Explain why each of the following common misinterpretations is incorrect, and then give the correct interpretations.

    1. “There is a 23% chance that the slope of 3.1 we obtained was the result of statistical randomness.”
    2. “Since the p-value is above the 0.05 threshold, we can accept the null hypothesis and conclude that E has no significant impact on R.”
  7. This exercise refers to the exam_grade data set from this section. We found that the intercept estimate was 77.09, and the p-value is extremely low. We might be tempted to say, “We can therefore be very confident in saying that when a student doesn’t study at all, he or she will get a 77% on the exam.” Look again at the model’s summary, and explain why we shouldn’t necessarily believe this statement.

  8. For the exam_grade data set and its model, use the RSE to fill in the blanks: “If the model predicts that a student’s score will be 60%, then we can be 95% confident that the true score will be between _______ and _______.” Explain why this statement lends further evidence to the claim that exam_model is not very good.

4.7 Multiple Linear Regression

In most real-life modeling problems, a response variable is a function of several predictor variables, not just one. A linear model that makes use of more than one predictor is known as a multiple linear regression model. The algebraic representation of a multiple regression model is a linear equation with more than one input; if \(y\) is the response variable and \(x_1, x_2,\ldots,x_n\) are the predictors, then the model would have the form \[y=\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n+b,\] where the numbers \(\beta_1,\beta_2,\ldots,\beta_n\) are called the regression coefficients, and \(b\) is still called the intercept. In this section, we’ll revisit our primary questions, but now for multiple regression models: How do we know when a model is performing well, and what can we do with it if it is? We’ll answer these questions by adapting the notions of residuals, RSEs, confidence intervals, R2 values, and p-values to the multivariable setting.

Consider the following data set:

sim_mult <- tibble(
  x = c(2.1, -3.0, 1.6, 4.2, 3.3, 2.7, 3.4, 4.9, -1.9, -2.2),
  y = c(10.2, 12.2, 9.8, 11.1, 10.5, 13.1, 9.5, 10.0, 11.3, 12.7),
  z = c(6.5, 11.1, 8.1, 4.4, 5.2, 5.7, 5.4, 3.9, 9.9, 10.6)
)

sim_mult
## # A tibble: 10 x 3
##        x     y     z
##    <dbl> <dbl> <dbl>
##  1   2.1  10.2   6.5
##  2  -3    12.2  11.1
##  3   1.6   9.8   8.1
##  4   4.2  11.1   4.4
##  5   3.3  10.5   5.2
##  6   2.7  13.1   5.7
##  7   3.4   9.5   5.4
##  8   4.9  10     3.9
##  9  -1.9  11.3   9.9
## 10  -2.2  12.7  10.6

In the simple regression case, the modeling process usually starts with a scatter plot to see whether there is a visual linear relationship between the predictor and response, but with multiple regression, such a plot is not always an option. The reason is that more than one predictor variable would necessitate a plot that can’t be drawn in a two-dimensional plane.

If we want a model with z as the response and both x and y as the predictors, we can still use the lm function but with x + y to the right of ~:

sim_mult_model <- lm(z ~ x + y, data = sim_mult)

Let’s look just at the coefficients before we ask for the entire summary:

coef(sim_mult_model)
## (Intercept)           x           y 
##  10.6179955  -0.9663994  -0.1882910

This is telling us that the formula for the model is \[z = 10.6179955 - 0.9663994x - 0.1882910y.\]

As with simple regression, we can also find residuals in multiple regression as a means of assessing a model. For example, when \(x = 2.1\) and \(y = 10.2\), our model predicts that \(z\) is \(10.6179955 - (0.9663994\times 2.1) - (0.1882910\times 10.2) = 6.67\). The actual \(z\)-value corresponding to this \((x,y)\) pair is 6.5, which means we have a residual of \(6.5 - 6.67 = -0.17\). As before, modelr can compute predictions and residuals for multiple regression models:

sim_mult_w_pred_resid <- sim_mult %>%
  add_predictions(sim_mult_model) %>%
  add_residuals(sim_mult_model)

sim_mult_w_pred_resid
## # A tibble: 10 x 5
##        x     y     z  pred   resid
##    <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1   2.1  10.2   6.5  6.67 -0.168 
##  2  -3    12.2  11.1 11.2  -0.120 
##  3   1.6   9.8   8.1  7.23  0.873 
##  4   4.2  11.1   4.4  4.47 -0.0691
##  5   3.3  10.5   5.2  5.45 -0.252 
##  6   2.7  13.1   5.7  5.54  0.158 
##  7   3.4   9.5   5.4  5.54 -0.143 
##  8   4.9  10     3.9  4.00 -0.0997
##  9  -1.9  11.3   9.9 10.3  -0.426 
## 10  -2.2  12.7  10.6 10.4   0.247

We can still obtain a residual plot, but we have to pick one predictor variable to display on the x-axis; it doesn’t matter which one. We’ll choose x for the plot below.

ggplot(sim_mult_w_pred_resid) +
  geom_point(aes(x, resid))

There’s no obvious pattern in the residual plot, so it seems like a linear regression model might be appropriate. We should also still examine the distribution of the residuals:

ggplot(sim_mult_w_pred_resid) +
  geom_histogram(aes(resid), binwidth = 0.1)

The histogram looks kind of normal, but with only 10 data points, it’s not going to look perfectly normal anyway. We’ll keep going with our model.


Since RSE and R2 values are defined in terms of residuals, these goodness-of-fit statistics also exist for multiple regression models. (In this case, we’re using three sample statistics to compute RSE and R2, namely the two regression coefficients and the intercept. This means our calculations would be triply biased, but we can correct this by using a \(\frac{1}{n-3}\) factor rather than \(\frac{1}{n}\).) The RSE and R2 values measure the same thing as in the simple case:

  • The RSE is the average residual value. It’s still the case that for a large data set with normally distributed residuals, we can be about 70% confident that any given prediction made by our model is within one RSE of the actual value, and 95% confident that a prediction is within two RSEs of the actual value.
  • The R2 value (both multiple and adjusted) roughly tells us the percentage of the variation in the response variable that the model can explain.

We can calculate RSE and R2 from scratch the same way we did in the simple case, but they’re also part of the summary statistics:

summary(sim_mult_model)
## 
## Call:
## lm(formula = z ~ x + y, data = sim_mult)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4265 -0.1619 -0.1099  0.1012  0.8735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.61800    1.47428   7.202 0.000177 ***
## x           -0.96640    0.05712 -16.918 6.18e-07 ***
## y           -0.18829    0.12860  -1.464 0.186575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4092 on 7 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9762 
## F-statistic: 185.9 on 2 and 7 DF,  p-value: 8.579e-07

It looks like we can be about 70% confident that the model’s predictions are within about 0.41 of the actual values, and about 95% confident that they’re within about 0.82 of the actual values. (We shouldn’t take the confidence intervals very seriously, though, since our data set is very small and the residuals aren’t very normally distributed.)

Also, based on the R2 value, it looks like our model can explain about 98% of the variation in the \(z\) variable. (Notice also that we only have 7 degrees of freedom, which is the value of \(n-3\).)

The random residual plot, the roughly normal residual distribution, the low RSE, and the high R2 tell us we have a reliable model. (Again, though, with only 10 data points and 7 degrees of freedom, we run the risk of having too small a sample to make good inferences.) So what can we do with it?


Just as with simple regression, we can now turn our attention to the issue of determining the impact of the predictor variables on the response. This is where the regression coefficients come in. Again, the model we’re using for the sim_mult data set is \[z = 10.6179955 - 0.9663994x - 0.1882910y.\] When \(x\) increases by 1 with \(y\) held constant, \(z\) decreases by about 0.97. When \(y\) increases by 1 with \(x\) held constant, \(z\) decreases by about 0.19. In general, the regression coefficient of a predictor measures the impact of that predictor on the response while the other predictor is held constant. This is extremely useful because it allows us to measure the effect of one variable independently of the other. In other words, we can study the effect of \(x\) on \(z\) while controlling for \(y\) (or vice versa).

But of course, it’s not enough just to look at the values of the coefficients; we also have to determine whether they’re significantly different from 0. For this, we look again at the p-values, which are found in their usual spots in the model summary.

The coefficient of \(x\) is large compared to the standard error, as measured by the large t-statistic, and this leads to a p-value of 0.000000618, which is well below the 0.05 threshold. Thus, we can reject the null hypothesis that the true coefficient of \(x\) is 0 and confidently claim that \(x\) does have a significant impact on \(z\).

However, the p-value for the \(y\) variable is 0.186575, which is above the 0.05 threshold. This means we don’t have enough evidence to reject the null hypothesis that this coefficient is actually 0. Thus, we cannot claim that \(y\) has a significant impact on \(z\).

We can gain some insight into the effects of \(x\) and \(y\) on \(z\) by comparing the scatter plots of \(z\) vs. \(x\) and \(z\) vs. \(y\). We see that \(z\) and \(x\) have a clearly linear relationship, but \(z\) and \(y\) do not.


Multiple regression is a very powerful technique for studying the effects of some predictor variables (the experimental variables) while controlling for others (the control variables). Consider again the interpretation of a predictor’s regression coefficient in a multiple regression model: It tells us the amount by which the response variable changes when the predictor increases by 1 while all other predictors are held constant. In other words, the coefficient states the effect a predictor has on the response, everything else being equal. A multiple regression model, then, allows us to extract the way a predictor (the experimental variable) impacts the response without having to worry about confounding effects of the other predictors (the control variables).

For example, a famous study once reported that increasing the amount of coffee a person drinks tends to lower their life expectancy. The following (made-up) data set mimics this finding:

coffee <- tibble(
  cups_per_day = c(1, 3, 2, 4, 3, 2, 1, 5, 3, 4, 5, 3, 2, 1),
  age_at_death = c(87, 76, 85, 80, 84, 87, 88, 82, 78, 81, 82, 78, 82, 84)
)

coffee
## # A tibble: 14 x 2
##    cups_per_day age_at_death
##           <dbl>        <dbl>
##  1            1           87
##  2            3           76
##  3            2           85
##  4            4           80
##  5            3           84
##  6            2           87
##  7            1           88
##  8            5           82
##  9            3           78
## 10            4           81
## 11            5           82
## 12            3           78
## 13            2           82
## 14            1           84

Let’s look at a scatter plot and then build a simple regression model.

ggplot(coffee) +
  geom_point(aes(cups_per_day, age_at_death))

coffee_model <- lm(age_at_death ~ cups_per_day, data = coffee)

summary(coffee_model)
## 
## Call:
## lm(formula = age_at_death ~ cups_per_day, data = coffee)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1144 -1.4472  0.8856  2.6019  3.4194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   86.5132     1.9836  43.614 1.37e-14 ***
## cups_per_day  -1.4663     0.6436  -2.278   0.0418 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.176 on 12 degrees of freedom
## Multiple R-squared:  0.302,  Adjusted R-squared:  0.2438 
## F-statistic: 5.191 on 1 and 12 DF,  p-value: 0.0418

The scatter plot seems slightly linear, and the R2 value of 0.302, which kind of low, is still usable. The coefficient of cups_per_day is -1.4663, which would suggest that each additional cup of coffee per day lowers your life expectancy by about 1.5 years. The p-value for our predictor coefficient is just below the 0.05 threshold, so we can reject the null hypothesis that there is no connection between coffee intake and life expectancy.

This is troubling news for coffee drinkers. We can statistically make the claim that coffee intake lowers life expectancy. However, this is where good experiment design principles come into play. Any study that seeks to explain the behavior of a response variable (like life expectancy), especially one related to human behavior, should consider as many potential predictor variables as possible. (Although as we’ll see in the next section, it’s not always a good idea to actually use all of these variables.) Simple regression is rarely a sufficient way to model complex interactions and can greatly oversimplify the situation.

What the real-life study mentioned above did not take into account was that there’s a confounding variable. This is a variable that is closely related to both the predictor variable and the response variable. The confounding variable in this study is the smoking habits of the participants. Coffee intake and tobacco usage are positively correlated variables, meaning that when the value of one is high, the value of the other is often high, too. Also, it’s well-known that chronic smokers often have lower life expectancies. Suppose we add a column to our data set that records the number of cigarettes smoked per day:

coffee_cigs <- tibble(
  cups_per_day = c(1, 3, 2, 4, 3, 2, 1, 5, 3, 4, 5, 3, 2, 1),
  cigs_per_day = c(0, 4, 0, 2, 0, 0, 0, 4, 5, 0, 1, 2, 0, 1),
  age_at_death = c(87, 76, 85, 80, 84, 87, 88, 82, 78, 81, 82, 78, 82, 84)
)

coffee_cigs
## # A tibble: 14 x 3
##    cups_per_day cigs_per_day age_at_death
##           <dbl>        <dbl>        <dbl>
##  1            1            0           87
##  2            3            4           76
##  3            2            0           85
##  4            4            2           80
##  5            3            0           84
##  6            2            0           87
##  7            1            0           88
##  8            5            4           82
##  9            3            5           78
## 10            4            0           81
## 11            5            1           82
## 12            3            2           78
## 13            2            0           82
## 14            1            1           84

Now we can build a regression model that investigates the effect of coffee drinking on life expectancy while controlling for smoking habits. Our experimental variable will be cups_per_day and the control variable will be cigs_per_day. (The lm function does not distinguish between experimental and control variables.)

coffee_cigs_model <- lm(age_at_death ~ cups_per_day + cigs_per_day, data = coffee_cigs)

summary(coffee_cigs_model)
## 
## Call:
## lm(formula = age_at_death ~ cups_per_day + cigs_per_day, data = coffee_cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4631 -1.8537  0.1568  1.3498  4.5291 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   86.1968     1.5678  54.980 8.89e-15 ***
## cups_per_day  -0.7414     0.5663  -1.309   0.2171    
## cigs_per_day  -1.2547     0.4354  -2.882   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.504 on 11 degrees of freedom
## Multiple R-squared:  0.6023, Adjusted R-squared:  0.5299 
## F-statistic: 8.328 on 2 and 11 DF,  p-value: 0.006278

These results offer a striking contrast to those of the simple model above. Based on our R2 value of 0.6023, including cigs_per_day as a predictor gives us a model that explains about twice as much of the variation in the response. The RSE is also lower, giving us more confidence in the predictions.

Also, look at the coefficients: cups_per_day still has a negative coefficient (-0.7414), but its p-value jumped well above the 0.05 threshold. Statistically speaking, this coefficient should be considered to be 0, meaning that cups_per_day has no impact on age_at_death. The coefficient of cigs_per_day, on the other hand, is -1.2547, and its p-value is only 0.0149. We can therefore make the claim that cigarette smoking has a significant impact on life expectancy, and more specifically, it seems that each additional cigarette smoked per day lowers one’s life expectancy by about 1.25 years.

Adding a control variable allows us to make an apples-to-apples comparison of the life expectancies of coffee drinkers. We’re now only comparing coffee drinkers with similar smoking habits, which is more appropriate than comparing the life expectancies of, for example, someone who drinks one cup of coffee per day and does not smoke to someone who drinks four cups per day and also smokes a pack of cigarettes per day. The smoker will probably have a lower life expectancy, but not because he or she drinks more coffee.


This example reveals the utility of multiple regression and explains why it’s a favorite type of model among statisticians. Decoupling the effects of two predictor variables on the response variable is very easy – just include both of them in your model! This example also unfortunately shows how easy it is to mislead people with statistics by ignoring confounding variables, a problem known as “omitted variable bias.”

4.7.1 Exercises

These exercises refer to the data set found here. It consists of attributes describing the size and performance of several car makes and models from 1985 along with their retail selling price. (Click here for more information about this data set.)

  1. Download the data set above and import it into R.

  2. Create a linear regression model with price as the response variable and highway-mpg as the predictor variable.

  3. State the values of the coefficient of highway-mpg and its p-value. Interpret what each of these numbers means. What conclusion can you draw from these statistics, and why might this conclusion seem surprising?

  4. Explain why curb-weight and horsepower might be confounding variables in your model.

  5. Control for curb-weight and horsepower by adding them as predictors in your model from Exercise 2.

  6. What are the coefficient and p-value of the highway-mpg variable in the updated model from the previous exercise? Interpret what these numbers mean.

  7. In light of the previous exercise, explain why the conclusion from Exercise 3 above is an example of omitted variable bias.

  8. Do some research to find another real-life example of omitted variable bias.

4.8 Predictor Selection

The power of multiple regression lies in the fact that we can use as many predictors as we want to explain the behavior of a response variable. However, more is not always better, as we’ll soon see. When your data set gives you several predictor variables to potentially use in your model, it’s important to choose them in such a way that we find the right balance between explanatory power and redundancy. In this section, we’ll focus on the following issues we face when selecting predictors:

  • collinearity
  • irrelevant predictors
  • the influence of outliers
  • the problem of missing values

Collinearity occurs when two or more of the predictor variables exhibit a strong linear relationship with each other. We say these variables are correlated with each other. Intuitively, it would seem unnecessary to include both of the correlated variables – one of them should suffice. In addition to creating redundancy, though, collinearity can make it hard to interpret a model’s coefficients. Let’s illustrate the problem of collinearity by building a model that will predict the price values from the diamonds data set. In the next section, we’ll see how to include categorical predictors like cut, color, and clarity, but for now, we’ll just consider the continuous predictor variables from diamonds.

Since it’s so tempting to just use all of the potential predictors in our model, we’ll do just that to start:

diamonds_model_all <- lm(price ~ carat + depth + table + x + y + z, data = diamonds)

Now let’s check the statistics from the summary:

summary(diamonds_model_all)
## 
## Call:
## lm(formula = price ~ carat + depth + table + x + y + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23878.2   -615.0    -50.7    347.9  12759.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20849.316    447.562  46.584  < 2e-16 ***
## carat       10686.309     63.201 169.085  < 2e-16 ***
## depth        -203.154      5.504 -36.910  < 2e-16 ***
## table        -102.446      3.084 -33.216  < 2e-16 ***
## x           -1315.668     43.070 -30.547  < 2e-16 ***
## y              66.322     25.523   2.599  0.00937 ** 
## z              41.628     44.305   0.940  0.34744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53933 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 5.486e+04 on 6 and 53933 DF,  p-value: < 2.2e-16

First notice that we have very high R2 values, indicating that our model fits the data well. Now look at the information for the predictors x (which is the length of a diamond in millimeters) and y (which is the width in millimeters). Both of these have p-values below the 0.05 threshold, so we are led to believe that their coefficients are significant. However, the coefficient of x is telling us that with each extra millimeter of length (all other variables held constant), the diamond’s price drops by over $1300. Notice the y coefficient, though: With each extra millimeter of width (all other variables held constant), the diamond’s price increases by about $66. If you scroll through the diamonds data set, you’ll see that the x and y values are almost always about the same. So if x increases by 1, y will also increase by about 1. Thus it wouldn’t make sense to say that the price will drop if x increases and go up if y increases. What’s going on?

The problem is that x and y are very highly correlated. We can see this by plotting them against each other:

ggplot(diamonds) +
  geom_point(aes(x, y))

There are a few extreme outliers (probably entry errors), but almost all of the points cluster very closely around a common line. Our data set doesn’t have much evidence for what happens when we increase x by 1 while holding y constant or vice versa; these phenomena just don’t happen. This leads to very unreliable coefficient values for x and y.

The predictors x and y should therefore not both be used in our model. Unless we have specific domain knowledge that would motivate us to keep one over the other, we can delete either of these variables from our model and try again. Let’s delete y:

diamonds_model_2 <- lm(price ~ carat + depth + table + x + z, data = diamonds)

summary(diamonds_model_2)
## 
## Call:
## lm(formula = price ~ carat + depth + table + x + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23901.1   -615.4    -50.6    347.2  12760.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20962.789    445.450  47.060   <2e-16 ***
## carat       10691.590     63.171 169.247   <2e-16 ***
## depth        -204.549      5.478 -37.340   <2e-16 ***
## table        -102.741      3.082 -33.333   <2e-16 ***
## x           -1261.495     37.691 -33.470   <2e-16 ***
## z              57.248     43.897   1.304    0.192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53934 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 6.582e+04 on 5 and 53934 DF,  p-value: < 2.2e-16

We now see the same situation with x and z. Not coincidentally, they’re also highly correlated:

ggplot(diamonds) +
  geom_point(aes(x, z))

Since z has a large p-value anyway, let’s remove it and try again.

diamonds_model_3 <- lm(price ~ carat + depth + table + x, data = diamonds)

summary(diamonds_model_3)
## 
## Call:
## lm(formula = price ~ carat + depth + table + x, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23894.2   -615.0    -50.6    346.9  12760.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20765.521    418.984   49.56   <2e-16 ***
## carat       10692.510     63.168  169.27   <2e-16 ***
## depth        -201.231      4.852  -41.48   <2e-16 ***
## table        -102.824      3.082  -33.37   <2e-16 ***
## x           -1226.773     26.678  -45.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53935 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 8.228e+04 on 4 and 53935 DF,  p-value: < 2.2e-16

No other glaring inconsistencies stand out, but we should now be a little worried that there are other correlated variables we’re missing. We could just look at the scatter plots for each pair of potentially correlated variables, but luckily, there’s a faster way. The degree to which two variables are correlated is measured by a statistic known as the correlation coefficient. This is a number from -1 to 1. Values near -1 indicate that the variables are negatively correlated – as one goes up, the other goes down. Values near 1 are positively correlated – when one changes, the other changes in the same way. Values near 0 mean that the variables are uncorrelated. We can calculate the correlation coefficients for each pair of predictor variables all at once using the correlation matrix. We first have to pare down the data set to include only the variables of interest.

diamonds_predvars <- diamonds %>%
  select(carat, depth, table, x, y, z)

cor(diamonds_predvars)
##            carat       depth      table           x           y          z
## carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
## depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
## table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
## x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
## y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
## z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000

For example, the correlation coefficient between depth and table is -0.2957785. (Notice that the correlation coefficient between any variable and itself is always 1, which should make sense.) We can see that depth and table aren’t correlated with any other variable, but carat, x, y, and z are all highly correlated with each other. Therefore, to avoid collinearity and the consequently unreliable coefficient values, we should only include one of these. Let’s use carat:

diamonds_model_4 <- lm(price ~ carat + depth + table, data = diamonds)

summary(diamonds_model_4)
## 
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18288.0   -785.9    -33.2    527.2  12486.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13003.441    390.918   33.26   <2e-16 ***
## carat        7858.771     14.151  555.36   <2e-16 ***
## depth        -151.236      4.820  -31.38   <2e-16 ***
## table        -104.473      3.141  -33.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared:  0.8537, Adjusted R-squared:  0.8537 
## F-statistic: 1.049e+05 on 3 and 53936 DF,  p-value: < 2.2e-16

We can now be confident that collinearity is not giving us unreliable regression coefficients.


Even if we make an effort to avoid collinearity among our predictors, it is still tempting to go with a “the more, the better” mentality and include as many noncollinear predictors as possible. After all, every time you add a predictor variable – even a completely irrelevant one – your (multiple) R2 value is guaranteed to increase. Thus we can “chase” higher and higher R2 values by using more and more predictors. But should we?

Consider the following artificial data set that contains data about ten Major League Baseball teams. avg_sal is the team’s average player salary (in millions of dollars), all_stars is the number of players on the team that made the all-star team, state_pop is the population of the state in which the team plays its home games, and wins is the number of wins the team had in the most recent season.

bb_wins <- tibble(
  avg_sal = c(3.6, 4.3, 2.2, 4.8, 3.6, 6.1, 7.1, 4.2, 3.1, 5.2),
  all_stars = c(4, 1, 1, 3, 2, 5, 6, 2, 1, 3),
  state_pop = c(6.8, 4.6, 39.2, 10.0, 6.9, 10.8, 13.0, 19.8, 39.2, 5.9),
  wins = c(89, 83, 63, 88, 78, 95, 99, 69, 82, 92)
)

bb_wins
## # A tibble: 10 x 4
##    avg_sal all_stars state_pop  wins
##      <dbl>     <dbl>     <dbl> <dbl>
##  1     3.6         4       6.8    89
##  2     4.3         1       4.6    83
##  3     2.2         1      39.2    63
##  4     4.8         3      10      88
##  5     3.6         2       6.9    78
##  6     6.1         5      10.8    95
##  7     7.1         6      13      99
##  8     4.2         2      19.8    69
##  9     3.1         1      39.2    82
## 10     5.2         3       5.9    92

Let’s build a model that examines the impact of avg_sal, all_stars, and state_pop on wins. First, we check for correlations among the predictors:

bb_wins_predictors <- bb_wins %>%
  select(avg_sal:state_pop)

cor(bb_wins_predictors)
##              avg_sal  all_stars  state_pop
## avg_sal    1.0000000  0.8306749 -0.5454264
## all_stars  0.8306749  1.0000000 -0.4531666
## state_pop -0.5454264 -0.4531666  1.0000000

There’s a minor correlation between avg_sal and all_stars, but not enough to cause big problems. We’ll keep both in our model.

bb_wins_model1 <- lm(wins ~ avg_sal + all_stars + state_pop, data = bb_wins)

summary(bb_wins_model1)
## 
## Call:
## lm(formula = wins ~ avg_sal + all_stars + state_pop, data = bb_wins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5318  -2.3922   0.7397   3.1439  10.5481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  65.7603    12.0251   5.469  0.00156 **
## avg_sal       3.1900     3.1976   0.998  0.35698   
## all_stars     2.3371     2.4837   0.941  0.38305   
## state_pop    -0.1667     0.2194  -0.760  0.47614   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.265 on 6 degrees of freedom
## Multiple R-squared:  0.7265, Adjusted R-squared:  0.5897 
## F-statistic: 5.311 on 3 and 6 DF,  p-value: 0.0399

We have a pretty high R2 value, so we’d be tempted to say that this is a good model. However, it might bother you that we’re using the state_pop variable, as it seems irrelevant to the number of wins a team gets. Let’s build another model without it.

bb_wins_model2 <- lm(wins ~ avg_sal + all_stars, data = bb_wins)

summary(bb_wins_model2)
## 
## Call:
## lm(formula = wins ~ avg_sal + all_stars, data = bb_wins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.046  -2.457   0.785   4.422   7.710 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   59.499      8.489   7.009  0.00021 ***
## avg_sal        4.017      2.914   1.378  0.21050    
## all_stars      2.337      2.408   0.971  0.36395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.042 on 7 degrees of freedom
## Multiple R-squared:  0.7001, Adjusted R-squared:  0.6145 
## F-statistic: 8.172 on 2 and 7 DF,  p-value: 0.01477

Notice that the R2 value went down, as it always will when you remove predictors. However, the adjusted R2 went up. The adjusted R2 is indicating that the model that excludes state_pop as a predictor actually fits the data better than the one that includes it!

This is the benefit of having an adjusted R2 value: It doesn’t automatically increase as more predictors are added. It can detect the addition of irrelevant predictors and counter the false sense of goodness-of-fit that the multiple R2 value gives. When deciding whether to use a predictor in a model, you should compare the adjusted R2 value before and after the predictor is included. If including it makes the adjusted R2 decrease, it should be deleted from the model.

The reason adjusted R2 doesn’t automatically increase as more variables are added can be explained by the way it’s computed. Recall that \[\textrm{adjusted R}^2 = 1-\frac{\textrm{MSR}}{\textrm{variance of response}}.\] Let SS(R) denote the sum of the squared residuals of the model. If the data set has \(n\) observations and the model uses \(p\) predictor variables, then the number of degrees of freedom of the model is \(n-p\). Thus, \[\textrm{MSR}=\frac{\textrm{SS(R)}}{n-p}.\] Now as \(p\) (the number of predictors) increases, \(n-p\) decreases. Unless SS(R) also decreases, the value of MSR will thus increase as \(p\) increases. In other words, if adding a predictor variable doesn’t significantly improve the performance of the model, the MSR value will increase, and this will cause the adjusted R2 to decrease.


We’ve seen why we should avoid collinearity and irrelevant predictors. The last issues we’ll cover regarding predictor selection have to do with the existence of outliers and missing data.

Let’s use the msleep data set to build a model that tests whether bodywt has a significant impact on sleep_total.

sleep_model1 <- lm(sleep_total ~ bodywt, data = msleep)

summary(sleep_model1)
## 
## Call:
## lm(formula = sleep_total ~ bodywt, data = msleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7008 -2.3787 -0.4268  3.2732  9.1731 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.7269205  0.4773797  22.470  < 2e-16 ***
## bodywt      -0.0017647  0.0005971  -2.956  0.00409 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.254 on 81 degrees of freedom
## Multiple R-squared:  0.09735,    Adjusted R-squared:  0.08621 
## F-statistic: 8.736 on 1 and 81 DF,  p-value: 0.004085

The RSE seems high (our 95% confidence interval is within about 8.5 hours of the true sleep_total value), and the R2 value is low (our model only explains about 9.7% of the variation in sleep_total). What might be causing such a terrible performance? Let’s look at the scatter plot of sleep_total vs. bodywt:

ggplot(msleep) +
  geom_point(aes(bodywt, sleep_total))

It looks like there are two mammals in the data set whose body weights are clear outliers. Let’s filter these out and rebuild the model:

msleep2 <- msleep %>%
  filter(bodywt < 2000)

sleep_model2 <- lm(sleep_total ~ bodywt, data = msleep2)

summary(sleep_model2)
## 
## Call:
## lm(formula = sleep_total ~ bodywt, data = msleep2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1422 -2.5334 -0.3189  2.8676  8.5670 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.33309    0.45785  24.753  < 2e-16 ***
## bodywt      -0.01290    0.00272  -4.742 9.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.88 on 79 degrees of freedom
## Multiple R-squared:  0.2216, Adjusted R-squared:  0.2118 
## F-statistic: 22.49 on 1 and 79 DF,  p-value: 9.209e-06

We still have a fairly high RSE and low R2, but the performance without the outliers is significantly better. Notice also that the bodywt coefficient changes dramatically after the outliers are removed. Needless to say, linear regression models are very sensitive to outliers. When selecting predictors, it’s important to check for outliers first and consider removing them if it makes sense to do so in the bigger context.

Let’s build another model on msleep that tests whether sleep_cycle has a significant impact on sleep_total.

sleep_model1.1 <- lm(sleep_total ~ sleep_cycle, data = msleep)

summary(sleep_model1.1)
## 
## Call:
## lm(formula = sleep_total ~ sleep_cycle, data = msleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9452 -2.7981 -0.4731  1.9073  7.4468 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.528      1.028  13.154 5.44e-14 ***
## sleep_cycle   -5.374      1.824  -2.946  0.00617 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.643 on 30 degrees of freedom
##   (51 observations deleted due to missingness)
## Multiple R-squared:  0.2244, Adjusted R-squared:  0.1986 
## F-statistic:  8.68 on 1 and 30 DF,  p-value: 0.006169

Our model’s goodness of fit statistics are not very impressive, but notice the parenthetical statement in the summary: “51 observations deleted due to missingness.” This is referring to the NAs in the sleep_cycle column. We can verify that there are 51 of them using our familiar method:

sum(is.na(msleep$sleep_cycle))
## [1] 51

Given that there are only 83 observations total in msleep, this is a lot of missing values. Unfortunately, the fact is that there’s no good way to handle missing data. We’ll discuss two methods here along with their pros and cons.

  1. We could drop the observations with missing values. This is what the lm function does by default. The advantage is that we can still salvage some of the information about our variable even though we know it will be incomplete. The disadvantage is that removing observations can drastically decrease the size of our sample, and this will significantly increase our RSE.

  2. We could replace the missing values with the mean of the non-missing values. This is a form of data imputation. It’s advantage is that we will still be able to use all of our observations. The disadvantage is that it introduces bias into the data set, and it also reduces the overall variance of the variable. Let’s impute the mean value of sleep_cycle to the NAs and rebuild our model.

avg_sleep_cycle <- mean(msleep$sleep_cycle, na.rm = TRUE)

msleep1.2 <- msleep %>%
  mutate(`sleep_cycle` = replace_na(`sleep_cycle`, avg_sleep_cycle))

sleep_model1.2 <- lm(sleep_total ~ sleep_cycle, data = msleep1.2)

summary(sleep_model1.2)
## 
## Call:
## lm(formula = sleep_total ~ sleep_cycle, data = msleep1.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5337 -3.1619 -0.0337  3.1657  8.9663 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.796      1.062  12.054   <2e-16 ***
## sleep_cycle   -5.374      2.161  -2.487    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.316 on 81 degrees of freedom
## Multiple R-squared:  0.07092,    Adjusted R-squared:  0.05945 
## F-statistic: 6.183 on 1 and 81 DF,  p-value: 0.01495

Our RSE actually went up and the R2 value went down. Mean imputation didn’t seem to help in this case.

There are several other much more sophisticated ways to impute missing values, but each of them introduces the risk of pushing the data in the direction of your own subjective point of view. Imputation is dangerous since you can unintentionally engineer your model to behave the way you think it should rather than the way the data says it should. If you must impute your missing values, always be absolutely transparent about how and why you’re doing it. When in doubt, it might be best to just not use a predictor with a lot of missing values if that’s an option.


Predictor selection should be a part of the pre-assessment stage when building a multiple regression model. Checking that there is no (or at least not much) collinearity, avoiding irrelevant predictors, checking for outliers in the data set, and handling the missing values in the most appropriate way all help ensure that our model will give us more reliable results.

4.8.1 Exercises

  1. Answer the following questions in reference to the final version of the price model from this section (the one which uses carat, depth, and table as the predictors).

    1. Why are the multiple and adjusted R2 values the same? (Hint: How many observations are in the data set?)
    2. Explain precisely what each of the regression coefficients tells us.
    3. Are the values of the regression coefficients statistically significant? Explain your answer.
    4. Are the residuals of this model normally distributed?
    5. What is the 95% confidence interval of this model?
  2. Suppose you want to build a model on the flights data set that seeks to find the main factors that influence a flight’s being delayed on arrival.

    1. Build a model with arr_delay as the response and the following variables as predictors: dep_delay, air_time, and distance.
    2. Which of the regression coefficients seem to contradict each other? Why might this be?
    3. Test for collinearity by checking whether any two of your predictors are highly correlated. (Warning: The correlation matrix will return NA when a variable has missing values. You’ll have to filter out the rows with NA in any of your three predictor columns.)
    4. Rebuild your model so that no two of the predictors are highly correlated. Do the coefficients make more sense now? Explain your answer.
  3. It was noted in this section that the x variable in diamonds has a few obvious outliers, which are probably entry errors. Build a model that predicts price from x with the outliers from x left in, and then build another one that predicts price from x with the outliers filtered out. Did removing the outliers improve your model? Explain your answer.

  4. Would it make sense to impute missing values in dep_delay from flights with the mean dep_delay value? Explain your answer. If you had to use dep_delay as a predictor variable in a model, how would you handle the NAs?

4.9 Categorical Predictors

It’s often the case that a response variable we want to model could have categorical predictors. For example, we might want to know whether a person’s gender has an impact on their salary, whether a car’s class has an impact on its fuel efficiency, or whether a diamond’s cut has an impact on its price.

It’s possible to use categorical predictors in a linear regression model. Consider the following data set, which lists the genders (M or F) of 10 adults along with their heights (in inches).

df <- tibble(
  gender = c("M", "F", "M", "M", "F", "F", "M", "M", "F", "F"),
  height = c(70, 64, 73, 67, 62, 67, 75, 68, 65, 63)
)

df
## # A tibble: 10 x 2
##    gender height
##    <chr>   <dbl>
##  1 M          70
##  2 F          64
##  3 M          73
##  4 M          67
##  5 F          62
##  6 F          67
##  7 M          75
##  8 M          68
##  9 F          65
## 10 F          63

Based on this sample, can we conclude that there is a significant difference between the heights of adult males and females? It would seem that a good way to answer this would be to build a regression model with height as the response and gender as the predictor. But how do we handle a categorical predictor variable such as gender? After all, regression models are linear equations that involve numerical variables.

Since gender is a binary variable (it only has two values), the simplest way to convert it to a numerical variable is to let one of the values of gender be represented by 0 and the other by 1. (It doesn’t matter which.) This is exactly what the lm function does. Let’s build the model and see how this looks.

df_model <- lm(height ~ gender, data = df)

summary(df_model)
## 
## Call:
## lm(formula = height ~ gender, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.60  -1.95  -0.40   2.00   4.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   64.200      1.225  52.419 1.94e-11 ***
## genderM        6.400      1.732   3.695  0.00609 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.739 on 8 degrees of freedom
## Multiple R-squared:  0.6305, Adjusted R-squared:  0.5844 
## F-statistic: 13.65 on 1 and 8 DF,  p-value: 0.006086

We can see by the R2 value that this model does a pretty good job fitting the data. Look at the coefficient’s name, though: genderM. This means that the gender variable has been converted to a numerical one as described above, and the M in the name genderM indicates that we’ll be using 1 for the gender value M and 0 for F. Another helpful way to think about this is that the gender variable has been turned into a question – we can interpret the name genderM as asking, “Does gender equal M? Enter 1 for ‘yes’ or 0 for ‘no.’”

The formula for our model is \[\textrm{height} = (6.4\times \textrm{genderM}) + 64.2.\] The height predicted by a model for a male would be obtained by letting genderM = 1, which gives us 70.6 inches. The prediction for a female is obtained by letting genderM = 0, which gives us 64.2 inches.

These predictions should not seem mysterious. If you were asked to predict the height of a male or female based only on this data set, you’d probably just find the average height of males and females in the data set and use these as your predictions. This is exactly what lm does:

df %>%
  group_by(gender) %>%
  summarize(avg_height = mean(height))
## # A tibble: 2 x 2
##   gender avg_height
##   <chr>       <dbl>
## 1 F            64.2
## 2 M            70.6

The slope and intercept of the model also have obvious interpretations: The slope is just the difference between the average heights (70.6 - 64.2 = 6.4). The intercept is the average height of the gender class represented by 0 (i.e., female).

We can now see exactly how to build a linear regression model with a binary categorical predictor: (1) Let one class of the binary variable be 1 and the other 0. (2) Find the average value of each class. (3) The intercept of the model is the average value of the class represented by 0. (4) The slope equals the average of the “1” class minus the average of the “0” class.

Recall the question we initially asked: Is there a significant difference between the heights of adult males and females? Well, given that this difference is just the regression slope, we can see that the model does indicate a difference in heights. But is it a statistically significant difference, or could the difference just be due to randomness? Of course, the way to answer this is to check the slope’s p-value. From the model summary above, we see the p-value of genderM is 0.00609. Since this is less than the 0.05 threshold, we can safely conclude that there is a significant difference in heights.


The above example represents the simplest case of using a categorical variable in a regression model: There’s only one categorical variable, and it’s binary. In practice, we’ll want to potentially use several categorical variables as predictors, each of which could have more than two values.

Consider the following built-in data set:

sim2
## # A tibble: 40 x 2
##    x          y
##    <chr>  <dbl>
##  1 a      1.94 
##  2 a      1.18 
##  3 a      1.24 
##  4 a      2.62 
##  5 a      1.11 
##  6 a      0.866
##  7 a     -0.910
##  8 a      0.721
##  9 a      0.687
## 10 a      2.07 
## 11 b      8.07 
## 12 b      7.36 
## 13 b      7.95 
## 14 b      7.75 
## 15 b      8.44 
## 16 b     10.8  
## 17 b      8.05 
## 18 b      8.58 
## 19 b      8.12 
## 20 b      6.09 
## 21 c      6.86 
## 22 c      5.76 
## 23 c      5.79 
## 24 c      6.02 
## 25 c      6.03 
## 26 c      6.55 
## 27 c      3.73 
## 28 c      8.68 
## 29 c      5.64 
## 30 c      6.21 
## 31 d      3.07 
## 32 d      1.33 
## 33 d      3.11 
## 34 d      1.75 
## 35 d      0.822
## 36 d      1.02 
## 37 d      3.07 
## 38 d      2.13 
## 39 d      2.49 
## 40 d      0.301

Given an x value, what should the predicted y value be? As above, it should just be the average y value for that x value. So before even bringing in a model, we know how to predict y from x:

sim2 %>%
  group_by(x) %>%
  summarize(avg_y = mean(y))
## # A tibble: 4 x 2
##   x     avg_y
##   <chr> <dbl>
## 1 a      1.15
## 2 b      8.12
## 3 c      6.13
## 4 d      1.91

We’d like to replicate these predictions with a linear regression model so that we can use its summary statistics to measure how good these predictions actually are. Our first task will be to see how to convert the values of the categorical variable x to numerical values.

The predictor variable x has four values: a, b, c, and d. One way to encode these values would be to just assign each one a number: a = 1, b = 2, c = 3, and d = 4. However, assigning numbers in this way imposes a ranking on the categorical values: a is less than b, which is less than c, which is less than d. This is a problem because there is no reason to assume there is any relationship among the values at all. We would also be in the strange situation where the computer would treat c as the average of b and d since 3 is the average of 2 and 4. This type of value encoding could misrepresent the data and lead to poorly performing models.

A better way to encode the values numerically is to do so in a way that doesn’t impose any kind of order. To this end, we can copy the way it was done with the binary variable gender earlier in this section. To review, we took the gender variable and turned it into a question: genderM. We read this as asking, “Does gender equal M?” If the answer is “yes,” we assign the value 1 to genderM, otherwise, we assign the value 0. Notice that there’s no need to follow up with the question, “Does gender equal F?” Once we know whether gender equals M, we can automatically deduce whether it equals F. That’s why we don’t also see a genderF variable in our model.

For the x variable in sim2, there are four possible values instead of two, but we can adapt the idea from the last paragraph as follows. Start by asking whether x equals b. (Or we can start by asking whether x equals some other value – it doesn’t matter.) We’ll do this by introducing a new variable xb and letting it equal 1 if x equals b and 0 otherwise.

Now if x does not equal b, then it will be one of a, c, or d. We’ll now have to ask a follow-up question, such as, “Does x equal c?” We’ll have to introduce another new variable named xc, letting its value be 1 if x equals c and 0 otherwise.

If x doesn’t equal b or c, then we won’t yet be able to automatically say what x is – it could still be a or d. We’ll ask one more follow-up question: “Does x equal d?” If it does, then xd = 1, otherwise xd = 0.

Once we’ve introduced the three new variables xb, xc, and xd – in effect asking whether x equals b, c, or d – we’ll be able to determine automatically whether x equals a. If any of xb, xc, or xd was assigned a value of 1, then x cannot equal a. If each of xb, xc, and xd was assigned a value of 0, then x must equal a.

We see that when a categorical predictor has \(n\) possible values, we have to expand that predictor into \(n-1\) numerical predictors, each of which can take on a value of either 0 or 1. So when \(n>2\), we’ll have a multiple regression model. To see how the lm function does this, let’s build a linear model on sim2 and look at the coefficients.

sim2_model <- lm(y ~ x, data = sim2)

sim2_model
## 
## Call:
## lm(formula = y ~ x, data = sim2)
## 
## Coefficients:
## (Intercept)           xb           xc           xd  
##      1.1522       6.9639       4.9750       0.7588

As you can see, the predictor x is expanded into xb, xc, and xd. The numbers listed under these variables are their coefficients in the model’s formula. This information is thus telling us that the algebraic form of our model is: \[y = 1.1522 + (6.9639\times \textrm{xb}) + (4.9750\times \textrm{xc}) + (0.7588\times \textrm{xd}).\] As explained above, there’s no xa variable since once we know whether x equals b, c, or d, we can automatically deduce whether it equals a.

How do we use this formula? Suppose we want to know the predicted value of y when x = c. Then we can let xb = 0, xc = 1, and xd = 0, getting y = 1.1522 + 4.9750 = 6.1272. If we want to know the predicted y value when x = a, we’d use 0 for each of xb, xc, and xd, getting y = 1.1522. Notice that these predictions agree with the average y values in the grouped summary above.

We can even see where the coefficients come from:

  • The intercept is just the mean value of the a class.
  • The coefficient of xb equals the mean of the b class minus that of the a class.
  • The coefficient of xc equals the mean of the c class minus that of the a class.
  • The coefficient of xd equals the mean of the d class minus that of the a class.

If we want to know whether these coefficients are statistically significant, we just have to check their p-values, as always:

summary(sim2_model)
## 
## Call:
## lm(formula = y ~ x, data = sim2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40131 -0.43996 -0.05776  0.49066  2.63938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1522     0.3475   3.316  0.00209 ** 
## xb            6.9639     0.4914  14.171 2.68e-16 ***
## xc            4.9750     0.4914  10.124 4.47e-12 ***
## xd            0.7588     0.4914   1.544  0.13131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.099 on 36 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8756 
## F-statistic: 92.52 on 3 and 36 DF,  p-value: < 2.2e-16

It looks like the coefficients of xb and xc are significant, meaning that changing x from a to either b or c has a significant effect on y. However, the high p-value for xd means that we cannot conclude that changing x from a to d has a significant effect on y. This shouldn’t be too surprising, given that the mean values of the a and d classes are pretty close.


The pre-assessments (check the residuals) and post-assessments (check the RSE and R2 values) work the same way with categorical predictors as they do with continuous ones. Checking for collinearity is more complicated when categorical variables are present, and we won’t cover techniques for doing so in this introductory setting. However, this would still be a necessary step in the regression process.

For sim2_model above, we would start by checking whether the residuals are (at least approximately) normally distributed.

sim2_w_resid <- sim2 %>%
  add_residuals(sim2_model)

ggplot(sim2_w_resid) +
  geom_histogram(aes(resid), binwidth = 0.3)

This looks somewhat normal, which means we can construct confidence intervals. The model summary says the RSE is 1.099. Thus, we can be about 70% confident that our model’s predictions are within about 1.1 of the actual values, and we can be about 95% confident that they’re within about 2.2 of the actual values.

We can also see from the summary that both R2 values indicate that our model can explain about 88% of the variation in the y value.


Of course, it’s often necessary to use both continuous and categorical predictors in a regression model, as we’ll see in the following example.

In the diamonds data set, a mysterious phenomenon is that diamonds with the best cut (“Ideal”) are on average worth the least. We can see this in the box plot below:

ggplot(diamonds) +
  geom_boxplot(aes(cut, price))

We can also back this claim up with a regression model.

cut_model1 <- lm(price ~ cut, data = diamonds)

summary(cut_model1)
## 
## Call:
## lm(formula = price ~ cut, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4258  -2741  -1494   1360  15348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4062.24      25.40 159.923  < 2e-16 ***
## cut.L        -362.73      68.04  -5.331  9.8e-08 ***
## cut.Q        -225.58      60.65  -3.719    2e-04 ***
## cut.C        -699.50      52.78 -13.253  < 2e-16 ***
## cut^4        -280.36      42.56  -6.588  4.5e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared:  0.01286,    Adjusted R-squared:  0.01279 
## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16

Before digging into this summary, let’s clear up the strange names given to the coefficients. In diamonds, cut is actually an ordered factor variable (as discussed in Section 3.2), not a categorical one:

type_sum(diamonds$cut)
## [1] "ord"

When factors are used as regression predictors, they’re encoded as above by lm. We’re going to coerce cut into a character type and treat it as categorical so that the model output will be easier to interpret:

diamonds_cat_cut <- diamonds %>%
  mutate(cut = as.character(cut))

type_sum(diamonds_cat_cut$cut)
## [1] "chr"

Now we’ll rebuild the model.

cut_model2 <- lm(price ~ cut, data = diamonds_cat_cut)

summary(cut_model2)
## 
## Call:
## lm(formula = price ~ cut, data = diamonds_cat_cut)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4258  -2741  -1494   1360  15348 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4358.76      98.79  44.122  < 2e-16 ***
## cutGood       -429.89     113.85  -3.776 0.000160 ***
## cutIdeal      -901.22     102.41  -8.800  < 2e-16 ***
## cutPremium     225.50     104.40   2.160 0.030772 *  
## cutVery Good  -377.00     105.16  -3.585 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared:  0.01286,    Adjusted R-squared:  0.01279 
## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16

We can use the model to verify our claim that the diamonds with the best cut are worth the least. The intercept is the mean price value of the cut class not included in the coefficients, namely “Fair.” The coefficient of cutIdeal is -901.22, which means that on average, the price of a diamond with an ideal cut is $901.22 less than that of a diamond with a fair cut. Since the cutIdeal coefficient has the largest negative value among the coefficients, the model is saying that diamonds with an ideal cut are worth the least. Further, the minuscule p-value for cutIdeal means that this $900 difference in price is statistically significant. So what’s going on?

The first clue that something is wrong should have been the very low R2 value. This model only explains about 1.3% of the variation. That’s a really bad fit, and we shouldn’t even be talking about the model’s coefficients and p-values. The low R2 value is telling us that we might be experiencing omitted variable bias and that we thus need another predictor variable. Let’s use carat as well:

cut_model3 <- lm(price ~ cut + carat, data = diamonds_cat_cut)

summary(cut_model3)
## 
## Call:
## lm(formula = price ~ cut + carat, data = diamonds_cat_cut)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17540.7   -791.6    -37.6    522.1  12721.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3875.47      40.41  -95.91   <2e-16 ***
## cutGood       1120.33      43.50   25.75   <2e-16 ***
## cutIdeal      1800.92      39.34   45.77   <2e-16 ***
## cutPremium    1439.08      39.87   36.10   <2e-16 ***
## cutVery Good  1510.14      40.24   37.53   <2e-16 ***
## carat         7871.08      13.98  563.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1511 on 53934 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8565 
## F-statistic: 6.437e+04 on 5 and 53934 DF,  p-value: < 2.2e-16

Our R2 value jumped way up to about 0.86, so adding carat gave the model quite a bit of extra explanatory power. Also, look at the cut coefficients; the cutIdeal coefficient now has the largest positive value, meaning that once we control for the weight of the diamond (carat), the ideal cut diamonds are actually the most valuable. The low p-value tells us that we can be confident in this assertion.

What happened is this: We already know that price is heavily influenced by carat. Big diamonds are generally worth more than small ones. But it must be that the data set doesn’t contain many “unicorn” diamonds, i.e., big diamonds with ideal cuts. Diamonds with an ideal cut are usually small, and this brings their value down. That’s why we see that, on average, ideal diamonds are worth the least. It’s not because being ideal is undesirable; rather it’s because there’s a confounding variable, carat, that needs to be taken into account.

Once we include carat as a predictor, we’re then able to do an apples-to-apples comparison of the prices of the various cuts across diamonds of similar sizes. We’re no longer making unfair comparisons such as a 4.5-carat diamond with a Fair cut to a 0.2-carat diamond with an Ideal cut. This is analogous to the model constructed in Section 4.7 in which we first predicted life expectancy from coffee intake and then immensely improved the model by controlling for smoking behavior.

4.9.1 Exercises

  1. Build a model using mpg with hwy as the response and class as the predictor. Write down the actual algebraic formula for your model.

  2. Using the model formula from the previous exercise, predict the highway gas mileages of the following classes (without using the add_predictions function).

    1. suv
    2. subcompact
    3. 2seater
  3. Use the summary statistics for your model from Exercise 1 to assess how well your model performs.

  4. Click here to access a csv file containing average life expectancies of people from a selection of European countries in 2020, obtained from the World Bank web site.

    1. Download this file and import it into R.
    2. Use a linear model to determine whether there is a significant difference in average life expectancies in Northern European versus Western European countries. Explain your answer using your model’s statistics.
  5. We’ve known since Chapter 1 that in the mpg data set, cars with larger engines (as measured by displ) get worse gas mileage (as measured by hwy). However, there are exceptions to this trend, such as cars in the 2seater class, which get better gas mileage than would be predicted for cars with such big engines.

    1. Re-do Exercise 4 in Section 4.3.1, and explain why you see a surprising number of cars at the far right of the histogram.
    2. Build a model that predicts hwy from displ, but this time, control for class.
    3. Obtain a histogram of the residuals from the model in part (b), and compare it to the histogram from part (a). Did controlling for class make the histogram look more like a normal distribution?

4.10 Linear Regression Summary and Example

Regression modelling is a simple, yet powerful, way to investigate relationships among variables in a data set. Our treatment of linear regression has only been introductory, but we’ve developed enough tools to get started. Here’s a summary of the basic process.

  1. Pre-assessment

    • Decide what to do with any outliers or missing values in the data set. These can have a big effect on the performance and reliability of a model.
    • Select the predictors you want to use, making an effort to avoid collinearity and irrelevant predictors.
    • For a simple regression model with a continuous predictor variable, check whether the scatter plot of the response vs. predictor variables appears linear.
    • Check the residual plot against the experimental variable. The residuals should look random and should not stray too far from 0.
    • Check that the distribution of the residuals is approximately normal.

If the model passes the pre-assessment stage, we can assume that linear regression is appropriate. To see how well a regression model actually performs, we move to post-assessment.

  1. Post-assessment

    • Compute the RSE and the resulting confidence intervals. The lower the RSE, the better the model since a low RSE means most of our predictions are fairly accurate.
    • Check both of the R2 values. (They’re usually pretty close to each other.) The higher the R2, the better the model. However, avoid using irrelevant predictors to falsely increase your R2 value. If adding a predictor makes the adjusted R2 decrease, then the predictor is irrelevant and should be left out.

If we have an appropriate situation for a regression model (as determined during pre-assessment) and we can build a model that fits the data well (as determined during post-assessment), then we can use our model to investigate the effects of the predictors on the response.

  1. Interpretation

    1. Check the coefficients of your predictor variables. These will tell you the amount by which the response changes when a predictor increases by 1 while the others are held constant.
    2. Check the p-values of the coefficients to determine whether the coefficient values are statistically significant or are the result of random noise.

This process is not always carried out in the order just described. It’s often the case that we have to experiment quite a bit to fine tune our model, bouncing around among the three stages until we have something reliable.

The following example covers a good cross-section of the issues we’ve covered within linear regression.


We’ll be using the data set below, which lists insurance charges along with various health and demographic data of the insured. A full description of this data set can be found here.

insurance <- readr::read_csv('https://raw.githubusercontent.com/jafox11/MS282/main/insurance.csv')

insurance
## # A tibble: 1,338 x 7
##      age sex      bmi children smoker region    charges
##    <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
##  1    19 female  27.9        0 yes    southwest  16885.
##  2    18 male    33.8        1 no     southeast   1726.
##  3    28 male    33          3 no     southeast   4449.
##  4    33 male    22.7        0 no     northwest  21984.
##  5    32 male    28.9        0 no     northwest   3867.
##  6    31 female  25.7        0 no     southeast   3757.
##  7    46 female  33.4        1 no     southeast   8241.
##  8    37 female  27.7        3 no     northwest   7282.
##  9    37 male    29.8        2 no     northeast   6406.
## 10    60 female  25.8        0 no     northwest  28923.
## # ... with 1,328 more rows
## # i Use `print(n = ...)` to see more rows

We’ll investigate the question: Are men charged more for health insurance than women?

We can gain a preliminary insight with a box plot that compares the average charge for men with that of women.

ggplot(insurance) +
  geom_boxplot(aes(sex, charges))

We can’t conclude very much from this box plot. Males’ charges seem a little higher, but is the difference significant? Let’s build a model to check.

insurance_model <- lm(charges ~ sex, data = insurance)

coef(insurance_model)
## (Intercept)     sexmale 
##   12569.579    1387.172

This tells us that, on average, males are charged $1387.17 more than females. However, the full model summary tells us more:

summary(insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

The p-value for sexmale is less than 0.05, so we could conclude that the difference in insurance charges is statistically significant. However, the R2 value is extremely low, and the RSE tells us that we can be about 70% confident that the model’s prediction is within $12,090 of the actual charges value. That’s a wide interval. The goodness-of-fit statistics are telling us that using sex as the sole predictor produces a model with a very poor performance. We should not place much stock in the claim that males are charged significantly more than females; we should add more predictors as control variables.

Let’s return to the pre-assessment stage. We’ll start by checking for correlations among the continuous predictors:

insurance_predictors <- insurance %>%
  select(age, bmi, children)

cor(insurance_predictors)
##                age       bmi  children
## age      1.0000000 0.1092719 0.0424690
## bmi      0.1092719 1.0000000 0.0127589
## children 0.0424690 0.0127589 1.0000000

The correlation coefficients between any two of these continuous variables is very low, so we are free to use any of them as control variables without worrying about collinearity. Also, there are no missing values in the data set, and the distributions of the three continuous variables show that there are no obvious outliers:

ggplot(insurance) +
  geom_histogram(aes(age), binwidth = 1)

ggplot(insurance) +
  geom_histogram(aes(children))

ggplot(insurance) +
  geom_histogram(aes(bmi))

Let’s see what happens when we control for age. This would ensure that we’re not accidentally seeing higher charges for men just because (hypothetically) the men in the sample are older, on average, than the women. First we’ll examine the distribution of the residuals. (Since our experimental variable is categorical, we won’t bother with the residual plot.)

insurance_model2 <- lm(charges ~ sex + age, data = insurance)

insurance_w_pred2 <- insurance %>%
  add_residuals(insurance_model2)

ggplot(insurance_w_pred2) +
  geom_histogram(aes(resid))

This distribution is definitely not normal; there seems to be an inordinate number of large negative and positive residuals, with relatively few residuals near 0. This indicates that the model tends to dramatically over- or under-predict, but that it rarely predicts accurately. We could stop here and add more predictors, but let’s check the summary anyway.

summary(insurance_model2)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

The coefficient of sexmale is even larger than before, which means that controlling for age even further accentuates the difference in charges. (Maybe the men in the sample are younger than the women, on average.) Also, the p-value for sexmale is low, but it doesn’t matter since we still have a very low R2 value. More control variables are needed.

Let’s build in the variables that quantify the health of the insured, namely bmi (body mass index) and smoker.

insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)

insurance_w_pred3 <- insurance %>%
  add_residuals(insurance_model3)

ggplot(insurance_w_pred3) +
  geom_histogram(aes(resid))

This is a huge improvement! The residuals are much more normally distributed than before; the big cluster near 0 means that the model is making a lot of accurate predictions. Let’s move on to post-assessment and check the goodness-of-fit statistics.

summary(insurance_model3)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

The R2 values rose quite a bit; the model explains about 75% of the variation in charges. The RSE is also pretty low, and since the residuals are normally distributed, the confidence intervals determined by the RSE are reliable: We can be about 70% confident that the model predicts charges to within about $6094 of the actual value.

Given the behavior of the residuals and the goodness-of-fit statistics, we’re finally safe to move on to the interpretation stage. Recall that our question was whether males are charged more for insurance than females. Our previous inferior models seem to indicate that the answer is “yes,” but our most recent model tells a different story. The coefficient of sexmale is now negative, but more importantly, its p-value (0.745) is very high. This means that the sex variable has no significant impact on charges. Once you level the playing field and take age and health data into account, males do not get charged significantly more or less than females.

Although we’ve satisfactorily answered the question that was posed, we should still be curious about the effect of the other two variables not yet included: children and region. Let’s start by seeing what happens when we control for children:

insurance_model4 <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)

insurance_w_pred4 <- insurance %>%
  add_residuals(insurance_model4)

ggplot(insurance_w_pred4) +
  geom_histogram(aes(resid))

The histogram looks pretty similar to the most recent one. Adding children didn’t seem to improve the residuals very much. Let’s check the summary:

summary(insurance_model4)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11837.2  -2916.7   -994.2   1375.3  29565.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12052.46     951.26 -12.670  < 2e-16 ***
## sexmale       -128.64     333.36  -0.386 0.699641    
## age            257.73      11.90  21.651  < 2e-16 ***
## bmi            322.36      27.42  11.757  < 2e-16 ***
## smokeryes    23823.39     412.52  57.750  < 2e-16 ***
## children       474.41     137.86   3.441 0.000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7488 
## F-statistic:   798 on 5 and 1332 DF,  p-value: < 2.2e-16

The R2 values and RSE barely moved. Also, the p-value of the sexmale coefficient would lead us to the same conclusion about the difference in charges for males and females. We should probably leave children out as a control variable; including it only made a negligible difference.

Let’s run the same check for region:

insurance_model5 <- lm(charges ~ sex + age + bmi + smoker + region, data = insurance)

insurance_w_pred5 <- insurance %>%
  add_residuals(insurance_model5)

ggplot(insurance_w_pred5) +
  geom_histogram(aes(resid))

Again, this histogram doesn’t appear to have changed much. As for the summary statistics:

summary(insurance_model5)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11852.6  -3010.9   -987.8   1515.8  29467.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11556.96     985.63 -11.725   <2e-16 ***
## sexmale           -111.57     334.26  -0.334   0.7386    
## age                258.54      11.94  21.658   <2e-16 ***
## bmi                340.46      28.71  11.857   <2e-16 ***
## smokeryes        23862.91     414.82  57.526   <2e-16 ***
## regionnorthwest   -304.10     478.01  -0.636   0.5248    
## regionsoutheast  -1039.20     480.65  -2.162   0.0308 *  
## regionsouthwest   -916.44     479.72  -1.910   0.0563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6087 on 1330 degrees of freedom
## Multiple R-squared:  0.7487, Adjusted R-squared:  0.7474 
## F-statistic:   566 on 7 and 1330 DF,  p-value: < 2.2e-16

The R2 values and RSE didn’t budge; in fact, the adjusted R2 came dangerously close to decreasing, which means that region is a borderline irrelevant variable. Also, the p-value for sexmale remained high. It’s not worth including region as a predictor, either.

4.11 Project

Project Description:

The purpose of this project is to build a linear regression model on a real-life data set and to interpret the results. The goal will be to investigate how various socio-economic variables affect pay rate.

You will need to install and load the wooldridge library.

Instructions:

Create a new R script file. This will be where you do your scratch work.

Preliminaries:

  1. Read the documentation for the wage1 data set from wooldridge.

  2. Convert wage1 to a tibble so that it will work better with tidyverse functions.

  3. Coerce the female and nonwhite columns into characters. (They’re originally integers, but we’ll want these to be categorical variables.)

Part I: Based on the data in wage1, was there gender-based pay inequity when this data was collected?

  1. Create a plot which will give an initial visual answer to the question. Think about the appropriate type of visualization, and be sure it’s labelled. Comment on your visualization and on whether it conveys enough information to satisfactorily answer the question.

  2. Build a simple regression model with wage as the response and female as the predictor.

    1. Discuss what the regression coefficient of female and its p-value tell you about differences in pay between men and women.
    2. Should you believe the conclusion from part (a)? Develop your answer by discussing the goodness-of-fit statistics.
    3. Create a visualization of the residual distribution and discuss its meaning.
    4. What can you conclude from your simple regression model?
  3. Add control variables to your model.

    1. Some of the variables in wage1 might be considered counfounding variables. (Recall that these are variables which could affect the response but aren’t yet included in the model.) By reading the documentation for wage1, determine what the three most likely confounding variables are, and explain why.
    2. For the three variables from part (a), check for collinearity, missing values, and outliers. State how you will deal with these problems if they occur.
    3. Re-build your model with the appropriate confounding variables from part (a) added as controls.
  4. Assess your multiple regression model.

    1. Compare the residual distribution of the multiple regression model you just built to that of your initial simple regression model. Do you see improvements?
    2. Comment on the goodness-of-fit statistics of the multiple regression model. What do they mean? Are they significantly better than those of the simple model?
    3. Do you have a reliable model? Explain why or why not. If so, proceed to the next problem. If not, add new or delete existing control variables and re-assess.
  5. Interpret the results.

    1. What is the regression coefficient of female? What does it mean in the context from which the data is taken? How does it compare to the coefficient from the simple model?
    2. What is the p-value of the coefficient from part (a)? What does it tell you?
    3. State and explain your answer to the main question of this project.
  6. Would it be appropriate to use your model and its conclusion to claim anything about pay inequity today? Explain.

Part II: Is there race-based pay inequity?

  • Re-do Part I, but use the nonwhite variable in place of female.

Guidelines:

  • Include a title, your name, and the date in the heading of your R Markdown file.
  • Also include a preliminary code chunk in which you load the libraries you’ll need, and briefly say what each one is for.
  • Begin with an introductory paragraph containing, at least, a description of the data set (including what it contains, its size, and its source) and a nicely displayed data table using the datatable function. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.)
  • Clearly describe what you’ll be doing with the data, and include any questions you’ll be trying to answer.
  • Follow the introduction and problem statement with the body of your report which will contain your work, broken into relevant section headings.
  • The compiled report should show all of your R code and its output, but it should not show any warning or error messages.
  • The body should also include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but it should avoid large blocks of text.
  • All graphics should look nice, be easy to read, and be fully labelled.
  • You should include insightful statements about the meaning you’re extracting from your models, graphics, and statistical calculations.
  • End with an overall concluding statement which summarizes your findings. Be sure to disclose any shortcomings of your analysis.

Grading Rubric:

  • Technical Explanations: Do you explain the technical aspects of your model, such as residuals, R2 values, RSE, confidence intervals, regression coefficients, and p-values, correctly? (30 points)
  • Arguments: Do you give fair and clear arguments based solely on the results of your models? Do your conclusions follow correctly from your models’ statistics? (30 points)
  • Narrative: Is it clear what you’re trying to accomplish with your model? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)
  • Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? (15 points)

  1. One might ask, “But why \(n-1\)? Why is this the best way to correct bias? The reason stated above is that replacing \(n\) by \(n-1\) means that the fraction \(\frac{1}{n}\) will be replaced by the bigger fraction \(\frac{1}{n-1}\). However, there are lots of ways to make \(\frac{1}{n}\) bigger – we could replace it with \(\frac{2}{n}\), \(\frac{3}{n}\), \(\frac{1}{n-2}\), \(\frac{1}{n-3}\), etc. Why do we settle on \(\frac{1}{n-1}\)?” Think of it this way: Suppose we have an entire population of values whose standard deviation we’d like to compute. Let’s denote the true mean of the population by \(\mu\) and the true standard deviation by \(\sigma\). This means that, on average, the values of the population differ from \(\mu\) by \(\sigma\) units. Now, we should think of our vector X as a sample of \(n\) values from the population: \(x_1, x_2, \ldots, x_n\). We’ll denote the sample mean by \(\overline{X}\) and the sample standard deviation by \(S\). Thus, \(\overline{X}\) is an estimate of \(\mu\), but \(S\) is a biased estimate of \(\sigma\). The reason \(S\) is a biased estimate of \(\sigma\) comes from the fact that \(\overline{X}\) is a little different from \(\mu\). This means that the values of X will be, on average, a little closer to \(\overline{X}\) than to \(\mu\), and therefore \(S\) will always be a little smaller than \(\sigma\). So if we want to see how to correct the bias built into \(S\), we should start with the expected difference between \(\overline{X}\) and \(\mu\). To determine this, we’d have to take several samples from the population and compute the sample mean of each. We could then look at the distribution of these sample means and compute its standard deviation. This will tell us the expected difference between \(\overline{X}\) and \(\mu\). It turns out to be easier to start with the variance of the sample means rather than the standard deviation. (Recall that this is the average squared distance between \(\overline{X}\) and \(\mu\).) Without going into the details, it’s known that the variance of the sample means is \(\frac{\sigma^2}{n}\). This should make sense – the dispersion of the sample means should be directly proportional to the dispersion of the population and inversely proportional to the sample size. Therefore, we have that: (1) the expected squared distance from each \(x_i\) to \(\overline{X}\) is \(S^2\), (2) the expected squared distance from \(\overline{X}\) to \(\mu\) is \(\frac{\sigma^2}{n}\), and (3) the expected squared distance from each \(x_i\) to \(\mu\) is \(\sigma^2\). Putting these three statements together, we have \(S^2+\frac{\sigma^2}{n} =\sigma^2\). We can isolate \(\sigma^2\) in terms of \(S^2\), getting \(\sigma^2=\frac{n}{n-1}\cdot S^2\). This shows us that the true variance \(\sigma^2\) is bigger than the sample variance \(S^2\) by a factor of \(\frac{n}{n-1}\). Thus, if we want to correct the sample variance to remove the built-in bias, we should multiply it by \(\frac{n}{n-1}\). The formula for sample variance is \(\frac{\textrm{SS(M)}}{n}\). If we multiply this fraction by \(\frac{n}{n-1}\), we get the formula for the true population variance: \(\frac{\textrm{SS(M)}}{n-1}\). We thus have a purely algebraic reason for dividing by \(n-1\) rather than \(n\) when estimating population variance and standard deviation in an unbiased way.↩︎

  2. Notice that this is yet another instance of a standard deviation with a different name. Standard error is just the standard deviation of the regression slope, much like residual standard error is just the standard deviation of the residuals.↩︎