Chapter 5 Statistical Modeling

We’ve been able to extract meaningful insights from data sets using various transformation functions and visualizations, but these insights have been largely qualitative, meaning that we’ve mostly only been able to describe our insights rhetorically by appealing to nicely displayed tables and plots. In this chapter, we’ll use statistical models as ways to extract more quantitative insights. In the process, we’ll have to dive deeper into statistics topics than we have so far.

We’ll need a new package called modelr. You’ll have to install it first:

install.packages("modelr")

And remember to load the necessary libraries:

library(tidyverse)
library(modelr)

5.1 Simple Linear Regression

A statistical model is a mathematical function that accepts input and produces an output. The way these functions are defined depends on a data set that contains (ideally several) observed input-output pairs. The model itself generalizes the input-output relationships observed in the data set. In a statistical model, the input variables are usually called predictor or feature or explanatory or independent variables. The output is usually called the response or dependent variable.

For example, consider the small sample data set sim1:

sim1
## # A tibble: 30 x 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## # ... with 20 more rows
## # i Use `print(n = ...)` to see more rows

We have 30 input-output pairs; we’ll take x to be the predictor variable and y to be the response variable. A natural question is, “If we know the value of x, can we predict the value of y?” Technically, the answer is “no,” since, for example, when x is 3, y could be either 7.36 or 10.5. But the purpose of a statistical model is to provide an approximate prediction rather than an exact one.

A good first step toward making a prediction would be to obtain a scatter plot of the data. We’ll also include a trend curve:

ggplot(data = sim1, mapping = aes(x = x, y = y)) +
  geom_point() +
  geom_smooth()

Before proceeding, it’s finally worth pointing out a shortcut in the ggplot syntax. The first argument required by every ggplot call is data =. Thus, it’s actually unnecessary to include data =; you can just type the name of the data set. The same is true for mapping =. This is the second argument by default, so you can just enter your aesthetics with aes. Finally, in geom_point and geom_smooth, it’s understood that the first two arguments used within aes are x = and y =, so you can just enter the x and y variables directly. Thus, the above ggplot call can be shortened to:

ggplot(sim1, aes(x, y)) +
  geom_point() +
  geom_smooth()

We’ll be using these shortcuts from now on.

Anyway, referring to our scatter plot, it’s clear that, while our points don’t all lie exactly on the same line, they do seem to cluster around some common line. This line of best fit is called the regression line of the data set. Being a line, it can be described algebraically by an equation of the form y=m(x) + b. If we can find the values of m (the slope) and b (the y-intercept), we’ll have a statistical model for our data set. A statistical model of this type is called a simple linear regression model. The word “simple” refers to the fact that we have only one predictor variable. When we have more than one, we call it multiple linear regression.

One way to find m and b would be to eyeball the scatter plot. This isn’t ideal, but it will get us started. If we were to extend the blue trend curve to the left, it looks like it might cross the y-axis around 4. Also, it looks like as x increases from 1 to 10, y increases from about 5 to 24. This means we have a rise of 19 over a run of 9, for a slope of 19/9, or about 2.1. If we have a slope and intercept in mind, we can plot the corresponding line with geom_abline. Let’s plot our eyeballed regression line over our scatter plot:

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(intercept = 4, slope = 2.1)

This looks like a pretty good fit, but we’d prefer that R do it for us. We can achieve this with the lm (= linear model) function from the modelr library. In the code below, sim1_model is the name we’re choosing for our regression model. Also, notice the first argument of lm is y ~ x. The lm function wants us to enter the response variable on the left of ~ and the predictor on the right.

sim1_model <- lm(y ~ x, data = sim1)

Let’s see what we have:

sim1_model
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Coefficients:
## (Intercept)            x  
##       4.221        2.052

It looks like the actual intercept value is 4.221. The value 2.052 listed under x in the above output is the coefficient of x in our linear model, i.e., the slope. So it looks like we were close. Let’s plot the actual regression line in red over our estimated one from above:

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(intercept = 4, slope = 2.1) +
  geom_abline(intercept = 4.221, slope = 2.052, color = "red")

It’s hard to tell the difference, but we’ll trust that the red line fits the data more closely.

Of course, it’s not a good practice to do all of this piecemeal, transcribing the intercept and slope over to geom_abline. An entry error could be made. We can extract these values directly from the model, though, by getting the coefficient vector:

coef(sim1_model)
## (Intercept)           x 
##    4.220822    2.051533

If we give this vector a name, we can then access the individual entries when plotting our regression line:

sim1_coefs <- coef(sim1_model)

b <- sim1_coefs[1]
m <- sim1_coefs[2]

ggplot(sim1) +
  geom_point(aes(x, y)) +
  geom_abline(intercept = b, slope = m, color = "red")

5.1.1 Exercises

  1. Obtain the scatter plot from the mpg data set of hwy vs displ. (Recall that this means that hwy is the y variable and displ is the x.) Guess the values of the slope and y-intercept of the best-fit regression line to this data, and plot the line along with the scatter plot.

  2. Use modelr to find the actual regression line. Then plot it in a different color on top of your scatter plot from the previous exercise.

5.2 Residuals

An extremely important part of statistical modeling is assessing the effectiveness of our models. We want to make sure our model captures the behavior of the phenomenon being modeled, and we want to choose predictor variables that have a significant effect on the response variable. In this and the next section, we’ll address the first of these points. We’ll address the second in Section 5.4.

The residual of a data point relative to a model is the difference between the actual response value of the point and the predicted response value. In other words, residual = actual \(-\) predicted.

Let’s calculate a few of the residuals of points in sim1 relative to sim1_model. One of the data points in sim1 is x = 2, y = 11.3. So 11.3 is the actual response value when x = 2. To get the predicted response value, we have to plug 2 in for x in our model: y = 2.052(x) + 4.221. The predicted y value is thus 2.052(2) + 4.221 = 8.325. The residual is therefore 11.3 - 8.325 = 2.975. Our model underestimates the y value by 2.975.

Another of our data points in sim1 is x = 4, y = 12.4. The predicted y value is 2.052(4) + 4.221 = 12.429. The residual, then, is 12.4 - 12.429 = -0.029. The model just barely overestimates the y value.

We’d ideally like all of our residuals to be near 0. The modelr library includes two very useful functions that will help us check our residuals: add_residuals and add_predictions. Their names describe exactly what they do:

sim1_w_pred_resids <- sim1 %>%
  add_predictions(sim1_model) %>%
  add_residuals(sim1_model)

sim1_w_pred_resids
## # A tibble: 30 x 4
##        x     y  pred    resid
##    <int> <dbl> <dbl>    <dbl>
##  1     1  4.20  6.27 -2.07   
##  2     1  7.51  6.27  1.24   
##  3     1  2.13  6.27 -4.15   
##  4     2  8.99  8.32  0.665  
##  5     2 10.2   8.32  1.92   
##  6     2 11.3   8.32  2.97   
##  7     3  7.36 10.4  -3.02   
##  8     3 10.5  10.4   0.130  
##  9     3 10.5  10.4   0.136  
## 10     4 12.4  12.4   0.00763
## # ... with 20 more rows
## # i Use `print(n = ...)` to see more rows

The pred column contains the results of plugging the x values into our model. The resid column contains the residuals, the values of which are equal to y - pred. By scrolling through the resid column, we see that a lot of the residuals are in fact close to 0. We can more easily see this by obtaining a residual plot using the resid column:

ggplot(sim1_w_pred_resids) +
  geom_point(aes(x, resid))

Since we’re interested in the proximity of the residuals to 0, we can also add a horizontal reference line at 0 with geom_hline:

ggplot(sim1_w_pred_resids) +
  geom_point(aes(x, resid)) +
  geom_hline(yintercept = 0)

The clustering of the residuals around 0 is a good sign, but the residual plot also reveals another reason to believe that our model does a good job: The residual plot has no obvious pattern.

To explain this, consider the fact that in any statistical process, there will be an element of randomness that is completely unpredictable. This is called noise. No model should predict noise. However, if there is any real underlying relationship among the variables at play in the statistical process, a good model would detect that relationship. Thus, we want our models to detect real relationships but not noise.

What does this have to do with our residual plot? Well, if our model can explain all of the behavior of our data except the random noise, then the only reason for a difference between actual and predicted values would be that noise. In other words, the residuals would measure only the noise. Since noise is totally random, this means that a good model would have a residual plot that looks totally random, as the one above does.

To contrast this situation, consider the following data set. (It’s not built in, which is why we’re importing it.)

sim1a <- readr::read_csv("https://raw.githubusercontent.com/jafox11/MS282/main/sim1_a.csv")
sim1a
## # A tibble: 10 x 2
##        x     y
##    <dbl> <dbl>
##  1     1  8.34
##  2     2  9.91
##  3     3 10.6 
##  4     4 11.2 
##  5     5 12.5 
##  6     6 14.7 
##  7     7 17.2 
##  8     8 19.0 
##  9     9 19.9 
## 10    10 20.5

Looking at a scatter plot, it seems like a linear model would be appropriate:

ggplot(sim1a) +
  geom_point(aes(x, y))

However, we should make sure that the residuals are random. Let’s build a linear model on sim1a, add the resulting residuals column to sim1a, and then get the residual plot:

sim1a_model <- lm(y ~ x, data = sim1a)

sim1a_w_resid <- sim1a %>%
  add_residuals(sim1a_model)

ggplot(sim1a_w_resid) +
  geom_point(aes(x, resid)) +
  geom_hline(yintercept = 0)

There’s a definite pattern in the residual plot. This means that there’s some relationship between x and y that sim1a_model is not detecting; its predictive power isn’t as good as it could be. However, only one residual value is more than 1 unit away from 0, so despite its shortcomings, our model is still pretty good at predicting y from x.

5.2.1 Exercises

Exercises 1-4 refer to the model that was built in Exercises 5.1.1.

  1. Row 36 of the mpg data set has a displ value of 3.5 and a hwy value of 29. Using only the formula for your model (and not the add_predictions and add_residuals functions) find the hwy value predicted by your model when displ = 3.5 and the corresponding residual.

  2. Add columns to the mpg data set that contain the residuals and predictions relative to your model.

  3. Plot the residuals from the previous problem. Include a horizontal reference line through 0.

  4. Based on your residual plot from the previous problem, would you say that a linear model is appropriate for this data set? Explain your answer.

  5. Which of the following residual plots best indicates that a linear model is an appropriate way to model the data? Explain your answer.

  1. Go to this web site. Copy and paste the 200 records from the table into a csv file and import it into R. Find the linear model that best fits the data. Then obtain the residuals plot. Based on the residuals plot, would you say that a linear model is appropriate for this data? Explain your answer.

5.3 R2 Values

Another way to assess how well a model captures the real relationships among variables is to compute the R2 value of the model. This is a number which roughly measures the percentage of the variation in the data set that can be explained by the model. When R2 = 1, the model can explain all of the variation; when R2 = 0, it can explain none of it.

To dig a little deeper, let’s consider sim1a again:

sim1a
## # A tibble: 10 x 2
##        x     y
##    <dbl> <dbl>
##  1     1  8.34
##  2     2  9.91
##  3     3 10.6 
##  4     4 11.2 
##  5     5 12.5 
##  6     6 14.7 
##  7     7 17.2 
##  8     8 19.0 
##  9     9 19.9 
## 10    10 20.5

There is definitely variation among the y values. One way to measure this variation is by comparing each y value to the mean y value, which is about 14.4. We can add a column tp sim1a containing the differences between the y values and 14.4:

sim1a_v2 <- sim1a %>%
  mutate(mean_deviation = y - mean(y))

sim1a_v2
## # A tibble: 10 x 3
##        x     y mean_deviation
##    <dbl> <dbl>          <dbl>
##  1     1  8.34         -6.05 
##  2     2  9.91         -4.48 
##  3     3 10.6          -3.75 
##  4     4 11.2          -3.15 
##  5     5 12.5          -1.85 
##  6     6 14.7           0.329
##  7     7 17.2           2.77 
##  8     8 19.0           4.60 
##  9     9 19.9           5.52 
## 10    10 20.5           6.06

An overall measure of the variation among the y values is the average value of the squares of the mean deviations. (It makes sense to square the mean deviations first since otherwise the positive and negative values might partially cancel each other and hide some of the actual variation.) This is called the sample variance of the y variable.

n <- length(sim1a_v2) # This is the number of rows in the data set.

sample_var <- 1/n * sum((sim1a_v2$mean_deviation)^2)

sample_var
## [1] 60.08195

So the sample variance is about 60, and we can consider this to be a measurement of the natural variation of y values.

Now, how much of this variation can sim1a_model explain? To answer this, let’s find the mean squared residual error (MSRE), which is a measurement of the amount by which the y values deviate from those predicted by the model. The MSRE is just the average value of the squares of the residuals incurred by our model. The code below will add a residual column to sim1a and then use it to obtain the MSRE.

sim1a_v3 <- sim1a %>%
  add_residuals(sim1a_model)

MSRE <- 1/n * sum((sim1a_v3$resid)^2)

MSRE
## [1] 1.554225

This number, 1.554225, is a rough measurement of the average residual error incurred by the model. In other words, it’s a measurement of the variation among the y values that the model cannot explain. We could find out the percentage of the natural y variation (as given by the sample variance) that the model cannot explain by computing the ratio of MSRE to sample variance:

MSRE / sample_var
## [1] 0.02586841

This means that our model can explain all but about 2.6% of the variation of y. In other words, it can explain about 97.4% of the y variation. This percentage expressed as a decimal, 0.974, is the R2 value of the model. So we see that in general,
\[\textrm{R}^2 = 1 - \frac{\textrm{MSRE}}{\textrm{sample variance}}.\] This is an involved calculation, but luckily, the summary function in modelr computes R2 values for us, along with many other statistics.

summary(sim1a_model)
## 
## Call:
## lm(formula = y ~ x, data = sim1a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1206 -0.4750  0.1561  0.5620  0.9511 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.36737    0.52153   12.21 1.88e-06 ***
## x            1.45886    0.08405   17.36 1.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7634 on 8 degrees of freedom
## Multiple R-squared:  0.9741, Adjusted R-squared:  0.9709 
## F-statistic: 301.3 on 1 and 8 DF,  p-value: 1.237e-07

We just computed “Multiple R-squared” by hand. We’ll explain some of the other statistics from this summary in the next two sections.


R2 values provide a quick metric we can use to assess our model, but as is always the case with any kind of assessment, a single number is not enough. For example, the R2 value for sim1a_model is off the charts at 0.974. You’ll probably never see such a high R2 value in real life. However, recall that in the last section, the residual plot for sim1a_model revealed a pattern among the residuals, signaling that there is a relationship between x and y that sim1a_model does not know about. In other words, despite the high R2 value, sim1a_model is not perfect.

Compare this to the siutation with sim1_model:

summary(sim1_model)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14

Its R2 value comes in a little lower at 0.885, but you could argue that based on the residual plot, sim1_model actually does a better job with its data set than does sim1a_model. In the end, both the R2 value and the residual plot should be used together when assessing model effectiveness. Neither one tells the whole story all by itself.

5.3.1 Exercises

  1. What is the R2 value of your model from Exercises 5.1.1? Explain precisely what this number tells you about your model.

  2. For the data set below, build a linear model with y as the response and state the R2 value. Then look at the residual plot. Explain why this example shows that a near perfect R2 value does not automatically mean that a model is near perfect.

df <- tibble(
  x = c(1, 2, 3, 4, 5, 6, 7, 8, 100),
  y = c(2, 1, 4, 3, 6, 5, 8, 7, 100)
)

5.4 Interpreting Regression Coefficients

Residual plots and R2 values help assess how well a model can explain the variation in the response variable, but we’ll also want to assess the choice of predictor variables themselves. If a predictor variable doesn’t affect the response variable very much, then it should probably be discarded from our model. For example, you would never try to model the Dow Jones Industrial Average by using the daily high temperature in your home town as a predictor. Predictor selection is more of an issue in multiple regression as we’ll see in Section 5.7, but we can lay out the basic idea in the context of simple regression.

Recall the model sim1_model that we built on the data set sim1 in Section 5.1: y = 2.052(x) + 4.221. What kind of impact does the predictor x have on the response y? The slope of our regression line is 2.052, meaning that every time x increases by 1 unit, y increases by 2.052 units. In other words, the coefficient of the predictor variable measures the change in the response as the result of increasing the predictor by 1.


Even with this interpretation in place, it’s still not clear whether a change of 2.052 is significant for y. If not, then x would have very little predictive power as a predictor of y.

You can imagine a situation in which a change of 2.052 in a response variable would barely even register; for example, a change of $2.05 in the U.S. national debt would be completely insignificant. Is this the case for y in sim1? Scrolling through sim1, it would appear that a change of 2.052 is fairly significant, and we can verify this statistically as follows.

Suppose a skeptic tells us that x is not a good predictor of y. This would mean that the true value of the coefficient of x in our model should be 0. We respond with the rebuttal that our data suggests the coefficient is actually 2.052, not 0. How could the skeptic answer our rebuttal? If he’s stubborn, he might say, “Well, your data set is the outcome of a statistical process, and all statistical processes include some random noise. I think your 2.052 is just a side-effect of that noise. It just so happens that the data you collected is really noisy and not representative of the true relationship between x and y.”

How might you try to prove the skeptic wrong? After all, it’s possible that he’s right. One way would be to find the probability that, given the skeptic’s claim that the coefficient is 0, you would randomly get a coefficient of 2.052. If you can show that the probability of randomly getting 2.052 is very low, then the skeptic is probably wrong. You could then safely claim that x is indeed a good predictor of y.

The first step in obtaining this probability is to compute the t-statistic of the coefficient. This is a measurement of the distance from your coefficient (2.052) to the skeptic’s (0) relative to the standard error of the coefficient. Standard error is a measure of the variability of the regression coefficient; if you were to collect several data sets from the population and find the regression coefficient each time, the standard error would measure how spread out your calculations are. (It’s like the standard deviation of a variable.) The t-statistic is then defined as: \[\textrm{t} = \frac{\textrm{coefficient}}{\textrm{standard error}}.\] If the absolute value of t is big, then your coefficient is several standard errors away from 0. The larger the t-statistic, the more unlikely it is that the true coefficient value is 0.

But how unlikely? Using something called a t-distribution, which is a description of the distribution of possible t-statistics, one can compute the probability that the absolute value of your t-statistic takes on the value that it does (or a greater value), if in fact, the skeptic is correct. This probability is called the p-value of your coefficient.

Luckily, the standard error, t-statistic, and p-value are all contained in the summary of your model:

summary(sim1_model)
## 
## Call:
## lm(formula = y ~ x, data = sim1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1469 -1.5197  0.1331  1.4670  4.6516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2208     0.8688   4.858 4.09e-05 ***
## x             2.0515     0.1400  14.651 1.17e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8805 
## F-statistic: 214.7 on 1 and 28 DF,  p-value: 1.173e-14

We can see that the standard error is 0.14, so that the t-statistic is 2.052/0.14 = 14.651. This means that 2.052 is more than 14 standard errors away from the skeptic’s claim of 0. This is a long way. The “Pr(>|t|)” column in the summary above tells us the probability of randomly observing a t-statistic of 14.651 or greater; in other words, this is the p-value. We can see that it’s only “1.17e-14.” This is scientific notation, and it’s short for \(1.17\times 10^{-14}\), which is 0.0000000000000117. (There are 13 leading zeros after the decimal point.) So the skeptic’s claim that 2.052 only showed up as the result of random noise in the data only has a 0.00000000000117% (11 leading zeros) chance of being correct. In general, the standard rule of thumb is to reject a skeptic’s claim that the coefficient is 0 whenever the p-value is less than 0.05. It’s therefore pretty safe to reject his claim in this case.

Our argument to refute the hypothesis that the coefficient of x is 0 was based on the following information:

predictor coefficient std. error t-statistic p-value
x 2.052 0.14 14.651 1.17e-14

To summarize, this says that our regression model for the response variable y estimates a coefficient of 2.052 for the predictor x. The standard error of this estimation is 0.14, and the t-statistic of our coefficient is 14.651, meaning that 2.052 is 14.651 standard errors away from 0. Our p-value tells us that if the coefficient of x really is 0, then the probability of obtaining 2.052 (or larger) as the result of random noise is only 1.17e-14. Since this p-value is below the threshold of 0.05, we can reject the hypothesis that the coefficient is 0 and accept our 2.052 instead. This means that we can confidently say that an increase of 1 in x brings about an increase of about 2.052 in y, and that this indicates a significant impact of x on y.


Consider, for example, another data set (entered manually below) that records the number of hours that students in a class spent studying for an exam and the exam grades they received.

exam_grade <- tibble(
  hours = c(1, 3, 2.5, 1, 1, 4, 3.5, 2, 2, 3.5, 1.5, 1, 0.5, 3, 2, 2.5, 3, 1, 2.5, 2),
  grade = c(80, 75, 60, 75, 90, 80, 65, 70, 75, 80, 90, 55, 75, 80, 75, 90, 95, 85, 75, 80)
)

exam_grade
## # A tibble: 20 x 2
##    hours grade
##    <dbl> <dbl>
##  1   1      80
##  2   3      75
##  3   2.5    60
##  4   1      75
##  5   1      90
##  6   4      80
##  7   3.5    65
##  8   2      70
##  9   2      75
## 10   3.5    80
## 11   1.5    90
## 12   1      55
## 13   0.5    75
## 14   3      80
## 15   2      75
## 16   2.5    90
## 17   3      95
## 18   1      85
## 19   2.5    75
## 20   2      80

Is hours a good predictor of grade? Let’s look at the scatter plot:

ggplot(exam_grade) +
  geom_point(aes(hours, grade))

There doesn’t seem to be a strong linear relationship. Let’s build a model and look at the coefficient statistics.

exam_model <- lm(grade ~ hours, data = exam_grade)

summary(exam_model)
## 
## Call:
## lm(formula = grade ~ hours, data = exam_grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2830  -2.5965  -0.0241   3.9670  17.3312 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.0900     5.4937  14.032 3.91e-11 ***
## hours         0.1929     2.3452   0.082    0.935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 18 degrees of freedom
## Multiple R-squared:  0.0003758,  Adjusted R-squared:  -0.05516 
## F-statistic: 0.006767 on 1 and 18 DF,  p-value: 0.9353

Extracting out the relevant numbers from this summary, we have:

predictor coefficient std. error t-statistic p-value
hours 0.1929 2.3452 0.082 0.935

The coefficient of hours is 0.1929, which seems close to 0. This would be our first clue that hours is not a good predictor of grade. However, we have to compare this coefficient to the standard error, 2.3452. The coefficient is much smaller than the standard error, and this gives us a very small t-statistic of 0.082. This means that our coefficient is only about 8% of a standard error away from 0. The probability of this happening as a result of random noise would seem pretty high. Indeed, the p-value of 0.935 confirms this, stating that the probability of accidentally getting a coefficient estimate of 0.1929 or above is about 93.5%. This is nowhere near the 5% threshold, so we cannot confidently reject the hypothesis that the coefficient is actually 0.

Our conclusion is that hours is not a good predictor of grade, just as we guessed based on the scatter plot. From now on, we’ll add p-values to the list of assessments of linear models, along with residual plots and R2 values.


When using statistics like the ones covered in this section to assess a model, it’s extremely important to use precise language. Below is how you should phrase the interpretations of your modeling statistics.

Suppose we’re modeling a response y by a predictor x, and our statistics are as follows:

predictor coefficient std. error t-statistic p-value
x m s t p
  • To say that the coefficient is m means, “Every 1-unit increase in x leads to a change of m units in y.”
  • To say that the standard error is s means, “We would expect an average deviation of s units when estimating the value of the coefficient.”
  • To say that the t-statistic is t means, “The value of our coefficient is t standard errors away from 0.”
  • To say that the p-value is p means, “Assuming the true value of our coefficient is 0, the probability of getting a coefficient whose absolute t-statistic is |t| or greater is p.” If p < 0.05, we can claim that x has a signficant impact on y. Otherwise, we cannot make that claim.

5.4.1 Exercises

  1. State the coefficient, standard error, t-statistic, and p-value for the mpg model from Exercises 5.1.1.

  2. Use the wording at the end of this section to state precisely what your numbers from the previous exercise mean.

  3. Does displ have a significant impact on hwy? Explain your answer.

  4. Using the tibble or tribble functions, manually enter a 10-observation data set with columns w, x, y, and z subject to the following conditions. Then show that these conditions are satisfied.

    1. The coefficient of w as a predictor of the response z has a very large positive t-statistic.
    2. The coefficient of x as a predictor of the response z has a very large negative t-statistic.
    3. The coefficient of y as a predictor of the response z has a t-statistic which is very close to 0.

5.5 Categorical Predictors

It’s often the case that a response variable we want to model could have categorical predictors. For example, we might want to know whether a person’s gender has an impact on their salary, whether a car’s class has an impact on its fuel efficiency, or whether a diamond’s color has an impact on its price.

It’s possible to use categorical predictors in a linear regression model as we’ll see below. Consider the following data set:

sim2
## # A tibble: 40 x 2
##    x          y
##    <chr>  <dbl>
##  1 a      1.94 
##  2 a      1.18 
##  3 a      1.24 
##  4 a      2.62 
##  5 a      1.11 
##  6 a      0.866
##  7 a     -0.910
##  8 a      0.721
##  9 a      0.687
## 10 a      2.07 
## 11 b      8.07 
## 12 b      7.36 
## 13 b      7.95 
## 14 b      7.75 
## 15 b      8.44 
## 16 b     10.8  
## 17 b      8.05 
## 18 b      8.58 
## 19 b      8.12 
## 20 b      6.09 
## 21 c      6.86 
## 22 c      5.76 
## 23 c      5.79 
## 24 c      6.02 
## 25 c      6.03 
## 26 c      6.55 
## 27 c      3.73 
## 28 c      8.68 
## 29 c      5.64 
## 30 c      6.21 
## 31 d      3.07 
## 32 d      1.33 
## 33 d      3.11 
## 34 d      1.75 
## 35 d      0.822
## 36 d      1.02 
## 37 d      3.07 
## 38 d      2.13 
## 39 d      2.49 
## 40 d      0.301

We can use a quick visualization to see whether x seems to have an effect on y. However, a scatter plot is not appropriate since x is categorical. Recall that a box plot is the best way to visualize a relationship between a categorical and a continuous variable.

ggplot(sim2) +
  geom_boxplot(aes(x, y))

The box plot does seem to indicate that x might be a good predictor of y. To build a linear regression model with x as the predictor variable and y as the response, we would need a way to encode the values of x numerically so that we can perform the necessary computations with them.

The predictor variable x has four values: a, b, c, and d. One way to encode these values would be to just assign each one a number: a = 1, b = 2, c = 3, and d = 4. However, assigning numbers in this way imposes a ranking on the categorical values: a is less than b, which is less than c, which is less than d. This is a problem because there is no reason to assume there is any relationship among the values at all. We would also be in the strange situation where the computer would treat c as the average of b and d since 3 is the average of 2 and 4. This type of value encoding could misrepresent the data and lead to poorly performing models.

A better way to encode the values numerically is to do so in a way that doesn’t impose any kind of order. To see how modelr does this, let’s build a linear model on sim2 and look at the coefficients.

sim2_model <- lm(y ~ x, data = sim2)

sim2_model
## 
## Call:
## lm(formula = y ~ x, data = sim2)
## 
## Coefficients:
## (Intercept)           xb           xc           xd  
##      1.1522       6.9639       4.9750       0.7588

As you can see, the predictor x is expanded into xb, xc, and xd. The numbers listed under these variables are their coefficients in the model’s formula. The number listed under “Intercept” is the constant term (like the y-intercept in our models above). This information is thus telling us that the algebraic form of our model is:

y = 1.1522 + 6.9639(xb) + 4.9750(xc) + 0.7588(xd).

How do we use this formula? Suppose we want to know the predicted value of y when x = c. Then we can let xb = 0, xc = 1, and xd = 0, getting y = 1.1522 + 4.9750 = 6.1272. Since there is no xa variable, to make a prediction when x = a, we’d use 0 for each of xb, xc, and xd, getting y = 1.1522.

We can get a good sense of what our model is doing by returning to the box plot above and adding markers at the y values predicted by our model for each x. We’ll first add our model’s predictions to sim2 using the add_predictions function:

sim2_w_pred <- sim2 %>%
  add_predictions(sim2_model)

sim2_w_pred
## # A tibble: 40 x 3
##    x          y  pred
##    <chr>  <dbl> <dbl>
##  1 a      1.94   1.15
##  2 a      1.18   1.15
##  3 a      1.24   1.15
##  4 a      2.62   1.15
##  5 a      1.11   1.15
##  6 a      0.866  1.15
##  7 a     -0.910  1.15
##  8 a      0.721  1.15
##  9 a      0.687  1.15
## 10 a      2.07   1.15
## # ... with 30 more rows
## # i Use `print(n = ...)` to see more rows

Now let’s recreate the box plot and add big red dots above each x value at the y values predicted by sim2_model:

ggplot(sim2_w_pred) +
  geom_boxplot(aes(x, y)) +
  geom_point(aes(x, pred), color = "red", size = 3)

It turns out that the predicted y values are just the mean y values for each value of x. (Recall that the middle line in a box is the x value’s median, which is why the dots and the middle lines don’t coincide exactly.)


Assessing a model with a categorical predictor is carried out the same way as in Sections 5.2, 5.3, and 5.4. Let’s look at the residual plot, which will have to be a box plot since x is categorical.

sim2_w_resid <- sim2 %>%
  add_residuals(sim2_model)

ggplot(sim2_w_resid) +
  geom_boxplot(aes(x, resid)) +
  geom_hline(yintercept = 0)

No pattern stands out, meaning that, at least at a quick glance, there doesn’t seem to be any behavior that our model is missing. The residuals are also fairly close to 0, with the exception of those for x = d, which are more spread out than the others.

We can extract the R2 value, t-statistics, and p-values from the model summary:

summary(sim2_model)
## 
## Call:
## lm(formula = y ~ x, data = sim2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40131 -0.43996 -0.05776  0.49066  2.63938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1522     0.3475   3.316  0.00209 ** 
## xb            6.9639     0.4914  14.171 2.68e-16 ***
## xc            4.9750     0.4914  10.124 4.47e-12 ***
## xd            0.7588     0.4914   1.544  0.13131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.099 on 36 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8756 
## F-statistic: 92.52 on 3 and 36 DF,  p-value: < 2.2e-16

Our R2 value is very high at 0.8852. The coefficients for xb and xc seem far from 0, and this is confirmed by their large t-statistics, which in turn produce small p-values. Since these p-values are well below the 0.05 cutoff, we can safely say that knowing that x = b or x = c provides good predictive power in predicting y.

What about xd? It has a somewhat small coefficient value, and this leads to a t-statistic of only 1.544, meaning that the coefficient value 0.7588 is only about 1.5 standard errors away from 0. The p-value tells us there’s about a 13% chance of this occurring randomly, and that’s too high a probability to to reject the hypothesis that xd doesn’t affect y. (It’s above the standard 5% cutoff.) Therefore, knowing that x = d does not help us make a confident prediction of the value of y.

Finally, what can we say if x = a? Recall that this means we’d let xb, xc, and xd all equal 0, so that to assess the impact of letting x equal a, we would example the “Intercept” statistics. We see that the p-value for the intercept is 0.00209, which falls below the 0.05 threshold. We can thus say that knowing x = a will indeed allow us to make a confident y prediction.


An important special case of simple regression with a categorical predictor occurs when the predictor is a binary variable, meaning that it only has two values. In this case, a regression model can be used to determine whether the difference in response values for the two predictor values is statistically significant.

Consider the following data set:

binary_pred <- tibble(
  x = c("a", "a", "a", "a", "a", "b", "b", "b", "b"),
  y = c(2.1, 1.9, 2.3, 2.1, 2.4, 8.6, 8.8, 8.3, 9.0)
)

binary_pred
## # A tibble: 9 x 2
##   x         y
##   <chr> <dbl>
## 1 a       2.1
## 2 a       1.9
## 3 a       2.3
## 4 a       2.1
## 5 a       2.4
## 6 b       8.6
## 7 b       8.8
## 8 b       8.3
## 9 b       9

Are the values of y when x = a significantly different from those when x = b? We can build a model to check:

binary_model <- lm(y ~ x, data = binary_pred)

coef(binary_model)
## (Intercept)          xb 
##       2.160       6.515

The formula for our model is y = 2.16 + 6.515(xb). Recall that xb is the new variable that encodes the value of x. When x = b, xb = 1, and when x = a, xb = 0.

We can thus see that when x = a, the model predicts that y is 2.16 + 6.515(0) = 2.16. This is just the mean value of y when x = a. When x = b, the model predicts that y is 2.16 + 6.515(1) = 8.675, which is just the mean value of y when x = b. The coefficient 6.515 of xb is therefore the difference between these mean values.

To see whether the difference between the mean values is significant, we should look at the summary statistics for our model:

summary(binary_model)
## 
## Call:
## lm(formula = y ~ x, data = binary_pred)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.375 -0.075 -0.060  0.140  0.325 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1600     0.1095   19.73 2.15e-07 ***
## xb            6.5150     0.1642   39.67 1.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2448 on 7 degrees of freedom
## Multiple R-squared:  0.9956, Adjusted R-squared:  0.9949 
## F-statistic:  1574 on 1 and 7 DF,  p-value: 1.684e-09

First, notice that the R2 value tells us that x can explain 99.56% of the variation in y. Further, the coefficient of xb (i.e., the difference in means) is much larger than the standard error, which gives us a very big t-statistic and hence a very small p-value. We can therefore reject the hypothesis that there is no significant difference in means and instead accept the hypothesis that there is a significant difference in means.

Models of this type are useful in comparing data in populations that can be divided into two groups, such as men and women, white and non-white, etc.

5.5.1 Exercises

  1. Build a model using mpg with hwy as the response and class as the predictor. Write down the actual algebraic formula for your model.

  2. Using the model formula from the previous exercise, predict the highway gas mileages of the following classes (without using the add_predictions function).

    1. suv
    2. subcompact
    3. 2seater
  3. Use the summary statistics for your model from Exercise 1 to assess how well your model performs.

  4. Click here to access a csv file containing average life expectancies of people from a selection of European countries in 2020, obtained from the World Bank web site.

    1. Download this file and import it into R.
    2. Use a linear model to determine whether there is a significant difference in average life expectancies in Northern European versus Western European countries. Explain your answer using your model’s statistics.

5.6 Exponential Regression

In Section 5.2, we noticed that the residual plot for the model sim1a_model revealed a trend, meaning that the model was not able to explain a real feature of the relationship between x and y. We’ll see one way to address problem’s like this in this section, particularly in the case when the scatter plot is indicating exponential, rather than linear, growth or decay.

Download the data set found here here, import it into R, and name it trout_data. It contains lengths and weights of rainbow trout found in the Spokane River in eastern Washington in 1999. (You can read more about this data set here.)

trout_data
## # A tibble: 42 x 2
##    `length (mm)` `weight (g)`
##            <dbl>        <dbl>
##  1           457          855
##  2           405          715
##  3           455          975
##  4           460          895
##  5           335          472
##  6           365          540
##  7           390          660
##  8           368          581
##  9           385          609
## 10           360          557
## # ... with 32 more rows
## # i Use `print(n = ...)` to see more rows

Let’s change the names of the columns to make them easier to work with:

trout_data_v2 <- trout_data %>%
  rename("length" = `length (mm)`,
         "weight" = `weight (g)`)

trout_data_v2
## # A tibble: 42 x 2
##    length weight
##     <dbl>  <dbl>
##  1    457    855
##  2    405    715
##  3    455    975
##  4    460    895
##  5    335    472
##  6    365    540
##  7    390    660
##  8    368    581
##  9    385    609
## 10    360    557
## # ... with 32 more rows
## # i Use `print(n = ...)` to see more rows

We’d like to model weight as a function of length, so we’ll start with a scatter plot:

ggplot(trout_data_v2) +
  geom_point(aes(length, weight))

It looks fairly linear, but we should check the residual plot:

trout_model <- lm(weight ~ length, data = trout_data_v2)

trout_data_v2_w_resid <- trout_data_v2 %>%
  add_residuals(trout_model)

ggplot(trout_data_v2_w_resid) +
  geom_point(aes(length, resid)) +
  geom_hline(yintercept = 0)

Notice that the residuals have large positive values on the left and right and large negative values in the middle. This means our model underestimates the weights of short and long trout and overestimates the weights of medium length trout. What should we do?


When we have a residual situation like the one described here, it’s a sign that the relationship between our variables is exponential, not linear. To explain the difference, consider the following two sample data sets:

linear_data <- tibble(
  x = c(-2, -1, 0, 1, 2, 3, 4, 5, 6),
  y = c(1, 3, 5, 7, 9, 11, 13, 15, 17)
)

linear_data
## # A tibble: 9 x 2
##       x     y
##   <dbl> <dbl>
## 1    -2     1
## 2    -1     3
## 3     0     5
## 4     1     7
## 5     2     9
## 6     3    11
## 7     4    13
## 8     5    15
## 9     6    17

and

exp_data <- tibble(
  x = c(-2, -1, 0, 1, 2, 3, 4, 5, 6),
  y = c(1, 1.25, 1.56, 1.95, 2.44, 3.05, 3.81, 4.77, 5.96)
)

exp_data
## # A tibble: 9 x 2
##       x     y
##   <dbl> <dbl>
## 1    -2  1   
## 2    -1  1.25
## 3     0  1.56
## 4     1  1.95
## 5     2  2.44
## 6     3  3.05
## 7     4  3.81
## 8     5  4.77
## 9     6  5.96

You can probably spot the trend right away in linear_data. Every time x increases by 1, y increases by 2. A constant change in y with every 1-unit change in x is the defining feature of linear growth. The change in y is just the slope of the line that the points lie on.

The pattern in exp_data is harder to spot, but we can at least see that it’s not linear. The change in y with each 1-unit change in x is not constant. Instead of looking at the difference between each y value and the one preceding it, consider instead the ratio of each y value to its predecessor: 1.25/1.00 = 1.25, 1.56/1.25 = 1.25, 1.95/1.56 = 1.25, etc. Every time we increase x by 1, y increases by a multiplicative factor of 1.25. In other words, y increases by 25%. When increasing x by 1 leads to a constant percentage change in y, we say y grows exponentially. (There is also such a thing as exponential decay, which occurs when a variable decreases by a constant percentage. This will be explored in the exercises.)

If we build a linear model on each of these data sets and overlay the regression lines, we get:

The fit for linear_data is perfect, but for exp_data, the model underestimates on the ends and overestimates in the middle. The residual plot for exp_data confirms this:

This is the same residual pattern we saw with the trout data above, and the reason is that trout_data_v2, like exp_data, is (roughly) exponential.


Now that we know trout_data_v2 is exponential, we know that each increase of 1 millimeter in length corresponds to a constant percentage increase in weight. The questions that still remains are: (1) What is this constant percentage increase? and (2) How do we model this nonlinear behavior with a linear model? The idea is to “straighten out” the data using a logarithm. Before proceeding, let’s review some relevant facts about logarithms.

Recall that the logarithm base \(b\) of a number \(a\) is defined to be the exponent to which \(b\) must be raised to get \(a\). We denote this exponent by \(\log_b(a)\). For example, \(\log_2(8)=3\) since \(2^3=8\). You can compute logarithms directly in R:

log(8, base = 2)
## [1] 3

In many applications, a helpful base to use in a logarithm is the number \(e\), which is a non-terminating, non-repeating decimal equal to about 2.71828. (In R, a more accurate approximation of \(e\) is given by entering exp(1).) A logarithm with \(e\) as its base is called a natural logarithm. Many mathematics textbooks denote the natural logarithm of a number \(a\) by \(\ln(a)\) rather than \(\log_e(a)\). In R, the natural logarithm is the default logarithm, meaning it’s the one used when no base is specified. The natural logarithm of 2 is thus

log(2)
## [1] 0.6931472

We can check this by raising \(e\) to the power of 0.6931472. (Remember that we can’t expect exact answers when dealing with the truncated value of \(e\).)

2.71828^0.6931472
## [1] 1.999999

We’ll need an important algebraic property of logarithms before returning to our modeling problem. For any base \(b\): \[\log_b(M) - \log_b(N) = \log_b\left(\frac{M}{N}\right).\] Now let’s look at our sample exponential data set, exp_data, again. Recall that when x increases by 1, the ratio of a y value to the previous one is 1.25. For example, when x increases from 2 to 3, y increases from 2.44 to 3.05. Notice that \(\frac{3.05}{2.44} = 1.25\).

Let’s look at what happens to the logarithm of y when x increases from 2 to 3. We’ll just use the natural logarithm here, since the base doesn’t matter. When x increases from 2 to 3, the logarithm of y increases from \(\ln(2.44)\) to \(\ln(3.05)\). This is a difference of \(\ln(3.05) - \ln(2.44)\). By the property above, this difference is equal to \(\ln(\frac{3.05}{2.44})\). But we know \(\frac{3.05}{2.44} = 1.25\), so that means the logarithm of y increases by \(\ln(1.25)\) when x increases from 2 to 3.

If we look at another consecutive pair of x values, say 0 and 1, we get the same result: The logarithm of y increases from \(\ln(1.56)\) to \(\ln(1.95)\). This is a difference of \(\ln(1.95)-\ln(1.56)\), which equals \(\ln(\frac{1.95}{1.56})\), which in turn equals \(\ln(1.25)\) again. This is not a coincidence, of course. We know that when x increases by 1, the ratio of corresponding y values is 1.25. Thus, whenever x increases by 1, the logarithm of y increases by \(\ln(1.25)\), or about 0.223.

In other words, and here’s the main point, if y increases exponentially as a function of x, then log(y) increases linearly as a function of x. We can see this by mutating log(y) onto the exp_data table:

exp_data_w_log <- exp_data %>%
  mutate(log(y))

exp_data_w_log
## # A tibble: 9 x 3
##       x     y `log(y)`
##   <dbl> <dbl>    <dbl>
## 1    -2  1       0    
## 2    -1  1.25    0.223
## 3     0  1.56    0.445
## 4     1  1.95    0.668
## 5     2  2.44    0.892
## 6     3  3.05    1.12 
## 7     4  3.81    1.34 
## 8     5  4.77    1.56 
## 9     6  5.96    1.79

Notice the constant slope of about 0.223 in the log(y) column. We can also see it in a scatter plot of log(y) vs. x:

ggplot(exp_data_w_log) +
  geom_point(aes(x, log(y)))

This explains the above comment that logarithms “straighten out” exponential growth. Since we have a linear relationship between x and log(y), it’s appropriate to build a linear model using these two variables. (Don’t be surprised by the extremely high R2 value and extremely low p-value; since exp_data was constructed to be perfectly exponential, the straightened out version will be perfectly linear.)

log_exp_model <- lm(log(y) ~ x, data = exp_data_w_log)

summary(log_exp_model)
## 
## Call:
## lm(formula = log(y) ~ x, data = exp_data_w_log)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.685e-04 -7.231e-04  1.533e-05  7.153e-04  8.091e-04 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.456e-01  3.259e-04    1367   <2e-16 ***
## x           2.232e-01  9.978e-05    2237   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0007729 on 7 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.003e+06 on 1 and 7 DF,  p-value: < 2.2e-16

The p-value of 2e-16 is the smallest number summary can handle. It should be understood to be 0.

The coefficient of x is 0.223, which we also noted above. How should we interpret this? It means that when x increases by 1, log(y) increases by 0.223, but what does that actually mean? Remember that 0.223 is the (approximate) value of \(\ln(1.25)\), and the significance of 1.25 in exp_data is that it’s the ratio of consecutive y values. As mentioned above, this means y increases by 25% each time x increases by 1. Notice that our regression coefficient of 0.223 is close to our percentage increase (converted to a decimal) of 0.25. This is not a coincidence, either. It’s the result of another property of logarithms:

When \(a\) is close to 1, \(\ln(a)\) is approximately equal to \(a-1\). (If you have a calculus background, you might recognize this as a statement of the fact that \(y = x-1\) is the tangent line to \(y=\ln(x)\) at \(x=1\).)

Thus, \(\ln(1.25)\approx 1.25-1=0.25\). The regression coefficient of x in log_exp_model should thus be seen as an approximation of the constant percentage increase on display in our exponential growth.

Here’s a summary of what we’ve just seen:

  • When a residual plot has large positive residuals at the ends and large negative ones in the middle (or vice versa), your data may be exponential, meaning the response variable changes by a constant percentage when the predictor increases by 1.
  • Replacing the response variable by its logarithm gives a linear relationship with the predictor.
  • The regression coefficient of the predictor is an approximation of the constant percentage change in the response variable.

Let’s finally revisit the trout example. The residual plot at the beginning of this section signaled that the relationship between length and weight is roughly exponential. If we build a model with log(weight) as the response and length as the predictor, we get:

log_trout_model <- lm(log(weight) ~ length, data = trout_data_v2)

summary(log_trout_model)
## 
## Call:
## lm(formula = log(weight) ~ length, data = trout_data_v2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16294 -0.06688 -0.01063  0.05245  0.18476 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.4323875  0.0893131   38.43   <2e-16 ***
## length      0.0076185  0.0002499   30.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09124 on 40 degrees of freedom
## Multiple R-squared:  0.9587, Adjusted R-squared:  0.9577 
## F-statistic: 929.3 on 1 and 40 DF,  p-value: < 2.2e-16

We have a very high R2 value and a very low p-value, which means we can be confident that our regression coefficient of 0.0076185 is meaningful. (This is despite the fact that the coefficient seems close to 0. It may be, but because the standard error is so small, the coefficient actually has a very large t-statistic and thus a very small p-value.)

The interpretation of this coefficient’s value is that it’s an approximation of the constant percentage change in weight that occurs when length increases by one. We can thus confidently claim that with each extra millimeter of length, a trout’s weight increases by about 0.76%.

5.6.1 Exercises

These exercises refer to the data set found here. It contains data that shows the discharge (measured in cubic feet per second) in the Raging River in western Washington for several hours after a heavy rainstorm that occurred one night at midnight. You can read more about the data set here. Download it and import it into R.

  1. Obtain a scatter plot of discharge vs. hours. Explain why an exponential decay model seems better than a linear one.
  2. Build a linear model on the data set with discharge as the response. Then obtain a residual plot.
  3. How does your residual plot confirm your answer to Exercise 1?
  4. Add a column to your data set containing the natural logarithm of discharge.
  5. Obtain a scatter plot of log(discharge) vs. hours. What do you notice about it?
  6. Build a new linear model with log(discharge) as the response and hours as the predictor. Use the summary statistics to assess your model.
  7. What is the regression coefficient of your model from the previous exercise? What is its interpretation in the context of the real-life situation?

5.7 Multiple Regression

In most real-life modeling problems, a response variable is a function of several predictor variables. A linear model that makes use of more than one predictor is known as a multiple regression model.

We’ll use the following artificial data set to see how to build and interpret the meaning of a multiple regression model in R.

sim_mult <- tibble(
  x = c(2.1, -3.0, 1.6, 4.2, 3.3, 2.7, 3.4, 4.9, -1.9, -2.2),
  y = c(10.2, 12.2, 9.8, 11.1, 10.5, 13.1, 9.5, 10.0, 11.3, 12.7),
  z = c(6.5, 11.1, 8.1, 4.4, 5.2, 5.7, 5.4, 3.9, 9.9, 10.6)
)

sim_mult
## # A tibble: 10 x 3
##        x     y     z
##    <dbl> <dbl> <dbl>
##  1   2.1  10.2   6.5
##  2  -3    12.2  11.1
##  3   1.6   9.8   8.1
##  4   4.2  11.1   4.4
##  5   3.3  10.5   5.2
##  6   2.7  13.1   5.7
##  7   3.4   9.5   5.4
##  8   4.9  10     3.9
##  9  -1.9  11.3   9.9
## 10  -2.2  12.7  10.6

In the simple regression case, our modeling process always starts with a scatter plot or box plot to see whether there is a visual relationship between the predictor and response. In multiple regression, unfortunately, this is not an option. With simple regression, we’d like to see whether the data points all lie close to a common line. For sim_mult, we’d be trying to check whether all of the points lie close to a common plane, which is hard to visualize. If there are more than two predictors, we’d be dealing with a scatter plot that can’t even be visualized in three dimensions.

If we want a model with z as the response and both x and y as the predictors, we will just go straight to the model calculation. We’ll still use the lm function but with x + y to the right of ~:

sim_mult_model <- lm(z ~ x + y, data = sim_mult)

Let’s look just at the coefficients before we ask for the entire summary:

coef(sim_mult_model)
## (Intercept)           x           y 
##  10.6179955  -0.9663994  -0.1882910

This is telling us that the formula for the model is z = 10.6179955 - 0.9663994(x) - 0.1882910(y). The coefficient of a predictor measures the impact of that predictor on the response while the other predictor is held constant. For our model, this means that when x increases by 1 with y held constant, z decreases by about 0.97. When y increases by 1 with x held constant, z decreases by about 0.19.

As with simple regression, we can find residuals of multiple regression models as a means of assessing the model. For example, when x = 2.1 and y = 10.2, our model predicts that z is 10.6179955 - 0.9663994(2.1) - 0.1882910(10.2) = 6.67. The actual z value is 6.5, which means we have a residual of 6.5 - 6.67 = -0.17. Again, modelr can compute predictions and residuals for us:

sim_mult_w_pred_resid <- sim_mult %>%
  add_predictions(sim_mult_model) %>%
  add_residuals(sim_mult_model)

sim_mult_w_pred_resid
## # A tibble: 10 x 5
##        x     y     z  pred   resid
##    <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1   2.1  10.2   6.5  6.67 -0.168 
##  2  -3    12.2  11.1 11.2  -0.120 
##  3   1.6   9.8   8.1  7.23  0.873 
##  4   4.2  11.1   4.4  4.47 -0.0691
##  5   3.3  10.5   5.2  5.45 -0.252 
##  6   2.7  13.1   5.7  5.54  0.158 
##  7   3.4   9.5   5.4  5.54 -0.143 
##  8   4.9  10     3.9  4.00 -0.0997
##  9  -1.9  11.3   9.9 10.3  -0.426 
## 10  -2.2  12.7  10.6 10.4   0.247

Let’s look at the summary statistics:

summary(sim_mult_model)
## 
## Call:
## lm(formula = z ~ x + y, data = sim_mult)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4265 -0.1619 -0.1099  0.1012  0.8735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.61800    1.47428   7.202 0.000177 ***
## x           -0.96640    0.05712 -16.918 6.18e-07 ***
## y           -0.18829    0.12860  -1.464 0.186575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4092 on 7 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9762 
## F-statistic: 185.9 on 2 and 7 DF,  p-value: 8.579e-07

We have a very high R2 value, which means our model explains a very high percentage of the variation in z. The coefficient of x is large compared to the standard error, as measured by the large t-statistic, and this leads to a p-value below the 0.05 threshold. Thus, x is a good predictor of z.

However, the p-value for the y variable is well above the 0.05 threshold, which means we cannot reject the hypothesis that this coefficient is actually 0. Thus, y is not a good predictor of z and should be deleted from our model. In this case, a simple regression model with x as the only predictor would be more appropriate than a multiple regression model. We can see these by comparing the scatter plots of z vs. x and z vs. y individually:


Multiple regression is a very powerful technique for studying the effects of one predictor variable while controlling for others. Consider again the interpretation of a predictor’s regression coefficient in a multiple regression model: It tells us the amount by which the response variable changes when the predictor increases by 1 and all other predictors are held constant. In other words, the coefficient states the effect a predictor has on the response, everything else being equal. A multiple regression model, then, allows us to extract out the way a predictor impacts the response without having to worry about confounding effects of the other predictors.

For example, a famous study once reported that increasing the amount of coffee a person drinks tends to lower their life expectancy. The following (made-up) data set reflects this finding:

coffee <- tibble(
  cups_per_day = c(1, 3, 2, 4, 3, 2, 1, 5, 3, 4, 5, 3, 2, 1),
  age_at_death = c(87, 76, 85, 80, 84, 87, 88, 82, 78, 81, 82, 78, 82, 84)
)

coffee
## # A tibble: 14 x 2
##    cups_per_day age_at_death
##           <dbl>        <dbl>
##  1            1           87
##  2            3           76
##  3            2           85
##  4            4           80
##  5            3           84
##  6            2           87
##  7            1           88
##  8            5           82
##  9            3           78
## 10            4           81
## 11            5           82
## 12            3           78
## 13            2           82
## 14            1           84

Let’s look at a scatter plot and then build a simple regression model.

ggplot(coffee) +
  geom_point(aes(cups_per_day, age_at_death))

coffee_model <- lm(age_at_death ~ cups_per_day, data = coffee)

summary(coffee_model)
## 
## Call:
## lm(formula = age_at_death ~ cups_per_day, data = coffee)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1144 -1.4472  0.8856  2.6019  3.4194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   86.5132     1.9836  43.614 1.37e-14 ***
## cups_per_day  -1.4663     0.6436  -2.278   0.0418 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.176 on 12 degrees of freedom
## Multiple R-squared:  0.302,  Adjusted R-squared:  0.2438 
## F-statistic: 5.191 on 1 and 12 DF,  p-value: 0.0418

The scatter plot seems slightly linear, and our model’s R2 value of 0.302 confirms this. The coefficient of cups_per_day is -1.4663, which would suggest that each additional cup of coffee per day lowers your life expectancy by about 1.5 years. The p-value for our predictor coefficient is just below the 0.05 threshold, so we can reject the hypothesis that there is no connection between coffee intake and life expectancy.

This is troubling news for coffee drinkers. We can statistically make the claim that coffee intake lowers life expectancy. However, this is where good experiment design principles come into play. Any study that seeks to explain the behavior of a response variable (like life expectancy), especially one related to human behavior, should consider as many potential predictor variables as possible. Simple regression is rarely a sufficient way to model complex interactions and can greatly oversimplify the situation.

What the real-life study mentioned above did not take into account was that there’s a confounding variable. This is a variable that is closely related to both the predictor variable and the response variable. The confounding variable in this study is the smoking habits of the participants. Coffee intake and tobacco usage are positively correlated variables, meaning that when the value of one is high, the value of the other often is too, and vice versa. Also, it’s well-known that chronic smokers often have lower life expectancies. Suppose we included a predictor variable that records the number of cigarettes smoked per day:

coffee_cigs <- tibble(
  cups_per_day = c(1, 3, 2, 4, 3, 2, 1, 5, 3, 4, 5, 3, 2, 1),
  cigs_per_day = c(0, 4, 0, 2, 0, 0, 0, 4, 5, 0, 1, 2, 0, 1),
  age_at_death = c(87, 76, 85, 80, 84, 87, 88, 82, 78, 81, 82, 78, 82, 84)
)

coffee_cigs
## # A tibble: 14 x 3
##    cups_per_day cigs_per_day age_at_death
##           <dbl>        <dbl>        <dbl>
##  1            1            0           87
##  2            3            4           76
##  3            2            0           85
##  4            4            2           80
##  5            3            0           84
##  6            2            0           87
##  7            1            0           88
##  8            5            4           82
##  9            3            5           78
## 10            4            0           81
## 11            5            1           82
## 12            3            2           78
## 13            2            0           82
## 14            1            1           84
coffee_cigs_model <- lm(age_at_death ~ cups_per_day + cigs_per_day, data = coffee_cigs)

summary(coffee_cigs_model)
## 
## Call:
## lm(formula = age_at_death ~ cups_per_day + cigs_per_day, data = coffee_cigs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4631 -1.8537  0.1568  1.3498  4.5291 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   86.1968     1.5678  54.980 8.89e-15 ***
## cups_per_day  -0.7414     0.5663  -1.309   0.2171    
## cigs_per_day  -1.2547     0.4354  -2.882   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.504 on 11 degrees of freedom
## Multiple R-squared:  0.6023, Adjusted R-squared:  0.5299 
## F-statistic: 8.328 on 2 and 11 DF,  p-value: 0.006278

These results offer some very valuable insights. Based on our R2 value of 0.6023, including cigs_per_day as a predictor gave us a model that explains about twice as much of the variation in the response.

Also, look at the coefficients: cups_per_day still has a negative coefficient (-0.7414), but its p-value jumped well above the 0.05 threshold. Statistically speaking, this coefficient should be considered to be 0, meaning that cups_per_day has no impact on age_at_death. The coefficient of cigs_per_day, on the other hand, is -1.2547, and its p-value is only 0.0149. We can therefore make the claim that cigarette smoking has a significant impact on life expectancy, and more specifically, it seems that each additional cigarette smoked per day lowers one’s life expectancy by about 1.25 years.


This example reveals the utility of multiple regression and explains why it’s a favorite type of model among statisticians. Decoupling the effects of correlated predictor variables on the response variable is very easy - just include both of the correlated predictors in your model! This example also unfortunately shows how easy it is to mislead people with statistics by ignoring confounding variables, a problem known as “omitted variable bias.”

5.7.1 Exercises

These exercises refer to the data set found here. It consists of attributes describing the size and performance of several car makes and models from 1985 along with their retail selling price. (Click here for more information about this data set.)

  1. Download the data set above and import it into R.

  2. Create a linear regression model with price as the response variable and highway-mpg as the predictor variable.

  3. State the values of the coefficient of highway-mpg and its p-value. Interpret what each of these numbers means. What conclusion can you draw from these statistics, and why might this conclusion seem surprising?

  4. Explain why curb-weight and horsepower might be confounding variables in your model.

  5. Control for curb-weight and horsepower by adding them as predictors in your model from Exercise 2.

  6. What are the coefficient and p-value of the highway-mpg variable in the updated model from the previous exercise? Interpret what these numbers mean.

  7. In light of the previous exercise, explain why the conclusion from Exercise 3 above is an example of omitted variable bias.

  8. Do some research to find another real-life example of omitted variable bias.

5.8 Multiple Regression Example

In this section we’ll build a regression model on the flights data set from nycflights13 that will predict the number flights that left New York on any given day in 2013. We will build in one variable at a time and track the movement in residuals, R2 values, and p-values as we do.

We will need a new package called lubridate, which contains a collection of functions for handling date and time data.

install.packages("lubridate")

Of course we’ll also need the nycflights13 and lubridate libraries:

library(nycflights13)
library(lubridate)

Before we begin, let’s see what we can do with lubridate. We could enter a date, July 20, 2013, for instance, as a value in a data set in our usual way: 7-20-2013. However, R wouldn’t know this is a date. In fact, it would treat this as 7 minus 20 minus 2013 and thus store the value -2026. Putting quotes around it wouldn’t help either; “7-20-2013” would be treated as some meaningless character string.

To address this, lubridate offers a function called make_date that converts our numbers into an actual date value. The syntax is make_date(YYYY, MM, DD). For July 20, 2013, we have:

make_date(2013, 07, 20)
## [1] "2013-07-20"

We can also recover the year, month, and day from a date:

year(make_date(2013, 07, 20))
## [1] 2013
month(make_date(2013, 07, 20))
## [1] 7
day(make_date(2013, 07, 20))
## [1] 20

Another useful function is wday, which states the day of the week of any given date. Dates are to be entered in the format: “YYYY-MM-DD.”

wday("2013-07-20")
## [1] 7

The output indicates that the days of the week are factors with levels 1 through 7. July 20, 2013, was apparently the seventh day of the week. The optional label argument converts this to the actual name of the day:

wday("2013-07-20", label = TRUE)
## [1] Sat
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

We want to model the number of flights that left New York on a given day. We’ll first have to use the year, month, and day columns from flights along with make_date to create a date variable. Then we’ll create a data set that shows the total number n of flights for each value of date.

daily <- flights %>%
  mutate(date = make_date(year, month, day)) %>%
  group_by(date) %>%
  summarize(n = n())

daily
## # A tibble: 365 x 2
##    date           n
##    <date>     <int>
##  1 2013-01-01   842
##  2 2013-01-02   943
##  3 2013-01-03   914
##  4 2013-01-04   915
##  5 2013-01-05   720
##  6 2013-01-06   832
##  7 2013-01-07   933
##  8 2013-01-08   899
##  9 2013-01-09   902
## 10 2013-01-10   932
## # ... with 355 more rows
## # i Use `print(n = ...)` to see more rows

Let’s build a model with n as the response and date as the predictor.

flight_model_v1 <- lm(n ~ date, data = daily)

Now let’s check the summary statistics and then the residual plot.

summary(flight_model_v1)
## 
## Call:
## lm(formula = n ~ date, data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -296.86  -16.89   39.09   62.98   83.20 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  49.91082  718.54057   0.069    0.945
## date          0.05493    0.04522   1.215    0.225
## 
## Residual standard error: 91.04 on 363 degrees of freedom
## Multiple R-squared:  0.004048,   Adjusted R-squared:  0.001304 
## F-statistic: 1.475 on 1 and 363 DF,  p-value: 0.2253

Our R2 value is very low, and the p-value is well above 0.05. Our model is not yet doing a good job. Maybe the residuals will tell us something. (We’ll break one of our rules and give each updated version of our data set the same name, only because we’ll be updating it a lot.)

daily <- daily %>%
  add_residuals(flight_model_v1)

ggplot(daily) +
  geom_point(aes(date, resid))

There is definitely some kind of pattern in these residuals. They seem to be separated into three distinct bands, and they look evenly spaced within each band. This might be suggesting a “day of the week” effect. Let’s check by color-coding the residuals by the day of the week. We’ll have to add a column with this information to daily using the wday function:

daily <- daily %>%
  mutate(day_of_week = wday(date, label = TRUE))

ggplot(daily) +
  geom_point(aes(date, resid, color = day_of_week)) +
  geom_hline(yintercept = 0)

It looks like Saturdays have very large negative residuals, as do Sundays to a lesser extent. This means our model is overestimating flight counts on these days; our model doesn’t know about the “day of the week” effect. We can build this into our model by including the day_of_week variable as a predictor.

flight_model_v2 <- lm(n ~ date + day_of_week, data = daily)

Let’s see how our statistics moved:

summary(flight_model_v2)
## 
## Call:
## lm(formula = n ~ date + day_of_week, data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -340.39   -8.61   11.82   24.32  103.74 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     10.37207  382.69081   0.027   0.9784    
## date             0.05742    0.02409   2.384   0.0177 *  
## day_of_week.L  -83.20776    6.72095 -12.380   <2e-16 ***
## day_of_week.Q -155.26428    6.71655 -23.117   <2e-16 ***
## day_of_week.C  -62.91584    6.71255  -9.373   <2e-16 ***
## day_of_week^4  -80.01235    6.72280 -11.902   <2e-16 ***
## day_of_week^5   -4.98892    6.70415  -0.744   0.4573    
## day_of_week^6  -16.96669    6.70759  -2.529   0.0119 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.48 on 357 degrees of freedom
## Multiple R-squared:  0.7222, Adjusted R-squared:  0.7168 
## F-statistic: 132.6 on 7 and 357 DF,  p-value: < 2.2e-16

We have a much higher R2 value and mostly low p-values. (Remember that day_of_week is categorical with 7 values, so lm expands it into 6 predictor variables via one hot encoding.) How do the residuals look?

daily <- daily %>%
  select(-resid) %>% # We remove the old residuals so as not to confuse them with the new.
  add_residuals(flight_model_v2)

ggplot(daily) +
  geom_point(aes(date, resid)) +
  geom_hline(yintercept = 0)

This is much better, but there is still an obvious pattern that seems to roughly correspond to changes in season. The large negative residuals during winter dates mean that our model is overestimating the flight count, and the large positive residuals in summer mean the model is underestimating the flight count. We can tell our model about this seasonal effect by mutating a season column onto daily and then use that as a predictor variable. (We did this in an exercise following Section 2.5.)

daily <- daily %>%
  mutate(season = case_when(month(date) %in% c(12, 1, 2) ~ "winter",
                            month(date) %in% c(3, 4, 5) ~ "spring",
                            month(date) %in% c(6, 7, 8) ~ "summer",
                            TRUE ~ "fall"))

flight_model_v3 <- lm(n ~ date + day_of_week + season, daily)
summary(flight_model_v3)
## 
## Call:
## lm(formula = n ~ date + day_of_week + season, data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -332.79   -5.66    6.58   15.69  115.46 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -270.43558  457.20807  -0.591  0.55457    
## date             0.07444    0.02858   2.604  0.00960 ** 
## day_of_week.L  -84.23875    6.00102 -14.037  < 2e-16 ***
## day_of_week.Q -156.08141    5.99631 -26.030  < 2e-16 ***
## day_of_week.C  -62.75384    5.99242 -10.472  < 2e-16 ***
## day_of_week^4  -79.66318    6.00192 -13.273  < 2e-16 ***
## day_of_week^5   -5.02575    5.98476  -0.840  0.40161    
## day_of_week^6  -16.53775    5.98801  -2.762  0.00605 ** 
## seasonspring    27.40611    8.27388   3.312  0.00102 ** 
## seasonsummer    34.23232    6.91255   4.952 1.14e-06 ***
## seasonwinter   -20.98238    7.71327  -2.720  0.00684 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.28 on 354 degrees of freedom
## Multiple R-squared:  0.7805, Adjusted R-squared:  0.7743 
## F-statistic: 125.9 on 10 and 354 DF,  p-value: < 2.2e-16

Our R2 went up a little and our p-values are almost all below 0.05. Let’s see what the residuals say.

daily <- daily %>%
  select(-resid) %>%
  add_residuals(flight_model_v3)

ggplot(daily) +
  geom_point(aes(date, resid)) +
  geom_hline(yintercept = 0)

This seemed to fix the seasonal effect. Now let’s look at some of the isolated points with very large negative or positive residuals. We can examine these points by creating a data set containing only the dates with large residuals. Let’s look at the dates with residuals above 75 or below -75:

large_resid <- daily %>%
  filter(resid > 75 | resid < -75)

large_resid
## # A tibble: 14 x 5
##    date           n day_of_week season  resid
##    <date>     <int> <ord>       <chr>   <dbl>
##  1 2013-05-26   729 Sun         spring -177. 
##  2 2013-07-04   737 Thu         summer -253. 
##  3 2013-07-05   822 Fri         summer -169. 
##  4 2013-08-31   680 Sat         summer  -92.1
##  5 2013-09-01   718 Sun         fall   -168. 
##  6 2013-11-28   634 Thu         fall   -333. 
##  7 2013-11-29   661 Fri         fall   -307. 
##  8 2013-11-30   857 Sat         fall    112. 
##  9 2013-12-01   987 Sun         winter  115. 
## 10 2013-12-21   811 Sat         winter   85.8
## 11 2013-12-24   761 Tue         winter -173. 
## 12 2013-12-25   719 Wed         winter -226. 
## 13 2013-12-28   814 Sat         winter   88.3
## 14 2013-12-31   776 Tue         winter -158.

We can see right away why these dates were hard to predict: They’re holidays or near-holidays. The large negative residuals mean that the model overestimated; fewer people flew than the predictors would have predicted. We can see that these are actual holidays or part of holiday weekends: 2013-05-26 was the day before Memorial Day, 2013-07-04 was Independence Day, etc. Every large negative residual corresponds to a day when people generally don’t travel because of a holiday.

The large positive residuals indicate an underestimate by the model, meaning more people flew than were predicted. Many of these dates seem to be travel days preceding or following holidays. (Why do you think so many people traveled for Christmas on December 21?)

Now let’s add a “holiday” column with three values: “travel” if it’s a heavy holiday travel day (i.e., a day with a large positive residual), “stay home” if it’s a light travel day (i.e., a day with a large negative residual), or “neither.” We’ll use the dates above, first creating vectors containing these dates to make our code below clearer.

holiday_travel <- large_resid$date[large_resid$resid > 0]
holiday_stay <- large_resid$date[large_resid$resid < 0]

daily <- daily %>%
  mutate(holiday = case_when(date %in% holiday_travel ~ "travel",
                             date %in% holiday_stay ~ "stay",
                             TRUE ~ "neither"))

flight_model_v4 <- lm(n ~ date + day_of_week + season + holiday, data = daily)
summary(flight_model_v4)
## 
## Call:
## lm(formula = n ~ date + day_of_week + season + holiday, data = daily)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.289   -8.402    0.986   10.095  120.922 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -983.51411  250.54521  -3.925 0.000104 ***
## date             0.11940    0.01567   7.620 2.37e-13 ***
## day_of_week.L  -85.35241    3.16519 -26.966  < 2e-16 ***
## day_of_week.Q -160.48460    3.19032 -50.304  < 2e-16 ***
## day_of_week.C  -69.43631    3.15772 -21.989  < 2e-16 ***
## day_of_week^4  -79.63958    3.16208 -25.186  < 2e-16 ***
## day_of_week^5   -9.41645    3.14701  -2.992 0.002965 ** 
## day_of_week^6  -12.64957    3.14869  -4.017 7.20e-05 ***
## seasonspring    31.86309    4.38948   7.259 2.52e-12 ***
## seasonsummer    39.35238    3.64363  10.800  < 2e-16 ***
## seasonwinter   -16.34626    4.15280  -3.936 9.98e-05 ***
## holidaystay   -216.02229    7.44214 -29.027  < 2e-16 ***
## holidaytravel   92.52097   12.10831   7.641 2.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.73 on 352 degrees of freedom
## Multiple R-squared:  0.9398, Adjusted R-squared:  0.9377 
## F-statistic: 457.8 on 12 and 352 DF,  p-value: < 2.2e-16

Our R2 value is way up, and all of our p-values are below the threshold. As for the residuals:

daily <- daily %>%
  select(-resid) %>%
  add_residuals(flight_model_v4)

ggplot(daily) +
  geom_point(aes(date, resid)) +
  geom_hline(yintercept = 0)

This took care of some of our large residuals. We could continue adding predictors almost indefinitely and moving our R2 value ever closer to 1. However, a regression model with too many predictors will almost always exhibit a problem known as overfitting. An overfit model is one which goes beyond modeling the real relationships among variables and starts to actually model the random noise. This makes our model completely unsuitable outside the context in which it was built.

5.8.1 Exercises

  1. Investigate the large positive and negative residuals in our most recent residual plot. What are the dates? Why do you think the residuals are so big? How might you fix this?

  2. You may have noticed that in the model summary statistics above, the p-value for day_of_week^5 (i.e., Thursday) is usually pretty high. Try to find out why this is.

  3. What do you think is going on with the series of evenly spaced negative residuals near the end of the year? Try to find out.

  4. Adding the holiday predictor in flight_model_v4 might have been an instance of overfitting. Explain why. (Think about how flight_model_v4 would perform on, for example, 2014 New York City flight data.)

5.9 Project

Project Description:

The purpose of this project is to build a linear regression model on a real-life data set and to interpret the results. The goal will be to investigate how various socio-economic variables affect pay rate.

You will need to install and load the wooldridge library.

Instructions:

Create a new R script file. This will be where you do your scratch work. Then view and read the documentation for the wage1 data set from wooldridge. Convert it to a tibble so that it will work better with tidyverse functions.

Part I: Is there gender-based pay inequity?

  1. Build a simple regression model with wage as the response and female as the predictor. What does the value of the regression coefficient and its statistics suggest about differences in pay between men and women? What does your model’s R2 value say about your model?

  2. Explain why your model could be called into question.

  3. Why might educ, exper, and tenure be considered confounding variables in your model?

  4. For each of the above three confounding variables, examine their individual relationships with wage. Explain why there’s evidence that the relationship might be exponential with at least some of these variables.

  5. Build a multiple regression model that illustrates the relationship between pay and gender, but which controls for education level, years of experience, and years of tenure. (Remember that we’re using log(wage) now because of the exponential relationships observed in the previous problem.)

  6. Examine the summary statistics of your multiple regression model. Do you need to change your conclusion based on your original simple regression model? Explain your answer very precisely here, making sure to cite and correctly interpret your summary statistics.

  7. For your multiple regression model, rank the predictor values by the impact they have on pay rate. (Use t-statistics.) Think of a way to visualize your ranking.

  8. Make a final conclusion. Would it be appropriate to use your model and its conclusion to claim anything about pay inequity today? Explain.

Part II: Is there race-based pay inequity?

  • Re-do Part I, but use the nonwhite variable in place of female.

Guidelines:

  • Include a title, your name, and the date in the heading of your R Markdown file.
  • Also include a preliminary code chunk in which you load the libraries you’ll need, and briefly say what each one is for.
  • Begin with an introductory paragraph containing, at least, a description of the data set (including what it contains, its size, and its source) and a nicely displayed data table using the datatable function. (If the data set is too big to display when you try to knit your .Rmd file, you don’t have to display it.)
  • Clearly describe what you’ll be doing with the data, and include any questions you’ll be trying to answer.
  • Follow the introduction and problem statement with the body of your report which will contain your work, broken into relevant section headings.
  • The compiled report should show all of your R code and its output, but it should not show any warning or error messages.
  • The body should also include text that provides a running narrative that guides the reader through your work. This should be thorough and complete, but it should avoid large blocks of text.
  • All graphics should look nice, be easy to read, and be fully labelled.
  • You should include insightful statements about the meaning you’re extracting from your models, graphics, and statistical calculations.
  • End with an overall concluding statement which summarizes your findings. Be sure to disclose any shortcomings of your analysis.

Grading Rubric:

  • Technical Explanations: Do you explain the technical aspects of your model, such as residuals, logarithms, predictor coefficients, R2 values, t-statistics, and p-values, correctly? (30 points)
  • Arguments: Do you give fair and clear arguments based solely on the results of your models? Do your conclusions follow correctly from your models’ statistics? (30 points)
  • Narrative: Is it clear what you’re trying to accomplish with your model? Do you maintain a readable narrative throughout that guides the reader through your analysis? (20 points)
  • Professionalism: Does your report look nice? Do you provide insights based on your analysis? Is your code clear and readable? (15 points)