Chapter 3 Data wrangling for model assessment

We’ll talk about the dplyr package, and assessing models.

Much of this code was adapted from Master the Tidyverse

3.1 dplyr verbs

We have already seen some dplyr verbs, but we are going to need more as we move on. So, let’s spend some time focusing on them. Here are the main verbs we will cover:

verb action example
select() take a subset of columns select(x,y), select(-x)
filter() take a subset of rows filter(x == __, y > __)
arange() reorder the rows arrange(x), arrange(desc(x))
summarize() a many-to-one or many-to-few summary summarize(mean(x), median(y))
group_by() group the rows by a specified column group_by(x) %>% something()
mutate() a many-to-many operation that creates a new variable mutate(x = ___, y = ___)

We’ll use the babynames package to play with the verbs, so let’s begin by loading that package, and then dataset.


We could skim the data, to learn something about it:


Of course, to use dplyr, we need to load it!


3.1.1 select()

select(babynames, name, prop)
  1. Alter the code to select just the n column select() helpers

  • : select range of columns, select(storms, storm:pressure)
  • - select every column but select(storms, -c(storm, pressure))
  • starts_with() select columns that start with… select(storms, starts_with("w"))
  • ends_with() select columns that end with… select(storms, ends_with("e"))
  • …and more! Check out the Data Transformation cheatsheet

3.1.2 filter()

extract rows that meet logical criteria

filter(babynames, name == "Amelia")

Notice I’m using ==, which tests if things are equal. In R, = sets something. There are other logical comparisons you can use

x < y less than
x > y greater than
x == y equal to
x <= y less than or equal to
x >= y greater than or equal to
x != y not equal to
x %in% y group membership is NA
! is not NA
  1. Now, see if you can use the logical operators to manipulate our code to show:
  • All of the names where prop is greater than or equal to 0.08
  • All of the children named “Sea”
  • All of the names that have a missing value for n (Hint: this should return an empty data set).

Common mistakes:

  • using = instead of ==
  • forgetting quotes

We can also filter rows that match every logical criteria,

filter(babynames, name == "Amelia", year == 1880)

For this, you need to use Boolean operators

a & b and
a | b or
xor(a, b) exactly or
!a not
a %in% c(a, b) one of (in)
  1. Use Boolean operators to alter the code below to return only the rows that contain:
  • Girls named Sea
  • Names that were used by exactly 5 or 6 children in 1880
  • Names that are one of Acura, Lexus, or Yugo

3.1.3 arrange()

Orders rows from smallest to largest values

arrange(babynames, n)
  1. Arrange babynames by n. Add prop as a second (tie breaking) variable to arrange by.
  2. Can you tell what the smallest value of n is? Any guesses why?

Another helpful function is desc(), which changes the ordering to largest smallest,

arrange(babynames, desc(n))
  1. Use desc() to find the names with the highest prop.
  2. Use desc() to find the names with the highest n.

3.2 %>%, the pipe

In other words, you can nest functions together in R, much like

\[ f(g(x)) \]

but, once you go beyond a function or two, that becomes hard to read.

try(come_to_life(stretch(yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition")))))

The pipe allows you to unnest your functions, and pass data along a pipeline.

I %>%
  tumble(out_of = "bed") %>%
  stumble(to = "the kitchen") %>%
  pour(who = "myself", unit = "cup", what = "ambition") %>%
  yawn() %>%
  stretch() %>%

(Those examples are not valid R code!)

We could see this with a more real-life example:

arrange(select(filter(babynames, year == 2015, 
  sex == "M"), name, n), desc(n))

What does this code do?

babynames %>%
  filter(year == 2015, sex == "M") %>%
  select(name, n) %>%

What does this code do?

We pronounce the pipe, %>%, as “then.”

[Side note: many languages use | as the pipe, but that means “or” or “given” in R, depending on the syntax.]

  1. Use %>% to write a sequence of functions that
  • Filter babynames to just the girls that were born in 2015
  • Select the name and n columns
  • Arrange the results so that the most popular names are near the top.
  1. [Combining dplyr knowledge with ggplot2!]
  • Trim babynames to just the rows that contain a particular name and sex. This could be your name/sex or that of a friend or famous person.
  • Trim the result to just the columns that you’ll need for the plot
  • Plot the results as a line graph with year on the x axis and prop on the y axis

[Hint: “trim” here is a colloquial word, you will need to translate it to the appropriate dplyr verb in each case.]

3.3 Modeling conditions

3.3.1 Least squares

When R finds the line of best fit, it is minimizing the sum of the squared residuals,

\[ SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 \] in order for the model to be appropriate, a number of conditions must be met.

  • Linearity
  • Independence
  • Normality
  • Equality of variance

These conditions are mainly related to the distribution of the residuals.

Assumption Consequence Diagnostic Solution

Indepdence | inaccurate inference | common sense/context | use a different technique/ don’t model \(E(\epsilon)=0\) | lack of model fit | plot of residuals vs. fitted values | transform \(x\) and/or \(y\) \(Var(\epsilon)=\sigma^2\) | inaccurate inference | plot of residuals v. fitted values | transform \(y\) \(\epsilon\sim N(\mu, \sigma)\) | if extreme, inaccurate inference | QQ plot | transform \(y\)

We would like to be able to work with the residuals from our models to assess whether the conditions are met, as well as to determine which model explains the most variability.

We would like to be able to work with the model objects we created yesterday using dplyr verbs, but model objects are untidy objects. This is where the broom package comes in! broom helps you tidy up your models. Its two most useful functions are augment and tidy.


Let’s re-create our simple linear regression model from before (again, I’m hoping this isn’t just hanging out in your Environment already!).

m1 <- lm(salary ~, data = Salaries)

Let’s augment() that model.

m1_augmented <- augment(m1)

Look at the new object in your environment. What is it like?

One parameter to augment() is data=. Let’s try again with that,

m1_augmented <- augment(m1, data=Salaries)

What’s different about that object?

We could use this augmented version of our dataset to do things like look for the largest residuals.

  1. Use a dplyr verb to find the rows where we over-predicted the most.
m1_augmented %>%

We could also use this dataset to plot our residuals, to see if they conform to our conditions. One way to see residual plots is to use the convenience function plot() on our original model object.


For more on QQ plots, see this post by Sean Kross

But, a more flexible approach is to create our own residual plots. The augmented data allows us to do this!

ggplot(m1_augmented, aes(x=.fitted, y=.resid)) + 
  geom_point() + 
ggplot(m1_augmented, aes(sample = .resid)) +
  stat_qq() +

One benefit to making our own residual plots is we can do things like color by a different variable, or facet the plot, to see if there is unexplained variability.

Try coloring the residual v. fitted plot by each of the categorical variables. Which categorical variable do you think explains the most additional variability? How can you tell?

3.4 More dplyr verbs

So far we have learned about filter, select and arrange. Now we want to go into the verbs that modify the data in some way. First, summarize

3.4.1 summarize()

[Note: both the British and American spellings are accepted! I use summarize() most of the time, but summarise() also works.]

This can be thought of as a many-to-one operation. We are moving from many rows of data, and condensing down to just one.

babynames %>% 
  summarise(total = sum(n), max = max(n))
  1. Use summarize() to compute three statistics about the data:
  • The first (minimum) year in the dataset
  • The last (maximum) year in the dataset
  • The total number of children represented in the data

There are a few useful helper functions for summarize(),

  • n(), which counts the number of rows in a dataset or group
  • n_distinct(), which counts number of distinct values in a variable

Right now, n() doesn’t seem that useful

babynames %>% 
  summarise(n = n())

n_distinct() might seem better,

babynames %>% 
  summarise(n = n(), nname = n_distinct(name))

But, these become even more useful when combined with…

3.4.2 group_by()

The group_by() function just groups cases by a common value of a particular variable.

babynames %>% 

When combined with other dplyr verbs, it can be very useful!

babynames %>% 
  group_by(sex) %>%
  summarise(total = sum(n))

3.4.3 mutate()

Our final single-table verb is mutate(). I think of mutate() as a many-to-many transformation. It adds additional columns (variables) to the data.

babynames %>%
  mutate(percent = round(prop*100, 2))
babynames %>%
  mutate(percent = round(prop*100, 2), nper = round(percent))

3.5 More model analysis

Since we have the residuals in m1_augmented, we can use that to compute the sum of squared residuals.

m1_augmented %>%
  summarize(SSE = sum(.resid^2))

Notice that I’m naming my summary statistic, so I could use it later as a variable name.

We can think of partitioning the variability in our response variable as follows,

\[ SST = SSM + SSE \] where

\[\begin{eqnarray*} SST &=& \sum_{i=1}^n (y_i-\bar{y})^2 \\ SSM &=& \sum_{i=1}^n (\bar{y} - \hat{y})^2 \\ SSE &=& \sum_{i=1}^n (y_i -\hat{y})^2 \end{eqnarray*}\]

Let’s find the other two sums of squares

m1_augmented %>%
  mutate(meansalary = mean(salary)) %>%
  summarize(SSE = sum(.resid^2), SSM = sum((meansalary - .fitted)^2), SST = sum((salary - meansalary)^2))

We don’t have a nice way to interpret those sums of squares, but we can use them to calculate the \(R^2\) value,

\[ R^2 = 1 - \frac{SSE}{SST} = \frac{SSM}{SST} \]

m1_augmented %>%
  mutate(meansalary = mean(salary)) %>%
  summarize(SSE = sum(.resid^2), SSM = sum((meansalary - .fitted)^2), SST = sum((salary - meansalary)^2)) %>%
  summarize(R2 = 1- SSE/SST)

We can use the \(R^2\) value to assess the model. The larger the \(R^2\) value, the more variability we can explain using the model.

Unfortunately, \(R^2\) always increases as you add predictors, so it is not a good statistic for comparing between models. Instead, we should use adjusted \(R^2\)

\[ R^2_{adj} = 1- \frac{SSE/(n-1)}{SST/(n-k-1)} \] The adjusted \(R^2\) doesn’t have a nice interpretation, but it can be used to compare between models.

The \(R^2\) and \(R^2_{adj}\) values are given by the model summary table.


We can also use the tidy function from broom to tidy up the model coefficients,


and glance to look at model summaries,

  1. Try re-making a few more of our models from yesterday, and glanceing to see which one has the highest adjusted \(R^2\).

  2. Use dplyr functions to calculate the adjusted \(R^2\) value for your favorite penguins model.

Beyond \(R^2\), another useful statistic for assessing a model is the mean squared error, or the root mean squared error

\[ MSE = \frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2 \\ RMSE = \sqrt{MSE} \] 1. Try using dplyr verbs to compute the RMSE.

3.6 Bonus: comment on theory!

Although we are trying to “minimize” the sum of squared residuals, we don’t have to use a simulation method. Regression is actually done using matrix operations.

Suppose we have a sample of \(n\) subjects. For subject \(i\in{1,...,n}\) let \(Y_i\) denote the observed response value and \((x_{i1},x_{i2},\dots,x_{ik})\) denote the observed values of the \(k\) predictors. Then we can collect our observed response values into a vector \(y\), our predictor values into a matrix \(X\), and our regression coefficients into a vector \(\beta\). Note that a column of 1s is included for an intercept term in \(X\):

\[\begin{eqnarray*}y= \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, X = \begin{pmatrix} 1 & x_{11} x_{12} \dots x_{1k} \\ 1 & x_{21} & x_{22} & \dots x_{2k} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nk} \end{pmatrix}, \text{ and } \beta = \begin{pmatrix} beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{pmatrix} \end{eqnarray*}\]

Then we can express the model \(y_i=\beta_0+\beta_1 x_{i1}+\dots +\beta_{k}x_{ik}\) for \(i\in{1,\dots,n}\) using linear algebra:

\[ y=X\beta \] Further, let \(\hat{\beta}\) denote the vector of sample estimated \(\hat{beta}\), and \(\hat{y}\) denote the vector of predictions/model values:

\[ \hat{y}=X\hat{\beta} \] Thus the residual vector is \[ y−\hat{y}=X\beta−X\hat{\beta} \] and the sum of squared residuals is \[ (y−\hat{y})^T(y−\hat{y}) \] Challenge: Prove that the following formula for sample coefficients \(\beta\) are the least squares estimates of \(\beta\), ie. they minimize the sum of squared residuals: \[ \hat{\beta}=(X^TX)^{-1}X^Ty \]