# Chapter 3 Data wrangling for model assessment

We’ll talk about the `dplyr`

package, and assessing models.

Much of this code was adapted from Master the Tidyverse

## 3.1 dplyr verbs

We have already seen some `dplyr`

verbs, but we are going to need more as we move on. So, let’s spend some time focusing on them. Here are the main verbs we will cover:

verb | action | example |
---|---|---|

`select()` |
take a subset of columns |
`select(x,y)` , `select(-x)` |

`filter()` |
take a subset of rows |
`filter(x == __, y > __)` |

`arange()` |
reorder the rows |
`arrange(x)` , `arrange(desc(x))` |

`summarize()` |
a many-to-one or many-to-few summary | `summarize(mean(x), median(y))` |

`group_by()` |
group the rows by a specified column |
`group_by(x) %>% something()` |

`mutate()` |
a many-to-many operation that creates a new variable | `mutate(x = ___, y = ___)` |

We’ll use the `babynames`

package to play with the verbs, so let’s begin by loading that package, and then dataset.

We could `skim`

the data, to learn something about it:

Of course, to use `dplyr`

, we need to load it!

### 3.1.1 select()

- Alter the code to select just the
`n`

column

#### 3.1.1.1 select() helpers

`:`

select range of columns,`select(storms, storm:pressure)`

`-`

select every column but`select(storms, -c(storm, pressure))`

`starts_with()`

select columns that start with…`select(storms, starts_with("w"))`

`ends_with()`

select columns that end with…`select(storms, ends_with("e"))`

- …and more! Check out the Data Transformation cheatsheet

### 3.1.2 filter()

extract rows that meet logical criteria

Notice I’m using `==`

, which tests if things are equal. In R, `=`

sets something. There are other logical comparisons you can use

x < y | less than |

x > y | greater than |

x == y | equal to |

x <= y | less than or equal to |

x >= y | greater than or equal to |

x != y | not equal to |

x %in% y | group membership |

is.na(x) | is NA |

!is.na(x) | is not NA |

- Now, see if you can use the logical operators to manipulate our code to show:

- All of the names where
`prop`

is greater than or equal to 0.08 - All of the children named “Sea”
- All of the names that have a missing value for
`n`

(Hint: this should return an empty data set).

Common mistakes:

- using
`=`

instead of`==`

- forgetting quotes

We can also filter rows that match *every* logical criteria,

For this, you need to use Boolean operators

a & b | and |

a | b | or |

xor(a, b) | exactly or |

!a | not |

a %in% c(a, b) | one of (in) |

- Use Boolean operators to alter the code below to return only the rows that contain:

- Girls named Sea
- Names that were used by exactly 5 or 6 children in 1880
- Names that are one of Acura, Lexus, or Yugo

### 3.1.3 arrange()

Orders rows from smallest to largest values

- Arrange babynames by n. Add prop as a second (tie breaking) variable to arrange by.
- Can you tell what the smallest value of n is? Any guesses why?

Another helpful function is `desc()`

, which changes the ordering to largest smallest,

- Use
`desc()`

to find the names with the highest`prop`

. - Use
`desc()`

to find the names with the highest`n`

.

## 3.2 %>%, the pipe

I %>%

— Hadley Wickham (@hadleywickham) February 11, 2021

tumble(out_of = "bed") %>%

stumble(to = "the kitchen") %>%

pour(who = "myself", unit = "cup", what = "ambition") %>%

yawn() %>%

stretch() %>%

try(come_to_live()) https://t.co/plnA2dZdJ4

In other words, you can nest functions together in R, much like

\[ f(g(x)) \]

but, once you go beyond a function or two, that becomes hard to read.

`try(come_to_life(stretch(yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition")))))`

The pipe allows you to unnest your functions, and pass data along a pipeline.

```
I %>%
tumble(out_of = "bed") %>%
stumble(to = "the kitchen") %>%
pour(who = "myself", unit = "cup", what = "ambition") %>%
yawn() %>%
stretch() %>%
try(come_to_life())
```

(Those examples are not valid R code!)

We could see this with a more real-life example:

What does this code do?

What does this code do?

We pronounce the pipe, `%>%`

, as “then.”

[Side note: many languages use `|`

as the pipe, but that means “or” or “given” in R, depending on the syntax.]

- Use
`%>%`

to write a sequence of functions that

- Filter babynames to just the girls that were born in 2015
- Select the
`name`

and`n`

columns - Arrange the results so that the most popular names are near the top.

- [Combining
`dplyr`

knowledge with`ggplot2`

!]

- Trim
`babynames`

to just the rows that contain a particular name and sex. This could be your name/sex or that of a friend or famous person. - Trim the result to just the columns that you’ll need for the plot
- Plot the results as a line graph with
`year`

on the x axis and`prop`

on the y axis

[Hint: “trim” here is a colloquial word, you will need to translate it to the appropriate `dplyr`

verb in each case.]

## 3.3 Modeling conditions

### 3.3.1 Least squares

When `R`

finds the line of best fit, it is minimizing the sum of the squared residuals,

\[ SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 \] in order for the model to be appropriate, a number of conditions must be met.

**L**inearity**I**ndependence**N**ormality**E**quality of variance

These conditions are mainly related to the distribution of the residuals.

Assumption | Consequence | Diagnostic | Solution |
---|

Indepdence | inaccurate inference | common sense/context | use a different technique/ don’t model \(E(\epsilon)=0\) | lack of model fit | plot of residuals vs. fitted values | transform \(x\) and/or \(y\) \(Var(\epsilon)=\sigma^2\) | inaccurate inference | plot of residuals v. fitted values | transform \(y\) \(\epsilon\sim N(\mu, \sigma)\) | if extreme, inaccurate inference | QQ plot | transform \(y\)

We would like to be able to work with the residuals from our models to assess whether the conditions are met, as well as to determine which model explains the most variability.

We would like to be able to work with the model objects we created yesterday using `dplyr`

verbs, but model objects are untidy objects. This is where the `broom`

package comes in! `broom`

helps you tidy up your models. Its two most useful functions are `augment`

and `tidy`

.

Let’s re-create our simple linear regression model from before (again, I’m hoping this isn’t just hanging out in your Environment already!).

Let’s `augment()`

that model.

Look at the new object in your environment. What is it like?

One parameter to `augment()`

is `data=`

. Let’s try again with that,

What’s different about that object?

We could use this augmented version of our dataset to do things like look for the largest residuals.

- Use a
`dplyr`

verb to find the rows where we over-predicted the most.

We could also use this dataset to plot our residuals, to see if they conform to our conditions. One way to see residual plots is to use the convenience function `plot()`

on our original model object.

For more on QQ plots, see this post by Sean Kross

But, a more flexible approach is to create our own residual plots. The augmented data allows us to do this!

One benefit to making our own residual plots is we can do things like color by a different variable, or facet the plot, to see if there is unexplained variability.

Try coloring the residual v. fitted plot by each of the categorical variables. Which categorical variable do you think explains the most additional variability? How can you tell?

## 3.4 More dplyr verbs

So far we have learned about `filter`

, `select`

and `arrange`

. Now we want to go into the verbs that modify the data in some way. First, `summarize`

### 3.4.1 summarize()

[Note: both the British and American spellings are accepted! I use `summarize()`

most of the time, but `summarise()`

also works.]

This can be thought of as a many-to-one operation. We are moving from many rows of data, and condensing down to just one.

- Use
`summarize()`

to compute three statistics about the data:

- The first (minimum) year in the dataset
- The last (maximum) year in the dataset
- The total number of children represented in the data

There are a few useful helper functions for `summarize()`

,

`n()`

, which counts the number of rows in a dataset or group`n_distinct()`

, which counts number of distinct values in a variable

Right now, `n()`

doesn’t seem that useful

`n_distinct()`

might seem better,

But, these become even more useful when combined with…

### 3.4.2 group_by()

The `group_by()`

function just groups cases by a common value of a particular variable.

When combined with other `dplyr`

verbs, it can be very useful!

## 3.5 More model analysis

Since we have the residuals in `m1_augmented`

, we can use that to compute the sum of squared residuals.

Notice that I’m naming my summary statistic, so I could use it later as a variable name.

We can think of partitioning the variability in our response variable as follows,

\[ SST = SSM + SSE \] where

\[\begin{eqnarray*} SST &=& \sum_{i=1}^n (y_i-\bar{y})^2 \\ SSM &=& \sum_{i=1}^n (\bar{y} - \hat{y})^2 \\ SSE &=& \sum_{i=1}^n (y_i -\hat{y})^2 \end{eqnarray*}\]

Let’s find the other two sums of squares

```
m1_augmented %>%
mutate(meansalary = mean(salary)) %>%
summarize(SSE = sum(.resid^2), SSM = sum((meansalary - .fitted)^2), SST = sum((salary - meansalary)^2))
```

We don’t have a nice way to interpret those sums of squares, but we can use them to calculate the \(R^2\) value,

\[ R^2 = 1 - \frac{SSE}{SST} = \frac{SSM}{SST} \]

```
m1_augmented %>%
mutate(meansalary = mean(salary)) %>%
summarize(SSE = sum(.resid^2), SSM = sum((meansalary - .fitted)^2), SST = sum((salary - meansalary)^2)) %>%
summarize(R2 = 1- SSE/SST)
```

We can use the \(R^2\) value to **assess** the model. The larger the \(R^2\) value, the more variability we can explain using the model.

Unfortunately, \(R^2\) always increases as you add predictors, so it is not a good statistic for comparing between models. Instead, we should use adjusted \(R^2\)

\[ R^2_{adj} = 1- \frac{SSE/(n-1)}{SST/(n-k-1)} \] The adjusted \(R^2\) doesn’t have a nice interpretation, but it can be used to compare between models.

The \(R^2\) and \(R^2_{adj}\) values are given by the model summary table.

We can also use the `tidy`

function from `broom`

to tidy up the model coefficients,

and `glance`

to look at model summaries,

Try re-making a few more of our models from yesterday, and

`glance`

ing to see which one has the highest adjusted \(R^2\).Use

`dplyr`

functions to calculate the adjusted \(R^2\) value for your favorite penguins model.

Beyond \(R^2\), another useful statistic for **assess**ing a model is the mean squared error, or the root mean squared error

\[
MSE = \frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2 \\
RMSE = \sqrt{MSE}
\]
1. Try using `dplyr`

verbs to compute the RMSE.

## 3.6 Bonus: comment on theory!

Although we are trying to “minimize” the sum of squared residuals, we don’t have to use a simulation method. Regression is actually done using matrix operations.

Suppose we have a sample of \(n\) subjects. For subject \(i\in{1,...,n}\) let \(Y_i\) denote the observed response value and \((x_{i1},x_{i2},\dots,x_{ik})\) denote the observed values of the \(k\) predictors. Then we can collect our observed response values into a vector \(y\), our predictor values into a matrix \(X\), and our regression coefficients into a vector \(\beta\). Note that a column of 1s is included for an intercept term in \(X\):

\[\begin{eqnarray*}y= \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, X = \begin{pmatrix} 1 & x_{11} x_{12} \dots x_{1k} \\ 1 & x_{21} & x_{22} & \dots x_{2k} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nk} \end{pmatrix}, \text{ and } \beta = \begin{pmatrix} beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{pmatrix} \end{eqnarray*}\]

Then we can express the model \(y_i=\beta_0+\beta_1 x_{i1}+\dots +\beta_{k}x_{ik}\) for \(i\in{1,\dots,n}\) using linear algebra:

\[ y=X\beta \] Further, let \(\hat{\beta}\) denote the vector of sample estimated \(\hat{beta}\), and \(\hat{y}\) denote the vector of predictions/model values:

\[ \hat{y}=X\hat{\beta} \] Thus the residual vector is \[ y−\hat{y}=X\beta−X\hat{\beta} \] and the sum of squared residuals is \[ (y−\hat{y})^T(y−\hat{y}) \] Challenge: Prove that the following formula for sample coefficients \(\beta\) are the least squares estimates of \(\beta\), ie. they minimize the sum of squared residuals: \[ \hat{\beta}=(X^TX)^{-1}X^Ty \]