Chapter 3 Data wrangling for model assessment
We’ll talk about the dplyr
package, and assessing models.
Much of this code was adapted from Master the Tidyverse
3.1 dplyr verbs
We have already seen some dplyr
verbs, but we are going to need more as we move on. So, let’s spend some time focusing on them. Here are the main verbs we will cover:
verb | action | example |
---|---|---|
select() |
take a subset of columns | select(x,y) , select(-x) |
filter() |
take a subset of rows | filter(x == __, y > __) |
arange() |
reorder the rows | arrange(x) , arrange(desc(x)) |
summarize() |
a many-to-one or many-to-few summary | summarize(mean(x), median(y)) |
group_by() |
group the rows by a specified column | group_by(x) %>% something() |
mutate() |
a many-to-many operation that creates a new variable | mutate(x = ___, y = ___) |
We’ll use the babynames
package to play with the verbs, so let’s begin by loading that package, and then dataset.
We could skim
the data, to learn something about it:
Of course, to use dplyr
, we need to load it!
3.1.1 select()
- Alter the code to select just the
n
column
3.1.1.1 select() helpers
:
select range of columns,select(storms, storm:pressure)
-
select every column butselect(storms, -c(storm, pressure))
starts_with()
select columns that start with…select(storms, starts_with("w"))
ends_with()
select columns that end with…select(storms, ends_with("e"))
- …and more! Check out the Data Transformation cheatsheet
3.1.2 filter()
extract rows that meet logical criteria
Notice I’m using ==
, which tests if things are equal. In R, =
sets something. There are other logical comparisons you can use
x < y | less than |
x > y | greater than |
x == y | equal to |
x <= y | less than or equal to |
x >= y | greater than or equal to |
x != y | not equal to |
x %in% y | group membership |
is.na(x) | is NA |
!is.na(x) | is not NA |
- Now, see if you can use the logical operators to manipulate our code to show:
- All of the names where
prop
is greater than or equal to 0.08 - All of the children named “Sea”
- All of the names that have a missing value for
n
(Hint: this should return an empty data set).
Common mistakes:
- using
=
instead of==
- forgetting quotes
We can also filter rows that match every logical criteria,
For this, you need to use Boolean operators
a & b | and |
a | b | or |
xor(a, b) | exactly or |
!a | not |
a %in% c(a, b) | one of (in) |
- Use Boolean operators to alter the code below to return only the rows that contain:
- Girls named Sea
- Names that were used by exactly 5 or 6 children in 1880
- Names that are one of Acura, Lexus, or Yugo
3.1.3 arrange()
Orders rows from smallest to largest values
- Arrange babynames by n. Add prop as a second (tie breaking) variable to arrange by.
- Can you tell what the smallest value of n is? Any guesses why?
Another helpful function is desc()
, which changes the ordering to largest smallest,
- Use
desc()
to find the names with the highestprop
. - Use
desc()
to find the names with the highestn
.
3.2 %>%, the pipe
I %>%
— Hadley Wickham (@hadleywickham) February 11, 2021
tumble(out_of = "bed") %>%
stumble(to = "the kitchen") %>%
pour(who = "myself", unit = "cup", what = "ambition") %>%
yawn() %>%
stretch() %>%
try(come_to_live()) https://t.co/plnA2dZdJ4
In other words, you can nest functions together in R, much like
\[ f(g(x)) \]
but, once you go beyond a function or two, that becomes hard to read.
try(come_to_life(stretch(yawn(pour(stumble(tumble(I, out_of = "bed"), to = "the kitchen"), who = "myself", unit = "cup", what = "ambition")))))
The pipe allows you to unnest your functions, and pass data along a pipeline.
I %>%
tumble(out_of = "bed") %>%
stumble(to = "the kitchen") %>%
pour(who = "myself", unit = "cup", what = "ambition") %>%
yawn() %>%
stretch() %>%
try(come_to_life())
(Those examples are not valid R code!)
We could see this with a more real-life example:
What does this code do?
What does this code do?
We pronounce the pipe, %>%
, as “then.”
[Side note: many languages use |
as the pipe, but that means “or” or “given” in R, depending on the syntax.]
- Use
%>%
to write a sequence of functions that
- Filter babynames to just the girls that were born in 2015
- Select the
name
andn
columns - Arrange the results so that the most popular names are near the top.
- [Combining
dplyr
knowledge withggplot2
!]
- Trim
babynames
to just the rows that contain a particular name and sex. This could be your name/sex or that of a friend or famous person. - Trim the result to just the columns that you’ll need for the plot
- Plot the results as a line graph with
year
on the x axis andprop
on the y axis
[Hint: “trim” here is a colloquial word, you will need to translate it to the appropriate dplyr
verb in each case.]
3.3 Modeling conditions
3.3.1 Least squares
When R
finds the line of best fit, it is minimizing the sum of the squared residuals,
\[ SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 \] in order for the model to be appropriate, a number of conditions must be met.
- Linearity
- Independence
- Normality
- Equality of variance
These conditions are mainly related to the distribution of the residuals.
Assumption | Consequence | Diagnostic | Solution |
---|
Indepdence | inaccurate inference | common sense/context | use a different technique/ don’t model \(E(\epsilon)=0\) | lack of model fit | plot of residuals vs. fitted values | transform \(x\) and/or \(y\) \(Var(\epsilon)=\sigma^2\) | inaccurate inference | plot of residuals v. fitted values | transform \(y\) \(\epsilon\sim N(\mu, \sigma)\) | if extreme, inaccurate inference | QQ plot | transform \(y\)
We would like to be able to work with the residuals from our models to assess whether the conditions are met, as well as to determine which model explains the most variability.
We would like to be able to work with the model objects we created yesterday using dplyr
verbs, but model objects are untidy objects. This is where the broom
package comes in! broom
helps you tidy up your models. Its two most useful functions are augment
and tidy
.
Let’s re-create our simple linear regression model from before (again, I’m hoping this isn’t just hanging out in your Environment already!).
Let’s augment()
that model.
Look at the new object in your environment. What is it like?
One parameter to augment()
is data=
. Let’s try again with that,
What’s different about that object?
We could use this augmented version of our dataset to do things like look for the largest residuals.
- Use a
dplyr
verb to find the rows where we over-predicted the most.
We could also use this dataset to plot our residuals, to see if they conform to our conditions. One way to see residual plots is to use the convenience function plot()
on our original model object.
For more on QQ plots, see this post by Sean Kross
But, a more flexible approach is to create our own residual plots. The augmented data allows us to do this!
One benefit to making our own residual plots is we can do things like color by a different variable, or facet the plot, to see if there is unexplained variability.
Try coloring the residual v. fitted plot by each of the categorical variables. Which categorical variable do you think explains the most additional variability? How can you tell?
3.4 More dplyr verbs
So far we have learned about filter
, select
and arrange
. Now we want to go into the verbs that modify the data in some way. First, summarize
3.4.1 summarize()
[Note: both the British and American spellings are accepted! I use summarize()
most of the time, but summarise()
also works.]
This can be thought of as a many-to-one operation. We are moving from many rows of data, and condensing down to just one.
- Use
summarize()
to compute three statistics about the data:
- The first (minimum) year in the dataset
- The last (maximum) year in the dataset
- The total number of children represented in the data
There are a few useful helper functions for summarize()
,
n()
, which counts the number of rows in a dataset or groupn_distinct()
, which counts number of distinct values in a variable
Right now, n()
doesn’t seem that useful
n_distinct()
might seem better,
But, these become even more useful when combined with…
3.4.2 group_by()
The group_by()
function just groups cases by a common value of a particular variable.
When combined with other dplyr
verbs, it can be very useful!
3.5 More model analysis
Since we have the residuals in m1_augmented
, we can use that to compute the sum of squared residuals.
Notice that I’m naming my summary statistic, so I could use it later as a variable name.
We can think of partitioning the variability in our response variable as follows,
\[ SST = SSM + SSE \] where
\[\begin{eqnarray*} SST &=& \sum_{i=1}^n (y_i-\bar{y})^2 \\ SSM &=& \sum_{i=1}^n (\bar{y} - \hat{y})^2 \\ SSE &=& \sum_{i=1}^n (y_i -\hat{y})^2 \end{eqnarray*}\]
Let’s find the other two sums of squares
m1_augmented %>%
mutate(meansalary = mean(salary)) %>%
summarize(SSE = sum(.resid^2), SSM = sum((meansalary - .fitted)^2), SST = sum((salary - meansalary)^2))
We don’t have a nice way to interpret those sums of squares, but we can use them to calculate the \(R^2\) value,
\[ R^2 = 1 - \frac{SSE}{SST} = \frac{SSM}{SST} \]
m1_augmented %>%
mutate(meansalary = mean(salary)) %>%
summarize(SSE = sum(.resid^2), SSM = sum((meansalary - .fitted)^2), SST = sum((salary - meansalary)^2)) %>%
summarize(R2 = 1- SSE/SST)
We can use the \(R^2\) value to assess the model. The larger the \(R^2\) value, the more variability we can explain using the model.
Unfortunately, \(R^2\) always increases as you add predictors, so it is not a good statistic for comparing between models. Instead, we should use adjusted \(R^2\)
\[ R^2_{adj} = 1- \frac{SSE/(n-1)}{SST/(n-k-1)} \] The adjusted \(R^2\) doesn’t have a nice interpretation, but it can be used to compare between models.
The \(R^2\) and \(R^2_{adj}\) values are given by the model summary table.
We can also use the tidy
function from broom
to tidy up the model coefficients,
and glance
to look at model summaries,
Try re-making a few more of our models from yesterday, and
glance
ing to see which one has the highest adjusted \(R^2\).Use
dplyr
functions to calculate the adjusted \(R^2\) value for your favorite penguins model.
Beyond \(R^2\), another useful statistic for assessing a model is the mean squared error, or the root mean squared error
\[
MSE = \frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2 \\
RMSE = \sqrt{MSE}
\]
1. Try using dplyr
verbs to compute the RMSE.
3.6 Bonus: comment on theory!
Although we are trying to “minimize” the sum of squared residuals, we don’t have to use a simulation method. Regression is actually done using matrix operations.
Suppose we have a sample of \(n\) subjects. For subject \(i\in{1,...,n}\) let \(Y_i\) denote the observed response value and \((x_{i1},x_{i2},\dots,x_{ik})\) denote the observed values of the \(k\) predictors. Then we can collect our observed response values into a vector \(y\), our predictor values into a matrix \(X\), and our regression coefficients into a vector \(\beta\). Note that a column of 1s is included for an intercept term in \(X\):
\[\begin{eqnarray*}y= \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, X = \begin{pmatrix} 1 & x_{11} x_{12} \dots x_{1k} \\ 1 & x_{21} & x_{22} & \dots x_{2k} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nk} \end{pmatrix}, \text{ and } \beta = \begin{pmatrix} beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{pmatrix} \end{eqnarray*}\]
Then we can express the model \(y_i=\beta_0+\beta_1 x_{i1}+\dots +\beta_{k}x_{ik}\) for \(i\in{1,\dots,n}\) using linear algebra:
\[ y=X\beta \] Further, let \(\hat{\beta}\) denote the vector of sample estimated \(\hat{beta}\), and \(\hat{y}\) denote the vector of predictions/model values:
\[ \hat{y}=X\hat{\beta} \] Thus the residual vector is \[ y−\hat{y}=X\beta−X\hat{\beta} \] and the sum of squared residuals is \[ (y−\hat{y})^T(y−\hat{y}) \] Challenge: Prove that the following formula for sample coefficients \(\beta\) are the least squares estimates of \(\beta\), ie. they minimize the sum of squared residuals: \[ \hat{\beta}=(X^TX)^{-1}X^Ty \]