Section 7 Easier analysis with the tidyverse

Now that you know the basics of R, it’s time to learn that there are much better ways to do everything you just learnt!

7.1 Introduction to the tidyverse

The tidyverse is a bundle of packages that make using R easier because they’re all designed to work together. Most “tidy” functions work well together because they:

  • Take a dataframe as their input
  • Return as dataframe as their output

You might not use every package from the tidyverse in an analysis, but you can still load them all at the start of most analyses, and know you’ll have a standard set of tools available. To load them all, just use:

library(tidyverse)

For this session, we’ll use the cowles data again, so let’s load it up:

cow = carData::Cowles

7.2 dplyr: Turning complex analyses into simple steps

Artwork by @allison_horst

As your analyses get more complicated, your code can get more complicated as well. To get to the answers you want, you might have to:

  • Drop certain rows from your dataset because they’re invalid or not relevant.
  • Calculate stats for each treatment group separately.
  • Use summary variables like the school mean to calculate a standardized score.

and more importantly, you might have to combine multiple different operations like these for each calculation you’re doing.

The dplyr package in the tidyverse makes this easier by providing a small number of simple verbs that can be combined easily to perform complex tasks. Each one takes your current data, changes it, and returns it. The most common verbs you’ll use are:

  • filter(): choose rows to keep based on a logical test, dropping the rest.
  • arrange(): sort the data.
  • select(): choose columns to keep.
  • mutate(): add new columns.
  • summarize(): create columns that summarise the data down to a single row.
  • count(): Count the number of rows in the data.
  • left_join() (and right_join(), inner_join() etc.): Merge datasets based on a common identifier.

And, possibly most importantly:

  • group_by(): Split the data into groups, so that any subsequent steps happen separately for each group.

We’ll go over examples of all of these below.

7.2.1 Bending the rules: Non-standard evaluation

dplyr and other tidyverse packages bend the rules of R syntax, allowing you to just type column names like group and time rather than having to spell out the name of the dataframe each time (like survey1$group, survey1$time). Within tidyverse function calls you can just type the column name, e.g. 

cow %>%
    # Within the brackets, no $ needed
    count(sex, volunteer)
##      sex volunteer   n
## 1 female        no 431
## 2 female       yes 349
## 3   male        no 393
## 4   male       yes 248

7.2.2 Pipes: %>%

dplyr and the tidyverse make heavy use of pipes to make it easier to carry out multiple processing steps in a row. A pipe looks like:

%>%

If you have a calculation that takes multiple steps, it gets confusing if you run them all in one go. Here we’ll calculate the mean of extraversion for the males in the data:

summarise(filter(cow, sex == "male"), male_mean = mean(extraversion))
##   male_mean
## 1  12.31357

The first step here is actually filtering the data, but we have to write these function calls inside out. R understands it fine but most human beings struggle to understand the logic of what’s happening, especially if there’s more than 2 of these nested steps happening.

You can do the steps one by one instead, but it’s clunky, and you might have to create multiple intermediate variables:

males = filter(cow, sex == "male")
summarise(males, male_mean = mean(extraversion))

Using pipes, you can do each step, and then send it through the pipe to the next step. To get the mean extraversion for males using pipes, we can do:

cow %>%
    filter(sex == "male") %>%
    summarise(male_mean = mean(extraversion))
##   male_mean
## 1  12.31357

See how we’re not actually specifying a dataframe for the filter() and summarise() functions? That’s because the %>% pipe automatically sends the data as the first argument to the next function.

7.2.3 Common tasks with dplyr

Selecting columns with select

You can select particular columns from a dataframe using select(). Using - in front of a column name excludes that column:

# These both give the same result: inclusion vs. exclusion
cow %>%
    select(neuroticism, extraversion) %>%
    head(2)
##   neuroticism extraversion
## 1          16           13
## 2           8           14
cow %>%
    select(-sex, -volunteer) %>% 
    head(2)
##   neuroticism extraversion
## 1          16           13
## 2           8           14

Instead of column names, you can also use a range of helper functions like starts_with() and num_range() to select multiple columns that match a particular pattern.

Creating/changing columns with mutate

We can use mutate on a dataframe to add or change one or more columns:

cow %>%
    # high_extraversion and high_neuroticism are the names
    #   of new columns that will be created
    mutate(high_extraversion = extraversion >= 15,
           high_neuroticism = neuroticism >= 15) %>%
    head()
##   neuroticism extraversion    sex volunteer high_extraversion high_neuroticism
## 1          16           13 female        no             FALSE             TRUE
## 2           8           14   male        no             FALSE            FALSE
## 3           5           16   male        no              TRUE            FALSE
## 4           8           20 female        no              TRUE            FALSE
## 5           9           19   male        no              TRUE            FALSE
## 6           6           15   male        no              TRUE            FALSE

If we want these changes to be saved, we would need to save the result back to the original variable. By default, mutate() will just return an altered copy of the original data, but won’t change the original:

# If we want to actually save the changes
cow = cow %>%
    mutate(high_extraversion = extraversion >= 15,
           high_neuroticism = neuroticism >= 15)

On its own, the main advantage mutate offers is being able to spell out your calculations without including the name of the dataframe. However, it can be very useful in combination with other verbs.

Summarizing the data with summarize

summarize is similar to mutate: it adds or changes columns. However, summarize also collapses the data down to a single row, so the values you calculate need to be single values like means or counts.

cow %>%
    summarize(
        extraversion = mean(extraversion),
        volunteers = sum(volunteer == "yes")
    )
##   extraversion volunteers
## 1     12.37298        597

summarize is useful in combination with group_by, where it collapses the data down to one row per group.

Selecting rows with filter

Choosing rows with a logical test with filter() works just like subsetting your data with a logical vector, it’s just easier to do it as part of a sequence of steps:

cow %>%
  filter((sex == "male") & (volunteer == "yes")) %>%
  head()
##     neuroticism extraversion  sex volunteer
## 220          17           19 male       yes
## 439           7           15 male       yes
## 440          17           12 male       yes
## 442           6           13 male       yes
## 445           8            9 male       yes
## 446           5           16 male       yes

Sorting with arrange

With arrange, you can sort by one or more columns. Use desc(column) (short for descending) to sort that column in the opposite direction:

cow %>%
  arrange(sex, volunteer, desc(extraversion)) %>%
  head()
##      neuroticism extraversion    sex volunteer
## 1109          15           23 female        no
## 67            15           21 female        no
## 277            9           21 female        no
## 875           10           21 female        no
## 4              8           20 female        no
## 40             5           20 female        no

group_by for calculations within groups

group_by() is very useful for calculating different stats in subgroups of your data. This covers a lot of the more complex operations you might need to do with your data, so it unlocks a lot of possibilities.

You can do things like:

  • Mean-centering scores separately for males and females:
cow %>%
    group_by(sex) %>%
    mutate(
        extraversion_centered = extraversion - mean(extraversion)
    )
## # A tibble: 1,421 x 5
## # Groups:   sex [2]
##    neuroticism extraversion sex    volunteer extraversion_centered
##          <int>        <int> <fct>  <fct>                     <dbl>
##  1          16           13 female no                        0.578
##  2           8           14 male   no                        1.69 
##  3           5           16 male   no                        3.69 
##  4           8           20 female no                        7.58 
##  5           9           19 male   no                        6.69 
##  6           6           15 male   no                        2.69 
##  7           8           10 female no                       -2.42 
##  8          12           11 male   no                       -1.31 
##  9          15           16 male   no                        3.69 
## 10          18            7 male   no                       -5.31 
## # ... with 1,411 more rows
  • Calculating means and SDs for subgroups:
cow %>%
    group_by(sex, volunteer) %>%
    summarise(mean = mean(neuroticism),
              sd = sd(neuroticism))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex    volunteer  mean    sd
##   <fct>  <fct>     <dbl> <dbl>
## 1 female no         12.2  4.75
## 2 female yes        12.3  4.79
## 3 male   no         10.5  4.75
## 4 male   yes        10.4  5.11

Frequency tables with count

count() gives you a straightforward way to create a frequency table. There’s no built-in way to calculate percentages, but you can easily add them using mutate():

cow %>%
    count(sex) %>%
    # count creates a column called 'n'
    mutate(percent = n / sum(n) * 100)
##      sex   n  percent
## 1 female 780 54.89092
## 2   male 641 45.10908

If you have multiple levels of grouping, you need to think about how you want to calculate percentages. To get the percentage who volunteered within each sex:

cow %>%
    count(sex, volunteer) %>%
    group_by(sex) %>%
    mutate(percent = n / sum(n) * 100)
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex    volunteer     n percent
##   <fct>  <fct>     <int>   <dbl>
## 1 female no          431    55.3
## 2 female yes         349    44.7
## 3 male   no          393    61.3
## 4 male   yes         248    38.7

7.2.3.1 Merging data with left_join()

To combine two dataframes, you can use functions like left_join() and inner_join(). In my experience, left_join() is what you want most of the time. To merge data successfully, all you need is a column that’s present in both datasets that they can be matched on, usually a participant ID or something similar.

Joins can also be useful when you have to calculate a complex summary. You can create a separate table with the summary info, and merge it back into the main dataset. As a simple example, let’s summarize extraversion by both sex and volunteering status and merge it back into the main dataset:

extraversion_info = cow %>%
    group_by(sex, volunteer) %>%
    summarize(mean_extraversion = mean(extraversion))

extraversion_info
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex    volunteer mean_extraversion
##   <fct>  <fct>                 <dbl>
## 1 female no                     12.0
## 2 female yes                    12.9
## 3 male   no                     11.9
## 4 male   yes                    12.9
cow %>%
    left_join(extraversion_info, by = c("sex", "volunteer")) %>%
    head()
##   neuroticism extraversion    sex volunteer mean_extraversion
## 1          16           13 female        no          12.00696
## 2           8           14   male        no          11.91349
## 3           5           16   male        no          11.91349
## 4           8           20 female        no          12.00696
## 5           9           19   male        no          11.91349
## 6           6           15   male        no          11.91349