Session 4 Handling data: the Tidyverse

Here we will introduce the Tidyverse – in particular the package ‘dplyr’.

First, read and work through Chapter 5 of Wickham & Grolemund. Then follow the ‘check-in’ exercises below to ensure that you have understood.

4.1 dplyr verbs: select, mutate, filter, arrange

While data frame manipulations can be done using the ‘base R’ approaches introduced above, using dplyr (part of the tidyverse collection of packages) can make your code more readable.

We also make use of the pipe ‘%>%’. Recall, writing f(x,y) is the same as writing x %>% f(y). For further help with pipes, see Chapter 18 of Wickham & Grolemund.

As revision, please work through the following exercises with the iris dataset (which will already be in your environment: simply type ‘iris’ to see it). For each exercise, start with the original dataset. Remember, if you need help try ?select, ?mutate, etc.

1. Select only the columns Sepal.Length and Sepal.Width

2. Arrange the data by increasing Sepal.Length

3. Filter the data to only include Species setosa.

4. Select the columns Petal.Length and Petal.Width, then make (mutate) a new column Petal.Area as Petal.Length multiplied by Petal.Width, then arrange in order of decreasing petal area.

4.2 More dplyr verbs: group_by and summarise

Returning to the iris dataset, work through the following exercises to recap your knowledge of group_by and summarise:

1. group_by species and calculate the mean Petal.Length for each species.

2. group_by species, then standardise the Petal.Length within each species – i.e. subtract the mean and divide by the standard deviation. Hint: your processed dataset should still have 150 rows; you will need to use mutate rather than summarise here.

4.3 Further Reading