Session 4 Handling data: the Tidyverse
Here we will introduce the Tidyverse – in particular the package ‘dplyr.’ We emphasise that dplyr and Tidyverse represent one way of coding in R and there are alternatives - however we have chosen to introduce Tidyverse as we believe it is most intuitive to beginners in R.
First, read and work through Chapter 5 of Wickham & Grolemund. Then follow the ‘check-in’ exercises below to ensure that you have understood.
4.1 dplyr verbs: select, mutate, filter, arrange
While data frame manipulations can be done using the ‘base R’ approaches introduced above, using dplyr (part of the tidyverse collection of packages) can make your code more readable.
We also make use of the pipe ‘%>%.’ Recall, writing f(x,y)
is the same as writing x %>% f(y)
. For further help with pipes, see Chapter 18 of Wickham & Grolemund.
As revision, please work through the following exercises with the iris
dataset (which will already be in your environment: simply type ‘iris’ to see it). For each exercise, start with the original dataset. Remember, if you need help try ?select
, ?mutate
, etc.
1. Select only the columns Sepal.Length and Sepal.Width
2. Arrange the data by increasing Sepal.Length
3. Filter the data to only include Species setosa.
4. Select the columns Petal.Length and Petal.Width, then make (mutate) a new column Petal.Area as Petal.Length multiplied by Petal.Width, then arrange in order of decreasing petal area.
4.2 More dplyr verbs: group_by and summarise
Returning to the iris
dataset, work through the following exercises to recap your knowledge of group_by
and summarise
:
1. group_by species and calculate the mean Petal.Length for each species.
2. group_by species, then standardise the Petal.Length within each species – i.e. subtract the mean and divide by the standard deviation. Hint: your processed dataset should still have 150 rows; you will need to use mutate
rather than summarise
here.
4.3 Further Reading
For more practice, take Chapter 3 of the R Bootcamp.
Take the Adventures in R dplyr course
Read Chapter 12 of R Programming for Data Science for yet more on dplyr.
Follow the Data Wrangling and Summarising Data workshops from Andrew Stewart’s course.