5 Tutorial 5: Data management with tidyverse

After working through Tutorial 5, you’ll…

know how to inspect and change data

For this tutorial, we’ll again use the data set “data_fictional_survey.csv” (via Moodle/Data for R). The data set has already been introduced and explained, so I won’t go into detail here.

survey <- read.csv2("data_fictional_survey.csv", header = TRUE)

5.1 What is the tidyverse?

The so-called tidyverse is a popular collection of different R packages especially useful and accessible for R “beginners”⁴.

These include…

tibble: creating data structures like tibbles, which is an enhanced type of data frame
readr, haven, readxl: reading data (e.g. readr for CSV, haven for SPSS, Stata and SAS, readxl for Excel)
tidyr, dplyr: data transformation, modification, and summary statistics
stringr, forcats, lubridate: create special, powerful object types (e.g. stringr for working with text objects, forcats for factors, lubridate for time data)
purrr: programming with R
ggplot2: graphing/charting

The most frequently used packages of the tidyverse can be installed and activated like so (you may have to do so for less frequently used packages separately):

install.packages("tidyverse")
library("tidyverse")

In the following, I’ll give you a short introduction into the tidyverse (especially dplyr for data wrangling), but not an overall introduction. For more detailed sources on this, see

The tidyverse Website, especially Wickham’s book R for data science
Tutorial 12: Tidy data by G. Grolemund und H. Wickham
Tutorial 5: Data management with dplyr by Lara Kobilke

5.2 Tibbles for our data

The tidyverse also comes with a specific data structure: tibbles.

Usually, you can forget all about vectors, lists etc. - just save your data as a tibble, which follows three rules:

Columns: Every column is one variable.
Rows: Every row is one observation.
Cells: Every cell contains one single value.

Image: Structure of Tibbles Source: R for Data Science

Let’s have a look at our survey data in tibble format:

survey %>%
  as_tibble()

## # A tibble: 20 x 6
##    name         age date       outlet outlet_use outlet_trust
##    <chr>      <int> <chr>      <chr>       <int>        <int>
##  1 Alexandra     20 09.09.2021 TV              2            5
##  2 Alex          25 08.09.2021 Online          3            5
##  3 Maximilian    29 09.09.2021 Print           4            1
##  4 Moritz        22 06.09.2021 TV              2            2
##  5 Vanessa       25 07.09.2021 Online          1            3
##  6 Andrea        26 09.09.2021 Online          3            4
##  7 Fabienne      26 09.09.2021 TV              3            2
##  8 Fabio         27 09.09.2021 Online          0            1
##  9 Magdalena      8 08.09.2021 Online          1            4
## 10 Tim           26 07.09.2021 TV             NA            2
## 11 Alex          27 09.09.2021 Online         NA            2
## 12 Tobias        26 07.09.2021 Online          2            2
## 13 Michael       25 09.09.2021 Online          3            2
## 14 Sabrina       27 08.09.2021 Online          1            2
## 15 Valentin      29 09.09.2021 TV              1            5
## 16 Tristan       26 09.09.2021 TV              2            5
## 17 Martin        21 09.09.2021 Online          1            2
## 18 Anna          23 08.09.2021 TV              3            3
## 19 Andreas       24 09.09.2021 TV              2            5
## 20 Florian       26 09.09.2021 Online          1            5

5.3 Pipes for our functions

Another nice feature of the tidyverse is the pipe operator, an element of the dplyr package within the tidyverse. dplyr is one of the most popular packages belonging to the tidyverse. It contains a lot of really helpful functions for data data transformation. For help, see this useful cheat sheet on dplyr.

For data management, dplyr relies on the pipe operator: %>%. The %>% operator allows functions to be applied sequentially to the same object, making it easy to understand transformations step-by-step.

Here, we always call the source object first and then add each transformation step separated by the %>% operator. The pipe operator %>% takes a certain object which is then passed over to one or several functions to its right.

Remember the command from before? That is exactly what we did here:

We (1) take the object survey, we then (2) push it into the pipe %>% and we (3) then use the as_tibble() command to transform the object:

survey %>%
  as_tibble()

Remember: If you want to save the result, you have to save the resulting object, for example like so:

survey_new <- survey %>%
  as_tibble()

5.4 Data management with dplyr

dplyr comes with five main functions:

select(): select variables column by column, i.e. pick columns / variables by their names
filter(): filter observations row by row, i.e. pick observations by their values
arrange(): sort / reorder data in ascending or descending order
mutate(): calculate new variables or transform existing ones
summarize(): summarize variables (e.g. mean, standard deviation, etc.), best combined with group_by()

5.4.1 Select specific variables

You will frequently encounter large data sets including hundreds of variables. The first problem in this scenario is narrowing down the variables you are truly interested in. select() helps you to easily choose a suitable subset of variables. In this selection process, the name of the data frame is the source object, followed by the pipe %>% operator. The expression that selects the columns that you are interested in comes after that.

Using the survey, we may for example want to only include specific variables for a reduced data set.

Let’s assume you want to reduce the object survey to the variables name and age using select:

survey_reduced <- survey %>%
  select(name, age)
survey_reduced

##          name age
## 1   Alexandra  20
## 2        Alex  25
## 3  Maximilian  29
## 4      Moritz  22
## 5     Vanessa  25
## 6      Andrea  26
## 7    Fabienne  26
## 8       Fabio  27
## 9   Magdalena   8
## 10        Tim  26
## 11       Alex  27
## 12     Tobias  26
## 13    Michael  25
## 14    Sabrina  27
## 15   Valentin  29
## 16    Tristan  26
## 17     Martin  21
## 18       Anna  23
## 19    Andreas  24
## 20    Florian  26

You can also delete columns by making a reverse selection with the - symbol. This means that you select all columns except the one whose name you specify.

survey_reduced <- survey %>%
  select(-name)
survey_reduced

##    age       date outlet outlet_use outlet_trust
## 1   20 09.09.2021     TV          2            5
## 2   25 08.09.2021 Online          3            5
## 3   29 09.09.2021  Print          4            1
## 4   22 06.09.2021     TV          2            2
## 5   25 07.09.2021 Online          1            3
## 6   26 09.09.2021 Online          3            4
## 7   26 09.09.2021     TV          3            2
## 8   27 09.09.2021 Online          0            1
## 9    8 08.09.2021 Online          1            4
## 10  26 07.09.2021     TV         NA            2
## 11  27 09.09.2021 Online         NA            2
## 12  26 07.09.2021 Online          2            2
## 13  25 09.09.2021 Online          3            2
## 14  27 08.09.2021 Online          1            2
## 15  29 09.09.2021     TV          1            5
## 16  26 09.09.2021     TV          2            5
## 17  21 09.09.2021 Online          1            2
## 18  23 08.09.2021     TV          3            3
## 19  24 09.09.2021     TV          2            5
## 20  26 09.09.2021 Online          1            5

5.5 Select specific observations

filter() divides observations into groups depending on their values. The name of the data frame is the source object, followed by the pipe %>% operator. Then follow the expressions that filter the data.

Let’s say we want to include include those observations from the object survey where respondents said that they were older than 21 years via filter().

survey_reduced <- survey %>% 
  filter(age > 21)
str(survey)

## 'data.frame':    20 obs. of  6 variables:
##  $ name        : chr  "Alexandra" "Alex" "Maximilian" "Moritz" ...
##  $ age         : int  20 25 29 22 25 26 26 27 8 26 ...
##  $ date        : chr  "09.09.2021" "08.09.2021" "09.09.2021" "06.09.2021" ...
##  $ outlet      : chr  "TV" "Online" "Print" "TV" ...
##  $ outlet_use  : int  2 3 4 2 1 3 3 0 1 NA ...
##  $ outlet_trust: int  5 5 1 2 3 4 2 1 4 2 ...

str(survey_reduced)

## 'data.frame':    17 obs. of  6 variables:
##  $ name        : chr  "Alex" "Maximilian" "Moritz" "Vanessa" ...
##  $ age         : int  25 29 22 25 26 26 27 26 27 26 ...
##  $ date        : chr  "08.09.2021" "09.09.2021" "06.09.2021" "07.09.2021" ...
##  $ outlet      : chr  "Online" "Print" "TV" "Online" ...
##  $ outlet_use  : int  3 4 2 1 3 3 0 NA NA 2 ...
##  $ outlet_trust: int  5 1 2 3 4 2 1 2 2 2 ...

We can even build more complicated filters: Let’s only include respondents older than 21 years and say they mainly use the TV:

survey_reduced <- survey %>% 
  filter(age > 21 & outlet == "TV")
str(survey)

## 'data.frame':    20 obs. of  6 variables:
##  $ name        : chr  "Alexandra" "Alex" "Maximilian" "Moritz" ...
##  $ age         : int  20 25 29 22 25 26 26 27 8 26 ...
##  $ date        : chr  "09.09.2021" "08.09.2021" "09.09.2021" "06.09.2021" ...
##  $ outlet      : chr  "TV" "Online" "Print" "TV" ...
##  $ outlet_use  : int  2 3 4 2 1 3 3 0 1 NA ...
##  $ outlet_trust: int  5 5 1 2 3 4 2 1 4 2 ...

str(survey_reduced)

## 'data.frame':    7 obs. of  6 variables:
##  $ name        : chr  "Moritz" "Fabienne" "Tim" "Valentin" ...
##  $ age         : int  22 26 26 29 26 23 24
##  $ date        : chr  "06.09.2021" "09.09.2021" "07.09.2021" "09.09.2021" ...
##  $ outlet      : chr  "TV" "TV" "TV" "TV" ...
##  $ outlet_use  : int  2 3 NA 1 2 3 2
##  $ outlet_trust: int  2 2 2 5 5 3 5

5.6 Arrange data

arrange() changes the order of observations (either ascending or descending). By default, arrange() will sort in ascending order, i.e. from 1:100 (numeric vector) and from A:Z (character vector). arrange() must always be applied to at least one column that is to be sorted.

Let’s sort our respondents by name:

survey %>%
  arrange(name)

##          name age       date outlet outlet_use outlet_trust
## 1        Alex  25 08.09.2021 Online          3            5
## 2        Alex  27 09.09.2021 Online         NA            2
## 3   Alexandra  20 09.09.2021     TV          2            5
## 4      Andrea  26 09.09.2021 Online          3            4
## 5     Andreas  24 09.09.2021     TV          2            5
## 6        Anna  23 08.09.2021     TV          3            3
## 7    Fabienne  26 09.09.2021     TV          3            2
## 8       Fabio  27 09.09.2021 Online          0            1
## 9     Florian  26 09.09.2021 Online          1            5
## 10  Magdalena   8 08.09.2021 Online          1            4
## 11     Martin  21 09.09.2021 Online          1            2
## 12 Maximilian  29 09.09.2021  Print          4            1
## 13    Michael  25 09.09.2021 Online          3            2
## 14     Moritz  22 06.09.2021     TV          2            2
## 15    Sabrina  27 08.09.2021 Online          1            2
## 16        Tim  26 07.09.2021     TV         NA            2
## 17     Tobias  26 07.09.2021 Online          2            2
## 18    Tristan  26 09.09.2021     TV          2            5
## 19   Valentin  29 09.09.2021     TV          1            5
## 20    Vanessa  25 07.09.2021 Online          1            3

5.7 Change values

Often you want to add new columns to a data set, e.g. when you calculate new variables or when you want to store re-coded values of other variables. With mutate(), new columns will be added to the end of you data frame while replace() replaces existing values in an existing variable.

For example, we can create a new variable that indicates whether respondents most use “traditional news media” (TV, Print) or “online media” (Online) like so:

survey %>%
  mutate(traditional = "yes",
         traditional = replace(traditional,
                               outlet == "Online",
                               "no")) %>%
  head()

##         name age       date outlet outlet_use outlet_trust traditional
## 1  Alexandra  20 09.09.2021     TV          2            5         yes
## 2       Alex  25 08.09.2021 Online          3            5          no
## 3 Maximilian  29 09.09.2021  Print          4            1         yes
## 4     Moritz  22 06.09.2021     TV          2            2         yes
## 5    Vanessa  25 07.09.2021 Online          1            3          no
## 6     Andrea  26 09.09.2021 Online          3            4          no

Another example: Let’s say that we want to transform the variable outlet_trust in the object survey.

Remember: The variable indicates how much each student trusts a specific media outlet described in the variable outlet (from 1 = not at all to 5 = very much).

Instead of numeric values ranging from 1 to 5, we want to create a new variable named which should include:

“low trust” instead of the numeric values 1 and 2
“medium trust” instead of the numeric value 3
“high trust” instead of the numeric values 4 and 5

We use mutate() and replace()` to create new variables based on our existing data.

#get object
survey %>%
  #create empty variable
  mutate(outlet_trust_new = NA) %>%

#transform trust scores
  mutate(outlet_trust_new = replace(outlet_trust_new, 
                                    outlet_trust<=2, 
                                    "low trust"),
         
         outlet_trust_new = replace(outlet_trust_new, 
                                    outlet_trust==3, 
                                    "medium trust"),
         
         outlet_trust_new = replace(outlet_trust_new, 
                                    outlet_trust>=4, 
                                    "high trust")) %>%
  head()

##         name age       date outlet outlet_use outlet_trust outlet_trust_new
## 1  Alexandra  20 09.09.2021     TV          2            5       high trust
## 2       Alex  25 08.09.2021 Online          3            5       high trust
## 3 Maximilian  29 09.09.2021  Print          4            1        low trust
## 4     Moritz  22 06.09.2021     TV          2            2        low trust
## 5    Vanessa  25 07.09.2021 Online          1            3     medium trust
## 6     Andrea  26 09.09.2021 Online          3            4       high trust

5.8 Summarize values

Lastly, summarize() function collapses a data frame into a single row that shows you summary statistics about your variables.

We can, for example, calculate the average age of respondents like so:

survey %>%
  summarize(mean_age = mean(age, na.rm = TRUE))

##   mean_age
## 1     24.4

The summarize() function grows especially powerful when it is combined with group_by to display summary statistics for groups.

We can, for example, calculate the average age of respondents dependent on which outlet they say the use most:

survey %>%
  group_by(outlet) %>%
  summarize(mean_age = mean(age, na.rm = TRUE))

## # A tibble: 3 x 2
##   outlet mean_age
##   <chr>     <dbl>
## 1 Online     23.9
## 2 Print      29  
## 3 TV         24.5

💡 Take Aways

pipe data for subsequent transformation: %>%
select variables: select()
select observations: filter()
arrange data: arrange()
transform values: mutate(), replace()
summarize values: summarize(), group()

📚 More tutorials on this

You still have questions? The following tutorials & papers can help you with that:

📌 Test your knowledge

You’ve worked through all the material of Tutorial 5? Let’s see it - the following tasks will test your knowledge.

Since it is almost Halloween season, we’ll work with a data set from fivethirthyeight on The Ultimate Halloween Candy Power Ranking.

In short, the data contains an online survey in which participants were asked to choose their favorite out of two different candy types. The data is accessible via a Creative Commons Attribution 4.0 International License here. You’ll have to copy the data into a “.txt” file. Else, you can also download the data from Moodle/Data for R (file called data_halloween.txt).

The data includes a range of variables, including

the name of each candy bar: competitorname
whether the candy bar contains chocolate: chocolate
whether the candy bar contains fruit flavor: fruity
whether the candy bar contains caramel: caramel
the unit price percentile the candy bar is in compared to the rest of the set: pricepercent

Task 5.1

Read the data set into R. Writing the corresponding R code, find out

how many observations and how many variables the data set contains.

Task 5.2

Writing the corresponding R code, find out

how many candy bars contain chocolate.
how many candy bars contain fruit flavor.

Task 5.3

Writing the corresponding R code, find out

the name(s) of candy bars containing both chocolate and fruit flavor.

Task 5.4

Create a new data frame called data_new. Writing the corresponding R code,

reduce the data set only observations containing chocolate but not caramel. The data set should also only include the variables competitorname and pricepercent.
round the variable pricepercent to two decimals.
sort the data by pricepercent in descending order, i.e., make sure that candy bars with the highest price are on top of the data frame and those with the lowest price on the bottom.

The corresponding data frame should look like this:

##                 competitorname pricepercent
## 1              Nestle Smarties         0.98
## 2            Hershey's Krackel         0.92
## 3     Hershey's Milk Chocolate         0.92
## 4       Hershey's Special Dark         0.92
## 5                  Mr Good Bar         0.92
## 6                       Mounds         0.86
## 7                     Whoppers         0.85
## 8                   Almond Joy         0.77
## 9          Nestle Butterfinger         0.77
## 10               Nestle Crunch         0.77
## 11         Peanut butter M&M's         0.65
## 12                       M&M's         0.65
## 13                 Peanut M&Ms         0.65
## 14   Reese's Peanut Butter cup         0.65
## 15              Reese's pieces         0.65
## 16 Reese's stuffed with pieces         0.65
## 17                3 Musketeers         0.51
## 18             Charleston Chew         0.51
## 19                Junior Mints         0.51
## 20                     Kit Kat         0.51
## 21        Tootsie Roll Juniors         0.51
## 22                 Tootsie Pop         0.32
## 23     Tootsie Roll Snack Bars         0.32
## 24          Reese's Miniatures         0.28
## 25            Hershey's Kisses         0.09
## 26                     Sixlets         0.08
## 27        Tootsie Roll Midgies         0.01

This is where you’ll find solutions for Tutorial 5. Let’s keep going: Tutorial 6: Intro to Scraping.

Though I never met anyone who “finished” R, so remember that we are all beginners in R - or that we are all learning R, but are at different stages of doing so.↩︎