3 Efficient Data Management in R

3.1 Tidyverse

Tidyverse is a collection of R packages designed to help data scientists to make more efficient use of R. It contains the following packages (and several more, which we will:

  • tibble provides a “modern reimagining” of the standard data.frame, in R. Tibbles (or tbl_dfs) are more flexible in terms of what they can store, but (purposefully) less flexible in terms of “sloppy code”.
  • readr provides alternative functions for reading in text data in tabular form. It provides faster and more consistent alternatives to read.table and read.csv.
  • dplyr provides a powerful suite of functions for data manipulation with a focus on allowing for clean and simple code. We will look at dplyr in more detail this week.
  • ggplot2 is a very featureful and systematic set of plotting functions, which we will focus on in this tutorial.
  • lubridate is a very useful package for handling dates and times in R. Dates and times are often tricky to deal with, and lubridate provides many useful functions for efficiently handling these.

3.2 Pipelines

Pipelines are at the centre of all the tidyverse packages. The R package magrittr provides a forward-pipe operator for R.

Suppose we have a function f defined in R

f <- function(x)
  x^2

Then we can apply f to an argument x using

x <- 3
f(x)
## [1] 9

The forward-pipes from magrittr allow us to rewrite this function call as

## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
x %>% f
## [1] 9

instead. The advantage of this alternative notation might not become immediately clear, but its advantage becomes more obvious when looking at nested function calls.

Consider the R data set mtcars, which contains data from the 1974 edition from the US magazine Motor Trend. Suppose we want to convert the fuel consumption to litres per 100 kilometres and then only retain the cars with a fuel economy better than 10 litres per 100 kilometres.

mtcars2 <- transform(mtcars, lp100k=235.21/mpg)
subset(mtcars2, lp100k<=10)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb   lp100k
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 9.639754
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 7.259568
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 7.737171
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 6.938348
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 8.615751
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2 9.046538
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 7.737171

(If you are wondering where the number of 235.21 comes from: A US gallon is roughly 3.785 litres and a mile is roughly 1.609 kilometres, and \(\frac{100\times 3.785}{1.609}\approx 235.21\))

If we want to perform both steps in one go, we can nest the two calls within one another and use

subset(transform(mtcars, lp100k=235.21/mpg), lp100k<=10)

This gives exactly the same results, but is not very easy to read and understand. It is not easy to see that the argument lp100k<=10 belongs to subset. When nesting function calls, the arguments get increasingly far from the function call to which they belong.

The %>% operator however allows us to write this much more cleanly:

mtcars %>%
  transform(lp100k=235.21/mpg) %>%
  subset(lp100k<=10)

3.2.1 Task

The R code below uses pipelines. Convert it to nested function calls.

rnorm(1000) %>% sin() %>% max()

3.2.2 Solution

The code generates a random sample of size 1000 (from a standard normal distribution), computes the sine of each entry and then takes the maximum.

max(sin(rnorm(1000)))
## [1] 0.9999949

In this case the nested function call is easy to read because every function only takes one argument.

3.2.3 Task

Convert the R code below to pipelines.

library(MASS)                       # Load package MASS, which contains the data
subset(transform(mammals, ratio=brain/body), ratio==max(ratio))

3.2.4 Answer

You can use the following R code using pipelines.

library(MASS)
mammals %>%
  transform(ratio=brain/body) %>%
  subset(ratio==max(ratio))
##                  body brain    ratio
## Ground squirrel 0.101     4 39.60396

Oddly enough, ground squirrels have a higher brain-to-body weight ratio than humans.

3.2.5 Additional Resources

Pipelines for Data Analysis in R Hadley Wickham has produced a series of excellent slides about pipelines, which covers much of what we will look at in this tutorial.

Background reading: Chapter 18 of R for Data Science Chapter 18 of R for Data Science gives a detailed overview of pipes and some of the underpinning technology (though the latter is rather advanced).

3.3 Tibbles

The package tibble provides tbl_df’s (or “tibbles”, which is easier to pronounce). They are a modern take on the built-in class data.frame.

One key advantage of tibbles is that they can store anything. A data.frame can only store a single value per “cell”, for example a number or a character string. However, in a tibble, you can store a list or even another tibble in a cell. An example of this is the tibble starwars from the package dplyr. The column starships contains for each row the list of starships flown by that character (which is a list of different length depending on the character.)

library(dplyr)                             # Load library dplyr which contains the data
starwars[,c("name", "starships")]          # Print columns name and starships
## # A tibble: 87 × 2
##    name               starships
##    <chr>              <list>   
##  1 Luke Skywalker     <chr [2]>
##  2 C-3PO              <chr [0]>
##  3 R2-D2              <chr [0]>
##  4 Darth Vader        <chr [1]>
##  5 Leia Organa        <chr [0]>
##  6 Owen Lars          <chr [0]>
##  7 Beru Whitesun lars <chr [0]>
##  8 R5-D4              <chr [0]>
##  9 Biggs Darklighter  <chr [1]>
## 10 Obi-Wan Kenobi     <chr [5]>
## # … with 77 more rows
starwars[10,"starships"][[1]]              # Starships flown by Obi-Wan
## [[1]]
## [1] "Jedi starfighter"         "Trade Federation cruiser"
## [3] "Naboo star skiff"         "Jedi Interceptor"        
## [5] "Belbullab-22 starfighter"

We could not have stored this information in a data frame. We would have had to either store the information across several data frames or stored the list of starships as a character string.

3.3.1 Creating tibbles

We can create tibbles using the function tibble. We can create the tibble from above using

kidstibble <- tibble(name=c("Sarah", "John"), age=c(4,11), weight=c(15,28),
                     height=c(101,132), gender=c("f", "m"))

In other words, the function tibble assembles a tibble on a column-by-column basis (akin to using cbind).

The function tribble (“transposed tibble”) lets you create a tibble on a row-by-bow basis (akin to using rbind), which is typically more legible when creating a matrix in code.

kidstibble <- tribble(~name,   ~age, ~weight, ~height, ~gender,
                      "Sarah",    4,      15,     101,     "f",
                      "John",    11,      28,     132,     "m")

3.3.2 Working with tibbles

  • Variables/Columns can be accessed and added using tibble$varname (varname needs to be fully spelled out). You can also access a column using tibble[,"varname"] or tibble[["varname"]].
  • Rows can be selected using tibble[rowindices,] (note that you cannot use row names).
  • Individual cells can be accessed using tibble[rowindices, colindices].

3.3.3 Subsetting tibbles always results in a tibble

Tibbles are also more consistent. Subsetting tibbles always results in a tibble.

kidstibble[,1]                            # Result is a tibble
## # A tibble: 2 × 1
##   name 
##   <chr>
## 1 Sarah
## 2 John

In contrast, subsetting a data frame or matrix is not guaranteed to result in a data frame or matrix (unless you use drop=FALSE). If the result is a single column or row, subsetting a data frame or matrix results in a vector.

This “dropping” of the dimension can be very useful when using R interactively, but can be the source of many issues in more complex projects, when programmers incorrectly assume that subsetting a data frame or matrix will always result in another data frame or matrix, rather than possibly just a vector (it is thus a good idea to always use drop=FALSE when working with data frames or matrices in complex projects).

Data Import Cheat Sheet RStudio’s cheat sheet for data import also covers tibbles.

3.4 Reading in data using readr

The package readr contains alternatives to the functions read.table and read.csv. The alternative functions from readr have four main advantages.

  • They read in the data a lot faster and can show a progress bar (though this is only relevant for really big data sets).
  • They store the data straight in a tibble, rather than a data frame.
  • They allow specifying the intended data type for each column and thus make it easier to identify rows which cause problems.
  • They are less intrusive: they don’t automatically convert character strings to factors and do not change column names (read.table and read.csv will for example remove spaces from variable names and replace them by full stops). The functions from readr are also guaranteed to give the same result irrespective of the platform or operating system they are run under.

readr provides the following functions.

  • read_csv reads in comma-separated files. read_csv2 reads in files which are semicolon-separated (common in countries like France or Germany, where a comma is used as decimal separator).
  • read_tsv reads in tab-separated files.
  • read_delim is the most general function (like read.table). The delimiter has to be specified using the argument delim.
  • read_fwd reads in fixed-width files.

All functions assume that the first row contains the column/variable names. If this is not the case, set the optional argument col_names to FALSE or to a character vector containing the intended column names.

The strings used to encode missing values can be specified using the optional argument na.

For example, we can read in the file chol.txt using

library(readr)
read_delim("chol.txt", delim=" ", col_names=c("ldl", "hdl", "trig",
                                              "age", "gender", "smoke"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ldl = col_double(),
##   hdl = col_double(),
##   trig = col_double(),
##   age = col_double(),
##   gender = col_character(),
##   smoke = col_character()
## )
## # A tibble: 13 × 6
##      ldl   hdl  trig   age gender smoke    
##    <dbl> <dbl> <dbl> <dbl> <chr>  <chr>    
##  1   175    25   148    39 female no       
##  2   196    36    92    32 female no       
##  3   139    65    NA    42 male   <NA>     
##  4   162    37   139    30 female ex-smoker
##  5   140   117    59    42 female ex-smoker
##  6   147    51   126    65 female ex-smoker
##  7    82    81    NA    57 male   no       
##  8   165    63   120    48 male   current  
##  9   149    49    NA    32 female no       
## 10    95    54   157    55 female ex-smoker
## 11   169    59    67    48 female no       
## 12   174   117   168    41 female no       
## 13    91    52   146    69 female current

Note that functions from readr show the data type it has used for each column. This makes it easier to spot mistakes like missing values not coded as expected, in which case a numeric column would show up as a character string.

For example, we can read in the file chol.csv using

library(readr)
read_csv("chol.csv", na=".")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ldl = col_double(),
##   hdl = col_double(),
##   trig = col_double(),
##   age = col_double(),
##   gender = col_character(),
##   smoke = col_character()
## )
## # A tibble: 13 × 6
##      ldl   hdl  trig   age gender smoke    
##    <dbl> <dbl> <dbl> <dbl> <chr>  <chr>    
##  1   175    25   148    39 female no       
##  2   196    36    92    32 female no       
##  3   139    65    NA    42 male   NA       
##  4   162    37   139    30 female ex-smoker
##  5   140   117    59    42 female ex-smoker
##  6   147    51   126    65 female ex-smoker
##  7    82    81    NA    57 male   no       
##  8   165    63   120    48 male   current  
##  9   149    49    NA    32 female no       
## 10    95    54   157    55 female ex-smoker
## 11   169    59    67    48 female no       
## 12   174   117   168    41 female no       
## 13    91    52   146    69 female current

3.4.1 Task

Read the data files cars.csv and ships.txt into R using the functions from readr.

3.4.2 Answer

The first line of the file cars.csv contains the variable names and the fields are separated by commas. Missing values are encoded as asterisks.

cars <- read_csv("cars.csv", na="*")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Manufacturer = col_character(),
##   Model = col_character(),
##   MPG = col_double(),
##   Displacement = col_double(),
##   Horsepower = col_double()
## )
cars
## # A tibble: 20 × 5
##    Manufacturer Model     MPG Displacement Horsepower
##    <chr>        <chr>   <dbl>        <dbl>      <dbl>
##  1 Chevrolet    Camaro     19          3.4        160
##  2 Oldsmobile   Achieva    NA          2.3        155
##  3 Dodge        Spirit     22          2.5        100
##  4 Chevrolet    Astro      NA          4.3        165
##  5 Chevrolet    Corsica    25          2.2        110
##  6 Volkswagen   Corrado    18          2.8        178
##  7 Dodge        Stealth    18          3          300
##  8 Volkswagen   Fox        25          1.8         81
##  9 Cadillac     DeVille    16          4.9        200
## 10 Hyundai      Excel      29          1.5         81
## 11 Toyota       Tercel     32          1.5         82
## 12 Dodge        Colt       29          1.5         92
## 13 Volkswagen   Passat     21          2          134
## 14 Geo          Storm      30          1.6         90
## 15 Toyota       Previa     18          2.4        138
## 16 Nissan       Sentra     29          1.6        110
## 17 Toyota       Celica     25          2.2        135
## 18 Honda        Civic      42          1.5        102
## 19 Dodge        Caravan    17          3          142
## 20 Hyundai      Sonata     20          2          128

We could have also used the function read_delim.

read_delim("cars.csv", delim=",", na="*")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Manufacturer = col_character(),
##   Model = col_character(),
##   MPG = col_double(),
##   Displacement = col_double(),
##   Horsepower = col_double()
## )
## # A tibble: 20 × 5
##    Manufacturer Model     MPG Displacement Horsepower
##    <chr>        <chr>   <dbl>        <dbl>      <dbl>
##  1 Chevrolet    Camaro     19          3.4        160
##  2 Oldsmobile   Achieva    NA          2.3        155
##  3 Dodge        Spirit     22          2.5        100
##  4 Chevrolet    Astro      NA          4.3        165
##  5 Chevrolet    Corsica    25          2.2        110
##  6 Volkswagen   Corrado    18          2.8        178
##  7 Dodge        Stealth    18          3          300
##  8 Volkswagen   Fox        25          1.8         81
##  9 Cadillac     DeVille    16          4.9        200
## 10 Hyundai      Excel      29          1.5         81
## 11 Toyota       Tercel     32          1.5         82
## 12 Dodge        Colt       29          1.5         92
## 13 Volkswagen   Passat     21          2          134
## 14 Geo          Storm      30          1.6         90
## 15 Toyota       Previa     18          2.4        138
## 16 Nissan       Sentra     29          1.6        110
## 17 Toyota       Celica     25          2.2        135
## 18 Honda        Civic      42          1.5        102
## 19 Dodge        Caravan    17          3          142
## 20 Hyundai      Sonata     20          2          128

The first line of the file ships.txt contains the variable names and the fields are separated by whitespace. Missing values are encoded as “.”.

ships <- read_delim("ships.txt", delim=' ' , na=".")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   type = col_character(),
##   year = col_double(),
##   period = col_double(),
##   service = col_double(),
##   incidents = col_double()
## )
ships
## # A tibble: 40 × 5
##    type   year period service incidents
##    <chr> <dbl>  <dbl>   <dbl>     <dbl>
##  1 A        60     60     127         0
##  2 A        60     75      63         0
##  3 A        65     60      NA         3
##  4 A        65     75    1095         4
##  5 A        70     60    1512         6
##  6 A        70     75    3353        18
##  7 A        75     60       0         0
##  8 A        75     75    2244        11
##  9 B        60     60   44882        39
## 10 B        60     75   17176        29
## # … with 30 more rows

3.4.3 Specifying column types

The functions from readr allow specifying the expected column types. This is especially important when writing which will then be run automatically. It provides an easy way of ensuring that the data provided is of the expected format.

The easiest way of specifying expected column types is to provide a character string with each letters standing for a column

Letter Meaning
c character
i integer
n number
d double
l logical
D date
T date time
t time
? guess the type
_ or - skip the column

So for the data file chol.csv we would expect the first four columns to be integers and the latter two to be character strings, so we would use

chol <- read_csv("chol.csv", na=".", col_types="iiiicc")

Specifying the expected column types can help pinpointing problems when reading in data. Suppose we had forgotten that missing values are coded using “.” in this data file. If we use …

chol <- read_csv("chol.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ldl = col_double(),
##   hdl = col_double(),
##   trig = col_character(),
##   age = col_double(),
##   gender = col_character(),
##   smoke = col_character()
## )

… we can see from the output that trig was read in as a character string, but we do not know why.

However, if we use …

chol <- read_csv("chol.csv", col_types="iiiicc")
## Warning: 3 parsing failures.
## row  col   expected actual       file
##   3 trig an integer      . 'chol.csv'
##   7 trig an integer      . 'chol.csv'
##   9 trig an integer      . 'chol.csv'

… we obtain a warning and can print the problematic rows using

problems(chol)
## # A tibble: 3 × 5
##     row col   expected   actual file      
##   <int> <chr> <chr>      <chr>  <chr>     
## 1     3 trig  an integer .      'chol.csv'
## 2     7 trig  an integer .      'chol.csv'
## 3     9 trig  an integer .      'chol.csv'

The output from problems shows us that for three rows (3, 7 and 9) the data in chol.csv was not of the expected format: a value of . is not compatible with the column being numeric. This makes it easy to identify the cause of the problem (NAs coded as “.”) and rectify the issue.

3.4.4 Additional Resources

Data Import Cheat Sheet RStudio’s cheat sheet for data import also covers readr.

Background reading: Chapter 11 of R for Data Science Chapter 11 of R for Data Science gives a detailed overview of the functions in readr. It explains in some more detail how the functions in readr parse files. over pipes and also covers the functions from readr that dead with writing files.

3.5 Efficient data manipulation using dplyr

In this section we will work with data from Paris’ Vélib’ bicycle sharing system available through JCDecaux’s API for open cycle data.

The data consists of the number of bikes available and the number of bike stands available at every Vélib’ station, recorded every five minutes over six hours on a Tuesday afternoon in October 2017.

The data consists of two tibbles. The first, bikes contains data on the number of available bikes and stands at each station.

Variable Description
name Name of the station
available_bikes Number of available at that time
available_bike_stands Number of available bike stands
time Decimal time for which the number have been recorded

The second, stations contains additional information about each station.

Variable Description
name Unique name of the station
id Internal ID number of the station
address Address of where the station is located
lng GPS coordinate (longitude)
lat GPS coordinate (latitude)
departement Département in which the station is located

You can load the data into R using

library(tibble)
load(url("https://github.com/UofGAnalyticsData/R/raw/main/Week%204/velib"))

3.5.1 Overview: the key functions (“verbs”) for dplyr

Function (“verb”) Description R base equivalent(s)
filter Select observations/rows subset
slice Select observations by row numbers [idx,]
select Select variables/column $ or [,sel]
mutate Create new variables/column transform
arrange Sort observations/rows order
group_by Group observations by variable by or aggregate
summarise Calculate summary statistics by or aggregate

The functions in dplyr are designed to be used with tibbles, but they also work with data frames. When invoked with a data frame, they will return a data frame as long as this is possible.

3.5.2 Selecting observations (rows) using filter and slice

3.5.2.1 filter

The function filter is used to select observations (or rows) in a similar way to the base R function subset.

Suppose we want to print all bike stations in Paris (rather than other départements from Île de France)

library(dplyr)
stations75 <- stations %>%
                filter(departement=="Paris")
stations75
## # A tibble: 743 × 6
##    name                                       id address   lng   lat departement
##    <chr>                                   <dbl> <chr>   <dbl> <dbl> <chr>      
##  1 PORT SOLFERINO (STATION MOBILE)           901 BERGES…  2.32  48.9 Paris      
##  2 QUAI MAURIAC  / PONT DE BERCY             903 FETE D…  2.37  48.8 Paris      
##  3 17/19 PLACE JOFFRE / ECOLE MILITAIRE      904 ECOLE …  2.30  48.9 Paris      
##  4 CONCORDE/BERGES DE SEINE (STATION MOBI…   905 BERGES…  2.32  48.9 Paris      
##  5 PORT DU GROS CAILLOU (STATION MOBILE)     908 BERGES…  2.31  48.9 Paris      
##  6 PONT D'ARCOLE (STATION MOBILE)            909 Voie G…  2.35  48.9 Paris      
##  7 ILE DE LA CITE PONT NEUF                 1001 41 QUA…  2.34  48.9 Paris      
##  8 PLACE DU CHATELET                        1002 14 AVE…  2.35  48.9 Paris      
##  9 RIVOLI SAINT DENIS                       1003 7 RUE …  2.35  48.9 Paris      
## 10 MARGUERITE DE NAVARRE                    1004 12 RUE…  2.35  48.9 Paris      
## # … with 733 more rows

Note the use of a double == to test whether the département is equal to “Paris”.

We can create more complex expressions using the standard logical operators & (“and”), | (“or”) and ! (“not”). Note that you cannot use && and || in this context, as they only work with scalar arguments.

For example, if we want to extract the stations which are in Paris or Hauts-de-Seine we can use

stations7592 <- stations %>%
                  filter(departement=="Paris" | departement=="Hauts-de-Seine")

Rather than using a logical or we could have used %in%:

stations7592 <- stations %>%
                  filter(departement %in% c("Paris" , "Hauts-de-Seine"))

Even though the functions from dplyr are designed to be used with pipelines, you can also provide the data set as first argument:

stations7592 <- filter(stations, departement %in% c("Paris" , "Hauts-de-Seine"))

3.5.2.2 slice

You can use the function slice to select observations based on their row numbers.

stations %>%
  slice(5:7)
## # A tibble: 3 × 6
##   name                                     id address      lng   lat departement
##   <chr>                                 <dbl> <chr>      <dbl> <dbl> <chr>      
## 1 PORT DU GROS CAILLOU (STATION MOBILE)   908 BERGES DE…  2.31  48.9 Paris      
## 2 PONT D'ARCOLE (STATION MOBILE)          909 Voie Geor…  2.35  48.9 Paris      
## 3 ILE DE LA CITE PONT NEUF               1001 41 QUAI D…  2.34  48.9 Paris

selects the observations in rows 5 to 7 and is equivalent to

stations[5:7,]
## # A tibble: 3 × 6
##   name                                     id address      lng   lat departement
##   <chr>                                 <dbl> <chr>      <dbl> <dbl> <chr>      
## 1 PORT DU GROS CAILLOU (STATION MOBILE)   908 BERGES DE…  2.31  48.9 Paris      
## 2 PONT D'ARCOLE (STATION MOBILE)          909 Voie Geor…  2.35  48.9 Paris      
## 3 ILE DE LA CITE PONT NEUF               1001 41 QUAI D…  2.34  48.9 Paris

3.5.3 Task

Identify the stations which had more than 60 bikes available at 3pm (i.e. time taking the value 15).

3.5.4 Answer

You can use the following R code:

bikes %>%
  filter(time==15 & available_bikes>60)
## # A tibble: 6 × 4
##   name                available_bikes available_bike_stands  time
##   <chr>                         <int>                 <int> <dbl>
## 1 MUSÉE D'ORSAY                    63                     2    15
## 2 DUPLEIX                          65                     3    15
## 3 ASSEMBLEE NATIONALE              62                     0    15
## 4 SAINT EMILION                    65                     1    15
## 5 METZ                             63                     1    15
## 6 PRIMO LEVI                       61                     1    15

3.5.5 Selecting variables (columns) using select

The function select can be used to subset the variables (columns) of a data set.

You can either specify the columns to retain or (with a minus) those you do not want to retain.

We can only retain the name and département of each station using either

stations.small <- stations %>%
                     select(name, departement)
stations.small
## # A tibble: 928 × 2
##    name                                      departement
##    <chr>                                     <chr>      
##  1 PORT SOLFERINO (STATION MOBILE)           Paris      
##  2 QUAI MAURIAC  / PONT DE BERCY             Paris      
##  3 17/19 PLACE JOFFRE / ECOLE MILITAIRE      Paris      
##  4 CONCORDE/BERGES DE SEINE (STATION MOBILE) Paris      
##  5 PORT DU GROS CAILLOU (STATION MOBILE)     Paris      
##  6 PONT D'ARCOLE (STATION MOBILE)            Paris      
##  7 ILE DE LA CITE PONT NEUF                  Paris      
##  8 PLACE DU CHATELET                         Paris      
##  9 RIVOLI SAINT DENIS                        Paris      
## 10 MARGUERITE DE NAVARRE                     Paris      
## # … with 918 more rows

or

stations.small <- stations %>% select(-id, -address, -lng, -lat)

You can also use select to change the order of the columns of a data set.

3.5.6 Adding new variables using mutate

The function mutate can be used to create new variables (columns) in a data set. mutate is similar in functionality to the base R function transform.

We can add the total number of stands to the data set bikes using

bikes <- bikes %>%
           mutate(total_stands = available_bikes+available_bike_stands)

More than one new variable can be defined by adding further arguments to mutate.

transmute is a sibling of mutate. Just like mutate it creates new columns. It however also removes all existing columns so that only the new columns remain.

3.5.7 Task

The time is currently encoded as decimal (e.g. 13.5 for 13:30). Create two columns time_hours, which contains the hour (13 in our example), and time_minutes, which contains the minutes, (30 in our example).

You can calculate time_hours as the floor of time (R function floor) and time_minutes as the remainder after integer division of 60 times time by 60 (R operator %%).

3.5.8 Answer

We can create both columns in one call to mutate.

bikes %>%
  mutate(time_hour=floor(time), time_minutes=(60*time)%%60)
## # A tibble: 67,354 × 7
##    name            available_bikes available_bike_…  time total_stands time_hour
##    <chr>                     <int>            <int> <dbl>        <int>     <dbl>
##  1 CHAMPEAUX (BAG…               9               41    13           50        13
##  2 POISSONNIÈRE -…              33                0    13           33        13
##  3 METRO ROME                    6               38    13           44        13
##  4 DE GAULLE (PAN…               2               16    13           18        13
##  5 PARC DE BELLEV…               4               22    13           26        13
##  6 SOLJENITSYNE (…              56                4    13           60        13
##  7 SERRES                        5               18    13           23        13
##  8 PYRAMIDE ARTIL…              14               40    13           54        13
##  9 SAINT GEORGES                12               10    13           22        13
## 10 MUSÉE D'ORSAY                65                0    13           65        13
## # … with 67,344 more rows, and 1 more variable: time_minutes <dbl>

The output does not show the new columns (as they would take the output of a single row to more than one line). We can show them all, for example, if we remove the station name.

bikes %>%
  mutate(time_hour=floor(time), time_minutes=(60*time)%%60) %>%
  select(-name)
## # A tibble: 67,354 × 6
##    available_bikes available_bike_sta…  time total_stands time_hour time_minutes
##              <int>               <int> <dbl>        <int>     <dbl>        <dbl>
##  1               9                  41    13           50        13            0
##  2              33                   0    13           33        13            0
##  3               6                  38    13           44        13            0
##  4               2                  16    13           18        13            0
##  5               4                  22    13           26        13            0
##  6              56                   4    13           60        13            0
##  7               5                  18    13           23        13            0
##  8              14                  40    13           54        13            0
##  9              12                  10    13           22        13            0
## 10              65                   0    13           65        13            0
## # … with 67,344 more rows

Alternatively, we can explicitly invoke the print method of the tibble and ask it to print everything.

bikes %>%
  mutate(time_hour=floor(time), time_minutes=(60*time)%%60) %>%
  print(width=Inf)
## # A tibble: 67,354 × 7
##    name                       available_bikes available_bike_stands  time
##    <chr>                                <int>                 <int> <dbl>
##  1 CHAMPEAUX (BAGNOLET)                     9                    41    13
##  2 POISSONNIÈRE - ENGHIEN                  33                     0    13
##  3 METRO ROME                               6                    38    13
##  4 DE GAULLE (PANTIN)                       2                    16    13
##  5 PARC DE BELLEVILLE (20040)               4                    22    13
##  6 SOLJENITSYNE (PUTEAUX)                  56                     4    13
##  7 SERRES                                   5                    18    13
##  8 PYRAMIDE ARTILLERIE                     14                    40    13
##  9 SAINT GEORGES                           12                    10    13
## 10 MUSÉE D'ORSAY                           65                     0    13
##    total_stands time_hour time_minutes
##           <int>     <dbl>        <dbl>
##  1           50        13            0
##  2           33        13            0
##  3           44        13            0
##  4           18        13            0
##  5           26        13            0
##  6           60        13            0
##  7           23        13            0
##  8           54        13            0
##  9           22        13            0
## 10           65        13            0
## # … with 67,344 more rows

3.5.9 Sorting data sets using arrange

The function arrange can be used to sort a data set by one or more variables. We can sort the data set bikes by the number of available bikes suing

bikes %>%
  arrange(available_bikes)
## # A tibble: 67,354 × 5
##    name                      available_bikes available_bike_…  time total_stands
##    <chr>                               <int>            <int> <dbl>        <int>
##  1 KARMAN (AUBERVILLIERS)                  0                0    13            0
##  2 PIGALLE GERMAIN PILLON                  0               20    13           20
##  3 ROND POINT DES CHAMPS EL…               0                0    13            0
##  4 MONTCALM                                0               47    13           47
##  5 PLACE HENOCQUE VERSION 2                0               34    13           34
##  6 PLACE DES FETES                         0               19    13           19
##  7 MANIN SECRETAN                          0               20    13           20
##  8 MARTINIE (VANVES)                       0               24    13           24
##  9 HORTENSIAS (LES LILAS)                  0               22    13           22
## 10 HAIES REUNION                           0               22    13           22
## # … with 67,344 more rows

You can use the function desc to sort in descending order

bikes %>%
  arrange(desc(available_bikes))
## # A tibble: 67,354 × 5
##    name    available_bikes available_bike_stands  time total_stands
##    <chr>             <int>                 <int> <dbl>        <int>
##  1 DUPLEIX              68                     0  16.2           68
##  2 DUPLEIX              68                     0  16.2           68
##  3 DUPLEIX              67                     1  15.4           68
##  4 DUPLEIX              67                     1  15.5           68
##  5 DUPLEIX              67                     1  15.8           68
##  6 DUPLEIX              67                     1  16.1           68
##  7 DUPLEIX              67                     1  16.3           68
##  8 SAHEL                67                     0  17.6           67
##  9 SAHEL                67                     0  18             67
## 10 SAHEL                67                     0  18.1           67
## # … with 67,344 more rows

3.5.10 Task

Identify the three bike stations that are furthest to the West (i.e. the ones with the smallest longitude lng).

3.5.11 Answer

We first sort the stations by the longitude and the select to top three observations.

stations %>%
  arrange(lng) %>%
  slice(1:3)
## # A tibble: 3 × 6
##   name                            id address               lng   lat departement
##   <chr>                        <dbl> <chr>               <dbl> <dbl> <chr>      
## 1 GARE ROUTIERE ( SAINT CLOUD) 22101 GARE ROUTIERE - AR…  2.22  48.8 Hauts-de-S…
## 2 SELLIER (SURESNES)           21501 RUE DE SAINT CLOUD…  2.23  48.9 Hauts-de-S…
## 3 VERDUN (SURESNES)            21502 18 BIS RUE DE VERD…  2.23  48.9 Hauts-de-S…

We could have also used the function filter and the ranking function min_rank:

stations %>%
  filter(min_rank(lng)<=3)
## # A tibble: 3 × 6
##   name                            id address               lng   lat departement
##   <chr>                        <dbl> <chr>               <dbl> <dbl> <chr>      
## 1 SELLIER (SURESNES)           21501 RUE DE SAINT CLOUD…  2.23  48.9 Hauts-de-S…
## 2 VERDUN (SURESNES)            21502 18 BIS RUE DE VERD…  2.23  48.9 Hauts-de-S…
## 3 GARE ROUTIERE ( SAINT CLOUD) 22101 GARE ROUTIERE - AR…  2.22  48.8 Hauts-de-S…

min_rank returns the rank of the observation when considering the variable given as argument (there are many different ways of computing ranks, see ?min_rank for details.)

However, the latter answer does not show the stations in increasing order of longitude.

3.5.12 Grouping data and calculating group-wise summary statistics: group_by and summarise

Suppose we want to identify the busiest stations in the system in the sense of having, on average, the most bikes taken out (and thus the highest number of available bike stands – this is assuming JCDecaux replenish all bike stations in the same way, which is not quite what is happening in reality; there are better, but more complex, ways of defining “busy”).

To calculate the average number of available bike stands per station we need to first group the data by bike station and then compute the average number of bike stands available

bikes %>% group_by(name) %>%                             # Group by station name
  summarise(avg_stands=mean(available_bike_stands)) %>%  # Calculate averages
  arrange(desc(avg_stands))                              # Sort in descending order
## # A tibble: 928 × 2
##    name                               avg_stands
##    <chr>                                   <dbl>
##  1 PANTIN                                   70.3
##  2 BELLEVILLE (20041)                       65.1
##  3 PLACE ADOLPHE CHERIOUX                   60  
##  4 HIPPODROME D AUTEUIL                     60.0
##  5 RUE DES BOULETS ( COMPLEMENTAIRE )       55  
##  6 PLACE DE LA PORTE DE CHATILLON           54.9
##  7 PORTE DE LA CHAPELLE                     54.1
##  8 CHARMES (FONTENAY SOUS BOIS)             53.5
##  9 PORTE DE MONTROUGE                       53  
## 10 ALLENDE (PANTIN)                         52.9
## # … with 918 more rows

3.5.13 Task

Find the number of bike stations in each département.

You might find the function n() helpful, which returns the number of cases and is the dplyr equivalent of COUNT(*) in SQL (type ?n to get help).

3.5.14 Answer

We can use the following R code:

stations %>% group_by(departement) %>%         # Group by department
  summarise(n_stations=n()) %>%                # Count cases
  arrange(desc(n_stations))                    # Sort in descending order
## # A tibble: 4 × 2
##   departement       n_stations
##   <chr>                  <int>
## 1 Paris                    743
## 2 Hauts-de-Seine            75
## 3 Seine-Saint-Denis         60
## 4 Val-de-Marne              50

#####[/answer]

group_by can be also used to limit the scope of subsequent calls to other functions such as filter, arrange or slice. To make this more concrete, suppose we want to find for each time point the station which the most available bikes. We first have group the data by time and then find the station with the most available bikes.

bikes %>%                                     
  group_by(time) %>%                           # Group by time
  arrange(desc(available_bikes)) %>%           # Sort by bikes within each group
  slice (1)                                    # Return only top one per group
## # A tibble: 73 × 5
## # Groups:   time [73]
##    name          available_bikes available_bike_stands  time total_stands
##    <chr>                   <int>                 <int> <dbl>        <int>
##  1 MUSÉE D'ORSAY              65                     0  13             65
##  2 MUSÉE D'ORSAY              65                     0  13.1           65
##  3 MUSÉE D'ORSAY              65                     0  13.2           65
##  4 MUSÉE D'ORSAY              62                     3  13.2           65
##  5 METZ                       64                     0  13.3           64
##  6 DUPLEIX                    64                     4  13.4           68
##  7 METZ                       64                     0  13.5           64
##  8 MUSÉE D'ORSAY              63                     2  13.6           65
##  9 SAINT EMILION              63                     3  13.7           66
## 10 MUSÉE D'ORSAY              65                     0  13.8           65
## # … with 63 more rows

Alternatively, we can use filter and min_rank:

bikes %>%                                     
  group_by(time) %>%                           # Group by time
  filter(min_rank(desc(available_bikes))==1)   # Find largest in each group
## # A tibble: 92 × 5
## # Groups:   time [73]
##    name                      available_bikes available_bike_…  time total_stands
##    <chr>                               <int>            <int> <dbl>        <int>
##  1 MUSÉE D'ORSAY                          65                0  13             65
##  2 MUSÉE D'ORSAY                          65                0  13.1           65
##  3 MUSÉE D'ORSAY                          65                0  13.2           65
##  4 MUSÉE D'ORSAY                          62                3  13.2           65
##  5 MOUFFETARD EPEE DE BOIS                62                1  13.2           63
##  6 SAINT PLACIDE CHERCHE MI…              62                0  13.2           62
##  7 METZ                                   64                0  13.3           64
##  8 DUPLEIX                                64                4  13.4           68
##  9 METZ                                   64                0  13.4           64
## 10 METZ                                   64                0  13.5           64
## # … with 82 more rows

You might have noticed that the answers differ a little. The reason for this are ties: for example, at 1.15pm the stations at Mussée d’Orsay, Mouffetard Epée de Bois and Sainte Placide Cherche-Midi all had 62 bikes available. The former commands extracts just one of them, whereas the bottom command extracts all three. (You would obtain the same results if you replaced min_rank by row_number, which breaks ties by using in doubt the order in the data set).

3.5.15 Merging (joining) data sets using the join-type functions

Suppose we want to extract the data from bikes relating to bike stations in Hauts-de-Seine only. The table bikes does not however contain any information about the département in which the stations are located. We need to merge the information from the stations and bikes. This can be done using one of the join functions of dplyr. We will use inner_join, which only retains cases if there are corresponding entries in both data sets: this corresponds to the default behaviour of the R function merge.

The join functions will be default use the columns with common names across the two data sets (“natural join”).

bikes %>% inner_join(stations) %>%              # Merge data (using common variable: name)
  filter(departement=="Hauts-de-Seine")
## Joining, by = "name"
## # A tibble: 5,333 × 10
##    name  available_bikes available_bike_…  time total_stands    id address   lng
##    <chr>           <int>            <int> <dbl>        <int> <dbl> <chr>   <dbl>
##  1 SOLJ…              56                4    13           60 28002 BOULEV…  2.25
##  2 DE G…               3               19    13           22 22005 195 AV…  2.26
##  3 NATI…              20                3    13           23 21015 39 RUE…  2.24
##  4 MONT…              20                5    13           25 22011 7 RUE …  2.28
##  5 PETI…              22                0    13           22 21113 2 RUE …  2.30
##  6 GREN…               9               12    13           21 21013 4 AVEN…  2.25
##  7 MART…               0               24    13           24 21703 5-7 AV…  2.29
##  8 MORI…              22                3    13           25 21106 2-4 RU…  2.31
##  9 SELL…              34               17    13           51 21501 RUE DE…  2.23
## 10 VALI…              22                2    13           24 21101 4 RUE …  2.30
## # … with 5,323 more rows, and 2 more variables: lat <dbl>, departement <chr>

We could have specified the column to used to join the data sets manually by adding the argument by="name" (or by=c("name"="name"), which allows using columns with different names in the two data set).

As a side note, in this example, we could have avoided joining the two tables. We could have first extracted the names of the stations in Hauts-de-Seine and then used those to subset the data from bikes (essentially the equivalent of a subquery in SQL):

names92 <- stations %>% filter(departement=="Hauts-de-Seine") %>%
               select(name) 
bikes %>% filter(name %in% names92[[1]]) 
## # A tibble: 5,333 × 5
##    name                      available_bikes available_bike_…  time total_stands
##    <chr>                               <int>            <int> <dbl>        <int>
##  1 SOLJENITSYNE (PUTEAUX)                 56                4    13           60
##  2 DE GAULLE 3 (NEUILLY)                   3               19    13           22
##  3 NATIONALE (BOULOGNE-BILL…              20                3    13           23
##  4 MONTROSIER (NEUILLY)                   20                5    13           25
##  5 PETIT (CLICHY)                         22                0    13           22
##  6 GRENIER (BOULOGNE-BILLAN…               9               12    13           21
##  7 MARTINIE (VANVES)                       0               24    13           24
##  8 MORICE 2 (CLICHY)                      22                3    13           25
##  9 SELLIER (SURESNES)                     34               17    13           51
## 10 VALITON (CLICHY)                       22                2    13           24
## # … with 5,323 more rows

We had to use names92[[1]] to extract the entries of the tibble names92 as a character vector (we could have also used unlist(names92)).

You might notice a small difference in the results returned by the two approaches. The former retains the columns from stations which we have inserted, whereas the latter only contains the columns which bikes contained to start with.

3.5.16 Additional Resources

Data Transformation Cheat Sheet RStudio have put together a very handy and compact cheat sheet for dplyr.

Background reading: Chapter 13 of R for Data Science Chapter 13 of R for Data Science gives a detailed overview of the functions in dplyr.