3 Efficient Data Management in R
3.1 Tidyverse
Tidyverse is a collection of R packages designed to help data scientists to make more efficient use of R. It contains the following packages (and several more, which we will:
-
tibble provides a “modern reimagining” of the standard
data.frame
, in R. Tibbles (ortbl_df
s) are more flexible in terms of what they can store, but (purposefully) less flexible in terms of “sloppy code”. -
readr provides alternative functions for reading in text data in tabular form. It provides faster and more consistent alternatives to
read.table
andread.csv
. -
dplyr provides a powerful suite of functions for data manipulation with a focus on allowing for clean and simple code. We will look at
dplyr
in more detail this week. - ggplot2 is a very featureful and systematic set of plotting functions, which we will focus on in this tutorial.
-
lubridate is a very useful package for handling dates and times in R. Dates and times are often tricky to deal with, and
lubridate
provides many useful functions for efficiently handling these.
3.2 Pipelines
Pipelines are at the centre of all the tidyverse packages. The R package magrittr provides a forward-pipe operator for R.
Suppose we have a function f
defined in R
f <- function(x)
x^2
Then we can apply f
to an argument x
using
x <- 3
f(x)
## [1] 9
The forward-pipes from magrittr allow us to rewrite this function call as
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
x %>% f
## [1] 9
instead. The advantage of this alternative notation might not become immediately clear, but its advantage becomes more obvious when looking at nested function calls.
Consider the R data set mtcars
, which contains data from the 1974 edition from the US magazine Motor Trend. Suppose we want to convert the fuel consumption to litres per 100 kilometres and then only retain the cars with a fuel economy better than 10 litres per 100 kilometres.
## mpg cyl disp hp drat wt qsec vs am gear carb lp100k
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.639754
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.259568
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.737171
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.938348
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.615751
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.046538
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.737171
(If you are wondering where the number of 235.21 comes from: A US gallon is roughly 3.785 litres and a mile is roughly 1.609 kilometres, and \(\frac{100\times 3.785}{1.609}\approx 235.21\))
If we want to perform both steps in one go, we can nest the two calls within one another and use
This gives exactly the same results, but is not very easy to read and understand. It is not easy to see that the argument lp100k<=10
belongs to subset
. When nesting function calls, the arguments get increasingly far from the function call to which they belong.
The %>%
operator however allows us to write this much more cleanly:
3.2.2 Solution
The code generates a random sample of size 1000 (from a standard normal distribution), computes the sine of each entry and then takes the maximum.
## [1] 0.9999949
In this case the nested function call is easy to read because every function only takes one argument.
3.2.4 Answer
You can use the following R code using pipelines.
## body brain ratio
## Ground squirrel 0.101 4 39.60396
Oddly enough, ground squirrels have a higher brain-to-body weight ratio than humans.
3.2.5 Additional Resources
Pipelines for Data Analysis in R Hadley Wickham has produced a series of excellent slides about pipelines, which covers much of what we will look at in this tutorial.
Background reading: Chapter 18 of R for Data Science Chapter 18 of R for Data Science gives a detailed overview of pipes and some of the underpinning technology (though the latter is rather advanced).
3.3 Tibbles
The package tibble provides tbl_df
’s (or “tibbles”, which is easier to pronounce). They are a modern take on the built-in class data.frame
.
One key advantage of tibbles is that they can store anything. A data.frame
can only store a single value per “cell”, for example a number or a character string. However, in a tibble, you can store a list or even another tibble in a cell. An example of this is the tibble starwars
from the package dplyr
. The column starships
contains for each row the list of starships flown by that character (which is a list of different length depending on the character.)
library(dplyr) # Load library dplyr which contains the data
starwars[,c("name", "starships")] # Print columns name and starships
## # A tibble: 87 × 2
## name starships
## <chr> <list>
## 1 Luke Skywalker <chr [2]>
## 2 C-3PO <chr [0]>
## 3 R2-D2 <chr [0]>
## 4 Darth Vader <chr [1]>
## 5 Leia Organa <chr [0]>
## 6 Owen Lars <chr [0]>
## 7 Beru Whitesun lars <chr [0]>
## 8 R5-D4 <chr [0]>
## 9 Biggs Darklighter <chr [1]>
## 10 Obi-Wan Kenobi <chr [5]>
## # … with 77 more rows
starwars[10,"starships"][[1]] # Starships flown by Obi-Wan
## [[1]]
## [1] "Jedi starfighter" "Trade Federation cruiser"
## [3] "Naboo star skiff" "Jedi Interceptor"
## [5] "Belbullab-22 starfighter"
We could not have stored this information in a data frame. We would have had to either store the information across several data frames or stored the list of starships as a character string.
3.3.1 Creating tibbles
We can create tibbles using the function tibble
. We can create the tibble from above using
kidstibble <- tibble(name=c("Sarah", "John"), age=c(4,11), weight=c(15,28),
height=c(101,132), gender=c("f", "m"))
In other words, the function tibble
assembles a tibble on a column-by-column basis (akin to using cbind
).
The function tribble
(“transposed tibble”) lets you create a tibble on a row-by-bow basis (akin to using rbind
), which is typically more legible when creating a matrix in code.
kidstibble <- tribble(~name, ~age, ~weight, ~height, ~gender,
"Sarah", 4, 15, 101, "f",
"John", 11, 28, 132, "m")
3.3.2 Working with tibbles
- Variables/Columns can be accessed and added using
tibble$varname
(varname
needs to be fully spelled out). You can also access a column usingtibble[,"varname"]
ortibble[["varname"]]
. - Rows can be selected using
tibble[rowindices,]
(note that you cannot use row names). - Individual cells can be accessed using
tibble[rowindices, colindices]
.
3.3.3 Subsetting tibbles always results in a tibble
Tibbles are also more consistent. Subsetting tibbles always results in a tibble.
kidstibble[,1] # Result is a tibble
## # A tibble: 2 × 1
## name
## <chr>
## 1 Sarah
## 2 John
In contrast, subsetting a data frame or matrix is not guaranteed to result in a data frame or matrix (unless you use drop=FALSE
). If the result is a single column or row, subsetting a data frame or matrix results in a vector.
This “dropping” of the dimension can be very useful when using R interactively, but can be the source of many issues in more complex projects, when programmers incorrectly assume that subsetting a data frame or matrix will always result in another data frame or matrix, rather than possibly just a vector (it is thus a good idea to always use drop=FALSE
when working with data frames or matrices in complex projects).
Data Import Cheat Sheet RStudio’s cheat sheet for data import also covers tibbles.
3.4 Reading in data using readr
The package readr contains alternatives to the functions read.table
and read.csv
. The alternative functions from readr
have four main advantages.
- They read in the data a lot faster and can show a progress bar (though this is only relevant for really big data sets).
- They store the data straight in a tibble, rather than a data frame.
- They allow specifying the intended data type for each column and thus make it easier to identify rows which cause problems.
- They are less intrusive: they don’t automatically convert character strings to factors and do not change column names (
read.table
andread.csv
will for example remove spaces from variable names and replace them by full stops). The functions fromreadr
are also guaranteed to give the same result irrespective of the platform or operating system they are run under.
readr
provides the following functions.
-
read_csv
reads in comma-separated files.read_csv2
reads in files which are semicolon-separated (common in countries like France or Germany, where a comma is used as decimal separator). -
read_tsv
reads in tab-separated files. -
read_delim
is the most general function (likeread.table
). The delimiter has to be specified using the argumentdelim
. -
read_fwd
reads in fixed-width files.
All functions assume that the first row contains the column/variable names. If this is not the case, set the optional argument col_names
to FALSE
or to a character vector containing the intended column names.
The strings used to encode missing values can be specified using the optional argument na
.
For example, we can read in the file chol.txt using
library(readr)
read_delim("chol.txt", delim=" ", col_names=c("ldl", "hdl", "trig",
"age", "gender", "smoke"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ldl = col_double(),
## hdl = col_double(),
## trig = col_double(),
## age = col_double(),
## gender = col_character(),
## smoke = col_character()
## )
## # A tibble: 13 × 6
## ldl hdl trig age gender smoke
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 175 25 148 39 female no
## 2 196 36 92 32 female no
## 3 139 65 NA 42 male <NA>
## 4 162 37 139 30 female ex-smoker
## 5 140 117 59 42 female ex-smoker
## 6 147 51 126 65 female ex-smoker
## 7 82 81 NA 57 male no
## 8 165 63 120 48 male current
## 9 149 49 NA 32 female no
## 10 95 54 157 55 female ex-smoker
## 11 169 59 67 48 female no
## 12 174 117 168 41 female no
## 13 91 52 146 69 female current
Note that functions from readr
show the data type it has used for each column. This makes it easier to spot mistakes like missing values not coded as expected, in which case a numeric column would show up as a character string.
For example, we can read in the file chol.csv using
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ldl = col_double(),
## hdl = col_double(),
## trig = col_double(),
## age = col_double(),
## gender = col_character(),
## smoke = col_character()
## )
## # A tibble: 13 × 6
## ldl hdl trig age gender smoke
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 175 25 148 39 female no
## 2 196 36 92 32 female no
## 3 139 65 NA 42 male NA
## 4 162 37 139 30 female ex-smoker
## 5 140 117 59 42 female ex-smoker
## 6 147 51 126 65 female ex-smoker
## 7 82 81 NA 57 male no
## 8 165 63 120 48 male current
## 9 149 49 NA 32 female no
## 10 95 54 157 55 female ex-smoker
## 11 169 59 67 48 female no
## 12 174 117 168 41 female no
## 13 91 52 146 69 female current
3.4.2 Answer
The first line of the file cars.csv
contains the variable names and the fields are separated by commas. Missing values are encoded as asterisks.
cars <- read_csv("cars.csv", na="*")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Manufacturer = col_character(),
## Model = col_character(),
## MPG = col_double(),
## Displacement = col_double(),
## Horsepower = col_double()
## )
cars
## # A tibble: 20 × 5
## Manufacturer Model MPG Displacement Horsepower
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chevrolet Camaro 19 3.4 160
## 2 Oldsmobile Achieva NA 2.3 155
## 3 Dodge Spirit 22 2.5 100
## 4 Chevrolet Astro NA 4.3 165
## 5 Chevrolet Corsica 25 2.2 110
## 6 Volkswagen Corrado 18 2.8 178
## 7 Dodge Stealth 18 3 300
## 8 Volkswagen Fox 25 1.8 81
## 9 Cadillac DeVille 16 4.9 200
## 10 Hyundai Excel 29 1.5 81
## 11 Toyota Tercel 32 1.5 82
## 12 Dodge Colt 29 1.5 92
## 13 Volkswagen Passat 21 2 134
## 14 Geo Storm 30 1.6 90
## 15 Toyota Previa 18 2.4 138
## 16 Nissan Sentra 29 1.6 110
## 17 Toyota Celica 25 2.2 135
## 18 Honda Civic 42 1.5 102
## 19 Dodge Caravan 17 3 142
## 20 Hyundai Sonata 20 2 128
We could have also used the function read_delim
.
read_delim("cars.csv", delim=",", na="*")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Manufacturer = col_character(),
## Model = col_character(),
## MPG = col_double(),
## Displacement = col_double(),
## Horsepower = col_double()
## )
## # A tibble: 20 × 5
## Manufacturer Model MPG Displacement Horsepower
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chevrolet Camaro 19 3.4 160
## 2 Oldsmobile Achieva NA 2.3 155
## 3 Dodge Spirit 22 2.5 100
## 4 Chevrolet Astro NA 4.3 165
## 5 Chevrolet Corsica 25 2.2 110
## 6 Volkswagen Corrado 18 2.8 178
## 7 Dodge Stealth 18 3 300
## 8 Volkswagen Fox 25 1.8 81
## 9 Cadillac DeVille 16 4.9 200
## 10 Hyundai Excel 29 1.5 81
## 11 Toyota Tercel 32 1.5 82
## 12 Dodge Colt 29 1.5 92
## 13 Volkswagen Passat 21 2 134
## 14 Geo Storm 30 1.6 90
## 15 Toyota Previa 18 2.4 138
## 16 Nissan Sentra 29 1.6 110
## 17 Toyota Celica 25 2.2 135
## 18 Honda Civic 42 1.5 102
## 19 Dodge Caravan 17 3 142
## 20 Hyundai Sonata 20 2 128
The first line of the file ships.txt
contains the variable names and the fields are separated by whitespace. Missing values are encoded as “.”.
ships <- read_delim("ships.txt", delim=' ' , na=".")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## type = col_character(),
## year = col_double(),
## period = col_double(),
## service = col_double(),
## incidents = col_double()
## )
ships
## # A tibble: 40 × 5
## type year period service incidents
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A 60 60 127 0
## 2 A 60 75 63 0
## 3 A 65 60 NA 3
## 4 A 65 75 1095 4
## 5 A 70 60 1512 6
## 6 A 70 75 3353 18
## 7 A 75 60 0 0
## 8 A 75 75 2244 11
## 9 B 60 60 44882 39
## 10 B 60 75 17176 29
## # … with 30 more rows
3.4.3 Specifying column types
The functions from readr
allow specifying the expected column types. This is especially important when writing which will then be run automatically. It provides an easy way of ensuring that the data provided is of the expected format.
The easiest way of specifying expected column types is to provide a character string with each letters standing for a column
Letter | Meaning |
---|---|
c |
character |
i |
integer |
n |
number |
d |
double |
l |
logical |
D |
date |
T |
date time |
t |
time |
? |
guess the type |
_ or -
|
skip the column |
So for the data file chol.csv we would expect the first four columns to be integers and the latter two to be character strings, so we would use
chol <- read_csv("chol.csv", na=".", col_types="iiiicc")
Specifying the expected column types can help pinpointing problems when reading in data. Suppose we had forgotten that missing values are coded using “.
” in this data file. If we use …
chol <- read_csv("chol.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ldl = col_double(),
## hdl = col_double(),
## trig = col_character(),
## age = col_double(),
## gender = col_character(),
## smoke = col_character()
## )
… we can see from the output that trig
was read in as a character string, but we do not know why.
However, if we use …
chol <- read_csv("chol.csv", col_types="iiiicc")
## Warning: 3 parsing failures.
## row col expected actual file
## 3 trig an integer . 'chol.csv'
## 7 trig an integer . 'chol.csv'
## 9 trig an integer . 'chol.csv'
… we obtain a warning and can print the problematic rows using
problems(chol)
## # A tibble: 3 × 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 3 trig an integer . 'chol.csv'
## 2 7 trig an integer . 'chol.csv'
## 3 9 trig an integer . 'chol.csv'
The output from problems
shows us that for three rows (3, 7 and 9) the data in chol.csv
was not of the expected format: a value of .
is not compatible with the column being numeric. This makes it easy to identify the cause of the problem (NAs coded as “.
”) and rectify the issue.
3.4.4 Additional Resources
Data Import Cheat Sheet
RStudio’s cheat sheet for data import also covers readr
.
Background reading: Chapter 11 of R for Data Science
Chapter 11 of R for Data Science gives a detailed overview of the functions in readr
. It explains in some more detail how the functions in readr
parse files. over pipes and also covers the functions from readr
that dead with writing files.
3.5 Efficient data manipulation using dplyr
In this section we will work with data from Paris’ Vélib’ bicycle sharing system available through JCDecaux’s API for open cycle data.
The data consists of the number of bikes available and the number of bike stands available at every Vélib’ station, recorded every five minutes over six hours on a Tuesday afternoon in October 2017.
The data consists of two tibbles. The first, bikes
contains data on the number of available bikes and stands at each station.
Variable | Description |
---|---|
name |
Name of the station |
available_bikes |
Number of available at that time |
available_bike_stands |
Number of available bike stands |
time |
Decimal time for which the number have been recorded |
The second, stations
contains additional information about each station.
Variable | Description |
---|---|
name |
Unique name of the station |
id |
Internal ID number of the station |
address |
Address of where the station is located |
lng |
GPS coordinate (longitude) |
lat |
GPS coordinate (latitude) |
departement |
Département in which the station is located |
You can load the data into R using
3.5.1 Overview: the key functions (“verbs”) for dplyr
Function (“verb”) | Description | R base equivalent(s) |
---|---|---|
filter |
Select observations/rows | subset |
slice |
Select observations by row numbers | [idx,] |
select |
Select variables/column |
$ or [,sel]
|
mutate |
Create new variables/column | transform |
arrange |
Sort observations/rows | order |
group_by |
Group observations by variable |
by or aggregate
|
summarise |
Calculate summary statistics |
by or aggregate
|
The functions in dplyr
are designed to be used with tibbles, but they also work with data frames. When invoked with a data frame, they will return a data frame as long as this is possible.
3.5.2 Selecting observations (rows) using filter
and slice
3.5.2.1 filter
The function filter
is used to select observations (or rows) in a similar way to the base R function subset
.
Suppose we want to print all bike stations in Paris (rather than other départements from Île de France)
## # A tibble: 743 × 6
## name id address lng lat departement
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 PORT SOLFERINO (STATION MOBILE) 901 BERGES… 2.32 48.9 Paris
## 2 QUAI MAURIAC / PONT DE BERCY 903 FETE D… 2.37 48.8 Paris
## 3 17/19 PLACE JOFFRE / ECOLE MILITAIRE 904 ECOLE … 2.30 48.9 Paris
## 4 CONCORDE/BERGES DE SEINE (STATION MOBI… 905 BERGES… 2.32 48.9 Paris
## 5 PORT DU GROS CAILLOU (STATION MOBILE) 908 BERGES… 2.31 48.9 Paris
## 6 PONT D'ARCOLE (STATION MOBILE) 909 Voie G… 2.35 48.9 Paris
## 7 ILE DE LA CITE PONT NEUF 1001 41 QUA… 2.34 48.9 Paris
## 8 PLACE DU CHATELET 1002 14 AVE… 2.35 48.9 Paris
## 9 RIVOLI SAINT DENIS 1003 7 RUE … 2.35 48.9 Paris
## 10 MARGUERITE DE NAVARRE 1004 12 RUE… 2.35 48.9 Paris
## # … with 733 more rows
Note the use of a double ==
to test whether the département is equal to “Paris”.
We can create more complex expressions using the standard logical operators &
(“and”), |
(“or”) and !
(“not”). Note that you cannot use &&
and ||
in this context, as they only work with scalar arguments.
For example, if we want to extract the stations which are in Paris or Hauts-de-Seine we can use
Rather than using a logical or we could have used %in%
:
Even though the functions from dplyr
are designed to be used with pipelines, you can also provide the data set as first argument:
3.5.2.2 slice
You can use the function slice
to select observations based on their row numbers.
## # A tibble: 3 × 6
## name id address lng lat departement
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 PORT DU GROS CAILLOU (STATION MOBILE) 908 BERGES DE… 2.31 48.9 Paris
## 2 PONT D'ARCOLE (STATION MOBILE) 909 Voie Geor… 2.35 48.9 Paris
## 3 ILE DE LA CITE PONT NEUF 1001 41 QUAI D… 2.34 48.9 Paris
selects the observations in rows 5 to 7 and is equivalent to
stations[5:7,]
## # A tibble: 3 × 6
## name id address lng lat departement
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 PORT DU GROS CAILLOU (STATION MOBILE) 908 BERGES DE… 2.31 48.9 Paris
## 2 PONT D'ARCOLE (STATION MOBILE) 909 Voie Geor… 2.35 48.9 Paris
## 3 ILE DE LA CITE PONT NEUF 1001 41 QUAI D… 2.34 48.9 Paris
3.5.3 Task
Identify the stations which had more than 60 bikes available at 3pm (i.e. time
taking the value 15).
3.5.4 Answer
You can use the following R code:
## # A tibble: 6 × 4
## name available_bikes available_bike_stands time
## <chr> <int> <int> <dbl>
## 1 MUSÉE D'ORSAY 63 2 15
## 2 DUPLEIX 65 3 15
## 3 ASSEMBLEE NATIONALE 62 0 15
## 4 SAINT EMILION 65 1 15
## 5 METZ 63 1 15
## 6 PRIMO LEVI 61 1 15
3.5.5 Selecting variables (columns) using select
The function select
can be used to subset the variables (columns) of a data set.
You can either specify the columns to retain or (with a minus) those you do not want to retain.
We can only retain the name and département of each station using either
## # A tibble: 928 × 2
## name departement
## <chr> <chr>
## 1 PORT SOLFERINO (STATION MOBILE) Paris
## 2 QUAI MAURIAC / PONT DE BERCY Paris
## 3 17/19 PLACE JOFFRE / ECOLE MILITAIRE Paris
## 4 CONCORDE/BERGES DE SEINE (STATION MOBILE) Paris
## 5 PORT DU GROS CAILLOU (STATION MOBILE) Paris
## 6 PONT D'ARCOLE (STATION MOBILE) Paris
## 7 ILE DE LA CITE PONT NEUF Paris
## 8 PLACE DU CHATELET Paris
## 9 RIVOLI SAINT DENIS Paris
## 10 MARGUERITE DE NAVARRE Paris
## # … with 918 more rows
or
You can also use select
to change the order of the columns of a data set.
3.5.6 Adding new variables using mutate
The function mutate
can be used to create new variables (columns) in a data set. mutate
is similar in functionality to the base R function transform
.
We can add the total number of stands to the data set bikes
using
More than one new variable can be defined by adding further arguments to mutate
.
transmute
is a sibling of mutate
. Just like mutate
it creates new columns. It however also removes all existing columns so that only the new columns remain.
3.5.7 Task
The time is currently encoded as decimal (e.g. 13.5
for 13:30). Create two columns time_hours
, which contains the hour (13 in our example), and time_minutes
, which contains the minutes, (30 in our example).
You can calculate time_hours
as the floor of time
(R function floor
) and time_minutes
as the remainder after integer division of 60 times time
by 60 (R operator %%
).
3.5.8 Answer
We can create both columns in one call to mutate
.
## # A tibble: 67,354 × 7
## name available_bikes available_bike_… time total_stands time_hour
## <chr> <int> <int> <dbl> <int> <dbl>
## 1 CHAMPEAUX (BAG… 9 41 13 50 13
## 2 POISSONNIÈRE -… 33 0 13 33 13
## 3 METRO ROME 6 38 13 44 13
## 4 DE GAULLE (PAN… 2 16 13 18 13
## 5 PARC DE BELLEV… 4 22 13 26 13
## 6 SOLJENITSYNE (… 56 4 13 60 13
## 7 SERRES 5 18 13 23 13
## 8 PYRAMIDE ARTIL… 14 40 13 54 13
## 9 SAINT GEORGES 12 10 13 22 13
## 10 MUSÉE D'ORSAY 65 0 13 65 13
## # … with 67,344 more rows, and 1 more variable: time_minutes <dbl>
The output does not show the new columns (as they would take the output of a single row to more than one line). We can show them all, for example, if we remove the station name.
## # A tibble: 67,354 × 6
## available_bikes available_bike_sta… time total_stands time_hour time_minutes
## <int> <int> <dbl> <int> <dbl> <dbl>
## 1 9 41 13 50 13 0
## 2 33 0 13 33 13 0
## 3 6 38 13 44 13 0
## 4 2 16 13 18 13 0
## 5 4 22 13 26 13 0
## 6 56 4 13 60 13 0
## 7 5 18 13 23 13 0
## 8 14 40 13 54 13 0
## 9 12 10 13 22 13 0
## 10 65 0 13 65 13 0
## # … with 67,344 more rows
Alternatively, we can explicitly invoke the print method of the tibble and ask it to print everything.
## # A tibble: 67,354 × 7
## name available_bikes available_bike_stands time
## <chr> <int> <int> <dbl>
## 1 CHAMPEAUX (BAGNOLET) 9 41 13
## 2 POISSONNIÈRE - ENGHIEN 33 0 13
## 3 METRO ROME 6 38 13
## 4 DE GAULLE (PANTIN) 2 16 13
## 5 PARC DE BELLEVILLE (20040) 4 22 13
## 6 SOLJENITSYNE (PUTEAUX) 56 4 13
## 7 SERRES 5 18 13
## 8 PYRAMIDE ARTILLERIE 14 40 13
## 9 SAINT GEORGES 12 10 13
## 10 MUSÉE D'ORSAY 65 0 13
## total_stands time_hour time_minutes
## <int> <dbl> <dbl>
## 1 50 13 0
## 2 33 13 0
## 3 44 13 0
## 4 18 13 0
## 5 26 13 0
## 6 60 13 0
## 7 23 13 0
## 8 54 13 0
## 9 22 13 0
## 10 65 13 0
## # … with 67,344 more rows
3.5.9 Sorting data sets using arrange
The function arrange
can be used to sort a data set by one or more variables.
We can sort the data set bikes
by the number of available bikes suing
## # A tibble: 67,354 × 5
## name available_bikes available_bike_… time total_stands
## <chr> <int> <int> <dbl> <int>
## 1 KARMAN (AUBERVILLIERS) 0 0 13 0
## 2 PIGALLE GERMAIN PILLON 0 20 13 20
## 3 ROND POINT DES CHAMPS EL… 0 0 13 0
## 4 MONTCALM 0 47 13 47
## 5 PLACE HENOCQUE VERSION 2 0 34 13 34
## 6 PLACE DES FETES 0 19 13 19
## 7 MANIN SECRETAN 0 20 13 20
## 8 MARTINIE (VANVES) 0 24 13 24
## 9 HORTENSIAS (LES LILAS) 0 22 13 22
## 10 HAIES REUNION 0 22 13 22
## # … with 67,344 more rows
You can use the function desc
to sort in descending order
## # A tibble: 67,354 × 5
## name available_bikes available_bike_stands time total_stands
## <chr> <int> <int> <dbl> <int>
## 1 DUPLEIX 68 0 16.2 68
## 2 DUPLEIX 68 0 16.2 68
## 3 DUPLEIX 67 1 15.4 68
## 4 DUPLEIX 67 1 15.5 68
## 5 DUPLEIX 67 1 15.8 68
## 6 DUPLEIX 67 1 16.1 68
## 7 DUPLEIX 67 1 16.3 68
## 8 SAHEL 67 0 17.6 67
## 9 SAHEL 67 0 18 67
## 10 SAHEL 67 0 18.1 67
## # … with 67,344 more rows
3.5.10 Task
Identify the three bike stations that are furthest to the West (i.e. the ones with the smallest longitude lng
).
3.5.11 Answer
We first sort the stations by the longitude and the select to top three observations.
## # A tibble: 3 × 6
## name id address lng lat departement
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 GARE ROUTIERE ( SAINT CLOUD) 22101 GARE ROUTIERE - AR… 2.22 48.8 Hauts-de-S…
## 2 SELLIER (SURESNES) 21501 RUE DE SAINT CLOUD… 2.23 48.9 Hauts-de-S…
## 3 VERDUN (SURESNES) 21502 18 BIS RUE DE VERD… 2.23 48.9 Hauts-de-S…
We could have also used the function filter
and the ranking function min_rank
:
## # A tibble: 3 × 6
## name id address lng lat departement
## <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 SELLIER (SURESNES) 21501 RUE DE SAINT CLOUD… 2.23 48.9 Hauts-de-S…
## 2 VERDUN (SURESNES) 21502 18 BIS RUE DE VERD… 2.23 48.9 Hauts-de-S…
## 3 GARE ROUTIERE ( SAINT CLOUD) 22101 GARE ROUTIERE - AR… 2.22 48.8 Hauts-de-S…
min_rank
returns the rank of the observation when considering the variable given as argument (there are many different ways of computing ranks, see ?min_rank
for details.)
However, the latter answer does not show the stations in increasing order of longitude.
3.5.12 Grouping data and calculating group-wise summary statistics: group_by
and summarise
Suppose we want to identify the busiest stations in the system in the sense of having, on average, the most bikes taken out (and thus the highest number of available bike stands – this is assuming JCDecaux replenish all bike stations in the same way, which is not quite what is happening in reality; there are better, but more complex, ways of defining “busy”).
To calculate the average number of available bike stands per station we need to first group the data by bike station and then compute the average number of bike stands available
bikes %>% group_by(name) %>% # Group by station name
summarise(avg_stands=mean(available_bike_stands)) %>% # Calculate averages
arrange(desc(avg_stands)) # Sort in descending order
## # A tibble: 928 × 2
## name avg_stands
## <chr> <dbl>
## 1 PANTIN 70.3
## 2 BELLEVILLE (20041) 65.1
## 3 PLACE ADOLPHE CHERIOUX 60
## 4 HIPPODROME D AUTEUIL 60.0
## 5 RUE DES BOULETS ( COMPLEMENTAIRE ) 55
## 6 PLACE DE LA PORTE DE CHATILLON 54.9
## 7 PORTE DE LA CHAPELLE 54.1
## 8 CHARMES (FONTENAY SOUS BOIS) 53.5
## 9 PORTE DE MONTROUGE 53
## 10 ALLENDE (PANTIN) 52.9
## # … with 918 more rows
3.5.13 Task
Find the number of bike stations in each département.
You might find the function n()
helpful, which returns the number of cases and is the dplyr
equivalent of COUNT(*)
in SQL (type ?n
to get help).
3.5.14 Answer
We can use the following R code:
stations %>% group_by(departement) %>% # Group by department
summarise(n_stations=n()) %>% # Count cases
arrange(desc(n_stations)) # Sort in descending order
## # A tibble: 4 × 2
## departement n_stations
## <chr> <int>
## 1 Paris 743
## 2 Hauts-de-Seine 75
## 3 Seine-Saint-Denis 60
## 4 Val-de-Marne 50
#####[/answer]
group_by
can be also used to limit the scope of subsequent calls to other functions such as filter
, arrange
or slice
. To make this more concrete, suppose we want to find for each time point the station which the most available bikes. We first have group the data by time
and then find the station with the most available bikes.
bikes %>%
group_by(time) %>% # Group by time
arrange(desc(available_bikes)) %>% # Sort by bikes within each group
slice (1) # Return only top one per group
## # A tibble: 73 × 5
## # Groups: time [73]
## name available_bikes available_bike_stands time total_stands
## <chr> <int> <int> <dbl> <int>
## 1 MUSÉE D'ORSAY 65 0 13 65
## 2 MUSÉE D'ORSAY 65 0 13.1 65
## 3 MUSÉE D'ORSAY 65 0 13.2 65
## 4 MUSÉE D'ORSAY 62 3 13.2 65
## 5 METZ 64 0 13.3 64
## 6 DUPLEIX 64 4 13.4 68
## 7 METZ 64 0 13.5 64
## 8 MUSÉE D'ORSAY 63 2 13.6 65
## 9 SAINT EMILION 63 3 13.7 66
## 10 MUSÉE D'ORSAY 65 0 13.8 65
## # … with 63 more rows
Alternatively, we can use filter
and min_rank
:
bikes %>%
group_by(time) %>% # Group by time
filter(min_rank(desc(available_bikes))==1) # Find largest in each group
## # A tibble: 92 × 5
## # Groups: time [73]
## name available_bikes available_bike_… time total_stands
## <chr> <int> <int> <dbl> <int>
## 1 MUSÉE D'ORSAY 65 0 13 65
## 2 MUSÉE D'ORSAY 65 0 13.1 65
## 3 MUSÉE D'ORSAY 65 0 13.2 65
## 4 MUSÉE D'ORSAY 62 3 13.2 65
## 5 MOUFFETARD EPEE DE BOIS 62 1 13.2 63
## 6 SAINT PLACIDE CHERCHE MI… 62 0 13.2 62
## 7 METZ 64 0 13.3 64
## 8 DUPLEIX 64 4 13.4 68
## 9 METZ 64 0 13.4 64
## 10 METZ 64 0 13.5 64
## # … with 82 more rows
You might have noticed that the answers differ a little. The reason for this are ties: for example, at 1.15pm the stations at Mussée d’Orsay, Mouffetard Epée de Bois and Sainte Placide Cherche-Midi all had 62 bikes available. The former commands extracts just one of them, whereas the bottom command extracts all three.
(You would obtain the same results if you replaced min_rank
by row_number
, which breaks ties by using in doubt the order in the data set).
3.5.15 Merging (joining) data sets using the join
-type functions
Suppose we want to extract the data from bikes
relating to bike stations in Hauts-de-Seine only. The table bikes
does not however contain any information about the département in which the stations are located. We need to merge the information from the stations
and bikes
. This can be done using one of the join
functions of dplyr
. We will use inner_join
, which only retains cases if there are corresponding entries in both data sets: this corresponds to the default behaviour of the R function merge
.
The join
functions will be default use the columns with common names across the two data sets (“natural join”).
bikes %>% inner_join(stations) %>% # Merge data (using common variable: name)
filter(departement=="Hauts-de-Seine")
## Joining, by = "name"
## # A tibble: 5,333 × 10
## name available_bikes available_bike_… time total_stands id address lng
## <chr> <int> <int> <dbl> <int> <dbl> <chr> <dbl>
## 1 SOLJ… 56 4 13 60 28002 BOULEV… 2.25
## 2 DE G… 3 19 13 22 22005 195 AV… 2.26
## 3 NATI… 20 3 13 23 21015 39 RUE… 2.24
## 4 MONT… 20 5 13 25 22011 7 RUE … 2.28
## 5 PETI… 22 0 13 22 21113 2 RUE … 2.30
## 6 GREN… 9 12 13 21 21013 4 AVEN… 2.25
## 7 MART… 0 24 13 24 21703 5-7 AV… 2.29
## 8 MORI… 22 3 13 25 21106 2-4 RU… 2.31
## 9 SELL… 34 17 13 51 21501 RUE DE… 2.23
## 10 VALI… 22 2 13 24 21101 4 RUE … 2.30
## # … with 5,323 more rows, and 2 more variables: lat <dbl>, departement <chr>
We could have specified the column to used to join the data sets manually by adding the argument by="name"
(or by=c("name"="name")
, which allows using columns with different names in the two data set).
As a side note, in this example, we could have avoided joining the two tables. We could have first extracted the names of the stations in Hauts-de-Seine and then used those to subset the data from bikes
(essentially the equivalent of a subquery in SQL):
names92 <- stations %>% filter(departement=="Hauts-de-Seine") %>%
select(name)
bikes %>% filter(name %in% names92[[1]])
## # A tibble: 5,333 × 5
## name available_bikes available_bike_… time total_stands
## <chr> <int> <int> <dbl> <int>
## 1 SOLJENITSYNE (PUTEAUX) 56 4 13 60
## 2 DE GAULLE 3 (NEUILLY) 3 19 13 22
## 3 NATIONALE (BOULOGNE-BILL… 20 3 13 23
## 4 MONTROSIER (NEUILLY) 20 5 13 25
## 5 PETIT (CLICHY) 22 0 13 22
## 6 GRENIER (BOULOGNE-BILLAN… 9 12 13 21
## 7 MARTINIE (VANVES) 0 24 13 24
## 8 MORICE 2 (CLICHY) 22 3 13 25
## 9 SELLIER (SURESNES) 34 17 13 51
## 10 VALITON (CLICHY) 22 2 13 24
## # … with 5,323 more rows
We had to use names92[[1]]
to extract the entries of the tibble names92
as a character vector (we could have also used unlist(names92)
).
You might notice a small difference in the results returned by the two approaches. The former retains the columns from stations
which we have inserted, whereas the latter only contains the columns which bikes
contained to start with.
3.5.16 Additional Resources
Data Transformation Cheat Sheet RStudio have put together a very handy and compact cheat sheet for dplyr.
Background reading: Chapter 13 of R for Data Science
Chapter 13 of R for Data Science gives a detailed overview of the functions in dplyr
.