5 Introduction to data wrangling
5.1 Setup
This chunk of R code loads the packages that we will be using.
Data is abundant, and as data analysts we often have to spend time working with our data tables to get it ready for whatever analysis we need to do. “Data wrangling” and “data transformation” are two terms that are used to describe the process of altering the structure of our data.3
There are three broad categories of data transformation:
alter the structure of the existing data tables. This could be through adding or removing rows or columns to our data table, including calculation of new variables, or importation of additional rows (observations) and joining two tables together.
summarize the content of the source data tables. For example, if our data table has the population of all the countries in the world, we might want a summary table that has the table with the total for each continent.
with both our existing data or our summary tables, we may want change the layout, either for analysis purposes or to make it easier for human readers to absorb the information.
In this section, we will begin to work with all three transformations.
5.1.1 Reading: Data wrangling
Please refer to Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science, 2nd ed., Data transformation for more information about this section.
For this section, and for other exercises throughout this course, we will be using a subset of the Gapminder data. These data tables contain information about the countries of the world, and were used by Hans Rosling to explain that there have been significant changes in the quality of people’s lives around the world.
The best stats you’ve ever seen— this is the video that introduced me to Hans Rosling.
200 Countries, 200 Years, 4 Minutes—recreating the chart in this video is one of the capstone project options
It’s outside the scope of this course, but I also recommend Rosling’s book Factfulness(Rosling, Rosling, and Rönnlund 2018), which goes deeper into the changes in the world and how data explains those changes.
An article related to the work of Hans Rosling can be found here: “The world is awful. The world is much better. The world can be much better.”, by Max Roser, the Founder and Director of Our World In Data, an organization dedicated to sharing data and analysis to make progress against the world’s largest problems.
5.2 Transform Data: 1a Alter the structure
Data Transformation Cheat Sheet (PDF file)
When working with data, particularly large data sets, you encounter a situation where you need to:
subset the data so that it contains only those records (rows) you are interested in (Why deal with a data table that has every province in Canada when you just need the British Columbia data?)
subset the data so that it contains only those variables (columns) you are interested in (Why keep looking at variables that do not factor into your analysis, or you will not print in your report?)
create new variables, often through calculations based on variables in your data
To achieve these three goals, we will be using functions from the tidyverse package {dplyr}. The functions are verbs that describe what they do. Here’s a list of the {dplyr} functions that you’ll be using:
function | action |
---|---|
filter() |
keep rows |
select() |
keep variables (or drop them -var or re-order them) |
mutate() |
create a new variable |
arrange() |
sort from smallest to largest |
arrange(desc()) |
sort from largest to smallest |
Other operators
In addition to these functions, R has a number of operators that add significantly to your code.
Arithmetic operators
Operator | Description |
---|---|
+ | plus |
- | minus |
* | multiplication |
/ | division |
^ | exponential |
Boolean operators are a special type of operator that return TRUE or FALSE.
Operator | Description |
---|---|
== | equal, tests equality |
!= | not equal, tests inequality |
> | greater than, tests greater than (also >=) |
< | less than, tests less than (also <=) |
%in% | contains, tests inclusion |
Boolean operators can be combined with and or or.
Operator | Description |
---|---|
& |
and, returns true if preceding and following are both true, else false |
| | or, returns true if either preceding and following are true, else false |
see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science, 2nd ed., “Comparisions” and “Boolean algebra”.]
5.3 Introducing the pipe operator |>
(see R for Data Science, 2nd ed., The Pipe for a deep dive into pipes)
The pipe operator |>
allows us to name the data frame first, and then have the next function work with the outcome of the first line.
For example, we could write
function(data, argument)
or, with a pipe
data |> function(argument)
Read the pipe symbol as “then”. We start with our data, then we apply the function.
If we want to do anything with the output from the first statement, we have to assign it to an intermediate object, and then use that intermediate object in the second statement. With the pipe, we can add another pipe and the results of the first statement are passed to the second statement without the intermediate object.
As we get to increasingly complex statements, the power of this will become more evident.
This code calculates the mean of life expectancy (the lifeExp
variable) in the gapminder data (note that we are accessing the lifeExp
variable by adding the dollar sign to the dataframe name).
mean(gapminder$lifeExp)
## [1] 59.47444
Another way to do this would be using the pipe operator. The symbol is |>
and links one line of script to the next. The way to read this is as meaning “and then…”
At this point, we only have one step in our chain, but in future code, you might have multiple steps linked one after the other.
So this little chunk would be read as
we start with the dataframe “gapminder”, “and then…”
we calculate a variable called “mean_of_life_expectancy” by running the
mean()
function on the variable “lifeExp”
## # A tibble: 1 × 1
## mean_of_life_expectancy
## <dbl>
## 1 59.5
IMPORTANT NOTE The pipe operator
|>
is in base R, but it wasn’t always. Prior to its introduction there was a pipe symbol available through the {magrittr} package, which looks like this:%>%
. You will still see this version in older texts (such as the first edition of R for Data Science) and course materials. It was used in BIDA302, so if you come across older course material, it might use the%>%
pipe operator.
5.4 Filter for rows
The calculations above are for the entire dataframe. Often in data analysis, we want to analyze a single group. We might want to examine life expectancy changes in a single country, or for calculate the average across the world for a single year. For this type of analysis, we can use the filter()
function.
In the code chunk below, we use the filter()
function to get the rows for Canada. This will return the 12 rows that have the records for Canada.
gapminder |>
filter(country == "Canada")
## # A tibble: 12 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## 11 Canada Americas 2002 79.8 31902268 33329.
## 12 Canada Americas 2007 80.7 33390141 36319.
In our next code chunk, we use filter to get the rows for every country, for the single year 2002. Note that because the “year” variable is a numeric type (specifically, an integer) we do not put the year value (2002) inside quotation marks.
gapminder |>
filter(year == 2002)
## # A tibble: 142 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Albania Europe 2002 75.7 3508512 4604.
## 3 Algeria Africa 2002 71.0 31287142 5288.
## 4 Angola Africa 2002 41.0 10866106 2773.
## 5 Argentina Americas 2002 74.3 38331121 8798.
## 6 Australia Oceania 2002 80.4 19546792 30688.
## 7 Austria Europe 2002 79.0 8148312 32418.
## 8 Bahrain Asia 2002 74.8 656397 23404.
## 9 Bangladesh Asia 2002 62.0 135656790 1136.
## 10 Belgium Europe 2002 78.3 10311970 30486.
## # ℹ 132 more rows
We can use this filter as the first step in a pipe that allows us to calculate the average life expectancy for all of the countries in 2002.
## # A tibble: 1 × 1
## global_life_expectancy
## <dbl>
## 1 65.7
5.5 Creating lists
In this example, we filter to get three countries, “New Zealand”, “Canada”, and “United States”.
Note that to get all three, the statement is OR, not AND! There are no cases where country == "New Zealand"
and simultaneously country == "Canada"
…
gapminder |>
filter(country == "New Zealand" |
country == "Canada" |
country == "United States")
## # A tibble: 36 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # ℹ 26 more rows
Note as well that the order of the values in the filter()
function does not matter—the table that results is in the same order as the source table.
Instead of this rather verbose syntax, we can achieve the same result by creating a list.
To create a list in R, we use the c()
function.
## # A tibble: 36 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # ℹ 26 more rows
Another option would be to create a named object that contains our list. That list can then be referenced in a filter()
statement using the %in%
operator as the evaluator.
# alternate answer
country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
gapminder |>
filter(country %in% country_list) # filter uses that object
## # A tibble: 36 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # ℹ 26 more rows
This final option has a significant benefit. Now if we need to use that list of countries in another part of our analysis, we can just summon the object, rather than typing the entire list again.
This also reduces the risk of inadvertent error if we want to change the countries we are analyzing. If we have a named list object, we only need to change it in the one place, and that change will be incorporated throughout our code. Contrast that with a find-and-replace approach, where you might accidentally miss one of the places it appears, or you might have a typo in one place.
NOTE: Watch your quotation marks! In many cases double and single quotes are interchangable, but fancy “smart” quotes to start and end are not valid.
This example uses the flights
table from the {nycflights13} package, and filters for Alaska Airlines Inc., which has the value “AS” in the variable “carrier”.
5.6 Select columns
You’ve already seen filter()
to choose particular rows in your dataframe. What if you want to remove some columns (variables)?
Here we will use select()
to reduce the number of variables in our gapminder table to “country”, “year”, and “lifeExp”:
gapminder |>
select(country, year, lifeExp)
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
You can use the minus sign to drop variable–that is, to define ones to not select.
In this example, note we have to use c()
to create a list:
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ℹ 1,694 more rows
You can also define a range of columns to select:
gapminder |>
select(country, year:pop)
## # A tibble: 1,704 × 4
## country year lifeExp pop
## <fct> <int> <dbl> <int>
## 1 Afghanistan 1952 28.8 8425333
## 2 Afghanistan 1957 30.3 9240934
## 3 Afghanistan 1962 32.0 10267083
## 4 Afghanistan 1967 34.0 11537966
## 5 Afghanistan 1972 36.1 13079460
## 6 Afghanistan 1977 38.4 14880372
## 7 Afghanistan 1982 39.9 12881816
## 8 Afghanistan 1987 40.8 13867957
## 9 Afghanistan 1992 41.7 16317921
## 10 Afghanistan 1997 41.8 22227415
## # ℹ 1,694 more rows
5.7 Add new variables
You can calculate new variables that may be useful in your analysis. The {dplyr} package function for this is mutate()
.
For this example, we will use the “mpg” data set, which is a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. For more information about the data set, see the reference page for the dataset: https://ggplot2.tidyverse.org/reference/mpg.html
The first thing we want to do is to take a quick look at the data so we get a sense of its structure, using the ls.str()
function:
ls.str(mpg)
## class : chr [1:234] "compact" "compact" "compact" "compact" "compact" "compact" "compact" "compact" ...
## cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## drv : chr [1:234] "f" "f" "f" "f" "f" "f" "f" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "r" "r" ...
## fl : chr [1:234] "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "r" "e" ...
## hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## manufacturer : chr [1:234] "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" ...
## model : chr [1:234] "a4" "a4" "a4" "a4" "a4" "a4" "a4" "a4 quattro" "a4 quattro" "a4 quattro" ...
## trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" "auto(l5)" "manual(m5)" "auto(av)" ...
## year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
We can also view the data as a table:
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
## 10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
## # ℹ 224 more rows
vectors of the same length
As part of our analysis, we want to understand the ratio of gas consumption to the size of the engine, measured as the variable displ
(short for “displacement”). To calculate that ratio, we divide the highway consumption (hwy
) by the displacement.
The {dplyr} function to create a new variable is mutate()
. This is shown in the code below:
mpg <- mpg |>
mutate(mpg_per_cubic_litre = hwy / displ)
Notice that there’s now a new variable added to the data called mpg_per_cubic_litre
.
In that example, the two columns are of the same length—for each car (each row in the data) there is a hwy
value and a displ
value.
Vector of the length = 1
In other instances, we might want to have a calculation using a constant—we might want to multiply every one of our cases by 2, or add 10 to each.
In the example below, we will add a new variable to our data set using mutate()
.
This will convert miles per US gallon to litres per 100 kilometers. The first step is to create a constant value, based on the number of gallons in a litre and the number of kilometers in a mile.
Once we have that constant, we can use it in an equation to calculate the litres per 100 kilometers for every car in the data.
# calculate conversion constant
# a US gallon is 3.785 litres
# a kilometer is 1.61 miles
lp100km <- (100 * 3.785) / (1.61)
lp100km
## [1] 235.0932
# divide constant by hwy mpg to get hwy litres per 100 km
mpg |>
mutate(hwy_lp100km = lp100km / hwy)
## # A tibble: 234 × 13
## manufacturer model displ year cyl trans drv cty hwy fl class mpg_per_cubic_litre
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <dbl>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 16.1
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 16.1
## 3 audi a4 2 2008 4 manu… f 20 31 p comp… 15.5
## 4 audi a4 2 2008 4 auto… f 21 30 p comp… 15
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 9.29
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… 9.29
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… 8.71
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… 14.4
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… 13.9
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… 14
## # ℹ 224 more rows
## # ℹ 1 more variable: hwy_lp100km <dbl>
5.8 Summarize (or Summarise)
Take a quick look at the data in the {gapminder} package.
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
R has a built-in function to generate some descriptive (or summary) statistics. The first is summary()
which gives a few key statistics about the distribution.
summary(gapminder)
## country continent year lifeExp pop
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
## Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
## Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
## Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
## (Other) :1632
## gdpPercap
## Min. : 241.2
## 1st Qu.: 1202.1
## Median : 3531.8
## Mean : 7215.3
## 3rd Qu.: 9325.5
## Max. :113523.1
##
The summarize()
(or summarise()
) functions allows us to use a pipe sequence to compute statistics about the data. In the example below, we will calculate three statistics:
- The first (minimum) year in the dataset
- The last (maximum) year in the dataset
- The number of unique countries
# answer
gapminder |>
summarize(year_min = min(year),
year_max = max(year),
n_countries = n_distinct(country))
## # A tibble: 1 × 3
## year_min year_max n_countries
## <int> <int> <int>
## 1 1952 2007 142
Everything we have done so far has been with to manipulate the entire data file (filtering and selecting), or calculating descriptive statistics for all the values in our data. Often in our data analysis we need to compare within our data. Here are some questions we can answer through the gapminder data:
What was the median life expectancy in African countries in 2007?
What was the median life expectancy by continent in 2007?
What was the median life expectancy by continent for each year in the data?
To answer the first of these, we can filter the data to get only the African countries and 2007.
But for the second and third, we need to group our data by continent and year, and then summarize those groups.
The R package {dplyr} has a grouping function group_by()
. Once the data is grouped, we apply a summarize()
(or summarise()
if you prefer!) function with our descriptive statistics.
Let’s walk through the steps for the first question:
Extract the rows for African countries in 2007.
Then find:
- The number of unique countries
- The median life expectancy of African countries as a group
# answer
gapminder |>
# filter for the continent and the year
filter(continent == "Africa" & year == 2007) |>
# calculate the summary statistics
summarise(n_countries = n_distinct(country),
lifeExp_med = median(lifeExp))
## # A tibble: 1 × 2
## n_countries lifeExp_med
## <int> <dbl>
## 1 52 52.9
5.8.1 Grouping
We can use the {dplyr} function group_by()
to define our grouping variables. If we group all of the countries by the continent they are part of, we can calculate a statistic for the whole continent.
The second question was to find the median life expectancy by continent in 2007.
# answer
gapminder |>
# filter for the year
filter(year == 2007) |>
# apply the grouping variable, and then summarize
group_by(continent) |>
summarize(lifeExp_mean = mean(lifeExp))
## # A tibble: 5 × 2
## continent lifeExp_mean
## <fct> <dbl>
## 1 Africa 54.8
## 2 Americas 73.6
## 3 Asia 70.7
## 4 Europe 77.6
## 5 Oceania 80.7
5.8.1.1 Grouping with multiple variables
We can group by multiple variables.
Find the median life expectancy by continent for each year in the dataframe.
## # A tibble: 60 × 3
## # Groups: continent [5]
## continent year lifeExp_med
## <fct> <int> <dbl>
## 1 Africa 1952 38.8
## 2 Africa 1957 40.6
## 3 Africa 1962 42.6
## 4 Africa 1967 44.7
## 5 Africa 1972 47.0
## 6 Africa 1977 49.3
## 7 Africa 1982 50.8
## 8 Africa 1987 51.6
## 9 Africa 1992 52.4
## 10 Africa 1997 52.8
## # ℹ 50 more rows
Create a summary table with the population and total GDP by continent for the year 1952
# answer
gapminder |>
filter(year == 1952) |> # filter to get the records we need
mutate(totalGDP = pop * gdpPercap) |> # note that we need to add the new variable calculated above!
group_by(continent) |> # this defines the grouping category
summarize(pop = sum(pop), #
totalGDP = sum(totalGDP)) #
## # A tibble: 5 × 3
## continent pop totalGDP
## <fct> <int> <dbl>
## 1 Africa 237640501 3.12e11
## 2 Americas 345152446 2.94e12
## 3 Asia 1395357351 1.13e12
## 4 Europe 418120846 2.55e12
## 5 Oceania 10686006 1.08e11
5.8.2 Counting the number of cases
A common step in a data analysis project is to count the number of cases in our dataset, often by groups.
For these examples, we will use the mpg
dataset, and produce a summary table that shows the number of vehicles by class (compact, suv, etc) and number of cylinders in the engine (the variable cyl
).
In this section, we will look at three different approaches to get to the same result.
Here we start with the original ungrouped dataset, applying the group_by()
, then use the n()
function inside a summarise()
to count the number of cases (or observations).
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 2seater 8 5
## 2 compact 4 32
## 3 compact 5 2
## 4 compact 6 13
## 5 midsize 4 16
## 6 midsize 6 23
## 7 midsize 8 2
## 8 minivan 4 1
## 9 minivan 6 10
## 10 pickup 4 3
## 11 pickup 6 10
## 12 pickup 8 20
## 13 subcompact 4 21
## 14 subcompact 5 2
## 15 subcompact 6 7
## 16 subcompact 8 5
## 17 suv 4 8
## 18 suv 6 16
## 19 suv 8 38
To get a sorted result, we would need to add an arrange(desc())
function to our pipe:
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 suv 8 38
## 2 compact 4 32
## 3 midsize 6 23
## 4 subcompact 4 21
## 5 pickup 8 20
## 6 midsize 4 16
## 7 suv 6 16
## 8 compact 6 13
## 9 minivan 6 10
## 10 pickup 6 10
## 11 suv 4 8
## 12 subcompact 6 7
## 13 2seater 8 5
## 14 subcompact 8 5
## 15 pickup 4 3
## 16 compact 5 2
## 17 midsize 8 2
## 18 subcompact 5 2
## 19 minivan 4 1
Another approach to count the number of rows in each group is to use the tally()
function, which takes the place of the summarize(n = n())
function.
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 2seater 8 5
## 2 compact 4 32
## 3 compact 5 2
## 4 compact 6 13
## 5 midsize 4 16
## 6 midsize 6 23
## 7 midsize 8 2
## 8 minivan 4 1
## 9 minivan 6 10
## 10 pickup 4 3
## 11 pickup 6 10
## 12 pickup 8 20
## 13 subcompact 4 21
## 14 subcompact 5 2
## 15 subcompact 6 7
## 16 subcompact 8 5
## 17 suv 4 8
## 18 suv 6 16
## 19 suv 8 38
If we want to sort in the tally()
function, we can add a sort = TRUE
argument. Note that this behaves differently than arrange()
; sort = TRUE
is largest to smallest.
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 suv 8 38
## 2 compact 4 32
## 3 midsize 6 23
## 4 subcompact 4 21
## 5 pickup 8 20
## 6 midsize 4 16
## 7 suv 6 16
## 8 compact 6 13
## 9 minivan 6 10
## 10 pickup 6 10
## 11 suv 4 8
## 12 subcompact 6 7
## 13 2seater 8 5
## 14 subcompact 8 5
## 15 pickup 4 3
## 16 compact 5 2
## 17 midsize 8 2
## 18 subcompact 5 2
## 19 minivan 4 1
The most streamlined way we could get to our grouped summary table with counts is by using the count()
function. This function merges both the grouping and tallying functions.
The categories we want to count become the arguments of the count()
function.
mpg |>
count(class, cyl)
## # A tibble: 19 × 3
## class cyl n
## <chr> <int> <int>
## 1 2seater 8 5
## 2 compact 4 32
## 3 compact 5 2
## 4 compact 6 13
## 5 midsize 4 16
## 6 midsize 6 23
## 7 midsize 8 2
## 8 minivan 4 1
## 9 minivan 6 10
## 10 pickup 4 3
## 11 pickup 6 10
## 12 pickup 8 20
## 13 subcompact 4 21
## 14 subcompact 5 2
## 15 subcompact 6 7
## 16 subcompact 8 5
## 17 suv 4 8
## 18 suv 6 16
## 19 suv 8 38
Similar to tally, we can include a sort = TRUE
argument.
mpg |>
count(class, cyl, sort = TRUE)
## # A tibble: 19 × 3
## class cyl n
## <chr> <int> <int>
## 1 suv 8 38
## 2 compact 4 32
## 3 midsize 6 23
## 4 subcompact 4 21
## 5 pickup 8 20
## 6 midsize 4 16
## 7 suv 6 16
## 8 compact 6 13
## 9 minivan 6 10
## 10 pickup 6 10
## 11 suv 4 8
## 12 subcompact 6 7
## 13 2seater 8 5
## 14 subcompact 8 5
## 15 pickup 4 3
## 16 compact 5 2
## 17 midsize 8 2
## 18 subcompact 5 2
## 19 minivan 4 1
Without the grouping variables, these three statements produce the same result:
## # A tibble: 1 × 1
## n
## <int>
## 1 234
mpg |> tally()
## # A tibble: 1 × 1
## n
## <int>
## 1 234
mpg |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 234
There are other counting functions within {dplyr}, see the reference page https://dplyr.tidyverse.org/reference/count.html for more details.
We will return to data wrangling later in the course, and explore more complex calculations.
5.9 Take aways
Extract cases with
filter()
. For filtering, we have to specify the values we want and the variable that they are in.Choose the rows we want from our dataframes with
select()
. To select the columns we want, we have to specify the variable(s) that we want.Create new variables, with
mutate()
Filter using a list with
%in%
Connect operations with the pipe symbol
|>
Make tables of summaries with
summarise()
Do groupwise operations with
group_by()
5.10 Functions for data wrangling
5.10.1 Reference: {dplyr} functions
The functions in {dplyr} (and many other packages) are verbs that describe what they do.
function | action |
---|---|
filter() |
keep rows |
mutate() |
create a new variable |
group_by() |
declare subsets in data |
select() |
keep variables (or drop them -var or re-order them) |
rename() |
renaming variables |
summarize() |
summarize the data, by groups if they have been declared |
distinct() |
returns only rows that are unique |
case_when() |
is used for “recoding” variable, often used with mutate() |
function | action |
---|---|
length dplyr::count()
|
number of cases |
tally() |
counting (by groups if group_by() applied) |
The full list of {dplyr} functions is here: https://dplyr.tidyverse.org/reference/index.html
5.10.2 Reference: Descriptive statistics
These statistics can all be generated individually, using the functions in the table below. Other statistical measures about a variable can be calculated with the appropriate function. Note that these are just a few of the functions that are included in base R; there are also other statistical measures available through various packages.
statistic | function |
---|---|
mean | mean() |
median | median() |
mode | mode() |
minimum and maximum |
min() max()
|
range | range() |
quantile |
quantile() (default: quartiles) |
— | — |
sum | sum() |
standard deviation | sd() |
variance | var() |
kurtosis | kurtosis() |
skewness | skewness() |
correlation coefficient | cor() |
correlation coefficient (two variables) | cor() |
— | — |
count (number of cases) | length() |
number of unique cases | n_distinct() |
5.10.3 Reference: Boolean operators
Boolean operators are a special type of operator that return TRUE or FALSE. They are used when we want to compare values.
Operator | Description |
---|---|
== | equal, tests equality |
!= | not equal, tests inequality |
> | greater than, tests greater than (also >=) |
< | less than, tests less than (also <=) |
%in% | contains, tests inclusion |
Boolean operators can be combined with and or or.
Operator | Description |
---|---|
& |
and, returns true if preceeding and following are both true, else false |
| | or, returns true if either preceeding and following are true, else false |
.footnote[see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science, 2nd ed., “Comparisions”” and “Boolean algebra”.]
-30-