4 Introduction to data wrangling

Data is abundant, and as data analysts we often have to spend time working with our data tables to get it ready for whatever analysis we need to do. “Data wrangling” and “data transformation” are two terms that are used to describe the process of altering the structure of our data.2

There are three broad categories of data transformation:

  • alter the structure of the existing data tables. This could be through adding or removing rows or columns to our data table, including calculation of new variables, or importation of additional rows (observations) and joining two tables together.

  • summarize the content of the source data tables. For example, if our data table has the population of all the countries in the world, we might want a summary table that has the table with the total for each continent.

  • with both our existing data or our summary tables, we may want change the layout, either for analysis purposes or to make it easier for human readers to absorb the information.

In this section, we will begin to work with all three transformations.

4.0.1 Reading: Data wrangling

Please refer to R for Data Science, Data transformation for more information about this section.

For this section, and for other exercises throughout this course, we will be using a subset of the Gapminder data. These data tables contain information about the countries of the world, and were used by Hans Rosling to explain that there have been significant changes in the quality of people’s lives around the world.

4.1 Transform Data: 1a Alter the structure

dplyr hex

The {dplyr} reference site

Data Transformation Cheat Sheet (PDF file)

When working with data, particularly large data sets, you encounter a situation where you need to:

  • subset the data so that it contains only those records (rows) you are interested in (Why deal with a data table that has every province in Canada when you just need the British Columbia data?)

  • subset the data so that it contains only those variables (columns) you are interested in (Why keep looking at variables that do not factor into your analysis, or you will not print in your report?)

  • create new variables, often through calculations based on variables in your data

To achieve these three goals, we will be using functions from the tidyverse package {dplyr}. The functions are verbs that describe what they do. Here’s a list of the {dplyr} functions that you’ll be using:

function action
filter() keep rows
select() keep variables (or drop them -var)
mutate() create a new variable
arrange() sort from smallest to largest
arrange(desc()) sort from largest to smallest

Other operators

In addition to these functions, R has a number of operators that add significantly to your code.

Arithmetic operators

Operator Description
+ plus
- minus
* multiplication
/ division
^ exponential

Boolean operators are a special type of operator that return TRUE or FALSE.

Operator Description
== equal, tests equality
!= not equal, tests inequality
> greater than, tests greater than (also >=)
< less than, tests less than (also <=)
%in% contains, tests inclusion

Boolean operators can be combined with and or or.

Operator Description
& and, returns true if preceding and following are both true, else false
| or, returns true if either preceding and following are true, else false

see R for Data Science, “Comparisions” and “Logical Operators”.]

4.2 Introducing the pipe operator %>%

(see R for Data Science, Pipes for a deep dive into pipes)

The pipe operator %>% allows us to name the data frame first, and then have the next function work with the outcome of the first line.

For example, we could write

function(data, argument)

or, with a pipe

data %>% function(argument)

Read the pipe symbol as “then”. We start with our data, then we apply the function.

If we want to do anything with the output from the first statement, we have to assign it to an intermediate object, and then use that intermediate object in the second statement. With the pipe, we can add another pipe and the results of the first statement are passed to the second statement without the intermediate object.

As we get to increasingly complex statements, the power of this will become more evident.

This code calculates the mean of life expectancy (the lifeExp variable) in the gapminder data (note that we are accessing the lifeExp variable by adding the dollar sign to the dataframe name).

mean(gapminder$lifeExp)
## [1] 59.47444

Another way to do this would be using the pipe operator. The symbol is %>% and links one line of script to the next. The way to read this is as meaning “and then…”

At this point, we only have one step in our chain, but in future code, you might have multiple steps linked one after the other.

So this little chunk would be read as

  • we start with the dataframe “exp_dem”, “and then…”

  • we calculate a variable called “mean_of_expense” by running the mean() function on the variable “Expense”

gapminder %>% 
  summarise(mean_of_life_expectancy = mean(lifeExp))
## # A tibble: 1 × 1
##   mean_of_life_expectancy
##                     <dbl>
## 1                    59.5

4.3 Creating lists

R allows us to create lists using the c() function.

A list can be referenced in a filter() statement using the %in% as the evaluator.

In this example, we filter to get three countries, “New Zealand”, “Canada”, and “United States”.

Note that to get all three, the statement is OR, not AND! There are no cases where country == "New Zealand" and simultaneously country == "Canada"

gapminder %>% 
  filter(country == "New Zealand" |
           country == "Canada" |
           country == "United States")
## # A tibble: 36 × 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # … with 26 more rows

Instead of this rather verbose syntax, we can achieve the same result by creating a list.

# answer
gapminder %>%
  filter(country %in% c("New Zealand", "Canada", "United States"))
## # A tibble: 36 x 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ... with 26 more rows

Another option would be to create a named object that contains our list, and then use the named object in our filter().

# alternate answer
country_list <- c("New Zealand", "Canada", "United States")   # create object that is list of countries

gapminder %>%
  filter(country %in% country_list)                           # filter uses that object 
## # A tibble: 36 × 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # … with 26 more rows

This final option has a significant benefit. Now if we need to use that list of countries in another part of our analysis, we can just summon the object, rather than typing the entire list again.

This also reduces the risk of inadvertent error if we want to change the countries we are analyzing. If we have a named list object, we only need to change it in the one place, and that change will be incorporated throughout. Contrast that with a find-and-replace approach, where you might accidentally miss one of the places it appears, or you might have a typo in one place.

NOTE: Watch your quotation marks! In many cases double and single quotes are interchangable, but fancy “smart” quotes to start and end are not valid.

This example uses the flights table from the {nycflights13} package, and filters for Alaska Airlines Inc., which has the value “AS” in the variable “carrier”.

# will not work due to "smart" quotes
nycflights13::flights %>% filter(carrier == “AS”)

# will work...double quotes
nycflights13::flights %>% filter(carrier == "AS")

# will work...single quotes
nycflights13::flights %>% filter(carrier == 'AS')

4.4 Select columns

You’ve already seen filter() to choose particular rows in your dataframe. What if you want to remove some columns (variables)?

Here we will use select() to reduce the number of variables in our gapminder table to “country”, “year”, and “lifeExp”:

gapminder %>%                                
  select(country, year, lifeExp) 
## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # … with 1,694 more rows

You can use the minus sign to drop variable–that is, to define ones to not select.

In this example, note we have to use c() to create a list:

gapminder %>%                                
  select(-c(continent, pop, gdpPercap))
## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # … with 1,694 more rows

You can define a range of columns to select:

gapminder %>%
  select(country, year:pop)
## # A tibble: 1,704 × 4
##    country      year lifeExp      pop
##    <fct>       <int>   <dbl>    <int>
##  1 Afghanistan  1952    28.8  8425333
##  2 Afghanistan  1957    30.3  9240934
##  3 Afghanistan  1962    32.0 10267083
##  4 Afghanistan  1967    34.0 11537966
##  5 Afghanistan  1972    36.1 13079460
##  6 Afghanistan  1977    38.4 14880372
##  7 Afghanistan  1982    39.9 12881816
##  8 Afghanistan  1987    40.8 13867957
##  9 Afghanistan  1992    41.7 16317921
## 10 Afghanistan  1997    41.8 22227415
## # … with 1,694 more rows

4.5 Functions for data wrangling

4.5.1 Reference: {dplyr} functions

The functions in {dplyr} (and many other packages) are verbs that describe what they do.

function action
filter() keep rows
mutate() create a new variable
group_by() declare subsets in data
select() keep variables (or drop them -var)
rename() renaming variables
summarize() summarize the data, by groups if they have been declared
distinct() returns only rows that are unique
case_when() is used for “recoding” variable, often used with mutate()
function action
length
dplyr::count()
number of cases
tally() counting (by groups if group_by() applied)

The full list of {dplyr} functions is here: https://dplyr.tidyverse.org/reference/index.html

4.5.2 Reference: Boolean operators

Boolean operators are a special type of operator that return TRUE or FALSE. They are used when we want to compare values.

Operator Description
== equal, tests equality
!= not equal, tests inequality
> greater than, tests greater than (also >=)
< less than, tests less than (also <=)
%in% contains, tests inclusion

Boolean operators can be combined with and or or.

Operator Description
& and, returns true if preceeding and following are both true, else false
| or, returns true if either preceeding and following are true, else false

.footnote[see R for Data Science, “Comparisions”” and “Logical Operators”.]

4.6 Descriptive statistics

R has a built-in function to generate some descriptive (or summary) statistics. The first is summary() which gives a few key statistics about the distribution.

summary(gapminder)
##         country        continent        year         lifeExp           pop           
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60   Min.   :6.001e+04  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20   1st Qu.:2.794e+06  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71   Median :7.024e+06  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47   Mean   :2.960e+07  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85   3rd Qu.:1.959e+07  
##  Australia  :  12                  Max.   :2007   Max.   :82.60   Max.   :1.319e+09  
##  (Other)    :1632                                                                    
##    gdpPercap       
##  Min.   :   241.2  
##  1st Qu.:  1202.1  
##  Median :  3531.8  
##  Mean   :  7215.3  
##  3rd Qu.:  9325.5  
##  Max.   :113523.1  
## 

These statistics can all be generated individually, using the functions in the table below. Other statistical measures about a variable can be calculated with the appropriate function. Note that these are just a few of the functions that are included in base R; there are also other statistical measures available through various packages.

statistic function
mean mean()
median median()
mode mode()
minimum and maximum min() max()
range range()
quantile quantile() (default: quartiles)
sum sum()
standard deviation sd()
variance var()
kurtosis kurtosis()
skewness skewness()
correlation coefficient cor()
correlation coefficient (two variables) cor()
count (number of cases) length()
number of unique cases n_distinct()

4.7 Tangent: Arguments within functions

Virtually every function we use has default arguments, and options to change those. This allows us to use one function for a variety of circumstances. Instead of having separate functions to split a variable into its quartiles (four parts), quintiles (five parts), deciles (ten parts), and so on, there is one function that uses the generic name for this process: quantile.

The default setting of the quantile() function splits the distribution into four parts, known as quartiles. This is the most common quantile used to summarize a distribution.

quantile(gapminder$lifeExp)
##      0%     25%     50%     75%    100% 
## 23.5990 48.1980 60.7125 70.8455 82.6030

If you want other groupings, the function has arguments that allow you define those. For example, if you want deciles—10 groups, each with 10% of the cases—the second example shows how to do that.

# deciles
quantile(gapminder$lifeExp, probs = seq(0, 1, 0.10))
##      0%     10%     20%     30%     40%     50%     60%     70%     80%     90%    100% 
## 23.5990 41.5108 45.8992 50.6021 55.7292 60.7125 66.0814 69.7465 72.0288 75.0970 82.6030

4.8 Transform Data: 1b Add new variables

mutate()

You can calculate new variables that may be useful in your analysis

vectors of the same length

For this example, we will use the “mpg” data set, which is a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. For more information about the data set, see the reference page for the dataset: https://ggplot2.tidyverse.org/reference/mpg.html

The first thing we want to do is to take a quick look at the data so we get a sense of its structure, using the ls.str() function:

ls.str(mpg)
## class :  chr [1:234] "compact" "compact" "compact" "compact" "compact" "compact" "compact" "compact" ...
## cty :  int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## cyl :  int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## displ :  num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## drv :  chr [1:234] "f" "f" "f" "f" "f" "f" "f" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "r" "r" ...
## fl :  chr [1:234] "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "r" "e" ...
## hwy :  int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## manufacturer :  chr [1:234] "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" ...
## model :  chr [1:234] "a4" "a4" "a4" "a4" "a4" "a4" "a4" "a4 quattro" "a4 quattro" "a4 quattro" ...
## trans :  chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" "auto(l5)" "manual(m5)" "auto(av)" ...
## year :  int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...

We can also view the data as a table:

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
##  2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
##  3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
##  4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
##  5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
##  6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
##  7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
## 10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact
## # … with 224 more rows

Vector of the length = 1

As part of our analysis, we want to understand the ratio of gas consumption to the size of the engine, measured as the variable displ (short for “displacement”). To calculate that ratio, we divide the highway consumption (hwy) by the displacement.

The {dplyr} function to create a new variable is mutate(). This is shown in the code below:

mpg <- mpg %>%
  mutate(mpg_per_cubic_litre = hwy / displ)

Notice that there’s now a new variable added to the data called mpg_per_cubic_litre.

In that example, the two columns are of the same length—for each car (each row in the data) there is a hwy value and a displ value.

In other instances, we might want to have a calculation using a constant—we might want to multiply every one of our cases by 2, or add 10 to each.

In the example below, we will add a new variable to our data set using mutate().

This will convert miles per US gallon to litres per 100 kilometers. The first step is to create a constant value, based on the number of gallons in a litre and the number of kilometers in a mile.

Once we have that constant, we can use it in an equation to calculate the litres per 100 kilometers for every car in the data.

# calculate conversion constant
# a US gallon is 3.785 litres
# a kilometer is 1.61 miles
lp100km <- (100 * 3.785) / (1.61)
lp100km
## [1] 235.0932
# divide constant by hwy mpg to get hwy litres per 100 km
mpg %>%
  mutate(hwy_lp100km = lp100km / hwy)
## # A tibble: 234 × 13
##    manufacturer model      displ  year   cyl trans    drv     cty   hwy fl    class mpg_per_cubic_l…
##    <chr>        <chr>      <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>            <dbl>
##  1 audi         a4           1.8  1999     4 auto(l5) f        18    29 p     comp…            16.1 
##  2 audi         a4           1.8  1999     4 manual(… f        21    29 p     comp…            16.1 
##  3 audi         a4           2    2008     4 manual(… f        20    31 p     comp…            15.5 
##  4 audi         a4           2    2008     4 auto(av) f        21    30 p     comp…            15   
##  5 audi         a4           2.8  1999     6 auto(l5) f        16    26 p     comp…             9.29
##  6 audi         a4           2.8  1999     6 manual(… f        18    26 p     comp…             9.29
##  7 audi         a4           3.1  2008     6 auto(av) f        18    27 p     comp…             8.71
##  8 audi         a4 quattro   1.8  1999     4 manual(… 4        18    26 p     comp…            14.4 
##  9 audi         a4 quattro   1.8  1999     4 auto(l5) 4        16    25 p     comp…            13.9 
## 10 audi         a4 quattro   2    2008     4 manual(… 4        20    28 p     comp…            14   
## # … with 224 more rows, and 1 more variable: hwy_lp100km <dbl>

4.9 Transform Data 2: summarize

Everything we have done so far has been with to manipulate the entire data file (filtering and selecting), or calculating descriptive statistics for all the values in our data. Often we need to compare within our data:

  • What was the median life expectancy in African countries in 2007?

  • What was the median life expectancy by continent in 2007?

  • What was the median life expectancy by continent for each year in the data?

To achieve the first of these things, we can filter the data to get only the African countries and 2007.

But for the second and third, we need to group our data by continent and year, and then summarize those groups.

The R package {dplyr} has a grouping function group_by(). Once the data is grouped, we apply a summarize() (or summarise() if you prefer!) function with our descriptive statistics.

Take a quick look at the data in the {gapminder} package.

gapminder
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

Use summarize() or summarise() to compute three statistics about the data:

  • The first (minimum) year in the dataset
  • The last (maximum) year in the dataset
  • The number of unique countries
# answer
gapminder %>%
  summarize(year_min = min(year),
            year_max = max(year),
            n_countries = n_distinct(country))
## # A tibble: 1 × 3
##   year_min year_max n_countries
##      <int>    <int>       <int>
## 1     1952     2007         142

Extract the rows for African countries in 2007.

Then find:

  1. The number of unique countries
  2. The median life expectancy of African countries as a group
# answer
gapminder %>% 
  filter(continent == "Africa" & year == 2007) %>% 
  summarise(n_countries = n_distinct(country),
            lifeExp_med = median(lifeExp))
## # A tibble: 1 × 2
##   n_countries lifeExp_med
##         <int>       <dbl>
## 1          52        52.9

4.9.1 Grouping

We can use the {dplyr} function group_by() to define our grouping variables. If we group all of the countries by the continent they are part of, we can calculate a statistic for the whole continent.

Find the median life expectancy by continent in 2007.

# answer
gapminder %>%
  filter(year == 2007) %>% 
  group_by(continent) %>%
  summarize(lifeExp_mean = mean(lifeExp))
## # A tibble: 5 × 2
##   continent lifeExp_mean
##   <fct>            <dbl>
## 1 Africa            54.8
## 2 Americas          73.6
## 3 Asia              70.7
## 4 Europe            77.6
## 5 Oceania           80.7

4.9.1.1 Grouping with multiple variables

We can group by multiple variables.

Find the median life expectancy by continent for each year in the dataframe.

gapminder %>%
  group_by(continent, year) %>%
  summarize(lifeExp_med = median(lifeExp))
## # A tibble: 60 × 3
## # Groups:   continent [5]
##    continent  year lifeExp_med
##    <fct>     <int>       <dbl>
##  1 Africa     1952        38.8
##  2 Africa     1957        40.6
##  3 Africa     1962        42.6
##  4 Africa     1967        44.7
##  5 Africa     1972        47.0
##  6 Africa     1977        49.3
##  7 Africa     1982        50.8
##  8 Africa     1987        51.6
##  9 Africa     1992        52.4
## 10 Africa     1997        52.8
## # … with 50 more rows

Create a summary table with the population and GDP by continent for the year 1952

# answer
gapminder %>%
  filter(year == 1952) %>%               # filter to get the records we need
  mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
  group_by(continent) %>%                # this defines the grouping category
  summarize(pop = sum(pop),              # 
            totalGDP = sum(totalGDP))    #
## # A tibble: 5 × 3
##   continent        pop totalGDP
##   <fct>          <int>    <dbl>
## 1 Africa     237640501  3.12e11
## 2 Americas   345152446  2.94e12
## 3 Asia      1395357351  1.13e12
## 4 Europe     418120846  2.55e12
## 5 Oceania     10686006  1.08e11

4.9.2 Counting the number of cases

A common step in a data analysis project is to count the number of cases in our dataset, often by groups.

For these examples, we will use the mpg dataset, and produce a summary table that shows the number of vehicles by class (compact, suv, etc) and number of cylinders in the engine (the variable cyl).

In this section, we will look at three different approaches to get to the same result.

4.9.2.1 n() ·

Here we start with the original ungrouped dataset, applying the group_by(), then use the n() function inside a summarise() to count the number of cases (or observations).

mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n = n()) 
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 2seater        8     5
##  2 compact        4    32
##  3 compact        5     2
##  4 compact        6    13
##  5 midsize        4    16
##  6 midsize        6    23
##  7 midsize        8     2
##  8 minivan        4     1
##  9 minivan        6    10
## 10 pickup         4     3
## 11 pickup         6    10
## 12 pickup         8    20
## 13 subcompact     4    21
## 14 subcompact     5     2
## 15 subcompact     6     7
## 16 subcompact     8     5
## 17 suv            4     8
## 18 suv            6    16
## 19 suv            8    38

To get a sorted result, we would need to add an arrange(desc()) function to our pipe:

mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n))
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 suv            8    38
##  2 compact        4    32
##  3 midsize        6    23
##  4 subcompact     4    21
##  5 pickup         8    20
##  6 midsize        4    16
##  7 suv            6    16
##  8 compact        6    13
##  9 minivan        6    10
## 10 pickup         6    10
## 11 suv            4     8
## 12 subcompact     6     7
## 13 2seater        8     5
## 14 subcompact     8     5
## 15 pickup         4     3
## 16 compact        5     2
## 17 midsize        8     2
## 18 subcompact     5     2
## 19 minivan        4     1

4.9.2.2 tally() ·

Another approach to count the number of rows in each group is to use the tally() function, which takes the place of the summarize(n = n()) function.

mpg %>% 
  group_by(class, cyl) %>% 
  tally()
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 2seater        8     5
##  2 compact        4    32
##  3 compact        5     2
##  4 compact        6    13
##  5 midsize        4    16
##  6 midsize        6    23
##  7 midsize        8     2
##  8 minivan        4     1
##  9 minivan        6    10
## 10 pickup         4     3
## 11 pickup         6    10
## 12 pickup         8    20
## 13 subcompact     4    21
## 14 subcompact     5     2
## 15 subcompact     6     7
## 16 subcompact     8     5
## 17 suv            4     8
## 18 suv            6    16
## 19 suv            8    38

If we want to sort in the tally() function, we can add a sort = TRUE argument. Note that this behaves differently than arrange(); sort = TRUE is largest to smallest.

mpg %>% 
  group_by(class, cyl) %>% 
  tally(sort = TRUE)
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 suv            8    38
##  2 compact        4    32
##  3 midsize        6    23
##  4 subcompact     4    21
##  5 pickup         8    20
##  6 midsize        4    16
##  7 suv            6    16
##  8 compact        6    13
##  9 minivan        6    10
## 10 pickup         6    10
## 11 suv            4     8
## 12 subcompact     6     7
## 13 2seater        8     5
## 14 subcompact     8     5
## 15 pickup         4     3
## 16 compact        5     2
## 17 midsize        8     2
## 18 subcompact     5     2
## 19 minivan        4     1

4.9.2.3 count() ·

The most streamlined way we could get to our grouped summary table with counts is by using the count() function. This function merges both the grouping and tallying functions.

The categories we want to count become the arguments of the count() function.

mpg %>% 
  count(class, cyl)
## # A tibble: 19 × 3
##    class        cyl     n
##    <chr>      <int> <int>
##  1 2seater        8     5
##  2 compact        4    32
##  3 compact        5     2
##  4 compact        6    13
##  5 midsize        4    16
##  6 midsize        6    23
##  7 midsize        8     2
##  8 minivan        4     1
##  9 minivan        6    10
## 10 pickup         4     3
## 11 pickup         6    10
## 12 pickup         8    20
## 13 subcompact     4    21
## 14 subcompact     5     2
## 15 subcompact     6     7
## 16 subcompact     8     5
## 17 suv            4     8
## 18 suv            6    16
## 19 suv            8    38

Similar to tally, we can include a sort = TRUE argument.

mpg %>% 
  count(class, cyl, sort = TRUE)
## # A tibble: 19 × 3
##    class        cyl     n
##    <chr>      <int> <int>
##  1 suv            8    38
##  2 compact        4    32
##  3 midsize        6    23
##  4 subcompact     4    21
##  5 pickup         8    20
##  6 midsize        4    16
##  7 suv            6    16
##  8 compact        6    13
##  9 minivan        6    10
## 10 pickup         6    10
## 11 suv            4     8
## 12 subcompact     6     7
## 13 2seater        8     5
## 14 subcompact     8     5
## 15 pickup         4     3
## 16 compact        5     2
## 17 midsize        8     2
## 18 subcompact     5     2
## 19 minivan        4     1

4.9.2.4 counting wrapup

Without the grouping variables, these three statements produce the same result:

mpg %>% summarise(n = n())
## # A tibble: 1 × 1
##       n
##   <int>
## 1   234
mpg %>% tally()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   234
mpg %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   234

There are other counting functions within {dplyr}, see the reference page https://dplyr.tidyverse.org/reference/count.html for more details.

We will return to data wrangling later in the course, and explore more complex calculations.

4.10 Take aways

  • Extract cases with filter(). For filtering, we have to specify the values we want and the variable that they are in.

  • Choose the rows we want from our dataframes with select(). To select the columns we want, we have to specify the variable(s) that we want.

  • Create new variables, with mutate()

  • Filter using a list with %in%

  • Connect operations with the pipe symbol %>%

  • Make tables of summaries with summarise()

  • Do groupwise operations with group_by()

  • Count the number of cases with tally() and count()

-30-