5 Introduction to data wrangling

5.1 Setup

This chunk of R code loads the packages that we will be using.

Data is abundant, and as data analysts we often have to spend time working with our data tables to get it ready for whatever analysis we need to do. “Data wrangling” and “data transformation” are two terms that are used to describe the process of altering the structure of our data.3

There are three broad categories of data transformation:

  • alter the structure of the existing data tables. This could be through adding or removing rows or columns to our data table, including calculation of new variables, or importation of additional rows (observations) and joining two tables together.

  • summarize the content of the source data tables. For example, if our data table has the population of all the countries in the world, we might want a summary table that has the table with the total for each continent.

  • with both our existing data or our summary tables, we may want change the layout, either for analysis purposes or to make it easier for human readers to absorb the information.

In this section, we will begin to work with all three transformations.

5.1.1 Reading: Data wrangling

Please refer to Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science, 2nd ed., Data transformation for more information about this section.

For this section, and for other exercises throughout this course, we will be using a subset of the Gapminder data. These data tables contain information about the countries of the world, and were used by Hans Rosling to explain that there have been significant changes in the quality of people’s lives around the world.

5.2 Transform Data: 1a Alter the structure

dplyr hex
dplyr hex

The {dplyr} reference site

Data Transformation Cheat Sheet (PDF file)

When working with data, particularly large data sets, you encounter a situation where you need to:

  • subset the data so that it contains only those records (rows) you are interested in (Why deal with a data table that has every province in Canada when you just need the British Columbia data?)

  • subset the data so that it contains only those variables (columns) you are interested in (Why keep looking at variables that do not factor into your analysis, or you will not print in your report?)

  • create new variables, often through calculations based on variables in your data

To achieve these three goals, we will be using functions from the tidyverse package {dplyr}. The functions are verbs that describe what they do. Here’s a list of the {dplyr} functions that you’ll be using:

function action
filter() keep rows
select() keep variables (or drop them -var or re-order them)
mutate() create a new variable
arrange() sort from smallest to largest
arrange(desc()) sort from largest to smallest

Other operators

In addition to these functions, R has a number of operators that add significantly to your code.

Arithmetic operators

Operator Description
+ plus
- minus
* multiplication
/ division
^ exponential

Boolean operators are a special type of operator that return TRUE or FALSE.

Operator Description
== equal, tests equality
!= not equal, tests inequality
> greater than, tests greater than (also >=)
< less than, tests less than (also <=)
%in% contains, tests inclusion

Boolean operators can be combined with and or or.

Operator Description
& and, returns true if preceding and following are both true, else false
| or, returns true if either preceding and following are true, else false

see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science, 2nd ed., “Comparisions” and “Boolean algebra”.]

5.3 Introducing the pipe operator |>

(see R for Data Science, 2nd ed., The Pipe for a deep dive into pipes)

The pipe operator |> allows us to name the data frame first, and then have the next function work with the outcome of the first line.

For example, we could write

function(data, argument)

or, with a pipe

data |> function(argument)

Read the pipe symbol as “then”. We start with our data, then we apply the function.

If we want to do anything with the output from the first statement, we have to assign it to an intermediate object, and then use that intermediate object in the second statement. With the pipe, we can add another pipe and the results of the first statement are passed to the second statement without the intermediate object.

As we get to increasingly complex statements, the power of this will become more evident.

This code calculates the mean of life expectancy (the lifeExp variable) in the gapminder data (note that we are accessing the lifeExp variable by adding the dollar sign to the dataframe name).

mean(gapminder$lifeExp)
## [1] 59.47444

Another way to do this would be using the pipe operator. The symbol is |> and links one line of script to the next. The way to read this is as meaning “and then…”

At this point, we only have one step in our chain, but in future code, you might have multiple steps linked one after the other.

So this little chunk would be read as

  • we start with the dataframe “gapminder”, “and then…”

  • we calculate a variable called “mean_of_life_expectancy” by running the mean() function on the variable “lifeExp”

gapminder |> 
  summarise(mean_of_life_expectancy = mean(lifeExp))
## # A tibble: 1 × 1
##   mean_of_life_expectancy
##                     <dbl>
## 1                    59.5

IMPORTANT NOTE The pipe operator |> is in base R, but it wasn’t always. Prior to its introduction there was a pipe symbol available through the {magrittr} package, which looks like this: %>%. You will still see this version in older texts (such as the first edition of R for Data Science) and course materials. It was used in BIDA302, so if you come across older course material, it might use the %>% pipe operator.

5.4 Filter for rows

The calculations above are for the entire dataframe. Often in data analysis, we want to analyze a single group. We might want to examine life expectancy changes in a single country, or for calculate the average across the world for a single year. For this type of analysis, we can use the filter() function.

In the code chunk below, we use the filter() function to get the rows for Canada. This will return the 12 rows that have the records for Canada.

gapminder |> 
  filter(country == "Canada")
## # A tibble: 12 × 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## 11 Canada  Americas   2002    79.8 31902268    33329.
## 12 Canada  Americas   2007    80.7 33390141    36319.

In our next code chunk, we use filter to get the rows for every country, for the single year 2002. Note that because the “year” variable is a numeric type (specifically, an integer) we do not put the year value (2002) inside quotation marks.

gapminder |> 
  filter(year == 2002)
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2002    42.1  25268405      727.
##  2 Albania     Europe     2002    75.7   3508512     4604.
##  3 Algeria     Africa     2002    71.0  31287142     5288.
##  4 Angola      Africa     2002    41.0  10866106     2773.
##  5 Argentina   Americas   2002    74.3  38331121     8798.
##  6 Australia   Oceania    2002    80.4  19546792    30688.
##  7 Austria     Europe     2002    79.0   8148312    32418.
##  8 Bahrain     Asia       2002    74.8    656397    23404.
##  9 Bangladesh  Asia       2002    62.0 135656790     1136.
## 10 Belgium     Europe     2002    78.3  10311970    30486.
## # ℹ 132 more rows

We can use this filter as the first step in a pipe that allows us to calculate the average life expectancy for all of the countries in 2002.

gapminder |> 
  filter(year == 2002) |> 
  summarise(global_life_expectancy = mean(lifeExp))
## # A tibble: 1 × 1
##   global_life_expectancy
##                    <dbl>
## 1                   65.7

5.5 Creating lists

In this example, we filter to get three countries, “New Zealand”, “Canada”, and “United States”.

Note that to get all three, the statement is OR, not AND! There are no cases where country == "New Zealand" and simultaneously country == "Canada"

gapminder |> 
  filter(country == "New Zealand" |
           country == "Canada" |
           country == "United States")
## # A tibble: 36 × 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ℹ 26 more rows

Note as well that the order of the values in the filter() function does not matter—the table that results is in the same order as the source table.

Instead of this rather verbose syntax, we can achieve the same result by creating a list.

To create a list in R, we use the c() function.

# answer
gapminder |>
  filter(country %in% c("New Zealand", "Canada", "United States"))
## # A tibble: 36 × 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ℹ 26 more rows

Another option would be to create a named object that contains our list. That list can then be referenced in a filter() statement using the %in% operator as the evaluator.

# alternate answer
country_list <- c("New Zealand", "Canada", "United States")   # create object that is list of countries

gapminder |>
  filter(country %in% country_list)                           # filter uses that object 
## # A tibble: 36 × 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ℹ 26 more rows

This final option has a significant benefit. Now if we need to use that list of countries in another part of our analysis, we can just summon the object, rather than typing the entire list again.

This also reduces the risk of inadvertent error if we want to change the countries we are analyzing. If we have a named list object, we only need to change it in the one place, and that change will be incorporated throughout our code. Contrast that with a find-and-replace approach, where you might accidentally miss one of the places it appears, or you might have a typo in one place.

NOTE: Watch your quotation marks! In many cases double and single quotes are interchangable, but fancy “smart” quotes to start and end are not valid.

This example uses the flights table from the {nycflights13} package, and filters for Alaska Airlines Inc., which has the value “AS” in the variable “carrier”.

# will work...double quotes
nycflights13::flights |> filter(carrier == "AS")

# will work...single quotes
nycflights13::flights |> filter(carrier == 'AS')

# will not work due to "smart" quotes
nycflights13::flights |> filter(carrier == “AS”)

5.6 Select columns

You’ve already seen filter() to choose particular rows in your dataframe. What if you want to remove some columns (variables)?

Here we will use select() to reduce the number of variables in our gapminder table to “country”, “year”, and “lifeExp”:

gapminder |>                                
  select(country, year, lifeExp) 
## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ℹ 1,694 more rows

You can use the minus sign to drop variable–that is, to define ones to not select.

In this example, note we have to use c() to create a list:

gapminder |>                                
  select(-c(continent, pop, gdpPercap))
## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ℹ 1,694 more rows

You can also define a range of columns to select:

gapminder |>
  select(country, year:pop)
## # A tibble: 1,704 × 4
##    country      year lifeExp      pop
##    <fct>       <int>   <dbl>    <int>
##  1 Afghanistan  1952    28.8  8425333
##  2 Afghanistan  1957    30.3  9240934
##  3 Afghanistan  1962    32.0 10267083
##  4 Afghanistan  1967    34.0 11537966
##  5 Afghanistan  1972    36.1 13079460
##  6 Afghanistan  1977    38.4 14880372
##  7 Afghanistan  1982    39.9 12881816
##  8 Afghanistan  1987    40.8 13867957
##  9 Afghanistan  1992    41.7 16317921
## 10 Afghanistan  1997    41.8 22227415
## # ℹ 1,694 more rows

5.7 Add new variables

You can calculate new variables that may be useful in your analysis. The {dplyr} package function for this is mutate().

For this example, we will use the “mpg” data set, which is a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. For more information about the data set, see the reference page for the dataset: https://ggplot2.tidyverse.org/reference/mpg.html

The first thing we want to do is to take a quick look at the data so we get a sense of its structure, using the ls.str() function:

ls.str(mpg)
## class :  chr [1:234] "compact" "compact" "compact" "compact" "compact" "compact" "compact" "compact" ...
## cty :  int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## cyl :  int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## displ :  num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## drv :  chr [1:234] "f" "f" "f" "f" "f" "f" "f" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "r" "r" ...
## fl :  chr [1:234] "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "r" "e" ...
## hwy :  int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## manufacturer :  chr [1:234] "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" ...
## model :  chr [1:234] "a4" "a4" "a4" "a4" "a4" "a4" "a4" "a4 quattro" "a4 quattro" "a4 quattro" ...
## trans :  chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" "auto(l5)" "manual(m5)" "auto(av)" ...
## year :  int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...

We can also view the data as a table:

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
##  2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
##  3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
##  4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
##  5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
##  6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
##  7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
## 10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact
## # ℹ 224 more rows

vectors of the same length

As part of our analysis, we want to understand the ratio of gas consumption to the size of the engine, measured as the variable displ (short for “displacement”). To calculate that ratio, we divide the highway consumption (hwy) by the displacement.

The {dplyr} function to create a new variable is mutate(). This is shown in the code below:

mpg <- mpg |>
  mutate(mpg_per_cubic_litre = hwy / displ)

Notice that there’s now a new variable added to the data called mpg_per_cubic_litre.

In that example, the two columns are of the same length—for each car (each row in the data) there is a hwy value and a displ value.

Vector of the length = 1

In other instances, we might want to have a calculation using a constant—we might want to multiply every one of our cases by 2, or add 10 to each.

In the example below, we will add a new variable to our data set using mutate().

This will convert miles per US gallon to litres per 100 kilometers. The first step is to create a constant value, based on the number of gallons in a litre and the number of kilometers in a mile.

Once we have that constant, we can use it in an equation to calculate the litres per 100 kilometers for every car in the data.

# calculate conversion constant
# a US gallon is 3.785 litres
# a kilometer is 1.61 miles
lp100km <- (100 * 3.785) / (1.61)
lp100km
## [1] 235.0932
# divide constant by hwy mpg to get hwy litres per 100 km
mpg |>
  mutate(hwy_lp100km = lp100km / hwy)
## # A tibble: 234 × 13
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class mpg_per_cubic_litre
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>               <dbl>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…               16.1 
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…               16.1 
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…               15.5 
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…               15   
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…                9.29
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…                9.29
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…                8.71
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…               14.4 
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…               13.9 
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…               14   
## # ℹ 224 more rows
## # ℹ 1 more variable: hwy_lp100km <dbl>

5.8 Summarize (or Summarise)

Take a quick look at the data in the {gapminder} package.

gapminder
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

R has a built-in function to generate some descriptive (or summary) statistics. The first is summary() which gives a few key statistics about the distribution.

summary(gapminder)
##         country        continent        year         lifeExp           pop           
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60   Min.   :6.001e+04  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20   1st Qu.:2.794e+06  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71   Median :7.024e+06  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47   Mean   :2.960e+07  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85   3rd Qu.:1.959e+07  
##  Australia  :  12                  Max.   :2007   Max.   :82.60   Max.   :1.319e+09  
##  (Other)    :1632                                                                    
##    gdpPercap       
##  Min.   :   241.2  
##  1st Qu.:  1202.1  
##  Median :  3531.8  
##  Mean   :  7215.3  
##  3rd Qu.:  9325.5  
##  Max.   :113523.1  
## 

The summarize() (or summarise()) functions allows us to use a pipe sequence to compute statistics about the data. In the example below, we will calculate three statistics:

  • The first (minimum) year in the dataset
  • The last (maximum) year in the dataset
  • The number of unique countries
# answer
gapminder |>
  summarize(year_min = min(year),
            year_max = max(year),
            n_countries = n_distinct(country))
## # A tibble: 1 × 3
##   year_min year_max n_countries
##      <int>    <int>       <int>
## 1     1952     2007         142

Everything we have done so far has been with to manipulate the entire data file (filtering and selecting), or calculating descriptive statistics for all the values in our data. Often in our data analysis we need to compare within our data. Here are some questions we can answer through the gapminder data:

  1. What was the median life expectancy in African countries in 2007?

  2. What was the median life expectancy by continent in 2007?

  3. What was the median life expectancy by continent for each year in the data?

To answer the first of these, we can filter the data to get only the African countries and 2007.

But for the second and third, we need to group our data by continent and year, and then summarize those groups.

The R package {dplyr} has a grouping function group_by(). Once the data is grouped, we apply a summarize() (or summarise() if you prefer!) function with our descriptive statistics.

Let’s walk through the steps for the first question:

Extract the rows for African countries in 2007.

Then find:

  1. The number of unique countries
  2. The median life expectancy of African countries as a group
# answer
gapminder |> 
  # filter for the continent and the year
  filter(continent == "Africa" & year == 2007) |> 
  # calculate the summary statistics
  summarise(n_countries = n_distinct(country),
            lifeExp_med = median(lifeExp))
## # A tibble: 1 × 2
##   n_countries lifeExp_med
##         <int>       <dbl>
## 1          52        52.9

5.8.1 Grouping

We can use the {dplyr} function group_by() to define our grouping variables. If we group all of the countries by the continent they are part of, we can calculate a statistic for the whole continent.

The second question was to find the median life expectancy by continent in 2007.

# answer
gapminder |>
  # filter for the year
  filter(year == 2007) |> 
  # apply the grouping variable, and then summarize
  group_by(continent) |>
  summarize(lifeExp_mean = mean(lifeExp))
## # A tibble: 5 × 2
##   continent lifeExp_mean
##   <fct>            <dbl>
## 1 Africa            54.8
## 2 Americas          73.6
## 3 Asia              70.7
## 4 Europe            77.6
## 5 Oceania           80.7

5.8.1.1 Grouping with multiple variables

We can group by multiple variables.

Find the median life expectancy by continent for each year in the dataframe.

gapminder |>
  group_by(continent, year) |>
  summarize(lifeExp_med = median(lifeExp))
## # A tibble: 60 × 3
## # Groups:   continent [5]
##    continent  year lifeExp_med
##    <fct>     <int>       <dbl>
##  1 Africa     1952        38.8
##  2 Africa     1957        40.6
##  3 Africa     1962        42.6
##  4 Africa     1967        44.7
##  5 Africa     1972        47.0
##  6 Africa     1977        49.3
##  7 Africa     1982        50.8
##  8 Africa     1987        51.6
##  9 Africa     1992        52.4
## 10 Africa     1997        52.8
## # ℹ 50 more rows

Create a summary table with the population and total GDP by continent for the year 1952

# answer
gapminder |>
  filter(year == 1952) |>               # filter to get the records we need
  mutate(totalGDP = pop * gdpPercap) |> # note that we need to add the new variable calculated above!
  group_by(continent) |>                # this defines the grouping category
  summarize(pop = sum(pop),              # 
            totalGDP = sum(totalGDP))    #
## # A tibble: 5 × 3
##   continent        pop totalGDP
##   <fct>          <int>    <dbl>
## 1 Africa     237640501  3.12e11
## 2 Americas   345152446  2.94e12
## 3 Asia      1395357351  1.13e12
## 4 Europe     418120846  2.55e12
## 5 Oceania     10686006  1.08e11

5.8.2 Counting the number of cases

A common step in a data analysis project is to count the number of cases in our dataset, often by groups.

For these examples, we will use the mpg dataset, and produce a summary table that shows the number of vehicles by class (compact, suv, etc) and number of cylinders in the engine (the variable cyl).

In this section, we will look at three different approaches to get to the same result.

n()

Here we start with the original ungrouped dataset, applying the group_by(), then use the n() function inside a summarise() to count the number of cases (or observations).

mpg |> 
  group_by(class, cyl) |> 
  summarise(n = n()) 
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 2seater        8     5
##  2 compact        4    32
##  3 compact        5     2
##  4 compact        6    13
##  5 midsize        4    16
##  6 midsize        6    23
##  7 midsize        8     2
##  8 minivan        4     1
##  9 minivan        6    10
## 10 pickup         4     3
## 11 pickup         6    10
## 12 pickup         8    20
## 13 subcompact     4    21
## 14 subcompact     5     2
## 15 subcompact     6     7
## 16 subcompact     8     5
## 17 suv            4     8
## 18 suv            6    16
## 19 suv            8    38

To get a sorted result, we would need to add an arrange(desc()) function to our pipe:

mpg |> 
  group_by(class, cyl) |> 
  summarise(n = n()) |> 
  arrange(desc(n))
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 suv            8    38
##  2 compact        4    32
##  3 midsize        6    23
##  4 subcompact     4    21
##  5 pickup         8    20
##  6 midsize        4    16
##  7 suv            6    16
##  8 compact        6    13
##  9 minivan        6    10
## 10 pickup         6    10
## 11 suv            4     8
## 12 subcompact     6     7
## 13 2seater        8     5
## 14 subcompact     8     5
## 15 pickup         4     3
## 16 compact        5     2
## 17 midsize        8     2
## 18 subcompact     5     2
## 19 minivan        4     1

tally()

Another approach to count the number of rows in each group is to use the tally() function, which takes the place of the summarize(n = n()) function.

mpg |> 
  group_by(class, cyl) |> 
  tally()
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 2seater        8     5
##  2 compact        4    32
##  3 compact        5     2
##  4 compact        6    13
##  5 midsize        4    16
##  6 midsize        6    23
##  7 midsize        8     2
##  8 minivan        4     1
##  9 minivan        6    10
## 10 pickup         4     3
## 11 pickup         6    10
## 12 pickup         8    20
## 13 subcompact     4    21
## 14 subcompact     5     2
## 15 subcompact     6     7
## 16 subcompact     8     5
## 17 suv            4     8
## 18 suv            6    16
## 19 suv            8    38

If we want to sort in the tally() function, we can add a sort = TRUE argument. Note that this behaves differently than arrange(); sort = TRUE is largest to smallest.

mpg |> 
  group_by(class, cyl) |> 
  tally(sort = TRUE)
## # A tibble: 19 × 3
## # Groups:   class [7]
##    class        cyl     n
##    <chr>      <int> <int>
##  1 suv            8    38
##  2 compact        4    32
##  3 midsize        6    23
##  4 subcompact     4    21
##  5 pickup         8    20
##  6 midsize        4    16
##  7 suv            6    16
##  8 compact        6    13
##  9 minivan        6    10
## 10 pickup         6    10
## 11 suv            4     8
## 12 subcompact     6     7
## 13 2seater        8     5
## 14 subcompact     8     5
## 15 pickup         4     3
## 16 compact        5     2
## 17 midsize        8     2
## 18 subcompact     5     2
## 19 minivan        4     1

count()

The most streamlined way we could get to our grouped summary table with counts is by using the count() function. This function merges both the grouping and tallying functions.

The categories we want to count become the arguments of the count() function.

mpg |> 
  count(class, cyl)
## # A tibble: 19 × 3
##    class        cyl     n
##    <chr>      <int> <int>
##  1 2seater        8     5
##  2 compact        4    32
##  3 compact        5     2
##  4 compact        6    13
##  5 midsize        4    16
##  6 midsize        6    23
##  7 midsize        8     2
##  8 minivan        4     1
##  9 minivan        6    10
## 10 pickup         4     3
## 11 pickup         6    10
## 12 pickup         8    20
## 13 subcompact     4    21
## 14 subcompact     5     2
## 15 subcompact     6     7
## 16 subcompact     8     5
## 17 suv            4     8
## 18 suv            6    16
## 19 suv            8    38

Similar to tally, we can include a sort = TRUE argument.

mpg |> 
  count(class, cyl, sort = TRUE)
## # A tibble: 19 × 3
##    class        cyl     n
##    <chr>      <int> <int>
##  1 suv            8    38
##  2 compact        4    32
##  3 midsize        6    23
##  4 subcompact     4    21
##  5 pickup         8    20
##  6 midsize        4    16
##  7 suv            6    16
##  8 compact        6    13
##  9 minivan        6    10
## 10 pickup         6    10
## 11 suv            4     8
## 12 subcompact     6     7
## 13 2seater        8     5
## 14 subcompact     8     5
## 15 pickup         4     3
## 16 compact        5     2
## 17 midsize        8     2
## 18 subcompact     5     2
## 19 minivan        4     1

Without the grouping variables, these three statements produce the same result:

mpg |> summarise(n = n())
## # A tibble: 1 × 1
##       n
##   <int>
## 1   234
mpg |> tally()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   234
mpg |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   234

There are other counting functions within {dplyr}, see the reference page https://dplyr.tidyverse.org/reference/count.html for more details.

We will return to data wrangling later in the course, and explore more complex calculations.

5.9 Take aways

  • Extract cases with filter(). For filtering, we have to specify the values we want and the variable that they are in.

  • Choose the rows we want from our dataframes with select(). To select the columns we want, we have to specify the variable(s) that we want.

  • Create new variables, with mutate()

  • Filter using a list with %in%

  • Connect operations with the pipe symbol |>

  • Make tables of summaries with summarise()

  • Do groupwise operations with group_by()

  • Count the number of cases with tally() and count()

5.10 Functions for data wrangling

5.10.1 Reference: {dplyr} functions

The functions in {dplyr} (and many other packages) are verbs that describe what they do.

function action
filter() keep rows
mutate() create a new variable
group_by() declare subsets in data
select() keep variables (or drop them -var or re-order them)
rename() renaming variables
summarize() summarize the data, by groups if they have been declared
distinct() returns only rows that are unique
case_when() is used for “recoding” variable, often used with mutate()
function action
length
dplyr::count()
number of cases
tally() counting (by groups if group_by() applied)

The full list of {dplyr} functions is here: https://dplyr.tidyverse.org/reference/index.html

5.10.2 Reference: Descriptive statistics

These statistics can all be generated individually, using the functions in the table below. Other statistical measures about a variable can be calculated with the appropriate function. Note that these are just a few of the functions that are included in base R; there are also other statistical measures available through various packages.

statistic function
mean mean()
median median()
mode mode()
minimum and maximum min() max()
range range()
quantile quantile() (default: quartiles)
sum sum()
standard deviation sd()
variance var()
kurtosis kurtosis()
skewness skewness()
correlation coefficient cor()
correlation coefficient (two variables) cor()
count (number of cases) length()
number of unique cases n_distinct()

5.10.3 Reference: Boolean operators

Boolean operators are a special type of operator that return TRUE or FALSE. They are used when we want to compare values.

Operator Description
== equal, tests equality
!= not equal, tests inequality
> greater than, tests greater than (also >=)
< less than, tests less than (also <=)
%in% contains, tests inclusion

Boolean operators can be combined with and or or.

Operator Description
& and, returns true if preceeding and following are both true, else false
| or, returns true if either preceeding and following are true, else false

.footnote[see Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund R for Data Science, 2nd ed., “Comparisions”” and “Boolean algebra”.]

-30-