4 Introduction to data wrangling
Data is abundant, and as data analysts we often have to spend time working with our data tables to get it ready for whatever analysis we need to do. “Data wrangling” and “data transformation” are two terms that are used to describe the process of altering the structure of our data.2
There are three broad categories of data transformation:
alter the structure of the existing data tables. This could be through adding or removing rows or columns to our data table, including calculation of new variables, or importation of additional rows (observations) and joining two tables together.
summarize the content of the source data tables. For example, if our data table has the population of all the countries in the world, we might want a summary table that has the table with the total for each continent.
with both our existing data or our summary tables, we may want change the layout, either for analysis purposes or to make it easier for human readers to absorb the information.
In this section, we will begin to work with all three transformations.
4.0.1 Reading: Data wrangling
Please refer to R for Data Science, Data transformation for more information about this section.
For this section, and for other exercises throughout this course, we will be using a subset of the Gapminder data. These data tables contain information about the countries of the world, and were used by Hans Rosling to explain that there have been significant changes in the quality of people’s lives around the world.
The best stats you’ve ever seen— this is the video that introduced me to Hans Rosling.
200 Countries, 200 Years, 4 Minutes—recreating the chart in this video is one of the capstone project options
It’s outside the scope of this course, but I also recommend Rosling’s book Factfulness(Rosling, Rosling, and Rönnlund 2018), which goes deeper into the changes in the world and how data explains those changes.
4.1 Transform Data: 1a Alter the structure
Data Transformation Cheat Sheet (PDF file)
When working with data, particularly large data sets, you encounter a situation where you need to:
subset the data so that it contains only those records (rows) you are interested in (Why deal with a data table that has every province in Canada when you just need the British Columbia data?)
subset the data so that it contains only those variables (columns) you are interested in (Why keep looking at variables that do not factor into your analysis, or you will not print in your report?)
create new variables, often through calculations based on variables in your data
To achieve these three goals, we will be using functions from the tidyverse package {dplyr}. The functions are verbs that describe what they do. Here’s a list of the {dplyr} functions that you’ll be using:
function | action |
---|---|
filter() |
keep rows |
select() |
keep variables (or drop them -var) |
mutate() |
create a new variable |
arrange() |
sort from smallest to largest |
arrange(desc()) |
sort from largest to smallest |
Other operators
In addition to these functions, R has a number of operators that add significantly to your code.
Arithmetic operators
Operator | Description |
---|---|
+ | plus |
- | minus |
* | multiplication |
/ | division |
^ | exponential |
Boolean operators are a special type of operator that return TRUE or FALSE.
Operator | Description |
---|---|
== | equal, tests equality |
!= | not equal, tests inequality |
> | greater than, tests greater than (also >=) |
< | less than, tests less than (also <=) |
%in% | contains, tests inclusion |
Boolean operators can be combined with and or or.
Operator | Description |
---|---|
& |
and, returns true if preceding and following are both true, else false |
| | or, returns true if either preceding and following are true, else false |
see R for Data Science, “Comparisions” and “Logical Operators”.]
4.2 Introducing the pipe operator %>%
(see R for Data Science, Pipes for a deep dive into pipes)
The pipe operator %>%
allows us to name the data frame first, and then have the next function work with the outcome of the first line.
For example, we could write
function(data, argument)
or, with a pipe
data %>% function(argument)
Read the pipe symbol as “then”. We start with our data, then we apply the function.
If we want to do anything with the output from the first statement, we have to assign it to an intermediate object, and then use that intermediate object in the second statement. With the pipe, we can add another pipe and the results of the first statement are passed to the second statement without the intermediate object.
As we get to increasingly complex statements, the power of this will become more evident.
This code calculates the mean of life expectancy (the lifeExp
variable) in the gapminder data (note that we are accessing the lifeExp
variable by adding the dollar sign to the dataframe name).
mean(gapminder$lifeExp)
## [1] 59.47444
Another way to do this would be using the pipe operator. The symbol is %>%
and links one line of script to the next. The way to read this is as meaning “and then…”
At this point, we only have one step in our chain, but in future code, you might have multiple steps linked one after the other.
So this little chunk would be read as
we start with the dataframe “exp_dem”, “and then…”
we calculate a variable called “mean_of_expense” by running the
mean()
function on the variable “Expense”
## # A tibble: 1 × 1
## mean_of_life_expectancy
## <dbl>
## 1 59.5
4.3 Creating lists
R allows us to create lists using the c()
function.
A list can be referenced in a filter()
statement using the %in%
as the evaluator.
In this example, we filter to get three countries, “New Zealand”, “Canada”, and “United States”.
Note that to get all three, the statement is OR, not AND! There are no cases where country == "New Zealand"
and simultaneously country == "Canada"
…
## # A tibble: 36 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # … with 26 more rows
Instead of this rather verbose syntax, we can achieve the same result by creating a list.
## # A tibble: 36 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # ... with 26 more rows
Another option would be to create a named object that contains our list, and then use the named object in our filter()
.
# alternate answer
country_list <- c("New Zealand", "Canada", "United States") # create object that is list of countries
gapminder %>%
filter(country %in% country_list) # filter uses that object
## # A tibble: 36 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # … with 26 more rows
This final option has a significant benefit. Now if we need to use that list of countries in another part of our analysis, we can just summon the object, rather than typing the entire list again.
This also reduces the risk of inadvertent error if we want to change the countries we are analyzing. If we have a named list object, we only need to change it in the one place, and that change will be incorporated throughout. Contrast that with a find-and-replace approach, where you might accidentally miss one of the places it appears, or you might have a typo in one place.
NOTE: Watch your quotation marks! In many cases double and single quotes are interchangable, but fancy “smart” quotes to start and end are not valid.
This example uses the flights
table from the {nycflights13} package, and filters for Alaska Airlines Inc., which has the value “AS” in the variable “carrier”.
# will not work due to "smart" quotes
::flights %>% filter(carrier == “AS”)
nycflights13
# will work...double quotes
::flights %>% filter(carrier == "AS")
nycflights13
# will work...single quotes
::flights %>% filter(carrier == 'AS') nycflights13
4.4 Select columns
You’ve already seen filter()
to choose particular rows in your dataframe. What if you want to remove some columns (variables)?
Here we will use select()
to reduce the number of variables in our gapminder table to “country”, “year”, and “lifeExp”:
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # … with 1,694 more rows
You can use the minus sign to drop variable–that is, to define ones to not select.
In this example, note we have to use c()
to create a list:
## # A tibble: 1,704 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # … with 1,694 more rows
You can define a range of columns to select:
## # A tibble: 1,704 × 4
## country year lifeExp pop
## <fct> <int> <dbl> <int>
## 1 Afghanistan 1952 28.8 8425333
## 2 Afghanistan 1957 30.3 9240934
## 3 Afghanistan 1962 32.0 10267083
## 4 Afghanistan 1967 34.0 11537966
## 5 Afghanistan 1972 36.1 13079460
## 6 Afghanistan 1977 38.4 14880372
## 7 Afghanistan 1982 39.9 12881816
## 8 Afghanistan 1987 40.8 13867957
## 9 Afghanistan 1992 41.7 16317921
## 10 Afghanistan 1997 41.8 22227415
## # … with 1,694 more rows
4.5 Functions for data wrangling
4.5.1 Reference: {dplyr} functions
The functions in {dplyr} (and many other packages) are verbs that describe what they do.
function | action |
---|---|
filter() |
keep rows |
mutate() |
create a new variable |
group_by() |
declare subsets in data |
select() |
keep variables (or drop them -var) |
rename() |
renaming variables |
summarize() |
summarize the data, by groups if they have been declared |
distinct() |
returns only rows that are unique |
case_when() |
is used for “recoding” variable, often used with mutate() |
function | action |
---|---|
length dplyr::count()
|
number of cases |
tally() |
counting (by groups if group_by() applied) |
The full list of {dplyr} functions is here: https://dplyr.tidyverse.org/reference/index.html
4.5.2 Reference: Boolean operators
Boolean operators are a special type of operator that return TRUE or FALSE. They are used when we want to compare values.
Operator | Description |
---|---|
== | equal, tests equality |
!= | not equal, tests inequality |
> | greater than, tests greater than (also >=) |
< | less than, tests less than (also <=) |
%in% | contains, tests inclusion |
Boolean operators can be combined with and or or.
Operator | Description |
---|---|
& |
and, returns true if preceeding and following are both true, else false |
| | or, returns true if either preceeding and following are true, else false |
.footnote[see R for Data Science, “Comparisions”” and “Logical Operators”.]
4.6 Descriptive statistics
R has a built-in function to generate some descriptive (or summary) statistics. The first is summary()
which gives a few key statistics about the distribution.
summary(gapminder)
## country continent year lifeExp pop
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
## Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
## Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
## Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
## (Other) :1632
## gdpPercap
## Min. : 241.2
## 1st Qu.: 1202.1
## Median : 3531.8
## Mean : 7215.3
## 3rd Qu.: 9325.5
## Max. :113523.1
##
These statistics can all be generated individually, using the functions in the table below. Other statistical measures about a variable can be calculated with the appropriate function. Note that these are just a few of the functions that are included in base R; there are also other statistical measures available through various packages.
statistic | function |
---|---|
mean | mean() |
median | median() |
mode | mode() |
minimum and maximum |
min() max()
|
range | range() |
quantile |
quantile() (default: quartiles) |
— | — |
sum | sum() |
standard deviation | sd() |
variance | var() |
kurtosis | kurtosis() |
skewness | skewness() |
correlation coefficient | cor() |
correlation coefficient (two variables) | cor() |
— | — |
count (number of cases) | length() |
number of unique cases | n_distinct() |
4.7 Tangent: Arguments within functions
Virtually every function we use has default arguments, and options to change those. This allows us to use one function for a variety of circumstances. Instead of having separate functions to split a variable into its quartiles (four parts), quintiles (five parts), deciles (ten parts), and so on, there is one function that uses the generic name for this process: quantile.
The default setting of the quantile()
function splits the distribution into four parts, known as quartiles. This is the most common quantile used to summarize a distribution.
quantile(gapminder$lifeExp)
## 0% 25% 50% 75% 100%
## 23.5990 48.1980 60.7125 70.8455 82.6030
If you want other groupings, the function has arguments that allow you define those. For example, if you want deciles—10 groups, each with 10% of the cases—the second example shows how to do that.
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 23.5990 41.5108 45.8992 50.6021 55.7292 60.7125 66.0814 69.7465 72.0288 75.0970 82.6030
4.8 Transform Data: 1b Add new variables
You can calculate new variables that may be useful in your analysis
vectors of the same length
For this example, we will use the “mpg” data set, which is a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. For more information about the data set, see the reference page for the dataset: https://ggplot2.tidyverse.org/reference/mpg.html
The first thing we want to do is to take a quick look at the data so we get a sense of its structure, using the ls.str()
function:
ls.str(mpg)
## class : chr [1:234] "compact" "compact" "compact" "compact" "compact" "compact" "compact" "compact" ...
## cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## drv : chr [1:234] "f" "f" "f" "f" "f" "f" "f" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "4" "r" "r" ...
## fl : chr [1:234] "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "p" "r" "e" ...
## hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## manufacturer : chr [1:234] "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" "audi" ...
## model : chr [1:234] "a4" "a4" "a4" "a4" "a4" "a4" "a4" "a4 quattro" "a4 quattro" "a4 quattro" ...
## trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" "auto(l5)" "manual(m5)" "auto(av)" ...
## year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
We can also view the data as a table:
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
## 10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
## # … with 224 more rows
Vector of the length = 1
As part of our analysis, we want to understand the ratio of gas consumption to the size of the engine, measured as the variable displ
(short for “displacement”). To calculate that ratio, we divide the highway consumption (hwy
) by the displacement.
The {dplyr} function to create a new variable is mutate()
. This is shown in the code below:
Notice that there’s now a new variable added to the data called mpg_per_cubic_litre
.
In that example, the two columns are of the same length—for each car (each row in the data) there is a hwy
value and a displ
value.
In other instances, we might want to have a calculation using a constant—we might want to multiply every one of our cases by 2, or add 10 to each.
In the example below, we will add a new variable to our data set using mutate()
.
This will convert miles per US gallon to litres per 100 kilometers. The first step is to create a constant value, based on the number of gallons in a litre and the number of kilometers in a mile.
Once we have that constant, we can use it in an equation to calculate the litres per 100 kilometers for every car in the data.
# calculate conversion constant
# a US gallon is 3.785 litres
# a kilometer is 1.61 miles
lp100km <- (100 * 3.785) / (1.61)
lp100km
## [1] 235.0932
# divide constant by hwy mpg to get hwy litres per 100 km
mpg %>%
mutate(hwy_lp100km = lp100km / hwy)
## # A tibble: 234 × 13
## manufacturer model displ year cyl trans drv cty hwy fl class mpg_per_cubic_l…
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <dbl>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p comp… 16.1
## 2 audi a4 1.8 1999 4 manual(… f 21 29 p comp… 16.1
## 3 audi a4 2 2008 4 manual(… f 20 31 p comp… 15.5
## 4 audi a4 2 2008 4 auto(av) f 21 30 p comp… 15
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p comp… 9.29
## 6 audi a4 2.8 1999 6 manual(… f 18 26 p comp… 9.29
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p comp… 8.71
## 8 audi a4 quattro 1.8 1999 4 manual(… 4 18 26 p comp… 14.4
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p comp… 13.9
## 10 audi a4 quattro 2 2008 4 manual(… 4 20 28 p comp… 14
## # … with 224 more rows, and 1 more variable: hwy_lp100km <dbl>
4.9 Transform Data 2: summarize
Everything we have done so far has been with to manipulate the entire data file (filtering and selecting), or calculating descriptive statistics for all the values in our data. Often we need to compare within our data:
What was the median life expectancy in African countries in 2007?
What was the median life expectancy by continent in 2007?
What was the median life expectancy by continent for each year in the data?
To achieve the first of these things, we can filter the data to get only the African countries and 2007.
But for the second and third, we need to group our data by continent and year, and then summarize those groups.
The R package {dplyr} has a grouping function group_by()
. Once the data is grouped, we apply a summarize()
(or summarise()
if you prefer!) function with our descriptive statistics.
Take a quick look at the data in the {gapminder} package.
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
Use summarize()
or summarise()
to compute three statistics about the data:
- The first (minimum) year in the dataset
- The last (maximum) year in the dataset
- The number of unique countries
# answer
gapminder %>%
summarize(year_min = min(year),
year_max = max(year),
n_countries = n_distinct(country))
## # A tibble: 1 × 3
## year_min year_max n_countries
## <int> <int> <int>
## 1 1952 2007 142
Extract the rows for African countries in 2007.
Then find:
- The number of unique countries
- The median life expectancy of African countries as a group
# answer
gapminder %>%
filter(continent == "Africa" & year == 2007) %>%
summarise(n_countries = n_distinct(country),
lifeExp_med = median(lifeExp))
## # A tibble: 1 × 2
## n_countries lifeExp_med
## <int> <dbl>
## 1 52 52.9
4.9.1 Grouping
We can use the {dplyr} function group_by()
to define our grouping variables. If we group all of the countries by the continent they are part of, we can calculate a statistic for the whole continent.
Find the median life expectancy by continent in 2007.
# answer
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(lifeExp_mean = mean(lifeExp))
## # A tibble: 5 × 2
## continent lifeExp_mean
## <fct> <dbl>
## 1 Africa 54.8
## 2 Americas 73.6
## 3 Asia 70.7
## 4 Europe 77.6
## 5 Oceania 80.7
4.9.1.1 Grouping with multiple variables
We can group by multiple variables.
Find the median life expectancy by continent for each year in the dataframe.
## # A tibble: 60 × 3
## # Groups: continent [5]
## continent year lifeExp_med
## <fct> <int> <dbl>
## 1 Africa 1952 38.8
## 2 Africa 1957 40.6
## 3 Africa 1962 42.6
## 4 Africa 1967 44.7
## 5 Africa 1972 47.0
## 6 Africa 1977 49.3
## 7 Africa 1982 50.8
## 8 Africa 1987 51.6
## 9 Africa 1992 52.4
## 10 Africa 1997 52.8
## # … with 50 more rows
Create a summary table with the population and GDP by continent for the year 1952
# answer
gapminder %>%
filter(year == 1952) %>% # filter to get the records we need
mutate(totalGDP = pop * gdpPercap) %>% # note that we need to add the new variable calculated above!
group_by(continent) %>% # this defines the grouping category
summarize(pop = sum(pop), #
totalGDP = sum(totalGDP)) #
## # A tibble: 5 × 3
## continent pop totalGDP
## <fct> <int> <dbl>
## 1 Africa 237640501 3.12e11
## 2 Americas 345152446 2.94e12
## 3 Asia 1395357351 1.13e12
## 4 Europe 418120846 2.55e12
## 5 Oceania 10686006 1.08e11
4.9.2 Counting the number of cases
A common step in a data analysis project is to count the number of cases in our dataset, often by groups.
For these examples, we will use the mpg
dataset, and produce a summary table that shows the number of vehicles by class (compact, suv, etc) and number of cylinders in the engine (the variable cyl
).
In this section, we will look at three different approaches to get to the same result.
4.9.2.1 n()
·
Here we start with the original ungrouped dataset, applying the group_by()
, then use the n()
function inside a summarise()
to count the number of cases (or observations).
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 2seater 8 5
## 2 compact 4 32
## 3 compact 5 2
## 4 compact 6 13
## 5 midsize 4 16
## 6 midsize 6 23
## 7 midsize 8 2
## 8 minivan 4 1
## 9 minivan 6 10
## 10 pickup 4 3
## 11 pickup 6 10
## 12 pickup 8 20
## 13 subcompact 4 21
## 14 subcompact 5 2
## 15 subcompact 6 7
## 16 subcompact 8 5
## 17 suv 4 8
## 18 suv 6 16
## 19 suv 8 38
To get a sorted result, we would need to add an arrange(desc())
function to our pipe:
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 suv 8 38
## 2 compact 4 32
## 3 midsize 6 23
## 4 subcompact 4 21
## 5 pickup 8 20
## 6 midsize 4 16
## 7 suv 6 16
## 8 compact 6 13
## 9 minivan 6 10
## 10 pickup 6 10
## 11 suv 4 8
## 12 subcompact 6 7
## 13 2seater 8 5
## 14 subcompact 8 5
## 15 pickup 4 3
## 16 compact 5 2
## 17 midsize 8 2
## 18 subcompact 5 2
## 19 minivan 4 1
4.9.2.2 tally()
·
Another approach to count the number of rows in each group is to use the tally()
function, which takes the place of the summarize(n = n())
function.
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 2seater 8 5
## 2 compact 4 32
## 3 compact 5 2
## 4 compact 6 13
## 5 midsize 4 16
## 6 midsize 6 23
## 7 midsize 8 2
## 8 minivan 4 1
## 9 minivan 6 10
## 10 pickup 4 3
## 11 pickup 6 10
## 12 pickup 8 20
## 13 subcompact 4 21
## 14 subcompact 5 2
## 15 subcompact 6 7
## 16 subcompact 8 5
## 17 suv 4 8
## 18 suv 6 16
## 19 suv 8 38
If we want to sort in the tally()
function, we can add a sort = TRUE
argument. Note that this behaves differently than arrange()
; sort = TRUE
is largest to smallest.
## # A tibble: 19 × 3
## # Groups: class [7]
## class cyl n
## <chr> <int> <int>
## 1 suv 8 38
## 2 compact 4 32
## 3 midsize 6 23
## 4 subcompact 4 21
## 5 pickup 8 20
## 6 midsize 4 16
## 7 suv 6 16
## 8 compact 6 13
## 9 minivan 6 10
## 10 pickup 6 10
## 11 suv 4 8
## 12 subcompact 6 7
## 13 2seater 8 5
## 14 subcompact 8 5
## 15 pickup 4 3
## 16 compact 5 2
## 17 midsize 8 2
## 18 subcompact 5 2
## 19 minivan 4 1
4.9.2.3 count()
·
The most streamlined way we could get to our grouped summary table with counts is by using the count()
function. This function merges both the grouping and tallying functions.
The categories we want to count become the arguments of the count()
function.
## # A tibble: 19 × 3
## class cyl n
## <chr> <int> <int>
## 1 2seater 8 5
## 2 compact 4 32
## 3 compact 5 2
## 4 compact 6 13
## 5 midsize 4 16
## 6 midsize 6 23
## 7 midsize 8 2
## 8 minivan 4 1
## 9 minivan 6 10
## 10 pickup 4 3
## 11 pickup 6 10
## 12 pickup 8 20
## 13 subcompact 4 21
## 14 subcompact 5 2
## 15 subcompact 6 7
## 16 subcompact 8 5
## 17 suv 4 8
## 18 suv 6 16
## 19 suv 8 38
Similar to tally, we can include a sort = TRUE
argument.
## # A tibble: 19 × 3
## class cyl n
## <chr> <int> <int>
## 1 suv 8 38
## 2 compact 4 32
## 3 midsize 6 23
## 4 subcompact 4 21
## 5 pickup 8 20
## 6 midsize 4 16
## 7 suv 6 16
## 8 compact 6 13
## 9 minivan 6 10
## 10 pickup 6 10
## 11 suv 4 8
## 12 subcompact 6 7
## 13 2seater 8 5
## 14 subcompact 8 5
## 15 pickup 4 3
## 16 compact 5 2
## 17 midsize 8 2
## 18 subcompact 5 2
## 19 minivan 4 1
4.9.2.4 counting wrapup
Without the grouping variables, these three statements produce the same result:
## # A tibble: 1 × 1
## n
## <int>
## 1 234
## # A tibble: 1 × 1
## n
## <int>
## 1 234
## # A tibble: 1 × 1
## n
## <int>
## 1 234
There are other counting functions within {dplyr}, see the reference page https://dplyr.tidyverse.org/reference/count.html for more details.
We will return to data wrangling later in the course, and explore more complex calculations.
4.10 Take aways
Extract cases with
filter()
. For filtering, we have to specify the values we want and the variable that they are in.Choose the rows we want from our dataframes with
select()
. To select the columns we want, we have to specify the variable(s) that we want.Create new variables, with
mutate()
Filter using a list with
%in%
Connect operations with the pipe symbol
%>%
Make tables of summaries with
summarise()
Do groupwise operations with
group_by()
-30-