Section 7 Easier analysis with the tidyverse
Now that you know the basics of R, it’s time to learn that there are much better ways to do everything you just learnt!
7.1 Introduction to the tidyverse
The tidyverse is a bundle of packages that make using R easier because they’re all designed to work together. Most “tidy” functions work well together because they:
- Take a dataframe as their input
- Return as dataframe as their output
You might not use every package from the tidyverse
in an analysis,
but you can still load them all at the start of most analyses, and
know you’ll have a standard set of tools available. To load
them all, just use:
For this session, we’ll use the cowles
data again, so let’s
load it up:
7.2 dplyr
: Turning complex analyses into simple steps
As your analyses get more complicated, your code can get more complicated as well. To get to the answers you want, you might have to:
- Drop certain rows from your dataset because they’re invalid or not relevant.
- Calculate stats for each treatment group separately.
- Use summary variables like the school mean to calculate a standardized score.
and more importantly, you might have to combine multiple different operations like these for each calculation you’re doing.
The dplyr
package in the tidyverse makes this easier
by providing a small number of simple verbs that can be combined
easily to perform complex tasks. Each one takes your current
data, changes it, and returns it. The most common verbs you’ll
use are:
filter()
: choose rows to keep based on a logical test, dropping the rest.arrange()
: sort the data.select()
: choose columns to keep.mutate()
: add new columns.summarize()
: create columns that summarise the data down to a single row.count()
: Count the number of rows in the data.left_join()
(andright_join()
,inner_join()
etc.): Merge datasets based on a common identifier.
And, possibly most importantly:
group_by()
: Split the data into groups, so that any subsequent steps happen separately for each group.
We’ll go over examples of all of these below.
7.2.1 Bending the rules: Non-standard evaluation
dplyr
and other tidyverse
packages bend the rules of R syntax,
allowing you to just type column names like group
and time
rather than
having to spell out the name of the dataframe each time (like survey1$group
,
survey1$time
). Within tidyverse
function calls you can just type the
column name, e.g.
## # A tibble: 4 x 3
## sex volunteer n
## <fct> <fct> <int>
## 1 female no 431
## 2 female yes 349
## 3 male no 393
## 4 male yes 248
7.2.2 Pipes: %>%
dplyr
and the tidyverse
make heavy use of pipes to
make it easier to carry out multiple processing steps in a row.
A pipe looks like:
%>%
If you have a calculation that takes multiple steps, it gets confusing if you run them all in one go. Here we’ll calculate the mean of extraversion for the males in the data:
## male_mean
## 1 12.31357
The first step here is actually filter
ing the data, but
we have to write these function calls inside out. R understands
it fine but most human beings struggle to understand the logic
of what’s happening, especially if there’s more than 2 of
these nested steps happening.
You can do the steps one by one instead, but it’s clunky, and you might have to create multiple intermediate variables:
Using pipes, you can do each step, and then send it through the pipe to the next step. To get the mean extraversion for males using pipes, we can do:
## male_mean
## 1 12.31357
See how we’re not actually specifying a dataframe for the
filter()
and summarise()
functions? That’s because
the %>%
pipe automatically sends the data as
the first argument to the next function.
7.2.3 Common tasks with dplyr
Selecting columns with select
You can select particular columns from a dataframe using select()
. Using
-
in front of a column name excludes that column:
# These both give the same result: inclusion vs. exclusion
cow %>%
select(neuroticism, extraversion) %>%
head(2)
## neuroticism extraversion
## 1 16 13
## 2 8 14
## neuroticism extraversion
## 1 16 13
## 2 8 14
Instead of column names, you can also use a range of helper functions
like starts_with()
and num_range()
to select multiple columns
that match a particular pattern.
Creating/changing columns with mutate
We can use mutate
on a dataframe to add or change one or more columns:
cow %>%
# high_extraversion and high_neuroticism are the names
# of new columns that will be created
mutate(high_extraversion = extraversion >= 15,
high_neuroticism = neuroticism >= 15) %>%
head()
## neuroticism extraversion sex volunteer high_extraversion
## 1 16 13 female no FALSE
## 2 8 14 male no FALSE
## 3 5 16 male no TRUE
## 4 8 20 female no TRUE
## 5 9 19 male no TRUE
## 6 6 15 male no TRUE
## high_neuroticism
## 1 TRUE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
If we want these changes to be saved, we would need to save the
result back to the original variable. By default, mutate()
will
just return an altered copy of the original data, but won’t change
the original:
# If we want to actually save the changes
cow = cow %>%
mutate(high_extraversion = extraversion >= 15,
high_neuroticism = neuroticism >= 15)
On its own, the main advantage mutate
offers is being able to spell
out your calculations without including the name of the dataframe. However,
it can be very useful in combination with other verbs.
Summarizing the data with summarize
summarize
is similar to mutate
: it adds or changes columns. However,
summarize
also collapses the data down to a single row, so the
values you calculate need to be single values like means or counts.
## extraversion volunteers
## 1 12.37298 597
summarize
is useful in combination with group_by
, where it
collapses the data down to one row per group.
Selecting rows with filter
Choosing rows with a logical test with filter()
works just like
subsetting your data with a logical vector, it’s just easier
to do it as part of a sequence of steps:
## neuroticism extraversion sex volunteer
## 1 17 19 male yes
## 2 7 15 male yes
## 3 17 12 male yes
## 4 6 13 male yes
## 5 8 9 male yes
## 6 5 16 male yes
Sorting with arrange
With arrange
, you can sort by one or more columns. Use desc(column)
(short for descending) to sort that column in the opposite direction:
## neuroticism extraversion sex volunteer
## 1 15 23 female no
## 2 15 21 female no
## 3 9 21 female no
## 4 10 21 female no
## 5 8 20 female no
## 6 5 20 female no
group_by
for calculations within groups
group_by()
is very useful for calculating different stats in
subgroups of your data. This covers a lot of the more complex
operations you might need to do with your data, so it unlocks
a lot of possibilities.
You can do things like:
- Mean-centering scores separately for males and females:
## # A tibble: 1,421 x 5
## # Groups: sex [2]
## neuroticism extraversion sex volunteer extraversion_centered
## <int> <int> <fct> <fct> <dbl>
## 1 16 13 female no 0.578
## 2 8 14 male no 1.69
## 3 5 16 male no 3.69
## 4 8 20 female no 7.58
## 5 9 19 male no 6.69
## 6 6 15 male no 2.69
## 7 8 10 female no -2.42
## 8 12 11 male no -1.31
## 9 15 16 male no 3.69
## 10 18 7 male no -5.31
## # ... with 1,411 more rows
- Calculating means and SDs for subgroups:
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex volunteer mean sd
## <fct> <fct> <dbl> <dbl>
## 1 female no 12.2 4.75
## 2 female yes 12.3 4.79
## 3 male no 10.5 4.75
## 4 male yes 10.4 5.11
Frequency tables with count
count()
gives you a straightforward way to create a frequency table. There’s
no built-in way to calculate percentages, but you can easily add them
using mutate()
:
## # A tibble: 2 x 3
## sex n percent
## <fct> <int> <dbl>
## 1 female 780 54.9
## 2 male 641 45.1
If you have multiple levels of grouping, you need to think about how you want to calculate percentages. To get the percentage who volunteered within each sex:
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex volunteer n percent
## <fct> <fct> <int> <dbl>
## 1 female no 431 55.3
## 2 female yes 349 44.7
## 3 male no 393 61.3
## 4 male yes 248 38.7
7.2.3.1 Merging data with left_join()
To combine two dataframes, you can use functions like left_join()
and inner_join()
. In my experience, left_join()
is what you
want most of the time. To merge data successfully, all you
need is a column that’s present in both datasets that they can
be matched on, usually a participant ID or something similar.
Joins can also be useful when you have to calculate a complex summary. You can create a separate table with the summary info, and merge it back into the main dataset. As a simple example, let’s summarize extraversion by both sex and volunteering status and merge it back into the main dataset:
extraversion_info = cow %>%
group_by(sex, volunteer) %>%
summarize(mean_extraversion = mean(extraversion))
extraversion_info
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex volunteer mean_extraversion
## <fct> <fct> <dbl>
## 1 female no 12.0
## 2 female yes 12.9
## 3 male no 11.9
## 4 male yes 12.9
## neuroticism extraversion sex volunteer mean_extraversion
## 1 16 13 female no 12.00696
## 2 8 14 male no 11.91349
## 3 5 16 male no 11.91349
## 4 8 20 female no 12.00696
## 5 9 19 male no 11.91349
## 6 6 15 male no 11.91349