Chapter 7 Handout 5: practice + more dplyr
Handout 5 goes over for-loops (something that also does not change going into tidyverse). So, this is a good opportunity to practice all of the skills you have learned so far, and see how the tidyverse approach differs in practice!
The following is a “repeat” of Handout 5 using the tidyverse. We will review past concepts, as well as introduce a few new dplyr verbs.
7.1 Review: What do we do?
Don’t forget! Load the tidyverse package!
library(tidyverse)
<- read.table("ANES_isolate.tsv") data_isolate
First, the handout explores the interventionist sentiment by party identification. It wants us to calculate the mean interventionist sentiment (a summary statistics) based on a group (party identification).
Exercise: How do we do this via tidyverse?
Solution: group_by()
and summarize()
!
<- data_isolate %>%
data_isolate_summary group_by(partyid) %>% # group by party ID
summarize(mean_isolation = mean(isolate)) %>% # calculate the mean for each group
ungroup() # don't forget to ungroup!
data_isolate_summary
## # A tibble: 7 × 2
## partyid mean_isolation
## <int> <dbl>
## 1 1 -0.405
## 2 2 -0.438
## 3 3 -0.453
## 4 4 -0.407
## 5 5 -0.560
## 6 6 -0.515
## 7 7 -0.624
Next, the handout references adding a new column for whether a year has a Democratic president or not.
Exercise: How do we add this new column?
Solution: This is a task for mutate()
!
<- data_isolate %>%
data_isolate mutate(dempres = ifelse(year %in% c(1968, 1980, 1994, 1996, 1998, 2000), 1, 0))
# dempres = 1 if in one of the above years
7.2 Skip the For-loop!
Consider the first for-loop. The reason why we need a for-loop was to re-produce the previous analysis, isolation sentiment, by both party ID and year. But, is there a better way to do that via tidyverse?
Yes! One cool thing is: group_by()
can group by multiple variables! So, we can forgo the for-loop completely:
<- data_isolate %>%
data_isolate_years group_by(year, partyid) %>% # group by both!
summarize(mean_isolation = mean(isolate)) %>%
ungroup() # don't forget to ungroup!
head(data_isolate_years)
## # A tibble: 6 × 3
## year partyid mean_isolation
## <int> <int> <dbl>
## 1 1956 1 -0.267
## 2 1956 2 -0.351
## 3 1956 3 -0.356
## 4 1956 4 -0.354
## 5 1956 5 -0.451
## 6 1956 6 -0.396
See how the grouping above allowed us to produce the relevant statistics for both party ID and year at the same time! All is not well, though.
Our next task, plotting, requires us to find the difference in between the extreme party IDs (1 and 7) and the moderates (4). So, we’ll need to do a bit more tidyverse coding.
7.3 rename()
and select()
NB: The following solution is not the ideal tidyverse solution to this problem. Instead, we will use it as an example to introduce some new dplyr verbs.
Our central challenge: how do we get the differences above (party ID 1/7 minus 4) all in the same dataframe?
First, let’s create 3 filters of the dataset, for each of the relevant party IDs (1, 4, 7), so we have that data available:
# three different filtered datasets
<- data_isolate_years %>%
data_isolate_1 filter(partyid == 1)
<- data_isolate_years %>%
data_isolate_4 filter(partyid == 4)
<- data_isolate_years %>%
data_isolate_7 filter(partyid == 7)
But how do we combine those? You might think we can simply bind the columns together (ie, use cbind()
, or column bind). But, examine the issues that emerge:
# try cbind
<- cbind(data_isolate_1, data_isolate_4, data_isolate_7)
data_cbind
head(data_cbind)
## year partyid mean_isolation year partyid mean_isolation year partyid mean_isolation
## 1 1956 1 -0.2674419 1956 4 -0.3541667 1956 7 -0.3483607
## 2 1958 1 -0.3342776 1958 4 -0.3440860 1958 7 -0.4248366
## 3 1960 1 -0.4180328 1960 4 -0.5612245 1960 7 -0.5609756
## 4 1968 1 -0.4960938 1968 4 -0.4615385 1968 7 -0.5468750
## 5 1972 1 -0.5775401 1972 4 -0.4553846 1972 7 -0.7052239
## 6 1976 1 -0.1944444 1976 4 -0.4688797 1976 7 -0.6787879
There’s a lot to unpack here. First, we have a lot of extra data: we don’t need the years multiple times. Second, we have multiple columns with the same name, which is a major no-no with R (because then if we tried accessing a column by name, which one would it pick?).
This is where we introduce select()
and rename()
.
select()
takes as an argument a vector of column names (or just column names). It then “selects” those columns and excludes the rest. So, in this example, we have year repeated three times, which is unnecessary. We also don’t really want party ID, since we only really care about the mean_isolation column. So, for two of the datasets (let’s choose data_isolate_1
and data_isolate_7
), we can select just the mean_isolation
column.
# we want only the mean_isolation column
<- data_isolate_1 %>%
data_isolate_1 select(mean_isolation)
<- data_isolate_7 %>%
data_isolate_7 select(mean_isolation)
head(data_isolate_1) # now we only have 1 column!
## # A tibble: 6 × 1
## mean_isolation
## <dbl>
## 1 -0.267
## 2 -0.334
## 3 -0.418
## 4 -0.496
## 5 -0.578
## 6 -0.194
Let’s try cbind()
again now, and see the result:
<- cbind(data_isolate_1, data_isolate_4, data_isolate_7)
data_cbind head(data_cbind)
## mean_isolation year partyid mean_isolation mean_isolation
## 1 -0.2674419 1956 4 -0.3541667 -0.3483607
## 2 -0.3342776 1958 4 -0.3440860 -0.4248366
## 3 -0.4180328 1960 4 -0.5612245 -0.5609756
## 4 -0.4960938 1968 4 -0.4615385 -0.5468750
## 5 -0.5775401 1972 4 -0.4553846 -0.7052239
## 6 -0.1944444 1976 4 -0.4688797 -0.6787879
So, we have fewer columns now, but we still need to fix the issue of multiple columns with the same name. Along the same lines, the column names should be a little more descriptive, so we can tell what each column mean_isolation
corresponds to.
Let’s try again, except by renaming the columns. The rename()
verb renames our columns for us; when we use rename()
, we simply use the syntax rename(newname = oldname)
:
# remember: rename(newname = oldname)
<- data_isolate_1 %>%
data_isolate_1 rename(mean_isolation_party1 = mean_isolation)
<- data_isolate_7 %>%
data_isolate_7 rename(mean_isolation_party7 = mean_isolation)
# our new dataframe
<- cbind(data_isolate_1, data_isolate_4, data_isolate_7)
data_cbind head(data_cbind)
## mean_isolation_party1 year partyid mean_isolation mean_isolation_party7
## 1 -0.2674419 1956 4 -0.3541667 -0.3483607
## 2 -0.3342776 1958 4 -0.3440860 -0.4248366
## 3 -0.4180328 1960 4 -0.5612245 -0.5609756
## 4 -0.4960938 1968 4 -0.4615385 -0.5468750
## 5 -0.5775401 1972 4 -0.4553846 -0.7052239
## 6 -0.1944444 1976 4 -0.4688797 -0.6787879
Finally, we want to go back to our original goal: create a new column calculating differences between the moderates (party ID 4) and the extremes (party ID 1/7). As you probably remember, mutate()
handles creating new columns for us:
# mutate will create new columns, called repsvindep and demsvindep, for us
<- data_cbind %>%
data_final mutate(repsvindep = mean_isolation - mean_isolation_party7) %>%
mutate(demsvindep = mean_isolation - mean_isolation_party1)
head(data_final)
## mean_isolation_party1 year partyid mean_isolation mean_isolation_party7 repsvindep demsvindep
## 1 -0.2674419 1956 4 -0.3541667 -0.3483607 -0.005806011 -0.086724806
## 2 -0.3342776 1958 4 -0.3440860 -0.4248366 0.080750580 -0.009808401
## 3 -0.4180328 1960 4 -0.5612245 -0.5609756 -0.000248880 -0.143191703
## 4 -0.4960938 1968 4 -0.4615385 -0.5468750 0.085336538 0.034555288
## 5 -0.5775401 1972 4 -0.4553846 -0.7052239 0.249839265 0.122155492
## 6 -0.1944444 1976 4 -0.4688797 -0.6787879 0.209908211 -0.274435224
7.4 (more) Advanced ggplot2
You might have noticed that in previous ggplots, we’ve only been plotting with two different columns. So, how do we plot the same thing in the handout, with different colors for different things?
(under construction)