Chapter 7 Handout 5: practice + more dplyr

Handout 5 goes over for-loops (something that also does not change going into tidyverse). So, this is a good opportunity to practice all of the skills you have learned so far, and see how the tidyverse approach differs in practice!

The following is a “repeat” of Handout 5 using the tidyverse. We will review past concepts, as well as introduce a few new dplyr verbs.

7.1 Review: What do we do?

Don’t forget! Load the tidyverse package!

library(tidyverse)
data_isolate <- read.table("ANES_isolate.tsv")

First, the handout explores the interventionist sentiment by party identification. It wants us to calculate the mean interventionist sentiment (a summary statistics) based on a group (party identification).

Exercise: How do we do this via tidyverse?

Solution: group_by() and summarize()!

data_isolate_summary <- data_isolate %>%
  group_by(partyid) %>% # group by party ID
  summarize(mean_isolation = mean(isolate)) %>% # calculate the mean for each group
  ungroup() # don't forget to ungroup!
data_isolate_summary
## # A tibble: 7 × 2
##   partyid mean_isolation
##     <int>          <dbl>
## 1       1         -0.405
## 2       2         -0.438
## 3       3         -0.453
## 4       4         -0.407
## 5       5         -0.560
## 6       6         -0.515
## 7       7         -0.624

Next, the handout references adding a new column for whether a year has a Democratic president or not.

Exercise: How do we add this new column?

Solution: This is a task for mutate()!

data_isolate <- data_isolate %>%
  mutate(dempres = ifelse(year %in% c(1968, 1980, 1994, 1996, 1998, 2000), 1, 0))
# dempres = 1 if in one of the above years

7.2 Skip the For-loop!

Consider the first for-loop. The reason why we need a for-loop was to re-produce the previous analysis, isolation sentiment, by both party ID and year. But, is there a better way to do that via tidyverse?

Yes! One cool thing is: group_by() can group by multiple variables! So, we can forgo the for-loop completely:

data_isolate_years <- data_isolate %>%
  group_by(year, partyid) %>% # group by both!
  summarize(mean_isolation = mean(isolate)) %>%
  ungroup() # don't forget to ungroup!
head(data_isolate_years)
## # A tibble: 6 × 3
##    year partyid mean_isolation
##   <int>   <int>          <dbl>
## 1  1956       1         -0.267
## 2  1956       2         -0.351
## 3  1956       3         -0.356
## 4  1956       4         -0.354
## 5  1956       5         -0.451
## 6  1956       6         -0.396

See how the grouping above allowed us to produce the relevant statistics for both party ID and year at the same time! All is not well, though.

Our next task, plotting, requires us to find the difference in between the extreme party IDs (1 and 7) and the moderates (4). So, we’ll need to do a bit more tidyverse coding.

7.3 rename() and select()

NB: The following solution is not the ideal tidyverse solution to this problem. Instead, we will use it as an example to introduce some new dplyr verbs.

Our central challenge: how do we get the differences above (party ID 1/7 minus 4) all in the same dataframe?

First, let’s create 3 filters of the dataset, for each of the relevant party IDs (1, 4, 7), so we have that data available:

# three different filtered datasets
data_isolate_1 <- data_isolate_years %>%
  filter(partyid == 1)
data_isolate_4 <- data_isolate_years %>%
  filter(partyid == 4)
data_isolate_7 <- data_isolate_years %>%
  filter(partyid == 7)

But how do we combine those? You might think we can simply bind the columns together (ie, use cbind(), or column bind). But, examine the issues that emerge:

# try cbind
data_cbind <- cbind(data_isolate_1, data_isolate_4, data_isolate_7)

head(data_cbind)
##   year partyid mean_isolation year partyid mean_isolation year partyid mean_isolation
## 1 1956       1     -0.2674419 1956       4     -0.3541667 1956       7     -0.3483607
## 2 1958       1     -0.3342776 1958       4     -0.3440860 1958       7     -0.4248366
## 3 1960       1     -0.4180328 1960       4     -0.5612245 1960       7     -0.5609756
## 4 1968       1     -0.4960938 1968       4     -0.4615385 1968       7     -0.5468750
## 5 1972       1     -0.5775401 1972       4     -0.4553846 1972       7     -0.7052239
## 6 1976       1     -0.1944444 1976       4     -0.4688797 1976       7     -0.6787879

There’s a lot to unpack here. First, we have a lot of extra data: we don’t need the years multiple times. Second, we have multiple columns with the same name, which is a major no-no with R (because then if we tried accessing a column by name, which one would it pick?).

This is where we introduce select() and rename().

select() takes as an argument a vector of column names (or just column names). It then “selects” those columns and excludes the rest. So, in this example, we have year repeated three times, which is unnecessary. We also don’t really want party ID, since we only really care about the mean_isolation column. So, for two of the datasets (let’s choose data_isolate_1 and data_isolate_7), we can select just the mean_isolation column.

# we want only the mean_isolation column
data_isolate_1 <- data_isolate_1 %>%
  select(mean_isolation)
data_isolate_7 <- data_isolate_7 %>%
  select(mean_isolation)

head(data_isolate_1) # now we only have 1 column!
## # A tibble: 6 × 1
##   mean_isolation
##            <dbl>
## 1         -0.267
## 2         -0.334
## 3         -0.418
## 4         -0.496
## 5         -0.578
## 6         -0.194

Let’s try cbind() again now, and see the result:

data_cbind <- cbind(data_isolate_1, data_isolate_4, data_isolate_7)
head(data_cbind)
##   mean_isolation year partyid mean_isolation mean_isolation
## 1     -0.2674419 1956       4     -0.3541667     -0.3483607
## 2     -0.3342776 1958       4     -0.3440860     -0.4248366
## 3     -0.4180328 1960       4     -0.5612245     -0.5609756
## 4     -0.4960938 1968       4     -0.4615385     -0.5468750
## 5     -0.5775401 1972       4     -0.4553846     -0.7052239
## 6     -0.1944444 1976       4     -0.4688797     -0.6787879

So, we have fewer columns now, but we still need to fix the issue of multiple columns with the same name. Along the same lines, the column names should be a little more descriptive, so we can tell what each column mean_isolation corresponds to.

Let’s try again, except by renaming the columns. The rename() verb renames our columns for us; when we use rename(), we simply use the syntax rename(newname = oldname):

# remember: rename(newname = oldname)
data_isolate_1 <- data_isolate_1 %>%
  rename(mean_isolation_party1 = mean_isolation)
data_isolate_7 <- data_isolate_7 %>%
  rename(mean_isolation_party7 = mean_isolation)

# our new dataframe
data_cbind <- cbind(data_isolate_1, data_isolate_4, data_isolate_7)
head(data_cbind)
##   mean_isolation_party1 year partyid mean_isolation mean_isolation_party7
## 1            -0.2674419 1956       4     -0.3541667            -0.3483607
## 2            -0.3342776 1958       4     -0.3440860            -0.4248366
## 3            -0.4180328 1960       4     -0.5612245            -0.5609756
## 4            -0.4960938 1968       4     -0.4615385            -0.5468750
## 5            -0.5775401 1972       4     -0.4553846            -0.7052239
## 6            -0.1944444 1976       4     -0.4688797            -0.6787879

Finally, we want to go back to our original goal: create a new column calculating differences between the moderates (party ID 4) and the extremes (party ID 1/7). As you probably remember, mutate() handles creating new columns for us:

# mutate will create new columns, called repsvindep and demsvindep, for us
data_final <- data_cbind %>%
  mutate(repsvindep = mean_isolation - mean_isolation_party7) %>%
  mutate(demsvindep = mean_isolation - mean_isolation_party1)

head(data_final)
##   mean_isolation_party1 year partyid mean_isolation mean_isolation_party7   repsvindep   demsvindep
## 1            -0.2674419 1956       4     -0.3541667            -0.3483607 -0.005806011 -0.086724806
## 2            -0.3342776 1958       4     -0.3440860            -0.4248366  0.080750580 -0.009808401
## 3            -0.4180328 1960       4     -0.5612245            -0.5609756 -0.000248880 -0.143191703
## 4            -0.4960938 1968       4     -0.4615385            -0.5468750  0.085336538  0.034555288
## 5            -0.5775401 1972       4     -0.4553846            -0.7052239  0.249839265  0.122155492
## 6            -0.1944444 1976       4     -0.4688797            -0.6787879  0.209908211 -0.274435224

7.4 (more) Advanced ggplot2

You might have noticed that in previous ggplots, we’ve only been plotting with two different columns. So, how do we plot the same thing in the handout, with different colors for different things?

(under construction)