6 GGplot 2: Making things pretty

6.1 Something new: R Notebooks and Quarto

library(readxl)
library(tidyverse)

When you open Rstudio, you’ll notice that besides an R Script, you have a bunch of other options.

Figure 6.1: Quarto, R Notebooks, and R Markdown all do basically the same thing, but for this class, we’lll use Quarto.

Most of these are other options are different ways to make notebooks with R code; ways to mix your code and text into a single document.

These are useful for sharing your work with others, keeping a record of your work, or publishing your work in different types of document. This workbook your reading right now is actually a collection of R Notebooks; I wrote the text in Rstudio, and then ran the code in the same document.

The newest of these options is Quarto, which is a polished way to make documents that mix code and text. To create a Quarto document, you can select “Quarto Document” from the “New File” menu in Rstudio.

Figure 6.2: Pretty smooth, huh?

Thinking back, we’ve learned a couple of keyboard shortcuts. We have:

Cmd-shift-M to make a |> pipe,
Cmd-Enter to run a block of code

Now, we have a third option: Cmd-shift-I to insert a new code block. Everything outside these code blocks is text, and everything inside is where you put your R code. Lets try it out by deleting the example code blocks, and adding a new one at the top of your document with Cmd-shift-I.

Inside the code block, let’s start by loading the Tidyverse, then in a second code block, let’s load some data about the Bevölkerung nach Religion, Herkunft und Statistischer Zone. Find the URL to the data, and load it into your document directly using read_csv().

Figure 6.3: Pretty smooth, huh?

Usually when we’re programming, we load all of our libraries at the top of the document. This is because it’s easier to keep track of what libraries we’re using, and of someone else runs our code, they’ll know what libraries they might need to install right off the bat. Let’s keep to this convention and load the Tidyverse at the top of the document.

We then can use read_csv(), and enter a URL to load some data into our document.

After you’ve done this, we can run an entire block of code by clicking the green play button in the top right of the code block, or by pressing Cmd-shift-Enter.You also have a handy little option to “Run all chunks above”, also on the top right of the code block. This will run all the code blocks above the one you’re currently in.

Let’s look at the data, and figure out what each of the columns mean:

religion |> glimpse()

## Rows: 39,308
## Columns: 13
## $ StichtagDatJahr  <dbl> 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993, 19…
## $ StatZoneSort     <dbl> 1101, 1101, 1101, 1101, 1101, 1101, 1102, 1102, 1102, 1102, 1102, 1102, 1103, 1103, 1103, 1103, 1103, 1103, 1104, 1104, 1104, 1104, 1104, 1104, 1201, 1201, 1201, 1201, 1201, 1201, 1202, 1202, 1202, 1202, 1202, 1202, 1203, 1203, 1203, 1203, 12…
## $ StatZoneLang     <chr> "Zähringerstrasse", "Zähringerstrasse", "Zähringerstrasse", "Zähringerstrasse", "Zähringerstrasse", "Zähringerstrasse", "Prediger", "Prediger", "Prediger", "Prediger", "Prediger", "Prediger", "Grossmünster", "Grossmünster", "Grossmünster", "G…
## $ QuarSort         <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14…
## $ QuarLang         <chr> "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", "Rathaus", …
## $ KreisSort        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ HerkunftSort     <dbl> 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1,…
## $ HerkunftCd       <dbl> 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1,…
## $ HerkunftLang     <chr> "Schweizer*in", "Schweizer*in", "Schweizer*in", "Ausländer*in", "Ausländer*in", "Ausländer*in", "Schweizer*in", "Schweizer*in", "Schweizer*in", "Ausländer*in", "Ausländer*in", "Ausländer*in", "Schweizer*in", "Schweizer*in", "Schweizer*in", "A…
## $ Kon2AggSort_noDM <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,…
## $ Kon2AggCd_noDM   <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,…
## $ Kon2AggLang_noDM <chr> "Evangelisch-Reformiert", "Römisch-Katholisch", "Andere, ohne, unbekannt", "Evangelisch-Reformiert", "Römisch-Katholisch", "Andere, ohne, unbekannt", "Evangelisch-Reformiert", "Römisch-Katholisch", "Andere, ohne, unbekannt", "Evangelisch-Refo…
## $ AnzBestWir       <dbl> 157, 122, 129, 8, 73, 95, 333, 208, 269, 14, 66, 89, 314, 176, 209, 24, 68, 94, 260, 141, 169, 11, 45, 59, 121, 136, 42, 15, 33, 27, 28, 63, 21, 2, 17, 13, 103, 44, 56, 15, 26, 46, 195, 111, 122, 14, 20, 32, 197, 106, 101, 6, 56, 50, 66, 38, …

Some of the ones that we want to look at are:

StichtagDatJahr: year
Kon2AggLang_noDM: religion
HerkunftLang: Swiss or foreign
AnzBestWir: number of people

Let’s see how many people are in each listed religion each year:

religion |> 
  group_by(StichtagDatJahr, Kon2AggLang_noDM) |>  # We want to group the data by religion and year
  summarise(total_people = sum(AnzBestWir)) # We want to sum the number of people in each religion and year

## # A tibble: 93 × 3
## # Groups:   StichtagDatJahr [31]
##    StichtagDatJahr Kon2AggLang_noDM        total_people
##              <dbl> <chr>                          <dbl>
##  1            1993 Andere, ohne, unbekannt        97129
##  2            1993 Evangelisch-Reformiert        129157
##  3            1993 Römisch-Katholisch            134612
##  4            1994 Andere, ohne, unbekannt       101494
##  5            1994 Evangelisch-Reformiert        126379
##  6            1994 Römisch-Katholisch            132975
##  7            1995 Andere, ohne, unbekannt       105868
##  8            1995 Evangelisch-Reformiert        123481
##  9            1995 Römisch-Katholisch            131477
## 10            1996 Andere, ohne, unbekannt       109342
## # ℹ 83 more rows

We can plot this pretty easily in GGplot, using a geom_col() to make a bar graph.

religion |> 
  group_by(StichtagDatJahr, Kon2AggLang_noDM) |>
  summarise(total_people = sum(AnzBestWir)) |> 
  ggplot() +
  geom_col(aes(x = StichtagDatJahr, y = total_people, fill = Kon2AggLang_noDM))

geom_col() takes a couple of useful arguments, like position, which can be set to “dodge” to make the bars side by side, or “fill” to make the bars fill the space.

religion |> 
  group_by(StichtagDatJahr, Kon2AggLang_noDM) |>
  summarise(total_people = sum(AnzBestWir)) |> 
  ggplot() +
  geom_col(aes(x = StichtagDatJahr, y = total_people, fill = Kon2AggLang_noDM), position = "dodge")

Here’s the same plot, but with position = "fill" instead of “dodge”:

religion |> 
  group_by(StichtagDatJahr, Kon2AggLang_noDM) |>
  summarise(total_people = sum(AnzBestWir)) |> 
  ggplot() +
  geom_col(aes(x = StichtagDatJahr, y = total_people, fill = Kon2AggLang_noDM), position = "fill")

We’re not restricted to just bar graphs. We can also make line graphs, scatter plots, and more. Here’s an example of a line graph, showing the number of people in each religion over time.

religion |> 
  group_by(StichtagDatJahr, Kon2AggLang_noDM) |>
  summarise(total_people = sum(AnzBestWir)) |> 
  ggplot() +
  geom_line(aes(x = StichtagDatJahr, y = total_people, color = Kon2AggLang_noDM))

Finally, we should always remember to set limits and add titles to our graphs. This can be done using the lims() and labs() functions.

religion |> 
  group_by(StichtagDatJahr, Kon2AggLang_noDM) |>
  summarise(total_people = sum(AnzBestWir)) |> 
  ggplot() +
  geom_line(aes(x = StichtagDatJahr, y = total_people, color = Kon2AggLang_noDM)) +
  labs(title = "Number of people in each religion in Zurich", x = "Year", y = "Number of people") +
  lims(y=c(0, 300000))

### Discussion

Of the graphs we made, which one do you think is the most informative? Why?
Besides what we’ve done above, what other ways could you think of to make these better?

6.2 Classwork: Making your own

Make a graph of your choice using this data.

Here is an example you could try to copy, but make whatever you like.

6.3 Modifying existing plots

We should always label our graphs so that people know what they’re looking at. We can do this using the labs() function. Often, we don’t want to do everything in one step, so we can save our plot as an object, and then add labels to it later.

plt <- # your plot code goes here.

plt_1 <- plt_1 + labs(
  title = "Number of people in each major religious group",
  subtitle = "Zurich, 2023",
  x = "Ratio",
  y = "Gemeinde"
)

plt_1

6.4 Themes

When you look at enough social science stuff, you’ll notice that a lot of the graphs look like these; using the default colors, fonts, and layouts provided by GGplot. This is fine, but we can do better. R comes with some built-in themes that you can use to make your graphs look a little more polished. Here are a couple examples:

plt_1 + theme_minimal()

plt_1 + theme_light()

plt_1 + theme_bw()

There are also some themes that you can install from other packages. Here are a few of my favorites:

library(hrbrthemes)
plt_1 + theme_ipsum()

library(ggdark)
plt_1 + dark_theme_gray()

## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

This works a little differently on everybody’s computer, but you can also change the font of your graphs using base_family inside the theme. Here’s an example using the Iosevka font, one of my personal favorite coding fonts:

You, of course, are restricted to the fonts that you have on your computer.

plt_1 + theme_minimal(base_family = "iosevka")

6.5 Modifying themes

You can also modify themes to make them look the way you want. Here’s an example of how you can move the legend to the bottom of the graph, and rotate the x-axis labels. If you want to leave out some text from your plot, you can replace it with element_blank().

plt_1 + 
  theme_minimal(base_family = "iosevka") + 
  theme(
    legend.position = "bottom",
    legend.title = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

Let’s call this good, and add this to our plot object.

plt_1 <- plt_1 + 
  theme_minimal(base_family = "iosevka") + 
  theme(
    legend.position = "bottom",
    legend.title = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)
    )

6.6 Color schemes

In addition to changing the theme of the layout, you can also specify colors used in the plot itself. There are two ways to do this: use a pre-made color palette, or specify the colors yourself. Here are a few examples of pre-made color palettes:

The viridis color maps come with ggplot, and are a good default.

plt_1 + scale_fill_viridis_d()

This is actually a large collection of color palettes, and you can specify which one you want to use using the “option” argument.

plt_1 + scale_fill_viridis_d(option = "F")

Another option is to use the RColorBrewer package, which has a bunch of color palettes that are good for different types of data.

library(RColorBrewer)
plt_1 + scale_fill_brewer(palette = "Set1")

Finally, you can set your own colors using the scale_fill_manual() function. This function takes a list of colors that you want to use in your plot. Here’s an example of how you can set the colors to be cyan, blue, and green. Note that you can enter colors using the name of a color, a hex code, or as RGB values. We’ll go over this in more detail on week 9.

plt_1 + scale_fill_manual(
  values = c(
    "#11bbbb", 
    "blue",
    rgb(0.1, 0.8, 0.1)
    )
  )

6.7 Classwork: Making stuff look good

For some practice, let’s make some charts that investigate what things might be related to traffic fatalities. This comes from the AER package, and is a data set of traffic fatalities in the US in the 1980s.

library(AER)
data("Fatalities")

You can find some information about the data set here.

I’ve made some basic graphs below. Your job is to make them look good, with themes and color schemes.

Traffic deaths and drinking age:

Fatalities |> 
  filter(year == 1982) |>
  mutate(fatalities_per_cap = fatal / pop) |>
  mutate(drinkage = factor(drinkage)) |>
  ggplot() +
  geom_jitter(aes(x = drinkage, y = fatalities_per_cap, color = drinkage), width = 0.2) +
  labs(title = "Fatalities per capita by legal drinking age", x = "Legal drinking age", y = "Fatalities per capita")

Traffic deaths and percent of young drivers in the population:

Fatalities |> 
  filter(year == 1982) |>
  mutate(fatalities_per_cap = fatal / pop) |>
  mutate(young_drivers_rounded = round(youngdrivers, 2) * 100) |> 
  mutate(young_drivers_rounded = factor(young_drivers_rounded)) |>
  group_by(young_drivers_rounded) |>
  summarise(fatalities_per_cap = mean(fatalities_per_cap)) |> 
  ggplot() +
  geom_col(aes(x = young_drivers_rounded, y = fatalities_per_cap)) +
  labs(title = "Fatalities by young drivers in population", x = "Percent Young drivers", y = "Fatalities")

Traffic deaths and income:

library(ggrepel)

Fatalities |> 
  filter(year == 1982) |>
  mutate(fatalities_per_cap = fatal / pop * 1e6) |>
  mutate(state = toupper(state)) |>
  ggplot(aes(x = income, y = fatalities_per_cap)) +
  geom_point(color="black") +
  geom_smooth(method = "lm") +
  geom_label_repel(aes(label = state, x = income, y = fatalities_per_cap)) +
  labs(title = "Fatalities by income", x = "Income", y = "Fatalities")

Make a new plot investigating something else in the data set. Again, you can find some information about the data set here.

6.8 Discrete vs continuous values

One last thing to note is that there’s a difference between continuous and discrete values. Continuous values are things like age, height, or weight; things that can be any number. Discrete values are things like canton or religion; things that can only be a few different values.

When you’re making a graph, you need to make sure that you’re using the right type of scale. For continuous values, you can use scale_color_gradient(). For discrete values, you can use scale_color_discrete() or scale_fill_discrete().

Pre-made color sets have different versions for continuous and discrete values. For example, scale_fill_viridis_c() is for continuous values, and scale_fill_viridis_d() is for discrete values.

For gradient values, you can specify the low and high colors using the low and high arguments. Here’s an example of how you can make a gradient color scale for income:

Fatalities |> 
  filter(year == 1982) |>
  mutate(fatalities_per_cap = fatal / pop) |>
  ggplot() +
  geom_point(aes(x = income, y = unemp, color = income)) +
  scale_color_gradient(low = "#bb8800", high = "#22bb22") +
  labs(title = "Fatalities by income and unemployment", x = "Income", y = "Unemployment")

6.9 Faceting

One last important tool is faceting. This is when you make a bunch of small graphs, each showing a different part of your data. This is useful when you have a lot of data, and you want to show how different parts of your data are related.

Sometimes our graph is too busy, like this example below. It’s hard to read, and you have a difficult time understanding what’s going on.

Fatalities |>
  filter(state %in% c("ny", "ca", "tx", "fl", "il")) |> 
  mutate(fatalities_per_cap = fatal / pop) |>
  ggplot() +
  geom_col(aes(x=year, y=fatalities_per_cap, fill=state), position="dodge") +
  scale_fill_brewer(palette = "Set1")

Instead, we can add an element to GGplot, facet_wrap(), which will make a bunch of small graphs, each showing a different part of your data.

Fatalities |>
  filter(state %in% c("ny", "ca", "tx", "fl", "il")) |> 
  mutate(fatalities_per_cap = fatal / pop) |>
  ggplot() +
  geom_col(aes(x=year, y=fatalities_per_cap, fill=state), position="dodge") +
  facet_wrap(~state) +
  scale_fill_brewer(palette = "Set1")

facet_wrap() has a few options that you can use to make your graphs look better. For example, you can specify how many columns you want with the ncol argument.

Fatalities |>
  filter(state %in% c("ny", "ca", "tx", "fl", "il")) |> 
  mutate(fatalities_per_cap = fatal / pop) |>
  mutate(year = as.numeric(as.character(year))) |> 
  ggplot() +
  geom_col(aes(x=year, y=fatalities_per_cap, fill=state), position="dodge") +
  facet_wrap(~state, ncol = 5) +
  scale_fill_brewer(palette = "Set1")

6.10 Homework and practice

For the next week, you’ll be working on a larger project. But if you want some more practice, you can make some more charts using pre-assembled data sets like this one. You can find a huge list of them here.