5.4 Categorical variables (2+)

5.4.1 Data & Packages & functions

  • Data: Two or several categorical variables
  • Challenge: We need to summarize the data as to obtain frequencies (absolute or relative)
    • A long format dataframe that contains the frequencies of different category combinations
  • Packages & functions:
    • tidyr and pivot_longer() function
    • dplyr and functions such as summarize(), mutate() etc.

5.4.2 Graph

  • Here we’ll reproduce and maybe criticize as well as improve Figure 5.5 (Bauer and Clemm von Hohenberg 2020)
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?33
    • How many scales/mappings does it use? Could we reduce them?34
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?

Several categorical variables

Figure 5.5: Several categorical variables

5.4.3 Lab: Data & Code

  • The code for Figure 5.5 is shown below (and creates Figure 5.6).

  • Learning objectives

    • How to manipulate data first and visualize it thereafter
      • How to use pivot_longer (ggplot likes long format!)
    • How to visualize unordered and ordered variables
    • How to name scales and create manual ones
    • How to create labels from data
      • Make sure the only thing that you might to add for the labels is gsub, i.e., now substantive changing of label names
    • How to size text elements

We start by importing the original (unsummarised data). As you can see below we have categorical string variables.

## Parsed with column specification:
## cols(
##   account_email = col_character(),
##   account_fb = col_character(),
##   account_twitter = col_character(),
##   account_whatsapp = col_character()
## )
account_email account_fb account_twitter account_whatsapp
Yes Yes No Yes
Yes Yes No Yes
Yes Yes, but I dont use it No Yes
Yes Yes No Yes
Yes No No Yes
Yes Yes Yes Yes

Subsequently, we summarize/aggregate the data producing a dataframe that contains the percentage of people in each category (Yes, Yes, but inactive, No) across the four variables. It’s a good idea to call the data that builds the basis for our plot data_plot as to keep the original dataset data. Let’s go through the code below step by step:

variable category value
Email Yes 98
Email Yes, but inactive 1
Email No 1
Facebook Yes 61
Facebook Yes, but inactive 8
Facebook No 30
Twitter Yes 15
Twitter Yes, but inactive 8
Twitter No 77
Whatsapp Yes 83
Whatsapp Yes, but inactive 1
Whatsapp No 16

Check out the variables in the data. Importantly, there is an ordered factor in there:

## tibble [12 x 3] (S3: tbl_df/tbl/data.frame)
##  $ variable: chr [1:12] "Email" "Email" "Email" "Facebook" ...
##  $ category: Factor w/ 3 levels "Yes","Yes, but inactive",..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ value   : num [1:12] 98 1 1 61 8 30 15 8 77 83 ...

Now we have prepared the data we can plot it in Figure 5.6 (fairly easy!). Again let’s go through this step by step:

Distribution of four categorical variables

Figure 5.6: Distribution of four categorical variables

Strictly speaking the coloring in Figure 5.6 would not be necessary as the platforms are already encoded on the x-Axis. Figure 5.6 uses 4 mappings (x, y, alpha/luminance, color) for three variables. Hence, we could also use a grayscale version of the graph that you see below in Figure 5.7.35

Distribution of four categorical variables

Figure 5.7: Distribution of four categorical variables

5.4.4 Exercise

  • Figure 5.8 uses code that is very similar to Figure 5.6. Can you recreate it?
  1. Load the summarized data (we’ll skip the data management steps).
## Parsed with column specification:
## cols(
##   variable = col_character(),
##   category = col_character(),
##   value = col_double()
## )
  1. Try to recreate Figure 5.8 using the code from Figure 5.6.
  2. How would we modify the code (data) if you want to show just 2 out of four variables or just 3 out of 5 categories?
  3. How would we visualize a fourth categorical variable?
  4. Here we used colors for differentiating. What would be an alternative way?
Distribution of four categorical variables

Figure 5.8: Distribution of four categorical variables


Bauer, Paul C, and Bernhard Clemm von Hohenberg. 2020. “Believing and Sharing Information by Fake Sources: An Experiment.”

  1. Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables: Platform with unordered categories and Account with 3 ordered categories.

  2. We use both the x-scale and color for the same mapping namely platforms. This could be reduced.

  3. One could also think of choosing colors that are also discernible when printed in grayscale instead of luminance.