5.5 Categorical variables (2+)

5.5.1 Data & Packages & functions

  • Data: Two or several categorical variables
  • Challenge: We need to summarize the data as to obtain frequencies (absolute or relative)
    • Create long format dataframe that contains the frequencies of different category combinations
  • Packages & functions:
    • tidyr and pivot_longer() function
    • dplyr and functions such as summarize(), mutate() etc.

5.5.2 Graph

  • Here we’ll reproduce and maybe criticize as well as improve Figure 5.5 (Bauer and Clemm von Hohenberg 2020)
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?29
    • How many scales/mappings does it use? Could we reduce them?30
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Several categorical variables

Figure 5.5: Several categorical variables



5.5.3 Lab: Data & Code

  • The code for Figure 5.5 is shown below (and creates Figure 5.6).

  • Learning objectives

    • How to manipulate data first and visualize it thereafter
      • How to use pivot_longer (ggplot likes long format!)
    • How to visualize unordered and ordered variables
    • How to name scales and create manual ones
    • How to create labels from data
      • Make sure the only thing that you might to add for the labels is gsub, i.e., now substantive changing of label names
    • How to size text elements

We start by importing the original (unsummarised data). As you can see below we have categorical string variables.

# data_account_prevalence.csv
#data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                         "1xbffXai-HqWS2Q17KoB_MCCY7YOpskyH"))

data <- read_csv("data/data_account_prevalence.csv",
                 col_types = cols())

head(data)
account_email account_fb account_twitter account_whatsapp
Yes Yes No Yes
Yes Yes No Yes
Yes Yes, but I dont use it No Yes
Yes Yes No Yes
Yes No No Yes
Yes Yes Yes Yes



Subsequently, we summarize/aggregate the data producing a dataframe that contains the percentage of people in each category (Yes, Yes, but inactive, No) across the four variables. It’s a good idea to call the data that builds the basis for our plot data_plot as to keep the original dataset data. Let’s go through the code below step by step:

# Creating plot data
  data_plot <- data %>% 
  
      pivot_longer(cols = account_email:account_whatsapp, # formerly gather
                   names_to = "variable", 
                   values_to = "value") %>%
  
      group_by(variable) %>%
  
      summarize(pct.Yes = mean(value == "Yes", na.rm=TRUE),
                pct.Inactive = mean(value == "Yes, but I dont use it", na.rm=TRUE),
                pct.No = mean(value == "No", na.rm=TRUE)) %>% 
  
      pivot_longer(cols = pct.Yes:pct.No, # formerly gather
                   names_to = "category", 
                   values_to = "value") %>% # only keep variables of interest
  
      mutate(category = factor(category, # Create factor for ordering
                               levels = c("pct.Yes", "pct.Inactive", "pct.No"), 
                               ordered = TRUE)) %>%
  
      mutate(value = round(value,2)) %>% 
      mutate(value = 100 * value)

# Change the labels and translate to english! (so we can direclty pull them out later)
  data_plot$category <- gsub("Inactive", "Yes, but inactive", data_plot$category)
  data_plot$category <- gsub("pct.", "", data_plot$category)
  
  data_plot$category <- factor(data_plot$category,
                               levels = c("Yes", "Yes, but inactive", "No"),
                               ordered = TRUE)
  data_plot$variable <- factor(data_plot$variable)
  
  data_plot$variable <- str_to_title(gsub("fb", "facebook", gsub("account_", "", 
                                          data_plot$variable)))
  data_plot
variable category value
Email Yes 98
Email Yes, but inactive 1
Email No 1
Facebook Yes 61
Facebook Yes, but inactive 8
Facebook No 30
Twitter Yes 15
Twitter Yes, but inactive 8
Twitter No 77
Whatsapp Yes 83
Whatsapp Yes, but inactive 1
Whatsapp No 16



Check out the variables in the data. Importantly, there is an ordered factor in there:

str(data_plot)
## tibble [12 x 3] (S3: tbl_df/tbl/data.frame)
##  $ variable: chr [1:12] "Email" "Email" "Email" "Facebook" ...
##  $ category: Ord.factor w/ 3 levels "Yes"<"Yes, but inactive"<..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ value   : num [1:12] 98 1 1 61 8 30 15 8 77 83 ...



Now we have prepared the data we can plot it in Figure 5.6 (fairly easy!). Again let’s go through this step by step:

  # CHECK
# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
  geom_bar(stat="identity", 
           width=0.7, 
           position = position_dodge(width=0.8),
           aes(fill = factor(variable),
               alpha=category)) +
  geom_text(position = position_dodge(width=0.8),
           aes(alpha=category,
               label = paste(value,"%", sep="")), 
           vjust=1.6, 
           color="black", 
           size=2) +
  
  scale_fill_discrete(name="Platform") +
  
  scale_alpha_discrete(name="Account",
                       range=c(1, 0.5)) + 
  xlab("Platforms")+
  ylab("Percentage (%)")+ 
  theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")), 
      plot.title = element_text(hjust = 0.5),
      plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),      
      panel.grid.major.x = element_blank(),
      legend.title = element_text(size=10),
      legend.text = element_text(size=9))
Distribution of four categorical variables

Figure 5.6: Distribution of four categorical variables



Strictly speaking the coloring in Figure 5.6 would not be necessary as the platforms are already encoded on the x-Axis. Figure 5.6 uses 4 mappings (x, y, alpha/luminance, color) for three variables. Hence, we could also use a grayscale version of the graph that you see below in Figure 5.7.31

# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
  geom_bar(stat="identity", 
           width=0.7, 
           position = position_dodge(width=0.8),
           aes(#fill = factor(variable),
               alpha=factor(category))) +
  geom_text(position = position_dodge(width=0.8),
           aes(alpha=factor(category),
               label = paste(value,"%", sep="")), 
           vjust=1.6, 
           color="black", 
           size=2) +
  

  
  scale_alpha_discrete(name="Account",
                       range=c(1, 0.5),
                       labels=c("Yes", "Yes, but inactive", "No")) + 
  
  scale_x_discrete(labels=str_to_title(gsub("fb", "facebook", 
                                            gsub("account_", "", unique(data_plot$variable))))) + 
  
  xlab("Platforms")+
  ylab("Percentage (%)")+ 
  theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")), 
      plot.title = element_text(hjust = 0.5),
      plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),      
      panel.grid.major.x = element_blank(),
      legend.title = element_text(size=10),
      legend.text = element_text(size=9))
Distribution of four categorical variables

Figure 5.7: Distribution of four categorical variables

5.5.4 Exercise

  • Figure 5.8 different data but code that is very similar to Figure 5.6. Can you recreate it?
  1. Load the summarized data (we’ll skip the data management steps).
# data_sharing_frequency_summarized.csv
#data_plot <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                              #"16oXege9RqvtIkBppZvkQNb4gl-lmSyoU"))
data_plot <- read_csv("www/data/data_sharing_frequency_summarized.csv")
head(data_plot)
variable category value
Email Daily 7
Facebook Daily 7
Twitter Daily 7
Whatsapp Daily 17
Email Rarer 67
Facebook Rarer 53
  1. Try to recreate Figure 5.8 using the code from Figure 5.6. There is a mistake in Figure 5.8. Can you spot it?
  2. How would we modify the code (data) if you want to show just 2 out of four variables or just 3 out of 5 categories?
  3. How could we visualize a fourth categorical variable?
  4. Here we used colors for differentiating. What would be an alternative way?
Distribution of four categorical variables

Figure 5.8: Distribution of four categorical variables

References

Bauer, Paul C, and Bernhard Clemm von Hohenberg. 2020. “Believing and Sharing Information by Fake Sources: An Experiment.”

  1. Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables: Platform with unordered categories and Account with 3 ordered categories.↩︎

  2. We use both the x-scale and color for the same mapping namely platforms. This could be reduced.↩︎

  3. One could also think of choosing colors that are also discernible when printed in grayscale instead of luminance.↩︎