5.5 Categorical variables (2+)
5.5.1 Data & Packages & functions
- Data: Two or several categorical variables
- Challenge: We need to summarize the data as to obtain frequencies (absolute or relative)
- Create long format dataframe that contains the frequencies of different category combinations
- Packages & functions:
tidyr
andpivot_longer()
functiondplyr
and functions such assummarize()
,mutate()
etc.
5.5.2 Graph
- Here we’ll reproduce and maybe criticize as well as improve Figure 5.5 (Bauer and Clemm von Hohenberg 2020)
- Questions:
- What does the graph show? What are the underlying variables (and data)?29
- How many scales/mappings does it use? Could we reduce them?30
- What do you like, what do you dislike about the figure? What is good, what is bad?
- What kind of information could we add to the graph (if any)?
- How would you approach a replication of the graph?

Figure 5.5: Several categorical variables
5.5.3 Lab: Data & Code
The code for Figure 5.5 is shown below (and creates Figure 5.6).
Learning objectives
- How to manipulate data first and visualize it thereafter
- How to use
pivot_longer
(ggplot likes long format!)
- How to use
- How to visualize unordered and ordered variables
- How to name scales and create manual ones
- How to create labels from data
- Make sure the only thing that you might to add for the labels is gsub, i.e., now substantive changing of label names
- How to size text elements
- How to manipulate data first and visualize it thereafter
We start by importing the original (unsummarised data). As you can see below we have categorical string variables.
# data_account_prevalence.csv
#data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
# "1xbffXai-HqWS2Q17KoB_MCCY7YOpskyH"))
<- read_csv("data/data_account_prevalence.csv",
data col_types = cols())
head(data)
account_email | account_fb | account_twitter | account_whatsapp |
---|---|---|---|
Yes | Yes | No | Yes |
Yes | Yes | No | Yes |
Yes | Yes, but I dont use it | No | Yes |
Yes | Yes | No | Yes |
Yes | No | No | Yes |
Yes | Yes | Yes | Yes |
Subsequently, we summarize/aggregate the data producing a dataframe that contains the percentage of people in each category (Yes
, Yes, but inactive
, No
) across the four variables. It’s a good idea to call the data that builds the basis for our plot data_plot
as to keep the original dataset data
. Let’s go through the code below step by step:
# Creating plot data
<- data %>%
data_plot
pivot_longer(cols = account_email:account_whatsapp, # formerly gather
names_to = "variable",
values_to = "value") %>%
group_by(variable) %>%
summarize(pct.Yes = mean(value == "Yes", na.rm=TRUE),
pct.Inactive = mean(value == "Yes, but I dont use it", na.rm=TRUE),
pct.No = mean(value == "No", na.rm=TRUE)) %>%
pivot_longer(cols = pct.Yes:pct.No, # formerly gather
names_to = "category",
values_to = "value") %>% # only keep variables of interest
mutate(category = factor(category, # Create factor for ordering
levels = c("pct.Yes", "pct.Inactive", "pct.No"),
ordered = TRUE)) %>%
mutate(value = round(value,2)) %>%
mutate(value = 100 * value)
# Change the labels and translate to english! (so we can direclty pull them out later)
$category <- gsub("Inactive", "Yes, but inactive", data_plot$category)
data_plot$category <- gsub("pct.", "", data_plot$category)
data_plot
$category <- factor(data_plot$category,
data_plotlevels = c("Yes", "Yes, but inactive", "No"),
ordered = TRUE)
$variable <- factor(data_plot$variable)
data_plot
$variable <- str_to_title(gsub("fb", "facebook", gsub("account_", "",
data_plot$variable)))
data_plot data_plot
variable | category | value |
---|---|---|
Yes | 98 | |
Yes, but inactive | 1 | |
No | 1 | |
Yes | 61 | |
Yes, but inactive | 8 | |
No | 30 | |
Yes | 15 | |
Yes, but inactive | 8 | |
No | 77 | |
Yes | 83 | |
Yes, but inactive | 1 | |
No | 16 |
Check out the variables in the data. Importantly, there is an ordered factor in there:
str(data_plot)
## tibble [12 x 3] (S3: tbl_df/tbl/data.frame)
## $ variable: chr [1:12] "Email" "Email" "Email" "Facebook" ...
## $ category: Ord.factor w/ 3 levels "Yes"<"Yes, but inactive"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ value : num [1:12] 98 1 1 61 8 30 15 8 77 83 ...
Now we have prepared the data we can plot it in Figure 5.6 (fairly easy!). Again let’s go through this step by step:
# CHECK
# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
geom_bar(stat="identity",
width=0.7,
position = position_dodge(width=0.8),
aes(fill = factor(variable),
alpha=category)) +
geom_text(position = position_dodge(width=0.8),
aes(alpha=category,
label = paste(value,"%", sep="")),
vjust=1.6,
color="black",
size=2) +
scale_fill_discrete(name="Platform") +
scale_alpha_discrete(name="Account",
range=c(1, 0.5)) +
xlab("Platforms")+
ylab("Percentage (%)")+
theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),
panel.grid.major.x = element_blank(),
legend.title = element_text(size=10),
legend.text = element_text(size=9))

Figure 5.6: Distribution of four categorical variables
Strictly speaking the coloring in Figure 5.6 would not be necessary as the platforms are already encoded on the x-Axis. Figure 5.6 uses 4 mappings (x, y, alpha/luminance, color) for three variables. Hence, we could also use a grayscale version of the graph that you see below in Figure 5.7.31
# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
geom_bar(stat="identity",
width=0.7,
position = position_dodge(width=0.8),
aes(#fill = factor(variable),
alpha=factor(category))) +
geom_text(position = position_dodge(width=0.8),
aes(alpha=factor(category),
label = paste(value,"%", sep="")),
vjust=1.6,
color="black",
size=2) +
scale_alpha_discrete(name="Account",
range=c(1, 0.5),
labels=c("Yes", "Yes, but inactive", "No")) +
scale_x_discrete(labels=str_to_title(gsub("fb", "facebook",
gsub("account_", "", unique(data_plot$variable))))) +
xlab("Platforms")+
ylab("Percentage (%)")+
theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),
panel.grid.major.x = element_blank(),
legend.title = element_text(size=10),
legend.text = element_text(size=9))

Figure 5.7: Distribution of four categorical variables
5.5.4 Exercise
- Load the summarized data (we’ll skip the data management steps).
# data_sharing_frequency_summarized.csv
#data_plot <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#"16oXege9RqvtIkBppZvkQNb4gl-lmSyoU"))
<- read_csv("www/data/data_sharing_frequency_summarized.csv")
data_plot head(data_plot)
variable | category | value |
---|---|---|
Daily | 7 | |
Daily | 7 | |
Daily | 7 | |
Daily | 17 | |
Rarer | 67 | |
Rarer | 53 |
- Try to recreate Figure 5.8 using the code from Figure 5.6. There is a mistake in Figure 5.8. Can you spot it?
- How would we modify the code (data) if you want to show just 2 out of four variables or just 3 out of 5 categories?
- How could we visualize a fourth categorical variable?
- Here we used colors for differentiating. What would be an alternative way?

Figure 5.8: Distribution of four categorical variables
References
Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables:
Platform
with unordered categories andAccount
with 3 ordered categories.↩︎We use both the x-scale and color for the same mapping namely platforms. This could be reduced.↩︎
One could also think of choosing colors that are also discernible when printed in grayscale instead of luminance.↩︎