5.6 Numeric vs. various variables
- Here we’ll reproduce parts of Figure 5.10 (Bauer and Clemm von Hohenberg 2020)
- Questions:
- What does the graph show? What are the underlying variables (and data)?36
- How many scales/mappings does it use? Could we reduce them?
- What do you like, what do you dislike about the figure? What is good, what is bad?
- What kind of information could we add to the graph (if any)?
- How would you approach a replication of the graph?

Figure 5.10: Several categorical variables
5.6.1 Lab: Data & Code
The code for a subset of Figure 5.10 is shown below (and creates Figure 5.11).
Learning objectives
- How to generate ggplot plots in loops (
aes_string
) - How to visualize a numeric variable (Y) vs. different variables (X)
- How to create graphs conditional on loop elements depending on variable types
- How to generate ggplot plots in loops (
Let’s check out (and load) the datasets the underlie the plot first.
data_loop
: Contains the variables names (variable
) and labels (label
) and type (type
) of different covariates.- We’ll loop over the content of this dataframe (it’s ordered by the variable importance)
data_heterogeneity
: Contains covariate values across individuals, as well as predictions for each individual (these are the predictions for a causal effect)- This is that data that is getting visualized.
# data_loop.csv
data_loop <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
"1ELtshmxQWS0T8mFR1uh5IH_r57ynP8MZ"))
data_loop
Importance | variable | label | type |
---|---|---|---|
0.3680689 | trust_source_mainstream | Mainstr. media trust | continuous |
0.1248805 | vote_choice_afd_num | Vote choice AfD | categorical |
0.0786331 | income_num | Income | categorical |
# data_treatment_heterogeneity.csv
data_heterogeneity <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
"1sYLKmFi4uZsDDxNZdcwjD-eTHcVWql_X"))
head(data_heterogeneity)
predictions | trust_source_mainstream | vote_choice_afd_num | income_num |
---|---|---|---|
1.1508661 | 3.2857143 | 0 | 11 |
0.6617550 | 3.0000000 | 0 | 6 |
0.4056240 | 1.0000000 | 0 | 11 |
0.3809769 | 2.0000000 | 0 | 3 |
0.6726859 | 2.5714286 | 0 | 3 |
0.2334704 | 0.5714286 | 1 | 10 |
On the basis of data_loop
and data_heterogeneity
we then write a loop the cycles through values of data_loop
and generates the corresponding plots.
- The things that are varies are variable name, label and variable type.
- There are two variable types
numeric
andcategorical
.
for(i in 1:nrow(data_loop)){
#print(i)
# Define objects taking them from the looping dataframe
var_name <- data_loop$variable[i]
var_label <- data_loop$label[i]
var_type <- data_loop$type[i]
# Create a plot number
plot_number <- LETTERS[seq(from = 1, to = nrow(data_loop))][i]
# Define angle conditionally
if (var_name %in% c("income_num")){angle <- 45}else{angle <- 0}
# Select data for plot
data_plot <- data_heterogeneity %>% select(var_name, predictions)
# select takes strings and non-strings
# CREATE PLOT DEPENDING ON VARIABLE TYPE
if(var_type == "continuous") {
p <- ggplot(data_plot, aes_string(x = as.name(var_name),
y = as.name("predictions"))) +
geom_point(alpha = 3/10) +
geom_smooth(method = "loess", span = 1, se=F, colour="gray") +
labs(title = paste0("(", plot_number, ") ", var_label)) +
theme_light() +
theme(axis.text.x = element_text(size = 6, angle = angle),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.title = element_text(size = 8))
} else {
# Convert from tibble
data_plot[,var_name] <- factor(round(data_plot[,var_name])%>% dplyr::pull(1))
p <- ggplot(data_plot,
aes_string(x = as.name(var_name),
y = as.name("predictions"))) +
geom_boxplot() +
geom_smooth(method = "loess", se=FALSE, aes(group=1), colour="gray") +
labs(title = paste0("(", plot_number, ") ", var_label)) +
#scale_x_discrete(labels = labels) +
theme_light() +
theme(axis.text.x = element_text(size = 6, angle = angle,
hjust = 1, vjust = 1),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.title = element_text(size = 8))
}
assign(paste("p", i, sep=""), p) # Create object
}
grid.arrange(arrangeGrob(p1, p2, p3, ncol = 3),
left = grid::textGrob("Predicted source\ntreatment effect",
rot = 90, vjust = 1))

Figure 5.11: Numeric vs different variable types
References
Bauer, Paul C, and Bernhard Clemm von Hohenberg. 2020. “Believing and Sharing Information by Fake Sources: An Experiment.”
Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables:
Platform
with unordered categories andAccount
with 3 ordered categories.↩