# Visualizing descriptive statistics

• Learning outcomes: Learn how to…
• …visualize combinations of different variables.
• …make graphs for publications.
• …use different geoms.
• …add summary statistics to plots.
• …manipulate data for a plot.
• …plot ordered variables (and order categories).
• …plot graphs next to each other.
• …search for solutions online.

Sources: Original material; Wickham (2010)

# 1 Description basics

• (Research) Question: How are observations (≈ units) distributed across values/categories of a variable (or several)?
• Objective:
• Show distributions (causality?), exploration & presentation, replace summary tables
• Data
• Uni-dimensional vs. multi-dimensional data (e.g., joint distribution)
• Time just another dimension/variables
• Data can be aggregated (e.g., means across time)
• Usage: Mostly in methods section or in the appendix (sometimes results!)
• Note: Description is just as important as explanation
• Types of graphs
• Depend on the data types and number of dimensions/variables
• Numeric (quantitative: discrete, continuous), categorical (qualitative: nominal, ordinal)
• 1, 2, 3 etc. variables
• See decision tree here (and another one here)

# 2 Exploratory summary graphs

• Learning outcomes: Learn how to…
• …produce a quick exploratory plot for your data.
• `ggpairs()` (`GGally` package): Provides quick and dirty overview of many variables and their relationships
• Don’t plot too many variables!
• Figure 1 provides an example. The data is shown in Table 1.1
• Q: What does Figure 1 show? How useful is it?

## 2.1 Lab: Data & Code

``````# data_bauer_poama.csv
"1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))
ggpairs(data %>% dplyr::select(sex, age, employment_status)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # religion,
# You can set the cardinality threshold: Maximum number of levels allowed in a character / factor column``````

# 3 Several variables: Categorical/numerical

• Learning outcomes: Learn how to…
• …spot errors in a graph.
• …visualize several single variables.
• …make summary graph for publication.
• …add summary statistics to plots.
• …manipulate data for a plot.
• …plot ordered variables (and order categories).
• …plot graphs next to each other.
• …search for solutions online.

## 3.1 Data & Packages & functions

• Data: One categorical or numerical variable (1 dimension), several plotted next to each other
• Packages & functions:
• `geom_histogram()`: Histograms show distribution of a single numeric/quantitative variable (Frequency polygons: `geom_freqpoly()`)2
• Provide a lot of information about the distribution but need more space than boxplots
• Bin the data, then count the number of observations in each bin (then show either bars or lines)
• `binwidth`: Control the width of bins (ALWAYS experiment with that, default is 30)
• `breaks`: Set manual bin cutoffs
• `geom_bar()`: Barplots show distribution categorical/qualitative variables
• Recommendations
• Use consistent strategy to visualize or indicate missings
• Show missings as annotation in histograms or as category in barplots (last of ordered categories)
• Survey data: Sometimes it could make sense to be more specific and also show “refusals”, “don’t know” etc. (make the sample transparent!)
• Create final labels in data for plot `data_plot` to avoid errors

## 3.2 Graph

• Here we’ll reproduce (criticize and improve) part of Figure 23.
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?4
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

## 3.3 Lab: Data & Code

We start by inspecting the data with `View(data)` and `str(data)`: What do we see?*

``````# data_bauer_poama.csv
#                          "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))

col_types = cols())

# TIP: Use styler package to style code

# View(data)
# str(data)

# Age ####
p1 <- ggplot(
data,
aes(x = age)
) +
geom_histogram(binwidth = 2, fill = "gray") +
labs(x = "Age", y = "N") +
theme_light()

# Religion ####
# Get categories in ranked order
levels_ranked <- data %>%
dplyr::select(religion) %>%
count(religion) %>%
arrange(desc(n)) %>%
dplyr::pull(religion)
# Create factor
data\$religion_fac <- factor(data\$religion,
levels = levels_ranked,
ordered = TRUE
)

p2 <- ggplot(data, aes(x = religion_fac)) +
geom_bar(fill = "gray") +
labs(x = "Religion", y = "N") +
theme_light() +
theme(
axis.text.x = element_text(
angle = 40,
hjust = 1,
vjust = 1,
margin = margin(0.2, 0, 0.3, 0, "cm")
),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
)

# Employment status
data\$employment_status_fac <- factor(data\$employment_status)
p3 <- ggplot(data, aes(x = employment_status_fac)) +
geom_bar(fill = "gray") +
labs(x = "Employment status", y = "N") +
theme_light() +
theme(
axis.text.x = element_text(
angle = 35,
hjust = 1,
vjust = 1,
margin = margin(0.2, 0, 0.3, 0, "cm")
),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
)

library(patchwork)
p1 + p2 + p3``````

## 3.4 Exercise

• We’ll split into teams and improve Figure 3 together. Use the code of the subplot above that we assign to you (p1, p2, p3) to create one of the plots below.
1. Histogram of age (p1):
• Add additional statistics (mean, median, sd)
• Tip: Use `geom_vline(aes(xintercept = mean(...)),col='red',size=1)` to add lines.
• Use `annotate("text", label = paste("Mean:", round(mean(data\$age))),...)` to add lables.
2. Barplot of religion (p2): Reduce rare categories to one category called `Other religion`.
• Tip: Recode the religion variable: `data\$religion_fac <- recode(data\$religion_fac, "Sikhism" = "Other")`
3. Barplot of employment status (p3)
• Add labels with absolute (relative) number for the bars
• Search solution here: https://community.rstudio.com/t/regarding-adding-bar-labels-at-the-top-of-each-bar-in-ggplot-in-rstudio/14226/4
• Order categories (see how it’s done in the code for the `religion plot` above)
• In the end explain to others what you did to improve the plot.

Exercise solution
``````# HISTOGRAM

# Age ####
ggplot(
data,
aes(x = age)
) +
geom_histogram(binwidth = 2, fill = "gray") +
labs(x = "Age", y = "N") +
theme_light() +
geom_vline(aes(xintercept = mean(age)), col = "red", size = 1) +
geom_vline(aes(xintercept = median(age)), col = "blue", size = 1) +
annotate("text",
label = paste(
"Mean:", round(mean(data\$age)),
"\nMedian:", round(median(data\$age)),
"\nSD:", round(sd(data\$age))
),
x = 60, y = 60,
size = 3,
colour = "black",
hjust = 0.5,
vjust = 0.5
)

# Religion

levels_ranked <- data %>%
dplyr::select(religion) %>%
count(religion) %>%
arrange(desc(n)) %>%
dplyr::pull(religion)
# Create factor
data\$religion_fac <- factor(data\$religion,
levels = levels_ranked
)

data\$religion_fac <- recode(data\$religion_fac,
"Sikhism" = "Other",
"Jainism" = "Other",
"Hinduism" = "Other",
"Paganism" = "Other",
"Spiritualism" = "Other"
)

ggplot(data, aes(x = data\$religion_fac)) +
geom_bar(fill = "gray") +
labs(x = "Religion", y = "N") +
theme_light() +
theme(
axis.text.x = element_text(
angle = 40,
hjust = 1,
vjust = 1,
margin = margin(0.2, 0, 0.3, 0, "cm")
),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
)

# Employment
# Get ranked categories
levels_ranked <- data %>%
dplyr::select(employment_status) %>%
count(employment_status) %>%
arrange(desc(n)) %>%
dplyr::pull(employment_status)
# Create factor
data\$employment_status_fac <- factor(data\$employment_status,
levels = levels_ranked
)

ggplot(data, aes(x = employment_status_fac)) +
geom_bar(fill = "gray") +
labs(x = "Employment status", y = "N") +
theme_light() +
theme(
axis.text.x = element_text(
angle = 35,
hjust = 1,
vjust = 1,
margin = margin(0.2, 0, 0.3, 0, "cm")
),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
) +
geom_text(stat = "count", aes(label = round((after_stat(count)) / sum(after_stat(count)), 2), vjust = -0.5))``````
Another solution
• Another solution using `geom_table_npc()`.
``````library(ggpmisc)

# data_bauer_poama.csv
#                          "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))

col_types = cols())

summ <- data %>%
summarize(Mean = mean(age),
Median = median(age),
SD = sd(age),
N = n(),
Missing = sum(is.na(age)))

ggplot(data,  aes(x=age)) +
geom_histogram(binwidth=2, fill = "gray") +
labs(x="Age", y = "N")+
theme_light() +
geom_table_npc(data = summ, label = list(summ), npcx = 0.42, npcy = 1, hjust = 0, vjust = 1) +
geom_vline(aes(xintercept = mean(age)), col='red', size=2) +
geom_vline(aes(xintercept = median(age)), col='blue', size=2)``````

## 3.5 Exercise: Higlighting things to focus attention

• Below two quick example of how we could highlight certain facts in histograms or barplots. In practice it might be helpful to add annotations in the graph. Please us the code below but this time…
• …highlight in the histogram in Figure 4 that only a small percentage is between 20 and 25 years old
• …highlight in Figure 5 that Islam, Buddhism and Judaism represent only a small share in the sample.
``````ggplot(
data,
aes(x = age)
) +
geom_histogram(binwidth = 2, fill = "gray") +
geom_histogram(data = data %>% filter(age>55),
aes(x = age),
binwidth = 2, fill = "black") +
labs(x = "Age", y = "N") +
theme_light() +
labs(title = "Only a small percentage in the sample is over 55.")``````
``````levels_ranked <- data %>%
dplyr::select(religion) %>%
count(religion) %>%
arrange(desc(n)) %>%
dplyr::pull(religion)
# Create factor
data\$religion_fac <- factor(data\$religion,
levels = levels_ranked
)

data\$religion_fac <- recode(data\$religion_fac,
"Sikhism" = "Other",
"Jainism" = "Other",
"Hinduism" = "Other",
"Paganism" = "Other",
"Spiritualism" = "Other"
)

ggplot(data, aes(x = religion_fac)) +
geom_bar(fill = "gray") +
geom_bar(data = data %>% filter(religion == "Christianity"),
aes(x = religion_fac),
fill = "red") +
labs(x = "Religion", y = "N") +
theme_light() +
theme(
axis.text.x = element_text(
angle = 40,
hjust = 1,
vjust = 1,
margin = margin(0.2, 0, 0.3, 0, "cm")
),
plot.title = element_text(color = "red"),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
) +
labs(title = "Christians are only the second biggest group in the sample.")``````
Exercise solution
``````ggplot(
data,
aes(x = age)
) +
geom_histogram(fill = "gray") +
geom_histogram(data = data %>% filter(age>20, age<25),
aes(x = age),
fill = "black") +
labs(x = "Age", y = "N") +
theme_light() +
labs(title = "Only a small percentage in the sample is between 20 and 25.")``````
``````levels_ranked <- data %>%
dplyr::select(religion) %>%
count(religion) %>%
arrange(desc(n)) %>%
dplyr::pull(religion)
# Create factor
data\$religion_fac <- factor(data\$religion,
levels = levels_ranked
)

data\$religion_fac <- recode(data\$religion_fac,
"Sikhism" = "Other",
"Jainism" = "Other",
"Hinduism" = "Other",
"Paganism" = "Other",
"Spiritualism" = "Other"
)

ggplot(data, aes(x = religion_fac)) +
geom_bar(fill = "gray") +
geom_bar(data = data %>%
filter(religion == "Islam"|
religion == "Buddhism"|
religion == "Judaism"),
aes(x = religion_fac),
fill = "red") +
labs(x = "Religion", y = "N") +
theme_light() +
theme(
axis.text.x = element_text(
angle = 40,
hjust = 1,
vjust = 1,
margin = margin(0.2, 0, 0.3, 0, "cm")
),
plot.title = element_text(color = "red"),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
) +
labs(title = "Islam, Buddhism and Judaism only make up a small percentage of the sample.")``````

# 4 Non-aggregated vs. aggregated data: Barplot example

• Learning outcomes: Learn…
• …the difference between plotting aggregated and non-aggregated data.
• …logic behind ordering scales.

## 4.1 Data & Packages & functions

• Data: One categorical variable
• Challenge: Either feed original raw data or summarized/processed data to ggplot
• Packages & functions:
• `geom_bar()`: Expects unsummarised data (each observation contributes one unit to the height of each bar)
• `geom_bar(stat ="identity")`: Tell `geom_bar` not to aggregate/summarize the data!
• `factor(party, ordered = TRUE, levels = c(...))`: Convert variable to ordered factor
• `str(data)`: Check data types + `levels(data\$party)`: Check levels and ordering
• `dput(unique(data\$party))`: Quickly extract categories from character vector to reorder

## 4.2 Graph

• First, we keep the code/graph simple (no ordering, labels etc.)…
``````# data_twitter_influence.csv
#                       col_types = cols())

col_types = cols())

p1 <- ggplot(data, aes(x = party)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Or summarize the data first:
data_plot <- data %>%
group_by(party) %>%
summarize(n = n()) %>% ungroup()
p2 <- ggplot(data_plot, aes(x = party, y = n)) +
geom_bar(stat ="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

p1 + p2 +
plot_layout(ncol = 2)``````

Now, let’s reorder the party variable according to ideology, i.e., with DieLinke being the most left party and AfD the most right party. This can be done through converting the corresponding variable to an ordered factor.

``````# data_twitter_influence.csv
#                       col_types = cols())  %>%
# select(party)

col_types = cols()) %>%
select(party)

data <- data %>%
mutate(party = factor(party, ordered = TRUE,
levels = c("DieLinke", "Greens", "SPD", "FDP", "CDU_CSU", "AfD")))
# str(data)
# levels(data\$party)

p1 <- ggplot(data, aes(x = party)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Or summarize the data first:
data_plot <- data %>%
group_by(party) %>%
summarize(n = n()) %>% ungroup()

p2 <- ggplot(data_plot, aes(x = party, y = n)) +
geom_bar(stat ="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

p1 + p2 +
plot_layout(ncol = 2)``````

# 5 Further examples

## 5.1 Categorical variables (2+)

• Learning outcomes: Learn…
• …to manipulate data first and visualize it thereafter
• …how to use `pivot_longer` (ggplot likes long format!)
• …how to visualize unordered and ordered variables
• …how to name scales and create manual ones
• …how to create labels from data
• Make sure the only thing that you might to add for the labels is gsub, i.e., ideally no substantive changing of label names
• …how to size text elements

### 5.1.1 Data & Packages & functions

• Data: Two or several categorical variables
• Challenge: We need to summarize the data as to obtain frequencies (absolute or relative)
• Create long format dataframe that contains the frequencies of different category combinations
• Packages & functions:
• `tidyr` and `pivot_longer()` function
• `dplyr` and functions such as `summarize()`, `mutate()` etc.

### 5.1.2 Graph

• Here we’ll reproduce and maybe criticize as well as improve Figure 8
• Questions:
• What does the graph show? What are the underlying variables (and data)?5
• How many scales/mappings does it use? Could we reduce them?6
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.1.3 Lab: Data & Code

We start by importing the original (unsummarised data). As you can see below we have categorical string variables.

``````# data_account_prevalence.csv
#                         "1xbffXai-HqWS2Q17KoB_MCCY7YOpskyH"))

col_types = cols())

account_email account_fb account_twitter account_whatsapp
Yes Yes No Yes
Yes Yes No Yes
Yes Yes, but I dont use it No Yes
Yes Yes No Yes
Yes No No Yes
Yes Yes Yes Yes

Subsequently, we summarize/aggregate the data producing a dataframe that contains the percentage of people in each category (`Yes`, `Yes, but inactive`, `No`) across the four variables. It’s a good idea to call the data that builds the basis for our plot `data_plot` as to keep the original dataset `data`. Let’s go through the code below step by step:

``````# Creating plot data
data_plot <- data %>%

pivot_longer(cols = account_email:account_whatsapp, # formerly gather
names_to = "variable",
values_to = "value") %>%

group_by(variable) %>%

summarize(pct.Yes = mean(value == "Yes", na.rm=TRUE),
pct.Inactive = mean(value == "Yes, but I dont use it", na.rm=TRUE),
pct.No = mean(value == "No", na.rm=TRUE)) %>%

pivot_longer(cols = pct.Yes:pct.No, # formerly gather
names_to = "category",
values_to = "value") %>% # only keep variables of interest

mutate(category = factor(category, # Create factor for ordering
levels = c("pct.Yes", "pct.Inactive", "pct.No"),
ordered = TRUE)) %>%

mutate(value = round(value,2)) %>%
mutate(value = 100 * value)

# Change the labels and translate to english! (so we can direclty pull them out later)
data_plot\$category <- gsub("Inactive", "Yes, but inactive", data_plot\$category)
data_plot\$category <- gsub("pct.", "", data_plot\$category)

data_plot\$category <- factor(data_plot\$category,
levels = c("Yes", "Yes, but inactive", "No"),
ordered = TRUE)
data_plot\$variable <- factor(data_plot\$variable)

data_plot\$variable <- str_to_title(gsub("fb", "facebook", gsub("account_", "",
data_plot\$variable)))
data_plot``````
variable category value
Email Yes 98
Email Yes, but inactive 1
Email No 1
Facebook Yes, but inactive 8
Twitter Yes, but inactive 8
Whatsapp Yes 83
Whatsapp Yes, but inactive 1
Whatsapp No 16

Check out the variables in the data. Importantly, there is an ordered factor in there:

``str(data_plot)``
``````tibble [12 × 3] (S3: tbl_df/tbl/data.frame)
\$ variable: chr [1:12] "Email" "Email" "Email" "Facebook" ...
\$ category: Ord.factor w/ 3 levels "Yes"<"Yes, but inactive"<..: 1 2 3 1 2 3 1 2 3 1 ...
\$ value   : num [1:12] 98 1 1 61 8 30 15 8 77 83 ...``````

Now we have prepared the data we can plot it in Figure 9 (fairly easy!). Again let’s go through this step by step:

``````  # CHECK
# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
geom_bar(stat="identity",
width=0.7,
position = position_dodge(width=0.8),
aes(fill = factor(variable),
alpha=category)) +
geom_text(position = position_dodge(width=0.8),
aes(alpha=category,
label = paste(value,"%", sep="")),
vjust=1.6,
color="black",
size=2) +

scale_fill_discrete(name="Platform") +

scale_alpha_discrete(name="Account",
range=c(1, 0.5)) +
xlab("Platforms")+
ylab("Percentage (%)")+
theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),
panel.grid.major.x = element_blank(),
legend.title = element_text(size=10),
legend.text = element_text(size=9))``````

Strictly speaking the coloring in Figure 9 would not be necessary as the platforms are already encoded on the x-Axis. Figure 9 uses 4 mappings (x, y, alpha/luminance, color) for three variables. Hence, we could also use a grayscale version of the graph that you see below in Figure 10.7

``````# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
geom_bar(stat="identity",
width=0.7,
position = position_dodge(width=0.8),
aes(#fill = factor(variable),
alpha=factor(category))) +
geom_text(position = position_dodge(width=0.8),
aes(alpha=factor(category),
label = paste(value,"%", sep="")),
vjust=1.6,
color="black",
size=2) +

scale_alpha_discrete(name="Account",
range=c(1, 0.5),
labels=c("Yes", "Yes, but inactive", "No")) +

gsub("account_", "", unique(data_plot\$variable))))) +

xlab("Platforms")+
ylab("Percentage (%)")+
theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),
panel.grid.major.x = element_blank(),
legend.title = element_text(size=10),
legend.text = element_text(size=9))``````

### 5.1.4 Exercise

1. Load the summarized data (we’ll skip the data management steps).
``````# data_sharing_frequency_summarized.csv
#"16oXege9RqvtIkBppZvkQNb4gl-lmSyoU"))
variable category value
Email Daily 7
Whatsapp Daily 17
Email Rarer 67
1. Try to recreate Figure 11 using the code from Figure 9. There is a mistake in Figure 11. Can you spot it?
2. How would we modify the code (data) if you want to show just 2 out of four variables or just 3 out of 5 categories?
3. How could we visualize a fourth categorical variable?
4. Here we used colors for differentiating. What would be an alternative way?
Exercise solution
``````data_plot\$category <- factor(data_plot\$category,
levels = c("Daily", "Once a week", "A few times a month", "A few times a year",
"Rarer"),
ordered = TRUE)

ggplot(data_plot, aes(x = variable, y = value)) +
geom_bar(
stat = "identity",
width = 0.7,
position = position_dodge(width = 0.8),
aes(
fill = factor(variable),
alpha = factor(category)
)
) +
geom_text(
position = position_dodge(width = 0.8),
aes(
alpha = factor(category),
label = paste(value, "%", sep = "")
),
vjust = 1.6,
color = "black",
size = 2
) +
scale_fill_discrete(name = "Platform") +
scale_alpha_discrete(name = "Sharing frequency") +
xlab("Platforms") +
ylab("Percentage (%)") +
theme_light() +
theme(
axis.text.x = element_text(angle = 35, hjust = 1, vjust = 1, margin = margin(0.2, 0, 0.3, 0, "cm")),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),
panel.grid.major.x = element_blank(),
legend.title = element_text(size = 10),
legend.text = element_text(size = 9)
)``````

## 5.2 Numeric vs. categorical: Various plot types

• Learning outcomes: Learn…
• …about plot types to visualize numeric vs. categorical.

### 5.2.1 Data & Packages & functions

• Data: 1 categorical variable, 1 numeric variable
• Packages & functions:
• `geomjitter()` offers the same control over aesthetics `geompoint()` (size, color, shape)
• `geomboxplot()`, `geomviolin()`: You can control the outline color or the internal fill color
• Strengths and weaknesses
• Boxplots summarize distribution with five numbers (minimum, first quartile, median, third quartile, and maximum)
• Jittered plots show every point but only work with relatively small datasets
• Violin plots give the richest display, but rely on the calculation of a density estimate, which can be hard to interpret (see here)
• Important: A combination of all three might be nice

### 5.2.2 Graph

• Figure 12 visualizes different ways of plotting a categorical vs. a numerical variable.
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.2.3 Lab: Data & Code

``````# data_twitter_influence.csv
#                         col_types = cols())
data_plot <- data

p1 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_point()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p2 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_jitter()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p3 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_boxplot()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p4 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_violin()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p1 + p2 + p3 + p4 +
plot_layout(ncol = 2)``````

### 5.2.4 Combining boxplot, violinplot and jitter

``````# data_twitter_influence.csv
#                         col_types = cols())
data_plot <- data
ggplot(data_plot, aes(x = party, y = account_age_years)) +
geom_violin(alpha = 0.5, width = 1, fill = 'lightblue')+
geom_boxplot(width = 0.25, fatten = 3, width = 0.3)+
geom_jitter(color="black", size=2, alpha = 0.3, width = 0.2) +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +

theme_classic()``````

## 5.3 Numeric vs. numeric: Correlograms

• Learning outcomes: Learn…
• …about correlograms to summarize bivariate relations between many variables.

### 5.3.1 Data & Packages & functions

• Data: Several numeric variables
• Correlogram: Visualizes correlations between continuous variables present in the same dataframe
• Package: `ggcorrplot` (github)

### 5.3.2 Graph

• Figure 13 visualizes correlation matrix in Table 2 for the dataframe `mtcars`.
• Keep in mind that this works for numeric variables only.

### 5.3.3 Lab: Data & Code

``````correlation_matrix <- round(cor(mtcars), 1)
head(correlation_matrix[, 1:6]) # Show part of matrix

# Plot
ggcorrplot(correlation_matrix,
hc.order = TRUE,
type = "lower",
lab = TRUE)``````

## 5.4 Numeric vs. numeric: Scatterplots + smoother

• Learning outcomes: Learn…
• …how to visualize a classic scatterplot lines for model and smoother.

### 5.4.1 Data & Packages & functions

• Data: Several numeric variables
• `geom_smooth()`: Adds smoother
• `geom_smooth(se= FALSE)`: Display confidence interval around smooth?
• `method = "loess"`
• Default for small n, uses a smooth local regression(as described in?loess)
• Wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly)
• If n > 1000 alternative smoothing algorithm is used

### 5.4.2 Graph

• Figure 14 and Figure 15 provide two examples:
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.4.3 Lab: Data & Code

``````# data_twitter_influence.csv
col_types = cols())

ggplot(data %>% filter(followers_count<50000),
aes(x = account_age_years,
y = followers_count)) +
geom_point(alpha =0.5) +
facet_wrap(~party) +
ylab("Number of followers") +
xlab("Account age (in years)") +
scale_x_continuous(breaks = c(0, 5, 10), limits = c(0,10)) +
geom_smooth(method=lm,  color = "black", fill="lightgray") +
geom_smooth(span =  0.3) +
theme_light()``````

``````ggplot(data %>% filter(followers_count<50000),
aes(x = account_age_years,
y = followers_count,
color = factor(party))) +
geom_point(alpha =0.5) +
#facet_wrap(~party) +
ylab("Number of followers") +
xlab("Account age (in years)") +
scale_x_continuous(breaks = c(0, 5, 10), limits = c(0,10)) +
geom_smooth(method=lm,  aes(fill=party, color=party)) +
theme_light()``````

## 5.5 Numeric vs. various variables

• Learning outcomes: Learn…
• …how to generate ggplot plots in loops (`aes_string`)
• …how to visualize a numeric variable (Y) vs. different variables (X)
• …how to create graphs conditional on loop elements depending on variable types
• …use elements in ggplot2 object and assign loop objects globally.

### 5.5.1 Graph

• Here we’ll reproduce parts of Figure 16
• Questions:
• What does the graph show? What are the underlying variables (and data)?8
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.5.2 Lab: Data & Code

Let’s check out (and load) the datasets the underlie the plot first.

• `data_loop`: Contains the variables names (`variable`) and labels (`label`) and type (`type`) of different covariates.
• We’ll loop over the content of this dataframe (it’s ordered by the variable importance)
• `data_heterogeneity`: Contains covariate values across individuals, as well as predictions for each individual (these are the predictions for a causal effect)
• This is that data that is getting visualized.
``````# data_loop.csv
#                               "1ELtshmxQWS0T8mFR1uh5IH_r57ynP8MZ"))
kable(data_loop)``````
Importance variable label type
0.3680689 trust_source_mainstream Mainstr. media trust continuous
0.1248805 vote_choice_afd_num Vote choice AfD categorical
0.0786331 income_num Income categorical
``````# data_treatment_heterogeneity.csv
#                                             "1sYLKmFi4uZsDDxNZdcwjD-eTHcVWql_X"))
predictions trust_source_mainstream vote_choice_afd_num income_num
1.1508661 3.2857143 0 11
0.6617550 3.0000000 0 6
0.4056240 1.0000000 0 11
0.3809769 2.0000000 0 3
0.6726859 2.5714286 0 3
0.2334704 0.5714286 1 10

On the basis of `data_loop` and `data_heterogeneity` we then write a loop the cycles through values of `data_loop` and generates the corresponding plots (code could be rewritten to directly check the class of those variables).

• The things that are varies are variable name, label and variable type.
• There are two variable types `numeric` and `categorical`.
``````for(i in 1:nrow(data_loop)){
#print(i)

# Try this out (understand the loop) with i <- 1

# Define objects taking them from the looping dataframe
var_name <- data_loop\$variable[i]
var_label <- data_loop\$label[i]
var_type <- data_loop\$type[i]

# Create a plot number
plot_number <- LETTERS[seq(from = 1, to = nrow(data_loop))][i]

# Define angle conditionally
if (var_name %in% c("income_num")){angle <- 45}else{angle <- 0}

# Select data for plot
data_plot <- data_heterogeneity %>% select(var_name, predictions)
# select takes strings and non-strings

# CREATE PLOT DEPENDING ON VARIABLE TYPE
if(var_type == "continuous") { # Continous variable

p <- ggplot(data_plot, aes_string(x = as.name(var_name),
y = as.name("predictions"))) +
geom_point(alpha = 3/10) +
geom_smooth(method = "loess", span = 1, se=F, colour="gray") +
labs(title = paste0("(", plot_number, ") ", var_label)) +
theme_light() +
theme(axis.text.x = element_text(size = 6, angle = angle),
axis.title.x = element_blank(),
plot.title = element_text(size = 8))

} else { # Categorical variable

# Convert from tibble
data_plot[,var_name] <- factor(round(data_plot[,var_name])%>% dplyr::pull(1))

p <- ggplot(data_plot,
aes_string(x = as.name(var_name),
y = as.name("predictions"))) +
geom_boxplot() +
geom_smooth(method = "loess", se=FALSE, aes(group=1), colour="gray") +
labs(title = paste0("(", plot_number, ") ", var_label)) +
theme_light() +
theme(axis.text.x = element_text(size = 6, angle = angle,
hjust = 1, vjust = 1),
#axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.title = element_text(size = 8))
}
assign(paste("p", i, sep=""), p) # Create object
}

p1\$labels\$y <- "Predicted source\ntreatment effect" # Q:?
p2\$labels\$y <- p3\$labels\$y <- " " # Q:?

p1 + p2 + p3 + plot_layout(ncol = 3)``````

## 5.6 Time: Line charts & events

• Learning outcomes: Learn…
• …how to plot dates
• …how to make line plots
• …how to create manual legends for various elements
• …how to visualize events & data collection periods

### 5.6.1 Data & Packages & functions

• Data: 1+ Numeric variables vs. time variable
• Line and path plots typically used for time series data (see Appendix [Line vs. path plots])
• Time dimension is shown on the x-axis from left to right
• Line plots (`geom_line()`): join the points from left to right
• Have time on the x-axis, showing how a single variable has changed over time
• Path plots (`geom_path()`): join them in the order that they appear in the dataset (in other words, a line plot is a path plot of the data sorted by x value)
• Below we’ll also use `gtrends()` from the `gtrendsR` package to obtain search frequencies.
• And we’ll use `pivot_wider()` from the `tidyr` package as well as `as.Date()` for conversion to a date variable (see here for formats)
• `new_scale_color()`: Can be used to reset a scale if we want to generate several legends (`ggnewscale` package)
• And we’ll use scale modification to show proper legends in line plots

### 5.6.2 Graph

• Here we’ll reproduce Figure 18 (but with ggplot2) (see also Bauer & Clemm von Hohenberg 2022)
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.6.3 Lab: Data & code

We’ll start by preparing the data.

1. We download data from Google on Google Searches. Table 3 shows the first few rows. Currently, gtrendsR access to Google is buggy (see here for error code 429). Hence, data below are loaded locally.
``````library(gtrendsR)
# Words to search for
search.words <- c("GDPR", "DSGVO") # Does not work anymore

# google.trends <- gtrends(search.words,
#                          gprop = "web",
#                          time = "2018-03-01 2018-11-16",
#                          geo = "DE")[[1]]

pivot_wider(names_from = c("keyword", "geo"), values_from = "hits") %>%
dplyr::select(-time, -gprop, -category)

# Replace "<1" with 0
mutate_all(funs(str_replace(., "<1", "0")))

# Convert date variable to 'Date' class

# Mutate factor to numeric and reorder
#mutate_if(is.factor, as.character) %>%
mutate_if(is.character, as.numeric)``````

1. We plot the data an add our own annotations in Figure 19. Let’s go through the code together.

The code for a simple line plot is as follows:

``````# The simple line plot
ggplot(data = google.trends) +
geom_line(aes(x = date, y = DSGVO_DE),
color = "black") +
geom_line(aes(x = date, y = GDPR_DE),
color = "blue")``````

The more complicted version is below.

``````library(ggnewscale)
# The complicated version
ggplot(data = google.trends) +
geom_rect(aes(fill = "fieldperiod"),
xmin = as.Date("2018-04-16", "%Y-%m-%d"),
xmax = as.Date("2018-04-23", "%Y-%m-%d"),
ymin = 0, ymax = 100, alpha = 0.2) +
geom_rect(aes(fill = "fieldperiod"),
xmin = as.Date("2018-07-24", "%Y-%m-%d"),
xmax = as.Date("2018-08-02", "%Y-%m-%d"),
ymin = 0, ymax = 100, alpha = 0.2) +
geom_rect(aes(fill = "fieldperiod"),
xmin = as.Date("2018-10-29", "%Y-%m-%d"),
xmax = as.Date("2018-11-07", "%Y-%m-%d"),
ymin = 0, ymax = 100, alpha = 0.2) +
geom_line(aes(x = date, y = DSGVO_DE, color = "dsgvocolor")) +
geom_line(aes(x = date, y = GDPR_DE, color = "gdprcolor")) +
theme_light() +
ylab("Searches (100 = max. interest in time period/territory)") +
xlab("Month (2018)") +
scale_colour_manual(name="Google Searches", values=c(gdprcolor = "darkgreen",
dsgvocolor = "black"),
labels = c("GDPR Searches",
"DSVGO Searches")) +
scale_fill_manual(name="Field periods",
values=c(fieldperiod="gray"),
labels = c("Wave 1, 2 and 3")) +
new_scale_color() +
scale_colour_manual(name="Events",
values=c(Policy_implementation = "red"),
labels = c("Policy implementation (25th of May)")) +
geom_vline(aes(xintercept = as.Date("2018-05-25"), color = "Policy_implementation")) +
theme(
legend.position = c(.95, .95),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6),
legend.background = element_rect(fill=alpha('white', 0.8)))``````

### 5.6.4 Exercise

1. Use the code from above and investigate Google searches for two other topics (e.g. “COVID” and “Hydroxychloroquine”). Choose a sensible time period for your search. And choose a sensible geographic area (e.g., `geo = "US"`).
2. Convert the data into longformat etc. (following the steps above) so that you can visualize it as a lineplot in ggplot.
3. Add events to your lineplots (e.g., one could take one of Trump’s tweets as an event).
4. Try to visualize a legend (it’s challenging!).

### 5.6.5 Newer graph: Salience of events across time

As published in Bauer & Clemm von Hohenberg (2022). Currently, gtrendsR access to Google is buggy (see here for error code 429).

• Potentially, it might be difficult to collect older data from Google trends.
``````#library(tidyverse)
#library(lubridate)

# Words to search for
search.words <- c("Einwanderung", "Flüchtlinge", "Asyl", "Migration", "Lagerfeld")

gprop = "web",
time = "2014-12-31 2019-06-03",
geo = "DE",
onlyInterest = TRUE
)[[1]]

pivot_wider(names_from = c("keyword", "geo"), values_from = "hits") %>%
dplyr::select(-time, -gprop, -category)

# Replace "<1" with 0
mutate_all(funs(str_replace(., "<1", "0")))

# Convert date variable

# Mutate factor to numeric and reorder
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, as.numeric)

# Aggegregate
mutate(
week = week(date),
week_start = floor_date(date, "weeks", week_start = 1),
week_end = ceiling_date(date, "weeks", week_start = 1)
) %>%
group_by(week_start) %>%
summarise(
date = first(date),
week_end = first(week_end),
Einwanderung_DE = mean(Einwanderung_DE),
Fluechtlinge_DE = mean(Fluechtlinge_DE),
Asyl_DE = mean(Asyl_DE),
Migration_DE = mean(Migration_DE),
Lagerfeld_DE = mean(Lagerfeld_DE)
)

ggplot(
aes(x = week_start)
) +
geom_rect(aes(fill = "fieldperiod"),
xmin = as.Date("2019-03-14", "%Y-%m-%d"),
xmax = as.Date("2019-03-29", "%Y-%m-%d"),
ymin = 0, ymax = 100, alpha = 0.2
) +
geom_line(aes(x = week_start, y = Einwanderung_DE, color = "Einwanderung")) +
geom_line(aes(x = week_start, y = Fluechtlinge_DE, color = "Fluechtlinge")) +
geom_line(aes(x = week_start, y = Asyl_DE, color = "Asyl")) +
geom_line(aes(x = week_start, y = Migration_DE, color = "Migration")) +
geom_line(aes(x = week_start, y = Lagerfeld_DE, color = "Lagerfeld")) +
theme_light() +
ylab("Searches (100 = max. interest\nin time period/territory)") +
xlab("Weekly averages (2019)") +
scale_colour_manual(name = "Search terms", values = c(
Einwanderung = "darkgreen",
Fluechtlinge = "black",
Asyl = "red",
Migration = "yellow",
Lagerfeld = "orange"
)) +
scale_fill_manual(
name = "Data collection",
values = c(fieldperiod = "gray"),
labels = c("Field period")
) +
new_scale_color() + # Add a new scale (ignore previous color scale)
geom_vline(aes(
xintercept = as.Date("2015-09-07"),
linetype = "dashed"
)) +
geom_vline(aes(
xintercept = as.Date("2015-12-31"),
linetype = "dotted"
)) +
geom_vline(aes(
xintercept = as.Date("2019-02-19"),
linetype = "twodash"
)) +
scale_linetype_manual(
name = "Events",
values = c(
"dashed",
"dotted",
"twodash"
),
labels = c(
"Refugee crisis\n(Summer 2015)",
"New Year's Eve assaults\n(2020/02/19)",
"Lagerfeld's death\n(2020/02/19)"
)
) +
scale_x_date(
date_breaks = "8 weeks",
date_labels = "%Y-%m-%d" # ,
# limits = c(as.Date("2018-12-31"), as.Date("2019-06-03"))
) +
theme(
legend.position = c(.80, .99),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.box = "horizontal",
legend.direction = "vertical",
legend.margin = margin(6, 6, 6, 6),
axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.background = element_rect(fill = adjustcolor("white", alpha.f = 0.7)),
legend.key = element_rect(fill = adjustcolor("white", alpha.f = 0.7), color = NA),
legend.key.size = unit(0.6, "cm")
)``````

## 5.7 Time: Means across time (or other categories)

• Learning outcomes: Learn…
• …how to plot error bars
• …how to dodge graph elements

### 5.7.1 Data & Packages & functions

• Data: Various one-dimensional distributions (several single variables)
• Plot type: Dot plot with error bars
• `geom_errorbar()`: To create error bars
• `position=position_dodge(0.6)`: Dodge graph elements

### 5.7.2 Graph

• We’ll reproduce and maybe criticize as well as improve Figure 20
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.7.3 Lab: Data & Code

The data has already been pre-processed, i.e., we have a dataframe that contains both our means as well as 90% and 95% percent confidence intervals for different subsamples of the data (as well as the full sample). The subsample are constructed from information on whether certain respondents participated across all waves or not. The dataframe also provides information on how these means should be grouped.

``````# data_gdpr_means_time.csv
#                          "1Ay7g1iIaCyxuDj2ce8UNc4kSRubForMN"))
pd <- position_dodge(0.6)
ggplot(data, aes(x = wave,
y = gdpr.know.num.mean,
color = factor(label),
group = factor(label))) +
geom_errorbar(aes(ymin=gdpr.know.num.mean - ci_90,
ymax=gdpr.know.num.mean + ci_90,
color = factor(label)),
colour="black",
size = 1,
width=.0,
position=pd) +
geom_errorbar(aes(ymin=gdpr.know.num.mean - ci_95,
ymax=gdpr.know.num.mean + ci_95,
color = factor(label)),
colour="black",
size = 0.4,
width=.1,
position=pd) +
geom_point(size = 3,
position=pd) +
scale_shape(solid = FALSE) +
ylim(0, 100) +
ylab("% GDPR Awareness") +
scale_x_discrete(labels = c(
"Wave 1 (N = 2093)\nApr 16 - 23, 2018",
"Wave 2 (N = 2043)\nJul 24 - Aug 02, 2018",
"Wave 3 (N = 2112)\nOct 29 - Nov 07, 2018"
)) +
theme_light() +
theme(axis.title.x = element_blank()) + scale_color_manual(
values = c("black", "#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00"),
name = "Participation",
breaks = levels(factor(data\$label)),
labels = c(
"Full Sample",
"Only W1 (N = 532)",
"Only W2 (N = 482)",
"Only W3 (N = 843)",
"W1 and W2 (N = 292)",
"W1, W2 and W3 (N = 1269)"
)
)``````

## 5.8 Time: Slope charts

### 5.8.1 Data & Packages & functions

• Data: Panel data with time points on one variable
• Example is taken from the vignette of the package
• `newggslopegraph()` function from the `CGPfunctions` package

### 5.8.2 Graph(s)

• Figure 22 (Data from 2002) and Figure 23 depict slope graphs for the data in Table 4 and Table 5.
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph?

### 5.8.3 Lab: Data & Code

``````head(newcancer %>% select(Year, Survival, Type))
newggslopegraph(dataframe = newcancer,
Times = Year,
Measurement = Survival,
Grouping = Type,
Title = "Estimates of Percent Survival Rates",
SubTitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.",
Caption = NULL
)``````

``````head(newgdp)
custom_colors <- tidyr::pivot_wider(newgdp,
id_cols = Country,
names_from = Year,
values_from = GDP) %>%
mutate(difference = Year1979 - Year1970) %>%
mutate(trend = case_when(
difference >= 2 ~ "green",
difference <= -1 ~ "red",
TRUE ~ "gray"
)
) %>%
select(Country, trend) %>%
tibble::deframe()
newggslopegraph(newgdp,
Year,
GDP,
Country,
Title = "Gross GDP",
SubTitle = NULL,
Caption = NULL,
LineThickness = .5,
YTextSize = 4,
LineColor = custom_colors
)``````

## References

Bauer, Paul C, Frederic Gerdon, Florian Keusch, and Frauke Kreuter. 2020. “The Impact of the GDPR Policy on Data Sharing/Privacy Attitudes.” Preliminary Draft, 1–22.
Bauer, Paul C, and Andrei Poama. 2020. “Does Suffering Suffice? An Experimental Assessment of Desert Retributivism.” PLoS One 15 (4): e0230304.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” J. Comput. Graph. Stat. 19 (1): 3–28.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.

## Footnotes

1. Boxplot: A boxplot graphically represents the distribution of a dataset by depicting its median, lower quartile (25th percentile), upper quartile (75th percentile), minimum and maximum values within the interquartile range, and potential outliers using a central rectangular box, extending whiskers, and individual data points for outliers. (ChatGPT).↩︎

2. `geom_density()`: Underlying computations are more complex + assumption that are not true for all data (continuous, unbounded, and smooth) → use the others .↩︎

3. The figure was published in Bauer and Poama (2020) that is based on a survey experiment studying the effect of an offender’s suffering on perceived justice of punishment. The figure shows individual-level data on socio-demographics.↩︎

4. Focus on scale categories, distributions, information etc.↩︎

5. Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables: `Platform` with unordered categories and `Account` with 3 ordered categories.↩︎

6. We use both the x-scale and color for the same mapping namely platforms. This could be reduced.↩︎

7. One could also think of choosing colors that are also discernible when printed in grayscale instead of luminance.↩︎

8. Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables: `Platform` with unordered categories and `Account` with 3 ordered categories.↩︎