Visualizing descriptive statistics

Sources: Original material; Wickham (2010)

1 Description basics

  • (Research) Question: How are observations (≈ units) distributed across values/categories of a variable (or several)?
  • Objective:
    • Show distributions (causality?), exploration & presentation, replace summary tables
  • Data
    • Uni-dimensional vs. multi-dimensional data (e.g., joint distribution)
    • Time just another dimension/variables
    • Data can be aggregated (e.g., means across time)
  • Usage: Mostly in methods section or in the appendix (sometimes results!)
  • Note: Description is just as important as explanation
  • Types of graphs
    • Depend on the data types and number of dimensions/variables
      • Numeric (quantitative: discrete, continuous), categorical (qualitative: nominal, ordinal)
      • 1, 2, 3 etc. variables
    • See decision tree here (and another one here)

2 Exploratory summary graphs

  • Learning outcomes: Learn how to…
    • …produce a quick exploratory plot for your data.
  • ggpairs() (GGally package): Provides quick and dirty overview of many variables and their relationships
    • Don’t plot too many variables!
  • Figure 1 provides an example. The data is shown in Table 1.1
    • Q: What does Figure 1 show? How useful is it?
Table 1: Data: Experiment on punishment
sex age religion treatment_text ethnicity_census personal_income employment_status
Male 41 Christianity Unhappy (No Moral Change) Mixed [10K, 20K) Part-Time
Female 55 Spiritualism Happy (No Moral Change) White <10K Not in paid work
Male 45 NonReligious Neutral (Yes Moral Change) White [20K, 30K) Full-Time
Male 25 NonReligious Neutral (No Moral Change) White [30K, 40K) Full-Time
Female 52 NonReligious Neutral (Yes Moral Change) White <10K Part-Time
Male 53 Christianity Unhappy (Yes Moral Change) White <10K Part-Time
Figure 1: Exploratory descriptive graphs

2.1 Lab: Data & Code

# data_bauer_poama.csv
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))
ggpairs(data %>% dplyr::select(sex, age, employment_status)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # religion, 
# You can set the cardinality threshold: Maximum number of levels allowed in a character / factor column

3 Several variables: Categorical/numerical

  • Learning outcomes: Learn how to…
    • …spot errors in a graph.
    • …visualize several single variables.
    • …make summary graph for publication.
    • …add summary statistics to plots.
    • …manipulate data for a plot.
    • …plot ordered variables (and order categories).
    • …plot graphs next to each other.
    • …search for solutions online.

3.1 Data & Packages & functions

  • Data: One categorical or numerical variable (1 dimension), several plotted next to each other
  • Packages & functions:
    • geom_histogram(): Histograms show distribution of a single numeric/quantitative variable (Frequency polygons: geom_freqpoly())2
      • Provide a lot of information about the distribution but need more space than boxplots
      • Bin the data, then count the number of observations in each bin (then show either bars or lines)
        • binwidth: Control the width of bins (ALWAYS experiment with that, default is 30)
        • breaks: Set manual bin cutoffs
    • geom_bar(): Barplots show distribution categorical/qualitative variables
  • Recommendations
    • Use consistent strategy to visualize or indicate missings
    • Show missings as annotation in histograms or as category in barplots (last of ordered categories)
    • Survey data: Sometimes it could make sense to be more specific and also show “refusals”, “don’t know” etc. (make the sample transparent!)
    • Create final labels in data for plot data_plot to avoid errors

3.2 Graph

  • Here we’ll reproduce (criticize and improve) part of Figure 23.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?4
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Figure 2: Graph summarizing a dataset

3.3 Lab: Data & Code

We start by inspecting the data with View(data) and str(data): What do we see?*

# data_bauer_poama.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))

data <- read_csv("data/data_bauer_poama.csv",
                 col_types = cols())

# TIP: Use styler package to style code

# View(data)
# str(data)

# Age ####
p1 <- ggplot(
  data,
  aes(x = age)
) +
  geom_histogram(binwidth = 2, fill = "gray") +
  labs(x = "Age", y = "N") +
  theme_light()

# Religion ####
# Get categories in ranked order
levels_ranked <- data %>%
  dplyr::select(religion) %>%
  count(religion) %>%
  arrange(desc(n)) %>%
  dplyr::pull(religion)
# Create factor
data$religion_fac <- factor(data$religion,
  levels = levels_ranked,
  ordered = TRUE
)

p2 <- ggplot(data, aes(x = religion_fac)) +
  geom_bar(fill = "gray") +
  labs(x = "Religion", y = "N") +
  theme_light() +
  theme(
    axis.text.x = element_text(
      angle = 40,
      hjust = 1,
      vjust = 1,
      margin = margin(0.2, 0, 0.3, 0, "cm")
    ),
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
  )

# Employment status
data$employment_status_fac <- factor(data$employment_status)
p3 <- ggplot(data, aes(x = employment_status_fac)) +
  geom_bar(fill = "gray") +
  labs(x = "Employment status", y = "N") +
  theme_light() +
  theme(
    axis.text.x = element_text(
      angle = 35,
      hjust = 1,
      vjust = 1,
      margin = margin(0.2, 0, 0.3, 0, "cm")
    ),
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
  )

library(patchwork)
p1 + p2 + p3
Figure 3: Descriptive graphs of various variables in a sample

3.4 Exercise

  • We’ll split into teams and improve Figure 3 together. Use the code of the subplot above that we assign to you (p1, p2, p3) to create one of the plots below.
  1. Histogram of age (p1):
    • Add additional statistics (mean, median, sd)
      • Tip: Use geom_vline(aes(xintercept = mean(...)),col='red',size=1) to add lines.
      • Use annotate("text", label = paste("Mean:", round(mean(data$age))),...) to add lables.
  2. Barplot of religion (p2): Reduce rare categories to one category called Other religion.
    • Tip: Recode the religion variable: data$religion_fac <- recode(data$religion_fac, "Sikhism" = "Other")
    • Tip: Or use the fct_lump() function.
  3. Barplot of employment status (p3)
    • Add labels with absolute (relative) number for the bars
      • Search solution here: https://community.rstudio.com/t/regarding-adding-bar-labels-at-the-top-of-each-bar-in-ggplot-in-rstudio/14226/4
      • Order categories (see how it’s done in the code for the religion plot above)
  • In the end explain to others what you did to improve the plot.



Exercise solution
# HISTOGRAM

# Age ####
ggplot(
  data,
  aes(x = age)
) +
  geom_histogram(binwidth = 2, fill = "gray") +
  labs(x = "Age", y = "N") +
  theme_light() +
  geom_vline(aes(xintercept = mean(age)), col = "red", size = 1) +
  geom_vline(aes(xintercept = median(age)), col = "blue", size = 1) +
  annotate("text",
    label = paste(
      "Mean:", round(mean(data$age)),
      "\nMedian:", round(median(data$age)),
      "\nSD:", round(sd(data$age))
    ),
    x = 60, y = 60,
    size = 3,
    colour = "black",
    hjust = 0.5,
    vjust = 0.5
  )





# Religion

levels_ranked <- data %>%
  dplyr::select(religion) %>%
  count(religion) %>%
  arrange(desc(n)) %>%
  dplyr::pull(religion)
# Create factor
data$religion_fac <- factor(data$religion,
  levels = levels_ranked
)

data$religion_fac <- recode(data$religion_fac,
  "Sikhism" = "Other",
  "Jainism" = "Other",
  "Hinduism" = "Other",
  "Paganism" = "Other",
  "Spiritualism" = "Other"
)
# Alternative: data$religion_fac <- fct_lump(data$religion_fac, n = 10)

ggplot(data, aes(x = data$religion_fac)) +
  geom_bar(fill = "gray") +
  labs(x = "Religion", y = "N") +
  theme_light() +
  theme(
    axis.text.x = element_text(
      angle = 40,
      hjust = 1,
      vjust = 1,
      margin = margin(0.2, 0, 0.3, 0, "cm")
    ),
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
  )



# Employment
# Get ranked categories
levels_ranked <- data %>%
  dplyr::select(employment_status) %>%
  count(employment_status) %>%
  arrange(desc(n)) %>%
  dplyr::pull(employment_status)
# Create factor
data$employment_status_fac <- factor(data$employment_status,
  levels = levels_ranked
)



ggplot(data, aes(x = employment_status_fac)) +
  geom_bar(fill = "gray") +
  labs(x = "Employment status", y = "N") +
  theme_light() +
  theme(
    axis.text.x = element_text(
      angle = 35,
      hjust = 1,
      vjust = 1,
      margin = margin(0.2, 0, 0.3, 0, "cm")
    ),
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
  ) +
  geom_text(stat = "count", aes(label = round((after_stat(count)) / sum(after_stat(count)), 2), vjust = -0.5))
Another solution
  • Another solution using geom_table_npc().
library(ggpmisc)

# data_bauer_poama.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))

data <- read_csv("data/data_bauer_poama.csv",
                 col_types = cols())

summ <- data %>%
  summarize(Mean = mean(age),
            Median = median(age),
            SD = sd(age),
            N = n(),
            Missing = sum(is.na(age)))

ggplot(data,  aes(x=age)) +
  geom_histogram(binwidth=2, fill = "gray") +
  labs(x="Age", y = "N")+
  theme_light() +
  geom_table_npc(data = summ, label = list(summ), npcx = 0.42, npcy = 1, hjust = 0, vjust = 1) +
  geom_vline(aes(xintercept = mean(age)), col='red', size=2) +
  geom_vline(aes(xintercept = median(age)), col='blue', size=2)

3.5 Exercise: Higlighting things to focus attention

  • Below two quick example of how we could highlight certain facts in histograms or barplots. In practice it might be helpful to add annotations in the graph.
    • Data in the graph can be colored by simply adding additional geoms on top.
    • In order to change the color of words you need to use <span style='color:orange;'>...</span> in combination with theme(plot.title = element_markdown()).
  • Please us the code below but this time…
    • …highlight in the histogram in Figure 4 that only a small percentage is younger than 25 years old
    • …highlight in Figure 5 that Islam, Buddhism and Judaism represent only a small share in the sample.
# data_bauer_poama.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))

data <- read_csv("data/data_bauer_poama.csv",
                 col_types = cols())

library(ggtext) 
ggplot(
  data,
  aes(x = age)
) +
  geom_histogram(binwidth = 2, fill = "gray") +
    geom_histogram(data = data %>% filter(age>55),
                                 aes(x = age), 
                                 binwidth = 2, fill = "blue") +
  labs(x = "Age", y = "N") +
  theme_light() +
    labs(title = "Only a small percentage in the sample <span style='color:blue;'>is over 55</span>.") +
  theme(plot.title = element_markdown())
Figure 4: Highlighting parts of a histogram
library(ggtext) 
levels_ranked <- data %>%
  dplyr::select(religion) %>%
  count(religion) %>%
  arrange(desc(n)) %>%
  dplyr::pull(religion)
# Create factor
data$religion_fac <- factor(data$religion,
  levels = levels_ranked
)

data$religion_fac <- recode(data$religion_fac,
  "Sikhism" = "Other",
  "Jainism" = "Other",
  "Hinduism" = "Other",
  "Paganism" = "Other",
  "Spiritualism" = "Other"
)

ggplot(data, aes(x = religion_fac)) +
  geom_bar(fill = "gray") +
    geom_bar(data = data %>% filter(religion == "Christianity"),
                     aes(x = religion_fac),
                     fill = "red") +
  labs(x = "Religion", y = "N") +
  theme_light() +
  theme(
    axis.text.x = element_text(
      angle = 40,
      hjust = 1,
      vjust = 1,
      margin = margin(0.2, 0, 0.3, 0, "cm")
    ),
    plot.title = element_markdown(),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
  ) +
    labs(title = " <span style='color:red;'>Christians</span> are only the second biggest group in the sample.")
Figure 5: Highlighting parts of a barplot
Exercise solution
ggplot(
  data,
  aes(x = age)
) +
  geom_histogram(fill = "gray",
                             breaks = seq(15,75,5)) +
    geom_histogram(data = data %>% filter(age<25),
                                 fill = "blue",
                             breaks = seq(15,75,5)) +
  labs(x = "Age", y = "N") +
  theme_light() +
    labs(title = "Only a small percentage in the sample <span style='color:blue;'>is younger than 25</span>.") +
  theme(plot.title = element_markdown())
Figure 6: Highlighting parts of a histogram
levels_ranked <- data %>%
  dplyr::select(religion) %>%
  count(religion) %>%
  arrange(desc(n)) %>%
  dplyr::pull(religion)
# Create factor
data$religion_fac <- factor(data$religion,
  levels = levels_ranked
)

data$religion_fac <- recode(data$religion_fac,
  "Sikhism" = "Other",
  "Jainism" = "Other",
  "Hinduism" = "Other",
  "Paganism" = "Other",
  "Spiritualism" = "Other"
)

ggplot(data, aes(x = religion_fac)) +
  geom_bar(fill = "gray") +
    geom_bar(data = data %>% 
                        filter(religion == "Islam"|
                                        religion == "Buddhism"|
                                        religion == "Judaism"),
                     aes(x = religion_fac),
                     fill = "red") +
  labs(x = "Religion", y = "N") +
  theme_light() +
  theme(
    axis.text.x = element_text(
      angle = 40,
      hjust = 1,
      vjust = 1,
      margin = margin(0.2, 0, 0.3, 0, "cm")
    ),
    plot.title = element_markdown(),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")
  ) +
    labs(title = "<span style='color:red;'>Islam, Buddhism</span> and <span style='color:red;'>Judaism</span> only make up a small percentage of the sample.")
Figure 7: Highlighting parts of a barplot

4 Non-aggregated vs. aggregated data: Barplot example

  • Learning outcomes: Learn…
    • …the difference between plotting aggregated and non-aggregated data.
    • …logic behind ordering scales.

4.1 Data & Packages & functions

  • Data: One categorical variable
  • Challenge: Either feed original raw data or summarized/processed data to ggplot
  • Packages & functions:
    • geom_bar(): Expects unsummarised data (each observation contributes one unit to the height of each bar)
    • geom_bar(stat ="identity"): Tell geom_bar not to aggregate/summarize the data!
    • factor(party, ordered = TRUE, levels = c(...)): Convert variable to ordered factor
    • str(data): Check data types + levels(data$party): Check levels and ordering
    • dput(unique(data$party)): Quickly extract categories from character vector to reorder

4.2 Graph

  • First, we keep the code/graph simple (no ordering, labels etc.)…
# data_twitter_influence.csv
#data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#"1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
 #                       col_types = cols())

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

  p1 <- ggplot(data, aes(x = party)) + 
    geom_bar() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Or summarize the data first:
  data_plot <- data %>% 
    group_by(party) %>% 
    summarize(n = n()) %>% ungroup()
  p2 <- ggplot(data_plot, aes(x = party, y = n)) + 
    geom_bar(stat ="identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  p1 + p2 + 
  plot_layout(ncol = 2)

2 Barplots: Summarize vs. identity

Now, let’s reorder the party variable according to ideology, i.e., with DieLinke being the most left party and AfD the most right party. This can be done through converting the corresponding variable to an ordered factor.

# data_twitter_influence.csv
#data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#"1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
 #                       col_types = cols())  %>%
    # select(party)

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols()) %>%
    select(party)

data <- data %>%
    mutate(party = factor(party, ordered = TRUE,
                                                levels = c("DieLinke", "Greens", "SPD", "FDP", "CDU_CSU", "AfD")))
# str(data)
# levels(data$party)

  p1 <- ggplot(data, aes(x = party)) + 
    geom_bar() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Or summarize the data first:
  data_plot <- data %>% 
    group_by(party) %>% 
    summarize(n = n()) %>% ungroup()

  p2 <- ggplot(data_plot, aes(x = party, y = n)) + 
    geom_bar(stat ="identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  p1 + p2 + 
  plot_layout(ncol = 2)

2 Barplots: Summarize vs. identity

5 Further examples

5.1 Categorical variables (2+)

  • Learning outcomes: Learn…
    • …to manipulate data first and visualize it thereafter
    • …how to use pivot_longer (ggplot likes long format!)
    • …how to visualize unordered and ordered variables
    • …how to name scales and create manual ones
    • …how to create labels from data
      • Make sure the only thing that you might to add for the labels is gsub, i.e., ideally no substantive changing of label names
    • …how to size text elements

5.1.1 Data & Packages & functions

  • Data: Two or several categorical variables
  • Challenge: We need to summarize the data as to obtain frequencies (absolute or relative)
    • Create long format dataframe that contains the frequencies of different category combinations
  • Packages & functions:
    • tidyr and pivot_longer() function
    • dplyr and functions such as summarize(), mutate() etc.

5.1.2 Graph

  • Here we’ll reproduce and maybe criticize as well as improve Figure 8
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?5
    • How many scales/mappings does it use? Could we reduce them?6
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Figure 8: Several categorical variables



5.1.3 Lab: Data & Code

We start by importing the original (unsummarised data). As you can see below we have categorical string variables.

# data_account_prevalence.csv
#data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                         "1xbffXai-HqWS2Q17KoB_MCCY7YOpskyH"))

data <- read_csv("data/data_account_prevalence.csv",
                 col_types = cols())

kable(head(data))
account_email account_fb account_twitter account_whatsapp
Yes Yes No Yes
Yes Yes No Yes
Yes Yes, but I dont use it No Yes
Yes Yes No Yes
Yes No No Yes
Yes Yes Yes Yes



Subsequently, we summarize/aggregate the data producing a dataframe that contains the percentage of people in each category (Yes, Yes, but inactive, No) across the four variables. It’s a good idea to call the data that builds the basis for our plot data_plot as to keep the original dataset data. Let’s go through the code below step by step:

# Creating plot data
  data_plot <- data %>% 
  
      pivot_longer(cols = account_email:account_whatsapp, # formerly gather
                   names_to = "variable", 
                   values_to = "value") %>%
  
      group_by(variable) %>%
  
      summarize(pct.Yes = mean(value == "Yes", na.rm=TRUE),
                pct.Inactive = mean(value == "Yes, but I dont use it", na.rm=TRUE),
                pct.No = mean(value == "No", na.rm=TRUE)) %>% 
  
      pivot_longer(cols = pct.Yes:pct.No, # formerly gather
                   names_to = "category", 
                   values_to = "value") %>% # only keep variables of interest
  
      mutate(category = factor(category, # Create factor for ordering
                               levels = c("pct.Yes", "pct.Inactive", "pct.No"), 
                               ordered = TRUE)) %>%
  
      mutate(value = round(value,2)) %>% 
      mutate(value = 100 * value)

# Change the labels and translate to english! (so we can direclty pull them out later)
  data_plot$category <- gsub("Inactive", "Yes, but inactive", data_plot$category)
  data_plot$category <- gsub("pct.", "", data_plot$category)
  
  data_plot$category <- factor(data_plot$category,
                               levels = c("Yes", "Yes, but inactive", "No"),
                               ordered = TRUE)
  data_plot$variable <- factor(data_plot$variable)
  
  data_plot$variable <- str_to_title(gsub("fb", "facebook", gsub("account_", "", 
                                          data_plot$variable)))
  data_plot
variable category value
Email Yes 98
Email Yes, but inactive 1
Email No 1
Facebook Yes 61
Facebook Yes, but inactive 8
Facebook No 30
Twitter Yes 15
Twitter Yes, but inactive 8
Twitter No 77
Whatsapp Yes 83
Whatsapp Yes, but inactive 1
Whatsapp No 16



Check out the variables in the data. Importantly, there is an ordered factor in there:

str(data_plot)
tibble [12 × 3] (S3: tbl_df/tbl/data.frame)
 $ variable: chr [1:12] "Email" "Email" "Email" "Facebook" ...
 $ category: Ord.factor w/ 3 levels "Yes"<"Yes, but inactive"<..: 1 2 3 1 2 3 1 2 3 1 ...
 $ value   : num [1:12] 98 1 1 61 8 30 15 8 77 83 ...



Now we have prepared the data we can plot it in Figure 9 (fairly easy!). Again let’s go through this step by step:

  # CHECK
# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
  geom_bar(stat="identity", 
           width=0.7, 
           position = position_dodge(width=0.8),
           aes(fill = factor(variable),
               alpha=category)) +
  geom_text(position = position_dodge(width=0.8),
           aes(alpha=category,
               label = paste(value,"%", sep="")), 
           vjust=1.6, 
           color="black", 
           size=2) +
  
  scale_fill_discrete(name="Platform") +
  
  scale_alpha_discrete(name="Account",
                       range=c(1, 0.5)) + 
  xlab("Platforms")+
  ylab("Percentage (%)")+ 
  theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")), 
      plot.title = element_text(hjust = 0.5),
      plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),      
      panel.grid.major.x = element_blank(),
      legend.title = element_text(size=10),
      legend.text = element_text(size=9))
Figure 9: Distribution of four categorical variables



Strictly speaking the coloring in Figure 9 would not be necessary as the platforms are already encoded on the x-Axis. Figure 9 uses 4 mappings (x, y, alpha/luminance, color) for three variables. Hence, we could also use a grayscale version of the graph that you see below in Figure 10.7

# Creating the plot
ggplot(data_plot,aes(x = variable, y = value))+
  geom_bar(stat="identity", 
           width=0.7, 
           position = position_dodge(width=0.8),
           aes(#fill = factor(variable),
               alpha=factor(category))) +
  geom_text(position = position_dodge(width=0.8),
           aes(alpha=factor(category),
               label = paste(value,"%", sep="")), 
           vjust=1.6, 
           color="black", 
           size=2) +
  

  
  scale_alpha_discrete(name="Account",
                       range=c(1, 0.5),
                       labels=c("Yes", "Yes, but inactive", "No")) + 
  
  scale_x_discrete(labels=str_to_title(gsub("fb", "facebook", 
                                            gsub("account_", "", unique(data_plot$variable))))) + 
  
  xlab("Platforms")+
  ylab("Percentage (%)")+ 
  theme_light() +
theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1, margin=margin(0.2,0,0.3,0,"cm")), 
      plot.title = element_text(hjust = 0.5),
      plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),      
      panel.grid.major.x = element_blank(),
      legend.title = element_text(size=10),
      legend.text = element_text(size=9))
Figure 10: Distribution of four categorical variables

5.1.4 Exercise

  1. Load the summarized data (we’ll skip the data management steps).
# data_sharing_frequency_summarized.csv
#data_plot <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                              #"16oXege9RqvtIkBppZvkQNb4gl-lmSyoU"))
data_plot <- read_csv("www/data/data_sharing_frequency_summarized.csv")
head(data_plot)
variable category value
Email Daily 7
Facebook Daily 7
Twitter Daily 7
Whatsapp Daily 17
Email Rarer 67
Facebook Rarer 53
  1. Try to recreate Figure 11 using the code from Figure 9. There is a mistake in Figure 11. Can you spot it?
  2. How would we modify the code (data) if you want to show just 2 out of four variables or just 3 out of 5 categories?
  3. How could we visualize a fourth categorical variable?
  4. Here we used colors for differentiating. What would be an alternative way?
Figure 11: Distribution of four categorical variables
Exercise solution
data_plot$category <- factor(data_plot$category,
                             levels = c("Daily", "Once a week", "A few times a month", "A few times a year", 
"Rarer"),
ordered = TRUE)

ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(
      fill = factor(variable),
      alpha = factor(category)
    )
  ) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(
      alpha = factor(category),
      label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency") +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light() +
  theme(
    axis.text.x = element_text(angle = 35, hjust = 1, vjust = 1, margin = margin(0.2, 0, 0.3, 0, "cm")),
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"),
    panel.grid.major.x = element_blank(),
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9)
  )

5.2 Numeric vs. categorical: Various plot types

  • Learning outcomes: Learn…
    • …about plot types to visualize numeric vs. categorical.

5.2.1 Data & Packages & functions

  • Data: 1 categorical variable, 1 numeric variable
  • Packages & functions:
    • geomjitter() offers the same control over aesthetics geompoint() (size, color, shape)
    • geomboxplot(), geomviolin(): You can control the outline color or the internal fill color
  • Strengths and weaknesses
    • Boxplots summarize distribution with five numbers (minimum, first quartile, median, third quartile, and maximum)
    • Jittered plots show every point but only work with relatively small datasets
    • Violin plots give the richest display, but rely on the calculation of a density estimate, which can be hard to interpret (see here)
  • Important: A combination of all three might be nice

5.2.2 Graph

  • Figure 12 visualizes different ways of plotting a categorical vs. a numerical variable.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 12: Boxplots and jittered points

5.2.3 Lab: Data & Code

# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
#                         col_types = cols())
data <- read_csv("data/data_twitter_influence.csv")
data_plot <- data

p1 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_point()+
    theme(axis.text.x = element_text(angle = 30, hjust = 1))
p2 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_jitter()+
    theme(axis.text.x = element_text(angle = 30, hjust = 1))
p3 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_boxplot()+
    theme(axis.text.x = element_text(angle = 30, hjust = 1))
p4 <- ggplot(data_plot, aes(x = party, y = account_age_years)) + geom_violin()+
    theme(axis.text.x = element_text(angle = 30, hjust = 1))
  p1 + p2 + p3 + p4 +
  plot_layout(ncol = 2)

5.2.4 Combining boxplot, violinplot and jitter

# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
#                         col_types = cols())
data <- read_csv("data/data_twitter_influence.csv")
data_plot <- data
ggplot(data_plot, aes(x = party, y = account_age_years)) + 
  geom_violin(alpha = 0.5, width = 1, fill = 'lightblue')+
    geom_boxplot(width = 0.25, fatten = 3, width = 0.3)+
      geom_jitter(color="black", size=2, alpha = 0.3, width = 0.2) +
    theme(axis.text.x = element_text(angle = 30, hjust = 1)) +

  theme_classic()

5.3 Numeric vs. numeric: Correlograms

  • Learning outcomes: Learn…
    • …about correlograms to summarize bivariate relations between many variables.

5.3.1 Data & Packages & functions

  • Data: Several numeric variables
  • Correlogram: Visualizes correlations between continuous variables present in the same dataframe
  • Package: ggcorrplot (github)

5.3.2 Graph

  • Figure 13 visualizes correlation matrix in Table 2 for the dataframe mtcars.
    • Keep in mind that this works for numeric variables only.
Table 2: Correlation matrix: Motor Trend Car Road Tests (first 5 rows)
mpg cyl disp hp drat wt
mpg 1.0 -0.9 -0.8 -0.8 0.7 -0.9
cyl -0.9 1.0 0.9 0.8 -0.7 0.8
disp -0.8 0.9 1.0 0.8 -0.7 0.9
hp -0.8 0.8 0.8 1.0 -0.4 0.7
drat 0.7 -0.7 -0.7 -0.4 1.0 -0.7
wt -0.9 0.8 0.9 0.7 -0.7 1.0
Figure 13: Correlation plot

5.3.3 Lab: Data & Code

correlation_matrix <- round(cor(mtcars), 1)
head(correlation_matrix[, 1:6]) # Show part of matrix

# Plot
ggcorrplot(correlation_matrix,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

5.4 Numeric vs. numeric: Scatterplots + smoother

  • Learning outcomes: Learn…
    • …how to visualize a classic scatterplot lines for model and smoother.

5.4.1 Data & Packages & functions

  • Data: Several numeric variables
  • geom_smooth(): Adds smoother
    • geom_smooth(se= FALSE): Display confidence interval around smooth?
  • method = "loess"
    • Default for small n, uses a smooth local regression(as described in?loess)
    • Wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly)
  • If n > 1000 alternative smoothing algorithm is used (Wickham 2016, 19)

5.4.2 Graph

  • Figure 14 and Figure 15 provide two examples:
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?
Figure 14: Small multiples of scatterplots



Figure 15: Scatterplot with colored subsets

5.4.3 Lab: Data & Code

# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

ggplot(data %>% filter(followers_count<50000), 
       aes(x = account_age_years, 
           y = followers_count)) + 
           geom_point(alpha =0.5) + 
           facet_wrap(~party) +
  ylab("Number of followers") +
  xlab("Account age (in years)") +
  scale_x_continuous(breaks = c(0, 5, 10), limits = c(0,10)) +
  geom_smooth(method=lm,  color = "black", fill="lightgray") +
  geom_smooth(span =  0.3) +
  theme_light()



ggplot(data %>% filter(followers_count<50000), 
       aes(x = account_age_years, 
           y = followers_count, 
           color = factor(party))) + 
           geom_point(alpha =0.5) + 
           #facet_wrap(~party) +
  ylab("Number of followers") +
  xlab("Account age (in years)") +
  scale_x_continuous(breaks = c(0, 5, 10), limits = c(0,10)) +
  geom_smooth(method=lm,  aes(fill=party, color=party)) +
  theme_light()

5.5 Numeric vs. various variables

  • Learning outcomes: Learn…
    • …how to generate ggplot plots in loops (aes_string)
    • …how to visualize a numeric variable (Y) vs. different variables (X)
    • …how to create graphs conditional on loop elements depending on variable types
    • …use elements in ggplot2 object and assign loop objects globally.

5.5.1 Graph

  • Here we’ll reproduce parts of Figure 16
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?8
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Figure 16: Numeric vs different variable types



5.5.2 Lab: Data & Code



Let’s check out (and load) the datasets the underlie the plot first.

  • data_loop: Contains the variables names (variable) and labels (label) and type (type) of different covariates.
    • We’ll loop over the content of this dataframe (it’s ordered by the variable importance)
  • data_heterogeneity: Contains covariate values across individuals, as well as predictions for each individual (these are the predictions for a causal effect)
    • This is that data that is getting visualized.
# data_loop.csv
# data_loop <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                               "1ELtshmxQWS0T8mFR1uh5IH_r57ynP8MZ"))
data_loop <- read_csv("data/data_loop.csv")
kable(data_loop)
Importance variable label type
0.3680689 trust_source_mainstream Mainstr. media trust continuous
0.1248805 vote_choice_afd_num Vote choice AfD categorical
0.0786331 income_num Income categorical
# data_treatment_heterogeneity.csv
# data_heterogeneity <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                             "1sYLKmFi4uZsDDxNZdcwjD-eTHcVWql_X"))
data_heterogeneity <- read_csv("data/data_treatment_heterogeneity.csv")
kable(head(data_heterogeneity))
predictions trust_source_mainstream vote_choice_afd_num income_num
1.1508661 3.2857143 0 11
0.6617550 3.0000000 0 6
0.4056240 1.0000000 0 11
0.3809769 2.0000000 0 3
0.6726859 2.5714286 0 3
0.2334704 0.5714286 1 10



On the basis of data_loop and data_heterogeneity we then write a loop the cycles through values of data_loop and generates the corresponding plots (code could be rewritten to directly check the class of those variables).

  • The things that are varies are variable name, label and variable type.
  • There are two variable types numeric and categorical.
for(i in 1:nrow(data_loop)){
  #print(i)
  
  # Try this out (understand the loop) with i <- 1
  
  # Define objects taking them from the looping dataframe
  var_name <- data_loop$variable[i]
  var_label <- data_loop$label[i]
  var_type <- data_loop$type[i]
  
  # Create a plot number
  plot_number <- LETTERS[seq(from = 1, to = nrow(data_loop))][i]

  # Define angle conditionally
  if (var_name %in% c("income_num")){angle <- 45}else{angle <- 0}
  
  # Select data for plot
  data_plot <- data_heterogeneity %>% select(var_name, predictions)
  # select takes strings and non-strings
  
  # CREATE PLOT DEPENDING ON VARIABLE TYPE
  if(var_type == "continuous") { # Continous variable
    
    p <- ggplot(data_plot, aes_string(x = as.name(var_name), 
                                      y = as.name("predictions"))) +
      geom_point(alpha = 3/10) +
      geom_smooth(method = "loess", span = 1, se=F, colour="gray") +
      labs(title = paste0("(", plot_number, ") ", var_label)) + 
      theme_light() +
      theme(axis.text.x = element_text(size = 6, angle = angle),
            axis.title.x = element_blank(),
            plot.title = element_text(size = 8))
    
    } else { # Categorical variable

      # Convert from tibble
      data_plot[,var_name] <- factor(round(data_plot[,var_name])%>% dplyr::pull(1))

      p <- ggplot(data_plot, 
                  aes_string(x = as.name(var_name),
                             y = as.name("predictions"))) + 
        geom_boxplot() +
        geom_smooth(method = "loess", se=FALSE, aes(group=1), colour="gray") +
        labs(title = paste0("(", plot_number, ") ", var_label)) + 
        theme_light() +
        theme(axis.text.x = element_text(size = 6, angle = angle,
                                         hjust = 1, vjust = 1),
              #axis.title.y = element_blank(),
              axis.title.x = element_blank(),
              plot.title = element_text(size = 8))
    }
    assign(paste("p", i, sep=""), p) # Create object
}

  p1$labels$y <- "Predicted source\ntreatment effect" # Q:?
  p2$labels$y <- p3$labels$y <- " " # Q:?
  
  p1 + p2 + p3 + plot_layout(ncol = 3)
Figure 17: Numeric vs different variable types

5.6 Time: Line charts & events

  • Learning outcomes: Learn…
    • …how to plot dates
    • …how to make line plots
    • …how to create manual legends for various elements
    • …how to visualize events & data collection periods

5.6.1 Data & Packages & functions

  • Data: 1+ Numeric variables vs. time variable
  • Line and path plots typically used for time series data (see Appendix [Line vs. path plots])
    • Time dimension is shown on the x-axis from left to right
  • Line plots (geom_line()): join the points from left to right
    • Have time on the x-axis, showing how a single variable has changed over time
  • Path plots (geom_path()): join them in the order that they appear in the dataset (in other words, a line plot is a path plot of the data sorted by x value)
  • Below we’ll also use gtrends() from the gtrendsR package to obtain search frequencies.
  • And we’ll use pivot_wider() from the tidyr package as well as as.Date() for conversion to a date variable (see here for formats)
  • new_scale_color(): Can be used to reset a scale if we want to generate several legends (ggnewscale package)
    • And we’ll use scale modification to show proper legends in line plots

5.6.2 Graph

  • Here we’ll reproduce Figure 18 (but with ggplot2) (Bauer et al. 2020) (see also Bauer & Clemm von Hohenberg 2022)
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Figure 18: Lines (trends) and events



5.6.3 Lab: Data & code

We’ll start by preparing the data.

  1. We download data from Google on Google Searches. Table 3 shows the first few rows. Currently, gtrendsR access to Google is buggy (see here for error code 429). Hence, data below are loaded locally.
library(gtrendsR)
# Words to search for
search.words <- c("GDPR", "DSGVO") # Does not work anymore


# Download google trends
    # google.trends <- gtrends(search.words, 
    #                          gprop = "web", 
    #                          time = "2018-03-01 2018-11-16", 
    #                          geo = "DE")[[1]]
#write_csv(google.trends, "data_google_trends.csv")
google.trends <- read_csv("www/data_google_trends.csv")


google.trends <- google.trends %>% 
    pivot_wider(names_from = c("keyword", "geo"), values_from = "hits") %>%
    dplyr::select(-time, -gprop, -category)

# Replace "<1" with 0
google.trends <- google.trends %>%  
        mutate_all(funs(str_replace(., "<1", "0")))

# Convert date variable to 'Date' class
google.trends$date <- as.Date(google.trends$date, "%Y-%m-%d")

# Mutate factor to numeric and reorder
google.trends <- google.trends %>%
  #mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)
Table 3: Google trends data (first rows)
date GDPR_DE DSGVO_DE
2018-03-01 1 4
2018-03-02 1 4
2018-03-03 0 1
2018-03-04 0 1
2018-03-05 1 5
2018-03-06 2 5



  1. We plot the data an add our own annotations in Figure 19. Let’s go through the code together.

The code for a simple line plot is as follows:

# The simple line plot
ggplot(data = google.trends) +
      geom_line(aes(x = date, y = DSGVO_DE), 
                    color = "black") +
      geom_line(aes(x = date, y = GDPR_DE), 
                    color = "blue")

The more complicted version is below.

library(ggnewscale)
# The complicated version
ggplot(data = google.trends) +
      geom_rect(aes(fill = "fieldperiod"), 
                xmin = as.Date("2018-04-16", "%Y-%m-%d"), 
                xmax = as.Date("2018-04-23", "%Y-%m-%d"), 
                ymin = 0, ymax = 100, alpha = 0.2) +
        geom_rect(aes(fill = "fieldperiod"),
                  xmin = as.Date("2018-07-24", "%Y-%m-%d"), 
                  xmax = as.Date("2018-08-02", "%Y-%m-%d"), 
                ymin = 0, ymax = 100, alpha = 0.2) +
        geom_rect(aes(fill = "fieldperiod"),
                  xmin = as.Date("2018-10-29", "%Y-%m-%d"), 
                  xmax = as.Date("2018-11-07", "%Y-%m-%d"), 
                ymin = 0, ymax = 100, alpha = 0.2) +
      geom_line(aes(x = date, y = DSGVO_DE, color = "dsgvocolor")) +
      geom_line(aes(x = date, y = GDPR_DE, color = "gdprcolor")) +
      theme_light() +
      ylab("Searches (100 = max. interest in time period/territory)") +
      xlab("Month (2018)") +
      scale_colour_manual(name="Google Searches", values=c(gdprcolor = "darkgreen",
                                                  dsgvocolor = "black"),
                          labels = c("GDPR Searches",
                                     "DSVGO Searches")) +
      scale_fill_manual(name="Field periods", 
                        values=c(fieldperiod="gray"),
                        labels = c("Wave 1, 2 and 3")) +
      new_scale_color() +
           scale_colour_manual(name="Events", 
                               values=c(Policy_implementation = "red"),
                               labels = c("Policy implementation (25th of May)")) +
      geom_vline(aes(xintercept = as.Date("2018-05-25"), color = "Policy_implementation")) + 
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6), 
    legend.background = element_rect(fill=alpha('white', 0.8)))
Figure 19: Lines (trends) and events



5.6.4 Exercise

  1. Use the code from above and investigate Google searches for two other topics (e.g. “COVID” and “Hydroxychloroquine”). Choose a sensible time period for your search. And choose a sensible geographic area (e.g., geo = "US").
  2. Convert the data into longformat etc. (following the steps above) so that you can visualize it as a lineplot in ggplot.
  3. Add events to your lineplots (e.g., one could take one of Trump’s tweets as an event).
  4. Try to visualize a legend (it’s challenging!).

5.6.5 Newer graph: Salience of events across time

As published in Bauer & Clemm von Hohenberg (2022). Currently, gtrendsR access to Google is buggy (see here for error code 429).

  • Potentially, it might be difficult to collect older data from Google trends.
#library(tidyverse)
#library(lubridate)

# Words to search for
search.words <- c("Einwanderung", "Flüchtlinge", "Asyl", "Migration", "Lagerfeld")

# Download google trends
data_google_trends <- gtrends(search.words,
                                                          gprop = "web",
                                                          time = "2014-12-31 2019-06-03",
                                                          geo = "DE", 
                                                            onlyInterest = TRUE
                                                            )[[1]]



data_google_trends <- data_google_trends %>%
  pivot_wider(names_from = c("keyword", "geo"), values_from = "hits") %>%
  dplyr::select(-time, -gprop, -category)

# Replace "<1" with 0
data_google_trends <- data_google_trends %>%
  mutate_all(funs(str_replace(., "<1", "0")))




# Convert date variable
data_google_trends$date <- as.Date(data_google_trends$date, "%Y-%m-%d")

# Mutate factor to numeric and reorder
data_google_trends <- data_google_trends %>%
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.numeric)


names(data_google_trends) <- str_replace_all(names(data_google_trends), "ü", "ue")

# Aggegregate
data_google_trends <- data_google_trends %>%
  mutate(
    week = week(date),
    week_start = floor_date(date, "weeks", week_start = 1),
    week_end = ceiling_date(date, "weeks", week_start = 1)
  ) %>%
  group_by(week_start) %>%
  summarise(
    date = first(date),
    week_end = first(week_end),
    Einwanderung_DE = mean(Einwanderung_DE),
    Fluechtlinge_DE = mean(Fluechtlinge_DE),
    Asyl_DE = mean(Asyl_DE),
    Migration_DE = mean(Migration_DE),
    Lagerfeld_DE = mean(Lagerfeld_DE)
  )



ggplot(
  data = data_google_trends,
  aes(x = week_start)
) +
  geom_rect(aes(fill = "fieldperiod"),
    xmin = as.Date("2019-03-14", "%Y-%m-%d"),
    xmax = as.Date("2019-03-29", "%Y-%m-%d"),
    ymin = 0, ymax = 100, alpha = 0.2
  ) +
  geom_line(aes(x = week_start, y = Einwanderung_DE, color = "Einwanderung")) +
  geom_line(aes(x = week_start, y = Fluechtlinge_DE, color = "Fluechtlinge")) +
  geom_line(aes(x = week_start, y = Asyl_DE, color = "Asyl")) +
  geom_line(aes(x = week_start, y = Migration_DE, color = "Migration")) +
  geom_line(aes(x = week_start, y = Lagerfeld_DE, color = "Lagerfeld")) +
  theme_light() +
  ylab("Searches (100 = max. interest\nin time period/territory)") +
  xlab("Weekly averages (2019)") +
  scale_colour_manual(name = "Search terms", values = c(
    Einwanderung = "darkgreen",
    Fluechtlinge = "black",
    Asyl = "red",
    Migration = "yellow",
    Lagerfeld = "orange"
  )) +
  scale_fill_manual(
    name = "Data collection",
    values = c(fieldperiod = "gray"),
    labels = c("Field period")
  ) +
  new_scale_color() + # Add a new scale (ignore previous color scale)
  geom_vline(aes(
    xintercept = as.Date("2015-09-07"),
    linetype = "dashed"
  )) +
  geom_vline(aes(
    xintercept = as.Date("2015-12-31"),
    linetype = "dotted"
  )) +
  geom_vline(aes(
    xintercept = as.Date("2019-02-19"),
    linetype = "twodash"
  )) +
  scale_linetype_manual(
    name = "Events",
    values = c(
      "dashed",
      "dotted",
      "twodash"
    ),
    labels = c(
      "Refugee crisis\n(Summer 2015)",
      "New Year's Eve assaults\n(2020/02/19)",
      "Lagerfeld's death\n(2020/02/19)"
    )
  ) +
  scale_x_date(
    date_breaks = "8 weeks",
    date_labels = "%Y-%m-%d" # ,
    # limits = c(as.Date("2018-12-31"), as.Date("2019-06-03"))
  ) +
  theme(
    legend.position = c(.80, .99),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.box = "horizontal",
    legend.direction = "vertical",
    legend.margin = margin(6, 6, 6, 6),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
    legend.title = element_text(size = 9),
    legend.text = element_text(size = 8),
    legend.background = element_rect(fill = adjustcolor("white", alpha.f = 0.7)),
    legend.key = element_rect(fill = adjustcolor("white", alpha.f = 0.7), color = NA),
    legend.key.size = unit(0.6, "cm")
  )

5.7 Time: Means across time (or other categories)

  • Learning outcomes: Learn…
    • …how to plot error bars
    • …how to dodge graph elements

5.7.1 Data & Packages & functions

  • Data: Various one-dimensional distributions (several single variables)
  • Plot type: Dot plot with error bars
  • geom_errorbar(): To create error bars
  • position=position_dodge(0.6): Dodge graph elements

5.7.2 Graph

  • We’ll reproduce and maybe criticize as well as improve Figure 20
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Figure 20: Means across time/categories



5.7.3 Lab: Data & Code

The data has already been pre-processed, i.e., we have a dataframe that contains both our means as well as 90% and 95% percent confidence intervals for different subsamples of the data (as well as the full sample). The subsample are constructed from information on whether certain respondents participated across all waves or not. The dataframe also provides information on how these means should be grouped.

# data_gdpr_means_time.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1Ay7g1iIaCyxuDj2ce8UNc4kSRubForMN"))
data <- read_csv("data/data_gdpr_means_time.csv")
pd <- position_dodge(0.6)
ggplot(data, aes(x = wave, 
                 y = gdpr.know.num.mean, 
                 color = factor(label), 
                 group = factor(label))) +
       geom_errorbar(aes(ymin=gdpr.know.num.mean - ci_90, 
                         ymax=gdpr.know.num.mean + ci_90,
                         color = factor(label)), 
                colour="black",
                size = 1,
                width=.0, 
                position=pd) + 
       geom_errorbar(aes(ymin=gdpr.know.num.mean - ci_95,
                         ymax=gdpr.know.num.mean + ci_95,
                         color = factor(label)), 
                colour="black",
                size = 0.4,
                width=.1, 
                position=pd) +
       geom_point(size = 3,
                  position=pd) +
  scale_shape(solid = FALSE) +
  ylim(0, 100) +
  ylab("% GDPR Awareness") +
  scale_x_discrete(labels = c(
    "Wave 1 (N = 2093)\nApr 16 - 23, 2018",
    "Wave 2 (N = 2043)\nJul 24 - Aug 02, 2018",
    "Wave 3 (N = 2112)\nOct 29 - Nov 07, 2018"
  )) +
  theme_light() +
  theme(axis.title.x = element_blank()) + scale_color_manual(
    values = c("black", "#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00"),
    name = "Participation",
    breaks = levels(factor(data$label)),
    labels = c(
      "Full Sample",
      "Only W1 (N = 532)",
      "Only W2 (N = 482)",
      "Only W3 (N = 843)",
      "W1 and W2 (N = 292)",
      "W1, W2 and W3 (N = 1269)"
    )
  )
Figure 21: Means across time/categories

5.8 Time: Slope charts

5.8.1 Data & Packages & functions

  • Data: Panel data with time points on one variable
  • Example is taken from the vignette of the package
  • newggslopegraph() function from the CGPfunctions package

5.8.2 Graph(s)

  • Figure 22 (Data from 2002) and Figure 23 depict slope graphs for the data in Table 4 and Table 5.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?



Table 4: Data (first rows)
Year Survival Type
5 Year 99 Prostate
10 Year 95 Prostate
15 Year 87 Prostate
20 Year 81 Prostate
5 Year 96 Thyroid
10 Year 96 Thyroid
Figure 22: Slope graph 1



Table 5: Data (first rows)
Year Country GDP
Year1970 Sweden 46.9
Year1979 Sweden 57.4
Year1970 Netherlands 44.0
Year1979 Netherlands 55.8
Year1970 Norway 43.5
Year1979 Norway 52.2
Figure 23: Slope graph 2

5.8.3 Lab: Data & Code

head(newcancer %>% select(Year, Survival, Type))
newggslopegraph(dataframe = newcancer,
                Times = Year,
                Measurement = Survival,
                Grouping = Type,
                Title = "Estimates of Percent Survival Rates",
                SubTitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.",
                Caption = NULL
                )



head(newgdp)
custom_colors <- tidyr::pivot_wider(newgdp, 
                   id_cols = Country, 
                   names_from = Year, 
                   values_from = GDP) %>% 
  mutate(difference = Year1979 - Year1970) %>%
  mutate(trend = case_when(
    difference >= 2 ~ "green",
    difference <= -1 ~ "red",
    TRUE ~ "gray"
    )
  ) %>%
  select(Country, trend) %>%
  tibble::deframe()
newggslopegraph(newgdp, 
                Year, 
                GDP, 
                Country, 
                Title = "Gross GDP", 
                SubTitle = NULL, 
                Caption = NULL,
                LineThickness = .5,
                YTextSize = 4,
                LineColor = custom_colors
)

5.9 Sankey diagrams

  • Google: “A sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains (e.g., universities and majors) or multiple paths through a set of stages (for instance, Google Analytics uses sankeys to show how traffic flows from pages to other pages on your web site).”

References

Bauer, Paul C, Frederic Gerdon, Florian Keusch, and Frauke Kreuter. 2020. “The Impact of the GDPR Policy on Data Sharing/Privacy Attitudes.” Preliminary Draft, 1–22.
Bauer, Paul C, and Andrei Poama. 2020. “Does Suffering Suffice? An Experimental Assessment of Desert Retributivism.” PLoS One 15 (4): e0230304.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” J. Comput. Graph. Stat. 19 (1): 3–28.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.

Footnotes

  1. Boxplot: A boxplot graphically represents the distribution of a dataset by depicting its median, lower quartile (25th percentile), upper quartile (75th percentile), minimum and maximum values within the interquartile range, and potential outliers using a central rectangular box, extending whiskers, and individual data points for outliers. (ChatGPT).↩︎

  2. geom_density(): Underlying computations are more complex + assumption that are not true for all data (continuous, unbounded, and smooth) → use the others (Wickham 2016, 23).↩︎

  3. The figure was published in Bauer and Poama (2020) that is based on a survey experiment studying the effect of an offender’s suffering on perceived justice of punishment. The figure shows individual-level data on socio-demographics.↩︎

  4. Focus on scale categories, distributions, information etc.↩︎

  5. Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables: Platform with unordered categories and Account with 3 ordered categories.↩︎

  6. We use both the x-scale and color for the same mapping namely platforms. This could be reduced.↩︎

  7. One could also think of choosing colors that are also discernible when printed in grayscale instead of luminance.↩︎

  8. Data: Data is four categorical variables with the same ordered categories. In the graph they are combined, i.e., the four variables are combined into two categorical variables: Platform with unordered categories and Account with 3 ordered categories.↩︎