Appendix C: Data examples & more stuff

1 Data examples

1.1 Long-format for visualization

  • Longformat lends itself to ggplot2 visualization
    • Each mapping (x, y, shape, color) would correspond to one variable
  • Discuss example here..

2 More stuff

2.1 Summary tables with sparklines (CHECK!)



  • Fascinating more recent stuff1: Sparklines

  • Sparklines in tables

    • The datasummary_skim in the modelsummary package
library(tidyverse)
library(modelsummary)
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- readr::read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
# Numeric data
datasummary_skim(data, type = "numeric", output = "html")
Unique (#) Missing (%) Mean SD Min Median Max
n_retweets 237 0 429.4 2332.7 0.0 36.0 48568.0
followers_count 485 0 13184.0 51672.6 12.0 2647.5 693125.0
account_age_months 504 0 84.1 40.7 5.0 87.3 143.6
account_age_years 504 0 7.0 3.4 0.4 7.3 12.0
female 2 0 0.3 0.5 0.0 0.0 1.0
# Categorical data (we had to create)
datasummary_skim(data %>% mutate(party = factor(party),
                                 female = factor(female)), 
                 type = "categorical", output = "html")
N %
party AfD 76 15.1
CDU_CSU 131 26.0
DieLinke 58 11.5
FDP 73 14.5
Greens 61 12.1
SPD 105 20.8
party_color black 131 26.0
blue 76 15.1
deeppink 58 11.5
gold 73 14.5
green 61 12.1
red 105 20.8
female 0 351 69.6
1 153 30.4

2.2 SKIP - Escaping ordering hell!

Load the data:

# data_sharing_frequency_summarized.csv
# data_plot <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                               "17EJuNwAwIhReW4j6v1w6FowzMjWJXbxH"))
data_plot <- read_csv("data/data_sharing_frequency_summarized.csv")

The old plot:

ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light()
Figure 1: Distribution of four categorical variables

Rename the category names:

data_plot <- data_plot %>%
  mutate(variable = recode(variable,
                           "sharing_email" = "Email",
                           "sharing_fb" = "Facebook",
                           "sharing_twitter" = "Twitter",
                           "sharing_whatsapp" = "Whatsapp"),
 category = recode(category,
        "pct.Seltener" = "Rarer",
        "pct.EinpaarMalimJahr" = "A few times a year", 
        "pct.EinpaarMalimMonat" = "A few times a month", 
        "pct.EinmalproWoche" = "Once a week",
        "pct.Taeglich" = "Daily"))

str(data_plot)
tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
 $ variable: chr [1:20] "Email" "Facebook" "Twitter" "Whatsapp" ...
 $ category: chr [1:20] "Daily" "Daily" "Daily" "Daily" ...
 $ value   : num [1:20] 7 7 7 17 67 53 52 46 8 11 ...

Because the data is already aggregated we now always work with stat = "identity".

  • Data without factor variables (ggplot does the interpretation)! Categories are ordered alphabetically!
ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light()
Figure 2: Distribution of four categorical variables
  • Now we want to change the order of “Platform” to c("Facebook", "Twitter", "Email", "Whatsapp").
  • And “Sharing Frequency” to c("Daily", "Once a week", "A few times a month","A few times a year","Rarer")
    • It is sufficient to convert the variable to factors and to assign levels. The factor does not have to be ordered. Ggplot understands that.
data_plot$variable <- factor(data_plot$variable, 
                             levels = c("Facebook",
                                        "Twitter",
                                        "Email",
                                        "Whatsapp"))
levels(data_plot$variable)
[1] "Facebook" "Twitter"  "Email"    "Whatsapp"
data_plot$category <- factor(data_plot$category, 
                             levels = c("Daily", 
                                        "Once a week", 
                                        "A few times a month", 
                                        "A few times a year", 
                                        "Rarer"))
levels(data_plot$category)
[1] "Daily"               "Once a week"         "A few times a month"
[4] "A few times a year"  "Rarer"              
ggplot(data_plot, aes(x = variable, y = value)) +
  geom_bar(
    stat = "identity",
    width = 0.7,
    position = position_dodge(width = 0.8),
    aes(fill = variable, alpha = category)) +
  geom_text(
    position = position_dodge(width = 0.8),
    aes(alpha = category,
        label = paste(value, "%", sep = "")
    ),
    vjust = 1.6,
    color = "black",
    size = 2
  ) +
  scale_fill_discrete(name = "Platform") +
  scale_alpha_discrete(name = "Sharing frequency",
    range = c(1, 0.5)) +
  xlab("Platforms") +
  ylab("Percentage (%)") +
  theme_light()
Figure 3: Distribution of four categorical variables
  • Insights
    • Do all the labeling in the data that you create for the plot data_plot
    • Unordered factors are sufficient to tell ggplot about a non-alphabetical order
      • If you want to reorder the data do so in the data that you feed to the plot (and inspect the data)

2.3 SKIP: Time: Wave participation & time-point presence

2.3.1 Data & Packages & functions

  • Plot type: Stacked bar plot
  • tidyr::expand(): To create observations/rows for non-observed variable combinations

2.3.2 Graph

  • Here we’ll reproduce and maybe criticize as well as improve Figure @ref(fig:fig-participation-across-waves) (Bauer et al. 2020)
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Figure 4: Presence/participation at/in different time points/waves



2.3.3 Lab: Data & Code

  • The code for Figure @ref(fig:fig-participation-across-waves) is shown below (and creates Figure @ref(fig:fig-participation-across-waves2)).

  • Learning objectives

    • How to make stacked barplots
    • How to expand data

We’ll start by preparing the data for our plot. As you can see below the data is in long-format already and contains an individual identifier pid as well as two variables that contain the same information namely the wave identifier in different format: wave.num and wave.

If you want directly move to the plot…

# data_wave_participation.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                          "1Y9z1shAjyaHgqpOxwt2T8uSoe-3RW-WI"),
#                  col_types = cols())
data <- read_csv("data/data_wave_participation.csv")
head(data)
# A tibble: 6 × 3
        pid wave.num wave  
      <dbl>    <dbl> <chr> 
1 421518540        1 Wave 1
2 441620046        1 Wave 1
3 454072144        1 Wave 1
4 477478244        1 Wave 1
5 481214044        1 Wave 1
6 453648542        1 Wave 1
nrow(data)
[1] 6258



We expand the data creating a new dataframe that we join with the older one. Like that we end up with a dataframe that indicated missings for missing \(\times\) respondent wave observations.

# Expand to get dataset with rows for non observations
data.expand <- data %>% tidyr::expand(pid, wave.num)
head(data.expand)
# A tibble: 6 × 2
        pid wave.num
      <dbl>    <dbl>
1 401008246        1
2 401008246        2
3 401008246        3
4 401008443        1
5 401008443        2
6 401008443        3
nrow(data.expand)
[1] 10269
# Right_Join with longformat data to real presence of respondents
data.expand <- data %>% right_join(data.expand) %>% arrange(pid, wave.num)
head(data.expand)
# A tibble: 6 × 3
        pid wave.num wave  
      <dbl>    <dbl> <chr> 
1 401008246        1 <NA>  
2 401008246        2 <NA>  
3 401008246        3 Wave 3
4 401008443        1 Wave 1
5 401008443        2 <NA>  
6 401008443        3 <NA>  
nrow(data.expand)
[1] 10269



Subsequently, we have to pursue different steps to summarize the data across waves as well as delete the categories with the smallest numbers (participants only in W2/W3 (N = 3) and only in W1/W3 (N = 2)). If you like you can skip this whole part and directly go to the function below.

# Subset variables
#data.expand <- data.expand %>% select(pid, wave.num, wave) # %>% distinct()

# Spread dataset and arrange
data.expand <- data.expand %>% 
               pivot_wider(names_from = wave.num, values_from = wave) %>% 
               arrange(pid)

# Rename wave variables
data.expand <- rename(data.expand, 
                      wave1 = "1", 
                      wave2 = "2", 
                      wave3 = "3")

# Create "across_waves" with information of presence in single waves
data.expand <- unite(data.expand, across_waves, -pid)

# Aggregate to get observations per presence in different waves
data.expand <- data.expand %>% group_by(across_waves) %>% summarize(n = n())

# Separate united variable
data.expand <- data.expand %>% separate(across_waves, c("Wave1", "Wave2", "Wave3"), sep = "_")

# Replace values of wave variables with N values
data.expand$Wave1[data.expand$Wave1 != "NA"] <- data.expand$n[data.expand$Wave1 != "NA"]
data.expand$Wave2[data.expand$Wave2 != "NA"] <- data.expand$n[data.expand$Wave2 != "NA"]
data.expand$Wave3[data.expand$Wave3 != "NA"] <- data.expand$n[data.expand$Wave3 != "NA"]

# Delete groups only W2/W3 (N = 3) and only W1/W3 (N = 2)
data.expand <- data.expand %>% filter(n > 5) %>% select(-n)

# Create barplot illustrating the sampels across waves
data.expand <- pivot_longer(data.expand, Wave1:Wave3, names_to = "wave", values_to = "samples")
data.expand$samples <- as.numeric(data.expand$samples)

data.expand$samples_labels <- dplyr::recode(data.expand$samples,
  "532" = "Only W1 (N = 532)",
  "292" = "W1 and W2 (N = 292)",
  "1269" = "W1, W2 and W3 (N = 1269)",
  "482" = "Only W2 (N = 482)",
  "843" = "Only W3 (N = 843)"
)

data.expand <- data.expand %>% filter(!is.na(samples))
data.expand <- data.expand %>% arrange(wave)
data_plot <- data.expand
data_plot
# A tibble: 8 × 3
  wave  samples samples_labels          
  <chr>   <dbl> <chr>                   
1 Wave1     532 Only W1 (N = 532)       
2 Wave1     292 W1 and W2 (N = 292)     
3 Wave1    1269 W1, W2 and W3 (N = 1269)
4 Wave2     482 Only W2 (N = 482)       
5 Wave2     292 W1 and W2 (N = 292)     
6 Wave2    1269 W1, W2 and W3 (N = 1269)
7 Wave3     843 Only W3 (N = 843)       
8 Wave3    1269 W1, W2 and W3 (N = 1269)

Finally, we plot the participation across waves in Figure @ref(fig:fig-participation-across-waves2).

Figure 5: Presence/participation at/in different time points/waves

2.3.4 Exercise

  • Try to produce such a graph with a panel survey that you are currently using. Store the panel data in long-format, only keep the participant ID as well as the wave number, rename these pid and wave.num and then start with the code.

2.4 Another animation

Bauer, Paul C, Frederic Gerdon, Florian Keusch, and Frauke Kreuter. 2020. “The Impact of the GDPR Policy on Data Sharing/Privacy Attitudes.” Preliminary Draft, 1–22.

Footnotes

  1. See the sparkline package and http://rstudio-pubs-static.s3.amazonaws.com/237078_159e5997707f44d69e63f71efb38481d.html.↩︎