5.2 Summary statistics/graphs for a paper
5.2.1 Data & Packages & functions
geom_histogram(): Histograms show distribution of a single numeric variable (Frequency polygons:
- Provide a lot of information about the distribution but need more space than boxplots
- Bin the data, then count the number of observations in each bin (then show either bars or lines)
binwidth: Control the width of bins (ALWAYS experiment with that, default is 30)
breaks: Set manual bin cutoffs
geom_bar(): Barplots for categorical variables
- Use consistent strategy to visualize or indicate missings
- Histograms: Show number of missings as annotation
- Barplots: Show number of missings as one category (last category on the right of bars ordered according to size)
- Survey data: Sometimes it could make sense to be more specific and also show “refusals”, “don’t know” etc.
- All the final labels should already be created in the data you create for the plot
- Then there is less room for manual errors
- Here we’ll reproduce (criticize and improve) part of Figure 5.231.
- What does the graph show? What are the underlying variables (and data)?
- How many scales/mappings does it use? Could we reduce them?
- What do you like, what do you dislike about the figure? What is good, what is bad?32
- What kind of information could we add to the graph (if any)?
- How would you approach a replication of the graph?
5.2.3 Lab: Data & Code
- The code for Figure 5.2 is shown below (and creates Figure 5.3).
- Learning objectives
- How to use different geoms
- How to plot ordered variables (and order categories)
- How to add summary statistics
- How to use various other arguments
- How to search for solutions online
We start by inspecting the data with
str(data): What do we see?*
# data_bauer_poama.csv data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8")) # View(data) # str(data) # Age #### p1 <- ggplot(data, aes(x=age)) + geom_histogram(binwidth=2, fill = "gray") + labs(x="Age", y = "N")+ theme_light() # Religion #### # Get ranked categories levels_ranked <- data %>% dplyr::select(religion) %>% count(religion) %>% arrange(desc(n)) %>% dplyr::pull(religion) # Create factor data$religion_fac <- factor(data$religion, levels = levels_ranked) p2 <- ggplot(data, aes(x=data$religion_fac)) + geom_bar(fill = "gray") + labs(x="Religion", y = "N")+ theme_light() + theme(axis.text.x=element_text(angle=40, hjust=1, vjust=1, margin=margin(0.2, 0, 0.3, 0, "cm")), plot.title = element_text(hjust = 0.5), plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")) # Employment status data$employment_status_fac <- factor(data$employment_status) p3 <- ggplot(data, aes(x=employment_status_fac)) + geom_bar(fill = "gray") + labs(x="Employment status", y = "N") + theme_light() + theme(axis.text.x=element_text(angle=35, hjust=1, vjust=1, margin=margin(0.2,0,0.3,0,"cm")), plot.title = element_text(hjust = 0.5), plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")) grid.arrange(p1, p2, p3, ncol=3)
- We’ll split into teams and improve Figure 5.3 together.
- Use the code of the subplot that we assign to you (p1, p2, p3, p4).
- Histogram of age:
- Add additional statistics (mean, median, sd, N, missings)
- Barplot of employment status
- Add labels with absolute (relative) number for the bars
- Order categories (see how it’s done in the code for the
- Barplot of religion: Reduce rare categories to one category called
- Histogram of age:
- Explain to others what you did to improve your plot.
The figure was published in Bauer and Poama (2020) that is based on a survey experiment studying the effect of an offender’s suffering on perceived justice of punishment. The figure shows individual-level data on socio-demographics.↩
Focus on scale categories, distributions, information etc.↩