5.2 Summary statistics/graphs for a paper

5.2.1 Data & Packages & functions

  • geom_histogram(): Histograms show distribution of a single numeric variable (Frequency polygons: geom_freqpoly())30
    • Provide a lot of information about the distribution but need more space than boxplots
    • Bin the data, then count the number of observations in each bin (then show either bars or lines)
      • binwidth: Control the width of bins (ALWAYS experiment with that, default is 30)
      • breaks: Set manual bin cutoffs
  • geom_bar(): Barplots for categorical variables
  • Recommendations:
    • Use consistent strategy to visualize or indicate missings
    • Histograms: Show number of missings as annotation
    • Barplots: Show number of missings as one category (last category on the right of bars ordered according to size)
    • Survey data: Sometimes it could make sense to be more specific and also show “refusals”, “don’t know” etc.
    • All the final labels should already be created in the data you create for the plot
      • Then there is less room for manual errors

5.2.2 Graph

  • Here we’ll reproduce (criticize and improve) part of Figure 5.231.
  • Questions:
    • What does the graph show? What are the underlying variables (and data)?
    • How many scales/mappings does it use? Could we reduce them?
    • What do you like, what do you dislike about the figure? What is good, what is bad?32
    • What kind of information could we add to the graph (if any)?
    • How would you approach a replication of the graph?


Graph summarizing a dataset

Figure 5.2: Graph summarizing a dataset



5.2.3 Lab: Data & Code

  • The code for Figure 5.2 is shown below (and creates Figure 5.3).
  • Learning objectives
    • How to use different geoms
    • How to plot ordered variables (and order categories)
    • How to add summary statistics
    • How to use various other arguments
    • How to search for solutions online

We start by inspecting the data with View(data) and str(data): What do we see?*

Descriptive various variables in a sample

Figure 5.3: Descriptive various variables in a sample

5.2.4 Exercise

  • We’ll split into teams and improve Figure 5.3 together.
  1. Use the code of the subplot that we assign to you (p1, p2, p3, p4).
    1. Histogram of age:
    2. Barplot of employment status
    3. Barplot of religion: Reduce rare categories to one category called Other religion.
  2. Explain to others what you did to improve your plot.

  1. geom_density(): Underlying computations are more complex + assumption that are not true for all data (continuous, unbounded, and smooth) → use the others (Wickham 2016, 23).

  2. The figure was published in Bauer and Poama (2020) that is based on a survey experiment studying the effect of an offender’s suffering on perceived justice of punishment. The figure shows individual-level data on socio-demographics.

  3. Focus on scale categories, distributions, information etc.