## 5.2 Summary statistics/graphs for a paper

### 5.2.1 Data & Packages & functions

• geom_histogram(): Histograms show distribution of a single numeric variable (Frequency polygons: geom_freqpoly())30
• Provide a lot of information about the distribution but need more space than boxplots
• Bin the data, then count the number of observations in each bin (then show either bars or lines)
• binwidth: Control the width of bins (ALWAYS experiment with that, default is 30)
• breaks: Set manual bin cutoffs
• geom_bar(): Barplots for categorical variables
• Recommendations:
• Use consistent strategy to visualize or indicate missings
• Histograms: Show number of missings as annotation
• Barplots: Show number of missings as one category (last category on the right of bars ordered according to size)
• Survey data: Sometimes it could make sense to be more specific and also show “refusals”, “don’t know” etc.
• All the final labels should already be created in the data you create for the plot
• Then there is less room for manual errors

### 5.2.2 Graph

• Here we’ll reproduce (criticize and improve) part of Figure 5.231.
• Questions:
• What does the graph show? What are the underlying variables (and data)?
• How many scales/mappings does it use? Could we reduce them?
• What do you like, what do you dislike about the figure? What is good, what is bad?32
• What kind of information could we add to the graph (if any)?
• How would you approach a replication of the graph? Figure 5.2: Graph summarizing a dataset

### 5.2.3 Lab: Data & Code

• The code for Figure 5.2 is shown below (and creates Figure 5.3).
• Learning objectives
• How to use different geoms
• How to plot ordered variables (and order categories)
• How to add summary statistics
• How to use various other arguments
• How to search for solutions online

We start by inspecting the data with View(data) and str(data): What do we see?*

# data_bauer_poama.csv
"1QeDCki2Auu3OZK7qMRh2xbq0u7n1Fy_8"))

# View(data)
# str(data)

# Age ####
p1 <- ggplot(data,
aes(x=age)) +
geom_histogram(binwidth=2, fill = "gray") +
labs(x="Age", y = "N")+
theme_light()

# Religion ####
# Get ranked categories
levels_ranked <- data %>%
dplyr::select(religion) %>%
count(religion) %>%
arrange(desc(n)) %>%
dplyr::pull(religion)
# Create factor
data$religion_fac <- factor(data$religion,
levels = levels_ranked)

p2 <- ggplot(data, aes(x=data$religion_fac)) + geom_bar(fill = "gray") + labs(x="Religion", y = "N")+ theme_light() + theme(axis.text.x=element_text(angle=40, hjust=1, vjust=1, margin=margin(0.2, 0, 0.3, 0, "cm")), plot.title = element_text(hjust = 0.5), plot.margin = margin(0.5, 0.5, 0, 0.5, "cm")) # Employment status data$employment_status_fac <- factor(data\$employment_status)
p3 <- ggplot(data, aes(x=employment_status_fac)) +
geom_bar(fill = "gray") +
labs(x="Employment status", y = "N") +
theme_light() +
theme(axis.text.x=element_text(angle=35,
hjust=1,
vjust=1,
margin=margin(0.2,0,0.3,0,"cm")),
plot.title = element_text(hjust = 0.5),
plot.margin = margin(0.5, 0.5, 0, 0.5, "cm"))

grid.arrange(p1, p2, p3, ncol=3) Figure 5.3: Descriptive various variables in a sample

### 5.2.4 Exercise

• We’ll split into teams and improve Figure 5.3 together.
1. Use the code of the subplot that we assign to you (p1, p2, p3, p4).
1. Histogram of age:
• Order categories (see how it’s done in the code for the religion plot)
3. Barplot of religion: Reduce rare categories to one category called Other religion.
1. geom_density(): Underlying computations are more complex + assumption that are not true for all data (continuous, unbounded, and smooth) → use the others (Wickham 2016, 23).