5.5 Numeric vs. categorical: Various plot types
5.5.1 Data & Packages & functions
- Data: 1 categorical variable, 1 numeric variable
- Packages & functions:
geomjitter()
offers the same control over aestheticsgeompoint()
(size, color, shape)geomboxplot()
,geomviolin()
: You can control the outline color or the internal fill color
- Strengths and weaknesses
- Boxplots summarize the bulk of the distribution with only five numbers
- Jittered plots show every point but only work with relatively small datasets
- Violin plots give the richest display, but rely on the calculation of a density estimate, which can be hard to interpret
5.5.2 Graph
- Figure 5.9 visualizes different ways of plotting a categorical vs. a numerical variable.
- Questions:
- What does the graph show? What are the underlying variables (and data)?
- How many scales/mappings does it use? Could we reduce them?
- What do you like, what do you dislike about the figure? What is good, what is bad?
- What kind of information could we add to the graph (if any)?
- How would you approach a replication of the graph?
## Parsed with column specification:
## cols(
## screen_name = col_character(),
## n_retweets = col_double(),
## followers_count = col_double(),
## party = col_character(),
## party_color = col_character(),
## first_name = col_character(),
## account_created_at = col_datetime(format = ""),
## account_age_months = col_double(),
## account_age_years = col_double(),
## last_name = col_character(),
## female = col_double()
## )

Figure 5.9: Boxplots and jittered points
5.5.3 Lab: Data & Code
# data_twitter_influence.csv
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
"1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
col_types = cols())
data_plot <- data %>% filter(n_retweets < 15000)
p1 <- ggplot(data_plot, aes(x = party, y = n_retweets)) + geom_point()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p2 <- ggplot(data_plot, aes(x = party, y = n_retweets)) + geom_jitter()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p3 <- ggplot(data_plot, aes(x = party, y = n_retweets)) + geom_boxplot()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p4 <- ggplot(data_plot, aes(x = party, y = n_retweets)) + geom_violin()+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
grid.arrange(p1, p2, p3, p4, ncol=2)