# Grammar of graphics & Ggplot

• Learning outcomes: Learn…
• …about the grammar of graphics.
• …about the various components of a graph.
• …how to built graphs departing from these components.

Sources: Original material; Wickham (2010)

# 1 Grammar

• Grammar answers question: What is a statistical graphic?
• “The Grammar of Graphics” 1
• “A Layered Grammar of Graphics” 2
• Good grammar
• provides insights into the composition of complicated graphics
• reveals unexpected connections between seemingly different graphics
• is just the first step in creating a good sentence
• Grammar tells us that
• a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars)
• Plot may contain statistical transformations/is drawn on specific coordinate system
• Faceting used to generate the same plot for different subsets of the dataset.
• Combination of these independent components that make up a graphic
• Possibility of grammatically correct but nonsensical graph
• Grammar = conceptually useful even if you don’t use Ggplot2 (e.g., if you use Plotly)

# 2 Grammar components & ggplot2

## 2.1 Grammar components & ggplot2 (1)

• Let’s go into detail
• Data & mappings
• data: Information you want to visualize
• mappings: describe how data’s variables are mapped to aesthetic attributes (e.g., x-axis) that you can perceive
• There are five mapping components
• (1) Layers
• Geometric objects (geoms): what you actually see on the plot: points, lines, etc.
• Statistical transformations (stats): Summarize data in many useful ways (e.g., LM)
• (2) Scales: map values in the data space to values in an aesthetic space
• e.g., color, size or shape
• Scales draw legend or axes (allow us to decode graph)
• (3) Coordinate system (coord): describes how data coordinates are mapped to plane of graphic3
• Provides axes/gridlines to make it possible to read the graph
• (4) Faceting specification (facet): how to break up data and visualize subsets
• (5) Theme: Control finer points of display (e.g., font, background color etc.)

## 2.2 Grammar components & ggplot2 (2): Exercise

• See some of the components in Figure 1 below that visualizes the data in Table 1.4
• Q: How many/which variables are mapped to which aesthetic attributes, i.e., x, y, color and statistical summary in Figure 1?
• Q: Where do the components – Layers (Geometric objects, stat. transformations), Scales (axes, legend: color, size, shape), Coordinate sytem, Theme, Faceting – turn up in the graph below?

## 2.3 Grammar components & ggplot2 (3): Exercise

• See some of the components in Figure 2 below.
• Q: How could we read the code?5
• Q: How would we execute that code to understand it?6
• Q: How can we best remember what a function/package/argument does?7
• Q: How do you write down code to make it readable?8

Code:

ggplot(data, aes(x = av_fulltime_salary, y = population)) +
geom_point(aes(color = factor(av_age_fac))) + # Point layer
stat_smooth(method = "lm") + # Statistical transformation layer
theme_light() + # Theme
xlab("Yearly full time salary (average)") + # Lab
ylab("Population") + # Lab
guides(color = guide_legend(title = "Age (average) in 4 groups")) # Legend

## 2.4 Grammar doesn’t

• Grammar doesn’t
• …suggest what graphics you should use to answer the research questions
• …specify what a graphic should look like (cf. theming system)
• …specify how to make an attractive graphic (cf. defaults should be sensible)
• describe interaction (only static graphs)9

## 2.5 Steps of ggplot visualizing

1. data = ... step: Tell the ggplot() function what our data is
2. mapping = aes(...) step: Tell ggplot() what relationships we want to see
3. Choose a geom step: Tell ggplot how we want to see the relationships in our data
4. Layer on geoms as needed, by adding them one at a time
5. Use The scale_, family, labs() and guides() functions.
• We can either store plot object and add things (p <- p + ...) or concatenate everything with +.

## 2.6 Explore ggplot2 object: Exercise

• Once stored we can explore our ggplot2 object (check class with class(p))
• p$...: Explore different elements of plot object • e.g., p$data, p$layers, p$mapping, p$labels etc. # data_eurostat.csv # data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", # "1Sv7QqkLRAUEspM58dL-kZF2nrr6pl3PH"), # col_types = cols()) data <- read_csv("data/data_eurostat.csv", col_types = cols()) p <- ggplot(data, aes(x = av_fulltime_salary, y = population)) + geom_point(aes(color = factor(av_age_fac))) + # Point layer stat_smooth(method = "lm") • Q: What can you decipher from the summary below? summary(p) # Describe structure of plot object data: country_name, country, av_fulltime_salary, av_age_fac, population [24x5] mapping: x = ~av_fulltime_salary, y = ~population faceting: <ggproto object: Class FacetNull, Facet, gg> compute_layout: function draw_back: function draw_front: function draw_labels: function draw_panels: function finish_data: function init_scales: function map_data: function params: list setup_data: function setup_params: function shrink: TRUE train_scales: function vars: function super: <ggproto object: Class FacetNull, Facet, gg> ----------------------------------- mapping: colour = ~factor(av_age_fac) geom_point: na.rm = FALSE stat_identity: na.rm = FALSE position_identity geom_smooth: se = TRUE, na.rm = FALSE, orientation = NA stat_smooth: method = lm, formula = NULL, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE, orientation = NA, method.args = list(), span = 0.75 position_identity  • Important: When we layer on a ggplot2 plot object p <- p + ... we simply change these elements • e.g., we could change the axes of an existing plot that is stored in p And we can also access sub elements: p$layers
[[1]]
mapping: colour = ~factor(av_age_fac)
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity

[[2]]
geom_smooth: se = TRUE, na.rm = FALSE, orientation = NA
stat_smooth: method = lm, formula = NULL, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE, orientation = NA, method.args = list(), span = 0.75
position_identity 
p$mapping Aesthetic mapping: * x -> av_fulltime_salary * y -> population p$labels
$x [1] "av_fulltime_salary"$y
[1] "population"

$colour [1] "factor(av_age_fac)" # 3 Data and mappings ## 3.1 Data: raw vs. processed • Data could be “raw”10 or processed • Processed data is, for instance, … • …aggregated data • …data summarized through a statistical model (e.g., coefficients from a model) • Distinction is important: Software (like ggplot) has processing abilities • Decision: (1) Do all the processing ourselves OR (2) let ggplot do some of the processing 1. means more control but also more work (and (2) vice versa) • Always ask: Which parts in the graph have been computed by ggplot (graphics software)? • Ggplot2 has the argument stat ="identity": Q: Does anyone know what that does?11 • (My) Recommendation: • Try to automatize everything (reproducability!) • Do most processing yourself for better control, i.e., reduce distance between data and graph (e.g., labels = variable names etc.) • But depends… ## 3.2 Data, aesthetics mappings and layers: Exercise • See the three ggplot2 statements below. Please describe the data, aesthetic mappings and layers used for each of the following plots . • You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like (see Section 3.3) before running the code. 1. ggplot(mpg, aes(cty, hwy)) + geom_point()12 2. ggplot(diamonds, aes(carat, price)) + geom_point()13 3. ggplot(economics, aes(date, unemploy)) + geom_line()14 4. ggplot(mpg, aes(cty)) + geom_histogram()15 • The fact that we can guess the outcome from the code attests to its readability! • Lesson: Choose readable variable names (e.g., cty = bad!)! ## 3.3 Data, aesthetics mappings and layers: Exercise Reveal • Figure 3 visualizes the four different statements. p1 <- ggplot(mpg, aes(cty, hwy)) + geom_point() p2 <- ggplot(diamonds, aes(carat, price)) + geom_point() p3 <- ggplot(economics, aes(date, unemploy)) + geom_line() p4 <- ggplot(mpg, aes(cty)) + geom_histogram() grid.arrange(p1, p2, p3, p4, ncol=2) # 4 Geoms ## 4.1 Geoms (1) • Geoms = geometric objects! • geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines) • geom_bar(stat = "identity") makes a bar chart (default = count the data, stat = "identity" to leave data unchanged) • geom_line() makes a line plot (group aesthetic determines which observations are connected) • geom_line() connects points from left to right; geom_path() is similar but connects points in the order they appear in the data • geom_point() produces a scatterplot • geom_polygon() draws polygons, which are filled paths • geom_rect(), geom_tile() and geom_raster() draw rectangles • geom_text() adds text to a plot (requires a label aesthetic that provides the text to display) • These are probably not all geoms! • Important • If aes() is set within ggplot(), they are inherited by other geoms that are added afterwards. • If aes() is set within geom_*() it will only used within that geom. ## 4.2 Geoms (2): Individual vs. collective • 3rd edition of ggplot2 book contrasts individual geoms with collective geoms • Individual geom draws a distinct graphical object for each observation (row), e.g., one point per row • Collective geom displays multiple observations with one geometric object • e.g., as a results of a statistical summary, like a boxplot, or fundamental to the display of the geom, like a polygon • Lines and paths fall in between: each line is composed of a set of straight segments, but each segment represents two points • How to control this? Use group aesthetic! • Default: group aesthetic mapped to interaction of all discrete variables in the plot • When not: necessary to explicitly define the grouping structure by mapping group to a discrete variable as shown in (aes(group = interaction(school_id, student_id))) • See example in Table 2 and Figure 416 ggplot(Oxboys, aes(age, height)) + geom_point() + geom_line() ggplot(Oxboys, aes(age, height, group = Subject)) + geom_point() + geom_line() # 5 Scales • Scales in ggplot2 control the mapping from data to aesthetics • Turn data into something that you can see, like size, colour, position or shape • Provide tools to interpret plot: the axes and legends • Ggplot2 automatically produces guides based on the layers in your plot • Scales toolbox divides scales into three main groups • 1 Position scales and axes • 2 Colour scales and legends • Pick colours wisely, e.g., no red/green (see ColorBrewer) • 3 Scales for other aesthetics, e.g., size, shape, line width, line type etc. ## 5.1 Axes • We can modify axes using various functions as depicted in Figure 5. The data is shown in Table 3. • xlab() and ylab(): Set axis labels • scale_x_continuous(): Manually set a numeric scale (same for y) • scale_x_discrete(): Manually set categorical scale • scale_x_*()/scale_y_*(): Various other options • xlim() and ylim(): Set limits • seq(): helper function to create breaks, e.g., seq(0, 10, 2) # data_twitter_influence.csv # data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", # "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"), # col_types = cols()) data <- read_csv("data/data_twitter_influence.csv", col_types = cols()) ggplot(data %>% filter(followers_count<50000), aes(x = account_age_years, y = followers_count, color = factor(female))) + geom_point(alpha =0.9) + ylab("Number of followers") + xlab("Account age (in years)") + scale_x_continuous(breaks = c(0, 5, 10), limits = c(0,10)) + xlim(0,5) # Replaces x-scale characteristics before ## 5.2 Axes: Exercise 1. Use the above code from Figure 5, try to recreate Figure 6 below. Among other things you have to modify scale_x_continuous() and scale_y_continuous(). Exercise solution data <- read_csv("data/data_twitter_influence.csv", col_types = cols()) ggplot(data %>% filter(followers_count<50000), aes(x = account_age_years, y = followers_count, color = factor(female))) + geom_point(alpha =0.9) + ylab("Number of followers (absolute)") + xlab("Account age (YEARS)") + scale_x_continuous(breaks = seq(0,10, 2), limits = c(0,10)) + scale_y_continuous(breaks = seq(0,15000, 2500), limits = c(0,15000)) ## 5.3 Aesthetic scales (1): Color, Size, Shape • colour, size and shape: Additional aesthetics that can be specified in aes(...) • Ggplot2 takes care of mapping data to aesthetic scale • 1 scale for each aesthetic mapping (i.e., for each variable) • Scale creates a guide (axis or legend) for reading/decoding • Decoding = Convert aesthetic values to data values • Default scales vs. manual scales (…we’ll see later) + Use scale_color_manual() to manually set the color scale (same logic for other scales) (see ?scale_color_manual)) • Setting an aesthetic to fixed value is done outside of aes(...). Compare… • ggplot(economics, aes(date, unemploy)) + geom_point(aes(colour ="blue")) • ggplot(economics, aes(date, unemploy)) + geom_point(colour ="blue") ## 5.4 Aesthetic scales (2): Color, Size, Shape • Figure 7 visualizes data of German politicians’ twitter followers and retweets (April 2020) and provides a simple example of a color scale.17 # data_twitter_influence.csv # data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", # "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6")) data <- read_csv("data/data_twitter_influence.csv", col_types = cols()) data <- data %>% filter(n_retweets < mean(n_retweets), # What happens here? followers_count < mean(followers_count)) ggplot(data, aes(x = followers_count, y = n_retweets, colour = party)) + geom_point() + scale_colour_manual(name = "Partei", values = c("DieLinke" = "magenta", "FDP" = "orange", "CDU_CSU" = "black", "SPD" = "red", "Greens" = "darkgreen", "AfD" = "blue")) # Good colors? ## 5.5 Aesthetic scales (3): Change legend order • If we want to change the order of the legend we have to convert the corresponding variable to a factor as shown for Figure 8. # data_twitter_influence.csv # data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", # "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6")) data <- read_csv("data/data_twitter_influence.csv", col_types = cols()) # Change datatype of party to ordered factor data <- data %>% mutate(party = factor(party, ordered = TRUE, levels = c("DieLinke", "FDP", "CDU_CSU", "SPD", "Greens", "AfD"))) data <- data %>% filter(n_retweets < mean(n_retweets), # Whathappenshere? followers_count < mean(followers_count)) ggplot(data, aes(x = followers_count, y = n_retweets, colour = party)) + geom_point() + scale_colour_manual(name = "Partei", values = c("DieLinke" = "magenta", "FDP" = "orange", "CDU_CSU" = "black", "SPD" = "red", "Greens" = "darkgreen", "AfD" = "blue")) # Good colors? ## 5.6 Aesthetic scales (4): Exercise 1 • Q: Just looking at the code below (not running it!), what would Figure 7 look like if we additionally set size = followers_count and shape = party within aes(...) as we do below? • How many variables/mappings are we dealing with now? (How many would we need?) • How many legends are created? # data_twitter_influence.csv data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"), col_types = cols()) %>% filter(n_retweets < mean(n_retweets), # What? followers_count < mean(followers_count)) ggplot(data, aes(x = followers_count, y = n_retweets, colour = party, size = followers_count, shape = party)) + geom_point() Exercise solution • How many variables/mappings are we dealing with now? (How many would we need?) • 3 variables and 5 mappings. We would only need 3 mappings (= number of variables). • How many legends are created? • 2 legends are created. ## 5.7 Aesthetic scales (4): Exercise 2 1. Load the data on twitter influence with data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6")) or data <- read_csv("data/data_twitter_influence.csv"). 2. Inspect the dataset with str(). 3. Please rebuilt Figure 9 below. • Hint: You might need to create a factor variable with factor() and it might be necessary to filter on the variable followers_count. Besides you might need the boxplot geom. 1. Pick three different variables (or more) out of data and map them to different (aesthetic) scales. Can you find anything interesting? Exercise solution # 1. data <- read_csv("www/data/data_twitter_influence.csv", col_types = cols()) # 2. str(data) # 3. ggplot(data, aes(x = party, y = followers_count, colour = party, shape = factor(female))) + geom_point() # 4. # The color scale because it is encoded in the x-axis (= x-scale) # 5. # Various solutions # 6 Data labels & Annotations ## 6.1 Data labels & Annotations (1) • Labels • geom_text(): Add text, e.g., labels as in Figure 10 • Annotations • geom_text(): Add text descriptions or to label points • Most plots won’t benefit from labelling every observation, but labeling outliers/important points is useful • geom_rect(): to highlight interesting rectangular regions of the plot • has aesthetics xmin, xmax, ymin and ymax • geom_vline(), geom_hline() and geom_abline(): Add reference lines • annotate(): Annotate text in textbox ## 6.2 Data labels & Annotations (2) • Figure 11 illustrates various geoms for labeling and annotations. • geom_text_repel(): Automatic label location available from the ggrepel package (install.packages("ggrepel")) • See the ggrepel vignette • Delete labels beforehand in the data (Q: Another solution?) • The use of hjust and vjust is explained in this wonderful answer. # data_twitter_influence.csv # data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", # "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6")) data <- read_csv("data/data_twitter_influence.csv", col_types = cols()) # Modify data data$last_name[data\$followers_count<50000] <- NA # Explain! (delete labels)

# Visualize
ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
geom_text_repel(aes(label = last_name, color = party),
segment.color = 'red',
size = 3) +
geom_vline(aes(xintercept = mean(followers_count))) +
geom_hline(aes(yintercept = mean(n_retweets))) +
annotate("rect", xmin = 300000, xmax = 700000,
ymin = 30000, ymax = 50000,
color = "#386cb0",
fill = "#386cb0",
alpha = 0.1) +
annotate("text",
label = "No outliers\n in this area!",
x = 500000, y = 40000,
size = 5,
colour = "#386cb0",
hjust = 0.5,
vjust = 0.5) +
scale_colour_manual(name = "Partei",
"FDP" = "orange",
"CDU_CSU" = "black",
"SPD" = "red",
"Greens" = "darkgreen",
"AfD" = "blue"))

# + scale_x_continuous(breaks = seq(0,700000, 100000),
#                      labels = paste(seq(0,700, 100), "tsd", sep="")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

## 6.3 Data labels & Annotations (3): Exercise

• Use the code from Figure 11 and try to built Figure 12 below.
• Hint: The graph only shows labels for persons that have high values on both axes (but not necessarily the other one). First, use the code below to keep last names for the respective persons (it creates a new variable last_name2 that is then used to replace the old last_name).
• To change the background theme use + theme_light()
• Also you will need both annotate("rect",...) and annotate("text",...)
• Beware: x-Scale is shown in scientific notation, e.g., the real value behind 4e+05 is 400000 and you have to pick xmin, xmax accordingly.
• The colors used for the boxes are #386cb0 and #984ea3.
# data_twitter_influence.csv
col_types = cols())

# Modify labeling data for graph
data <- data %>%
mutate(last_name1 = ifelse(followers_count < 300000, NA, last_name),
last_name2 = ifelse(n_retweets < 5000, NA, last_name),
last_name = coalesce(last_name1, last_name2))

head(data %>% select(last_name, party, followers_count, n_retweets))

Exercise solution
data <- read_csv("www/data/data_twitter_influence.csv")
data <- data %>%
mutate(last_name1 = ifelse(followers_count < 300000, NA, last_name),
last_name2 = ifelse(n_retweets < 5000, NA, last_name),
last_name = coalesce(last_name1, last_name2))

ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
geom_text_repel(aes(label = last_name),
segment.color = 'grey50',
size = 4) +
scale_colour_manual(name = "Partei",
"FDP" = "orange",
"CDU_CSU" = "black",
"SPD" = "red",
"Greens" = "darkgreen",
"AfD" = "blue")) +
geom_vline(aes(xintercept = median(followers_count))) +
geom_hline(aes(yintercept = median(n_retweets))) +
annotate("rect", xmin = 0, xmax = 700000,
ymin = 25000, ymax = 50000,
alpha = 0.1,
fill="#386cb0") +
annotate("text",
label = "Outliers:\n Retweets",
x = 150000, y = 40000,
size = 5,
colour = "#386cb0",
hjust = 0.5,
vjust = 0.5) +
annotate("rect", xmin = 300000, xmax = 700000,
ymin = 0, ymax = 50000,
alpha = 0.1,
fill="#984ea3") +
annotate("text",
label = "Outliers:\n Followers",
x = 500000, y = 15000,
size = 5,
colour = "#984ea3",
hjust = 0.5,
vjust = 0.5) +
theme_light()

# 7 Theme & Captions

## 7.1 Theme (1)

• Theme system allows to exercise fine control over non-data elements of a plot
• Does not change rendering (geoms) or transformation (scales), i.e., perceptual properties
• Helps to make plot aesthetically pleasing.. control fonts, ticks, panel strips etc.
• Separation between data and non-data parts
1. Determine how data is displayed when creating plot
2. Edit details of rendering using theming system
• Components ($$\times$$ 4)
• Theme elements: Specify non-data elements to control, e.g., plot.title, axis.ticks.x, legend.key.height
• Element functions: Describe visual properties of element, e.g., element_text() set font size, color etc. of text elements
• theme() function: Can be used to override default theme elements
• theme(plot.title = element_text(colour = "red"))
• Complete themes, e.g., theme_light() (check them out)

## 7.2 Theme (2): Exercise

• Q: Run the code below and try to modify the elements marked with # ?. What do the different theme specifications (marked by # ?) in the code change?
# data_twitter_influence.csv

col_types = cols())
data <- data %>% filter(followers_count<50000)

ggplot(data,
aes(x = account_age_years,
y = followers_count,
color = factor(female))) +
geom_point(alpha =0.7) +
scale_colour_manual(name = "Gender",
values = c("0" = "#d95f02", "1" = "#1f78b4"),
labels = c("0" = "male", "1" = "female")) +
ylab("Number of followers") +
xlab("Account age (in years)") +
ggtitle("Twitter: Account age, follower numbers and gender") +
theme_bw() +
theme(
plot.title = element_text(face = "bold", size = 12), # ?
legend.background = element_rect(
fill = "white", # ?
size = 4, colour = "white"
),
legend.justification = c(1, 1), # ?
legend.position = c(1, 1), # ?
axis.ticks = element_line(colour = "grey70", size = 0.2), # ?
panel.grid.major = element_line(colour = "grey70", size = 0.2), # ?
panel.grid.minor = element_blank() # ?
) #+
#theme_light() # ?

## 7.3 Theme (3): Storing & re-using themes + fonts

• We may want to re-use a theme across our organization (e.g., the BKA)
• First we need to install the respective fonts (here: Times New Roman). This can take several minutes.
install.packages("extrafont")
remotes::install_version("Rttf2pt1", version = "1.3.8")
font_import(pattern = "times") # Import system fonts
# AFFIRM with y + enter

loadfonts(device="win") # Register fonts so they can be used for output devices
fonts() # show registered fonts
• Below we store the theme with font Times New Roman and apply it to a graph.
# Load the data
col_types = cols())
data <- data %>% filter(followers_count<50000)

# Store the theme
bka_theme <- theme_light() + # use before theme()!
theme(text=element_text(family="Times New Roman",
face="bold",
size=12))

# Store the plot and show it
p1 <- ggplot(data,
aes(x = account_age_years,
y = followers_count,
color = factor(female))) +
geom_point(alpha =0.7) +
scale_colour_manual(name = "Gender",
values = c("0" = "#d95f02", "1" = "#1f78b4"),
labels = c("0" = "male", "1" = "female")) +
ylab("Number of followers") +
xlab("Account age (in years)") +
ggtitle("Twitter: Account age, follower numbers and gender")

# Show the plot
p1

# Combine plot with stored theme
p1 + bka_theme

# Store them locally (working directory)
# TimesNewRoman only works when loaded like above
save(bka_theme, file = "bka_theme.RData")
load(file = "bka_theme.RData")

## 7.4 Theme (4): Legend position

• Specify position with theme(legend.position="left") (see here)
• Replace with one of right, top, bottom, c(0.5, 0.5) (move into plot)
• legend.position=c(0.5, 0.5): Position coordinate on the plot, e.g., middle = c(0.5, 0.5)
• lower-left = c(0, 0), upper-right = c(1, 1), etc.
• legend.justification = c(0, 1): Legend justification from position coordinate
• c(0, 0) = Position coordinate is lower left corner, c(1, 1) = upper right corner, c(0, 1) = upper left corner, c(1, 0) = lower right corner
• legend.box = "horizontal": Change legend to horizontal
• Below a simple example:
# data_twitter_influence.csv
col_types = cols())
ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
theme(legend.position="bottom")

## 7.5 Theme (5): Legend position exercise

• Exercise: Take the code from above. Do you manage to position the legend as below using the arguments legend.position and legend.justification?
Exercise solution
ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
theme(legend.position=c(1, 1),
legend.justification = c(1, 1))

## 7.6 Titles & Captions (1)

• ggtitle("...", subtitle = "..."): Adding title and subtitle with ggtitle()
• labs(title = "...", subtitle = "...", caption = "..."): Adding title, subtitle and caption with ggtitle()
• Modyfying titles/captions in theme()
• element_text(): Change appearance and position of text
• Q: How does the code below change title, subtitle and caption?
+ theme(
plot.title = element_text(color = "black", size = 14, face = "bold"),
plot.subtitle = element_text(color = "black", size = 12),
plot.caption = element_textbox_simple(color = "black", face = "italic", hjust = 0)
)
• element_textbox_simple() (ggtext package): Fold text to fit plot size
• Tip: If you add the figure notes (caption = "...") directly in the plot file it will travel with the plot once it is shared.

## 7.7 Titles & Captions (2): Exercise

• Please modify and add to the code below so that you reach the graph that is visualized.
# data_twitter_influence.csv
options(scipen = 999) # ?
library(ggtext)
col_types = cols())

ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
labs(title = ...,
subtitle = ...,
caption = ...,
x = "Number of followers",
y = "Number of retweets") +
theme(
legend.position="bottom",
plot.title = element_text(color = ..., size = 14, face = "bold"),
plot.subtitle = element_text(color = ..., size = 12),
plot.caption = element_textbox_simple(color = ..., face = "italic", hjust = 0)
)
Exercise solution
# data_twitter_influence.csv
options(scipen = 999) # ?
library(ggtext)
col_types = cols())
ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
labs(title = "Politician's influence on Twitter",
subtitle = "Followers vs. retweets",
caption = "Note: The graph shows politicians (= unit of analysis) and their number of followers (x-axis) and retweets (y-axis, summed over one month) in April 2020. Politician's party is encoded through colors.",
x = "Number of followers",
y = "Number of retweets") +
theme(
legend.position="bottom",
plot.title = element_text(color = "black", size = 14, face = "bold"),
plot.subtitle = element_text(color = "black", size = 12),
plot.caption = element_textbox_simple(color = "black", face = "italic", hjust = 0)
)

# 8 Putting all together

# data_twitter_influence.csv
options(scipen = 999) # ?
library(ggtext)
library(ggrepel)
col_types = cols()) %>%
filter(last_name!="Lauterbach") # Omit Lauterbach

# Modify data
data <- data %>%
mutate(last_name1 = ifelse(followers_count < 150000, NA, last_name),
last_name2 = ifelse(n_retweets < 3000, NA, last_name),
last_name = coalesce(last_name1, last_name2))

ggplot(data = data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point() +
geom_point(data=data %>% filter(party == "AfD", n_retweets >=3000),
aes(x =followers_count,
y = n_retweets),
pch=21,
fill=NA, size=4,
colour="blue", stroke=1) +
geom_text_repel(aes(label = last_name,
color = party),
segment.color = 'red',
size = 3) +
labs(title = "German politician's influence on Twitter",
subtitle = "The intriguing retweet strength of AfD politicians.",
caption = "Note: The graph shows politicians (unit of analysis) and their number of followers (x-axis, measured on 24th of April) and retweets (y-axis, summed over March 2020) in April 2020. Politician's party is encoded through colors. Lauterbach was omitted from the graph.",
x = "Number of followers",
y = "Number of retweets") +
theme_light() +
theme(
legend.position="bottom",
plot.title = element_text(color = "black", size = 14, face = "bold"),
plot.subtitle = element_text(color = "black", size = 12),
plot.caption = element_textbox_simple(color = "black", face = "italic", hjust = 0)
) +
scale_colour_manual(name = "Partei",
"FDP" = "orange",
"CDU_CSU" = "black",
"SPD" = "red",
"Greens" = "darkgreen",
"AfD" = "blue"))

# 9 Facetting (small multiples)

## 9.1 Facetting (1)

• Small multiples: Display a plot across values/categories of another variable
• Aim of facetting: Increase density while keeping readability
• Types of facetting: Grid (facet_grid()) and wrapped (facet_wrap()) (Show Figure 14.1)
• Q: According to which variable does Figure 13 split the data? (note the ~ in facet_wrap(~variable)) What can we see in the graph? How many vars have been mapped to how many scales?
ggplot(data %>% filter(followers_count<50000), # Filter!
aes(x = account_age_years,
y = followers_count, color = party)) +
geom_point() +
facet_wrap(~party)  +
scale_colour_manual(name = "Partei",
"FDP" = "orange",
"CDU_CSU" = "black",
"SPD" = "red",
"Greens" = "darkgreen",
"AfD" = "blue"))
• If the aim is to plot several independent plots next to each other we have to resort to the package patchwork or the grid.arrange() function from the gridExtra package

## 9.2 Facetting: Exercise (2)

1. Create a new R script. In this script load the data locally: data <- read_csv("data_twitter_influence.csv") (you can download the data here) and or from the server (below). data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
2. What do you think happens if you try to facet by a continuous variable like followers_count? What about party? What’s the key difference?
3. Take the subset of politicians that have less then 50000 followers (data %>% filter(followers_count<50000)) and use facetting (+ facet_grid(female ~ party)) to explore the four-way relationship between account_age_years, followers_count, female (0 = male) and party.
4. Read the documentation for facet_wrap() (?facet_wrap). What arguments can you use to control how many rows and columns appear in the output?
5. What does the scales argument in facetwrap() do? When might you use it?18
Exercise solution
# 3.
ggplot(data %>% filter(followers_count<50000),
aes(x = account_age_years,
y = followers_count)) +
geom_point() +
facet_grid(party ~ female)

ggplot(data %>% filter(followers_count<50000),
aes(x = account_age_years,
y = followers_count,
color = factor(female))) +
geom_point() +
facet_wrap(~party)

# 10 Output & saving

• print(): Render it on screen (inside a loop or function, you’ll need to do it yourself)
• ggsave(): Save it to disk
• plot =: By default the last plot or provide ggplot object
• width =, height =: Default size in inches
• DINA4 paper 8.27 * 11.69 inches, e.g., use width = 7
• For other unit use: units = "cm"
• device =: Output format, e.g., "eps", "pdf", "jpeg", "tiff", "png", "bmp", "svg"
• dpi =: Set plot print resolution (dots per inch vs. ppi = pixels per inch)
• PDF are vector grapics: Preferable!
• saveRDS(): Save a cached copy of it to disk
• Saves complete copy of plot object (you can easily recreate it)
• Use walk() in combination with ggsave() to store several formats
# data_twitter_influence.csv

col_types = cols())
p <- ggplot(data,
aes(x = followers_count,
y = n_retweets,
colour = party)) +
geom_point()
print(p)
ggsave(plot = p,
filename = "plot3.pdf", # e.g. change to pdf
path = "www/images/test", # adapt path!
width = 7,
height = 4,
device = "pdf", # e.g. change to pdf
dpi = 300)
# Or simple provide the filename with path

Saving a plot object directly to the hard disk:

# .RData format
save(p, file = "myplot.RData")

# RDS format
q <- readRDS("www/images/plot.rds")  # adapt path!

Saving a plot in several formats at once:

walk(c('pdf', 'eps', 'png'),
~ ggsave(filename = file.path(paste0("www/images/figure-in-different-formats.", .x)),
device = .x,
plot = p))

# 11 Add interactivity (ggplot + plotly)

• ggplot objects can be converted to plotly objects
• Work more or less well (see below)
• See [Interactive data visualization: Plotly]
# data_twitter_influence.csv

col_types = cols())

p1 <- ggplot(data,
aes(x = followers_count,
y = n_retweets,
colour = party,
label = last_name)) +
geom_point()

# Turn interactive
ggplotly(p1)

Interactivity with small multiples in Figure 15:

# data_twitter_influence.csv

col_types = cols())

# Create plot
p2 <- ggplot(data %>% filter(followers_count<50000), # Filter!
aes(x = account_age_years,
y = followers_count,
color = party,
label = screen_name)) + # Specify name for plot
geom_point() +
facet_wrap(~party)

# Turn interactive
ggplotly(p2)

## References

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” J. Comput. Graph. Stat. 19 (1): 3–28.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.
Wilkinson, Leland. 2013. The Grammar of Graphics. Springer Science & Business Media.

## Footnotes

1. Objective: Describe all features underlying statistical graphs↩︎

2. Focuses on the primacy of layers and adapts it for R↩︎

3. Normally Cartesian.↩︎

4. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).↩︎

5. From the inside of brackets to the outside. Then from line(+) to line(+).↩︎

6. Execute it line by line (or from inside to outside)! Same is true for dplyr code!↩︎

7. Either by its name directly.. often by knowing what an abbreviation stands for.↩︎

8. Use line breaks, spaces etc. Check out the tidyverse styleguide.↩︎

9. See plotly and other packages.↩︎

10. Here raw entails that we didn’t summarized the data, e.g., we didn’t aggregate it or summarize it through certain statistical models.↩︎

11. Tells the ggplot function to visualize the data “as is”, e.g., we could use geom_bar() and either feed it data that is then summarized/aggregated by the geom_bar() function or we feed it data that we summarized ourselves beforehand and tell geom_bar() not to summarize it.↩︎

12. This dataset contains a subset of the fuel economy data that the EPA makes available. cty = city miles per gallon. hwy = highway miles per gallon↩︎

13. A dataset containing the prices and other attributes of almost 54,000 diamonds. carat = weight of the diamond (0.2–5.01). price = price in US dollars.↩︎

14. This dataset was produced from US economic time series data available. date = Month of data collection. unemploy = number of unemployed in thousands.↩︎

15. When using geom_histogram(), ggplot produces/calculates values for the y-axis itself. This dataset contains a subset of the fuel economy data that the EPA makes available. cty = city miles per gallon↩︎

16. ?Oxboys: These data are described in Goldstein (1987) as data on the height of a selection of boys from Oxford, England versus a standardized age.↩︎

17. Data includes German MPs that have a Twitter account (not complete). The number of retweets refers to all retweets in March 2020. The number of followers was measured on 24th of April 2020.↩︎

18. Determines whether scales are fixed across all plots or not.↩︎