Grammar of graphics & Ggplot

Sources: Original material; Wickham (2010)

1 Grammar

  • Grammar answers question: What is a statistical graphic?
  • “The Grammar of Graphics” (Wilkinson 2013)1
  • “A Layered Grammar of Graphics” (Wickham 2010)2
  • Good grammar (Wickham 2010, 3)
    • provides insights into the composition of complicated graphics
    • reveals unexpected connections between seemingly different graphics
    • is just the first step in creating a good sentence
  • Grammar tells us that (Wickham 2016)
    • a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars)
    • Plot may contain statistical transformations/is drawn on specific coordinate system
    • Faceting used to generate the same plot for different subsets of the dataset.
    • Combination of these independent components that make up a graphic
  • Possibility of grammatically correct but nonsensical graph
  • Grammar = conceptually useful even if you don’t use Ggplot2 (e.g., if you use Plotly)

2 Grammar components & ggplot2

2.1 Grammar components & ggplot2 (1)

  • Let’s go into detail (Wickham 2016, 4–5)
  • Data & mappings
    • data: Information you want to visualize
    • mappings: describe how data’s variables are mapped to aesthetic attributes (e.g., x-axis) that you can perceive
    • There are five mapping components
  • (1) Layers
    • Geometric objects (geoms): what you actually see on the plot: points, lines, etc.
    • Statistical transformations (stats): Summarize data in many useful ways (e.g., LM)
  • (2) Scales: map values in the data space to values in an aesthetic space
    • e.g., color, size or shape
    • Scales draw legend or axes (allow us to decode graph)
  • (3) Coordinate system (coord): describes how data coordinates are mapped to plane of graphic3
    • Provides axes/gridlines to make it possible to read the graph
  • (4) Faceting specification (facet): how to break up data and visualize subsets
  • (5) Theme: Control finer points of display (e.g., font, background color etc.)

2.2 Grammar components & ggplot2 (2): Exercise

  • See some of the components in Figure 1 below that visualizes the data in Table 1.4
    • Q: How many/which variables are mapped to which aesthetic attributes, i.e., x, y, color and statistical summary in Figure 1?
    • Q: Where do the components – Layers (Geometric objects, stat. transformations), Scales (axes, legend: color, size, shape), Coordinate sytem, Theme, Faceting – turn up in the graph below?
Table 1: Yearly salary, age and populations accross European countries (Eurostat, first 5 rows)
country_name av_fulltime_salary av_age_fac population
Austria 50849 High (42.8,45.2] 8955797
Belgium 52466 Low (40.4,42.8] 11586195
Bulgaria 11850 High (42.8,45.2] 6877743
Cyprus 23129 Lowest [38,40.4] 900356
Czechia 20434 High (42.8,45.2] 10505772
Figure 1: Ggplot2 grammar components

2.3 Grammar components & ggplot2 (3): Exercise

  • See some of the components in Figure 2 below.
    • Q: How could we read the code?5
    • Q: How would we execute that code to understand it?6
    • Q: How can we best remember what a function/package/argument does?7
    • Q: How do you write down code to make it readable?8

Code:

ggplot(data, aes(x = av_fulltime_salary, y = population)) +
       geom_point(aes(color = factor(av_age_fac))) + # Point layer
       stat_smooth(method = "lm") + # Statistical transformation layer
       theme_light() + # Theme
       xlab("Yearly full time salary (average)") + # Lab
       ylab("Population") + # Lab
       guides(color = guide_legend(title = "Age (average) in 4 groups")) # Legend
Figure 2: Ggplot2 grammar components

2.4 Grammar doesn’t

  • Grammar doesn’t(Wickham 2016, 5)
    • …suggest what graphics you should use to answer the research questions
    • …specify what a graphic should look like (cf. theming system)
    • …specify how to make an attractive graphic (cf. defaults should be sensible)
    • describe interaction (only static graphs)9

2.5 Steps of ggplot visualizing

  1. data = ... step: Tell the ggplot() function what our data is
  2. mapping = aes(...) step: Tell ggplot() what relationships we want to see
  3. Choose a geom step: Tell ggplot how we want to see the relationships in our data
  4. Layer on geoms as needed, by adding them one at a time
  5. Use The scale_, family, labs() and guides() functions.
  • We can either store plot object and add things (p <- p + ...) or concatenate everything with +.

2.6 Explore ggplot2 object: Exercise

  • Once stored we can explore our ggplot2 object (check class with class(p))
  • p$...: Explore different elements of plot object
    • e.g., p$data, p$layers, p$mapping, p$labels etc.
# data_eurostat.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1Sv7QqkLRAUEspM58dL-kZF2nrr6pl3PH"),
#                         col_types = cols())

data <- read_csv("data/data_eurostat.csv",
                 col_types = cols()) 


p <- ggplot(data, aes(x = av_fulltime_salary, y = population)) +
       geom_point(aes(color = factor(av_age_fac))) + # Point layer
       stat_smooth(method = "lm")
  • Q: What can you decipher from the summary below?
summary(p) # Describe structure of plot object
data: country_name, country, av_fulltime_salary, av_age_fac, population
  [24x5]
mapping:  x = ~av_fulltime_salary, y = ~population
faceting: <ggproto object: Class FacetNull, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
mapping: colour = ~factor(av_age_fac) 
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 

geom_smooth: se = TRUE, na.rm = FALSE, orientation = NA
stat_smooth: method = lm, formula = NULL, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE, orientation = NA, method.args = list(), span = 0.75
position_identity 
  • Important: When we layer on a ggplot2 plot object p <- p + ... we simply change these elements
    • e.g., we could change the axes of an existing plot that is stored in p



And we can also access sub elements:

p$layers
[[1]]
mapping: colour = ~factor(av_age_fac) 
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 

[[2]]
geom_smooth: se = TRUE, na.rm = FALSE, orientation = NA
stat_smooth: method = lm, formula = NULL, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE, orientation = NA, method.args = list(), span = 0.75
position_identity 
p$mapping
Aesthetic mapping: 
* `x` -> `av_fulltime_salary`
* `y` -> `population`
p$labels
$x
[1] "av_fulltime_salary"

$y
[1] "population"

$colour
[1] "factor(av_age_fac)"

3 Data and mappings

3.1 Data: raw vs. processed

  • Data could be “raw”10 or processed

  • Processed data is, for instance, …

    • …aggregated data
    • …data summarized through a statistical model (e.g., coefficients from a model)
  • Distinction is important: Software (like ggplot) has processing abilities

  • Decision: (1) Do all the processing ourselves OR (2) let ggplot do some of the processing

      1. means more control but also more work (and (2) vice versa)
  • Always ask: Which parts in the graph have been computed by ggplot (graphics software)?

  • Ggplot2 has the argument stat ="identity": Q: Does anyone know what that does?11

  • (My) Recommendation:

    • Try to automatize everything (reproducability!)
    • Do most processing yourself for better control, i.e., reduce distance between data and graph (e.g., labels = variable names etc.)
    • But depends…

3.2 Data, aesthetics mappings and layers: Exercise

  • See the three ggplot2 statements below. Please describe the data, aesthetic mappings and layers used for each of the following plots (Wickham 2016, 14).
    • You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like (see Section 3.3) before running the code.
  1. ggplot(mpg, aes(cty, hwy)) + geom_point()12
  2. ggplot(diamonds, aes(carat, price)) + geom_point()13
  3. ggplot(economics, aes(date, unemploy)) + geom_line()14
  4. ggplot(mpg, aes(cty)) + geom_histogram()15
  • The fact that we can guess the outcome from the code attests to its readability!
  • Lesson: Choose readable variable names (e.g., cty = bad!)!

3.3 Data, aesthetics mappings and layers: Exercise Reveal

  • Figure 3 visualizes the four different statements.
p1 <- ggplot(mpg, aes(cty, hwy)) + geom_point()
p2 <- ggplot(diamonds, aes(carat, price)) + geom_point()
p3 <- ggplot(economics, aes(date, unemploy)) + geom_line()
p4 <- ggplot(mpg, aes(cty)) + geom_histogram()
grid.arrange(p1, p2, p3, p4, ncol=2)
Figure 3: Different geoms

4 Geoms

4.1 Geoms (1)

  • Geoms = geometric objects!
  • geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines)
  • geom_bar(stat = "identity") makes a bar chart (default = count the data, stat = "identity" to leave data unchanged)
  • geom_line() makes a line plot (group aesthetic determines which observations are connected)
    • geom_line() connects points from left to right; geom_path() is similar but connects points in the order they appear in the data
  • geom_point() produces a scatterplot
  • geom_polygon() draws polygons, which are filled paths
  • geom_rect(), geom_tile() and geom_raster() draw rectangles
  • geom_text() adds text to a plot (requires a label aesthetic that provides the text to display)
  • These are probably not all geoms!
  • Important
    • If aes() is set within ggplot(), they are inherited by other geoms that are added afterwards.
    • If aes() is set within geom_*() it will only used within that geom.

4.2 Geoms (2): Individual vs. collective

  • 3rd edition of ggplot2 book contrasts individual geoms with collective geoms
  • Individual geom draws a distinct graphical object for each observation (row), e.g., one point per row
  • Collective geom displays multiple observations with one geometric object
    • e.g., as a results of a statistical summary, like a boxplot, or fundamental to the display of the geom, like a polygon
  • Lines and paths fall in between: each line is composed of a set of straight segments, but each segment represents two points
  • How to control this? Use group aesthetic!
    • Default: group aesthetic mapped to interaction of all discrete variables in the plot
    • When not: necessary to explicitly define the grouping structure by mapping group to a discrete variable as shown in (aes(group = interaction(school_id, student_id)))
    • See example in Table 2 and Figure 416
Table 2: Data: Motor Trend Car Road Tests (first 5 rows)
Subject age height Occasion
1 -1.0000 140.5 1
1 -0.7479 143.4 2
1 -0.4630 144.8 3
1 -0.1643 147.1 4
1 -0.0027 147.7 5
ggplot(Oxboys, aes(age, height)) + 
  geom_point() + 
  geom_line()

ggplot(Oxboys, aes(age, height, group = Subject)) + 
  geom_point() + 
  geom_line()
(a) Grouping variable specified
(b) Grouping variable not specified
Figure 4: Individual vs. collective geoms

5 Scales

  • Scales in ggplot2 control the mapping from data to aesthetics
    • Turn data into something that you can see, like size, colour, position or shape
    • Provide tools to interpret plot: the axes and legends
  • Ggplot2 automatically produces guides based on the layers in your plot
  • Scales toolbox divides scales into three main groups
    • 1 Position scales and axes
    • 2 Colour scales and legends
      • Pick colours wisely, e.g., no red/green (see ColorBrewer)
    • 3 Scales for other aesthetics, e.g., size, shape, line width, line type etc.

5.1 Axes

  • We can modify axes using various functions as depicted in Figure 5. The data is shown in Table 3.
  • xlab() and ylab(): Set axis labels
  • scale_x_continuous(): Manually set a numeric scale (same for y)
    • scale_x_discrete(): Manually set categorical scale
    • scale_x_*()/scale_y_*(): Various other options
  • xlim() and ylim(): Set limits
  • seq(): helper function to create breaks, e.g., seq(0, 10, 2)
Table 3: Data: German politicians on Twitter
screen_name n_retweets followers_count party party_color first_name account_created_at account_age_months account_age_years last_name female
MartinSchulz 250 693125 SPD red Martin 2008-11-27 11:49:00 137.15303 11.427730 Schulz 0
SWagenknecht 3178 430367 DieLinke deeppink Sahra 2009-06-15 18:33:09 130.54368 10.877897 Wagenknecht 1
c_lindner 2182 397943 FDP gold Christian 2010-03-11 17:11:51 121.67889 10.140617 Lindner 0
HeikoMaas 3942 372902 SPD red Heiko 2009-03-13 12:37:00 133.61859 11.135660 Maas 0
GregorGysi 1974 362697 DieLinke deeppink Gregor 2012-10-18 09:20:12 90.45648 7.537416 Gysi 0
peteraltmaier 103 268181 CDU_CSU black Peter 2011-09-23 19:00:07 103.27639 8.604622 Altmaier 0
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
#                         col_types = cols())

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

ggplot(data %>% filter(followers_count<50000), 
       aes(x = account_age_years, 
           y = followers_count, 
           color = factor(female))) + 
           geom_point(alpha =0.9) + 
  ylab("Number of followers") +
  xlab("Account age (in years)") +
  scale_x_continuous(breaks = c(0, 5, 10), limits = c(0,10)) +
  xlim(0,5) # Replaces x-scale characteristics before
Figure 5: Modifying scales and axes

5.2 Axes: Exercise

  1. Use the above code from Figure 5, try to recreate Figure 6 below. Among other things you have to modify scale_x_continuous() and scale_y_continuous().
Figure 6: Modifying scales and axes
Exercise solution
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

ggplot(data %>% filter(followers_count<50000), 
       aes(x = account_age_years, 
           y = followers_count, 
           color = factor(female))) + 
           geom_point(alpha =0.9) + 
  ylab("Number of followers (absolute)") +
  xlab("Account age (YEARS)") +
  scale_x_continuous(breaks = seq(0,10, 2), limits = c(0,10))  +
  scale_y_continuous(breaks = seq(0,15000, 2500), limits = c(0,15000))

5.3 Aesthetic scales (1): Color, Size, Shape

  • colour, size and shape: Additional aesthetics that can be specified in aes(...)
  • Ggplot2 takes care of mapping data to aesthetic scale
    • 1 scale for each aesthetic mapping (i.e., for each variable)
  • Scale creates a guide (axis or legend) for reading/decoding
    • Decoding = Convert aesthetic values to data values
    • Default scales vs. manual scales (…we’ll see later) + Use scale_color_manual() to manually set the color scale (same logic for other scales) (see ?scale_color_manual))
  • Setting an aesthetic to fixed value is done outside of aes(...). Compare…
    • ggplot(economics, aes(date, unemploy)) + geom_point(aes(colour ="blue"))
    • ggplot(economics, aes(date, unemploy)) + geom_point(colour ="blue")

5.4 Aesthetic scales (2): Color, Size, Shape

  • Figure 7 visualizes data of German politicians’ twitter followers and retweets (April 2020) and provides a simple example of a color scale.17
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

data <- data %>% 
        filter(n_retweets < mean(n_retweets), # What happens here?
        followers_count < mean(followers_count))

ggplot(data, 
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
       geom_point() +
             scale_colour_manual(name = "Partei",
                                 values = c("DieLinke" = "magenta", 
                                                     "FDP" = "orange",
                                                     "CDU_CSU" = "black",
                                                     "SPD" = "red",
                                                     "Greens" = "darkgreen", 
                                                     "AfD" = "blue")) # Good colors?
Figure 7: A simple color scale

5.5 Aesthetic scales (3): Change legend order

  • If we want to change the order of the legend we have to convert the corresponding variable to a factor as shown for Figure 8.
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

# Change datatype of party to ordered factor
data <- data %>%
    mutate(party = factor(party, ordered = TRUE,
                        levels = c("DieLinke", "FDP",
                                           "CDU_CSU", "SPD",
                                           "Greens", "AfD")))


data <- data %>% 
        filter(n_retweets < mean(n_retweets), # Whathappenshere?
        followers_count < mean(followers_count))

ggplot(data, 
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
       geom_point() +
             scale_colour_manual(name = "Partei",
                                 values = c("DieLinke" = "magenta", 
                                                     "FDP" = "orange",
                                                     "CDU_CSU" = "black",
                                                     "SPD" = "red",
                                                     "Greens" = "darkgreen", 
                                                     "AfD" = "blue")) # Good colors?
Figure 8: A simple color scale

5.6 Aesthetic scales (4): Exercise 1

  • Q: Just looking at the code below (not running it!), what would Figure 7 look like if we additionally set size = followers_count and shape = party within aes(...) as we do below?
    • How many variables/mappings are we dealing with now? (How many would we need?)
    • How many legends are created?
# data_twitter_influence.csv
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                                "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
                        col_types = cols()) %>% 
        filter(n_retweets < mean(n_retweets), # What?
               followers_count < mean(followers_count))
ggplot(data, 
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party,
           size = followers_count,
           shape = party)) + 
    geom_point()
Exercise solution
  • How many variables/mappings are we dealing with now? (How many would we need?)
    • 3 variables and 5 mappings. We would only need 3 mappings (= number of variables).
  • How many legends are created?
    • 2 legends are created.

5.7 Aesthetic scales (4): Exercise 2

  1. Load the data on twitter influence with data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6")) or data <- read_csv("data/data_twitter_influence.csv").
  2. Inspect the dataset with str().
  3. Please rebuilt Figure 9 below.
    • Hint: You might need to create a factor variable with factor() and it might be necessary to filter on the variable followers_count. Besides you might need the boxplot geom.
Figure 9: Aesthetic Attributes 2
  1. Pick three different variables (or more) out of data and map them to different (aesthetic) scales. Can you find anything interesting?
Exercise solution
# 1.
data <- read_csv("www/data/data_twitter_influence.csv", col_types = cols())

# 2.
str(data)

# 3.
ggplot(data, 
       aes(x = party, 
           y = followers_count, 
           colour = party,
           shape = factor(female))) + 
    geom_point()

# 4.
# The color scale because it is encoded in the x-axis (= x-scale) 

# 5. 
# Various solutions

6 Data labels & Annotations

6.1 Data labels & Annotations (1)

  • Labels
    • geom_text(): Add text, e.g., labels as in Figure 10
  • Annotations
    • geom_text(): Add text descriptions or to label points
      • Most plots won’t benefit from labelling every observation, but labeling outliers/important points is useful
    • geom_rect(): to highlight interesting rectangular regions of the plot
      • has aesthetics xmin, xmax, ymin and ymax
    • geom_vline(), geom_hline() and geom_abline(): Add reference lines
    • annotate(): Annotate text in textbox
Figure 10: Labels and annotations

6.2 Data labels & Annotations (2)

  • Figure 11 illustrates various geoms for labeling and annotations.
  • geom_text_repel(): Automatic label location available from the ggrepel package (install.packages("ggrepel"))
    • See the ggrepel vignette
    • Delete labels beforehand in the data (Q: Another solution?)
    • The use of hjust and vjust is explained in this wonderful answer.
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

# Modify data
data$last_name[data$followers_count<50000] <- NA # Explain! (delete labels)

# Visualize
ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() + 
    geom_text_repel(aes(label = last_name, color = party), 
                      segment.color = 'red', 
                      size = 3) +
    geom_vline(aes(xintercept = mean(followers_count))) +
    geom_hline(aes(yintercept = mean(n_retweets))) +
    annotate("rect", xmin = 300000, xmax = 700000,
              ymin = 30000, ymax = 50000,
              color = "#386cb0",
              fill = "#386cb0",
             alpha = 0.1) +
  annotate("text", 
           label = "No outliers\n in this area!", 
           x = 500000, y = 40000, 
           size = 5, 
           colour = "#386cb0",
           hjust = 0.5,
           vjust = 0.5) + 
             scale_colour_manual(name = "Partei",
                                 values = c("DieLinke" = "magenta", 
                                                     "FDP" = "orange",
                                                     "CDU_CSU" = "black",
                                                     "SPD" = "red",
                                                     "Greens" = "darkgreen", 
                                                     "AfD" = "blue"))

# + scale_x_continuous(breaks = seq(0,700000, 100000),
#                      labels = paste(seq(0,700, 100), "tsd", sep="")) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 11: Labels and annotations

6.3 Data labels & Annotations (3): Exercise

  • Use the code from Figure 11 and try to built Figure 12 below.
    • Hint: The graph only shows labels for persons that have high values on both axes (but not necessarily the other one). First, use the code below to keep last names for the respective persons (it creates a new variable last_name2 that is then used to replace the old last_name).
    • To change the background theme use + theme_light()
    • Also you will need both annotate("rect",...) and annotate("text",...)
    • Beware: x-Scale is shown in scientific notation, e.g., the real value behind 4e+05 is 400000 and you have to pick xmin, xmax accordingly.
    • The colors used for the boxes are #386cb0 and #984ea3.
# data_twitter_influence.csv
data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
                                "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"),
                        col_types = cols())


# Modify labeling data for graph
data <- data %>%
    mutate(last_name1 = ifelse(followers_count < 300000, NA, last_name),
                 last_name2 = ifelse(n_retweets < 5000, NA, last_name),
                 last_name = coalesce(last_name1, last_name2))

head(data %>% select(last_name, party, followers_count, n_retweets))



Figure 12: Labels and annotations
Exercise solution
data <- read_csv("www/data/data_twitter_influence.csv")
data <- data %>%
    mutate(last_name1 = ifelse(followers_count < 300000, NA, last_name),
                 last_name2 = ifelse(n_retweets < 5000, NA, last_name),
                 last_name = coalesce(last_name1, last_name2))


ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() + 
    geom_text_repel(aes(label = last_name), 
                      segment.color = 'grey50', 
                      size = 4) +
             scale_colour_manual(name = "Partei",
                                 values = c("DieLinke" = "magenta", 
                                                     "FDP" = "orange",
                                                     "CDU_CSU" = "black",
                                                     "SPD" = "red",
                                                     "Greens" = "darkgreen", 
                                                     "AfD" = "blue")) +
    geom_vline(aes(xintercept = median(followers_count))) +
    geom_hline(aes(yintercept = median(n_retweets))) +
    annotate("rect", xmin = 0, xmax = 700000,
              ymin = 25000, ymax = 50000,
              alpha = 0.1,
              fill="#386cb0") +
  annotate("text", 
           label = "Outliers:\n Retweets", 
           x = 150000, y = 40000, 
           size = 5, 
           colour = "#386cb0",
           hjust = 0.5,
           vjust = 0.5) +
    annotate("rect", xmin = 300000, xmax = 700000,
              ymin = 0, ymax = 50000,
              alpha = 0.1,
              fill="#984ea3") +
  annotate("text", 
           label = "Outliers:\n Followers", 
           x = 500000, y = 15000, 
           size = 5, 
           colour = "#984ea3",
           hjust = 0.5,
           vjust = 0.5) +
  theme_light()

7 Theme & Captions

7.1 Theme (1)

  • Theme system allows to exercise fine control over non-data elements of a plot
    • Does not change rendering (geoms) or transformation (scales), i.e., perceptual properties
    • Helps to make plot aesthetically pleasing.. control fonts, ticks, panel strips etc.
  • Separation between data and non-data parts
    1. Determine how data is displayed when creating plot
    2. Edit details of rendering using theming system
  • Components (\(\times\) 4)
    • Theme elements: Specify non-data elements to control, e.g., plot.title, axis.ticks.x, legend.key.height
    • Element functions: Describe visual properties of element, e.g., element_text() set font size, color etc. of text elements
    • theme() function: Can be used to override default theme elements
      • theme(plot.title = element_text(colour = "red"))
    • Complete themes, e.g., theme_light() (check them out)

7.2 Theme (2): Exercise

  • Q: Run the code below and try to modify the elements marked with # ?. What do the different theme specifications (marked by # ?) in the code change?
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
data <- data %>% filter(followers_count<50000)

ggplot(data, 
       aes(x = account_age_years, 
           y = followers_count, 
           color = factor(female))) + 
           geom_point(alpha =0.7) + 
  scale_colour_manual(name = "Gender",
                     values = c("0" = "#d95f02", "1" = "#1f78b4"),
                     labels = c("0" = "male", "1" = "female")) +
  ylab("Number of followers") +
  xlab("Account age (in years)") +
  ggtitle("Twitter: Account age, follower numbers and gender") +
  theme_bw() +
  theme(
    plot.title = element_text(face = "bold", size = 12), # ?
    legend.background = element_rect(
      fill = "white", # ?
      size = 4, colour = "white"
    ),
    legend.justification = c(1, 1), # ?
    legend.position = c(1, 1), # ?
    axis.ticks = element_line(colour = "grey70", size = 0.2), # ?
    panel.grid.major = element_line(colour = "grey70", size = 0.2), # ?
    panel.grid.minor = element_blank() # ?
  ) #+
  #theme_light() # ?

7.3 Theme (3): Storing & re-using themes + fonts

  • We may want to re-use a theme across our organization (e.g., the BKA)
  • First we need to install the respective fonts (here: Times New Roman). This can take several minutes.
install.packages("extrafont")
remotes::install_version("Rttf2pt1", version = "1.3.8")
library(extrafont) # load package
font_import(pattern = "times") # Import system fonts
# AFFIRM with y + enter

loadfonts(device="win") # Register fonts so they can be used for output devices
fonts() # show registered fonts
  • Below we store the theme with font Times New Roman and apply it to a graph.
# Load the data
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
data <- data %>% filter(followers_count<50000)


# Store the theme
bka_theme <- theme_light() + # use before theme()!
  theme(text=element_text(family="Times New Roman",
                                     face="bold", 
                                     size=12))


# Store the plot and show it
p1 <- ggplot(data, 
       aes(x = account_age_years, 
           y = followers_count, 
           color = factor(female))) + 
           geom_point(alpha =0.7) + 
  scale_colour_manual(name = "Gender",
                     values = c("0" = "#d95f02", "1" = "#1f78b4"),
                     labels = c("0" = "male", "1" = "female")) +
  ylab("Number of followers") +
  xlab("Account age (in years)") +
  ggtitle("Twitter: Account age, follower numbers and gender")

# Show the plot
p1

# Combine plot with stored theme
p1 + bka_theme


# Store them locally (working directory)
# TimesNewRoman only works when loaded like above
save(bka_theme, file = "bka_theme.RData")
load(file = "bka_theme.RData")

7.4 Theme (4): Legend position

  • Specify position with theme(legend.position="left") (see here)
    • Replace with one of right, top, bottom, c(0.5, 0.5) (move into plot)
    • legend.position=c(0.5, 0.5): Position coordinate on the plot, e.g., middle = c(0.5, 0.5)
      • lower-left = c(0, 0), upper-right = c(1, 1), etc.
  • legend.justification = c(0, 1): Legend justification from position coordinate
    • c(0, 0) = Position coordinate is lower left corner, c(1, 1) = upper right corner, c(0, 1) = upper left corner, c(1, 0) = lower right corner
  • legend.box = "horizontal": Change legend to horizontal
  • Below a simple example:
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() +
  theme(legend.position="bottom")

7.5 Theme (5): Legend position exercise

  • Exercise: Take the code from above. Do you manage to position the legend as below using the arguments legend.position and legend.justification?

Exercise solution
ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() +
  theme(legend.position=c(1, 1),
        legend.justification = c(1, 1))

7.6 Titles & Captions (1)

  • Functions to add titles/captions
    • ggtitle("...", subtitle = "..."): Adding title and subtitle with ggtitle()
    • labs(title = "...", subtitle = "...", caption = "..."): Adding title, subtitle and caption with ggtitle()
  • Modyfying titles/captions in theme()
    • element_text(): Change appearance and position of text
    • Q: How does the code below change title, subtitle and caption?
+ theme(
  plot.title = element_text(color = "black", size = 14, face = "bold"),
  plot.subtitle = element_text(color = "black", size = 12),
  plot.caption = element_textbox_simple(color = "black", face = "italic", hjust = 0)
  )
  • element_textbox_simple() (ggtext package): Fold text to fit plot size
  • Tip: If you add the figure notes (caption = "...") directly in the plot file it will travel with the plot once it is shared.

7.7 Titles & Captions (2): Exercise

  • Please modify and add to the code below so that you reach the graph that is visualized.
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
options(scipen = 999) # ?
library(ggtext)
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() +
      labs(title = ..., 
                 subtitle = ..., 
                 caption = ...,
                 x = "Number of followers",
                 y = "Number of retweets") +
    theme(
    legend.position="bottom",
  plot.title = element_text(color = ..., size = 14, face = "bold"),
  plot.subtitle = element_text(color = ..., size = 12),
  plot.caption = element_textbox_simple(color = ..., face = "italic", hjust = 0)
    )

Exercise solution
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
options(scipen = 999) # ?
library(ggtext)
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() +
      labs(title = "Politician's influence on Twitter", 
                 subtitle = "Followers vs. retweets", 
                 caption = "Note: The graph shows politicians (= unit of analysis) and their number of followers (x-axis) and retweets (y-axis, summed over one month) in April 2020. Politician's party is encoded through colors.",
                 x = "Number of followers",
                 y = "Number of retweets") +
    theme(
    legend.position="bottom",
  plot.title = element_text(color = "black", size = 14, face = "bold"),
  plot.subtitle = element_text(color = "black", size = 12),
  plot.caption = element_textbox_simple(color = "black", face = "italic", hjust = 0)
    )

8 Putting all together

# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
options(scipen = 999) # ?
library(ggtext)
library(ggrepel)
data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols()) %>% 
    filter(last_name!="Lauterbach") # Omit Lauterbach



# Modify data
data <- data %>%
    mutate(last_name1 = ifelse(followers_count < 150000, NA, last_name),
                 last_name2 = ifelse(n_retweets < 3000, NA, last_name),
                 last_name = coalesce(last_name1, last_name2))



ggplot(data = data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point() +
  geom_point(data=data %>% filter(party == "AfD", n_retweets >=3000), 
                     aes(x =followers_count, 
                            y = n_retweets), 
                     pch=21, 
             fill=NA, size=4, 
             colour="blue", stroke=1) +
    geom_text_repel(aes(label = last_name, 
                                            color = party), 
                      segment.color = 'red', 
                      size = 3) +
      labs(title = "German politician's influence on Twitter", 
                 subtitle = "The intriguing retweet strength of AfD politicians.", 
                 caption = "Note: The graph shows politicians (unit of analysis) and their number of followers (x-axis, measured on 24th of April) and retweets (y-axis, summed over March 2020) in April 2020. Politician's party is encoded through colors. Lauterbach was omitted from the graph.",
                 x = "Number of followers",
                 y = "Number of retweets") +
    theme_light() +
    theme(
    legend.position="bottom",
  plot.title = element_text(color = "black", size = 14, face = "bold"),
  plot.subtitle = element_text(color = "black", size = 12),
  plot.caption = element_textbox_simple(color = "black", face = "italic", hjust = 0)
    ) +
             scale_colour_manual(name = "Partei",
                                 values = c("DieLinke" = "magenta", 
                                                     "FDP" = "orange",
                                                     "CDU_CSU" = "black",
                                                     "SPD" = "red",
                                                     "Greens" = "darkgreen", 
                                                     "AfD" = "blue"))

9 Facetting (small multiples)

9.1 Facetting (1)

  • Small multiples: Display a plot across values/categories of another variable
  • Aim of facetting: Increase density while keeping readability
  • Types of facetting: Grid (facet_grid()) and wrapped (facet_wrap()) (Show Figure 14.1)
    • Q: According to which variable does Figure 13 split the data? (note the ~ in facet_wrap(~variable)) What can we see in the graph? How many vars have been mapped to how many scales?
ggplot(data %>% filter(followers_count<50000), # Filter!
       aes(x = account_age_years, 
           y = followers_count, color = party)) + 
      geom_point() + 
      facet_wrap(~party)  +
             scale_colour_manual(name = "Partei",
                                 values = c("DieLinke" = "magenta", 
                                                     "FDP" = "orange",
                                                     "CDU_CSU" = "black",
                                                     "SPD" = "red",
                                                     "Greens" = "darkgreen", 
                                                     "AfD" = "blue"))
Figure 13: Facetting according to the party variable
  • If the aim is to plot several independent plots next to each other we have to resort to the package patchwork or the grid.arrange() function from the gridExtra package

9.2 Facetting: Exercise (2)

  1. Create a new R script. In this script load the data locally: data <- read_csv("data_twitter_influence.csv") (you can download the data here) and or from the server (below). data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download", "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))
  2. What do you think happens if you try to facet by a continuous variable like followers_count? What about party? What’s the key difference?
  3. Take the subset of politicians that have less then 50000 followers (data %>% filter(followers_count<50000)) and use facetting (+ facet_grid(female ~ party)) to explore the four-way relationship between account_age_years, followers_count, female (0 = male) and party.
  4. Read the documentation for facet_wrap() (?facet_wrap). What arguments can you use to control how many rows and columns appear in the output?
  5. What does the scales argument in facetwrap() do? When might you use it?18
Exercise solution
# 3.
ggplot(data %>% filter(followers_count<50000), 
       aes(x = account_age_years, 
           y = followers_count)) + 
  geom_point() + 
  facet_grid(party ~ female)

ggplot(data %>% filter(followers_count<50000), 
       aes(x = account_age_years, 
           y = followers_count, 
           color = factor(female))) + 
  geom_point() + 
  facet_wrap(~party)

10 Output & saving

  • print(): Render it on screen (inside a loop or function, you’ll need to do it yourself)
  • ggsave(): Save it to disk
    • plot =: By default the last plot or provide ggplot object
    • width =, height =: Default size in inches
      • DINA4 paper 8.27 * 11.69 inches, e.g., use width = 7
      • For other unit use: units = "cm"
    • device =: Output format, e.g., "eps", "pdf", "jpeg", "tiff", "png", "bmp", "svg"
    • dpi =: Set plot print resolution (dots per inch vs. ppi = pixels per inch)
    • PDF are vector grapics: Preferable!
  • saveRDS(): Save a cached copy of it to disk
    • Saves complete copy of plot object (you can easily recreate it)
  • Use walk() in combination with ggsave() to store several formats
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())
p <- ggplot(data,
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party)) + 
    geom_point()
print(p)

ggsave(plot = p,
       filename = "plot3.pdf", # e.g. change to pdf
       path = "www/images/test", # adapt path!
       width = 7,
       height = 4,
       device = "pdf", # e.g. change to pdf
       dpi = 300)
# Or simple provide the filename with path

Saving a plot object directly to the hard disk:

# .RData format
save(p, file = "myplot.RData")
load(file = "myplot.RData")

# RDS format
saveRDS(p, "www/images/plot.rds")  # adapt path!
q <- readRDS("www/images/plot.rds")  # adapt path!

Saving a plot in several formats at once:

walk(c('pdf', 'eps', 'png'), 
     ~ ggsave(filename = file.path(paste0("www/images/figure-in-different-formats.", .x)), 
              device = .x, 
              plot = p))

11 Add interactivity (ggplot + plotly)

  • ggplot objects can be converted to plotly objects
    • Work more or less well (see below)
    • See [Interactive data visualization: Plotly]
# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

p1 <- ggplot(data, 
       aes(x = followers_count, 
           y = n_retweets, 
           colour = party,
            label = last_name)) + 
       geom_point()

# Turn interactive
ggplotly(p1)
Figure 14: Making ggplot interactive

Interactivity with small multiples in Figure 15:

# data_twitter_influence.csv
# data <- read_csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
#                                 "1dLSTUJ5KA-BmAdS-CHmmxzqDFm2xVfv6"))

data <- read_csv("data/data_twitter_influence.csv",
                 col_types = cols())

# Create plot
p2 <- ggplot(data %>% filter(followers_count<50000), # Filter!
       aes(x = account_age_years, 
           y = followers_count, 
            color = party, 
            label = screen_name)) + # Specify name for plot
      geom_point() + 
    facet_wrap(~party)

# Turn interactive
ggplotly(p2)
Figure 15: Interactivity with small multiples

References

Healy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton University Press.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” J. Comput. Graph. Stat. 19 (1): 3–28.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.
Wilkinson, Leland. 2013. The Grammar of Graphics. Springer Science & Business Media.

Footnotes

  1. Objective: Describe all features underlying statistical graphs↩︎

  2. Focuses on the primacy of layers and adapts it for R↩︎

  3. Normally Cartesian.↩︎

  4. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).↩︎

  5. From the inside of brackets to the outside. Then from line(+) to line(+).↩︎

  6. Execute it line by line (or from inside to outside)! Same is true for dplyr code!↩︎

  7. Either by its name directly.. often by knowing what an abbreviation stands for.↩︎

  8. Use line breaks, spaces etc. Check out the tidyverse styleguide.↩︎

  9. See plotly and other packages.↩︎

  10. Here raw entails that we didn’t summarized the data, e.g., we didn’t aggregate it or summarize it through certain statistical models.↩︎

  11. Tells the ggplot function to visualize the data “as is”, e.g., we could use geom_bar() and either feed it data that is then summarized/aggregated by the geom_bar() function or we feed it data that we summarized ourselves beforehand and tell geom_bar() not to summarize it.↩︎

  12. This dataset contains a subset of the fuel economy data that the EPA makes available. cty = city miles per gallon. hwy = highway miles per gallon↩︎

  13. A dataset containing the prices and other attributes of almost 54,000 diamonds. carat = weight of the diamond (0.2–5.01). price = price in US dollars.↩︎

  14. This dataset was produced from US economic time series data available. date = Month of data collection. unemploy = number of unemployed in thousands.↩︎

  15. When using geom_histogram(), ggplot produces/calculates values for the y-axis itself. This dataset contains a subset of the fuel economy data that the EPA makes available. cty = city miles per gallon↩︎

  16. ?Oxboys: These data are described in Goldstein (1987) as data on the height of a selection of boys from Oxford, England versus a standardized age.↩︎

  17. Data includes German MPs that have a Twitter account (not complete). The number of retweets refers to all retweets in March 2020. The number of followers was measured on 24th of April 2020.↩︎

  18. Determines whether scales are fixed across all plots or not.↩︎