4 The Anatomy of ggplot

In a work-flow we might expect a call to ggplot to appear something like:

data %>%
ggplot(aes(x = map data components to graph components)) +
  geom_xxx(arguments to modify the appearance of the geom) +
  ... +
  theme_xxx(arguments to change the overall appearance) +
  labs(add axis-labels and a title)

where aes(...) stands for aesthetics.

Data Set variables are inside the aes() function, which in return is inside of a ggplot().

After we ggplot are done we are going to start adding (+) additional geom or other layers, in general we have several options:

  1. Initializing the canvas and defining variables,
  2. adding points,
  3. specifying further variables inside aes(),
  4. specifying aesthetics outside aes(),
  5. faceting,
  6. changing the plot theme.
  7. adding or modify layers

We will now explain these six considerations with our example data-set:

\(~\)


\(~\)

library('tidyverse')
#> -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
#> v ggplot2 3.3.3     v purrr   0.3.4
#> v tibble  3.1.2     v dplyr   1.0.6
#> v tidyr   1.1.3     v stringr 1.4.0
#> v readr   1.4.0     v forcats 0.5.1
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

A_DATA <- readRDS('C:/Users/jkyle/Documents/GitHub/Jeff_Data_Wrangling/Week_2/DATA/A_DATA.RDS')

4.1 Initializing the canvas and defining variables

Start by defining the variables, e.g., ggplot(aes(x = var1, y = var2)):

A_DATA %>% 
  ggplot(aes(x = Gender, y = Age))

That gives us a grid of the results - notice, no, points, because we didn’t add (+) any!

Your work flow may look like:

data %>% 
  filter(...) %>% 
  mutate(...) %>% 
  ggplot(aes(...)) +

The lines that come before the ggplot() function are piped, whereas from ggplot() on-wards you have to use + to signify we are now adding different layers and customization to the same plot:

4.2 Adding Points

Let’s ask ggplot() to draw a point for each observation by adding geom_point():

A_DATA %>% 
  ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age)) +
  geom_point()
#> Warning: Removed 94638 rows containing missing values (geom_point).

We have now created a scatter plot.

If we copy the above code and change just one thing - the x variable from AGE_AT_DIAG_DM2 to Gender (which is a categorical variable) - we get what’s called a strip plot.

This means we are now plotting a continuous variable (Age) against a categorical one (Gender). But the thing to note is that the rest of the code stays exactly the same, all we did was change the x = argument

A_DATA %>% 
  ggplot(aes(x = Gender, y = Age)) +
  geom_point()

4.3 specifying further variables inside aes()

Let’s use DIABETES to give the points some color. We can do this by adding color = DIABETES inside the aes():

A_DATA %>% 
  ggplot(aes(x = Gender, y = Age, color = DIABETES)) +
  geom_point()

It uses the default color scheme and will automatically include a legend.

Notice that since R thinks of DIABETES as numerical it is plotting it on that scale we can force R to think of it as a factor:

A_DATA %>% 
  ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES))) +
  geom_point()

4.4 specifying aesthetics outside aes()

It is very important to understand the difference between including ggplot arguments inside or outside of the aes() function.

The main aesthetics (things we can see) are: x, y, color, fill, shape, size, and any of these could appear inside or outside the aes() function. Use ?geom_point geom_point(), to see the full list of aesthetics that can be used with this geom.

Variables (or columns of your dataset) have to be defined inside aes(). Whereas to apply a modification on everything, we can set an aesthetic to a constant value outside of aes().

The default shape used by geom_point() is number 16.

To make all of the points in our figure hollow, let’s set their shape to 1. We do this by adding shape = 1 inside the geom_point():

A_DATA %>% 
  ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES))) +
  geom_point(shape = 1)

Let’s get a picture for some of the other shapes in ggplot, while there are 26 total possible shapes, often having too many on a graph can be confusing, R will recommend fewer than 6

A_DATA %>% 
  mutate(age_ntile = as.factor(ntile(Age,6))) %>%
  ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES), shape = age_ntile)) +
  geom_point() 

Any more and R believes you’re getting out of control (you Rebel) and need to rethink what you are trying to display:

A_DATA %>% 
  mutate(age_ntile = as.factor(ntile(Age,7))) %>%
  ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES), shape = age_ntile)) +
  geom_point() 
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 7. Consider
#> specifying shapes manually if you must have them.
#> Warning: Removed 14473 rows containing missing values (geom_point).

It did not add a 7th shape and provided a friendly warning,

The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have X. Consider specifying shapes manually if you must have them.

So instead I will change shape = as.factor(DIABETES) and now make color = age_ntile

A_DATA %>% 
  mutate(age_ntile = as.factor(ntile(Age,9))) %>%
  ggplot(aes(x = Gender, y = Age, shape  =  as.factor(DIABETES), color = age_ntile)) +
  geom_point()
#> Warning: Removed 5769 rows containing missing values (geom_point).

4.5 Faceting

Faceting is a way to efficiently create the same plot for subgroups within the dataset. For example, we can separate each continent into its own facet by adding facet_wrap(~DIABETES) to our plot:

A_DATA %>% 
  mutate(age_ntile = as.factor(ntile(Age,9))) %>%
  ggplot(aes(x = Gender, y = Age, shape  =  as.factor(DIABETES), color = age_ntile)) +
  geom_point() +
  facet_wrap(~DIABETES)
#> Warning: Removed 5769 rows containing missing values (geom_point).

Let’s remove the missing group of Diabetics, note where we insert filter(!is.na(DIABETES)) %>% from above:

A_DATA %>% 
  filter(!is.na(DIABETES)) %>%
  mutate(age_ntile = as.factor(ntile(Age,9))) %>%
  ggplot(aes(x = Gender, y = Age, shape  =  as.factor(DIABETES), color = age_ntile)) +
  geom_point() +
  facet_wrap(~DIABETES)

Note that we have to use the tilde (~) in facet_wrap().

The tilde (~) in R denotes dependency. It is mostly used by statistical models formula (y ~ b + m*x) to define dependent and explanatory variables and you will see it a lot in the second part of this book.

There is a similar function called facet_grid() that will create a grid of plots based on two grouping variables, e.g., facet_grid(var1~var2):

A_DATA %>% 
  filter(!is.na(DIABETES)) %>%
  mutate(age_ntile = as.factor(ntile(Age,9))) %>%
  ggplot(aes(x = Gender, y = Age, shape  =  as.factor(DIABETES), color = age_ntile)) +
  geom_point() +
  facet_grid(DIABETES ~ age_ntile)

4.6 Changing the theme

We can customize every single thing on a ggplot. Font type, color, size or thickness or any lines or numbers, background, you name it. But a very quick way to change the appearance of a ggplot is to apply a different theme. The signature ggplot theme has a light grey background and white grid lines.

Some of the built-in ggplot themes (1) default (2) theme_bw(), (3) theme_dark(), (4) theme_classic().

As a final step, we are adding theme_bw() (“background white”) to give the plot a different look.

A_DATA %>% 
  filter(!is.na(DIABETES)) %>%
  mutate(age_ntile = as.factor(ntile(Age,9))) %>%
  ggplot(aes(x = Gender, y = Age, shape  =  as.factor(DIABETES), color = age_ntile)) +
  geom_point() +
  facet_grid(DIABETES ~ age_ntile) +
  theme_bw()

There are additional themes that are available from CRAN install.packages('ggthemes') or you can customize your own theme as in this example https://rpubs.com/mclaire19/ggplot2-custom-themes.

Let’s see if we can improve on the example in Section 4.2

A_DATA %>% 
  mutate(age_ntile = ntile(Age,10)) %>%
  ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age, shape=as.factor(DIABETES), color=age_ntile)) +
  geom_point() 
#> Warning: Removed 94638 rows containing missing values (geom_point).

4.7 Add layers

We can add a trend line with:

A_DATA %>% 
  mutate(age_ntile = ntile(Age,10)) %>%
  ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age, shape=as.factor(DIABETES), color=age_ntile)) +
  geom_point() +
  geom_smooth()
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> Warning: Removed 94638 rows containing non-finite values (stat_smooth).
#> Warning: Removed 94638 rows containing missing values (geom_point).

This is how ggplot() works - you can build a plot by adding or modifying things one by one.