4 The Anatomy of ggplot
In a work-flow we might expect a call to ggplot
to appear something like:
data ggplot(aes(x = map data components to graph components)) +
geom_xxx(arguments to modify the appearance of the geom) +
... theme_xxx(arguments to change the overall appearance) +
labs(add axis-labels and a title)
where aes(...)
stands for aesthetics.
Data Set variables are inside the aes()
function, which in return is inside of a ggplot()
After we ggplot
are done we are going to start adding (+
) additional geom
or other layers, in general we have several options:
- Initializing the canvas and defining variables,
- adding points,
- specifying further variables inside
, - specifying aesthetics outside
, - faceting,
- changing the plot theme.
- adding or modify layers
We will now explain these six considerations with our example data-set:
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
#> ✔ tibble 3.1.8 ✔ dplyr 1.0.10
#> ✔ tidyr 1.2.1 ✔ stringr 1.5.0
#> ✔ readr 2.1.3 ✔ forcats 0.5.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
A_DATA <- readRDS('DATA/Part_2/A_DATA.RDS')
4.1 Initializing the canvas and defining variables
Start by defining the variables, e.g., ggplot(aes(x = var1, y = var2)):
That gives us a grid of the results - notice, no, points, because we didn’t add (+
) any!
Your work flow may look like:
data filter(...) %>%
mutate(...) %>%
ggplot(aes(...)) +
The lines that come before the ggplot()
function are piped, whereas from ggplot()
on-wards you have to use +
to signify we are now adding different layers and customization to the same plot:
4.2 Adding Points
Let’s ask ggplot()
to draw a point for each observation by adding geom_point()
A_DATA %>%
ggplot(aes(x = AGE_AT_DIAG_DM2 , y = Age)) +
#> Warning: Removed 94638 rows containing missing values (`geom_point()`).
We have now created a scatter plot.
If we copy the above code and change just one thing - the x
variable from AGE_AT_DIAG_DM2
to Gender
(which is a categorical variable) - we get what’s called a strip plot.
This means we are now plotting a continuous variable (Age
) against a categorical one (Gender
). But the thing to note is that the rest of the code stays exactly the same, all we did was change the x =
A_DATA %>%
ggplot(aes(x = Gender, y = Age)) +
4.3 specifying further variables inside aes()
Let’s use DIABETES
to give the points some color. We can do this by adding color = DIABETES
inside the aes()
A_DATA %>%
ggplot(aes(x = Gender, y = Age, color = DIABETES)) +
It uses the default color scheme and will automatically include a legend.
Notice that since R
thinks of DIABETES
as numerical it is plotting it on that scale we can force R
to think of it as a factor:
A_DATA %>%
ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES))) +
4.4 specifying aesthetics outside aes()
It is very important to understand the difference between including ggplot arguments inside or outside of the aes()
The main aesthetics (things we can see) are: x
, y
, color
, fill
, shape
, size
, and any of these could appear inside or outside the aes()
function. Use ?geom_point
geom_point(), to see the full list of aesthetics that can be used with this geom.
Variables (or columns of your dataset) have to be defined inside aes(). Whereas to apply a modification on everything, we can set an aesthetic to a constant value outside of aes()
The default shape used by geom_point()
is number 16.
To make all of the points in our figure hollow, let’s set their shape to 1. We do this by adding shape = 1
inside the geom_point()
A_DATA %>%
ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES))) +
geom_point(shape = 1)
Let’s get a picture for some of the other shapes in ggplot,
while there are 26 total possible shapes
, often having too many on a graph can be confusing, R
will recommend fewer than 6
A_DATA %>%
mutate(age_ntile = as.factor(ntile(Age,5))) %>%
ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES), shape = age_ntile)) +
Any more and R
believes you’re getting out of control (you Rebel) and need to rethink what you are trying to display:
A_DATA %>%
mutate(age_ntile = as.factor(ntile(Age,7))) %>%
ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES), shape = age_ntile)) +
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 7. Consider
#> specifying shapes manually if you must have them.
#> Warning: Removed 14473 rows containing missing values (`geom_point()`).
It did not add a 7th shape and provided a friendly warning,
The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have X. Consider specifying shapes manually if you must have them.
So instead I will change shape = as.factor(DIABETES)
and now make color = age_ntile
A_DATA %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
#> Warning: Removed 5769 rows containing missing values (`geom_point()`).
4.5 Faceting
Faceting is a way to efficiently create the same plot for subgroups within the dataset. For example, we can separate each continent into its own facet by adding facet_wrap(~DIABETES)
to our plot:
A_DATA %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
#> Warning: Removed 5769 rows containing missing values (`geom_point()`).
Let’s remove the missing group of Diabetics, note where we insert filter(!is.na(DIABETES)) %>%
from above:
A_DATA %>%
filter(!is.na(DIABETES)) %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
Note that we have to use the tilde (~
) in facet_wrap()
The tilde (~
) in R
denotes dependency. It is mostly used by statistical models formula (y ~ b + m*x
) to define dependent and explanatory variables and you will see it a lot in the second part of this book.
There is a similar function called facet_grid()
that will create a grid of plots based on two grouping variables, e.g., facet_grid(var1~var2)
A_DATA %>%
filter(!is.na(DIABETES)) %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
facet_grid(DIABETES ~ age_ntile)
4.6 Changing the theme
We can customize every single thing on a ggplot. Font type, color, size or thickness or any lines or numbers, background, you name it. But a very quick way to change the appearance of a ggplot is to apply a different theme. The signature ggplot theme has a light grey background and white grid lines.
Some of the built-in ggplot themes (1) default (2) theme_bw()
, (3) theme_dark()
, (4) theme_classic()
As a final step, we are adding theme_bw() (“background white”) to give the plot a different look.
A_DATA %>%
filter(!is.na(DIABETES)) %>%
mutate(age_ntile = as.factor(ntile(Age,10))) %>%
mutate(Diabetic_Status = if_else(DIABETES == 1, "Diabeic", "Not Diabetic")) %>%
ggplot(aes(x = Gender, y = Age, shape = Diabetic_Status, color = age_ntile)) +
geom_point() +
facet_grid(DIABETES ~ age_ntile) +
theme_bw() +
theme(axis.text.x = element_text(angle=90)) +
guides(color=guide_legend(title="Age Decile"))
There are additional themes that are available from CRAN install.packages('ggthemes')
or you can customize your own theme as in this example https://rpubs.com/mclaire19/ggplot2-custom-themes.
Let’s see if we can improve on the example in Section 4.2
A_DATA %>%
mutate(age_ntile = ntile(Age,5)) %>%
filter(DIABETES == 1) %>%
ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age, color=as.factor(age_ntile))) +
#> Warning: Removed 129 rows containing missing values (`geom_point()`).
4.7 Add layers
We can add a trend line with:
A_DATA %>%
mutate(age_ntile = ntile(Age,5)) %>%
filter(DIABETES == 1) %>%
ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age, color = as.factor(age_ntile))) +
geom_point() +
geom_smooth(method = lm, formula = y ~ poly(x, 2), se = TRUE)
#> Warning: Removed 129 rows containing non-finite values (`stat_smooth()`).
#> Warning: Removed 129 rows containing missing values (`geom_point()`).
This is how ggplot() works - you can build a plot by adding or modifying things one by one.