4 The Anatomy of ggplot
In a work-flow we might expect a call to ggplot
to appear something like:
%>%
data ggplot(aes(x = map data components to graph components)) +
geom_xxx(arguments to modify the appearance of the geom) +
+
... theme_xxx(arguments to change the overall appearance) +
labs(add axis-labels and a title)
where aes(...)
stands for aesthetics.
Data Set variables are inside the aes()
function, which in return is inside of a ggplot()
.
After we ggplot
are done we are going to start adding (+
) additional geom
or other layers, in general we have several options:
- Initializing the canvas and defining variables,
- adding points,
- specifying further variables inside
aes()
, - specifying aesthetics outside
aes()
, - faceting,
- changing the plot theme.
- adding or modify layers
We will now explain these six considerations with our example data-set:
\(~\)
\(~\)
library('tidyverse')
#> -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
#> v ggplot2 3.3.3 v purrr 0.3.4
#> v tibble 3.1.2 v dplyr 1.0.6
#> v tidyr 1.1.3 v stringr 1.4.0
#> v readr 1.4.0 v forcats 0.5.1
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
<- readRDS('C:/Users/jkyle/Documents/GitHub/Jeff_Data_Wrangling/Week_2/DATA/A_DATA.RDS') A_DATA
4.1 Initializing the canvas and defining variables
Start by defining the variables, e.g., ggplot(aes(x = var1, y = var2)):
%>%
A_DATA ggplot(aes(x = Gender, y = Age))
That gives us a grid of the results - notice, no, points, because we didn’t add (+
) any!
Your work flow may look like:
%>%
data filter(...) %>%
mutate(...) %>%
ggplot(aes(...)) +
The lines that come before the ggplot()
function are piped, whereas from ggplot()
on-wards you have to use +
to signify we are now adding different layers and customization to the same plot:
4.2 Adding Points
Let’s ask ggplot()
to draw a point for each observation by adding geom_point()
:
%>%
A_DATA ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age)) +
geom_point()
#> Warning: Removed 94638 rows containing missing values (geom_point).
We have now created a scatter plot.
If we copy the above code and change just one thing - the x
variable from AGE_AT_DIAG_DM2
to Gender
(which is a categorical variable) - we get what’s called a strip plot.
This means we are now plotting a continuous variable (Age
) against a categorical one (Gender
). But the thing to note is that the rest of the code stays exactly the same, all we did was change the x =
argument
%>%
A_DATA ggplot(aes(x = Gender, y = Age)) +
geom_point()
4.3 specifying further variables inside aes()
Let’s use DIABETES
to give the points some color. We can do this by adding color = DIABETES
inside the aes()
:
%>%
A_DATA ggplot(aes(x = Gender, y = Age, color = DIABETES)) +
geom_point()
It uses the default color scheme and will automatically include a legend.
Notice that since R
thinks of DIABETES
as numerical it is plotting it on that scale we can force R
to think of it as a factor:
%>%
A_DATA ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES))) +
geom_point()
4.4 specifying aesthetics outside aes()
It is very important to understand the difference between including ggplot arguments inside or outside of the aes()
function.
The main aesthetics (things we can see) are: x
, y
, color
, fill
, shape
, size
, and any of these could appear inside or outside the aes()
function. Use ?geom_point
geom_point(), to see the full list of aesthetics that can be used with this geom.
Variables (or columns of your dataset) have to be defined inside aes(). Whereas to apply a modification on everything, we can set an aesthetic to a constant value outside of aes()
.
The default shape used by geom_point()
is number 16.
To make all of the points in our figure hollow, let’s set their shape to 1. We do this by adding shape = 1
inside the geom_point()
:
%>%
A_DATA ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES))) +
geom_point(shape = 1)
Let’s get a picture for some of the other shapes in ggplot,
while there are 26 total possible shapes
, often having too many on a graph can be confusing, R
will recommend fewer than 6
%>%
A_DATA mutate(age_ntile = as.factor(ntile(Age,6))) %>%
ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES), shape = age_ntile)) +
geom_point()
Any more and R
believes you’re getting out of control (you Rebel) and need to rethink what you are trying to display:
%>%
A_DATA mutate(age_ntile = as.factor(ntile(Age,7))) %>%
ggplot(aes(x = Gender, y = Age, color = as.factor(DIABETES), shape = age_ntile)) +
geom_point()
#> Warning: The shape palette can deal with a maximum of 6 discrete values because
#> more than 6 becomes difficult to discriminate; you have 7. Consider
#> specifying shapes manually if you must have them.
#> Warning: Removed 14473 rows containing missing values (geom_point).
It did not add a 7th shape and provided a friendly warning,
The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have X. Consider specifying shapes manually if you must have them.
So instead I will change shape = as.factor(DIABETES)
and now make color = age_ntile
%>%
A_DATA mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point()
#> Warning: Removed 5769 rows containing missing values (geom_point).
4.5 Faceting
Faceting is a way to efficiently create the same plot for subgroups within the dataset. For example, we can separate each continent into its own facet by adding facet_wrap(~DIABETES)
to our plot:
%>%
A_DATA mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
facet_wrap(~DIABETES)
#> Warning: Removed 5769 rows containing missing values (geom_point).
Let’s remove the missing group of Diabetics, note where we insert filter(!is.na(DIABETES)) %>%
from above:
%>%
A_DATA filter(!is.na(DIABETES)) %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
facet_wrap(~DIABETES)
Note that we have to use the tilde (~
) in facet_wrap()
.
The tilde (~
) in R
denotes dependency. It is mostly used by statistical models formula (y ~ b + m*x
) to define dependent and explanatory variables and you will see it a lot in the second part of this book.
There is a similar function called facet_grid()
that will create a grid of plots based on two grouping variables, e.g., facet_grid(var1~var2)
:
%>%
A_DATA filter(!is.na(DIABETES)) %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
facet_grid(DIABETES ~ age_ntile)
4.6 Changing the theme
We can customize every single thing on a ggplot. Font type, color, size or thickness or any lines or numbers, background, you name it. But a very quick way to change the appearance of a ggplot is to apply a different theme. The signature ggplot theme has a light grey background and white grid lines.
Some of the built-in ggplot themes (1) default (2) theme_bw()
, (3) theme_dark()
, (4) theme_classic()
.
As a final step, we are adding theme_bw() (“background white”) to give the plot a different look.
%>%
A_DATA filter(!is.na(DIABETES)) %>%
mutate(age_ntile = as.factor(ntile(Age,9))) %>%
ggplot(aes(x = Gender, y = Age, shape = as.factor(DIABETES), color = age_ntile)) +
geom_point() +
facet_grid(DIABETES ~ age_ntile) +
theme_bw()
There are additional themes that are available from CRAN install.packages('ggthemes')
or you can customize your own theme as in this example https://rpubs.com/mclaire19/ggplot2-custom-themes.
Let’s see if we can improve on the example in Section 4.2
%>%
A_DATA mutate(age_ntile = ntile(Age,10)) %>%
ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age, shape=as.factor(DIABETES), color=age_ntile)) +
geom_point()
#> Warning: Removed 94638 rows containing missing values (geom_point).
4.7 Add layers
We can add a trend line with:
%>%
A_DATA mutate(age_ntile = ntile(Age,10)) %>%
ggplot(aes(x = AGE_AT_DIAG_DM2, y = Age, shape=as.factor(DIABETES), color=age_ntile)) +
geom_point() +
geom_smooth()
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> Warning: Removed 94638 rows containing non-finite values (stat_smooth).
#> Warning: Removed 94638 rows containing missing values (geom_point).
This is how ggplot() works - you can build a plot by adding or modifying things one by one.