13 Factors

13.1 Objectives

  • Understand how factor type variables differ from character strings.

  • Understand how to manipulate factors.

  • Produce bar charts and other categorical plots using {ggplot2}.

  • Set universal plot settings.

  • Extend learning of {ggplot2} functions.

13.2 Setup

This chunk of R code loads the packages that we will be using.

library(tidyverse)
library(gapminder)
library(forcats)  # note that {forcats} isn't part of the core tidyverse so has to be loaded explicitly

13.3 Categorical variables: factors

We love factors. We hate factors.

Factors make working with categorical variables a breeze: you can sort them or arrange them arbitrarily (think days of the week),

But there are some traps that you might fall into, if you’re not careful.

The package {forcats} is very helpful.

13.3.1 Factor functions

Base R functions for working with factors

function purpose
str display structure of object
class returns the class of an object
levels returns the values of the levels
nlevels return the number of levels

13.3.1.1 Your turn

Use the functions in the table above to examine the continent variable in the gapminder dataset.

Solution
str(gapminder$continent)
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
class(gapminder$continent)
## [1] "factor"
# this is an important one!
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5

13.3.1.2 Your turn

Use the count() function in a pipe to get a frequency table of each level in the factors in the continent variable

Solution
# solution
gapminder |>
  count(continent)
## # A tibble: 5 × 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24

13.3.2 Dropping levels

A key thing to remember is that the factor levels exist separate from your data … even if you filter the data, the factor levels stay the same unless you drop the extras.

In this example, we see that there are 142 countries in gapminder – and if you filter the gapminder data down to 4, there are still 142 levels associated with country.

Leaving them in can cause problems, for instance when you try to plot them all.

nlevels(gapminder$country)
## [1] 142
h_countries <- c("Belgium", "India", "Denmark", "Albania")  # see what I did there?

h_gap <- gapminder |>
  filter(country %in% h_countries)

nlevels(h_gap$country)
## [1] 142

The function we need here is droplevels().

13.3.2.1 Your turn

Use the droplevels() function to delete the levels that are unused.

Solution
h_gap <- gapminder |>
  filter(country %in% h_countries) |> 
  droplevels()

After you filter the gapminder data table to include just the 4 countries in the h_countries list, the droplevels() function can be run next in the pipe–the other 138 levels are dropped.

We can then use levels() and nlevels() to check the results.

levels(h_gap$country)
## [1] "Albania" "Belgium" "Denmark" "India"
nlevels(h_gap$country)
## [1] 4

13.3.3 Changing the order of the factors

As you can see with our country names above, the default arrangement is alphabetical. This is fine in some applications, but as we will see when we plot the data, we might want to sort them by another variable.

13.3.3.1 Arranging levels

We might want to sort levels in an arbitrary way.

For example, our short country list would be more entertaining if it spelled out “BIDA”.

Here’s the way the levels are arranged:

h_gap$country |> 
  levels()
## [1] "Albania" "Belgium" "Denmark" "India"

We can use fct_relevel() to change some or all of the levels.

h_gap$country |>
  fct_relevel("Belgium", "India", "Denmark", "Albania") |> 
  levels()
## [1] "Belgium" "India"   "Denmark" "Albania"

What if we want to sort by the number of countries in each continent, that is, the number of times each factor occurs?

fct_infreq() is what we need, or fct_infreq() |> fct_rev() for backwards.

13.3.3.2 Your turn

Give fct_infreq() a try on continent in gapminder.

Solution
# solutions
gapminder$continent |>
  fct_infreq() |>
  levels()
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"
gapminder$continent |>
  fct_infreq() |>
  fct_rev() |>
  levels()
## [1] "Oceania"  "Americas" "Europe"   "Asia"     "Africa"

In a pipe where we are using the entire data frame, it would look like this:

gmind <- gapminder |>
  # sort the factors
  mutate(continent = continent |> 
           fct_infreq() |> 
           fct_rev()) 

levels(gmind$continent)
## [1] "Oceania"  "Americas" "Europe"   "Asia"     "Africa"

You can also sort by another variable in the data frame. In this example, the countries are sorted by minimum life expectancy.

You can find other ways to sort in STAT545, “Reorder factors”

fct_reorder(gapminder$country, gapminder$lifeExp, min) |>
  levels() |>
  head()
## [1] "Rwanda"       "Afghanistan"  "Gambia"       "Angola"       "Sierra Leone" "Cambodia"

Recoding levels is similar to the renaming that’s possible in dplyr::select()

i_gap <- gapminder |> 
  filter(country %in% c("United States", "Sweden", "Australia")) |> 
  droplevels()

i_gap$country |> 
  levels()
## [1] "Australia"     "Sweden"        "United States"
i_gap$country |>
  fct_recode("USA" = "United States", "Oz" = "Australia") |> 
  levels()
## [1] "Oz"     "Sweden" "USA"

13.4 Plotting categorical variables

Create a dataframe object with the name “gapminder_2007” by filtering the gapminder data.

gapminder_2007 <- filter(gapminder, year == 2007)

13.4.1 Bar plot: countries in each continent

Bar plots are often used to visualize the distribution of a discrete variable. In this case, we will show how many countries there are in each continent.

  • the geom is geom_bar

  • map the x variable to continent

  • there is no y variable! {ggplot} will count the number of observations in each category

# solution
ggplot(gapminder_2007, aes(x = continent)) +
  geom_bar()

Note that colour (or color) won’t work on a bar! That’s for points and lines.

For something that occupies a block of space–such as a bar or pie chart–you need to use fill.

Add the fill attribute to continent to the code you wrote above. (Yes, you’ll be specifying continent twice!)

# solution

ggplot(gapminder_2007, aes(x = continent, fill = `continent`)) +
  geom_bar()

This is the default palette. It might be a bit too vibrant for your eyes…don’t worry, we will learn to fix that later.

Another {ggplot2} feature is that every plot is an object. If you want to save a basic version of your plot and continue to tinker with it, you can assign that basic version to an object name, and just add to it.

It would look something like:

mybar <- ggplot() + geom_()

Followed by

mybar +

# solution
mybar <- ggplot(gapminder_2007, aes(x = continent, fill = `continent`)) +
  geom_bar()

It’s possible to turn this on its side, so that the country labels are on the left, and the bars run left-to-right instead of up-and-down.

To do this, add the coord_flip() function to the chart object you assigned above (you might have called it mybar).

# solution
mybar + coord_flip()

13.4.2 Sorting factors in a plot

To sort the factors in a plot, we first mutate the variable that contains the factor variable that will plot.

In this example, we use the same code that we saw in hands-on 3.3, Your Turn 3.2. Other sort functions (such as fct_relevel) could also be used.

gapminder |>
  # sort the factors
  mutate(continent = continent |> 
           fct_infreq() |> 
           fct_rev()) |> 
  # then plot
  ggplot(aes(x = continent)) +
    geom_bar() +
  coord_flip()

A generic version of the sorting using fct_rev() to put them in reverse order is here:

df |> mutate(my_fct = fct_rev(my_fct) |> ggplot(…)

In this example, we will plot the four countries in our BIDA group by life expectancy.

h_countries <- c("Belgium", "India", "Denmark", "Albania")  # see what I did there?

h_gap <- gapminder |>
  filter(country %in% h_countries) |>
  mutate(country = country |> 
  fct_relevel("Belgium", "India", "Denmark", "Albania"))  

h_gap |>
  filter(year == 2007) |>
  ggplot(mapping = aes(x = country, y = lifeExp)) +
  geom_point(size = 6)

13.4.2.1 Your turn

Now it’s your turn: plot the four countries in our BIDA group by life expectancy.

Solution
# solution
h_gap |>
  filter(year == 2007) |>
  mutate(country = country |> 
           fct_reorder(lifeExp) |> 
           fct_rev()) |> 
  # now plot
  ggplot(mapping = aes(x = country, y = lifeExp)) +
  geom_point(size = 6)

13.5 Exercise - factors and plotting

BIDA302: Factors and Plotting

  • an introduction to factor variables in the context of making plots

13.6 Reference

This hands-on exercise draws heavily on material at the following

Some of these plotting examples are lifted directly from http://euclid.psych.yorku.ca/www/psy6135/tutorials/gapminder.html

See also Yan Holtz, “Reorder a variable with ggplot2” at the R Graph Gallery

-30-