3.2 Use factors for categorical variables

For many purposes within R, the most convenient way to handle categorical variables is to convert them to factor variables (see Section 1.7.4). Prior to conversion, the variable could have numeric or character values. When creating a factor, the levels argument contains the valid values in the original variable and the labels argument contains the labels to be attached to those levels.

Follow the steps below to create a factor from a character variable.

# Character variable

# List all the possible values
table(mydat$Sex, exclude = NULL)
## 
##   F   M 
## 428 102
# Convert to a factor
mydat$Sex <- factor(mydat$Sex,
                    levels = c("F", "M"),
                    labels = c("Female", "Male"))

# Compare before and after
table(mydat$Sex_orig, mydat$Sex, exclude = NULL)
##         
##          Female Male
##   f           2    0
##   F         420    0
##   female      6    0
##   m           0    3
##   M           0   99
# Examine the new levels
levels(mydat$Sex)
## [1] "Female" "Male"

After conversion, the levels of the factor are the values you supplied as labels.

Going back to the previous section, where we corrected data anomalies, we can actually combine the fixing data anomaly step with the factor creation step as follows. Note that you have to list the same number of labels as you list possible values, and in the correct order, and that the labels are repeated.

# Re-load the data so we can start over
load("Data/RheumArth-Chapter3.RData")

# Create a copy of the original variable
mydat$Sex_orig <- mydat$Sex

# List all the values
table(mydat$Sex, exclude = NULL)
## 
##      f      F female      m      M 
##      2    420      6      3     99
# Convert to a factor
mydat$Sex <- factor(mydat$Sex,
                    levels = c("f",      "female", "F",      "m",    "M"),
                    labels = c("Female", "Female", "Female", "Male", "Male"))

# Compare before and after
table(mydat$Sex_orig, mydat$Sex, exclude = NULL)
##         
##          Female Male
##   f           2    0
##   F         420    0
##   female      6    0
##   m           0    3
##   M           0   99

Follow the steps below to create a factor from a numeric variable. For the following example, assume the codebook entry for AgeGp has the following information:

Level Label
1 40-49y
2 50-59y
3 60-69y
4 70-79y
5 80-90y

AgeGp is numeric with codes representing categories.

# Create a copy of the original variable
mydat$AgeGp_orig <- mydat$AgeGp

# List all the possible values
table(mydat$AgeGp, exclude = NULL)
## 
##   1   2   3   4   5 
##  64 203 183  40  40
# Convert to a factor
mydat$AgeGp <- factor(mydat$AgeGp,
                      levels = 1:5,
                      labels = c("40-49y",
                                 "50-59y",
                                 "60-69y",
                                 "70-79y",
                                 "80-90y"))

# Compare before and after
table(mydat$AgeGp_orig, mydat$AgeGp, exclude = NULL)
##    
##     40-49y 50-59y 60-69y 70-79y 80-90y
##   1     64      0      0      0      0
##   2      0    203      0      0      0
##   3      0      0    183      0      0
##   4      0      0      0     40      0
##   5      0      0      0      0     40
# Examine the new levels
levels(mydat$AgeGp)
## [1] "40-49y" "50-59y" "60-69y" "70-79y" "80-90y"

If there were new missing values or the before vs. after matrix had any off-diagonal elements, then you would know there was a mistake in your code.

The following code creates these same factors using tidyverse functions.

# Character variable

# List all the possible values
mydat_tibble %>%
  count(Sex)

# Convert to a factor
mydat_tibble <- mydat_tibble %>% 
  mutate(Sex = factor(Sex,
                      levels = c("F", "M"),
                      labels = c("Female", "Male")))

# Compare before and after
mydat_tibble %>% 
  count(Sex_orig, Sex)

# Examine the levels
levels(mydat_tibble$Sex)

# Numeric with codes representing categories

# Create a copy of the original variable
mydat_tibble <- mydat_tibble %>% 
  mutate(AgeGp_orig = AgeGp)

# List all the possible values
mydat_tibble %>% count(AgeGp)

# Convert to a factor
mydat_tibble <- mydat_tibble %>% 
  mutate(AgeGp = factor(AgeGp,
                     levels = 1:5,
                     labels = c("40-49y",
                                "50-59y",
                                "60-69y",
                                "70-79y",
                                "80-90y")))

# Compare before and after
mydat_tibble %>% 
  count(AgeGp_orig, AgeGp)

# Examine the levels
levels(mydat_tibble$AgeGp)

REMINDER: If you type the name of a dataset followed by the pipe operator %>%, then you will manipulate the dataset in some way but not permanently change anything. If you want to alter the dataset itself, start with the dataset name (or a new name) followed by the assignment operator <- and the dataset name. Then use pipes to string together any data manipulation functions you want. In the above examples, we did not use an assignment operator when using count() since we just wanted to look at a frequency table, not store it. But when we created the factor variable, we did use an assignment operator.