3.2 Use factors for categorical variables
For many purposes within R, the most convenient way to handle categorical variables is to convert them to factor
variables (see Section 1.7.4). Prior to conversion, the variable could have numeric
or character
values. When creating a factor
, the levels
argument contains the valid values in the original variable and the labels
argument contains the labels to be attached to those levels.
Follow the steps below to create a factor
from a character
variable.
##
## F M
## 428 102
# Convert to a factor
mydat$Sex <- factor(mydat$Sex,
levels = c("F", "M"),
labels = c("Female", "Male"))
# Compare before and after
table(mydat$Sex_orig, mydat$Sex, useNA = "ifany")
##
## Female Male
## f 2 0
## F 420 0
## female 6 0
## m 0 3
## M 0 99
## [1] "Female" "Male"
After conversion, the levels
of the factor are the values you supplied as labels
.
Going back to the previous section, where we corrected data anomalies, we can actually combine the fixing data anomaly step with the factor creation step as follows. Note that you have to list the same number of labels as you list possible values, and in the correct order, and that the labels are repeated.
# Re-load the data so we can start over
load("Data/RheumArth-Chapter3.RData")
# Create a copy of the original variable
mydat$Sex_orig <- mydat$Sex
# List all the values
table(mydat$Sex, useNA = "ifany")
##
## f F female m M
## 2 420 6 3 99
# Convert to a factor
mydat$Sex <- factor(mydat$Sex,
levels = c("f", "female", "F", "m", "M"),
labels = c("Female", "Female", "Female", "Male", "Male"))
# Compare before and after
table(mydat$Sex_orig, mydat$Sex, useNA = "ifany")
##
## Female Male
## f 2 0
## F 420 0
## female 6 0
## m 0 3
## M 0 99
Follow the steps below to create a factor
from a numeric
variable. For the following example, assume the codebook entry for AgeGp
has the following information:
Level | Label |
---|---|
1 | 40-49y |
2 | 50-59y |
3 | 60-69y |
4 | 70-79y |
5 | 80-90y |
AgeGp
is numeric with codes representing categories.
# Create a copy of the original variable
mydat$AgeGp_orig <- mydat$AgeGp
# List all the possible values
table(mydat$AgeGp, useNA = "ifany")
##
## 1 2 3 4 5
## 64 203 183 40 40
# Convert to a factor
mydat$AgeGp <- factor(mydat$AgeGp,
levels = 1:5,
labels = c("40-49y",
"50-59y",
"60-69y",
"70-79y",
"80-90y"))
# Compare before and after
table(mydat$AgeGp_orig, mydat$AgeGp, useNA = "ifany")
##
## 40-49y 50-59y 60-69y 70-79y 80-90y
## 1 64 0 0 0 0
## 2 0 203 0 0 0
## 3 0 0 183 0 0
## 4 0 0 0 40 0
## 5 0 0 0 0 40
## [1] "40-49y" "50-59y" "60-69y" "70-79y" "80-90y"
If there were new missing values or the before vs. after matrix had any off-diagonal elements, then you would know there was a mistake in your code.
The following code creates these same factors using tidyverse
functions.
# Character variable
# List all the possible values
mydat_tibble %>%
count(Sex)
# Convert to a factor
mydat_tibble <- mydat_tibble %>%
mutate(Sex = factor(Sex,
levels = c("F", "M"),
labels = c("Female", "Male")))
# Compare before and after
mydat_tibble %>%
count(Sex_orig, Sex)
# Examine the levels
levels(mydat_tibble$Sex)
# Numeric with codes representing categories
# Create a copy of the original variable
mydat_tibble <- mydat_tibble %>%
mutate(AgeGp_orig = AgeGp)
# List all the possible values
mydat_tibble %>% count(AgeGp)
# Convert to a factor
mydat_tibble <- mydat_tibble %>%
mutate(AgeGp = factor(AgeGp,
levels = 1:5,
labels = c("40-49y",
"50-59y",
"60-69y",
"70-79y",
"80-90y")))
# Compare before and after
mydat_tibble %>%
count(AgeGp_orig, AgeGp)
# Examine the levels
levels(mydat_tibble$AgeGp)
REMINDER: If you type the name of a dataset followed by the pipe operator %>%
, then you will manipulate the dataset in some way but not permanently change anything. If you want to alter the dataset itself, start with the dataset name (or a new name) followed by the assignment operator <-
and the dataset name. Then use pipes to string together any data manipulation functions you want. In the above examples, we did not use an assignment operator when using count()
since we just wanted to look at a frequency table, not store it. But when we created the factor
variable, we did use an assignment operator.