4 Factor/ categorical variables: Making R recognise variables as factor (& ordered factor), tidying up such variables and combining categories. Creating categorical variables using cut-offs on numeric variables.

Data Wrangling Recipes in R: Hilary Watt

“Really helpful practical tips for those annoying things that often aren’t covered in other courses I have attended” Dr David Salmon, post-doc, School of Public Health.

The R name for categorical variables is factor variables. Variables that will be treated as categorical in data analysis should be assigned R factor format (or ordered factor format, used far less often, since it implies that many R commands look for ordered trends across categories, which is less often desirable). Sometimes categories needs to be combined for sensible analysis, or because they are different spellings of the same thing. It is really useful to order the named categories to the order that you want them to appear in tables and graphs, even when we do NOT assign it to be an ordered category.

Important: when categories are combined for analysis, it is essential to do this purely on the definitions of the categories. Resulting p-values are invalidated, when decisions on which categories to combine is based on similarity of outcomes, so essential to AVOID.

4.1 Changing to factor format whilst tidying up messy categories

When converting from numerical/ integer values, factor categories have numbered levels/ categories, so they appear as numbers in tables. More often, we convert from character data, and categories (levels) often have words/ phrases that usefully describe them.

df$var1_factor <- as.factor(df$var1_string) # convert to factor
df$var1_factor <- factor(df$var1_string) # equivalent to above, but has more useful options

You may want to specify the category names. Anything else results in NA, so check for accidental introduction of NAs. All named categories are then included into tables, even when their frequency is zero. They are tabulated in the order specified in the command (underlying numerical values are 1, 2, 3… in order specified although these mostly remain invisible and are not often visibly used by R code).

table(df$var1_string, exclude=NULL) # See what categories (levels) and how many NAs
df$var1_factor <- factor(df$var1_string,
                         levels=c(“African”, “European”, “American”, “Asian”))
summary(df$var1_factor) # check if extra NAs are accidentally introduced

View a table of the categories/ levels to see whether each category/ level is truly distinct; combine categories that are different spellings/ capitalisations of the same thing.

Here are some strategies to assist, that work on string variables (prior to converting to factor). We can convert to entirely upper case, or lower case, and we can remove leading or trailing blanks: For a variable that is already factor, it is also possible to convert to string (as.string()), then convert fully to lower case (as above str_to_lower or str_to_upper for upper case). To remove leading and trailing blanks from string variable first, use str_trim. This makes capitalisation and similar more consistent, so we have fewer different categories. When we convert back to factor, we have fewer categories/ levels that are more distinct as a result. This may be sufficient to tidy-up categories, or we may need to further combine categories/ levels. There is also the option to take only the first one or few characters, prior to converting (back) to factor using substr() function (example in script below), which may usefully give the correct number of distinct categories/ levels.

The chapter on character variables gives more ways in which characters variables can be tidied, prior to converting to factor variables.

This simple example illustrates the above, with a few different strategies being demonstrated in the following code. “Male” is coded with two different spellings/ capitalisations, namely “male” and “Male”. Clean the variable coding by reassigning “male”. Recode the empty string “” to missing, since blank empty quotes are not treated as missing data by R:

Converting to lower case to tidy categories:

table(anaemia$sex, exclude=NULL)  # shows 4 categories including blank
## 
##        Female   male   Male 
##      3    344      2    691
anaemia$sex2 <- as.factor( str_to_lower(anaemia$sex) )  # make all lower case for consistency then convert to factor
anaemia$sex2[anaemia$sex2 == ""] <- NA  # make sure blanks are appropriately treated as NA (missing data) by R
summary(anaemia$sex2)  # see result - all lower case in levels/ categories
##        female   male   NA's 
##      0    344    693      3
anaemia$sex2 <- NULL  # decided to drop this and remake below, retaining capitals at start of Male and Female levels

Tidying up with individual statements reassigning values:

anaemia$sex3 <- anaemia$sex  # copy variable to new var prior to amending
anaemia$sex3[anaemia$sex3 == "male"] <- "Male"  # recode one specific category/ level to another - alternative to above for this variable
anaemia$sex3[anaemia$sex3 == ""] <- NA  # make sure blanks are appropriately treated as NA (missing data) by R
anaemia$sex3 <- factor(anaemia$sex3, levels=c("Male", "Female"))
summary(anaemia$sex3)  # see new data in table 
##   Male Female   NA's 
##    693    344      3
table(anaemia$sex3, anaemia$sex, exclude=NULL)  # compare existing and new version of the variable - including NAs
##         
##              Female male Male
##   Male     0      0    2  691
##   Female   0    344    0    0
##   <NA>     3      0    0    0

Tidying up by taking only the first letter & converting to lower case:

anaemia$sex_first <- as.factor ( str_to_upper( substr( anaemia$sex, 1, 1)  ) ) # take substring, starting at first character, length 1 - convert to upper case - then convert to factor
summary(anaemia$sex_first)  # view the result - reduced to first letters m and f
##       F   M 
##   3 344 693

When converting to factor, we can specify the category names/ levels. Then anything else is converted to NA. It is crucial to check that we did not accidentally introduce NAs when doing this. summary() command reports number of NAs in above code, which can be compared to original number of NAs. See second example above: tidying up with individual statements reassiging values.

After tidying up categories, we may end up with levels that have zero frequency, but that are still appear in tables and other output. It is useful to tidy up, to avoid this. Whilst not an issue with the above commands, we sometimes might end up with 4 levels still (male, Male, Female and blank). To remove excess levels, we can reconvert to a factor or use the droplevels command. Only ONE of the following statements is required for this purpose:

anaemia$sex3 <- factor(anaemia$sex3, levels=c("Female", "Male")) # keeps only specified categories - other levels are coded to NA
# ALTERNATIVE CODE TO DROP LEVELS 
anaemia$sex3 <- factor(anaemia$sex3) # re-converting to factor also drops unused levels
# ALTERNATIVE CODE TO DROP LEVELS 
anaemia$sex3 <- droplevels(anaemia$sex3)   # this drops any levels/ categories with frequency zero, so unused categories
table(anaemia$sex)
## 
##        Female   male   Male 
##      3    344      2    691

The following convert several variables to factor format in a single command. Remember to check results, including with summary commands and with other checks often being needed:

anaemia[c("sex", "operat", "death", "deathICU", "return", "hpt")] <- lapply(
  anaemia[c("sex", "operat", "death", "deathICU", "return", "hpt")], as.factor)
summary(anaemia[c("sex", "operat", "death", "deathICU", "return", "hpt")])
##      sex            operat      death       deathICU     return      hpt     
##        :  3            :  9        :   9        :   9       :  9       :  9  
##  Female:344   Elective :710   alive:1000   6    :   2    yes: 96    yes:665  
##  male  :  2   Emergency: 56   died :  31   alive:1027   no  :906   no  :359  
##  Male  :691   Salvage  :  2                died :   2   NA's: 29   NA's:  7  
##               Urgent   :263

ADVANCED: After converting to factor, this might be a fairly efficient way to combine several factor categories together (to one named “other”), whilst retaining other categories. You could then repeat, to combine a different set of categories together, and repeat again as needed. Remember need to install.packages(“forcats”) if not used before. This example is on a different dataset:

library(forcats)  # need to install this package first, if not already done so
other_cats = c(“Mexican”, “Fiji”, “Solomon Isles”) # specify categories to combine
a$country <- fct_collapse(a$country, other=other_cats) # this collapses countries named into a single level

Minor example: creating a subset excluding one operation type, then removing level that has become unnecessary:

anaemia.NOTsalvage <- anaemia[anaemia$operat!="Salvage", ]  # create data-frame that excludes salvage operations
# this includes "Salvage" even though no one has this operation type in this subset
table(anaemia.NOTsalvage$operat) 
## 
##            Elective Emergency   Salvage    Urgent 
##         9       710        56         0       263
# drops levels of a factor variable that are NOT present in the current dataset
anaemia.NOTsalvage$operat <- droplevels(anaemia.NOTsalvage$operat)
table(anaemia.NOTsalvage$operat) # does not show any empty categories
## 
##            Elective Emergency    Urgent 
##         9       710        56       263

4.2 Ordering levels in tables and graphs: reording levels whilst converting to factor makes life much easier; we can rename them too

Why struggle with levels ordered in illogical manner throughout your tables, when you can readily specify their order?, whilst declaring them as factor variables? When specifying category names, remember that anything else results in NA, so check for accidental introduction of NAs. All named categories are then included into tables, in order specific, even when their frequency is zero.

table(df$var1_string, exclude=NULL) # See what categories (levels) and how many NAs
df$var1_factor <- factor(df$var1_string,
                         levels=c(“African”, “European”, “American”, “Asian”))
summary(df$var1_factor) # check if extra NAs are accidentally introduced

We can even specify the order of levels/ categories, whilst simultaneously renaming them:

df$var1_factor <- factor(df$var1_string,
                         levels=c(“African”, “European”, “American”, “Asian”),
                         labels=c(“Africa”, “Europe”, “USA”, “Asia”))

Second example of specifying order whilst renaming levels:

summary(anaemia$sex3)
## Female   Male   NA's 
##    344    693      3
anaemia$sex3 <- factor(anaemia$sex3,
                      levels=c("Male", "Female"),
                      labels=c("Men", "Women"))
summary(anaemia$sex3)
##   Men Women  NA's 
##   693   344     3

4.3 Recoding into ordered factor variables

Because R is written by many different users, different functions work in different ways. However, many R functions may seek a linear trend across numbered levels/ categories, for any variable that R recognises as (ordinal) ordered factor. This is less often what we want. Therefore, best to avoid specifying variables as ordered factors. Unless we are certain this results in the analyses that we want in specific situations. For the times when we might occasionally want to specify as ordered factor, here is code:

var1_factor <- as.ordered(var1_string)  
var1_factor <- factor(var1_string, ordered=TRUE)  

The following specifies only 3 of 4 categories for variable operat - the forth category, “Salvage” is coded to NA (it doesn’t fit into ordered categories):

anaemia$operat <- factor(anaemia$operat, ordered=TRUE, levels=c("Elective", "Urgent", "Emergency"))

Remember that the above code (with ordered=FALSE, the default), will specify the order that the categories are displayed, without calling it an ordered factor variable, which is very useful in many more situations.

4.4 Recoding to binary 0/1, from numerical, factor or string variables

It is possible to use an expression that evaluates to TRUE or FALSE, resulting in logical datatype. Then convert to integer, where TRUE=1, FALSE=0. Example: We want to create a binary variable equal to 1 for long hospital stays post-operation (stay >20 days) and equal to 0 otherwise. We want to creata binary variable for elective operation type.

# binary variable for hospital stays greater than 20 days: the following
# evaluates expression to see whether “anaemia$los_post > 20.5” is TRUE or FALSE or NA
anaemia$long_post <- as.integer(anaemia$los_post > 20.5) # recode to binary 0/1 indicating long stay post operation
anaemia$elective <- as.integer(anaemia$operat == "Elective")  # recode to binary 0/ 1 elective operations

summary(anaemia$long_post)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1145  0.0000  1.0000       9

if_else( ) from tidyverse package [ needs library(tidyverse) or library(dplyr)) ]. This is useful for modifying variables. This retains date and factor format, which generally makes it more convenient than the R base version ifelse.

if_else (expression that evaluate as TRUE/FALSE, value when expression is TRUE, value when expression is FALSE). For some examples ifelse from base R works just as well, but a policy of always using the tidyverse version may be sensible.

Examples as above, different coding: find a binary integer variable =1 for elective operation and =0 otherwise (to compare both these groups).

# see how the categorical variable for operation type is coded
summary(anaemia$operat)
##            Elective Emergency   Salvage    Urgent 
##         9       710        56         2       263
anaemia$elective <- if_else(anaemia$operat == "Elective", 1, 0)  # recode to binary 0/ 1
summary(anaemia$elective) # see new variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.6827  1.0000  1.0000
table(anaemia$elective,anaemia$operat, exclude=NULL)  # compare to original variable
##    
##         Elective Emergency Salvage Urgent
##   0   9        0        56       2    263
##   1   0      710         0       0      0

Create a binary variable=1 for long hospital stays post-operation (stay >20 days) and =0 otherwise.

# binary variable for hospital stays greater than 20 days
anaemia$long_post <- if_else(anaemia$los_post > 20.5, 1, 0) # same as above using if_else
table(anaemia$long_post, exclude= NULL)  # check result
## 
##    0    1 <NA> 
##  913  118    9

4.5 Dividing numbers into categories, often according to clinically meaningful cut-offs (indicating severity of illness or similar) or using rounded cut-offs (such as 5 or 10 year age groups).

We might alternatively want to use recognised cut-offs which may have clinical significance to create the groups (e.g. for BMI, use recognised thresholds to denote underweight, normal weight, overweight, obese).

A series of nested if_else statements can be used. However, the following method is probably simpler. This uses the cut command to create a factor variable from a numeric variable, by specifying the cut-offs using breaks=… Note that the first values needs to be below the minimum and the last needs to be above the maximum, or else NAs will result:

summary(anaemia$bmi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.88   24.02   26.89   27.48   30.39   58.39      19
anaemia$BMI_grp = cut(anaemia$bmi, breaks=c(1, 18.5, 25, 30, 90))
summary(anaemia$BMI_grp)
##  (1,18.5] (18.5,25]   (25,30]   (30,90]      NA's 
##        29       310       405       277        19

Note the categories/ levels are labelled with their upper and lower limits. Here is a similar example for age (which isn’t actually in dataset so would result in error).

Use seq to specify cut-points. This divides age into 10 year groups, starting at age 20 years:

seq(20, 120, 10) # starts from 20 to 120 in steps of 10: 20, 30, 40, 50,,...120
summary(aaa$age)
aaa$age_grp <- cut(aaa$age, breaks=seq(20, 120,10))
table(aaa$age_grp)

4.6 Fifths, tens and quartiles: dividing numeric variables into quintiles and similar

Hilary Watt

We might often want to use the median, quartiles or quintiles or deciles to create these groups. The result is a factor variable, and we usually keep it like this (rather than calling it ordered factor, since we generally use this to avoid the need to look for a linear trend).

Create new factor var ht_q5 that divides height into fifths

summary(anaemia$height) # see variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.360   1.650   1.730   1.719   1.800   1.970      15
seq(0,1,0.2)  # sequence from 0 to 1 in steps of 0.2: 0, 0.2, 0.4, 0.6, 0.8, 1.0
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
cut5 <- quantile(anaemia$height, na.rm=TRUE, probs=seq(0,1,0.2)) # find quintiles - divide into fifths
cut5 # show quintiles – dividing height into fifths
##   0%  20%  40%  60%  80% 100% 
## 1.36 1.63 1.70 1.75 1.80 1.97
anaemia$ht_q5 <- cut(anaemia$height, breaks=cut5, include.lowest=TRUE)
summary(anaemia$ht_q5) # because people of the same height always go into the same category, so “fifths” 
## [1.36,1.63]  (1.63,1.7]  (1.7,1.75]  (1.75,1.8]  (1.8,1.97]        NA's 
##         217         207         195         209         197          15
# don’t contain exactly the same sample size.
table(anaemia$ht_q5) # table showing ranges for each fifth and sample sizes against each
## 
## [1.36,1.63]  (1.63,1.7]  (1.7,1.75]  (1.75,1.8]  (1.8,1.97] 
##         217         207         195         209         197

If you want to divide into tenths, use seq(0,1,0.1) in the above command. For tertiles, use seq(0,1.0002,0.3334) (rather than 1, this has been amended to three times 0.3334).

4.7 Cross tabulate variables: table(var1, var2), to check for inconsistencies, such as pregnant men**?? Never-smokers who smoke 20 cigarettes per day?

Can you check any inconsistent data back to some primary source? If not, should you recode both inconsistent variables to missing? People often make a choice as to which variable to believe, and recode the other variable in this light. It is important to document and report which variable you consider to be primary (believed to be true when there are inconsistencies), in R script files and then include into written reports.

The following example looks at 3 death variables. It finds an inconsistency between death in hospital (death variable) and death during follow-up (death_fu variable, which includes death afterwards). The deecision is to believe the deaths in the deaths in hospital variable and to amend the other variable to agree with it. This isn’t necessarily reasonable, perhaps converting both to NA would be more reasonable. It is essential to declare any such data modifications in any write-up.

table(anaemia$deathICU, exclude=NULL) # died whilst in ICU
## 
##           6 alive  died 
##     9     2  1027     2
table(anaemia$death, exclude=NULL) # died whilst in hospital
## 
##       alive  died 
##     9  1000    31
table(anaemia$death_fu, exclude=NULL) # died whilst in hospital or follow-up after that
## 
##    0    1 <NA> 
##  897  134    9
table(anaemia$death, anaemia$deathICU, exclude=NULL) # consistency - except for apparent typos in deathICU variable
##        
##               6 alive died
##           9   0     0    0
##   alive   0   2   998    0
##   died    0   0    29    2
table(anaemia$death, anaemia$death_fu, exclude=NULL) # shows inconsistency between death variables
##        
##           0   1 <NA>
##           0   0    9
##   alive 895 105    0
##   died    2  29    0
anaemia$death_fu2 <- anaemia$death_fu    # copy variable prior to amending
anaemia$death_fu2 [anaemia$death=="died"] <- 1 # recode to died=1 when other variable indicates death
table(anaemia$death, anaemia$death_fu2, exclude=NULL) # now both variables agree
##        
##           0   1 <NA>
##           0   0    9
##   alive 895 105    0
##   died    0  31    0

4.8 Count levels/ categories: how many levels does a variable take? How much missing NA data?

The code length(unique( anaemia$sex )) gives number of unique levels for the variable of interest.

# Tabulate data – keep track of missing data with exclude=NULL option
table(anaemia$sex, exclude=NULL)
# shows number of rows(=observation)
length(anaemia$sex) 
# shows unique values (without frequencies)
unique(anaemia$sex) 
# reports NUMBER of unique values
length(unique(anaemia$sex)) 
# gives vector output indicating where each item in variable is NA or not.
# is.na(anaemia$sex) 
# tabulates missing data
table( is.na(anaemia$sex3) )

Always keep a copy of the data set as sent to you and a tidy R script file that produces the data set that you use for analysis.


The main dataset is called anaemia, available here: https://github.com/hcwatt/data_wrangling_open.

Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London.