5 Character/ string variables: converting to R character data-type, tidyingup ID and other character variables, and splitting into component parts
Data Wrangling Recipes in R: Hilary Watt
Support bug life by keeping on coding. Each bug is as an opportunity to develop debugging prowess.
5.1 Converting to character datatype, so R recognises variable as this datatype
To convert variable named var_name into a character version with variable name var_char, when both variables are stored in dataframe named df:
- It may be useful to acknowledge that a variable currently stored as R “factor” is merely character (R’s name for string/ free text variables).
- This might be useful for numeric ID variables, which are inconveniently being rounded. When in character format, the whole ID would appear.
- It might be useful for factor variables, if the character format allows for easier tidying up of categories (such as converting all to lower case). Then convert back to factor as required.
5.2 Tidying character variables, including ID variables prior to merging/ other variables prior to converting to factor
Sometimes ID variables are character variables and contain letters and numbers. If we want to merge datasets matching by these ID variables, we may need to tidy them prior to merging. Sometimes their case may be inconsistent. Sometimes leading/ trailing blanks are added accidentally during data processing.
These all use tidyverse, so require either library(tidyverse)
or (contained within tidyverse) library(stringr)
Making characters tidier/ more consistent:
str_to_lower()
: convert letters to lower casestr_to_upper()
: convert letters to upper casestr_trim()
: removes leading and trailing spaces from charactersstr_replace_all(var_name, “-”, “ “)
: replaces – with spacesstr_remove_all(“-“)
: removes “-” from charactersstr_length()
: returns length of character
The other key use here is character variables that we want to convert to factor/ categorical variables. Prior to converting, we might convert to all lower case, to avoid having different capitalisations ending in different categories (e.g. male and Male and MALE are treated as distinct before conversion).
## Female male Male
## 3 344 2 691
##
## Female male Male
## 3 344 2 691
anaemia$sex <- str_to_lower(anaemia$sex) # coverts to lower case
table(anaemia$sex, useNA="ifany") # table more consistent now
##
## female male
## 3 344 693
## [1] "factor"
## female male
## 3 344 693
5.3 Extracting subsets of character variables
The code substr takes a substring starting at specified position in the string, ending at specified position. For example substr( anaemia\(sex, 1, 1) takes the first character of the string alone (starting at the first characters and stops at the first character. substr( anaemia\)sex, 3, 2) starts at the third character and takes a substring of length two.
Tidying up by taking only the first letter & converting to lower case:
anaemia$sex_first <- as.factor ( str_to_upper( substr( anaemia$sex, 1, 1) ) ) # take substring, starting at first character, length 1 - convert to upper case - then convert to factor
summary(anaemia$sex_first) # view the result - reduced to first letters m and f
## F M
## 3 344 693
5.4 Dividing character variables into component parts
Occasionally the ID variable contains components to indicate cluster, and person within cluster and repeated measures within person. We may need to divide into separate components.
Perhaps we want to match on region, but because of wrong spellings, it suits us to match on only the first 3 characters (then check that this approach is valid). Perhaps we want dates, and we have time/dates, then one possible approach is to strip out the date component whilst the variable is in character format, before aiming to convert into R date format.
The following are some examples of the separator command from within tidyverse’s tidyr
. They are applied to the following dataset. Note that the examples print results to the screen. It would be usual to save into a new dataframe, by starting the command with: new_df_name <-
Because the kit_no
character variable has different formats/ structures on different lines, each of the examples of the separator commands works differently on each line. There are different numbers of component parts on different rows. Hence all these examples give warning messages. It might be fine to ignore them, though useful to understand how they arise so we can make an informed decision on that.
Specifying a symbol divider for character variables, such as “_”; this divider does not appear in any resulting component part.
Specify a symbol to divide the character variable into 2 (or more) parts: here symbol “_”. The following code specifies 2 new variable names for the character component (named part1 & part2). Part1 is the component of character variable kit_no
before the first symbol “_”. Part2 is the component between the first and second “_” symbols. Because no further variable names are specified, the remaining components (for lines 3, 7 and 10 above, after the second “_”) are discarded. A warning reports this discarded elements.
# Dividing one character variable into 2 vars with a given separate character -
# here "_"
# this separates "kit_no" var into two separate variables
print(separate(kits,
kit_no,
# name the two newly created vars "part1" and "part2"
into = c("part1", "part2"),
# this defines how they are separated - by "_" character
sep = "_"
))
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [3, 7, 10].
If we want to keep all component parts of the character variable kit_no
with separator “_”, we need to specify 4 names for the resulting variables as follows. This gives a different warning about resulting NA
entries, since most rows don’t have 4 elements separated by “_”:
# Dividing one character variable into 4 vars with a given separate character -
# here "_"
# this separates "kit_no" var into four separate variables
print(separate(kits,
kit_no,
# name four newly created vars
into = c("part1", "part2", "part3", "part4"),
# this defines how they are separated - by "_" character
sep = "_"
))
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 7 rows [1, 2, 4, 5, 6,
## 8, 9].
Specifying the characters combination that determine where to split; characters specified are retained in component parts
The following version looks for anywhere that the letter “t” is followed by any number. The “t” and anything to the left of it form part1 (stored into new variable named “text” here), and the number (plus anything to the right of it) form the second part (stored into new variable named numplus
). For most strings in this example, there isn’t a letter t followed by a number; the entire string is placed into the first named variable text
, with the second element numplus
being blank. Again, there is a warning.
# Dividing one character variable into 2 variables - split where letter t is
# followed by a digit[0-9]
# this separates "kit_no" var into two separate variables
print(separate(kits,
kit_no,
# name the two newly created vars "text" and "numplus"
into = c("text", "numplus"),
# this defines how they are separated: at the first point where a
# number [0-9] follows letter "t"
sep = "(?<=t)(?=[0-9])"
))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 7 rows [1, 2, 4, 5, 6,
## 8, 9].
The following version looks for anywhere that the letter (“s”, “S”, “t” or “T”), is followed by any number (0 to 9). The (“s”, “S”, “t” or “T”) and anything to the left of it form part1 (stored into new variable named text
here), and the number (plus anything to the right of it) form the second part (stored into new variable named numplus
). For most string, there isn’t a letter (“s”, “S”, “t” or “T”) followed by a number – NA
entries then result for both newly created variables. Again, there is a warning.
# Dividing one character variable into 2 variables - separate where anyone of a
# few named letters is followed by a number= digit[0-9]
# this separates "kit_no" var into two separate variables
print(separate(kits,
kit_no,
# name the two newly created vars "text" and "numplus"
into = c("text", "numplus"),
# this defines how they are separated: at the first point where a
# number [0-9] follows letter "t", "T", "S" or "s"
sep = "(?<=t)(?=[0-9])"
))
Result identical to output shown ABOVE this code.
Specifying that strings are split the first time any letter (capital or lower case) is followed by any number (0-9). The output is familiar, since for this example, the only letter that is followed by a number is the letter t. However, this may be useful in many situations.
# Dividing one character variable into 2 variables - separate where ANY letter,
# (a-z, lower case or upper case) is followed by ANY digit
# this separates "kit_no" var into two separate variables
print(separate(kits,
kit_no,
# name the TWO newly created vars "text" and "numplus"
into = c("text", "numplus"),
# this defines how they are separated: at the first point where a
# number follows a letter
sep = "(?<=[A-Za-z])(?=[0-9])"
))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 7 rows [1, 2, 4, 5, 6,
## 8, 9].
5.5 Pasting char var together: this might be used to construct columns for tables with confidence intervals in brackets or similar
Strings can be pasted together using past() command. It is possible to specify a separator (sep = ) between elements of the string, such as blank ” “:
## [1] "hello I am happy"
This is useful when constructing tables, where estimates are stored in one column and standard errors are in a different column. Then we can construct a column that contains the estimate, with 95% CI in brackets. The following illustrates the code (although here the first line reads in the data, whereas R regression output or similar would better be used for that purpose):
est_a <- c(4.563425, 4.54223,3.5422) # sets value of estimate
se_a <- c(0.655422, 0.453262, 0.345323) # sets value of its standard error
paste( est_a, " (CI ", est_a-1.96*se_a, ", ", est_a+1.96*se_a , ")" )
## [1] "4.563425 (CI 3.27879788 , 5.84805212 )"
## [2] "4.54223 (CI 3.65383648 , 5.43062352 )"
## [3] "3.5422 (CI 2.86536692 , 4.21903308 )"
k <- 2 # used to round to 2 decimal places in the following code:
paste( round(est_a, k), " (CI ", round( est_a-1.96*se_a, k), ", ", round(est_a+1.96*se_a,2) , ")" )
## [1] "4.56 (CI 3.28 , 5.85 )" "4.54 (CI 3.65 , 5.43 )"
## [3] "3.54 (CI 2.87 , 4.22 )"
This is printed in a row, whereas it would be printed as a column if printed as part of a table that has various columns.
The main dataset is called kittens2, available here: https://github.com/hcwatt/data_wrangling_open.
Data Wrangling Recipes in R: Hilary Watt. PCPH, Imperial College London