Essentials of this chapter: Which base R commands and R packages are covered?
9.2.1 Text basics
In R, text is represented as a sequence of characters, which can contain letters, numbers, and other symbols.
To turn a sequence of characters into a text object, it needs to be enclosed in quotation marks (
The data type R provides for storing sequences of characters is character and the mode of an object that holds a character string is character:
Sometimes we want to talk about a word, rather than use it in a sentence. This is typically signaled by using quotation marks around the word that is talked about. When using quotation marks within a string, it is important that the types of inner and outer quotation marks match up. For instance, the following 2 definitions yield valid character objects:
whereas the following 4 definitions would all yield errors:
Just like with other R objects, a sequence of strings can be combined into a vector:
Actually, strings in vectors are similarly addictive as
NA values in calculations:
When combining numbers or other data types with a character object, the entire string is changed into type and mode character:
character takes a numeric argument
length and creates a vector of a corresponding number of empty strings:
This may seem unnecessary at this point, but becomes useful when initializing data structures to be filled by the results of a vector operation or a
for loop (see Chapter 12 for examples):
The corresponding functions
is.character coerce objects into text strings or test whether objects are text strings:
R contains a few built-in constants.
Apart from the constant
pi (which evaluates to the numeric double 3.1415927), the following ones are vectors of type character:
LETTERS: 26 upper-case letters of the Roman alphabet;
letters: 26 lower-case letters of the Roman alphabet;
month.abb: 3-letter abbreviations for the English month names;
month.name: English names for the months of the year.
Base R contains a range of functions to deal with text objects. As you will encounter them for simple string manipulations in this book and in the code of others, it is good to have seen them at some point.
However, the set of text-related functions in base R are a bit like a zoo — with amazing and colorful individual creatures, but it can be hard to see the connections or any organizing principle. Thus, we will only cover the most important functions here and use the stringr package (see Section @ref(#text:stringr) below) for more advanced forms of string manipulation.
To organize a set of otherwise relatively inconsistent functions, we present these basic string manipulation functions as solutions to specific tasks.
A. Basic tasks to do with text:
- Measuring text length
- Changing the case of characters
- Combining text strings
- Splitting text strings
- Reading text input
- Printing text output
Also: Extract substrings (based on positions)
B. Advanced tasks to do with text:
- detect text or patterns
- locate text or patterns
- extract text or patterns
- replace text or patterns
- Task: Measuring text length.
Rather than using the
length() function (which provides the number of elements in a vector), the length of a text object is measured by
- Task: Changing the case of characters.
A specific property of letters in many alphabets is that they exist in lowercase and uppercase (
The R functions
tolower() allow changing the case of text strings:
- Task: Extracting parts of strings.
- Task: Detecting characters in strings.
- Task: Replacing characters in strings.
substr() can also be used to replace characters in strings:
- Task: Combining strings (into 1 string).
Note vector-based paste commands.
- Task: Splitting a string (into multiple parts).
Note: The opposite of pasting is splitting: Use
strsplit() returns a list. To obtain a vector, we need to
- Task: Reading text input (from a file or user console).
- Task: Providing text output (to the console or a file).
cat() concatenates and prints several objects, which is useful for printing the output of user-defined functions. It converts its arguments to character vectors, concatenates them to a single vector, appends the given string separator
sep to each element, and then prints them (to the console or a file).
- Verify that the elements of
lettersare equal to the elements of
LETTERSin lower case, and that the elements of
LETTERSare equal to those in
lettersin upper case.
- Extract the substring “his” out of the sentence
- Study the documentation to
?substring), compare it to
substr, and then try to explain the results of the following examples:
# Extracting substrings: substring(text = "ABCDE", first = 2, last = 3) #>  "BC" substring(LETTERS, 2, 3) #>  "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" #>  "" substring(a, 1:nchar(a), 1:nchar(a)) #>  "H" "e" "l" "l" "o" # Replacing substrings (with recycling): v5 <- v1 v5 #>  "Hello" "!" "This is a sentence." substring(v5, 1) <- c("A", "Z") v5 #>  "Aello" "Z" "Ahis is a sentence."
- Use a combination of
paste0commands to change the sentence
sinto an R object “This is a great sentence!”
9.2.2 Regular expressions
A regular expression (aka. regex or regexp) is a sequence of characters that define a search pattern. Many computer languages provide regex functionality to find patterns in strings of text (see Wikipedia for details).
Think of regular expressions as a compact and formal language to define patterns in text.
Regexps can seem very cryptic and learning can be daunting.
- characters are the basic building blocks of regular expressions: letters and digits typically match themselves.
meta-characters (like the wildcard
9.2.3 String manipulation with stringr
- Task: Detect text or patterns in strings:
# String match: grep(s, strings) #>  1 3 5 7 8 grep(s, strings, value = TRUE) #>  "Allosaurus" "Brachiosaurus" "Stegosaurus" "Tyrannosaurus" #>  "Thesaurus" grepl(s, strings) #>  TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE stringr::str_detect(strings, s) #>  TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE # Pattern match: grep(pattern, strings) #>  1 3 5 7 8 grep(pattern, strings, value = TRUE) #>  "Allosaurus" "Brachiosaurus" "Stegosaurus" "Tyrannosaurus" #>  "Thesaurus" stringr::str_detect(strings, pattern) #>  TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
+++ here now +++
Blurring the boundaries between data types is creative and fun — and can have scientific or artistic motivations and consequences.
In science: Quantifying text: Counting character or word frequency of literary works.
Visual arts: Plotting text, creating word search puzzles or crossword puzzles, etc.