9.3 Basic text manipulation

This section covers the elementary constants and functions for manipulating character data that are provided by base R. Base R also provides support for regular expressions and more advanced tasks, which are addressed by Appendix E and used in Section 9.4.

Above (in Section 9.2.1), we defined some simple objects of type character:

a
#> [1] "Hello"
b
#> [1] "!"
c
#> [1] "This is a sentence."
d
#> [1] "The word 'word' is a 4-letter word."

In the following, we first explicate some facts that we have encountered repeatedly throughout this book: R comes with pre-defined elements of data and we can collect text objects in vectors and add them to other (rectangular) data structures. To work with text-related data, R provides a range of commands that allow basic manipulations of character objects. Some straightforward examples of such functions have already been used and include:

nchar(a)
#> [1] 5
paste0(a, b)
#> [1] "Hello!"
c(c, d)
#> [1] "This is a sentence."                 "The word 'word' is a 4-letter word."

The rest of this section answers the following three questions:

  • Which text-related constants exist in R?
  • How can we combine and add text objects to other data structures (vectors, matrices, and tables)?
  • Which functions for solving basic text-related tasks should we know?

As the first two questions can be answered quickly, most of this section will cover text-related functions provided by base R.

9.3.1 Text constants

R contains a few built-in constants. Constants are data that are pre-defined and should only be changed with good reasons. Apart from the constant pi, which evaluates to the numeric double 3.1415927 (unless we change it by assigning it to a different value), the following ones are vectors of type character:

  1. LETTERS: 26 upper-case letters of the Roman alphabet;
  2. letters: 26 lower-case letters of the Roman alphabet;
  3. month.abb: 3-letter abbreviations for the English month names;
  4. month.name: English names for the months of the year.

As these four constants come as character vectors, we can select their elements by indexing (just as with any other vector in R):

# Numeric indexing:
LETTERS[1:3]
#> [1] "A" "B" "C"
letters[24:26]
#> [1] "x" "y" "z"
month.abb[11:12]
#> [1] "Nov" "Dec"
month.name[7:9]
#> [1] "July"      "August"    "September"

# Logical indexing:
month.name[nchar(month.name) < 5]
#> [1] "May"  "June" "July"

In rare instances, it may make sense to provide names to character vectors so that their elements can be selected by their name (which are also character objects). Above, we encountered the named Umlaut vector from the ds4psy package (in Section 9.2.2):

Umlaut         # a named character vector
#>   a   o   u   A   O   U   s 
#> "ä" "ö" "ü" "Ä" "Ö" "Ü" "ß"
names(Umlaut)  # names are characters
#> [1] "a" "o" "u" "A" "O" "U" "s"

# Selecting elements by name:
Umlaut["o"]
#>   o 
#> "ö"

Having encountered many character vectors and tables that contain character objects, we need to ask:

  • How can we create character vectors and other data structures that collect and store text data?

9.3.2 Data structures for text

Which data structures have we encountered so far? Back in Chapter 1: Basic R concepts and commands, we learned that R relies primarily on linear and rectangular data structure (i.e., vectors and tables) to store data (see Sections 1.4 and 1.5). Fortunately, this applies to character data just as it does to numerical, logical, or temporal data. Hence, we do not need to learn any new commands for integrating character objects into larger data structures. Nevertheless, we will briefly refresh our memory by providing some examples.

Character vectors

Just as with other R objects, the concatenate (or combine) function c() turns a sequence of character objects (or strings of text) into a vector of type character:

c(a, b)
#> [1] "Hello" "!"
v1 <- c(c, d)
v1
#> [1] "This is a sentence."                 "The word 'word' is a 4-letter word."
typeof(v1)
#> [1] "character"

Many R functions that we previously used with numeric vectors also work with character data:

length(v1)
#> [1] 2
rev(v1)
#> [1] "The word 'word' is a 4-letter word." "This is a sentence."
sort(c("B", "A", "D", "C"))
#> [1] "A" "B" "C" "D"

while others clearly would make no sense (as they require logical or numeric arguments):

sum(v1)
mean(v1)

As the length() function provides the number of elements in a vector, measuring the length of text objects (in terms of their number of characters) requires a different function nchar():

nchar(a)
#> [1] 5
nchar(v1)
#> [1] 19 35

Accessing the elements of a character vector works exactly like accessing the elements of any other vector. For instance, we can use numeric or logical indexing:

# numeric indexing:
v1[2]
#> [1] "The word 'word' is a 4-letter word."

# logical indexing:
v1[nchar(v1) < 20]
#> [1] "This is a sentence."

Actually, strings in vectors are similarly addictive as NA values in calculations: When combining numbers or other data types with a character object, the entire string is changed into type and mode character:

v2 <- c(1, 2, "C")
v2
#> [1] "1" "2" "C"

v3 <- v2[1:2]
v3
#> [1] "1" "2"

typeof(v3)
#> [1] "character"

The reason for the addictive nature of characters is clear: While it is difficult to interpret an object of type character as a number (which number corresponds to the word “tree?”), it is straightforward to interpret numbers as a series of character symbols (i.e., digits). Characters simply are the common denominator of of numbers and text.

The function character() takes a numeric argument length and creates a vector of a corresponding number of empty strings:

v4 <- character(length = 4)
v4
#> [1] "" "" "" ""

This may seem unnecessary at this point, but becomes useful when initializing data structures to be filled by the results of a vector operation or a for loop (see Chapter 12 on Iteration for examples):

v4[1] <- "uno"
v4[3] <- "tres"
v4
#> [1] "uno"  ""     "tres" ""

The corresponding functions as.character() and is.character() coerce objects into text strings or test whether objects are text strings:

1:4                   # a numeric vector
#> [1] 1 2 3 4
as.character(c(1:4))  # as character
#> [1] "1" "2" "3" "4"

is.character(1:4)
#> [1] FALSE
is.character(as.character(1:4))
#> [1] TRUE

Tabular data structures

Just as with other data types, it is possible and common to store character data in a matrix:

matrix(data = month.abb, ncol = 3, byrow = TRUE)
#>      [,1]  [,2]  [,3] 
#> [1,] "Jan" "Feb" "Mar"
#> [2,] "Apr" "May" "Jun"
#> [3,] "Jul" "Aug" "Sep"
#> [4,] "Oct" "Nov" "Dec"

Matrices are useful and efficient structures for storing data, but have the same limitation as vectors in R: They can only store data of a single data type. As we are usually working with data of multiple types (e.g., combinations of character, numeric, and logical variables) our data is typically stored in the form of data frames or tibbles:

tibble::tibble(participant = LETTERS,  
               initials = paste0(sample(LETTERS), ".", sample(LETTERS), "."),
               age = sample(18:30, size = length(LETTERS), replace = TRUE),
               may_drink = (age > 20))
#> # A tibble: 26 × 4
#>    participant initials   age may_drink
#>    <chr>       <chr>    <int> <lgl>    
#>  1 A           H.E.        29 TRUE     
#>  2 B           Y.I.        26 TRUE     
#>  3 C           K.T.        20 FALSE    
#>  4 D           G.S.        20 FALSE    
#>  5 E           M.D.        21 TRUE     
#>  6 F           B.M.        19 FALSE    
#>  7 G           P.G.        29 TRUE     
#>  8 H           Q.N.        19 FALSE    
#>  9 I           S.P.        28 TRUE     
#> 10 J           Z.Q.        20 FALSE    
#> # … with 16 more rows

As data frames and tibbles were covered extensively in Chapters 1 and 5 (e.g., see Sections 1.5.2 and 5.2), we can devote the rest of this section to functions and the tasks we can address with them.

9.3.3 Text functions

The base R package contains a range of dedicated functions to deal with text objects. As we see them frequently wherever strings are being manipulated — both in this book and in the code of others — it is good to be familiar with them. And even if we should eventually decide to use the alternative functions provided by the stringr package, we should now why we are doing so.

Actually, we have encountered quite a few text-related functions in the previous chapters of this book. For instance, even Chapter 1: Basic R concepts and commands contained the functions nchar() and substr(), as well as the constant LETTERS. Similarly, the paste() function and its variant paste0() appeared in Chapters 1, 2, 5, and in this chapter.

As we will see, the set of text-related functions in base R are a bit like a zoo. There is a mix of some mundane and some amazing creatures, but it can be hard to see their connections or an organizing principle. Thus, we only cover basic text-manipulation functions here and rely on the stringr package for more advanced tasks of string manipulation (in Section 9.4.

9.3.4 Basic tasks with text

Table 9.2 contains the basic tasks of our summary table (from Section 9.2.4).

Table 9.2: Basic tasks of text manipulation (involving a string s).
Task R base stringr
A: Basic tasks
Measure the length of strings s: nchar(s) str_length(s)
Change chars in s to lower case: tolower(s) str_to_lower(s)\(^{2}\)
Change chars in s to upper case: toupper(s) str_to_upper(s)\(^{2}\)
Combine or collapse strings ...: paste(...)\(^{1}\) str_c(...)
Split a string s: strsplit(s, split) str_split(s, split)\(^{2}\)
Sort a character vector s: sort(s) str_sort(s)\(^{2}\)
Extract or replace substrings in s: substr(s, start, stop)\(^{1}\) str_sub(s, start, stop)
Translate old into new chars in s: chartr(old, new, s)
– Text as input or output: print(), cat(), format(), readLines(), scan(), writeLines()\(^{1}\)

Table notes

  • \(^{1}\): base R functions with additional variants that tweak their functionality (see their documentation).

  • \(^{2}\): stringr functions with additional variants that tweak their functionality (see their documentation).

For these basic tasks, we will focus on the base R functions and only briefly show their stringr alternatives.

Measuring string length

We have already seen above that the nchar() function differs from the length() function. Whereas length() provides the number of elements in a vector, the nchar() function provides the length of text objects:

length("Hello")
#> [1] 1
nchar("Hello")
#> [1] 5

length(v1)
#> [1] 2
nchar(v1)
#> [1] 19 35

length(month.name)
#> [1] 12
nchar(month.name)
#>  [1] 7 8 5 5 3 4 4 6 9 7 8 8

The stringr function str_length() provides an alternative to nchar():

stringr::str_length("Hello")
#> [1] 5

stringr::str_length(v1)
#> [1] 19 35

stringr::str_length(month.name)
#>  [1] 7 8 5 5 3 4 4 6 9 7 8 8

Changing character case

A characteristic property of letters in many alphabets is that they exist in lowercase and uppercase forms. As R is case-sensitive (e.g., x and X are different objects), it makes sense that character objects in lower- vs. uppercase are also distinguished (e.g., the vowels aeiou are different from those in AEIOU).

The base R functions tolower() and toupper() allow changing the case of text strings:

tolower(month.abb[7:9])
#> [1] "jul" "aug" "sep"
toupper(month.name[11:12])
#> [1] "NOVEMBER" "DECEMBER"

s <- c("A tiny dog chased a large rat.", "Is POTUS mad?", "Big dad, so sad!")
tolower(s)
#> [1] "a tiny dog chased a large rat." "is potus mad?"                 
#> [3] "big dad, so sad!"
toupper(s)
#> [1] "A TINY DOG CHASED A LARGE RAT." "IS POTUS MAD?"                 
#> [3] "BIG DAD, SO SAD!"

The direct stringr alternatives to tolower() and toupper() are str_to_lower() and str_to_upper(). The functions str_to_sentence() and str_to_title() provide further variations on the same theme with some locale-specific aspects:

stringr::str_to_lower(s)
#> [1] "a tiny dog chased a large rat." "is potus mad?"                 
#> [3] "big dad, so sad!"
stringr::str_to_upper(s)
#> [1] "A TINY DOG CHASED A LARGE RAT." "IS POTUS MAD?"                 
#> [3] "BIG DAD, SO SAD!"
stringr::str_to_title(s)
#> [1] "A Tiny Dog Chased A Large Rat." "Is Potus Mad?"                 
#> [3] "Big Dad, So Sad!"
stringr::str_to_sentence(s)
#> [1] "A tiny dog chased a large rat." "Is potus mad?"                 
#> [3] "Big dad, so sad!"

As issues of capitalization come up regularly when working with text, the capitalize() and caseflip() functions of the ds4psy package provide similar functionality with slightly different options, output formats, and trade-offs:

ds4psy::capitalize(s)
#> [1] "A tiny dog chased a large rat." "Is POTUS mad?"                 
#> [3] "Big dad, so sad!"
ds4psy::capitalize(s, n =  2)
#> [1] "A tiny dog chased a large rat." "IS POTUS mad?"                 
#> [3] "BIg dad, so sad!"
ds4psy::capitalize(s, n = 10)
#> [1] "A TINY DOG chased a large rat." "IS POTUS Mad?"                 
#> [3] "BIG DAD, So sad!"
ds4psy::capitalize(s, as_text = FALSE)
#> [1] "A tiny dog chased a large rat." "Is POTUS mad?"                 
#> [3] "Big dad, so sad!"

ds4psy::caseflip(s[2])
#> [1] "iS potus MAD?"

Combining or collapsing strings

The paste() function is the string-related workhorse of the base R package. In its most basic form, paste() combines multiple character objects into one character object. Although this behavior is also referred to as “concatenating vectors” (e.g., in R’s documentation), it is instructive to contrast the behavior or paste() with that of the c() function:

paste(a, b)
#> [1] "Hello !"
length(paste(a, b))
#> [1] 1

c(a, b)
#> [1] "Hello" "!"
length(c(a, b))
#> [1] 2

Thus, whereas c() combines multiple objects into a vector, paste() combines multiple strings into a single one.

Interestingly, paste() also works with non-character objects, but coerces them into character objects:

paste(2, 4, 6)
#> [1] "2 4 6"
paste(TRUE, FALSE)
#> [1] "TRUE FALSE"

The paste() function has two arguments sep and collapse: The sep argument of paste() allows selecting the separation character(s) between the terms to be pasted and is set to " " by default):

paste(a, b)
#> [1] "Hello !"
paste(a, b, sep = " ")
#> [1] "Hello !"
paste(a, b, sep = "<--->")
#> [1] "Hello<--->!"

As we often want to create a character string without any separation between the original terms, the paste0() variant of paste() is a convenient wrapper for paste(..., sep = ""):

paste0(a, b)
#> [1] "Hello!"
paste(a, b, sep = "")
#> [1] "Hello!"

The collapse argument of paste() is set to NULL by default. To understand its usefulness, consider the following examples:

paste(letters[1:3])
#> [1] "a" "b" "c"
length(paste(letters[1:3]))
#> [1] 3

Contrary to what we may have expected, this paste() command did not yield a single character string, but a vector of three character objects. The reason for this is simple: The input we provided to paste() were not multiple character objects, but a vector of character objects (here: letters[1:3], which evaluates to a, b, c).

But as R usually works with vectors, we often want to create a single character object (or string of text) out of a vector of character objects. This can also be achieved with paste(), but requires setting the collapse argument to some value other than NULL:

paste(letters[1:3], collapse = " ")
#> [1] "a b c"
length(paste(letters[1:3], collapse = " "))
#> [1] 1

Thus, collapse works just like sep, but is used when we want to combine a vector of strings into a single one:

paste(letters[1:10], collapse = "")
#> [1] "abcdefghij"
paste(letters[1:10], collapse = "-")
#> [1] "a-b-c-d-e-f-g-h-i-j"

If both sep and collapse are provided, the argument used depends on the type of input provided:

# paste character objects:
paste("a", "b", "c", sep = "_", collapse = "|")
#> [1] "a_b_c"

# paste a character vector:
paste(c("a", "b", "c"), sep = "_", collapse = "|")
#> [1] "a|b|c"

A nifty feature of paste() is that it also works with vectors of different length, in which case the term with fewer elements is recycled (i.e., repeated to match the maximum number of elements):

paste("n", 1:10, sep = "_")
#>  [1] "n_1"  "n_2"  "n_3"  "n_4"  "n_5"  "n_6"  "n_7"  "n_8"  "n_9"  "n_10"
paste(LETTERS[1:3], 1:10)
#>  [1] "A 1"  "B 2"  "C 3"  "A 4"  "B 5"  "C 6"  "A 7"  "B 8"  "C 9"  "A 10"

paste(month.abb, 2020)
#>  [1] "Jan 2020" "Feb 2020" "Mar 2020" "Apr 2020" "May 2020" "Jun 2020"
#>  [7] "Jul 2020" "Aug 2020" "Sep 2020" "Oct 2020" "Nov 2020" "Dec 2020"
paste(month.abb, 2020, collapse = ", ")
#> [1] "Jan 2020, Feb 2020, Mar 2020, Apr 2020, May 2020, Jun 2020, Jul 2020, Aug 2020, Sep 2020, Oct 2020, Nov 2020, Dec 2020"

The stringr alternative to paste() is str_c(), which emphasizes the connection to c(). The str_c() function also takes two arguments sep and collapse, but the default of sep is set to "" (i.e., no separation):

stringr::str_c(a, b)
#> [1] "Hello!"
stringr::str_c(a, b, sep = " ")
#> [1] "Hello !"

stringr::str_c("a", "b", "c", sep = "_", collapse = "|")
#> [1] "a_b_c"
stringr::str_c(c("a", "b", "c"), sep = "_", collapse = "|")
#> [1] "a|b|c"

In addition, the tidyverse contains a glue package (Hester, 2020) that makes it easier to combine text strings and other data elements (see https://github.com/tidyverse/glue for details).

Splitting strings

Splitting a string into multiple strings is the opposite of pasting (i.e., combining or collapsing) strings. The strsplit() function takes two arguments: x is a string (or multiple strings) and split denotes a character object — or a pattern, defined as a regular expression (see Appendix E — at which the string x is to be split. Intuitively, we would expect a sequence of x <- paste(v, collapse = " ") and strsplit(x, split = " ") to yield v again, but see what happens:

# A character vector:
v <- letters[1:3]
v
#> [1] "a" "b" "c"

# Paste into 1 string: 
x <- paste(v, collapse = " ")
x
#> [1] "a b c"

# Split this string:
y <- strsplit(x, split = " ")
y
#> [[1]]
#> [1] "a" "b" "c"

# Is y equal to original v? 
all.equal(v, y)
#> [1] "Modes: character, list"              
#> [2] "Lengths: 3, 1"                       
#> [3] "target is character, current is list"

It turns out that strsplit() returns a list, rather than a vector. To obtain the orginal vector v, we need to unlist() the list y:

# Turn a list into a vector:
unlist(y)
#> [1] "a" "b" "c"

# Is unlist(y) equal to the original vector v? 
all.equal(unlist(y), v)
#> [1] TRUE

What do we need strsplit() for? A typical application of using strsplit() (often followed by unlist()) consists in splitting a longer string of text (e.g., a paragraph or chapter) into individual words or sentences:

# Create a vector of 3 sentences:
s1 <- c("This is the first sentence.", 
        "Yet another short sentence.", 
        "A third and final sentence.")

# Paste into 1 paragraph:
p <- paste(s1, collapse = " ")
p
#> [1] "This is the first sentence. Yet another short sentence. A third and final sentence."

# Split p into words:
w <- strsplit(p, split = " ")
w
#> [[1]]
#>  [1] "This"      "is"        "the"       "first"     "sentence." "Yet"      
#>  [7] "another"   "short"     "sentence." "A"         "third"     "and"      
#> [13] "final"     "sentence."

Unfortunately, we cannot just enter strsplit(p, split = ".") to split the paragraph p at every full stop (i.e., the punctuation symbol .) into the sentences it contains. The solution for this task also involves a version of strsplit() but the split argument looks unusually cryptic:

# Split p into sentences:
s2 <- unlist(strsplit(p, split = "\\.|\\.."))
s2
#> [1] "This is the first sentence" "Yet another short sentence"
#> [3] "A third and final sentence"

# More general solution (allowing for other punctuation marks):
s2 <- unlist(strsplit(p, split = "[[:punct:]]|[[:punct:]] "))
s2
#> [1] "This is the first sentence" "Yet another short sentence"
#> [3] "A third and final sentence"

We already know the reason for this (from Section 9.2.2): The punctuation symbol . is one of the 12 so-called metacharacters that have a special meaning in matching patterns:

  • . \ | ( ) [ { ^ $ * + ? (see the metachar vector of ds4psy).

Unfortunately, understanding the solution(s) shown here requires that we first learn more about regular expressions (see Appendix E). What we can do at this point, however, is adding the missing full stops to the end of each sentence in s2 and verifying that we then have regained our original vector of sentences s1:

# Paste "." to each sentence in s2:
s3 <- paste0(s2, ".")

all.equal(s1, s3)
#> [1] TRUE

As splitting text into sentences or words is a frequent task, but turns out to be surprisingly difficult, the ds4psy package provides two corresponding helper functions: text_to_sentences() and text_to_words(). However, under the hood, these functions are really just convenient wrappers for a series of strsplit() and unlist() commands:

p <- c("This is a paragraph. A second sentence here.", 
       "A third sentence, etc., as another string.",
       "A question? The end, finally!")

ds4psy::text_to_sentences(p)
#> [1] "This is a paragraph."                      
#> [2] "A second sentence here."                   
#> [3] "A third sentence, etc., as another string."
#> [4] "A question?"                               
#> [5] "The end, finally!"
ds4psy::text_to_words(p)
#>  [1] "This"      "is"        "a"         "paragraph" "A"         "second"   
#>  [7] "sentence"  "here"      "A"         "third"     "sentence"  "etc"      
#> [13] "as"        "another"   "string"    "A"         "question"  "The"      
#> [19] "end"       "finally"

The stringr alternative to strsplit() is str_split(). The str_split() function uses an argument pattern (instead of the split argument) and provides some additional options (e.g., for returning a character matrix). By default, it also returns a list of character vectors, rather than a vector:

x  # from above
#> [1] "a b c"

(y <- stringr::str_split(x, pattern = " "))
#> [[1]]
#> [1] "a" "b" "c"

unlist(y)  # turn list into vector
#> [1] "a" "b" "c"

Sorting strings

Sorting strings into alphabetical order is easy.
As we have briefly seen above (and in Chapter 1, Section 1.4.3), we can use the sort() function that also handles numeric or logical objects:

f <- c("Banana", "Lemon", "Apple", "Zucchini", "Cucumber")

sort(f)
#> [1] "Apple"    "Banana"   "Cucumber" "Lemon"    "Zucchini"
sort(f, decreasing = TRUE)
#> [1] "Zucchini" "Lemon"    "Cucumber" "Banana"   "Apple"

Actually, sorting strings is easy, as long as we do not confuse sort() with order(). The order() function yields a numeric vector that provides a permutation that rearranges its first argument into ascending (or descending, if decreasing = TRUE) order:

order(f)
#> [1] 3 1 5 2 4
order(f, decreasing = TRUE)
#> [1] 4 2 5 1 3

The connection between both functions can be seen by using the output of order(f) to sort the vector f (by numerical indexing):

f[order(f)]
#> [1] "Apple"    "Banana"   "Cucumber" "Lemon"    "Zucchini"
f[order(f, decreasing = TRUE)]
#> [1] "Zucchini" "Lemon"    "Cucumber" "Banana"   "Apple"

If this is confusing, just remember to use sort(), rather than order(), to sort vectors into order.

The stringr alternatives to sort() and order() are — perhaps not surprisingly — str_sort() and str_order(), but provide a few additional options (e.g., a numeric argument to treat numeric character objects like numbers):

stringr::str_sort(f)
#> [1] "Apple"    "Banana"   "Cucumber" "Lemon"    "Zucchini"
stringr::str_sort(f, decreasing = TRUE)
#> [1] "Zucchini" "Lemon"    "Cucumber" "Banana"   "Apple"

stringr::str_order(f)
#> [1] 3 1 5 2 4
stringr::str_order(f, decreasing = TRUE)
#> [1] 4 2 5 1 3

# Numeric character objects:
stringr::str_sort(c("10a", "2b", "100c", "5d"))
#> [1] "100c" "10a"  "2b"   "5d"
stringr::str_sort(c("10a", "2b", "100c", "5d"), numeric = TRUE)
#> [1] "2b"   "5d"   "10a"  "100c"

Extracting and replacing substrings (by position)

Extracting part(s) of a character string x by specifying the location (or position) to be extracted is straightforward by using the substr() function:

a  # from above
#> [1] "Hello"

# Extracting substrings: 
substr(x = a, start = 1, stop = 4)
#> [1] "Hell"
substr(x = a, start = 4, stop = 5)
#> [1] "lo"

Let’s find out what happens when we set the main argument x to a character vector and use unusual (or non-existent) start and stop values:

w <- c("a", "big", "coconut", "does", "exist", ".")

substr(w, -2,  1)  # start < 0
#> [1] "a" "b" "c" "d" "e" "."
substr(w,  2, 99)  # stop > nchar(w)
#> [1] ""       "ig"     "oconut" "oes"    "xist"   ""
substr(w,  2,  1)  # start > stop
#> [1] "" "" "" "" "" ""

This shows that substr() works for character vectors (rather than scalars) and stays well-behaved for unusual or non-existent values of start and stop. This is particularly important when writing text-related functions (see Chapter 11), where we typically do not know the character argument x in advance.

An interesting setting for start and stop could take into account the current number of characters nchar(). For instance, the following extracts the last four characters of any string in s:

s
#> [1] "A tiny dog chased a large rat." "Is POTUS mad?"                 
#> [3] "Big dad, so sad!"

# last 4 characters: 
substr(s, nchar(s) - 3, nchar(s))
#> [1] "rat." "mad?" "sad!"

Assigning a new character object to the extracted string(s) will replace the extracted string(s):

d <- s[1]
d
#> [1] "A tiny dog chased a large rat."

# Extract substring by position:
substr(d, 8, 10)
#> [1] "dog"

# Replace a substring by positions:
substr(d, 8, 10) <- "cat"
d  # Note that d has changed:
#> [1] "A tiny cat chased a large rat."

# Replace a substring by positions:
substr(d, 8, 10) <- "pig"
d  # Note that d has changed:
#> [1] "A tiny pig chased a large rat."

# Replace a substring by positions:
substr(d, 8, 10) <- "cowboy"
d  # Note that d has changed, but NOT to "cowboy": 
#> [1] "A tiny cow chased a large rat."

As the last example shows, extracting and replacing substrings only affects the positions denoted by start and stop.

The stringr alternative to substr() is str_sub(), but the stop argument of the former is called end in the latter. Additionally, negative argument values now count back from the last character:

# Extracting substrings:
stringr::str_sub(a, start = 1, end = 4)
#> [1] "Hell"
stringr::str_sub(a, start = 4, end = 5)
#> [1] "lo"

# Unusual values for start and end:
stringr::str_sub(w, -2,  1)  # start < 0
#> [1] "a" ""  ""  ""  ""  "."
stringr::str_sub(w,  2, 99)  # end > nchar(w)
#> [1] ""       "ig"     "oconut" "oes"    "xist"   ""
stringr::str_sub(w,  2,  1)  # start > end
#> [1] "" "" "" "" "" ""

# last 4 characters: 
stringr::str_sub(s,  -4,  -1)  
#> [1] "rat." "mad?" "sad!"

Assigning a new character object to str_sub() behaves similar, but slightly differently than it does with substr():

# Extract and replace:
stringr::str_sub(d, 8, 10) <- "boy"
d  # Note that d has changed:
#> [1] "A tiny boy chased a large rat."

stringr::str_sub(d, 8, 10) <- "mannequin"
d  # Note that d HAS changed to "mannequin": 
#> [1] "A tiny mannequin chased a large rat."

Translating characters

Another way of replacing characters does not specify the parts to-be-replaced by their position, but by their identity. More specifically, the chartr() function takes a sequence of characters old and replaces them by the corresponding character in a sequence of characters new:

v2 <- v1[2]

# Translate 1 character:
chartr("o", "0", v2)
#> [1] "The w0rd 'w0rd' is a 4-letter w0rd."

# Translate multiple characters:
chartr("aeio", "AEIO", v2)
#> [1] "ThE wOrd 'wOrd' Is A 4-lEttEr wOrd."
chartr("aeio", "4310", v2)
#> [1] "Th3 w0rd 'w0rd' 1s 4 4-l3tt3r w0rd."

# Works for non-letters:
chartr(old = " .-", new = "_!/xyz", x = v2)
#> [1] "The_word_'word'_is_a_4/letter_word!"

There appears to be no direct stringr alternative to the chartr() function.

9.3.5 Text as input or output

There are a number of simple R functions that deal with reading text inputs and showing or storing text outputs. As most of them are fairly straightforward, we only briefly mention them here:

# Data:
sm <- month.abb[6:8]  # summer months
pg <- c("This is a paragraph. A second sentence here.", 
        "A third sentence, etc., as another string.",
        "A question? The end, finally!")
  • print() is a generic function for printing an argument and returning it invisibly.
print(sm)
#> [1] "Jun" "Jul" "Aug"
print(pg)
#> [1] "This is a paragraph. A second sentence here."
#> [2] "A third sentence, etc., as another string."  
#> [3] "A question? The end, finally!"

cat() concatenates and prints several objects, which is useful for printing the output of user-defined functions (see Chapter 11 on writing functions). cat() converts its arguments to character vectors, combines them to a single vector, appends the given string separator sep to each element, and then prints them (to the console or a file):

cat(sm)
#> Jun Jul Aug
cat(sm, sep = " | ")
#> Jun | Jul | Aug
cat(pg)
#> This is a paragraph. A second sentence here. A third sentence, etc., as another string. A question? The end, finally!

# Note the difference to:
paste(sm, collapse = " ")
#> [1] "Jun Jul Aug"
paste(sm, collapse = " | ")
#> [1] "Jun | Jul | Aug"
paste(pg, collapse = " ")
#> [1] "This is a paragraph. A second sentence here. A third sentence, etc., as another string. A question? The end, finally!"
  • format() provides support for pretty-printing objects. This will become important when displaying dates and times (see Chapter 10), but is also relevant for numbers:
mio <- 10^6

format(mio)
#> [1] "1e+06"
format(mio, scientific = FALSE, 
       big.mark = ",",  decimal.mark = ".", nsmall = 2)
#> [1] "1,000,000.00"
format(mio, scientific = FALSE, 
       big.mark = ".",  decimal.mark = ",", nsmall = 2)
#> [1] "1.000.000,00"
  • readLines() and scan() read text from a connection (e.g., a file, URL, or the console):
# Create a file (for testing purposes): 
cat("The Title", "1 2 3", "4 5 6", "7 8 9", 
    file = "test.dat", sep = "\n")

readLines("test.dat")
#> [1] "The Title" "1 2 3"     "4 5 6"     "7 8 9"
readLines("test.dat", n = 2)  # read 2 lines
#> [1] "The Title" "1 2 3"

scan("test.dat", skip = 1)  # skip the 1st line
#> [1] 1 2 3 4 5 6 7 8 9
scan("test.dat", skip = 1, quiet = TRUE)
#> [1] 1 2 3 4 5 6 7 8 9
scan("test.dat", skip = 1, nlines = 2)  # only 2 lines after the skipped one
#> [1] 1 2 3 4 5 6

# Clean up file:
unlink("test.dat")  
  • writeLines() writes text lines to a connection (e.g., a file, URL, or the console):
writeLines(sm)
#> Jun
#> Jul
#> Aug
writeLines(pg)
#> This is a paragraph. A second sentence here.
#> A third sentence, etc., as another string.
#> A question? The end, finally!

See each function’s documentation for details and related functions.

Before expanding our horizons by moving on to regular expressions (see Appendix E) or more advanced tasks (see Section 9.4), here are some practice tasks on some important base R commands.

Practice

  1. Verify that the elements of letters are equal to the elements of LETTERS in lower case, and that the elements of LETTERS are equal to those of letters in upper case.
all.equal(letters, tolower(LETTERS))
all.equal(toupper(letters), LETTERS)
  1. Varieties of paste():
  • Verify that and explain why the following three R commands yield the same result:
a <- "Hello"

paste(a, "!", a)
paste(a, a, sep = " ! ")
paste0(a, " ! ", a)
  • Which stringr command would yield the same result?
  1. Without knowing the identity of txt, predict and explain the result of

  2. Extract the substring “his” out of the sentence s.

s <- "This is a sentence."

substr(s, start = 2, stop = 4)
  1. Use a combination of substr() and paste0() commands to change the sentence st into an R object “This is a great sentence!”
st <- "This is a sentence."

paste0(substr(st, 1, 10), "great ", substr(st, 11, (nchar(st) - 1)), "!")
  1. Study the documentation to substring() (by evaluating ?substring), compare it to substr(), and then try to explain the results of the following examples:
# Extracting substrings:
substring(text = "ABCDE", first = 2, last = 3)
#> [1] "BC"
substring(LETTERS, 2, 3)
#>  [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
#> [26] ""
substring(a, 1:nchar(a), 1:nchar(a))
#> [1] "H" "e" "l" "l" "o"

# Replacing substrings (with recycling):
v1
#> [1] "This is a sentence."                 "The word 'word' is a 4-letter word."
substring(v1, 1) <- c("A", "Z")
v1
#> [1] "Ahis is a sentence."                 "Zhe word 'word' is a 4-letter word."
  1. Word processing:
  • Read in a paragraph of text containing multiple sentences (with punctuation marks) and use base R command(s) to partition it into a vector that contains its (a) individual sentences and (b) individual words.
# Data: William James (1890): The Principles of Psychology[1]
# Chapter I: The Scope of Psychology

wj_pp <- "Psychology is the Science of Mental Life, both of its phenomena and of their conditions. The phenomena are such things as we call feelings, desires, cognitions, reasonings, decisions, and the like; and, superficially considered, their variety and complexity is such as to leave a chaotic impression on the observer. The most natural and consequently the earliest way of unifying the material was, first, to classify it as well as might be, and, secondly, to affiliate the diverse mental modes thus found, upon a simple entity, the personal Soul, of which they are taken to be so many facultative manifestations. Now, for instance, the Soul manifests its faculty of Memory, now of Reasoning, now of Volition, or again its Imagination or its Appetite. This is the orthodox 'spiritualistic' theory of scholasticism and of common-sense. Another and a less obvious way of unifying the chaos is to seek common elements in the divers mental facts rather than a common agent behind them, and to explain them constructively by the various forms of arrangement of these elements, as one explains houses by stones and bricks. The 'associationist' schools of Herbart in Germany, and of Hume, the Mills and Bain in Britain, have thus constructed a psychology without a soul by taking discrete 'ideas,' faint or vivid, and showing how, by their cohesions, repulsions, and forms [p.~2] of succession, such things as reminiscences, perceptions, emotions, volitions, passions, theories, and all the other furnishings of an individual's mind may be engendered. The very Self or ego of the individual comes in this way to be viewed no longer as the pre-existing source of the representations, but rather as their last and most complicated fruit."
wj_pp

# (a) sentences:
st <- unlist(strsplit(wj_pp, split = "\\. "))  # split at "."
st <- unlist(strsplit(st, split = "\\? "))     # split at "."
st <- unlist(strsplit(st, split = "! "))       # split at "!"
st

# (b) words:
wd <- unlist(strsplit(wj_pp, split = " "))  # split at word boundaries (" ")
wd <- chartr(old = "-,.;:!?[]", new = "         ", x = wd)  # replace non-character symbols with " "
wd <- unlist(strsplit(wd, split = " "))     # split at " " (to remove leading/trailing spaces)
words <- wd

words
  • Shortcut: If you find this difficult, try solving the task by using the text_to_sentences() and text_to_words() functions provided by the ds4psy package.
library(ds4psy)

# (1) Split into parts: 
text_to_sentences(wj_pp)
text_to_words(wj_pp)

# (2) Quantify words and letters: ----  

# (a) words:
(wt <- count_words(wj_pp, case_sense = FALSE))
round(wt/sum(wt), 3)  # proportion of words

# (b) letters:
(tt <- count_chars(wj_pp, case_sense = FALSE))
round(tt/sum(tt), 3)  # proportion of chars
  • Take the vector of individual words from above and sort its elements alphabetically.
sort(words)
  • Take the (unsorted) vector of individual words from above and capitalize the first letter of each word to create a new vector Words.

Hint: This is surprisingly difficult. A possible solution could split each word into two substrings — its first letter vs. the rest of the word — and then re-combine both parts (after capitalizing the first letter):

words  # data (from above)

w_len <- nchar(words)  # length of each word
n_cap <- 1             # number of characters to capitalize
first <- substr(words, 1, n_cap)      # first character of each word 
rest  <- substr(words, n_cap + 1, w_len)  # rest of each word
rest  <- substring(words, n_cap + 1)  # rest of each word (with default end)
Words <- paste0(toupper(first), rest) # capitalize first and paste with rest

Words  # capitalized word vector
  • Shortcut: If a tasks turns out to be difficult, we can always look for existing functions that solve our problem. In case of capitalization, we could use the str_to_title() function of the stringr package or the capitalize() function of the ds4psy package:
(Words <- stringr::str_to_title(words))

(Words <- ds4psy::capitalize(words, as_text = FALSE))
  • Finally, re-create the original paragraph by pasting the vector of capitalized words Words into a single text string (and note how hard it is to parse and read a paragraph of words without any punctuation).
(pg <- paste(Words, collapse = " "))  # pasting Words into 1 string

This concludes our overview of basic text-manipulation tasks, for which we primarily use base R functions. Before proceeding to more advanced text-manipulation, consider taking a detour for acquiring or refreshing your familiarity with using regular expressions (see Appendix E).

References

Hester, J. (2020). glue: Interpreted string literals. Retrieved from https://CRAN.R-project.org/package=glue