9.4 Advanced text-manipulation

In Section 9.2.4, we emphasized that base R provides a range of functions to deal with advanced tasks of text-manipulation. However, as these are a bit cryptic and confusing (e.g., by existing in many variants and returning their results as lists), using the stringr package (Wickham, 2019b) included in the tidyverse (Wickham et al., 2019) is easier and more straightforward:

library(stringr)

As both base R and stringr functions support pattern matching, advanced text-manipulation assumes some familiarity with using regular expressions (see Appendix E).

9.4.1 Advanced tasks with text

This section is structured according to the advanced tasks identified in Section 9.2.4. The following Table 9.3 repeats the advanced tasks of our original summary table:

**Table 9.3:** Advanced tasks of text-manipulation (involving a string `s` and pattern `p`).
Task	R base	stringr
B: Advanced tasks
– View strings in `s` that match `p`:		`str_view(s, p)`$^{a}$
– Detect pattern `p` in strings `s`:	`grep(p, s)` `grepl(p, s)`$^{1}$	`str_detect(s, p)`
– Locate pattern `p` in strings `s`:	`gregexpr(p, s)`$^{1}$	`str_locate(s, p)`$^{a}$
– Obtain strings in `s` that match `p`:	`grep(p, s, value = TRUE)`$^{1}$	`str_subset(s, p)`
– Extract matches of `p` in strings `s`:	`regmatches(s, gregexpr(p, s))`$^{1}$	`str_extract(s, p)`$^{a}$
– Replace matches of `p` by `r` in `s`:	`gsub(p, r, s)`$^{1}$	`str_replace(s, p, r)`$^{a}$
– Count matches of `p` in strings `s`:		`str_count(s, p)`

Table notes

$^{1}$: base R functions with additional variants that tweak their functionality (see their documentation).
$^{2}$: stringr functions with additional variants that tweak their functionality.
$^{a}$: stringr functions with an additional suffix _all() that applies the function to all matches (rather than just to the first match).

Why stringr?

There are three main reasons for switching to the stringr package to address these more advanced tasks:

Organization: The stringr functions are named and structured in a more systematic fashion than their base R ancestors.
Specialization: The stringr functions are more specialized than the corresponding base R functions. Each function is designed to accomplish a specific task rather well, rather than the base R approach of providing a family of related functions that do many things in different ways.
Functionality: The outputs of str_view() and str_count() are difficult to reproduce with base R functions. Additionally, the pattern argument of many stringr functions can be used in combination with so-called modifiers that govern the interpretation of a regular expression and allow setting additional arguments (e.g., ignore.case = TRUE, see ?stringr::regex for the documentation).

Personally, I use the base R functions nchar(), paste(), and substr() on a regular basis, appreciate the flexibility of grep() and strsplit(), but find the details of gregexpr(), gsub(), and regmatches() too confusing to remember. Regarding the stringr package, I enjoy the convenience of str_view_all() for displaying the matches of regular expressions, the fact that a set of uniform commands are named by the functions they perform, and the invaluable functionality of str_count() for quantifying pattern matches.

Data

To have some character data to match, here is a delightfully non-sensical string s:

# (1) Define some data: ----
s <- c("The cat sat on the mat.", 
       "The mad hatter had heard her, so what?", 
       "The fat dad was so sad.")

Additionally, we will use some of the data included in stringr and ds4psy:

# (2) from ds4psy: ------ 

# (a) words: ---- 
fruits <- ds4psy::fruits
# length(fruits)  # 122
countries <- ds4psy::countries
# length(countries)  # 197
Trumpisms <- ds4psy::Trumpisms
# length(Trumpisms)  # 96

# (b) sentences: ---- 
flowery <- ds4psy::flowery
# length(flowery)  # 60
Bushisms <- ds4psy::Bushisms 
# length(Bushisms)  # 22

# (3) from stringr: ------ 

words <- stringr::words
# length(words)  # 980
sentences <- stringr::sentences
# length(sentences)  # 720

9.4.2 Essential stringr commands

Viewing pattern matches

The stringr function str_view(s, p) is convenient for showing and highlighting the first match of a pattern p in a string s:

str_view(string = s, pattern = "at")

Its variant str_view_all(s, p) shows all matches of a pattern p in s:

str_view_all(string = s, pattern = "at")

Importantly, the pattern argument allows for different interpretations, which are accessible via so-called modifiers. The default interpretation is pattern = regex(p), which interprets p as a regular expression:

# regex: A 3-letter word:
p <- "\\b[:alpha:]{3}\\b"

# str_view_all(s, p)  
# is the same as:
str_view_all(s, regex(p))

If the "\\b[:alpha:]{3}\\b" looks like symbol salad to you, see Appendix E for using regular expressions.

Besides the default regex(), the other modifiers of pattern are coll(), fixed(), and boundary(). The boundary() modifier is useful for detecting various text-related boundaries:

str_view_all(s, boundary("word"))

str_view_all(s, boundary("sentence"))

If case is to be ignored, setting ignore_case = TRUE inside the modifier is indicated:

str_view_all("Abra CAD abrA", "A")

str_view_all("Abra CAD abrA", regex("A", ignore_case = TRUE))

As s gets large, setting match = TRUE or match = FALSE allows selectively showing matching or non-matching strings, respectively:

str_view_all(s, "hat", match = TRUE)

str_view_all(s, "hat", match = FALSE)

The str_view() commands have no direct equivalent in base R. The closest similar command is grep() with setting value = TRUE. Both function families share the pattern argument, but the argument x of grep() corresponds to the string argument in stringr functions. Importantly, the order of arguments is reversed, which matters whenever we get lazy and omit argument names.⁵⁵ Hence, the following function calls — note their reversal of arguments — both yield the element of s with a positive match:

p <- "hat"

grep(p, s, value = TRUE)
#> [1] "The mad hatter had heard her, so what?"
str_view_all(s, p, match = TRUE)

but only str_view_all() highlights these matches, which also reveals that there are two matches.

To complicate matters, Table 9.3 (above) lists str_subset() as the direct equivalent of grep(p, s, value = TRUE):

str_subset(s, p)
#> [1] "The mad hatter had heard her, so what?"

Essentially, the str_view() family of functions are convenient tools, but also complex hybrids that include several other commands and tasks. The task of viewing matches of a pattern p in a string s combines several steps to perform a seemingly simple task:

str_view_all(s, p): View all occurrences of the pattern p in string s.

If we were to program this function, we would realize that viewing all pattern matches is far from simple, but require a combination of several simpler tasks:

detecting strings matching a pattern,
locating matching patterns within the string, and
obtaining strings that match a pattern, or
extracting matching patterns (e.g, for highlighting them).

As we will see next, these tasks all contain an identical step (i.e., matching patterns in strings of text), but differ in the outputs they produce.

Detecting pattern matches

The task of detecting matches of a pattern p in a string s is performed by str_detect() and answers the question:

str_detect(s, p): Does pattern p occur in string s?

The answer to this question is a vector of logical values:

# Task: Detect matches of a pattern in string s:
str_detect(s, "hat")
#> [1] FALSE  TRUE FALSE

Obtaining a logical vector as the result of detecting patterns may seem like a limitation. However, pattern detection combined with regular expressions and logical indexing (and R’s convention of treating TRUE as 1 and FALSE as 0 when applying arithmetic functions to logical vectors) can answer quite sophisticated questions:

# How many fruits contain a letter twice in a row?
sum(str_detect(fruits, "(.)\\1"))
#> [1] 46

# What proportion of words start or end on a vowel?
mean(str_detect(words, "^[aeiou]"))
#> [1] 0.1785714
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.2765306

The base R equivalent to str_detect(s, p) is grepl(p, s) (note the reversal of arguments):

grepl("hat", x = s)
#> [1] FALSE  TRUE FALSE

Locating pattern matches

The task of locating pattern matches is similar to detecting matches. But rather than asking if a pattern p occurs in a string s, the question answered by str_locate(s, p) is:

str_locate(s, p): Where in string s does a pattern p occur?

For example, let’s reconsider our example from above:

Where in string s does the character sequence "hat" occur?

We see that “hat” occurs twice in the second element of the character vector s.

If we were only interested in the location of our first pattern match, the str_locate() command provides an answer:

# Task: Locate the first match of a pattern in string s:
str_locate(s, "hat")
#>      start end
#> [1,]    NA  NA
#> [2,]     9  11
#> [3,]    NA  NA

Note that the result of str_locate(s, p) is a matrix that contains the integer values of the start and end positions (as columns) of the first match in each string (or line) of s (in separate rows). As this matrix would no longer suffice if we allow for multiple matches of p in each string of s, the output of str_locate_all() becomes a list of values:

# Task: Locate all matches of a pattern in string s:
str_locate_all(s, "hat")
#> [[1]]
#>      start end
#> 
#> [[2]]
#>      start end
#> [1,]     9  11
#> [2,]    35  37
#> 
#> [[3]]
#>      start end

Each list element reports the start and end positions of all matches in the corresponding element/row of the string s. Using their values requires some skill in processing lists and matrices (see Sections 1.5.1 and 1.6.3):

# Task: Locate all matches and process a list element:
l_hat <- str_locate_all(s, "hat")  # as a list

mx_e3 <- l_hat[[3]]  # 3rd list element
is.matrix(mx_e3)     # a matrix
mx_e3[ , 1]          # 1st column: start positions of 3 matches

The need for dealing with different output types is preserved when using base R commands for locating pattern matches:

grep(p, s) returns the integer position(s) of all matching strings (in character vector s)
regexpr(p, s) returns an integer vector with additional attributes
gregexpr(p, s) returns a list of the same length as s

grep("hat", x = s)         # integer position (in vector)
#> [1] 2
regexpr("hat", text = s)   # integer vector with attributes
#> [1] -1  9 -1
#> attr(,"match.length")
#> [1] -1  3 -1
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
gregexpr("hat", text = s)  # list of integer vectors 
#> [[1]]
#> [1] -1
#> attr(,"match.length")
#> [1] -1
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#> 
#> [[2]]
#> [1]  9 35
#> attr(,"match.length")
#> [1] 3 3
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#> 
#> [[3]]
#> [1] -1
#> attr(,"match.length")
#> [1] -1
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE

If you find the latter two outputs confusing, you are in good company. Extracting their various attributes typically requires using the base R function attr(), plus remembering that the n-th element of a list l is l[[n]]). But note that all the information you may want about the location of matches is provided.

Obtaining strings that match patterns

Obtaining all strings in a character vector s that match a pattern p can be achieved by str_subset(s, p), answering the question:

str_subset(s, p): Which elements of a string s match a pattern p?

# Task: Obtain strings in s that match a pattern:
str_subset(s, "hat")
#> [1] "The mad hatter had heard her, so what?"

The base R equivalent to str_detect(s, p) is grep(p, s, value = TRUE) (note the reversal of arguments):

grep("hat", s, value = TRUE)
#> [1] "The mad hatter had heard her, so what?"

As we have seen above, obtaining matching strings can also be achieved by first detecting matching strings and then using logical indexing:

# Obtain strings in s that match a pattern 
# (by detecting matching strings and logical indexing):
s[str_detect(s, "hat")]
#> [1] "The mad hatter had heard her, so what?"

But since obtaining all strings that match a pattern is a common task, the str_subset() function is a welcome shortcut for the two-step operation of detecting and subsetting.

Using base R variants of grep(), we could use both logical or numerical indexing to obtain matching strings:

# (with logical indexing):
s[grepl("hat", x = s)]
#> [1] "The mad hatter had heard her, so what?"

# (with numerical indexing):
s[grep("hat", s)]
#> [1] "The mad hatter had heard her, so what?"

Counting pattern matches

A task closely related to detecting, locating, and obtaining strings that match a pattern is counting the number of occurrences of a pattern p in a string s:

str_count(s, p): How often does pattern p occur in string s?

The str_count() function is a key tool for quantifying pattern matches. As most functions for advanced text-manipulation, the function is quite mundane when used with highly specific patterns:

str_count(s, "a")
#> [1] 3 5 4
str_count(s, "at")
#> [1] 3 2 1
str_count(s, "hat")
#> [1] 0 2 0

but becomes powerful when combined with regular expressions and other functions:

# Proportion of words beginning with a vowel:
sum(str_count(words, "^[aeiou]"))/length(words)
#> [1] 0.1785714

# Proportion of words ending on a vowel:
sum(str_count(words, "[aeiou]$"))/length(words)
#> [1] 0.2765306

# Mean proportion of vowels per word in words:
mean(str_count(words, "[aeiou]")/nchar(words))
#> [1] 0.3798123

# Words in words containing more than 70% vowels:
words[str_count(words, "[aeiou]")/nchar(words) > .70]
#> [1] "a"    "area" "idea"

Note that the str_count(s, p) function counts all occurrences of a pattern p in strings s, not just the first occurrence in each element of s. As it has no direct equivalent in base R, the str_count() command is one of the main reasons for using the stringr package.

Extracting pattern matches

After detecting, locating, or counting pattern matches, a logical next step is extracting those matches. Extracting matches to a pattern p in a character vectors s can be achieved by str_extract(s, p) and answers the question:

str_extract(s, p): Which character sequences in string s match a pattern p?

As with other stringr functions, the str_extract() function comes in two varieties: The function str_extract(s, p) will only extract the first match of p in each element of s:

# Task: Extract the first match of a pattern from string s:
str_extract(s, "hat")
#> [1] NA    "hat" NA

whereas the function str_extract_all(s, p) will extract all matches of p in each element of s:

# Task: Extract all matches of a pattern from string s:
str_extract_all(s, "hat")
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> [1] "hat" "hat"
#> 
#> [[3]]
#> character(0)

As we have seen before, the price of the more complete result of str_extract_all(s, p) is a more complex output format: The resulting list contains a vector of matches for each element of the matched string s.

These examples also show that extracting matches to highly specific patterns is pretty pointless. After all, the only character sequence that can match the pattern “hat” is “hat,” and extracting matches will not change that. However, extracting pattern matches can yield insights and surprises when combining it with the flexibility of regular expressions (see Appendix E). For instance, the following two commands both use this regex functionality and extract

all words that end on “at,” and
all words containing exactly three characters:

# Task: Extract all words ending on "at" from string s:
str_extract_all(s, "\\b[:alpha:]+at\\b")

# Task: Extract all 3-letter words from string s:
str_extract_all(s, "\\b[:alpha:]{3}\\b")

As our collections of strings s get larger and our regular expressions grow in complexity, we can extract matching patterns for asking and answering genuine questions:

has_animal <- paste(c("cat", "dog", "bird", "elephant", "fish", "fox", 
                      "mouse", "sheep", "trout", "zebra"), collapse = "|")

# Which of these animals occur in sentences? How often?
m <- str_extract_all(sentences, has_animal, simplify = TRUE)
v <- as.vector(m)  # matrix as vector
v <- v[v != ""]    # remove instances of ""
table(v)

# Obtain corresponding sentences:
str_subset(sentences, has_animal)

The base R equivalent to str_extract(s, p) is ugly, but straightforward. It first requires locating all pattern matches (with gregexpr(p, s), see above) and then providing the results to a function regmatches(x, m) (note the change in argument names):

regmatches(x = s, m = gregexpr("hat", text = s))
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> [1] "hat" "hat"
#> 
#> [[3]]
#> character(0)

Replacing pattern matches

Replacing matches to a pattern p in a character vectors s can be achieved by str_replace(s, p):

str_replace(s, p, r): Replace character sequences in string s match pattern p by r

The pair str_replace(s, p, r) and str_replace_all(s, p, r) follow the familiar pattern of replacing either the first or all occurrences of p by r:

str_replace(s[1], "a", "A")
#> [1] "The cAt sat on the mat."
str_replace_all(s[1], "a", "A")
#> [1] "The cAt sAt on the mAt."

Again, the power of replacing patterns is enhanced by using regular expressions:

str_replace_all(s[1], "[a-o]", "_")
#> [1] "T__ __t s_t __ t__ __t."

Note that the replacement can have a different length than the matched pattern or consist of a pattern:

# different lengths:
str_replace_all(s[1], "[aeiou]", "xxx")
#> [1] "Thxxx cxxxt sxxxt xxxn thxxx mxxxt."
str_replace_all(s[1], "[aeiou]", "")
#> [1] "Th ct st n th mt."

# with patterns:
str_replace_all(s[1], "[cmt]", toupper)          # capitalize matches
#> [1] "The CaT saT on The MaT."
str_replace_all(s[1], "[a]", NA_character_)      # replaces entire string with NA
#> [1] NA
str_replace_all(s[1], "([aeiou])", "\\1\\1\\1")  # triple all vowels
#> [1] "Theee caaat saaat ooon theee maaat."

Note that the str_replace() functions are vectorized over string, pattern, and replacement:

str_replace_all(s, "[aeiou]", c("1", "2", "3"))
#> [1] "Th1 c1t s1t 1n th1 m1t."               
#> [2] "Th2 m2d h2tt2r h2d h22rd h2r, s2 wh2t?"
#> [3] "Th3 f3t d3d w3s s3 s3d."
str_replace_all(s, c("a", "e", "d"), "-")
#> [1] "The c-t s-t on the m-t."               
#> [2] "Th- mad hatt-r had h-ard h-r, so what?"
#> [3] "The fat -a- was so sa-."
str_replace_all(s, c("a", "e", "d"), c("A", "3", "D"))
#> [1] "The cAt sAt on the mAt."               
#> [2] "Th3 mad hatt3r had h3ard h3r, so what?"
#> [3] "The fat DaD was so saD."

Performing multiple replacements in a single step is possible by using str_replace_all() and providing a named character vector to the pattern argument:

# Named vectors:
changes <- c("a" = "A", "e" = "3", "i" = "1", "o" = "0", "u" = "U")
numbers <- c("2" = "two", "3" = "three", "4" = "many")

# Replacing multiple matches at once:
str_replace_all(s, pattern = changes)
#> [1] "Th3 cAt sAt 0n th3 mAt."               
#> [2] "Th3 mAd hAtt3r hAd h3Ard h3r, s0 whAt?"
#> [3] "Th3 fAt dAd wAs s0 sAd."
str_replace_all(sentences[3], changes)
#> [1] "It's 3Asy t0 t3ll th3 d3pth 0f A w3ll."
str_replace_all(paste0(2:4, " little piggies"), numbers)
#> [1] "two little piggies"   "three little piggies" "many little piggies"

The closest base R equivalent to str_replace() is gsub(), which also contains a replacement argument. This function also allows for regular expressions, but lacks the vector-based whistles of the stringr functions:

gsub(pattern = "a", replacement = "A", x = s[1])
#> [1] "The cAt sAt on the mAt."
gsub(pattern = "[aeuoi]", replacement = "_", x = s[3])
#> [1] "Th_ f_t d_d w_s s_ s_d."

# Substitute any letter b or B by 2 x "be" or "Be":
gsub("([bB])", "\\1e\\1e", x = "To b or not to B.")
#> [1] "To bebe or not to BeBe."

9.4.3 Additional stringr commands

The stringr package (Wickham, 2019b) contains many additional commmands that facilitate working with strings (some of which were mentioned in Section 9.3.4). Here are some examples:

The str_length() function is stringr’s equivalent of nchar(). A substantial portion of text is white space between words, lines, or paragraphs. Consequently, managing white space in text is the purpose of many specialized functions.

To examine these functions, let’s first select some fruits with very short and very long names:

fruits <- ds4psy::fruits  # Data:
fs <- fruits[nchar(fruits) <  4]  # very short names
fn <- fruits[!str_detect(fruits, "[\\(]")]  # no parentheses
fl <- fn[nchar(fn) > 15]  # very long names
f2 <- c(fs, fl)           # very short and very long names
f2
#> [1] "Fig"                    "Pea"                    "Grewia asiatica phalsa"
#> [4] "Monstera Delisiousa"    "Purple mangosteen"

str_length(f2)  # string lengths
#> [1]  3  3 22 19 17

The pair of functions str_pad() and str_trim() helps managing string lengths by adding or removing leading or trailing spaces to or from strings:

# Add padding (to short strings only):
fp <- str_pad(f2, width = 10, side = "both")
fp
#> [1] "   Fig    "             "   Pea    "             "Grewia asiatica phalsa"
#> [4] "Monstera Delisiousa"    "Purple mangosteen"

# Trim padding:
str_trim(fp)
#> [1] "Fig"                    "Pea"                    "Grewia asiatica phalsa"
#> [4] "Monstera Delisiousa"    "Purple mangosteen"

Similarly, the str_squish() function trims leading and trailing spaces, but also deletes repeated whitespaces inside a string:

# Trim and remove repeated white space:
str_squish("  The is  messy  a  sentence.  A second phrase...  ")
#> [1] "The is messy a sentence. A second phrase..."

For removing leading or trailing whitespace from character strings, see also the base R function trimws():

gender <- c("male", "female", "other", "male ", " female", " other ") 

unique(gender)
#> [1] "male"    "female"  "other"   "male "   " female" " other "
unique(trimws(gender))
#> [1] "male"   "female" "other"

The str_trunc() function truncates long strings so that all strings in s have the same length of characters:

# Truncate strings (to a maximum length):
str_trunc(f2, width = 15, side = "right")
#> [1] "Fig"             "Pea"             "Grewia asiat..." "Monstera Del..."
#> [5] "Purple mango..."

The str_wrap() function helps formatting text into well-behaved paragraphs:

pg <- str_wrap(sentences[1:3], width = 20, indent = 5)

writeLines(pg)
#>      The birch canoe
#> slid on the smooth
#> planks.
#>      Glue the sheet
#> to the dark blue
#> background.
#>      It's easy to
#> tell the depth of a
#> well.

The str_starts() and str_ends() helpers are shortcuts for str_detect() with regex anchors (see Section E.2.4 in Appendix E):

f3 <- c("Apple", "Banana", "Coconut")

str_starts(f3, "C")
#> [1] FALSE FALSE  TRUE
str_starts(f3, "C", negate = TRUE)
#> [1]  TRUE  TRUE FALSE

str_ends(f3, "a")
#> [1] FALSE  TRUE FALSE
str_ends(f3, "a", negate = TRUE)
#> [1]  TRUE FALSE  TRUE

The str_flatten() function is similar to paste() with its collapse argument not being NULL, but may be easier to remember:

str_flatten(letters)
#> [1] "abcdefghijklmnopqrstuvwxyz"
str_flatten(letters, collapse = "_")
#> [1] "a_b_c_d_e_f_g_h_i_j_k_l_m_n_o_p_q_r_s_t_u_v_w_x_y_z"

# Contrast with paste(): 
paste(letters, collapse = "")
#> [1] "abcdefghijklmnopqrstuvwxyz"
paste(letters, collapse = "_")
#> [1] "a_b_c_d_e_f_g_h_i_j_k_l_m_n_o_p_q_r_s_t_u_v_w_x_y_z"

The str_match() function is an extension of str_extract() that also obtains the parts of matches:

When using more complex regular expressions for matching patterns, it can be useful to extract not only complete matches, but also their components. This is the job of the str_match() and str_match_all() functions, both of which return matrices that contain the complete match (in their first column) and additional columns for each capture group. For instance, the following code extract all instances of an article (“a” or “the”) followed by a space and a sequence of characters:

# regex: An article + space + character sequence:
article_word <- "(a|the) ([:alpha:]+)"

# Extract entire match vs. its parts: 
head(str_extract(sentences, article_word))
#> [1] "the smooth" "the sheet"  "the depth"  "a chicken"  NA          
#> [6] NA
head(str_match(sentences, article_word))
#>      [,1]         [,2]  [,3]     
#> [1,] "the smooth" "the" "smooth" 
#> [2,] "the sheet"  "the" "sheet"  
#> [3,] "the depth"  "the" "depth"  
#> [4,] "a chicken"  "a"   "chicken"
#> [5,] NA           NA    NA       
#> [6,] NA           NA    NA

## Getting "_all" requires lists of lists:
# str_extract_all(sentences, article_word)
# str_match_all(sentences, article_word)

See the RStudio cheatsheet on stringr and the package’s documentation for additional commands.

Practice

Here are some practice tasks that allow us to test our knowledge and skills regarding stringr functions.

Rethinking functions:

The %in% operator finds elements of vectors.

Can we also use %in% to find characters in strings?
Under which conditions is sum(str_detect(s, p)) equal to sum(str_count(s, p))?

Revisit the %in% operator (from Section 1.4) that checks whether an element is found in a vector:

5 %in% 1:10
#> [1] TRUE
"X" %in% LETTERS
#> [1] TRUE

Can we also use %in% to find characters in strings?

# Data: 
st <- "I am a simple demo sentence."

# as vector:
vt <- unlist(strsplit(st, split = " "))
vt <- unlist(str_extract_all(st, boundary("word")))

# (a) Finds entire elements of vectors:
"simple" %in% vt

# (b) BUT not character sequences WITHIN strings: 
"simple" %in% st

# (c) Thus, we need to use regex:
grepl("simple", st)
str_view_all(st[1], pattern = "simple")

Under which conditions is sum(str_detect(s, p)) equal to sum(str_count(s, p))?

# Data:
s  # (from above):

# Different outputs:
str_detect(s, p = "a")  # either FALSE/0 or TRUE/1 for each element
str_count(s, p = "a")   # counts all occurrences per element

# When are the sums of outputs equal?
# When a pattern p occurs 0 or 1 times in every element of s:
dv <- str_detect(s, p = "T")  # all TRUE
cv <- str_count(s, p = "T")   # 1 1 1 
sum(dv) == sum(cv)

dv <- str_detect(s, p = "n")  # TRUE FALSE FALSE
cv <- str_count(s, p = "n")   # 1 0 0
sum(dv) == sum(cv)

Finding the right words:

How many words in words contain the letter sequence “age?” Which ones?
Find all words in words that end on “ing.”
Find all words in words that contain “ing,” but do not end on it.
Find all words in words that contain 10 or more letters.
Find all words in words consisting only of vowels (non-consonants, i.e., the set aeiou).
Find all words in words consisting only of consonants (i.e., without any vowels).
Find all words in words that begin and end with the same letter.

The tasks can be solved with stringr functions and regular expressions (see Appendix E). However, also consider simpler solutions (e.g., involving base R functions or logical indexing).

# Data:
words <- stringr::words
length(words)  # 980

How many words in words contain the letter sequence “age?” Which ones?

sum(str_detect(words, "age"))
sum(str_count(words, "age"))

str_subset(words, "age")

Find all words in words that end on “ing”:

# regex with anchor:
p <- "ing$"
str_subset(words, p)

# alternatives: 
words[str_detect(words, p)]
words[str_ends(words, "ing")]

Find all words in words that contain “ing,” but do not end on it:

# regex with wildcard
p <- "ing."
str_subset(words, p)

# in 2 steps:
ing_w <- words[str_detect(words, "ing")]  # words with "ing"
ing_w[str_ends(ing_w, "[^g]")]            # not ending on "g"

Find all words in words that contain 10 or more letters:

# regex with wildcard and repetition:
p <- "^.{10,}$"      # with anchors
p <- "\\b.{10,}\\b"  # with word boundaries

str_subset(words, p)

Note the much simpler solutions based on measuring the length of words:

words[str_length(words) >= 10]
words[nchar(words) >= 10]

Find all words in words consisting only of vowels (non-consonants, i.e., the set aeiou):

# regex with anchors:
p <- "^[aeiou]+$"

str_subset(words, p)

Note that there are many alternative solutions:

# detect and logical indexing:
only_vowel <- str_detect(words, p)
words[only_vowel]

# base R solutions:
grep(pattern = p, x = words, value = TRUE)
words[grepl(pattern = p, x = words)]

Find all words in words consisting only of consonants (i.e., without any vowels):

# negated regex with anchors:
p <- "^[^aeiou]+$"

str_subset(words, p)

Again, there are many alternative solutions:

# detect and logical indexing:
no_vowel_2 <- str_detect(words, p)
words[no_vowel_2]

# detect opposite and use logical indexing:
no_vowel <- !str_detect(words, "[aeiou]")
words[no_vowel]

# base R solutions:
grep(pattern = p, x = words, value = TRUE)
words[grepl(pattern = p, x = words)]
words[!grepl(pattern = "[aeiou]", x = words)]

Find all words in words that begin and end with the same letter:

# regex with backreference:
p <- "^(.).+\\1$"

str_subset(words, p)

Note that the last solution missed the 1-letter word “a” and would also miss “AA.”

Why — and how can we fix it?

# The regex with backreference:
p <- "^(.).+\\1$"
# requires a character that is repeated (in 1st and last position) 
# plus an intermediate character (.+): 
str_subset(c("a", "AA", "SOS"), p)

p <- "^(.).*\\1$"  # would render the intermediate character optional:
str_subset(c("a", "AA", "SOS"), p)

# A solution:
p <- "^(.).*\\1$|^.{1}"
str_subset(c("a", "AA", "SOS"), p)

# Alternative solution:
p <- "^(.).+\\1$|^(.)\\1$|^.{1}"
str_subset(c("a", "AA", "SOS"), p)

Quantifying fruits:

Measure some aspects of fruits (from the ds4psy package, but all in lowercase letters):

# Data:
fruits <- tolower(ds4psy::fruits)

Use the vector of fruits to answer the following questions:

Are there more fruits containing the letter “a” or the letter “e?”
Are there more fruits containing the letter sequence “ana” or “po?”

# fruits with "a" vs. "e": 
sum(str_detect(fruits, pattern = "a"))
sum(str_detect(fruits, pattern = "e"))

str_subset(fruits, pattern = "a")
str_subset(fruits, pattern = "e")

# fruits with "ana" vs. "po":
sum(str_count(fruits, "ana"))
str_view(fruits, "ana", match = TRUE)
sum(str_detect(fruits, "po"))
str_view(fruits, "po", match = TRUE)

How many and which fruits contain one (or more) of the letters “x,” “y,” or “z?”
How many and which fruits that are not berries contain one (or more) of the letters “x,” “y,” or “z?”

# letter x, y, or z:
sum(str_detect(fruits, pattern = "[xyz]"))
str_view(fruits, pattern = "[xyz]", match = TRUE)

# excluding berries:
no_berry <- str_subset(fruits, pattern = "berry", negate = TRUE)
sum(str_detect(no_berry, pattern = "[xyz]"))
str_view(no_berry, pattern = "[xyz]", match = TRUE)

Create a tibble ft that contains all names of fruits (in lowercase letters) as a column name.
Add a column len that contains the length of each fruit’s name.
Add a column n_vow that counts the number of vowels (defined as one of “aeiou”) in each fruit’s name.
Add a column n_con that counts the number of consonants (defined as non-vowels) in each fruit’s name.
Verify that len equals the sum of n_vow and n_con for all fruits in ft.
Which names of fruits contain more than 50% of vowels?

Hint: An example in 14.4 Tools (Wickham & Grolemund, 2017) solves most of this task for the character vector of words.

# Create tibble: 
ft <- tibble::tibble(
  nr = 1:length(fruits),
  name = tolower(fruits) 
)

# Adding some (count) variables:
ft <- ft %>% 
  mutate(len = nchar(name),
         n_vow = str_count(name, "[aeiou]"),
         n_con = str_count(name, "[^aeiou]"))

# Verify:
all.equal(ft$len, (ft$n_vow + ft$n_con))
#> [1] TRUE

kable(head(ft), caption = "The `head()` of `ft`.")

Table 9.1: The `head()` of `ft`.
nr	name	len	n_vow	n_con
1	acai	4	3	1
2	ackee	5	3	2
3	apple	5	2	3
4	apricot	7	3	4
5	avocado	7	4	3
6	banana	6	3	3

Using ft to answer:

Which names of fruits contain more than 50% of vowels?

ft %>%
  mutate(p_vowels = n_vow/len) %>%
  filter(p_vowels > .50)

Replacing characters in Trumpisms:

The vector Trumpisms (included in ds4psy) contains 168 words or short phrases frequently used by U.S. president Donald Trump. Use this vector for some character replacements:

Replace all instances of “i” by “!” and all instances of “s” by “$.”
Replace all instances of two repeated letters (e.g., “ll”) by “wall.”

Trumpisms <- ds4psy::Trumpisms  # data

# stringr:
change_T <- c("i" = "!", "s" = "$")
str_replace_all(Trumpisms, pattern = change_T)
str_replace_all(Trumpisms, "(.)\\1", "wall")

# base R:
chartr(old = "is", new = "!$", x = Trumpisms)
gsub(pattern = "(.)\\1", "wall", x = Trumpisms)

Replacing and translating Bushisms:

The vector Bushisms (included in ds4psy) contains 22 phrases spoken by or attributed to U.S. president George W. Bush. Well-known examples include marvels like “They misunderestimated me.” and “Rarely is the question asked: Is our children learning?”

Replace all instances of “I” by “you,” “my” by “your,” “you” by “I,” and “your” by “my,” respectively.

Bushisms <- ds4psy::Bushisms  # data

# stringr:
change_B <- c("I" = "you", "my" = "your", 
              "your" = "my", "Your" = "My",
              "you" = "I", "You" = "I")
str_replace_all(Bushisms, pattern = change_B)

As children, we sometimes talked in what we called our secret “B-language”: Every occurrence of a vowel was followed by “b” and then repeated. The resulting sentences failed to protect our secrets from our enemies and parents, but sounded pretty funny.

Translate the set of Bushisms into B-language.

str_replace_all(Bushisms, "([aeoui])", "\\1b\\1")

Flowery phrases:

After all this political talk, we crave for some more decorative and charming phrases. Fortunately, the vector flowery (included in ds4psy) contains 60 versions and variations of Gertrude Stein’s popular phrase “A rose is a rose is a rose.”

Use this vector (in lowercase letters) for answering the following questions:

How often do the words “Rose” and its variations occur in flowery phrases?
How many matches can we find for words belonging to some semantic field?
What is the topic or theme of each phrase?

Note that these questions all address semantic issues, which can be tricky, subject to interpretations, and often require human judgment and heuristic approaches. But let’s see how far we get with our fairly simple tools:

How often do the words “Rose” and its variations occur in flowery phrases?

flowery <- tolower(ds4psy::flowery)  # data

# frequency of "Rose" etc:
set_rose <- "rose|rosa|rose|rosy"
sum(str_count(flowery, pattern = set_rose))

str_view_all(flowery, set_rose)

Solving the task

How many matches can we find for words belonging to some semantic field?

requires that we first look through the flowery phrases and identify semantic fields as sets of words belonging to the same category. For instance, a first set could consist of “garden,” “flower,” “friend,” “love,” and “save,” which are all positively connotated words associated with roses. This set could be contrasted with phrases that address more negative topics (e.g., “murder,” “thief,” “zombie”) etc.

# (a) define sets:
set_rose_love <- "garden|flower|friend|love|save"
set_horror_crime <- "bitch|bullet|crime|hell|lie|loss|murder|rape|thief|zombie"
set_body_parts <- "belly|breast|gut|head|leg|nose|toe"
set_objects <- "bolder|moon|rock|stein|stone|pebble|thing"

# (b) count occurrences: 
sum(str_count(flowery, pattern = set_rose_love))
sum(str_count(flowery, pattern = set_horror_crime))
sum(str_count(flowery, pattern = set_body_parts))
sum(str_count(flowery, pattern = set_objects))

Interestingly, the answers we get do not only depend on the data we analyze, but also as a function of the precise questions we ask.

Finally, let’s try to figure out what the flowery phrases are about:

What is the topic or theme of each phrase?

Solving this task will require some insight or heuristic. A possible approach could ask: Which noun occurs repeatedly in a phrase? The following attempt extracts the first two words of every phrase and then chooses the longer one of them:

# Get the longer of the first two words of each phrase:

# Extract the first_two words:
first_two <- str_extract(flowery, pattern = "[:alpha:]+ [:alpha:]+ ") 
first_two <- str_trim(first_two)

# Identifying the first word by regex and measure word lengths: 
tb <- tibble(first_two) %>%
  mutate(wrd_1 = str_extract(first_two, "[:alpha:]+(?= )"), 
         len_1 = nchar(wrd_1),
         len_2 = nchar(first_two) - len_1 - 1,
         first = len_1 > len_2)
# tb

# simpler: 
first_two <- str_split(first_two, " ", simplify = TRUE)
tb <- as_tibble(first_two) %>% 
  mutate(len_1 = nchar(V1),
         len_2 = nchar(V2),
         first = len_1 > len_2)
# tb

# Use either the first or second word:
theme_1 <- tb$V1[tb$first]  # theme in 1st word
theme_2 <- tb$V2[!tb$first] # theme in 2nd word
themes <- c(theme_1, theme_2)
themes

Note that this result is still sub-optimal. Can you find a better solution?

References

Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz

Omitting argument names is common practice when using a function, but can be dangerous when several arguments are of the same type. As regular expressions are strings in R, inadvertently reversing the string and pattern arguments is possible, and can yield disastrous results when not noticed. A cheap insurance policy against such mistakes is to always explicate argument names, particularly when programming your own functions (see Chapter 11).↩︎

Task	R base	stringr
B: Advanced tasks
– View strings in `s` that match `p`:		`str_view(s, p)`\(^{a}\)
– Detect pattern `p` in strings `s`:	`grep(p, s)` `grepl(p, s)`\(^{1}\)	`str_detect(s, p)`
– Locate pattern `p` in strings `s`:	`gregexpr(p, s)`\(^{1}\)	`str_locate(s, p)`\(^{a}\)
– Obtain strings in `s` that match `p`:	`grep(p, s, value = TRUE)`\(^{1}\)	`str_subset(s, p)`
– Extract matches of `p` in strings `s`:	`regmatches(s, gregexpr(p, s))`\(^{1}\)	`str_extract(s, p)`\(^{a}\)
– Replace matches of `p` by `r` in `s`:	`gsub(p, r, s)`\(^{1}\)	`str_replace(s, p, r)`\(^{a}\)
– Count matches of `p` in strings `s`:		`str_count(s, p)`