9.4 Advanced text manipulation

In Section 9.2.4, we emphasized that base R provides a range of functions to deal with advanced tasks of text manipulation, but using the stringr package (Wickham, 2019b) included in the tidyverse (Wickham, Averick, et al., 2019) is easier and more straightforward:

As both base R and stringr functions support pattern matching, this section assumes some familiarity with using regular expressions (see Appendix E).

9.4.1 Advanced tasks with text

This section is structured according to the advanced tasks identified in Section 9.2.4. The following Table 9.3 repeats the advanced tasks of our original summary table:

Table 9.3: Advanced tasks of text manipulation (involving a string s and pattern p).
Task R base stringr
B: Advanced tasks
View strings in s that match p: str_view(s, p)\(^{a}\)
Detect pattern p in strings s: grep(p, s) grepl(p, s)\(^{1}\) str_detect(s, p)
Locate pattern p in strings s: gregexpr(p, s)\(^{1}\) str_locate(s, p)\(^{a}\)
Obtain strings in s that match p: grep(p, s, value = TRUE)\(^{1}\) str_subset(s, p)
Extract matches of p in strings s: regmatches(s, gregexpr(p, s))\(^{1}\) str_extract(s, p)\(^{a}\)
Replace matches of p by r in s: gsub(p, r, s)\(^{1}\) str_replace(s, p, r)\(^{a}\)
Count matches of p in strings s: str_count(s, p)

Table notes

  • \(^{1}\): base R functions with additional variants that tweak their functionality (see their documentation).

  • \(^{2}\): stringr functions with additional variants that tweak their functionality.

  • \(^{a}\): stringr functions with an additional suffix _all() that applies the function to all matches (rather than just to the first match).

Why stringr?

There are three main reasons for switching to the stringr package to address these more advanced tasks:

  1. Organization: The stringr functions are named and structured in a more systematic fashion than their base R ancestors.

  2. Specialization: The stringr functions are more specialized than the corresponding base R functions. Each function is designed to accomplish a specific task rather well, rather than the base R approach of providing a family of related functions that do many things in different ways.

  3. Functionality: The outputs of str_view() and str_count() are difficult to reproduce with base R functions. Additionally, the pattern argument of many stringr functions can be used in combination with so-called modifiers that govern the interpretation of a regular expression and allow setting additional arguments (e.g., ignore.case = TRUE, see ?stringr::regex for the documentation).

Personally, I use the base R functions nchar(), paste(), and substr() on a regular basis, appreciate the flexibility of grep() and strsplit(), but find the details of gregexpr(), gsub(), and regmatches() too confusing to remember. Regarding the stringr package, I enjoy the convenience of str_view_all() for displaying the matches of regular expressions, the fact that a set of uniform commands are named by the functions they perform, and the invaluable functionality of str_count() for quantifying pattern matches.

9.4.2 Essential stringr commands

Viewing pattern matches

The stringr function str_view(s, p) is convenient for showing and highlighting the first match of a pattern p in a string s:

Its variant str_view_all(s, p) shows all matches of a pattern p in s:

Importantly, the pattern argument allows for different interpretations, which are accessible via so-called modifiers. The default interpretation is pattern = regex(p), which interprets p as a regular expression:

If the "\\b[:alpha:]{3}\\b" looks like symbol salad to you, see Appendix E for using regular expressions.

Besides the default regex(), the other modifiers of pattern are coll(), fixed(), and boundary(). The boundary() modifier is useful for detecting various text-related boundaries:

If case is to be ignored, setting ignore_case = TRUE inside the modifier is indicated:

As s gets large, setting match = TRUE or match = FALSE allows selectively showing matching or non-matching strings, respectively:

The str_view() commands have no direct equivalent in base R. The closest similar command is grep() with setting value = TRUE. Both function families share the pattern argument, but the argument x of grep() corresponds to the string argument in stringr functions. Importantly, the order of arguments is reversed, which matters whenever we get lazy and omit argument names.45 Hence, the following function calls — note their reversal of arguments — both yield the element of s with a positive match:

but only str_view_all() highlights these matches, which also reveals that there are two matches.

To complicate matters, Table 9.3 (above) lists str_subset() as the direct equivalent of grep(p, s, value = TRUE):

Essentially, the str_view() family of functions are convenient tools, but also complex hybrids that include several other commands and tasks. The task of viewing matches of a pattern p in a string s combines several steps to perform a seemingly simple task:

  • str_view_all(s, p): View all occurrences of the pattern p in string s.

If we were to program this function, we would realize that viewing all pattern matches is far from simple, but require a combination of several simpler tasks:

  • detecting strings matching a pattern,
  • locating matching patterns within the string, and
  • obtaining strings that match a pattern, or
  • extracting matching patterns (e.g, for highlighting them).

As we will see next, these tasks all contain an identical step (i.e., matching patterns in strings of text), but differ in the outputs they produce.

Detecting pattern matches

The task of detecting matches of a pattern p in a string s is performed by str_detect() and answers the question:

  • str_detect(s, p): Does pattern p occur in string s?

The answer to this question is a vector of logical values:

Obtaining a logical vector as the result of detecting patterns may seem like a limitation. However, pattern detection combined with regular expressions and logical indexing (and R’s convention of treating TRUE as 1 and FALSE as 0 when applying arithmetic functions to logical vectors) can answer quite sophisticated questions:

The base R equivalent to str_detect(s, p) is grepl(p, s) (note the reversal of arguments):

Locating pattern matches

The task of locating pattern matches is similar to detecting matches. But rather than asking if a pattern p occurs in a string s, the question answered by str_locate(s, p) is:

  • str_locate(s, p): Where in string s does a pattern p occur?

For example, let’s reconsider our example from above:

  • Where in string s does the character sequence "hat" occur?

We see that “hat” occurs twice in the second element of the character vector s.

If we were only interested in the location of our first pattern match, the str_locate() command provides an answer:

Note that the result of str_locate(s, p) is a matrix that contains the integer values of the start and end positions (as columns) of the first match per string in s (in separate rows). As this matrix would no longer suffice if we allow for multiple matches of p in each string of s, the output of str_locate_all() becomes a list of values:

The need to deal with different output types is preserved when using base R commands for locating pattern matches:

  • grep(p, s) returns the integer position(s) of all matching strings (in character vector s)
  • regexpr(p, s) returns an integer vector with additional attributes
  • gregexpr(p, s) returns a list of the same length as s

If you find the latter two outputs confusing, you are in good company. But note that all the information you may want about the location of matches is there.

Obtaining strings that match patterns

Obtaining all strings in a character vectors s that match a pattern p can be achieved by str_subset(s, p) and answers the question:

  • str_subset(s, p): Which elements of a string s match pattern p?

The base R equivalent to str_detect(s, p) is grep(p, s, value = TRUE) (note the reversal of arguments):

As we have seen above, obtaining matching strings can also be achieved by first detecting matching strings and then using logical indexing:

But since obtaining all strings that match a pattern is a common task, the str_subset() function is a welcome shortcut for the two-step operation of detecting and subsetting.

Using base R variants of grep(), we could use both logical or numerical indexing to obtain matching strings:

Counting pattern matches

A task closely related to detecting, locating, and obtaining strings that match a pattern is counting the number of occurrences of a pattern p in a string s:

  • str_count(s, p): How often does pattern p occur in string s?

The str_count() function is a key tool for quantifying pattern matches. As most functions for advanced string manipulation, the function is quite mundane when used with highly specific patterns:

but becomes powerful when combined with regular expressions and other functions:

Note that the str_count(s, p) function counts all occurrences of a pattern p in strings s, not just the first occurrence in each element of s. As it has no direct equivalent in base R, the str_count() command is one of the main reasons for using the stringr package.

Extracting pattern matches

After detecting, locating, or counting pattern matches, a logical next step is extracting those matches. Extracting matches to a pattern p in a character vectors s can be achieved by str_extract(s, p) and answers the question:

  • str_extract(s, p): Which character sequences in string s match pattern p?

As with other stringr functions, the str_extract() function comes in two varieties: The function str_extract(s, p) will only extract the first match of p in each element of s:

whereas the function str_extract_all(s, p) will extract all matches of p in each element of s:

As we have seen before, the price of the more complete result of str_extract_all(s, p) is a more complex output format: The resulting list contains a vector of matches for each element of the matched string s.

These examples also show that extracting matches to highly specific patterns is pretty pointless. After all, the only character sequence that can match the pattern “hat” is “hat”, and extracting matches will not change that. However, extracting pattern matches can yield insights and surprises when combining it with the flexibility of regular expressions (see Appendix E). For instance, the following two commands both use this regex functionality and extract

  • all words that end on “at”, and

  • all words containing exactly three characters:

As our collections of strings s get larger and our regular expressions grow in complexity, we can extract matching patterns for asking and answering genuine questions:

The base R equivalent to str_extract(s, p) is ugly, but straightforward. It first requires locating all pattern matches (with gregexpr(p, s), see above) and then providing the results to a function regmatches(x, m) (note the change in argument names):

Replacing pattern matches

Replacing matches to a pattern p in a character vectors s can be achieved by str_replace(s, p):

  • str_replace(s, p, r): Replace character sequences in string s match pattern p by r

The pair str_replace(s, p, r) and str_replace_all(s, p, r) follow the familiar pattern of replacing either the first or all occurrences of p by r:

Again, the power of replacing patterns is enhanced by using regular expressions:

Note that the replacement can have a different length than the matched pattern or consist of a pattern:

Note that the str_replace() functions are vectorized over string, pattern, and replacement:

Performing multiple replacements in a single step is possible by using str_replace_all() and providing a named character vector to the pattern argument:

The closest base R equivalent to str_replace() is gsub(), which also contains a replacement argument. This function also allows for regular expressions, but lacks the vector-based whistles of the stringr functions:

9.4.3 Additional stringr commands

The stringr package (Wickham, 2019b) contains many additional commmands that facilitate working with strings (some of which were mentioned in Section 9.3.4). Here are some examples:

  • The str_length() function is stringr’s equivalent of nchar(). A substantial portion of text is white space between words, lines, or paragraphs. Consequently, managing white space in text is the purpose of many specialized functions.

To examine these functions, let’s first select some fruits with very short and very long names:

  • The pair of functions str_pad() and str_trim() helps managing string lengths by adding or removing leading or trailing spaces to or from strings:
  • Similarly, the str_squish() function trims leading and trailing spaces, but also deletes repeated whitespaces inside a string:
  • For removing leading or trailing whitespace from character strings, see also the base R function trimws():
  • The str_trunc() function truncates long strings so that all strings in s have the same length of characters:
  • The str_wrap() function helps formatting text into well-behaved paragraphs:
  • The str_starts() and str_ends() helpers are shortcuts for str_detect() with regex anchors (see Section E.2.4 in Appendix E):
  • The str_flatten() function is similar to paste() with its collapse argument not being NULL, but may be easier to remember:
  • The str_match() function is an extension of str_extract() that also obtains the parts of matches:

When using more complex regular expressions for matching patterns, it can be useful to extract not only complete matches, but also their components. This is the job of the str_match() and str_match_all() functions, both of which return matrices that contain the complete match (in their first column) and additional columns for each capture group. For instance, the following code extract all instances of an article (“a” or “the”) followed by a space and a sequence of characters:

See the RStudio cheatsheet on stringr and the package’s documentation for additional commands.

Practice

Here are some practice tasks that allow us to test our knowledge and skills regarding stringr functions.

  1. Rethinking functions:

The %in% operator finds elements of vectors.

  • Can we also use %in% to find characters in strings?
  • Under which conditions is sum(str_detect(s, p)) equal to sum(str_count(s, p))?

Revisit the %in% operator (from Section 1.4) that checks whether an element is found in a vector:

  • Can we also use %in% to find characters in strings?
  • Under which conditions is sum(str_detect(s, p)) equal to sum(str_count(s, p))?
  1. Finding the right words:
  • How many words in words contain the letter sequence “age”? Which ones?
  • Find all words in words that end on “ing”.
  • Find all words in words that contain “ing”, but do not end on it.
  • Find all words in words that contain 10 or more letters.
  • Find all words in words consisting only of vowels (non-consonants, i.e., the set aeiou).
  • Find all words in words consisting only of consonants (i.e., without any vowels).
  • Find all words in words that begin and end with the same letter.

The tasks can be solved with stringr functions and regular expressions (see Appendix E). However, also consider simpler solutions (e.g., involving base R functions or logical indexing).

  • How many words in words contain the letter sequence “age”? Which ones?
  • Find all words in words that end on “ing”:
  • Find all words in words that contain “ing”, but do not end on it:
  • Find all words in words that contain 10 or more letters:

Note the much simpler solutions based on measuring the length of words:

  • Find all words in words consisting only of vowels (non-consonants, i.e., the set aeiou):

Note that there are many alternative solutions:

  • Find all words in words consisting only of consonants (i.e., without any vowels):

Again, there are many alternative solutions:

  • Find all words in words that begin and end with the same letter:

Note that the last solution missed the 1-letter word “a” and would also miss “AA”.

  • Why — and how can we fix it?
  1. Quantifying fruits:

Measure some aspects of fruits (from the ds4psy package, but all in lowercase letters):

Use the vector of fruits to answer the following questions:

  • Are there more fruits containing the letter “a” or “e”?
  • Are there more fruits containing the letter sequence “ana” or “po”?
  • How many and which fruits contain one (or more) of the letters “x”, “y”, or “z”?
  • How many and which fruits that are not berries contain one (or more) of the letters “x”, “y”, or “z”?
  • Create a tibble ft that contains all names of fruits (in lowercase letters) as a column name.
  • Add a column len that contains the length of each fruit’s name.
  • Add a column n_vow that counts the number of vowels (defined as one of “aeiou”) in each fruit’s name.
  • Add a column n_con that counts the number of consonants (defined as non-vowels) in each fruit’s name.
  • Verify that len equals the sum of n_vow and n_con for all fruits in ft.
  • Which names of fruits contain more than 50% of vowels?

Hint: An example in 14.4 Tools (Wickham & Grolemund, 2017) solves most of this task for the character vector of words.

Table 9.1: The head() of ft.
nr name len n_vow n_con
1 acai 4 3 1
2 ackee 5 3 2
3 apple 5 2 3
4 apricot 7 3 4
5 avocado 7 4 3
6 banana 6 3 3

Using ft to answer:

  • Which names of fruits contain more than 50% of vowels?
  1. Replacing characters in Trumpisms:

The vector Trumpisms (included in ds4psy) contains 145 words or short phrases frequently used by U.S. president Donald Trump. Use this vector for some character replacements:

  • Replace all instances of “i” by “!” and all instances of “s” by “$”.
  • Replace all instances of two repeated letters (e.g., “ll”) by “wall”.
  1. Replacing and translating Bushisms:

The vector Bushisms (included in ds4psy) contains 22 phrases spoken by or attributed to U.S. president George W. Bush. Well-known examples include marvels like “They misunderestimated me.” and “Rarely is the question asked: Is our children learning?”.

  • Replace all instances of “I” by “you”, “my” by “your”, “you” by “I”, and “your” by “my”, respectively.

As children, we sometimes talked in what we called our secret “B-language”: Every occurrence of a vowel was followed by “b” and then repeated. The resulting sentences failed to protect our secrets from our enemies and parents, but sounded pretty funny.

  • Translate the set of Bushisms into B-language.
  1. Flowery phrases:

After all this political talk, we crave for some more decorative and charming phrases. Fortunately, the vector flowery (included in ds4psy) contains 60 versions and variations of Gertrude Stein’s popular phrase “A rose is a rose is a rose”.

Use this vector (in lowercase letters) for answering the following questions:

  • How often do the words “Rose” and its variations occur in flowery phrases?
  • How many matches can we find for words belonging to some semantic field?
  • What is the topic or theme of each phrase?

Note that these questions all address semantic issues, which can be tricky, subject to interpretations, and often require human judgment and heuristic approaches. But let’s see how far we get with our fairly simple tools:

  • How often do the words “Rose” and its variations occur in flowery phrases?

The next task

  • How many matches can we find for words belonging to some semantic field?

requires that we first look through the flowery phrases and identify semantic fields as sets of words belonging to the same category. For instance, a first set could consist of “garden”, “flower”, “friend”, “love”, and “save”, which are all positively connotated words associated with roses. This set could be contrasted with phrases that address more negative topics (e.g., “murder”, “thief”, “zombie”) etc.

Interestingly, the answers we get do not only depend on the data we analyze, but also as a function of the precise questions we ask.

Finally, let’s try to figure out what the flowery phrases are about:

  • What is the topic or theme of each phrase?

Solving this task will require some insight or heuristic. A possible approach could ask: Which noun occurs repeatedly in a phrase? The following attempt extracts the first two words of every phrase and then chooses the longer one of them:

Note that this result is still sub-optimal. Can you find a better solution?

References

Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz


  1. Omitting argument names is common practice when using a function, but can be dangerous when several arguments are of the same type. As regular expressions are strings in R, inadvertently reversing the string and pattern arguments is possible, and can yield disastrous results when not noticed. A cheap insurance policy against such mistakes is to always explicate argument names, particularly when programming your own functions (see Chapter 11).