9.4 Advanced text-manipulation
In Section 9.2.4, we emphasized that base R provides a range of functions to deal with advanced tasks of text-manipulation. However, as these are a bit cryptic and confusing (e.g., by existing in many variants and returning their results as lists), using the stringr package (Wickham, 2019b) included in the tidyverse (Wickham et al., 2019) is easier and more straightforward:
library(stringr)
As both base R and stringr functions support pattern matching, advanced text-manipulation assumes some familiarity with using regular expressions (see Appendix E).
9.4.1 Advanced tasks with text
This section is structured according to the advanced tasks identified in Section 9.2.4. The following Table 9.3 repeats the advanced tasks of our original summary table:
Task | R base | stringr |
---|---|---|
B: Advanced tasks | ||
– View strings in s that match p : |
str_view(s, p) \(^{a}\) |
|
– Detect pattern p in strings s : |
grep(p, s) grepl(p, s) \(^{1}\) |
str_detect(s, p) |
– Locate pattern p in strings s : |
gregexpr(p, s) \(^{1}\) |
str_locate(s, p) \(^{a}\) |
– Obtain strings in s that match p : |
grep(p, s, value = TRUE) \(^{1}\) |
str_subset(s, p) |
– Extract matches of p in strings s : |
regmatches(s, gregexpr(p, s)) \(^{1}\) |
str_extract(s, p) \(^{a}\) |
– Replace matches of p by r in s : |
gsub(p, r, s) \(^{1}\) |
str_replace(s, p, r) \(^{a}\) |
– Count matches of p in strings s : |
str_count(s, p) |
Table notes
\(^{1}\): base R functions with additional variants that tweak their functionality (see their documentation).
\(^{2}\): stringr functions with additional variants that tweak their functionality.
\(^{a}\): stringr functions with an additional suffix
_all()
that applies the function to all matches (rather than just to the first match).
Why stringr?
There are three main reasons for switching to the stringr package to address these more advanced tasks:
Organization: The stringr functions are named and structured in a more systematic fashion than their base R ancestors.
Specialization: The stringr functions are more specialized than the corresponding base R functions. Each function is designed to accomplish a specific task rather well, rather than the base R approach of providing a family of related functions that do many things in different ways.
Functionality: The outputs of
str_view()
andstr_count()
are difficult to reproduce with base R functions. Additionally, thepattern
argument of many stringr functions can be used in combination with so-called modifiers that govern the interpretation of a regular expression and allow setting additional arguments (e.g.,ignore.case = TRUE
, see?stringr::regex
for the documentation).
Personally, I use the base R functions nchar()
, paste()
, and substr()
on a regular basis, appreciate the flexibility of grep()
and strsplit()
, but find the details of gregexpr()
, gsub()
, and regmatches()
too confusing to remember.
Regarding the stringr package, I enjoy the convenience of str_view_all()
for displaying the matches of regular expressions, the fact that a set of uniform commands are named by the functions they perform, and the invaluable functionality of str_count()
for quantifying pattern matches.
Data
To have some character data to match, here is a delightfully non-sensical string s
:
# (1) Define some data: ----
<- c("The cat sat on the mat.",
s "The mad hatter had heard her, so what?",
"The fat dad was so sad.")
Additionally, we will use some of the data included in stringr and ds4psy:
# (2) from ds4psy: ------
# (a) words: ----
<- ds4psy::fruits
fruits # length(fruits) # 122
<- ds4psy::countries
countries # length(countries) # 197
<- ds4psy::Trumpisms
Trumpisms # length(Trumpisms) # 96
# (b) sentences: ----
<- ds4psy::flowery
flowery # length(flowery) # 60
<- ds4psy::Bushisms
Bushisms # length(Bushisms) # 22
# (3) from stringr: ------
<- stringr::words
words # length(words) # 980
<- stringr::sentences
sentences # length(sentences) # 720
9.4.2 Essential stringr commands
Viewing pattern matches
The stringr function str_view(s, p)
is convenient for showing and highlighting the first match of a pattern p
in a string s
:
str_view(string = s, pattern = "at")
Its variant str_view_all(s, p)
shows all matches of a pattern p
in s
:
str_view_all(string = s, pattern = "at")
Importantly, the pattern
argument allows for different interpretations, which are accessible via so-called modifiers.
The default interpretation is pattern = regex(p)
, which interprets p
as a regular expression:
# regex: A 3-letter word:
<- "\\b[:alpha:]{3}\\b"
p
# str_view_all(s, p)
# is the same as:
str_view_all(s, regex(p))
If the "\\b[:alpha:]{3}\\b"
looks like symbol salad to you, see Appendix E for using regular expressions.
Besides the default regex()
, the other modifiers of pattern
are coll()
, fixed()
, and boundary()
.
The boundary()
modifier is useful for detecting various text-related boundaries:
str_view_all(s, boundary("word"))
If case is to be ignored, setting ignore_case = TRUE
inside the modifier is indicated:
str_view_all("Abra CAD abrA", "A")
As s
gets large, setting match = TRUE
or match = FALSE
allows selectively showing matching or non-matching strings, respectively:
str_view_all(s, "hat", match = TRUE)
The str_view()
commands have no direct equivalent in base R. The closest similar command is grep()
with setting value = TRUE
.
Both function families share the pattern
argument, but the argument x
of grep()
corresponds to the string
argument in stringr functions. Importantly, the order of arguments is reversed, which matters whenever we get lazy and omit argument names.55
Hence, the following function calls — note their reversal of arguments — both yield the element of s
with a positive match:
<- "hat"
p
grep(p, s, value = TRUE)
#> [1] "The mad hatter had heard her, so what?"
str_view_all(s, p, match = TRUE)
but only str_view_all()
highlights these matches, which also reveals that there are two matches.
To complicate matters, Table 9.3 (above) lists str_subset()
as the direct equivalent of grep(p, s, value = TRUE)
:
str_subset(s, p)
#> [1] "The mad hatter had heard her, so what?"
Essentially, the str_view()
family of functions are convenient tools, but also complex hybrids that include several other commands and tasks. The task of viewing matches of a pattern p
in a string s
combines several steps to perform a seemingly simple task:
str_view_all(s, p)
: View all occurrences of the patternp
in strings
.
If we were to program this function, we would realize that viewing all pattern matches is far from simple, but require a combination of several simpler tasks:
- detecting strings matching a pattern,
- locating matching patterns within the string, and
- obtaining strings that match a pattern, or
- extracting matching patterns (e.g, for highlighting them).
As we will see next, these tasks all contain an identical step (i.e., matching patterns in strings of text), but differ in the outputs they produce.
Detecting pattern matches
The task of detecting matches of a pattern p
in a string s
is performed by str_detect()
and answers the question:
str_detect(s, p)
: Does patternp
occur in strings
?
The answer to this question is a vector of logical values:
# Task: Detect matches of a pattern in string s:
str_detect(s, "hat")
#> [1] FALSE TRUE FALSE
Obtaining a logical vector as the result of detecting patterns may seem like a limitation.
However, pattern detection combined with regular expressions and logical indexing (and R’s convention of treating TRUE
as 1 and FALSE
as 0 when applying arithmetic functions to logical vectors) can answer quite sophisticated questions:
# How many fruits contain a letter twice in a row?
sum(str_detect(fruits, "(.)\\1"))
#> [1] 46
# What proportion of words start or end on a vowel?
mean(str_detect(words, "^[aeiou]"))
#> [1] 0.1785714
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.2765306
The base R equivalent to str_detect(s, p)
is grepl(p, s)
(note the reversal of arguments):
grepl("hat", x = s)
#> [1] FALSE TRUE FALSE
Locating pattern matches
The task of locating pattern matches is similar to detecting matches.
But rather than asking if a pattern p
occurs in a string s
, the question answered by str_locate(s, p)
is:
str_locate(s, p)
: Where in strings
does a patternp
occur?
For example, let’s reconsider our example from above:
- Where in string
s
does the character sequence"hat"
occur?
We see that “hat” occurs twice in the second element of the character vector s
.
If we were only interested in the location of our first pattern match, the str_locate()
command provides an answer:
# Task: Locate the first match of a pattern in string s:
str_locate(s, "hat")
#> start end
#> [1,] NA NA
#> [2,] 9 11
#> [3,] NA NA
Note that the result of str_locate(s, p)
is a matrix that contains the integer values of the start
and end
positions (as columns) of the first match in each string (or line) of s
(in separate rows). As this matrix would no longer suffice if we allow for multiple matches of p
in each string of s
, the output of str_locate_all()
becomes a list of values:
# Task: Locate all matches of a pattern in string s:
str_locate_all(s, "hat")
#> [[1]]
#> start end
#>
#> [[2]]
#> start end
#> [1,] 9 11
#> [2,] 35 37
#>
#> [[3]]
#> start end
Each list element reports the start and end positions of all matches in the corresponding element/row of the string s
.
Using their values requires some skill in processing lists and matrices (see Sections 1.5.1 and 1.6.3):
# Task: Locate all matches and process a list element:
<- str_locate_all(s, "hat") # as a list
l_hat
<- l_hat[[3]] # 3rd list element
mx_e3 is.matrix(mx_e3) # a matrix
1] # 1st column: start positions of 3 matches mx_e3[ ,
The need for dealing with different output types is preserved when using base R commands for locating pattern matches:
grep(p, s)
returns the integer position(s) of all matching strings (in character vectors
)regexpr(p, s)
returns an integer vector with additional attributesgregexpr(p, s)
returns a list of the same length ass
grep("hat", x = s) # integer position (in vector)
#> [1] 2
regexpr("hat", text = s) # integer vector with attributes
#> [1] -1 9 -1
#> attr(,"match.length")
#> [1] -1 3 -1
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
gregexpr("hat", text = s) # list of integer vectors
#> [[1]]
#> [1] -1
#> attr(,"match.length")
#> [1] -1
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#>
#> [[2]]
#> [1] 9 35
#> attr(,"match.length")
#> [1] 3 3
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#>
#> [[3]]
#> [1] -1
#> attr(,"match.length")
#> [1] -1
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
If you find the latter two outputs confusing, you are in good company.
Extracting their various attributes typically requires using the base R function attr()
,
plus remembering that the n
-th element of a list l
is l[[n]])
.
But note that all the information you may want about the location of matches is provided.
Obtaining strings that match patterns
Obtaining all strings in a character vector s
that match a pattern p
can be achieved by str_subset(s, p)
, answering the question:
str_subset(s, p)
: Which elements of a strings
match a patternp
?
# Task: Obtain strings in s that match a pattern:
str_subset(s, "hat")
#> [1] "The mad hatter had heard her, so what?"
The base R equivalent to str_detect(s, p)
is grep(p, s, value = TRUE)
(note the reversal of arguments):
grep("hat", s, value = TRUE)
#> [1] "The mad hatter had heard her, so what?"
As we have seen above, obtaining matching strings can also be achieved by first detecting matching strings and then using logical indexing:
# Obtain strings in s that match a pattern
# (by detecting matching strings and logical indexing):
str_detect(s, "hat")]
s[#> [1] "The mad hatter had heard her, so what?"
But since obtaining all strings that match a pattern is a common task, the str_subset()
function is a welcome shortcut for the two-step operation of detecting and subsetting.
Using base R variants of grep()
, we could use both logical or numerical indexing to obtain matching strings:
# (with logical indexing):
grepl("hat", x = s)]
s[#> [1] "The mad hatter had heard her, so what?"
# (with numerical indexing):
grep("hat", s)]
s[#> [1] "The mad hatter had heard her, so what?"
Counting pattern matches
A task closely related to detecting, locating, and obtaining strings that match a pattern is counting the number of occurrences of a pattern p
in a string s
:
str_count(s, p)
: How often does patternp
occur in strings
?
The str_count()
function is a key tool for quantifying pattern matches.
As most functions for advanced text-manipulation, the function is quite mundane when used with highly specific patterns:
str_count(s, "a")
#> [1] 3 5 4
str_count(s, "at")
#> [1] 3 2 1
str_count(s, "hat")
#> [1] 0 2 0
but becomes powerful when combined with regular expressions and other functions:
# Proportion of words beginning with a vowel:
sum(str_count(words, "^[aeiou]"))/length(words)
#> [1] 0.1785714
# Proportion of words ending on a vowel:
sum(str_count(words, "[aeiou]$"))/length(words)
#> [1] 0.2765306
# Mean proportion of vowels per word in words:
mean(str_count(words, "[aeiou]")/nchar(words))
#> [1] 0.3798123
# Words in words containing more than 70% vowels:
str_count(words, "[aeiou]")/nchar(words) > .70]
words[#> [1] "a" "area" "idea"
Note that the str_count(s, p)
function counts all occurrences of a pattern p
in strings s
, not just the first occurrence in each element of s
.
As it has no direct equivalent in base R, the str_count()
command is one of the main reasons for using the stringr package.
Extracting pattern matches
After detecting, locating, or counting pattern matches, a logical next step is extracting those matches.
Extracting matches to a pattern p
in a character vectors s
can be achieved by str_extract(s, p)
and answers the question:
str_extract(s, p)
: Which character sequences in strings
match a patternp
?
As with other stringr functions, the str_extract()
function comes in two varieties:
The function str_extract(s, p)
will only extract the first match of p
in each element of s
:
# Task: Extract the first match of a pattern from string s:
str_extract(s, "hat")
#> [1] NA "hat" NA
whereas the function str_extract_all(s, p)
will extract all matches of p
in each element of s
:
# Task: Extract all matches of a pattern from string s:
str_extract_all(s, "hat")
#> [[1]]
#> character(0)
#>
#> [[2]]
#> [1] "hat" "hat"
#>
#> [[3]]
#> character(0)
As we have seen before, the price of the more complete result of str_extract_all(s, p)
is a more complex output format:
The resulting list contains a vector of matches for each element of the matched string s
.
These examples also show that extracting matches to highly specific patterns is pretty pointless. After all, the only character sequence that can match the pattern “hat” is “hat,” and extracting matches will not change that. However, extracting pattern matches can yield insights and surprises when combining it with the flexibility of regular expressions (see Appendix E). For instance, the following two commands both use this regex functionality and extract
all words that end on “at,” and
all words containing exactly three characters:
# Task: Extract all words ending on "at" from string s:
str_extract_all(s, "\\b[:alpha:]+at\\b")
# Task: Extract all 3-letter words from string s:
str_extract_all(s, "\\b[:alpha:]{3}\\b")
As our collections of strings s
get larger and our regular expressions grow in complexity, we can extract matching patterns for asking and answering genuine questions:
<- paste(c("cat", "dog", "bird", "elephant", "fish", "fox",
has_animal "mouse", "sheep", "trout", "zebra"), collapse = "|")
# Which of these animals occur in sentences? How often?
<- str_extract_all(sentences, has_animal, simplify = TRUE)
m <- as.vector(m) # matrix as vector
v <- v[v != ""] # remove instances of ""
v table(v)
# Obtain corresponding sentences:
str_subset(sentences, has_animal)
The base R equivalent to str_extract(s, p)
is ugly, but straightforward.
It first requires locating all pattern matches (with gregexpr(p, s)
, see above) and then providing the results to a function regmatches(x, m)
(note the change in argument names):
regmatches(x = s, m = gregexpr("hat", text = s))
#> [[1]]
#> character(0)
#>
#> [[2]]
#> [1] "hat" "hat"
#>
#> [[3]]
#> character(0)
Replacing pattern matches
Replacing matches to a pattern p
in a character vectors s
can be achieved by str_replace(s, p)
:
str_replace(s, p, r)
: Replace character sequences in strings
match patternp
byr
The pair str_replace(s, p, r)
and str_replace_all(s, p, r)
follow the familiar pattern of replacing either the first or all occurrences of p
by r
:
str_replace(s[1], "a", "A")
#> [1] "The cAt sat on the mat."
str_replace_all(s[1], "a", "A")
#> [1] "The cAt sAt on the mAt."
Again, the power of replacing patterns is enhanced by using regular expressions:
str_replace_all(s[1], "[a-o]", "_")
#> [1] "T__ __t s_t __ t__ __t."
Note that the replacement
can have a different length than the matched pattern or consist of a pattern:
# different lengths:
str_replace_all(s[1], "[aeiou]", "xxx")
#> [1] "Thxxx cxxxt sxxxt xxxn thxxx mxxxt."
str_replace_all(s[1], "[aeiou]", "")
#> [1] "Th ct st n th mt."
# with patterns:
str_replace_all(s[1], "[cmt]", toupper) # capitalize matches
#> [1] "The CaT saT on The MaT."
str_replace_all(s[1], "[a]", NA_character_) # replaces entire string with NA
#> [1] NA
str_replace_all(s[1], "([aeiou])", "\\1\\1\\1") # triple all vowels
#> [1] "Theee caaat saaat ooon theee maaat."
Note that the str_replace()
functions are vectorized over string
, pattern
, and replacement
:
str_replace_all(s, "[aeiou]", c("1", "2", "3"))
#> [1] "Th1 c1t s1t 1n th1 m1t."
#> [2] "Th2 m2d h2tt2r h2d h22rd h2r, s2 wh2t?"
#> [3] "Th3 f3t d3d w3s s3 s3d."
str_replace_all(s, c("a", "e", "d"), "-")
#> [1] "The c-t s-t on the m-t."
#> [2] "Th- mad hatt-r had h-ard h-r, so what?"
#> [3] "The fat -a- was so sa-."
str_replace_all(s, c("a", "e", "d"), c("A", "3", "D"))
#> [1] "The cAt sAt on the mAt."
#> [2] "Th3 mad hatt3r had h3ard h3r, so what?"
#> [3] "The fat DaD was so saD."
Performing multiple replacements in a single step is possible by using str_replace_all()
and providing a named character vector to the pattern
argument:
# Named vectors:
<- c("a" = "A", "e" = "3", "i" = "1", "o" = "0", "u" = "U")
changes <- c("2" = "two", "3" = "three", "4" = "many")
numbers
# Replacing multiple matches at once:
str_replace_all(s, pattern = changes)
#> [1] "Th3 cAt sAt 0n th3 mAt."
#> [2] "Th3 mAd hAtt3r hAd h3Ard h3r, s0 whAt?"
#> [3] "Th3 fAt dAd wAs s0 sAd."
str_replace_all(sentences[3], changes)
#> [1] "It's 3Asy t0 t3ll th3 d3pth 0f A w3ll."
str_replace_all(paste0(2:4, " little piggies"), numbers)
#> [1] "two little piggies" "three little piggies" "many little piggies"
The closest base R equivalent to str_replace()
is gsub()
, which also contains a replacement
argument.
This function also allows for regular expressions, but lacks the vector-based whistles of the stringr functions:
gsub(pattern = "a", replacement = "A", x = s[1])
#> [1] "The cAt sAt on the mAt."
gsub(pattern = "[aeuoi]", replacement = "_", x = s[3])
#> [1] "Th_ f_t d_d w_s s_ s_d."
# Substitute any letter b or B by 2 x "be" or "Be":
gsub("([bB])", "\\1e\\1e", x = "To b or not to B.")
#> [1] "To bebe or not to BeBe."
9.4.3 Additional stringr commands
The stringr package (Wickham, 2019b) contains many additional commmands that facilitate working with strings (some of which were mentioned in Section 9.3.4). Here are some examples:
- The
str_length()
function is stringr’s equivalent ofnchar()
. A substantial portion of text is white space between words, lines, or paragraphs. Consequently, managing white space in text is the purpose of many specialized functions.
To examine these functions, let’s first select some fruits
with very short and very long names:
<- ds4psy::fruits # Data:
fruits <- fruits[nchar(fruits) < 4] # very short names
fs <- fruits[!str_detect(fruits, "[\\(]")] # no parentheses
fn <- fn[nchar(fn) > 15] # very long names
fl <- c(fs, fl) # very short and very long names
f2
f2#> [1] "Fig" "Pea" "Grewia asiatica phalsa"
#> [4] "Monstera Delisiousa" "Purple mangosteen"
str_length(f2) # string lengths
#> [1] 3 3 22 19 17
- The pair of functions
str_pad()
andstr_trim()
helps managing string lengths by adding or removing leading or trailing spaces to or from strings:
# Add padding (to short strings only):
<- str_pad(f2, width = 10, side = "both")
fp
fp#> [1] " Fig " " Pea " "Grewia asiatica phalsa"
#> [4] "Monstera Delisiousa" "Purple mangosteen"
# Trim padding:
str_trim(fp)
#> [1] "Fig" "Pea" "Grewia asiatica phalsa"
#> [4] "Monstera Delisiousa" "Purple mangosteen"
- Similarly, the
str_squish()
function trims leading and trailing spaces, but also deletes repeated whitespaces inside a string:
# Trim and remove repeated white space:
str_squish(" The is messy a sentence. A second phrase... ")
#> [1] "The is messy a sentence. A second phrase..."
- For removing leading or trailing whitespace from character strings, see also the base R function
trimws()
:
<- c("male", "female", "other", "male ", " female", " other ")
gender
unique(gender)
#> [1] "male" "female" "other" "male " " female" " other "
unique(trimws(gender))
#> [1] "male" "female" "other"
- The
str_trunc()
function truncates long strings so that all strings ins
have the same length of characters:
# Truncate strings (to a maximum length):
str_trunc(f2, width = 15, side = "right")
#> [1] "Fig" "Pea" "Grewia asiat..." "Monstera Del..."
#> [5] "Purple mango..."
- The
str_wrap()
function helps formatting text into well-behaved paragraphs:
<- str_wrap(sentences[1:3], width = 20, indent = 5)
pg
writeLines(pg)
#> The birch canoe
#> slid on the smooth
#> planks.
#> Glue the sheet
#> to the dark blue
#> background.
#> It's easy to
#> tell the depth of a
#> well.
- The
str_starts()
andstr_ends()
helpers are shortcuts forstr_detect()
with regex anchors (see Section E.2.4 in Appendix E):
<- c("Apple", "Banana", "Coconut")
f3
str_starts(f3, "C")
#> [1] FALSE FALSE TRUE
str_starts(f3, "C", negate = TRUE)
#> [1] TRUE TRUE FALSE
str_ends(f3, "a")
#> [1] FALSE TRUE FALSE
str_ends(f3, "a", negate = TRUE)
#> [1] TRUE FALSE TRUE
- The
str_flatten()
function is similar topaste()
with itscollapse
argument not beingNULL
, but may be easier to remember:
str_flatten(letters)
#> [1] "abcdefghijklmnopqrstuvwxyz"
str_flatten(letters, collapse = "_")
#> [1] "a_b_c_d_e_f_g_h_i_j_k_l_m_n_o_p_q_r_s_t_u_v_w_x_y_z"
# Contrast with paste():
paste(letters, collapse = "")
#> [1] "abcdefghijklmnopqrstuvwxyz"
paste(letters, collapse = "_")
#> [1] "a_b_c_d_e_f_g_h_i_j_k_l_m_n_o_p_q_r_s_t_u_v_w_x_y_z"
- The
str_match()
function is an extension ofstr_extract()
that also obtains the parts of matches:
When using more complex regular expressions for matching patterns, it can be useful to extract not only complete matches, but also their components. This is the job of the str_match()
and str_match_all()
functions, both of which return matrices that contain the complete match (in their first column) and additional columns for each capture group.
For instance, the following code extract all instances of an article (“a” or “the”) followed by a space and a sequence of characters:
# regex: An article + space + character sequence:
<- "(a|the) ([:alpha:]+)"
article_word
# Extract entire match vs. its parts:
head(str_extract(sentences, article_word))
#> [1] "the smooth" "the sheet" "the depth" "a chicken" NA
#> [6] NA
head(str_match(sentences, article_word))
#> [,1] [,2] [,3]
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet" "the" "sheet"
#> [3,] "the depth" "the" "depth"
#> [4,] "a chicken" "a" "chicken"
#> [5,] NA NA NA
#> [6,] NA NA NA
## Getting "_all" requires lists of lists:
# str_extract_all(sentences, article_word)
# str_match_all(sentences, article_word)
See the RStudio cheatsheet on stringr and the package’s documentation for additional commands.
Practice
Here are some practice tasks that allow us to test our knowledge and skills regarding stringr functions.
- Rethinking functions:
The %in%
operator finds elements of vectors.
- Can we also use
%in%
to find characters in strings? - Under which conditions is
sum(str_detect(s, p))
equal tosum(str_count(s, p))
?
Revisit the %in%
operator (from Section 1.4) that checks whether an element is found in a vector:
5 %in% 1:10
#> [1] TRUE
"X" %in% LETTERS
#> [1] TRUE
- Can we also use
%in%
to find characters in strings?
# Data:
<- "I am a simple demo sentence."
st
# as vector:
<- unlist(strsplit(st, split = " "))
vt <- unlist(str_extract_all(st, boundary("word")))
vt
# (a) Finds entire elements of vectors:
"simple" %in% vt
# (b) BUT not character sequences WITHIN strings:
"simple" %in% st
# (c) Thus, we need to use regex:
grepl("simple", st)
str_view_all(st[1], pattern = "simple")
- Under which conditions is
sum(str_detect(s, p))
equal tosum(str_count(s, p))
?
# Data:
# (from above):
s
# Different outputs:
str_detect(s, p = "a") # either FALSE/0 or TRUE/1 for each element
str_count(s, p = "a") # counts all occurrences per element
# When are the sums of outputs equal?
# When a pattern p occurs 0 or 1 times in every element of s:
<- str_detect(s, p = "T") # all TRUE
dv <- str_count(s, p = "T") # 1 1 1
cv sum(dv) == sum(cv)
<- str_detect(s, p = "n") # TRUE FALSE FALSE
dv <- str_count(s, p = "n") # 1 0 0
cv sum(dv) == sum(cv)
- Finding the right
words
:
- How many words in
words
contain the letter sequence “age?” Which ones?
- Find all words in
words
that end on “ing.”
- Find all words in
words
that contain “ing,” but do not end on it.
- Find all words in
words
that contain 10 or more letters.
- Find all words in
words
consisting only of vowels (non-consonants, i.e., the setaeiou
).
- Find all words in
words
consisting only of consonants (i.e., without any vowels).
- Find all words in
words
that begin and end with the same letter.
The tasks can be solved with stringr functions and regular expressions (see Appendix E). However, also consider simpler solutions (e.g., involving base R functions or logical indexing).
# Data:
<- stringr::words
words length(words) # 980
- How many words in
words
contain the letter sequence “age?” Which ones?
sum(str_detect(words, "age"))
sum(str_count(words, "age"))
str_subset(words, "age")
- Find all words in
words
that end on “ing”:
# regex with anchor:
<- "ing$"
p str_subset(words, p)
# alternatives:
str_detect(words, p)]
words[str_ends(words, "ing")] words[
- Find all words in
words
that contain “ing,” but do not end on it:
# regex with wildcard
<- "ing."
p str_subset(words, p)
# in 2 steps:
<- words[str_detect(words, "ing")] # words with "ing"
ing_w str_ends(ing_w, "[^g]")] # not ending on "g" ing_w[
- Find all words in
words
that contain 10 or more letters:
# regex with wildcard and repetition:
<- "^.{10,}$" # with anchors
p <- "\\b.{10,}\\b" # with word boundaries
p
str_subset(words, p)
Note the much simpler solutions based on measuring the length of words
:
str_length(words) >= 10]
words[nchar(words) >= 10] words[
- Find all words in
words
consisting only of vowels (non-consonants, i.e., the setaeiou
):
# regex with anchors:
<- "^[aeiou]+$"
p
str_subset(words, p)
Note that there are many alternative solutions:
# detect and logical indexing:
<- str_detect(words, p)
only_vowel
words[only_vowel]
# base R solutions:
grep(pattern = p, x = words, value = TRUE)
grepl(pattern = p, x = words)] words[
- Find all words in
words
consisting only of consonants (i.e., without any vowels):
# negated regex with anchors:
<- "^[^aeiou]+$"
p
str_subset(words, p)
Again, there are many alternative solutions:
# detect and logical indexing:
<- str_detect(words, p)
no_vowel_2
words[no_vowel_2]
# detect opposite and use logical indexing:
<- !str_detect(words, "[aeiou]")
no_vowel
words[no_vowel]
# base R solutions:
grep(pattern = p, x = words, value = TRUE)
grepl(pattern = p, x = words)]
words[!grepl(pattern = "[aeiou]", x = words)] words[
- Find all words in
words
that begin and end with the same letter:
# regex with backreference:
<- "^(.).+\\1$"
p
str_subset(words, p)
Note that the last solution missed the 1-letter word “a” and would also miss “AA.”
- Why — and how can we fix it?
# The regex with backreference:
<- "^(.).+\\1$"
p # requires a character that is repeated (in 1st and last position)
# plus an intermediate character (.+):
str_subset(c("a", "AA", "SOS"), p)
<- "^(.).*\\1$" # would render the intermediate character optional:
p str_subset(c("a", "AA", "SOS"), p)
# A solution:
<- "^(.).*\\1$|^.{1}"
p str_subset(c("a", "AA", "SOS"), p)
# Alternative solution:
<- "^(.).+\\1$|^(.)\\1$|^.{1}"
p str_subset(c("a", "AA", "SOS"), p)
- Quantifying
fruits
:
Measure some aspects of fruits
(from the ds4psy package, but all in lowercase letters):
# Data:
<- tolower(ds4psy::fruits) fruits
Use the vector of fruits
to answer the following questions:
- Are there more
fruits
containing the letter “a” or the letter “e?” - Are there more
fruits
containing the letter sequence “ana” or “po?”
# fruits with "a" vs. "e":
sum(str_detect(fruits, pattern = "a"))
sum(str_detect(fruits, pattern = "e"))
str_subset(fruits, pattern = "a")
str_subset(fruits, pattern = "e")
# fruits with "ana" vs. "po":
sum(str_count(fruits, "ana"))
str_view(fruits, "ana", match = TRUE)
sum(str_detect(fruits, "po"))
str_view(fruits, "po", match = TRUE)
- How many and which
fruits
contain one (or more) of the letters “x,” “y,” or “z?” - How many and which
fruits
that are not berries contain one (or more) of the letters “x,” “y,” or “z?”
# letter x, y, or z:
sum(str_detect(fruits, pattern = "[xyz]"))
str_view(fruits, pattern = "[xyz]", match = TRUE)
# excluding berries:
<- str_subset(fruits, pattern = "berry", negate = TRUE)
no_berry sum(str_detect(no_berry, pattern = "[xyz]"))
str_view(no_berry, pattern = "[xyz]", match = TRUE)
- Create a tibble
ft
that contains all names offruits
(in lowercase letters) as a columnname
. - Add a column
len
that contains the length of each fruit’sname
.
- Add a column
n_vow
that counts the number of vowels (defined as one of “aeiou”) in each fruit’sname
. - Add a column
n_con
that counts the number of consonants (defined as non-vowels) in each fruit’sname
. - Verify that
len
equals the sum ofn_vow
andn_con
for all fruits inft
. - Which names of
fruits
contain more than 50% of vowels?
Hint: An example in 14.4 Tools (Wickham & Grolemund, 2017)
solves most of this task for the character vector of words
.
# Create tibble:
<- tibble::tibble(
ft nr = 1:length(fruits),
name = tolower(fruits)
)
# Adding some (count) variables:
<- ft %>%
ft mutate(len = nchar(name),
n_vow = str_count(name, "[aeiou]"),
n_con = str_count(name, "[^aeiou]"))
# Verify:
all.equal(ft$len, (ft$n_vow + ft$n_con))
#> [1] TRUE
kable(head(ft), caption = "The `head()` of `ft`.")
nr | name | len | n_vow | n_con |
---|---|---|---|---|
1 | acai | 4 | 3 | 1 |
2 | ackee | 5 | 3 | 2 |
3 | apple | 5 | 2 | 3 |
4 | apricot | 7 | 3 | 4 |
5 | avocado | 7 | 4 | 3 |
6 | banana | 6 | 3 | 3 |
Using ft
to answer:
- Which names of
fruits
contain more than 50% of vowels?
%>%
ft mutate(p_vowels = n_vow/len) %>%
filter(p_vowels > .50)
- Replacing characters in
Trumpisms
:
The vector Trumpisms
(included in ds4psy) contains 168 words or short phrases frequently used by U.S. president Donald Trump.
Use this vector for some character replacements:
- Replace all instances of “i” by “!” and all instances of “s” by “$.”
- Replace all instances of two repeated letters (e.g., “ll”) by “wall.”
<- ds4psy::Trumpisms # data
Trumpisms
# stringr:
<- c("i" = "!", "s" = "$")
change_T str_replace_all(Trumpisms, pattern = change_T)
str_replace_all(Trumpisms, "(.)\\1", "wall")
# base R:
chartr(old = "is", new = "!$", x = Trumpisms)
gsub(pattern = "(.)\\1", "wall", x = Trumpisms)
- Replacing and translating
Bushisms
:
The vector Bushisms
(included in ds4psy) contains 22 phrases spoken by or attributed to U.S. president George W. Bush.
Well-known examples include marvels like “They misunderestimated me.” and “Rarely is the question asked: Is our children learning?”
- Replace all instances of “I” by “you,” “my” by “your,” “you” by “I,” and “your” by “my,” respectively.
<- ds4psy::Bushisms # data
Bushisms
# stringr:
<- c("I" = "you", "my" = "your",
change_B "your" = "my", "Your" = "My",
"you" = "I", "You" = "I")
str_replace_all(Bushisms, pattern = change_B)
As children, we sometimes talked in what we called our secret “B-language”: Every occurrence of a vowel was followed by “b” and then repeated. The resulting sentences failed to protect our secrets from our enemies and parents, but sounded pretty funny.
- Translate the set of
Bushisms
into B-language.
str_replace_all(Bushisms, "([aeoui])", "\\1b\\1")
- Flowery phrases:
After all this political talk, we crave for some more decorative and charming phrases. Fortunately, the vector flowery
(included in ds4psy) contains 60 versions and variations of Gertrude Stein’s popular phrase “A rose is a rose is a rose.”
Use this vector (in lowercase letters) for answering the following questions:
- How often do the words “Rose” and its variations occur in
flowery
phrases?
- How many matches can we find for words belonging to some semantic field?
- What is the topic or theme of each phrase?
Note that these questions all address semantic issues, which can be tricky, subject to interpretations, and often require human judgment and heuristic approaches. But let’s see how far we get with our fairly simple tools:
- How often do the words “Rose” and its variations occur in
flowery
phrases?
<- tolower(ds4psy::flowery) # data
flowery
# frequency of "Rose" etc:
<- "rose|rosa|rose|rosy"
set_rose sum(str_count(flowery, pattern = set_rose))
str_view_all(flowery, set_rose)
Solving the task
- How many matches can we find for words belonging to some semantic field?
requires that we first look through the flowery
phrases and identify semantic fields as sets of words belonging to the same category.
For instance, a first set could consist of “garden,” “flower,” “friend,” “love,” and “save,” which are all positively connotated words associated with roses. This set could be contrasted with phrases that address more negative topics (e.g., “murder,” “thief,” “zombie”) etc.
# (a) define sets:
<- "garden|flower|friend|love|save"
set_rose_love <- "bitch|bullet|crime|hell|lie|loss|murder|rape|thief|zombie"
set_horror_crime <- "belly|breast|gut|head|leg|nose|toe"
set_body_parts <- "bolder|moon|rock|stein|stone|pebble|thing"
set_objects
# (b) count occurrences:
sum(str_count(flowery, pattern = set_rose_love))
sum(str_count(flowery, pattern = set_horror_crime))
sum(str_count(flowery, pattern = set_body_parts))
sum(str_count(flowery, pattern = set_objects))
Interestingly, the answers we get do not only depend on the data we analyze, but also as a function of the precise questions we ask.
Finally, let’s try to figure out what the flowery
phrases are about:
- What is the topic or theme of each phrase?
Solving this task will require some insight or heuristic. A possible approach could ask: Which noun occurs repeatedly in a phrase? The following attempt extracts the first two words of every phrase and then chooses the longer one of them:
# Get the longer of the first two words of each phrase:
# Extract the first_two words:
<- str_extract(flowery, pattern = "[:alpha:]+ [:alpha:]+ ")
first_two <- str_trim(first_two)
first_two
# Identifying the first word by regex and measure word lengths:
<- tibble(first_two) %>%
tb mutate(wrd_1 = str_extract(first_two, "[:alpha:]+(?= )"),
len_1 = nchar(wrd_1),
len_2 = nchar(first_two) - len_1 - 1,
first = len_1 > len_2)
# tb
# simpler:
<- str_split(first_two, " ", simplify = TRUE)
first_two <- as_tibble(first_two) %>%
tb mutate(len_1 = nchar(V1),
len_2 = nchar(V2),
first = len_1 > len_2)
# tb
# Use either the first or second word:
<- tb$V1[tb$first] # theme in 1st word
theme_1 <- tb$V2[!tb$first] # theme in 2nd word
theme_2 <- c(theme_1, theme_2)
themes themes
Note that this result is still sub-optimal. Can you find a better solution?
References
Omitting argument names is common practice when using a function, but can be dangerous when several arguments are of the same type. As regular expressions are strings in R, inadvertently reversing the
string
andpattern
arguments is possible, and can yield disastrous results when not noticed. A cheap insurance policy against such mistakes is to always explicate argument names, particularly when programming your own functions (see Chapter 11).↩︎