E.2 Essential regex commands
Regular expressions specify patterns in strings of text. The notion of a pattern implies a range of flexibility that can vary from very specific to highly general. For instance, the word “text” could be described as a word that contains the letter sequence t-e-x-t, as a word that begins and ends with the letter t, or as a four-letter word. Each of these descriptions could be described as a pattern, which would then match the word “text.” But whereas the first pattern would only match this particular word, the second and third would also match the word “test,” and the third would additionally match the word “four.” Thus, whenever matching a pattern, we always aim for the sweet spot between too specific and overly general searches. Thus, when learning to write regular expressions, we need means and tools that allow striking the right balance between specificity and generality.
To begin our expedition into the realm of regular expressions, we will primarily explore the character vector tests
(defined above):
tests#> [1] "abc"
#> [2] "ABC"
#> [3] "a.c"
#> [4] "a_c"
#> [5] "a\\c"
#> [6] "ac/dc"
#> [7] "2+4=6 0-9 2^3 2518 9612708"
#> [8] "The cat, sat mat, etc., fat dad."
#> [9] "Us or them?"
#> [10] "Been there, (seen it --- at last), done that."
#> [11] "Not act, is bad, so sad!"
As we proceed to more advanced aspects of regular expressions, we will use more specialized collections of data, which we will define on the fly or specified above (see Sections E.1.2).
E.2.1 Character sequences
Regular expressions specify patterns in character strings, but are also provided as character strings (i.e., enclosed by quotation marks, as in pattern = "at"
above). Consequently, character symbols are not only the basic building blocks of character strings (i.e., text objects or data of type character), but also of regular expressions (i.e., abstract descriptions of patterns in text data). As the differences between strings that are text objects and strings that are regular expressions lies in their intended use, this similarity can be confusing. And although the functionally different roles of character strings in R are often convenient, they also create some conflicts, as we will see in the next section.
Sequences of letters — or words — and any numeric digits used in regular expressions match themselves:
# Letters (case-sensitive):
str_view_all(tests, "a", match = TRUE)
str_view_all(tests, "A", match = TRUE)
# Digits:
str_view_all(tests, "1", match = TRUE)
str_view_all(tests, "12", match = TRUE)
# Words or word parts:
str_view_all(tests, "een", match = TRUE)
str_view_all(tests, "at", match = TRUE)
The same is true for many characters that are neither letters nor digits:
# Matching non-letters:
str_view_all(tests, "-", match = TRUE)
str_view_all(tests, "=", match = TRUE)
str_view_all(tests, ",", match = TRUE)
str_view_all(tests, " ", match = TRUE)
However, we will soon see that many characters require special treatments to be used within regular expressions (see Section E.2.2).
What about the other Unicode characters that Chapter 9 on Strings of text) taught us to appreciate and type? We can try matching Unicode characters by using our epitome of cultural insights (from Section 9.2.2):
<- "Hansj\U00F6rg says: 'Der K\U00E4sereichtum \U00D6sterreichs ist ungew\U00F6hnlich gro\U00DF.'" k2
Just as there were different ways of typing Unicode characters in a character string, we can use these characters in regular expressions (which are character strings) in various ways.
Here are three different ways of entering an Umlaut character (e.g., ö
) within a regular expression:
# Three ways of matching rare (Umlaut) characters:
str_view_all(k2, "ö", match = TRUE) # a) typing Umlaut
str_view_all(k2, "\U00F6", match = TRUE) # b) Unicode Umlaut
str_view_all(k2, Umlaut["o"], match = TRUE) # c) ds4psy Umlaut
Fortunately, all three ways of using the Umlaut character ö
within a regex yield the same result:
Hence, we can use and match Unicode characters in regular expressions. (See Sections 9.2.2 and 9.8.2 for additional information and resources on Unicode characters.)
Practice
- Predict the outputs of the following commands if
p
is changed to"A"
, then verify your predictions by evaluating the commands.
<- "A"
p
grep(p, tests) # returns numeric indices of hits
grepl(p, tests) # returns a logical vector
grep(p, tests, value = TRUE) # return hits
- Predict the outputs of the following commands if
p
is changed to"sa"
, then verify your predictions by evaluating the commands.
<- "sa"
p
grep(p, tests) # returns numeric indices of hits
grepl(p, tests) # returns a logical vector
grep(p, tests, value = TRUE) # return hits
- Combine the datasets provided by
Bushisms
andTrumpisms
into a vectorBT
and then search it for all objects containing the following character sequences or words:
- “big”
- “est”
- “mis”
- “child”
- “country”
- “America”
Here’s an example:
# Data:
<- c(Bushisms, Trumpisms)
BT
str_view_all(BT, "mis", match = TRUE)
E.2.2 Meta-characters and escaping
In Chapter 9 on Strings of text), the existence of metacharacters and character constants was only mentioned (in Section 9.2.3). Now we are in a position to learn which special meanings these characters have within regular expressions and how they can be matched when they appear in strings of text.
The 12 so-called metacharacters in R are:
. \ | ( ) [ { ^ $ * + ?
and are documented in ?regex
. To provide convenient access to them, the ds4psy package defines a character vector metachar
that contains these characters in a vector.
The .
as wildcard vs. dot
The first metacharacter in metachar
is the dot (aka. “period” or “full stop”) .
.
In regular expressions, using the dot .
serves as a wildcard character that matches any single character (except the newline character \n
):
str_view_all("abc ABC 123 ,;: <([{-}])> .?!", ".", match = TRUE)
This degree of flexibility makes matching .
pretty useless in itself, as matching everything is typically not very helpful.
However, using one or more wildcards becomes very powerful in combination it with other characters:
str_view_all(tests, "a.", match = TRUE) # "a" followed by any other character
str_view_all(tests, ".a", match = TRUE) # "a" preceded by any other character
str_view_all(tests, "a...a", match = TRUE) # 2 "a" exactly 3 characters apart
But the ambiguity that the dot symbol can now appear both in our data and also as a wildcard in a regular expression creates a conundrum:
If a .
matches any arbitrary character, how can we match the character symbol "."
(e.g., used after abbreviations or at the end of a sentence)?
The answer is that we need to signal to R that we want to use the .
symbol not in its special meaning, but as an ordinary character symbol.
Thus, we need to “escape” from its special meaning.
An escape from a symbol’s special meaning is achieved by preceding the symbol by a backslash \
.
But another glance at the set of metacharacters shows that \
also happens to be a character with special meaning.
So how can we use it within a regular expression?
Well, by escaping from its special meaning (i.e., by preceding it with a backslash \
).
Thus, we can search for a literal dot symbol .
by preceding it by two backslash characters:
str_view_all(tests, "\\.", match = TRUE)
Following the same logic also allows us searches for other metacharacters, like ?
, ^
, +
, or parentheses ()
:
str_view_all(tests, "\\?", match = TRUE)
str_view_all(tests, "\\^", match = TRUE)
str_view_all(tests, "\\+", match = TRUE)
str_view_all(tests, "\\(", match = TRUE)
str_view_all(tests, "\\)", match = TRUE)
Matching the backslash \
Having understood the need for double-backslashes, matching a literal backslash \
is still challenging.
To write a \
inside a string, we need to escape the special meaning of \
, hence write "\\"
.
However, to match a \
, we need to escape it as well.
As a consequence, we need to use \\\\
(indeed, no less than four backslashes) to match a single \
:
# Matching a \ in a string:
str_view_all(tests, "\\\\", match = TRUE)
Character constants
Note that — besides metacharacters — there are also so-called character constants.
These are characters with a special meaning in R that are also preceded by a backslash \
.
The most common character constants are:
\n
newline\r
carriage return\t
tab\b
backspace\f
form feed\'
ASCII apostrophe\"
ASCII quotation mark
Evaluate ?"'"
to obtain a complete list of character constants.80
Practice
- Matching dots:
We saw that we can match any literal dot .
by escaping its special meaning (as a wildcard):
str_view_all(tests, "\\.", match = TRUE)
But notice that there are (at least) two kinds of dots:
- Can we distinguish between inline dots (typically signaling abbreviations) and final dots (typically signaling the end of a sentence)?
Well, we can easily identify non-final dots by searching for an escaped dot \\.
that is followed by another character (i.e., a wildcard dot .
):
str_view_all(tests, "\\..", match = TRUE) # a non-final "."
This was easy, but finding a dot .
signals the end of a character string is tricky (as long as we do not yet know about anchors, which will be discussed in Section E.2.4 below):
str_view_all(tests, "\\.$", match = TRUE) # a non-final "."
However, a quick hack could add an empty space " " to any string and then search for a dot followed by a space:
<- paste0(tests, " ") # add a " " to any string in tests
tests_s str_view_all(tests_s, "\\. ", match = TRUE) # find "." followed by " "
- Matching meta- and other cryptic characters:
The following string cryptix
contains 50 characters that are a mix of meta-characters and non-meta-characters:
cryptix#> [1] "!=[$/]\\[%</:=),{>|/*}?(&(.<\\.!$|*,/#:.%(.*+-[\\%\\^|"
Use this string to complete the following tasks:
Define
cryptix
as!=[$/]\[%</:=),{>|/*}?(&(.<\.!$|*,/#:.%(.*+-[\%\^|
.Inspect
cryptix
to determine which metacharacters are contained in it.Inspect
cryptix
to determine which non-metacharacters are contained in it.Construct a series of stringr
str_view_all()
commands that selectively finds every character contained incryptix
.
For metacharacters, these commands could be:
For non-metacharacters, these commands could be:
Note that the results of searching and printing the cryptix
string can be rather unpredictable.
Thus, dealing with a mix of cryptic characters remains messy, even with sophisticated tools.
E.2.3 Character classes
Beyond matching specific characters, regular expressions provide means of matching entire classes (or types) of characters. We can distringuish between three types of character classes:
- Specific character classes are the most common symbols in most texts:
[:lower:]
lower-case letters
[:upper:]
upper-case letters
[:alpha:]
alphabetic characters:[:lower:]
and[:upper:]
[:digit:]
digits:0 1 2 3 4 5 6 7 8 9
[:punct:]
punctuation characters:! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _
{ | } ~`
- Spacing and control characters are mostly invisible, yet very common and important for structuring text:
[:blank:]
blank characters (space, tab
\t
, and ideally non-breaking spaces)
[:cntrl:]
control characters (e.g., tab\t
, newline\n
, carriage return\r
)
[:space:]
space characters (e.g., tab, newline, form feed, carriage return, space, and others)
- Classes of character classes are generalizations of other categories:
[:alnum:]
alphanumeric characters:[:alpha:]
and[:digit:]
[:graph:]
graphical characters:[:alnum:]
and[:punct:]
[:print:]
printable characters:[:alnum:]
,[:punct:]
, and[:space:]
The ds4psy package contains a (named) vector cclass
that is suited to illustrate these commands.
cclass
contains different character classes (as character strings) and allows selecting each class by an abbreviated name:
names(cclass)
#> [1] "ltr" "LTR" "dig" "hex" "pun" "spc"
"ltr"] # lowercase LETTERS
cclass[#> ltr
#> "abcdefghijklmnopqrstuvwxyz"
"LTR"] # uppercase LETTERS
cclass[#> LTR
#> "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"dig"] # decimal digits
cclass[#> dig
#> "0123456789"
"hex"] # hexadecimal digits
cclass[#> hex
#> "0123456789ABCDEFabcdef"
"pun"] # punctuation characters
cclass[#> pun
#> "!#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
"spc"] # 4 different spaces
cclass[#> spc
#> " \t \n \r"
We now can try to match the contents of cclass
by regular expressions that are designed for matching entire character classes.
The following commands show the results of the corresponding matches:
- Specific character classes:
# 1: Common character classes:
str_view_all(cclass, "[:lower:]", match = TRUE)
str_view_all(cclass, "[:upper:]", match = TRUE)
str_view_all(cclass, "[:alpha:]", match = TRUE)
str_view_all(cclass, "[:digit:]", match = TRUE)
str_view_all(cclass, "[:punct:]", match = TRUE)
str_view_all(cclass, "[:xdigit:]", match = TRUE)
Note that some metacharacters are not matched by [:punct:]
, but can be matched by escaping the corresponding symbol (see Section E.2.2 above):
<- paste(ds4psy::metachar, collapse = " ") mcv
# Matches of "[:punct:]":
str_view_all(mcv, "[:punct:]", match = TRUE)
# Matching special metachars:
str_view_all(mcv, "\\|", match = TRUE)
str_view_all(mcv, "\\^", match = TRUE)
str_view_all(mcv, "\\$", match = TRUE)
str_view_all(mcv, "\\+", match = TRUE)
- Spacing and control characters:
# 2. Spaces and control characters:
str_view_all(cclass, "[:blank:]", match = TRUE)
str_view_all(cclass, "[:cntrl:]", match = TRUE)
str_view_all(cclass, "[:space:]", match = TRUE)
- Classes of character classes:
# 3. Classes of character classes:
str_view_all(cclass, "[:alnum:]", match = TRUE)
str_view_all(cclass, "[:graph:]", match = TRUE)
str_view_all(cclass, "[:print:]", match = TRUE)
When using base R commands, the character classes enclosed in brackets must be enclosed in an additional set of brackets.
For instance, we can find all strings in cclass
with alphanumeric characters by the following grep()
command:
grep("[[:alnum:]]", cclass, value = TRUE)
#> ltr LTR
#> "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#> dig hex
#> "0123456789" "0123456789ABCDEFabcdef"
Alternative options for matching some classes of characters are provided by the following escape sequences:
\d
matches any digit (\D
any non-digit)\s
matches any space character (e.g. space, tab\t
, newline\n
;\S
any non-space character)\w
matches any word character (letter, digit, or underscore in the current locale;\W
any non-word character)
And remember: To enter the character \
within a string (as regex are written as strings), it needs to be escaped by an additional \
:
str_view_all(cclass, "\\d", match = TRUE)
str_view_all(cclass, "\\D", match = TRUE)
str_view_all(cclass, "\\s", match = TRUE)
str_view_all(cclass, "\\S", match = TRUE)
str_view_all(cclass, "\\w", match = TRUE)
str_view_all(cclass, "\\W", match = TRUE)
We are not showing the results of all these commands here, but feel free to try them out in your console.
Practice
- Non-characteristic fruits:
- Are there any
fruits
that contain characters that are not[:alnum:]
?
Hint: Yes, there are. Just search for character classes that are not contained in [:alnum:]
.
str_view_all(fruits, "[:space:]", match = TRUE)
str_view_all(fruits, "[:punct:]", match = TRUE)
Note: It would be nice to instruct R to “Find anything containing elements not in :alnum:
!”
And since there’s a regex for everything, we will learn how to negate a pattern below.
- Surrounded spaces:
- Write several regular expressions that match any space
" "
that is surrounded by three characters on either side intests
.
Some examples include:
# Matching a space preceded and followed by 3 characters:
str_view_all(tests, "... ...", match = TRUE)
str_view_all(tests, "...[:blank:]...", match = TRUE)
str_view_all(tests, "...[:space:]...", match = TRUE)
str_view_all(tests, "...\\s...", match = TRUE)
E.2.4 Anchors
Anchors allow matching patterns in strings at two prominent positions:
^
matches the start of a string$
matches the end of a string
Some straightforward examples for using anchors (using the grep()
function) include:
grep(pattern = "^A", x = tests, value = TRUE)
#> [1] "ABC"
grep(pattern = "c$", x = tests, value = TRUE)
#> [1] "abc" "a.c" "a_c" "a\\c" "ac/dc"
Note the order of characters: As the anchor $
matches the end of the string, any character required to be at the end of the string needs to appear before it in the regular expression.
The corresponding stringr commands with anchors are:
str_view_all(tests, "^A", match = TRUE)
str_view_all(tests, "c$", match = TRUE)
Using anchors in combination with more general patterns (e.g., patterns that match entire character classes) makes them quite powerful tools. For instance, we now can search for the beginning and end of sentences:81
str_view_all(tests, "^[:upper:]", match = TRUE)
str_view_all(tests, "[:punct:]$", match = TRUE)
Their functionality as anchors explains the special meaning of the metacharacters ^
and $
, but note that their position also matters. And remember: To match a literal ^
and $
in a string (e.g., in metachar
), we need to escape them (see Section E.2.2 above):
# Using ds4psy::metachar:
str_view_all(metachar, "\\^", match = TRUE)
str_view_all(metachar, "\\$", match = TRUE)
Additional anchors that are mostly used for matching words (rather than strings) are:
\b
matches any empty string at either boundary/edge of a word\B
matches empty strings that are NOT at word boundaries/edges
str_view_all("This is a sentence.", "\\b", match = TRUE)
str_view_all("This is a sentence.", "\\B", match = TRUE)
Practice
- Accounting for availability:
We apparently think that there are more words starting with a specific letter than ending on the same letter. This is often explained by the so-called availability heuristic (Tversky & Kahneman, 1974): Apparently, it is easier to recall exemplars by their first letter than by their last letter.
Test this assumption by first trying to recall
fruits
starting or ending on a specific letterl
(for several letters, e.g., E, L, K, Y).Then detect and count the number of corresponding
fruits
by regex searches.
str_view_all(fruits, "^E", match = TRUE)
str_view_all(fruits, "e$", match = TRUE)
str_view_all(fruits, "^L", match = TRUE)
str_view_all(fruits, "l$", match = TRUE)
str_view_all(fruits, "^K", match = TRUE)
str_view_all(fruits, "k$", match = TRUE)
str_view_all(fruits, "^Y", match = TRUE)
str_view_all(fruits, "y$", match = TRUE)
- Analyzing
Bushisms
:
Analyze the set of Bushisms
to answer the following questions:
Are there more sentences starting with
I
or withYou
?Are there more
Bushisms
that end on a question (i.e., with a final?
) or that contain a question (i.e., with a non-final?
)?
str_view_all(Bushisms, "^I", match = TRUE)
str_view_all(Bushisms, "^You", match = TRUE)
str_view_all(Bushisms, "\\?$", match = TRUE)
str_view_all(Bushisms, "\\? ", match = TRUE)
E.2.5 Alternates, groups, and negation
Whereas anchors make our searches more specific (by requiring patterns to occur in specific positions), the use of other operators and conventions makes them more general. A key step towards more general patterns consists in specifying two or more alternatives:
a|b
matchesa
orb
[abc]
matches one or more ofa
,b
, orc
[a-c]
matches any character in the range froma
toc
[^abc]
matches anything buta
,b
, orc
In addition to these uses of square brackets []
to specify alternatives, this section also introduces the use of round parentheses ()
for grouping purposes. But let’s proceed step-by-step:
- Using the metacharacter
|
in a regular expression"a|b"
matchesa
orb
(or both):
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "a|3")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "abc|123")
str_view_all(tests, "ad|at", match = TRUE)
str_view_all(tests, "at|ct|et|it|ot|st", match = TRUE)
- Enclosing characters in square brackets
[]
provides a more general way of specifying a group of characters to be matched. We have already seen the[...]
construct when matching character classes above (see Section E.2.3), but now realize that the...
can also contain groups or ranges of characters to be matched:
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[a3]")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[abc]")
str_view_all(tests, "[st]", match = TRUE)
str_view_all(tests, "[/,-=!]", match = TRUE)
# Alternative metacharacters:
str_view_all(tests, "[\\.\\+\\^\\(\\)]", match = TRUE)
Note the subtle, but important difference between matching a pattern "ab|c"
(i.e., matching ab
or c
) and matching a pattern "[abc]"
(i.e., matching a
or b
or c
):
str_view_all(c("ab", "abc", "a b c"), pattern = "abc")
str_view_all(c("ab", "abc", "a b c"), pattern = "ab|c")
str_view_all(c("ab", "abc", "a b c"), pattern = "[abc]")
- Square brackets
[]
also allow specifying alphabetic or numeric ranges by-
:
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[b-z]")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[1-2]")
str_view_all(tests, "[A-N]", match = TRUE)
str_view_all(tests, "[0-4]", match = TRUE)
- Round parentheses
()
allow to group patterns. This can be used for merely illustrative purposes (which can be very helpful for clarifying complex regular expressions):
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "(a)(b)(c)")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "(a)(bc)")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[(a)(b)(c)]")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[(a)((bc))]")
str_view_all(tests, "[(d)(st)]", match = TRUE)
but becomes essential when combining various options — like a required plus an optional part — for constructing more complex patterns:
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "b(c|1)")
str_view_all(tests, "a(d|t)", match = TRUE)
str_view_all(tests, "[a-s]t", match = TRUE)
str_view_all(tests, "( )(s|t)", match = TRUE)
Another powerful tool for specifying sets or ranges of characters (or general patterns) is provided by negating a given set or range (i.e., excluding characters from matches).
- Preceding a set or range of characters by
^
(e.g.,[^a-z]
) negates the set or range (i.e., excludes any characters within the set or range from matches):
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[^abc]")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[^a-c]")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[^1-9]")
str_view_all(c("abc", "123", "ab123", "abc12", "abc123"), pattern = "[^A-Z]")
Note that this particular use of the symbol ^
illustrates that the meaning of meta-characters is ambiguous and depends on the context in which they appear: Whereas ^
was used as an anchor when preceding a regex "^..."
(see Section E.2.4), it acts as a negation symbol when it is used inside square brackets [^...]
:
str_view_all(tests, "^[a-z]", match = TRUE)
str_view_all(tests, "[^a-z]", match = TRUE)
Here are further examples for combining alternatives, ranges, and negations. Try to predict their results before evaluating them to verify your predictions:
str_view_all(tests, "a|e|i|o|u") # any (lowercase) vowel
str_view_all(tests, "[aeiou]") # any (lowercase) vowel
str_view_all(tests, "[^aeiou]") # any non-(lowercase)-vowel
str_view_all(tests, "[a-m]") # range of characters
str_view_all(tests, "[^a-m]") # negation
str_view_all(tests, "[a-n]|[A-N]") # range 1 or range 2
str_view_all(tests, "[(a-n)(A-N)]") # 2 ranges
str_view_all(tests, "[^(a-n)(A-N)]") # negation
Overall, the use of |
, []
, ()
, and the negation of ranges via ^
, provides the components of a language that allows the expression of quite powerful regular expressions.
Practice
- Alternative spaces and punctuations:
- Predict and explain the results of the following commands:
str_view_all(tests, "[:space:][:punct:]")
str_view_all(tests, "[:space:]|[:punct:]")
str_view_all(tests, "[^[:space:][:punct:]]")
str_view_all(tests, "[^[:space:]]|[:punct:]")
str_view_all(tests, "[^[:space:]]|[^:punct:]")
- Mixing anchors and metacharacters:
What do the following regular expressions match?
"[\\\\]$"
"[^\\\\]"
"[^\\\\]$"
"^[\\\\\\\$]"
"[\\\\\\^]$"
"[\\\\\\^\\$]"
Construct str_view_all()
expressions that check and verify your predictions.
Hint: As the regex involve the metacharacters of the backslash \
, ^
, and $
symbols, and the roles of the latter two as anchors, we need a test string that includes those symbols (with escaping) in different positions.
# Create test string:
<- c("\\", "$", "\\ ^ $", "$ ^ \\", " \\ $ ^", " \\ $.")
backdoll
# Verify predictions:
str_view_all(backdoll, "[\\\\]$") # any \
str_view_all(backdoll, "[^\\\\]") # any non-\
str_view_all(backdoll, "[^\\\\]$") # any final non-\
str_view_all(backdoll, "^[\\\\\\$]") # any initial \ or $
str_view_all(backdoll, "[\\\\\\^]$") # any final \ or ^
str_view_all(backdoll, "[\\\\\\^\\$]") # any \ ^ $
- Non-characteristic fruits with negation:
In an earlier task (above), we answered the following question:
- Are there any
fruits
that contain characters that are not[:alnum:]
?
by searching for character classes not contained in [:alnum:]
(like [:space:]
or [:punct:]
).
Knowing about the negation of ranges, we now can ask:
- Can we find the same
fruits
by negating[:alnum:]
?
str_view_all(fruits, "[:punct:]|[:space:]", match = TRUE)
str_view_all(fruits, "[^[:alnum:]]", match = TRUE)
- Regular presidential expressions:
- Describe the goals of the following regex patterns prior to running them (to verify your predictions).
str_view_all(Trumpisms, "e[:space:](.)", match = TRUE)
str_view_all(Trumpisms, "^....$", match = TRUE)
str_view_all(Trumpisms, "[A-Z]", match = TRUE)
str_view_all(Bushisms, "[:punct:][^[:space:]]", match = TRUE)
str_view_all(Bushisms, "(^[A-Z])|([:punct:][:space:][A-Z])")
str_view_all(Bushisms, "((^.. )|( .. )|( ..$))", match = TRUE)
E.2.6 Repetition
Yet another way of fine-tuning our searches for patterns is provided by specifying how many times a (part of a) pattern is to be matched. To search for a specific number of occurrences, a regular expression (regex) may be followed by a repetition quantifier:
?
: the preceding regex will be matched at most once (\(0\)-\(1\)).*
: The preceding regex will be matched zero or more times (\(0+\)).+
: The preceding regex will be matched one or more times (\(1+\)).
# Data to be matched: Pseudo-sizes
<- "XXS XS S M X L XL XXL XLXL XXXL XLL XLLL XLLLL"
ps
str_view_all(ps, "XL", match = TRUE) # X and L
str_view_all(ps, "XL?", match = TRUE) # X + 0 or 1 L
str_view_all(ps, "XL*", match = TRUE) # X + 0+ L
str_view_all(ps, "XL+", match = TRUE) # X + 1+ L
Note that the quantifier in the last three examples only applied to the character L
immediately preceding it.
If we wanted to quantify the repetion of “XL” or of any “X” or “L,” we would have needed to group both characters by parentheses (XL)
or brackets [XL]
(see Section E.2.5).
A more general way of requiring a specific number or range of repetitions is provided by enclosing one or two numbers (n
, or n
and m
) inside of curly brackets {}
:
{n}
: The preceding regex is matched exactlyn
times (\(n\)).{n,}
: The preceding regex is matchedn
or more times (\(n+\)).{n,m}
: The preceding regex is matched at leastn
times, but not more thanm
times (\(n\)-\(m\)).
str_view_all(ps, "XL{2}", match = TRUE)
str_view_all(ps, "XL{3,}", match = TRUE)
str_view_all(ps, "XL{2,3}", match = TRUE)
The {n,}
and {n,m}
constructs are more general than the use of ?
, *
, and +
, as the latter can easily be re-written as:
?
corresponds to{0,1}
*
corresponds to{0,}
+
corresponds to{1,}
Combining multiple repetition quantifiers can be powerful, but also confusing. Here are some examples that show how minor changes can make potentially crucial differences:
str_view_all(ps, "X?L+", match = TRUE)
str_view_all(ps, "X+L?", match = TRUE)
str_view_all(ps, "X+L*", match = TRUE)
str_view_all(ps, "X+L+", match = TRUE)
str_view_all(ps, "X?[XL]+", match = TRUE)
str_view_all(ps, "X+[XL]?", match = TRUE)
Repeated matches are greedy by default, so that the maximal possible number of repetitions is found.
This can be changed to minimal by appending ?
to the quantifier:
str_view_all(ps, "XL+", match = TRUE)
str_view_all(ps, "XL+?", match = TRUE)
str_view_all(ps, "X?XL{1,4}", match = TRUE)
str_view_all(ps, "X?XL{1,3}?", match = TRUE)
Note that the question mark ?
in "X?XL{1,3}?"
has two different meanings:
Whereas the first is a repetition quantifier, the second switches the preceding repetition quantifier to matching in a non-greedy fashion.
Practice
- Specialized favours:
The following sentences ou
and iz
contain a messy mix of British and U.S. American spelling.
- Construct regular expressions that would pick up both the U.K. and U.S. spelling variants of
o/ou
,is/iz
, andys/yz
.
<- "Rumour has it that our favorite science of behaviour is an honorable endeavour, not lacking humour or color, but with little glamour."
ou <- "We must realize, recognise, and citicize, that excessive specialisation and socializing can paralyse."
iz
str_view_all(ou, "ou?r", match = TRUE)
str_view_all(ou, "(our)|(or)", match = TRUE)
str_view_all(iz, "(i|y)(s|z)", match = TRUE)
str_view_all(iz, "[iy][sz]+", match = TRUE)
- Can you create a regex that would pick up both the U.K. and U.S. spelling variants of
o/ou
, but not the words “or” and “our?”
str_view_all(ou, "[^ ]ou?r", match = TRUE)
- Roman numerals:
The function as.roman()
of the utils package (included in base R) translates numbers into Roman numerals.
- Create a vector
romans
that contains the Roman numerals for the numbers from 1990 to 2010 as characters.
<- as.character(utils::as.roman(1990:2010))
romans
romans#> [1] "MCMXC" "MCMXCI" "MCMXCII" "MCMXCIII" "MCMXCIV" "MCMXCV"
#> [7] "MCMXCVI" "MCMXCVII" "MCMXCVIII" "MCMXCIX" "MM" "MMI"
#> [13] "MMII" "MMIII" "MMIV" "MMV" "MMVI" "MMVII"
#> [19] "MMVIII" "MMIX" "MMX"
- Predict and explain the results of the following stringr commands containing regular expressions:
str_view_all(romans, "XCI?", match = TRUE)
str_view_all(romans, "XCI*", match = TRUE)
str_view_all(romans, "XCI+", match = TRUE)
str_view_all(romans, "MI{1}", match = TRUE)
str_view_all(romans, "MI{2,}", match = TRUE)
str_view_all(romans, "MI{2,3}?", match = TRUE)
str_view_all(romans, "M{2}I?", match = TRUE)
str_view_all(romans, "M{2}I?V", match = TRUE)
str_view_all(romans, "M{2}I+", match = TRUE)
E.2.7 Back-references
In Section E.2.5), we saw that round parentheses ()
provide a way of disambiguating regular expressions. Parentheses also create a capturing group that can be referred to by a number (e.g., 1, 2, etc.). A capturing group stores the part of the string matched by the regular expression inside the parentheses. We can refer to the same pattern that was previously matched with so-called back-references (\\1
, \\2
, etc.).
For example, the following regular expression finds all fruits
that have a repeated vowel or a repeated pair of letters:
<- ds4psy::fruits # data
fruits
str_view(fruits, "([aeiou])\\1", match = TRUE)
str_view(fruits, "(..)\\1", match = TRUE)
Note that the group remembered by the back-reference is exactly the one found on the first match.
Thus, the following expression finds fruits
that contain the same capital letter twice:
str_view(fruits, "([:upper:]).*\\1", match = TRUE)
whereas fruits
with any two (or more) capital letters can be matched as follows:
str_view(fruits, "([:upper:]).*[:upper:]", match = TRUE)
Back-referencing earlier matches is particularly powerful in combination with the wildcard character .
(which we know to match any individual character, see Section E.2.2) and the ability to match an arbitrary number of characters .*
(see Section E.2.6).
To illustrate the potential of matching patterns by combining the wildcard character .
with back-references, the following examples slightly modify two excellent, but difficult exercises from Section 14.3.5 Grouping and backreferences (Wickham & Grolemund, 2017):
The first exercise asks us to describe, in words, what the following regular expressions will match:
(.)\1\1
"(.)(.)\\2\\1"
(..)\1
"(.).\\1.\\1"
"(.)(.)(.).*\\3\\2\\1"
Unless someone is quite experienced with pattern matching, describing the targets of these regex is challenging (and an additional difficulty is that two of them first need to be turned into strings, which includes escaping the \
symbol).
When something becomes difficult to think through, constructing or seeing an example can help a lot.
So here is some data bref
that allows evaluating these regular expressions:
<- c("aaaa aaab aabb abba baba ahab anna",
bref "abcxcba abcxabc baobab abracadabra",
"toto motto total lotto TNT a lot of LOL toll")
str_view_all(bref, "(.)\\1\\1", match = TRUE)
str_view_all(bref, "(.)(.)\\2\\1", match = TRUE)
str_view_all(bref, "(..)\\1", match = TRUE)
str_view_all(bref, "(.).\\1.\\1", match = TRUE)
str_view_all(bref, "(.)(.)(.).*\\3\\2\\1", match = TRUE)
Practice
- Repetitions in
words
:
A second exercise in Section 14.3.5 Grouping and backreferences (Wickham & Grolemund, 2017) asks us to construct regular expressions that match words that:
- start and end with the same character.
- contain a repeated pair of letters (e.g., the word “decide” contains the letter sequence “de” twice.)
- contain a letter repeated in at least three places (e.g. “evidence” contains three “e”s.)
To solve these tasks, we can use the words
data from the stringr package:
<- stringr::words # data
words
str_view_all(words, "^(.).*\\1$", match = TRUE)
str_view_all(words, "(.)(.).*\\1\\2", match = TRUE)
str_view_all(words, "(.).*\\1.*\\1", match = TRUE)
E.2.8 Look-arounds
An advanced feature of regular expressions is specifying a pattern that follows or precedes some other pattern.
So-called look-around expressions exist in two versions:
a(?=b)
indicates that a
is followed by b
(i.e., looks ahead of a
) and
(?<=b)a
indicates that b
precedes a
(i.e., looks behind of a
).
Replacing the equal sign =
by an exclamation mark !
negates both expressions.
This becomes clearer when seeing some examples:
Look-ahead
Here is a string containing some nice Roman numerals to match:
<- "XS S M L XL XXL LX LLX" # data sz
Looking ahead implies matching something before something else is matched:
a(?=b)
matches anya
followed byb
a(?!b)
matches anya
not followed byb
Examples
- Find any
X
that is followed by anL
:
str_view_all(sz, "X(?=L)", match = TRUE)
- Find any
X
that is _not _followed by anL
:
str_view_all(sz, "X(?!L)", match = TRUE)
Look-behind
Looking behind implies matching something after something else has been matched:
(?<=b)a
matches anya
preceded byb
(?<!b)a
matches anya
not preceded byb
Examples
- Find any
X
that is preceded by anL
:
str_view_all(sz, "(?<=L)X", match = TRUE)
- Find any
X
that is not preceded by anL
:
str_view_all(sz, "(?<!L)X", match = TRUE)
Notes
Both types of look-arounds only match the pattern outside of the look-ahead or look-behind expression (provided in parentheses).
Using look-arounds with base R commands (like
grep()
) typically requires setting theirperl
argument toperl = TRUE
.82
There are more advanced aspects of regular expressions.
For instance, an ?(if)then|else
construct allows creating conditional regular expressions that can be used in combination with look-ahead or look-behind constructs.
But this functionality would clearly go beyond our gentle introduction. See http://www.regular-expressions.info or the resources mentioned in Section E.4 for these features.
A final caveat
Before we conclude this section with some final practice exercises, we should emphasize that regular expressions can be beautiful creatures, but should nevertheless be used with caution.
Quite often, an overly complicated regex can be replaced by two or three simpler steps.
For instance, if we were to search for all words
with exactly 10 letters, we could use any of the following regular expressions:
str_view_all(words, "^..........$", match = TRUE)
str_view_all(words, "^.{10}$", match = TRUE)
str_view_all(words, "\\b.{10}\\b", match = TRUE)
However, this particular task could also be solved by remembering some simple base R functions:
nchar(words) == 10]
words[#> [1] "department" "difference" "experience" "individual" "particular"
#> [6] "photograph" "television" "understand" "university"
Thus, regular expressions are powerful tools, but are complemented by and used in combination with other tools (e.g., the str_count()
function discussed in the Strings of text chapter (see Section 9.4).
Practice
Here are some final practice tasks to check your regular expression skills.
- Finding four-letter words:
We now can define the three patterns described in the introduction above (see Section E.2). Specifically, a pattern that
- finds only the word text;
- finds any word beginning and ending with the letter t;
- finds any four-letter word.
Write regular expressions matching these patterns and demonstrate their results on the character string tst
:
<- "Both 'text' and 'test' are four-letter words." tst
Solution
# only "text":
str_view_all(tst, "text", match = TRUE)
# beginning and ending with "t":
str_view_all(tst, "t.*t", match = TRUE) # too general
str_view_all(tst, "( t\\w*t )|('t\\w*t')", match = TRUE) # cheating
str_view_all(tst, "'\\bt[:alpha:]*t\\b", match = TRUE) # ok
# any 4-letter word:
str_view_all(tst, "\\b....\\b", match = TRUE) # too general
str_view_all(tst, "\\b\\w{4}\\b", match = TRUE) # ok
- Finding
fruits
:
We can further practice our regex skills on the ds4psy collection of fruits
(transformed into lowercase letters):
<- tolower(ds4psy::fruits) # data
fruits # length(fruits) # 122
Answer the following questions about fruits
:
- Does
fruits
include “ananas” or “kiwi?” - Which types of “berry” starting with the letters “b” or “c” are included in
fruits
? - Which
fruits
start with the letters X to Z? - Which
fruits
start and end on a vowel? - Which
fruits
contain the same letter five times? - Which
fruits
contain an anagram (i.e., a reversible letter sequence like “anana”)?
Solution
# ananas or kiwi?
# it seems not:
"ananas" %in% fruits # FALSE (as not its own string)
"kiwi" %in% fruits # FALSE (as not its own string)
# but:
grep(pattern = "ananas|kiwi", x = fruits, value = TRUE)
str_view_all(fruits, "ananas|kiwi", match = TRUE)
# b/c...berry?
str_view_all(fruits, "^(b|c).*berry", match = TRUE)
# start on x to z?
str_view_all(fruits, "^[x-z]", match = TRUE)
# start and end on a vowel?
str_view_all(fruits, "^[aeiou].*[aeiou]$", match = TRUE)
# letter 5 times?
str_view_all(fruits, "(.).*\\1.*\\1.*\\1.*\\1", match = TRUE)
# anagram?
str_view_all(fruits, "(.)(.).?\\2\\1", match = TRUE)
- Matching
fruits
:
Predict, evaluate, and explain the results of the following searches:
str_view_all(fruits, "[xz]", match = TRUE)
str_view_all(fruits, "^[d-g]", match = TRUE)
str_view_all(fruits, "[d-g]$", match = TRUE)
str_view_all(fruits, "a(n|m).*e$", match = TRUE)
str_view_all(fruits, "(an){2}", match = TRUE)
str_view_all(fruits, "[^a-z A-Z\\(\\)]", match = TRUE)
str_view_all(fruits, "(..)\\1", match = TRUE)
str_view_all(fruits, "(r).*\\1.*\\1", match = TRUE)
- Matching sentence borders:
Use a base R command for finding Bushisms
that contain
- a dot
.
, followed by at least one space, and a capital letter - a question mark
?
, followed by at least one space, and a capital letter
Hint: Remember to set perl = TRUE
for enabling look-around functionality in base R commands.
Solution
# A dot ".", followed by at least 1 space, and capital letter:
grep(x = Bushisms, pattern = "(?<=\\.) {1,}(?=[A-Z])", perl = TRUE, value = TRUE)
# A question, followed by at least 1 space, and capital letter:
grep(x = Bushisms, pattern = "(?<=\\?) {1,}(?=[A-Z])", perl = TRUE, value = TRUE)
- Matching articles:
Use str_view_all()
commands for viewing indefinite or definite articles (i.e., “a” or “the”) in Bushisms
.
Specifically, create regular expressions that match
- all instances of “a” or “the”
- all instances of “a” or “the” followed by a word
- all words preceded by “a” or “the”
Hint: Word boundaries can be matched by \b
.
Solution
str_view_all(Bushisms, "\\b(a|the)\\b", match = TRUE)
str_view_all(Bushisms, "(\\b(a|the)\\b)(?=( \\b.+?\\b))", match = TRUE)
str_view_all(Bushisms, "(?<=(\\b(a|the)\\b) )(\\b.+?\\b)", match = TRUE)
- Looking back and ahead:
Use the following data:
<- "XS S M L XL XXL LX LLX" # data sz
to construct regular expressions that:
find any
L
preceded byX
find any
L
not preceded byX
find any
L
followed byX
find any
L
not followed byX
find any
X
preceded byX
and followed byL
find any
L
not preceded byX
but followed byX
.
Solution
str_view_all(sz, "(?<=X)L", match = TRUE)
str_view_all(sz, "(?<!X)L", match = TRUE)
str_view_all(sz, "L(?=X)", match = TRUE)
str_view_all(sz, "L(?!X)", match = TRUE)
str_view_all(sz, "(?<=X)X(?=L)", match = TRUE)
str_view_all(sz, "(?<!X)L(?=X)", match = TRUE)
See Section E.4 for additional resources on regular expressions.
References
When using R or R Markdown, it can be quite confusing that the backslash
\
and the ASCII accent grave (used to invoke a code environment in R Markdown) appear in many different roles and meanings.↩︎Actually, we are cheating a bit here: Searching for patterns that begin with a capital letter or end on a punctuation mark only identifies the beginning or end of sentences when these are already stored as separate strings — in which case we could simply match their first or last character. Hence, functions that aim to identify sentences in longer passages of text (like
text_to_sentences()
in ds4psy) need to be smarter than this.↩︎By default, R uses extended regular expressions. Setting
perl = TRUE
switches to the PCRE library similar to Perl 5.x (see?base::regex
for details).↩︎