E.2 Essential regex commands

Regular expressions specify patterns in strings of text. The notion of a pattern implies a range of flexibility that can vary from very specific to highly general. For instance, the word “text” could be described as a word that contains the letter sequence t-e-x-t, as a word that begins and ends with the letter t, or as a four-letter word. Each of these descriptions could be described as a pattern, which would then match the word “text”. But whereas the first pattern would only match this particular word, the second and third would also match the word “test”, and the third would additionally match the word “four”. Thus, whenever matching a pattern, we always aim for the sweet spot between too specific and overly general searches. Thus, when learning to write regular expressions, we need means and tools that allow striking the right balance between specificity and generality.

To begin our expedition into the realm of regular expressions, we will primarily explore the character vector tests (defined above):

As we proceed to more advanced aspects of regular expressions, we will use more specialized collections of data, which we will define on the fly or specified above (see Sections E.1.2).

E.2.1 Character sequences

Regular expressions specify patterns in character strings, but are also provided as character strings (i.e., enclosed by quotation marks, as in pattern = "at" above). Consequently, character symbols are not only the basic building blocks of character strings (i.e., text objects or data of type character), but also of regular expressions (i.e., abstract descriptions of patterns in text data). As the differences between strings that are text objects and strings that are regular expressions lies in their intended use, this similarity can be confusing. And although the functionally different roles of character strings in R are often convenient, they also create some conflicts, as we will see in the next section.

Sequences of letters — or words — and any numeric digits used in regular expressions match themselves:

The same is true for many characters that are neither letters nor digits:

However, we will soon see that many characters require special treatments to be used within regular expressions (see Section E.2.2).

What about the other Unicode characters that Chapter 9 on Strings of text) taught us to appreciate and type? We can try matching Unicode characters by using our epitome of cultural insights (from Section 9.2.2):

Just as there were different ways of typing Unicode characters in a character string, we can use these characters in regular expressions (which are character strings) in various ways. Here are three different ways of entering an Umlaut character (e.g., ö) within a regular expression:

Fortunately, all three ways of using the Umlaut character ö within a regex yield the same result:

Hence, we can use and match Unicode characters in regular expressions. (See Sections 9.2.2 and 9.8.2 for additional information and resources on Unicode characters.)

Practice

  1. Predict the outputs of the following commands if p is changed to "A", then verify your predictions by evaluating the commands.
  1. Predict the outputs of the following commands if p is changed to "sa", then verify your predictions by evaluating the commands.
  1. Combine the datasets provided by Bushisms and Trumpisms into a vector BT and then search it for all objects containing the following character sequences or words:
  • “big”
  • “est”
  • “mis”
  • “child”
  • “country”
  • “America”

Here’s an example:

E.2.2 Meta-characters and escaping

In Chapter 9 on Strings of text), the existence of metacharacters and character constants was only mentioned (in Section 9.2.3). Now we are in a position to learn which special meanings these characters have within regular expressions and how they can be matched when they appear in strings of text.

The 12 so-called metacharacters in R are:

  • . \ | ( ) [ { ^ $ * + ?

and are documented in ?regex. To provide convenient access to them, the ds4psy package defines a character vector metachar that contains these characters in a vector.

The . as wildcard vs. dot

The first metacharacter in metachar is the dot (aka. “period” or “full stop”) .. In regular expressions, using the dot . serves as a wildcard character that matches any single character (except the newline character \n):

This degree of flexibility makes matching . pretty useless in itself, as matching everything is typically not very helpful. However, using one or more wildcards becomes very powerful in combination it with other characters:

But the ambiguity that the dot symbol can now appear both in our data and also as a wildcard in a regular expression creates a conundrum: If a . matches any arbitrary character, how can we match the character symbol "." (e.g., used after abbreviations or at the end of a sentence)? The answer is that we need to signal to R that we want to use the . symbol not in its special meaning, but as an ordinary character symbol. Thus, we need to “escape” from its special meaning.

An escape from a symbol’s special meaning is achieved by preceding the symbol by a backslash \. But another glance at the set of metacharacters shows that \ also happens to be a character with special meaning. So how can we use it within a regular expression? Well, by escaping from its special meaning (i.e., by preceding it with a backslash \).
Thus, we can search for a literal dot symbol . by preceding it by two backslash characters:

Following the same logic also allows us searches for other metacharacters, like ?, ^, +, or parentheses ():

Matching the backslash \

Having understood the need for double-backslashes, matching a literal backslash \ is still challenging. To write a \ inside a string, we need to escape the special meaning of \, hence write "\\". However, to match a \, we need to escape it as well. As a consequence, we need to use \\\\ (indeed, no less than four backslashes) to match a single \:

Character constants

Note that — besides metacharacters — there are also so-called character constants. These are characters with a special meaning in R that are also preceded by a backslash \. The most common character constants are:

  • \n newline
  • \r carriage return
  • \t tab
  • \b backspace
  • \f form feed
  • \' ASCII apostrophe
  • \" ASCII quotation mark

Evaluate ?"'" to obtain a complete list of character constants.68

Practice

  1. Matching dots:

We saw that we can match any literal dot . by escaping its special meaning (as a wildcard):

But notice that there are (at least) two kinds of dots:

  • Can we distinguish between inline dots (typically signaling abbreviations) and final dots (typically signaling the end of a sentence)?

Well, we can easily identify non-final dots by searching for an escaped dot \\. that is followed by another character (i.e., a wildcard dot .):

This was easy, but finding a dot . signals the end of a character string is tricky (as long as we do not yet know about anchors, which will be discussed in Section E.2.4 below):

However, a quick hack could add an empty space " " to any string and then search for a dot followed by a space:

  1. Matching meta- and other cryptic characters:

The following string cryptix contains 50 characters that are a mix of meta-characters and non-meta-characters:

Use this string to complete the following tasks:

  • Define cryptix as !=[$/]\[%</:=),{>|/*}?(&(.<\.!$|*,/#:.%(.*+-[\%\^|.

  • Inspect cryptix to determine which metacharacters are contained in it.

  • Inspect cryptix to determine which non-metacharacters are contained in it.

  • Construct a series of stringr str_view_all() commands that selectively finds every character contained in cryptix.

For metacharacters, these commands could be:

For non-metacharacters, these commands could be:

Note that the results of searching and printing the cryptix string can be rather unpredictable. Thus, dealing with a mix of cryptic characters remains messy, even with sophisticated tools.

E.2.3 Character classes

Beyond matching specific characters, regular expressions provide means of matching entire classes (or types) of characters. We can distringuish between three types of character classes:

  1. Specific character classes are the most common symbols in most texts:
  • [:lower:] lower-case letters
  • [:upper:] upper-case letters
  • [:alpha:] alphabetic characters: [:lower:] and [:upper:]
  • [:digit:] digits: 0 1 2 3 4 5 6 7 8 9
  • [:punct:] punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ { | } ~`
  1. Spacing and control characters are mostly invisible, yet very common and important for structuring text:
  • [:blank:] blank characters (space , tab \t, and ideally non-breaking spaces)
  • [:cntrl:] control characters (e.g., tab \t, newline \n, carriage return \r)
  • [:space:] space characters (e.g., tab, newline, form feed, carriage return, space, and others)
  1. Classes of character classes are generalizations of other categories:
  • [:alnum:] alphanumeric characters: [:alpha:] and [:digit:]
  • [:graph:] graphical characters: [:alnum:] and [:punct:]
  • [:print:] printable characters: [:alnum:], [:punct:], and [:space:]

The ds4psy package contains a (named) vector cclass that is suited to illustrate these commands. cclass contains different character classes (as character strings) and allows selecting each class by an abbreviated name:

We now can try to match the contents of cclass by regular expressions that are designed for matching entire character classes. The following commands show the results of the corresponding matches:

  1. Specific character classes:

Note that some metacharacters are not matched by [:punct:], but can be matched by escaping the corresponding symbol (see Section E.2.2 above):

  1. Spacing and control characters:
  1. Classes of character classes:

When using base R commands, the character classes enclosed in brackets must be enclosed in an additional set of brackets. For instance, we can find all strings in cclass with alphanumeric characters by the following grep() command:

Alternative options for matching some classes of characters are provided by the following escape sequences:

  • \d matches any digit (\D any non-digit)
  • \s matches any space character (e.g. space, tab \t, newline \n; \S any non-space character)
  • \w matches any ‘word’ character (letter, digit, or underscore in the current locale; \W any non-word character)

And remember: To enter the character \ within a string (as regex are written as strings), it needs to be escaped by an additional \:

We are not showing the results of all these commands here, but feel free to try them out in your console.

Practice

  1. Non-characteristic fruits:
  • Are there any fruits that contain characters that are not [:alnum:]?

Hint: Yes, there are. Just search for character classes that are not contained in [:alnum:].

Note: We will learn how to negate a pattern below.

  1. Surrounded spaces:
  • Write several regular expressions that match any space " " that is surrounded by three characters on either side in tests.

Some examples include:

E.2.4 Anchors

Anchors allow matching patterns in strings at two prominent positions:

  • ^ matches the start of a string
  • $ matches the end of a string

Some straightforward examples for using anchors (using the grep() function) include:

Note the order of characters: As the anchor $ matches the end of the string, any character required to be at the end of the string needs to appear before it in the regular expression.

The corresponding stringr commands with anchors are:

Using anchors in combination with more general patterns (e.g., patterns that match entire character classes) makes them quite powerful tools. For instance, we now can search for the beginning and end of sentences:69

Their functionality as anchors explains the special meaning of the metacharacters ^ and $, but note that their position also matters. And remember: To match a literal ^ and $ in a string (e.g., in metachar), we need to escape them (see Section E.2.2 above):

Additional anchors that are mostly used for matching words (rather than strings) are:

  • \b matches any empty string at either edge of a word
  • \B matches empty strings that are NOT at word edges

Practice

  1. Accounting for availability:

We apparently think that there are more words starting with a specific letter than ending on the same letter. This is often explained by the so-called availability heuristic (Tversky & Kahneman, 1974): Apparently, it is easier to recall exemplars by their first letter than by their last letter.

  • Test this assumption by first trying to recall fruits starting or ending on a specific letter l (for several letters, e.g., E, L, K, Y).

  • Then detect and count the number of corresponding fruits by regex searches.

  1. Analyzing Bushisms:

Analyze the set of Bushisms to answer the following questions:

  • Are there more sentences starting with I or with You?

  • Are there more Bushisms that end on a question (i.e., with a final ?) or that contain a question (i.e., with a non-final ?)?

E.2.5 Alternates, groups, and negation

Whereas anchors make our searches more specific (by requiring patterns to occur in specific positions), the use of other operators and conventions makes them more general. A key step towards more general patterns consists in specifying two or more alternatives:

  • a|b matches a or b
  • [abc] matches one or more of a, b, or c
  • [a-c] matches any character in the range from a to c
  • [^abc] matches anything but a, b, or c

In addition to these uses of square brackets [] to specify alternatives, this section also introduces the use of round parentheses () for grouping purposes. But let’s proceed step-by-step:

  • Using the metacharacter | in a regular expression "a|b" matches a or b (or both):
  • Enclosing characters in square brackets [] provides a more general way of specifying a group of characters to be matched. We have already seen the [...] construct when matching character classes above (see Section E.2.3), but now realize that the ... can also contain groups or ranges of characters to be matched:

Note the subtle, but important difference between matching a pattern "ab|c" (i.e., matching ab or c) and matching a pattern "[abc]" (i.e., matching a or b or c):

  • Square brackets [] also allow specifying alphabetic or numeric ranges by -:
  • Round parentheses () allow to group patterns. This can be used for merely illustrative purposes (which can be very helpful for clarifying complex regular expressions):

but becomes essential when combining various options — like a required plus an optional part — for constructing more complex patterns:

Another powerful tool for specifying sets or ranges of characters (or general patterns) is provided by negating a given set or range (i.e., excluding characters from matches).

  • Preceding a set or range of characters by ^ (e.g., [^a-z]) negates the set or range (i.e., excludes any characters within the set or range from matches):

Note that this particular use of the symbol ^ illustrates that the meaning of meta-characters is ambiguous and depends on the context in which they appear: Whereas ^ was used as an anchor when preceding a regex "^..." (see Section E.2.4), it acts as a negation symbol when it is used inside square brackets [^...]:

Here are further examples for combining alternatives, ranges, and negations. Try to predict their results before evaluating them to verify your predictions:

Overall, the use of |, [], (), and the negation of ranges via ^, provides the components of a language that allows the expression of quite powerful regular expressions.

Practice

  1. Alternative spaces and punctuations:
  • Predict and explain the results of the following commands:
  1. Mixing anchors and metacharacters:

What do the following regular expressions match?

  • "[\\\\]$"
  • "[^\\\\]"
  • "[^\\\\]$"

  • "^[\\\\\\\$]"
  • "[\\\\\\^]$"
  • "[\\\\\\^\\$]"

Construct str_view_all() expressions that check and verify your predictions.

Hint: As the regex involve the metacharacters of the backslash \, ^, and $ symbols, and the roles of the latter two as anchors, we need a test string that includes those symbols (with escaping) in different positions.

  1. Non-characteristic fruits with negation:

In an earlier task (above), we answered the following question:

  • Are there any fruits that contain characters that are not [:alnum:]?

by searching for character classes not contained in [:alnum:] (like [:space:] or [:punct:]). Knowing about the negation of ranges, we now can ask:

  • Can we find the same fruits by negating [:alnum:]?
  1. Regular presidential expressions:
  • Describe the goals of the following regex patterns prior to running them (to verify your predictions).

E.2.6 Repetition

Yet another way of fine-tuning our searches for patterns is provided by specifying how many times a (part of a) pattern is to be matched. To search for a specific number of occurrences, a regular expression (regex) may be followed by a repetition quantifier:

  • ?: the preceding regex will be matched at most once (\(0\)-\(1\)).
  • *: The preceding regex will be matched zero or more times (\(0+\)).
  • +: The preceding regex will be matched one or more times (\(1+\)).

Note that the quantifier in the last three examples only applied to the character L immediately preceding it. If we wanted to quantify the repetion of “XL” or of any “X” or “L”, we would have needed to group both characters by parentheses (XL) or brackets [XL] (see Section E.2.5).

A more general way of requiring a specific number or range of repetitions is provided by enclosing one or two numbers (n, or n and m) inside of curly brackets {}:

  • {n}: The preceding regex is matched exactly n times (\(n\)).
  • {n,}: The preceding regex is matched n or more times (\(n+\)).
  • {n,m}: The preceding regex is matched at least n times, but not more than m times (\(n\)-\(m\)).

The {n,} and {n,m} constructs are more general than the use of ?, *, and +, as the latter can easily be re-written as:

  • ? corresponds to {0,1}
  • * corresponds to {0,}
  • + corresponds to {1,}

Combining multiple repetition quantifiers can be powerful, but also confusing. Here are some examples that show how minor changes can make potentially crucial differences:

Repeated matches are greedy by default, so that the maximal possible number of repetitions is found. This can be changed to minimal by appending ? to the quantifier:

Note that the ? in "X?XL{1,3}?" has two different meanings: Whereas the first is a repetition quantifier, the second switches the preceding repetition quantifier to matching in a non-greedy fashion.

Practice

  1. Specialized favours:

The following sentences ou and iz contain a messy mix of British and U.S. American spelling.

  • Construct regular expressions that would pick up both the U.K. and U.S. spelling variants of o/ou, is/iz, and ys/yz.
  • Can you create a regex that would pick up both the U.K. and U.S. spelling variants of o/ou, but not the words “or” and “our”?
  1. Roman numerals:

The function as.roman() of the utils package (included in base R) translates numbers into Roman numerals.

  • Create a vector romans that contains the Roman numerals for the numbers from 1990 to 2010 as characters.
  • Predict and explain the results of the following stringr commands containing regular expressions:

E.2.7 Back-references

In Section E.2.5), we saw that round parentheses () provide a way of disambiguating regular expressions. Parentheses also create a capturing group that can be referred to by a number (e.g., 1, 2, etc.). A capturing group stores the part of the string matched by the regular expression inside the parentheses. We can refer to the same pattern that was previously matched with so-called back-references (\\1, \\2, etc.).

For example, the following regular expression finds all fruits that have a repeated vowel or a repeated pair of letters:

Note that the group remembered by the back-reference is exactly the one found on the first match. Thus, the following expression finds fruits that contain the same capital letter twice:

whereas fruits with any two (or more) capital letters can be matched as follows:

Back-referencing earlier matches is particularly powerful in combination with the wildcard character . (which we know to match any individual character, see Section E.2.2) and the ability to match an arbitrary number of characters .* (see Section E.2.6).

To illustrate the potential of matching patterns by combining the wildcard character . with back-references, the following examples slightly modify two excellent, but difficult exercises from Section 14.3.5 Grouping and backreferences (Wickham & Grolemund, 2017):

The first exercise asks us to describe, in words, what the following regular expressions will match:

  • (.)\1\1
  • "(.)(.)\\2\\1"
  • (..)\1
  • "(.).\\1.\\1"
  • "(.)(.)(.).*\\3\\2\\1"

Unless someone is quite experienced with pattern matching, describing the targets of these regex is challenging (and an additional difficulty is that two of them first need to be turned into strings, which includes escaping the \ symbol).

When something becomes difficult to think through, constructing or seeing an example can help a lot. So here is some data bref that allows evaluating these regular expressions:

Practice

  1. Repetitions in words:

A second exercise in Section 14.3.5 Grouping and backreferences (Wickham & Grolemund, 2017) asks us to construct regular expressions that match words that:

  • start and end with the same character.
  • contain a repeated pair of letters (e.g., the word “decide” contains the letter sequence “de” twice.)
  • contain a letter repeated in at least three places (e.g. “evidence” contains three “e”s.)

To solve these tasks, we can use the words data from the stringr package:

E.2.8 Look-arounds

An advanced feature of regular expressions is specifying a pattern that follows or precedes some other pattern. So-called look-around expressions exist in two versions: a(?=b) indicates that a is followed by b (i.e., looks ahead of a) and (?<=b)a indicates that b precedes a (i.e., looks behind of a).
Replacing the = by ! negates both expressions. This becomes clearer when seeing some examples:

Look-ahead

  • a(?=b) matches any a followed by b
  • a(?!b) matches any a not followed by b

Look-behind

  • (?<=b)a matches any a preceded by b
  • (?<!b)a matches any a not preceded by b

Notes

  • Both types of look-arounds only match the pattern outside of the look-ahead or look-behind expression (provided in parentheses).

  • Using look-arounds with base R commands (like grep()) typically requires setting their perl argument to perl = TRUE.70

There are more advanced aspects of regular expressions. For instance, an ?(if)then|else construct allows creating conditional regular expressions that can be used in combination with look-ahead or look-behind constructs. But this functionality would clearly go beyond our gentle introduction. See http://www.regular-expressions.info or the resources mentioned in Section E.4 for these features.

A final caveat

Before we conclude this section with some final practice exercises, we should emphasize that regular expressions can be beautiful creatures, but should nevertheless be used with caution. Quite often, an overly complicated regex can be replaced by two or three simpler steps. For instance, if we were to search for all words with exactly 10 letters, we could use any of the following regular expressions:

However, this particular task could also be solved by remembering some simple base R functions:

Thus, regular expressions are powerful tools, but are complemented by and used in combination with other tools (e.g., the str_count() function discussed in the Strings of text chapter (see Section 9.4).

Practice

Here are some final practice tasks to check your regular expression skills.

  1. Finding four-letter words:

We now can define the three patterns described in the introduction above (see Section E.2). Specifically, a pattern that

  • finds only the word text;
  • finds any word beginning and ending with the letter t;
  • finds any four-letter word.

Write regular expressions matching these patterns and demonstrate their results on the character string tst:

  1. Finding fruits:

We can further practice our regex skills on the ds4psy collection of fruits (transformed into lowercase letters):

Answer the following questions about fruits:

  • Does fruits include “ananas” or “kiwi”?
  • Which types of “berry” starting with the letters “b” or “c” are included in fruits?
  • Which fruits start with the letters X to Z?
  • Which fruits start and end on a vowel?
  • Which fruits contain the same letter five times?
  • Which fruits contain an anagram (i.e., a reversible letter sequence like “anana”)?
  1. Matching fruits:

Predict, evaluate, and explain the results of the following searches:

  1. Matching sentence borders:

Use a base R command for finding Bushisms that contain

  • a dot ., followed by at least one space, and a capital letter
  • a question mark ?, followed by at least one space, and a capital letter

Hint: Remember to set perl = TRUE for enabling look-around functionality in base R commands.

  1. Matching articles:

Use str_view_all() commands for viewing indefinite or definite articles (i.e., “a” or “the”) in Bushisms. Specifically, create regular expressions that match

  • all instances of “a” or “the”
  • all instances of “a” or “the” followed by a word
  • all words preceded by “a” or “the”

Hint: Word boundaries can be matched by \b.

See Section E.4 for additional resources on regular expressions.

References

Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131. https://doi.org/10.1126/science.185.4157.1124

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz


  1. When using R or R Markdown, it can be quite confusing that the backslash \ and the ASCII accent grave (used to invoke a code environment in R Markdown) appear in many different roles and meanings.

  2. Actually, we are cheating a bit here: Searching for patterns that begin with a capital letter or end on a punctuation mark only identifies the beginning or end of sentences when these are already stored as separate strings — in which case we could simply match their first or last character. Hence, functions that aim to identify sentences in longer passages of text (like text_to_sentences() in ds4psy) need to be smarter than this.

  3. By default, R uses extended regular expressions. Setting perl = TRUE switches to the PCRE library similar to Perl 5.x (see ?base::regex for details).