E.1 Introduction
Pattern matching isn’t like direct string comparison, even at its simplest level;
it’s more like string searching with mutant wildcards on steroids.Tom Christiansen and Nathan Torkington (2003, p. 180)
The notion of a regular expression doesn’t sound like a awful lot of fun. While regex sounds a little sexier, our introductory reference to theoretical computer science and formal language theory may curb our enthusiasm. Fortunately, the above quote from the Perl Cookbook (Christiansen & Torkington, 2003) promises some excitement in pattern matching. But before the fun can start, we need to clarify some terminology and specify the data and tools that will allow us to unleash the mysteries of regex magic.
E.1.1 What are regular expressions?
A regular expression (aka. regex or regexp) is a sequence of characters that define a pattern or regularity in strings of text. Rather than always requiring an exact match for a character sequence, using regular expressions allows to generalize our search activities to symbol sequences that correspond to more abstract patterns. As this is immensely useful, many computer languages provide regex functionality for finding patterns in strings of text (see Wikipedia: Regular expressions for details). Whereas most characters will simply match themselves, some symbols assume special meanings when they are used within a regular expression. And just as in natural languages, some symbols change their meaning based on the context in which they are being used.
To write and use regular expressions, we must acquire a compact language for defining patterns. As regular expressions can get quite cryptic, they often look a bit scary to the uninitiated. Whereas a formal description of regex operators and rules can be daunting, learning them step-by-step is much easier. To get you started, this appendix provides a gentle introduction that covers the basics.
E.1.2 Getting started
Background
To study regular expressions, we need
some test data that allows us searching for pattern matches in strings of text, and
some functions that show us the results of pattern matching operations in these strings of text.
Data to search
As practice data, we will primarily use some vectors (typically strings of characters, words, or sentences) provided by the R packages ds4psy (Neth, 2023) and stringr (Wickham, 2022), but also define additional test data that contains strings with specific properties.
Specific test data will often be defined on the fly.
For instance, the vector tests
contains some character objects that we will aim to match later:
# (A) Data definitions: ------
tests <- c("abc", "ABC", "a.c", "a_c", "a\\c", "ac/dc",
"2+4=6 0-9 2^3 2518 9612708",
"The cat, sat mat, etc., fat dad.", "Us or them?",
"Been there, (seen it --- at last), done that.",
"Not act, is bad, so sad!")
By contrast, test data provided by ds4psy and stringr is typically longer or more complicated. Here is an overview of the package-based data that we will use in the course of this tutorial:
# (B) from ds4psy: ------
# (a) words: ----
fruits <- ds4psy::fruits
# length(fruits) # 122
countries <- ds4psy::countries
# length(countries) # 197
Trumpisms <- ds4psy::Trumpisms
# length(Trumpisms) # 96
# (b) sentences: ----
flowery <- ds4psy::flowery
# length(flowery) # 60
Bushisms <- ds4psy::Bushisms
# length(Bushisms) # 22
# (c) other vectors: ----
cclass <- ds4psy::cclass
metachar <- ds4psy::metachar
Umlaut <- ds4psy::Umlaut
# (C) from stringr: ------
words <- stringr::words
# length(words) # 980
sentences <- stringr::sentences
# length(sentences) # 720
Functions for matching patterns
The key functions for matching patterns and viewing the results of pattern matching operations are the base R functions grep()
and the corresponding stringr functions str_detect()
, str_subset()
, and str_view()
. All these functions come with different options and in a variety of versions. At this point, all we care about is that they allow us to show the matches (or non-matches) in some string of text data s
to a pattern p
.
All these functions distinguish between some data s
and a pattern p
.
To illustrate them, we will use our vector tests
(defined above) as the data s
to be searched through and define a first pattern p
, that consists in the letters "at"
(as a character string):
- Here are examples of using the base R family of
grep()
functions for viewing pattern matches:
# base R functions:
grep(pattern = p, x = s) # positions of matches
#> [1] 8 10
grepl(pattern = p, x = s) # detect matches (yields logical vector)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
grep(pattern = p, x = s, value = TRUE) # obtain matches (yields value of matches)
#> [1] "The cat, sat mat, etc., fat dad."
#> [2] "Been there, (seen it --- at last), done that."
- The stringr package provides a larger and more systematic family of functions.
Their name typically begins with
str_
and then adds some verb that indicates the specific functionality performed by the function. For instance,str_detect()
helps detecting apattern
in astring
(as ingrepl(pattern = p, x = s)
above), whereasstr_subset()
returns the subset of objects instring
that match thepattern
(as ingrep(pattern = p, x = s, value = TRUE)
above):
# Corresponding stringr functions:
library(stringr)
str_detect(string = s, pattern = p) # detect matches (yields logical vector)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
str_subset(string = s, pattern = p) # obtain matches (yields value of matches)
#> [1] "The cat, sat mat, etc., fat dad."
#> [2] "Been there, (seen it --- at last), done that."
Note that the grep()
and the str_...()
functions share the pattern
argument, but the argument x
in grep()
corresponds to the string
argument in str_...()
functions. Also, the order of arguments is reversed, which matters if we are lazy and omit argument names. Hence, the following pairs of function calls — note the reversal of their arguments — yield exactly the same results:
# Task: Detect matches:
grepl(p, s)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
str_detect(s, p)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
# Task: Obtain matching objects:
grep(p, s, value = TRUE)
#> [1] "The cat, sat mat, etc., fat dad."
#> [2] "Been there, (seen it --- at last), done that."
str_subset(s, p)
#> [1] "The cat, sat mat, etc., fat dad."
#> [2] "Been there, (seen it --- at last), done that."
Actually, the last two commands are easily replaced by the first two, as the task of obtaining matching objects can be achieved by logical indexing:
#> [1] "The cat, sat mat, etc., fat dad."
#> [2] "Been there, (seen it --- at last), done that."
#> [1] "The cat, sat mat, etc., fat dad."
#> [2] "Been there, (seen it --- at last), done that."
- If we are primarily interested in viewing matches, the stringr family of
string_view()
functions — and particularlystr_view_all()
with its argumentmatch = TRUE
— is the most convenient command to use, as it directly highlights the matches to a patternp
in the string of texts
:
#> [8] │ The c<at>, s<at> m<at>, etc., f<at> dad.
#> [10] │ Been there, (seen it --- <at> last), done th<at>.
#> [1] │ abc
#> [2] │ ABC
#> [3] │ a.c
#> [4] │ a_c
#> [5] │ a\c
#> [6] │ ac/dc
#> [7] │ 2+4=6 0-9 2^3 2518 9612708
#> [8] │ The c<at>, s<at> m<at>, etc., f<at> dad.
#> [9] │ Us or them?
#> [10] │ Been there, (seen it --- <at> last), done th<at>.
#> [11] │ Not act, is bad, so sad!
#> [8] │ The c<at>, s<at> m<at>, etc., f<at> dad.
#> [10] │ Been there, (seen it --- <at> last), done th<at>.
As it can be instructive to use alternative functions to find matches or non-matches, we will occasionally vary between different functions or versions. It is good to know that their argument pairs — pattern
and x
for grep()
vs. string
and pattern
for str_view()
— are used by a much larger family of functions (see Sections 9.3 and 9.4 of Chapter 9 on Strings of text). But as our current focus is on defining regular expressions, we merely use the function that shows us the result of our pattern matching operations.
E.1.3 Looking up references
By the way: Unless they regularly work with text data, most people do not memorize and remember all the details regarding of regular expressions. So do not despair when the following range of options seems a bit overwhelming at first — and remember that our R system typically provides quite extensive documentation. For regular expressions, this documentation is available via typing:
For a compact and useful overview of R’s regex options, see the Posit cheatsheets on Basic Regular Expressions in R (contributed by Ian Kopacka):
The back page of the Posit cheatsheets on the stringr package also contains a useful overview on regular expressions.