Pattern matching isn’t like direct string comparison, even at its simplest level;
it’s more like string searching with mutant wildcards on steroids.
Tom Christiansen and Nathan Torkington (2003, p. 180)
The notion of a regular expression doesn’t sound like a awful lot of fun. While regex sounds a little sexier, our introductory reference to theoretical computer science and formal language theory may curb our enthusiasm. Fortunately, the above quote from the Perl Cookbook (Christiansen & Torkington, 2003) promises some excitement in pattern matching. But before the fun can start, we need to clarify some terminology and specify the data and tools that will allow us to unleash the mysteries of regex magic.
E.1.1 What are regular expressions?
A regular expression (aka. regex or regexp) is a sequence of characters that define a pattern or regularity in strings of text. Rather than always requiring an exact match for a character sequence, using regular expressions allows to generalize our search activities to symbol sequences that correspond to more abstract patterns. As this is immensely useful, many computer languages provide regex functionality for finding patterns in strings of text (see Wikipedia: Regular expressions for details). Whereas most characters will simply match themselves, some symbols assume special meanings when they are used within a regular expression. And just as in natural languages, some symbols change their meaning based on the context in which they are being used.
To write and use regular expressions, we must acquire a compact language for defining patterns. As regular expressions can get quite cryptic, they often look a bit scary to the uninitiated. Whereas a formal description of regex operators and rules can be daunting, learning them step-by-step is much easier. To get you started, this appendix provides a gentle introduction that covers the basics.
E.1.2 Getting started
To study regular expressions, we need
some test data that allows us searching for pattern matches in strings of text, and
some functions that show us the results of pattern matching operations in these strings of text.
Data to search
As practice data, we will primarily use some vectors (typically strings of characters, words, or sentences) provided by the R packages ds4psy (Neth, 2020) and stringr (Wickham, 2019b), but also define additional test data that contains strings with specific properties.
Specific test data will often be defined on the fly.
For instance, the vector
tests contains some character objects that we will aim to match later:
By contrast, test data provided by ds4psy and stringr is typically longer or more complicated. Here is an overview of the package-based data that we will use in the course of this tutorial:
# (B) from ds4psy: ------ # (a) words: ---- fruits <- ds4psy::fruits # length(fruits) # 122 countries <- ds4psy::countries # length(countries) # 197 Trumpisms <- ds4psy::Trumpisms # length(Trumpisms) # 96 # (b) sentences: ---- flowery <- ds4psy::flowery # length(flowery) # 60 Bushisms <- ds4psy::Bushisms # length(Bushisms) # 22 # (c) other vectors: ---- cclass <- ds4psy::cclass metachar <- ds4psy::metachar Umlaut <- ds4psy::Umlaut # (C) from stringr: ------ words <- stringr::words # length(words) # 980 sentences <- stringr::sentences # length(sentences) # 720
Functions for matching patterns
The key functions for matching patterns and viewing the results of pattern matching operations are the base R functions
grep() and the corresponding stringr functions
str_view(). All these functions come with different options and in a variety of versions. At this point, all we care about is that they allow us to show the matches (or non-matches) in some string of text data
s to a pattern
All these functions distinguish between some data
s and a pattern
To illustrate them, we will use our vector
tests (defined above) as the data
s to be searched through and define a first pattern
p, that consists in the letters
"at" (as a character string):
- Here are examples of using the base R family of
grep()functions for viewing pattern matches:
# base R functions: grep(pattern = p, x = s) # positions of matches #>  8 10 grepl(pattern = p, x = s) # detect matches (yields logical vector) #>  FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE grep(pattern = p, x = s, value = TRUE) # obtain matches (yields value of matches) #>  "The cat, sat mat, etc., fat dad." #>  "Been there, (seen it --- at last), done that."
- The stringr package provides a larger and more systematic family of functions.
Their name typically begins with
str_and then adds some verb that indicates the specific functionality performed by the function. For instance,
str_detect()helps detecting a
grepl(pattern = p, x = s)above), whereas
str_subset()returns the subset of objects in
stringthat match the
grep(pattern = p, x = s, value = TRUE)above):
# Corresponding stringr functions: library(stringr) str_detect(string = s, pattern = p) # detect matches (yields logical vector) #>  FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE str_subset(string = s, pattern = p) # obtain matches (yields value of matches) #>  "The cat, sat mat, etc., fat dad." #>  "Been there, (seen it --- at last), done that."
Note that the
grep() and the
str_...() functions share the
pattern argument, but the argument
grep() corresponds to the
string argument in
str_...() functions. Also, the order of arguments is reversed, which matters if we are lazy and omit argument names. Hence, the following pairs of function calls — note the reversal of their arguments — yield exactly the same results:
# Task: Detect matches: grepl(p, s) #>  FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE str_detect(s, p) #>  FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE # Task: Obtain matching objects: grep(p, s, value = TRUE) #>  "The cat, sat mat, etc., fat dad." #>  "Been there, (seen it --- at last), done that." str_subset(s, p) #>  "The cat, sat mat, etc., fat dad." #>  "Been there, (seen it --- at last), done that."
Actually, the last two commands are easily replaced by the first two, as the task of obtaining matching objects can be achieved by logical indexing:
- If we are primarily interested in viewing matches, the stringr family of
string_view()functions — and particularly
str_view_all()with its argument
match = TRUE— is the most convenient command to use, as it directly highlights the matches to a pattern
pin the string of text
As it can be instructive to use alternative functions to find matches or non-matches, we will occasionally vary between different functions or versions. It is good to know that their argument pairs —
str_view() — are used by a much larger family of functions (see Sections 9.3 and 9.4 of Chapter 9 on Strings of text). But as our current focus is on defining regular expressions, we merely use the function that shows us the result of our pattern matching operations.
E.1.3 Looking up references
By the way: Unless they regularly work with text data, most people do not memorize and remember all the details regarding of regular expressions. So do not despair when the following range of options seems a bit overwhelming at first — and remember that our R system typically provides quite extensive documentation. For regular expressions, this documentation is available via typing:
For a compact and useful overview of R’s regex options, see the RStudio cheatsheet on Basic Regular Expressions in R (contributed by Ian Kopacka):
The back page of the RStudio cheatsheet on the stringr package also contains a useful overview on regular expressions.
Christiansen, T., & Torkington, N. (2003). Perl Cookbook: Solutions & examples for Perl programmers (2nd ed.). O’Reilly Media, Inc.
Neth, H. (2020). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy
Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr