E.1 Introduction

Pattern matching isn’t like direct string comparison, even at its simplest level;
it’s more like string searching with mutant wildcards on steroids.

Tom Christiansen and Nathan Torkington (2003, p. 180)

The notion of a regular expression doesn’t sound like a awful lot of fun. While regex sounds a little sexier, our introductory reference to theoretical computer science and formal language theory may curb our enthusiasm. Fortunately, the above quote from the Perl Cookbook (Christiansen & Torkington, 2003) promises some excitement in pattern matching. But before the fun can start, we need to clarify some terminology and specify the data and tools that will allow us to unleash the mysteries of regex magic.

E.1.1 What are regular expressions?

A regular expression (aka. regex or regexp) is a sequence of characters that define a pattern or regularity in strings of text. Rather than always requiring an exact match for a character sequence, using regular expressions allows to generalize our search activities to symbol sequences that correspond to more abstract patterns. As this is immensely useful, many computer languages provide regex functionality for finding patterns in strings of text (see Wikipedia: Regular expressions for details). Whereas most characters will simply match themselves, some symbols assume special meanings when they are used within a regular expression. And just as in natural languages, some symbols change their meaning based on the context in which they are being used.

To write and use regular expressions, we must acquire a compact language for defining patterns. As regular expressions can get quite cryptic, they often look a bit scary to the uninitiated. Whereas a formal description of regex operators and rules can be daunting, learning them step-by-step is much easier. To get you started, this appendix provides a gentle introduction that covers the basics.

E.1.2 Getting started

Background

To study regular expressions, we need

some test data that allows us searching for pattern matches in strings of text, and
some functions that show us the results of pattern matching operations in these strings of text.

Data to search

As practice data, we will primarily use some vectors (typically strings of characters, words, or sentences) provided by the R packages ds4psy (Neth, 2022) and stringr (Wickham, 2019b), but also define additional test data that contains strings with specific properties.

Specific test data will often be defined on the fly. For instance, the vector tests contains some character objects that we will aim to match later:

# (A) Data definitions: ------ 
tests <- c("abc", "ABC", "a.c", "a_c", "a\\c", "ac/dc", 
           "2+4=6  0-9  2^3  2518 9612708", 
           "The cat, sat mat, etc., fat dad.", "Us or them?", 
           "Been there, (seen it --- at last), done that.", 
           "Not act, is bad, so sad!")

By contrast, test data provided by ds4psy and stringr is typically longer or more complicated. Here is an overview of the package-based data that we will use in the course of this tutorial:

# (B) from ds4psy: ------ 

# (a) words: ---- 
fruits <- ds4psy::fruits
# length(fruits)  # 122
countries <- ds4psy::countries
# length(countries)  # 197
Trumpisms <- ds4psy::Trumpisms
# length(Trumpisms)  # 96

# (b) sentences: ---- 
flowery <- ds4psy::flowery
# length(flowery)  # 60
Bushisms <- ds4psy::Bushisms 
# length(Bushisms)  # 22

# (c) other vectors: ---- 
cclass   <- ds4psy::cclass
metachar <- ds4psy::metachar
Umlaut   <- ds4psy::Umlaut


# (C) from stringr: ------ 
words <- stringr::words
# length(words)  # 980
sentences <- stringr::sentences
# length(sentences)  # 720

Functions for matching patterns

The key functions for matching patterns and viewing the results of pattern matching operations are the base R functions grep() and the corresponding stringr functions str_detect(), str_subset(), and str_view(). All these functions come with different options and in a variety of versions. At this point, all we care about is that they allow us to show the matches (or non-matches) in some string of text data s to a pattern p.

All these functions distinguish between some data s and a pattern p. To illustrate them, we will use our vector tests (defined above) as the data s to be searched through and define a first pattern p, that consists in the letters "at" (as a character string):

s <- tests  # some data
p <- "at"   # some pattern

Here are examples of using the base R family of grep() functions for viewing pattern matches:

# base R functions:
grep(pattern = p,  x = s)  # positions of matches
#> [1]  8 10
grepl(pattern = p, x = s)  # detect matches (yields logical vector)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
grep(pattern = p,  x = s, value = TRUE)  # obtain matches (yields value of matches)
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."

The stringr package provides a larger and more systematic family of functions. Their name typically begins with str_ and then adds some verb that indicates the specific functionality performed by the function. For instance, str_detect() helps detecting a pattern in a string (as in grepl(pattern = p, x = s) above), whereas str_subset() returns the subset of objects in string that match the pattern (as in grep(pattern = p, x = s, value = TRUE) above):

# Corresponding stringr functions:
library(stringr)
str_detect(string = s, pattern = p)  # detect matches (yields logical vector)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
str_subset(string = s, pattern = p)  # obtain matches (yields value of matches)
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."

Note that the grep() and the str_...() functions share the pattern argument, but the argument x in grep() corresponds to the string argument in str_...() functions. Also, the order of arguments is reversed, which matters if we are lazy and omit argument names. Hence, the following pairs of function calls — note the reversal of their arguments — yield exactly the same results:

# Task: Detect matches:
grepl(p, s)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
str_detect(s, p)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

# Task: Obtain matching objects:
grep(p, s, value = TRUE) 
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."
str_subset(s, p)
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."

Actually, the last two commands are easily replaced by the first two, as the task of obtaining matching objects can be achieved by logical indexing:

# Task: Obtain matching objects (by logical indexing): 
s[grepl(p, s)]
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."
s[str_detect(s, p)]
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."

If we are primarily interested in viewing matches, the stringr family of string_view() functions — and particularly str_view_all() with its argument match = TRUE — is the most convenient command to use, as it directly highlights the matches to a pattern p in the string of text s:

# Task: View (and print) matching objects: 
str_view(string = s, pattern = p)  # show 1st matches in s

str_view_all(s, p) # show ALL matches in s str_view_all(s, p, match = TRUE) # only view ALL positive matches

As it can be instructive to use alternative functions to find matches or non-matches, we will occasionally vary between different functions or versions. It is good to know that their argument pairs — pattern and x for grep() vs. string and pattern for str_view() — are used by a much larger family of functions (see Sections 9.3 and 9.4 of Chapter 9 on Strings of text). But as our current focus is on defining regular expressions, we merely use the function that shows us the result of our pattern matching operations.

E.1.3 Looking up references

By the way: Unless they regularly work with text data, most people do not memorize and remember all the details regarding of regular expressions. So do not despair when the following range of options seems a bit overwhelming at first — and remember that our R system typically provides quite extensive documentation. For regular expressions, this documentation is available via typing:

# Help on regular expressions in R:
?base::regex

For a compact and useful overview of R’s regex options, see the RStudio cheatsheet on Basic Regular Expressions in R (contributed by Ian Kopacka):

Basic regular expressions in R (by Ian Kopacka)<br> available at [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/).

Figure E.1: Basic regular expressions in R (by Ian Kopacka)
available at RStudio cheatsheets.

The back page of the RStudio cheatsheet on the stringr package also contains a useful overview on regular expressions.

References

Christiansen, T., & Torkington, N. (2003). Perl Cookbook: Solutions & examples for Perl programmers (2nd ed.). O’Reilly Media, Inc.

Neth, H. (2022). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy

Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr