E.1 Introduction

Pattern matching isn’t like direct string comparison, even at its simplest level;
it’s more like string searching with mutant wildcards on steroids.

Tom Christiansen and Nathan Torkington (2003, p. 180)

The notion of a regular expression doesn’t sound like a awful lot of fun. While regex sounds a little sexier, our introductory reference to theoretical computer science and formal language theory may curb our enthusiasm. Fortunately, the above quote from the Perl Cookbook (Christiansen & Torkington, 2003) promises some excitement in pattern matching. But before the fun can start, we need to clarify some terminology and specify the data and tools that will allow us to unleash the mysteries of regex magic.

E.1.1 What are regular expressions?

A regular expression (aka. regex or regexp) is a sequence of characters that define a pattern or regularity in strings of text. Rather than always requiring an exact match for a character sequence, using regular expressions allows to generalize our search activities to symbol sequences that correspond to more abstract patterns. As this is immensely useful, many computer languages provide regex functionality for finding patterns in strings of text (see Wikipedia: Regular expressions for details). Whereas most characters will simply match themselves, some symbols assume special meanings when they are used within a regular expression. And just as in natural languages, some symbols change their meaning based on the context in which they are being used.

To write and use regular expressions, we must acquire a compact language for defining patterns. As regular expressions can get quite cryptic, they often look a bit scary to the uninitiated. Whereas a formal description of regex operators and rules can be daunting, learning them step-by-step is much easier. To get you started, this appendix provides a gentle introduction that covers the basics.

E.1.2 Getting started

Background

To study regular expressions, we need

  1. some test data that allows us searching for pattern matches in strings of text, and

  2. some functions that show us the results of pattern matching operations in these strings of text.

Functions for matching patterns

The key functions for matching patterns and viewing the results of pattern matching operations are the base R functions grep() and the corresponding stringr functions str_detect(), str_subset(), and str_view(). All these functions come with different options and in a variety of versions. At this point, all we care about is that they allow us to show the matches (or non-matches) in some string of text data s to a pattern p.

All these functions distinguish between some data s and a pattern p. To illustrate them, we will use our vector tests (defined above) as the data s to be searched through and define a first pattern p, that consists in the letters "at" (as a character string):

s <- tests  # some data
p <- "at"   # some pattern
  • Here are examples of using the base R family of grep() functions for viewing pattern matches:
# base R functions:
grep(pattern = p,  x = s)  # positions of matches
#> [1]  8 10
grepl(pattern = p, x = s)  # detect matches (yields logical vector)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
grep(pattern = p,  x = s, value = TRUE)  # obtain matches (yields value of matches)
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."
  • The stringr package provides a larger and more systematic family of functions. Their name typically begins with str_ and then adds some verb that indicates the specific functionality performed by the function. For instance, str_detect() helps detecting a pattern in a string (as in grepl(pattern = p, x = s) above), whereas str_subset() returns the subset of objects in string that match the pattern (as in grep(pattern = p, x = s, value = TRUE) above):
# Corresponding stringr functions:
library(stringr)
str_detect(string = s, pattern = p)  # detect matches (yields logical vector)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
str_subset(string = s, pattern = p)  # obtain matches (yields value of matches)
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."

Note that the grep() and the str_...() functions share the pattern argument, but the argument x in grep() corresponds to the string argument in str_...() functions. Also, the order of arguments is reversed, which matters if we are lazy and omit argument names. Hence, the following pairs of function calls — note the reversal of their arguments — yield exactly the same results:

# Task: Detect matches:
grepl(p, s)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
str_detect(s, p)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

# Task: Obtain matching objects:
grep(p, s, value = TRUE) 
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."
str_subset(s, p)
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."

Actually, the last two commands are easily replaced by the first two, as the task of obtaining matching objects can be achieved by logical indexing:

# Task: Obtain matching objects (by logical indexing): 
s[grepl(p, s)]
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."
s[str_detect(s, p)]
#> [1] "The cat, sat mat, etc., fat dad."             
#> [2] "Been there, (seen it --- at last), done that."
  • If we are primarily interested in viewing matches, the stringr family of string_view() functions — and particularly str_view_all() with its argument match = TRUE — is the most convenient command to use, as it directly highlights the matches to a pattern p in the string of text s:
# Task: View (and print) matching objects: 
str_view(string = s, pattern = p)  # show 1st matches in s
str_view_all(s, p) # show ALL matches in s
str_view_all(s, p, match = TRUE) # only view ALL positive matches

As it can be instructive to use alternative functions to find matches or non-matches, we will occasionally vary between different functions or versions. It is good to know that their argument pairs — pattern and x for grep() vs. string and pattern for str_view() — are used by a much larger family of functions (see Sections 9.3 and 9.4 of Chapter 9 on Strings of text). But as our current focus is on defining regular expressions, we merely use the function that shows us the result of our pattern matching operations.

E.1.3 Looking up references

By the way: Unless they regularly work with text data, most people do not memorize and remember all the details regarding of regular expressions. So do not despair when the following range of options seems a bit overwhelming at first — and remember that our R system typically provides quite extensive documentation. For regular expressions, this documentation is available via typing:

# Help on regular expressions in R:
?base::regex

For a compact and useful overview of R’s regex options, see the RStudio cheatsheet on Basic Regular Expressions in R (contributed by Ian Kopacka):

Basic regular expressions in R (by Ian Kopacka)<br> available at [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/).

Figure E.1: Basic regular expressions in R (by Ian Kopacka)
available at RStudio cheatsheets.

The back page of the RStudio cheatsheet on the stringr package also contains a useful overview on regular expressions.

References

Christiansen, T., & Torkington, N. (2003). Perl Cookbook: Solutions & examples for Perl programmers (2nd ed.). O’Reilly Media, Inc.
Neth, H. (2022). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy
Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr