E Using regular expressions
Regular expressions (aka. regex) are character sequences that define a pattern. In theoretical computer science and formal language theory, such patterns are used for validating text inputs and for searching, finding and replacing strings of text.
Many R commands involving character data (e.g., the base R functions grep()
and strsplit()
, and most of the stringr functions discussed in Chapter 9 on Strings of text) support the use of regular expressions.
While regular expressions can be immensely powerful and time-saving tools, their abstract nature and formal appearance often seem scary and intimidating. For instance, given a vector dinos
that contains the 10 character strings
dinos
#> [1] "Allosaurus" "Archaeopteryx" "Betamax" "Brachiosaurus"
#> [5] "Chameleon" "so-called US" "Stegosaurus" "Thesaurus"
#> [9] "Toys 'R' Us" "Tyrannosaurus"
two moderately cryptic grep()
and str_view()
commands
grep(pattern = "s.+us", x = dinos, ignore.case = TRUE, value = TRUE)
str_view(dinos, pattern = regex("s.+us", ignore_case = TRUE), match = TRUE)
would each find the following results:
Many text-related tasks that programmers address by using iterative or recursive functions can be tackled by regular expressions as well. While regular expressions are often shorter and faster than self-made alternatives, they can be cryptic and difficult to understand. Although there’s probably a regular expression for solving almost any text-related task, we should always aim for a good balance between functionality and transparency. To provide a glimpse into the potential of regular expressions without requiring too much formal overhead, this appendix provides a gentle introduction into using regular expressions in R.