Chapter 7 Pattern Matching

7.1 Review: Tidyverse string expressions

library(tidyverse)

When we load the Tidyverse, as above, one of the libraries included is stringr, which provides a set of functions for working with strings.

These seem decieptively simple, but they have some powerful features that we’ll want to look at. Let’s start by reviewing the basics.

For this example, we’ll look at a short Wikipedia article. We can download it and load its contents

library(rvest)
erfurt <- read_html("https://en.wikipedia.org/wiki/Erfurt_latrine_disaster")
erfurt <- erfurt |>
    html_elements("p") |>
    html_text() |>
    str_flatten()

erfurt
## [1] "\nThe Erfurt latrine disaster occurred on 26 July 1184, when Henry VI, King of Germany (later Holy Roman Emperor), held a Hoftag (informal assembly) in the Petersberg Citadel in Erfurt. On the morning of 26 July, the combined weight of the assembled nobles caused the wooden second story floor of the building to collapse and most of them fell through into the latrine cesspit below the ground floor, where about 60 of them drowned in liquid excrement. This event is called the Erfurter Latrinensturz (lit. 'Erfurt latrine fall') in several German sources.[1][2][3]A feud between Landgrave Louis III of Thuringia and Archbishop Conrad of Mainz, which had existed since the defeat of Henry the Lion, intensified to the point that King Henry VI was forced to intervene while he was traveling through the region during a military campaign against Poland. Henry decided to call a diet in Erfurt, where he was staying, to mediate the situation between the two and invited a number of other figures to the negotiations.[4]All of the nobles across the Holy Roman Empire were invited to the meeting, and many arrived on 25 July to attend.[5] Just as the assembly began, the wooden floor of the deanery, in which the nobles were sitting, broke under the stress, and people fell down through the first floor into the latrine in the cellar. About 60 people are said to have died,[6] including Count Gozmar III of Ziegenhain, Count Friedrich I of Abenberg [de], Burgrave Friedrich I of Kirchberg [de], Count Heinrich I of Schwarzburg [de], Count Burgrave Burchard of Wartburg [de], Count Hansteiner of Liechtenstein, Burgmeister Breuer of Wartschitt and Beringer of Meldingen.[7] King Henry was said to have survived only because he sat in an alcove with a stone floor[5] and was later saved using ladders. He departed as soon as possible. Landgrave Louis III of Thuringia survived as well.[5]Of those who died, many drowned in human excrement or suffocated from the fumes emitted by the decomposing waste, while others were crushed by falling debris.\n"

Some of the most important stringr functions are as follows:

str_detect() simply returns a TRUE or FALSE depending on whether the pattern is found in the string.

str_detect(erfurt, "Holy Roman Empire")
## [1] TRUE

str_replace() and str_replace_all() replace the pattern with a new string.

str_replace_all(erfurt, "Erfurt", "Zurich")
## [1] "\nThe Zurich latrine disaster occurred on 26 July 1184, when Henry VI, King of Germany (later Holy Roman Emperor), held a Hoftag (informal assembly) in the Petersberg Citadel in Zurich. On the morning of 26 July, the combined weight of the assembled nobles caused the wooden second story floor of the building to collapse and most of them fell through into the latrine cesspit below the ground floor, where about 60 of them drowned in liquid excrement. This event is called the Zuricher Latrinensturz (lit. 'Zurich latrine fall') in several German sources.[1][2][3]A feud between Landgrave Louis III of Thuringia and Archbishop Conrad of Mainz, which had existed since the defeat of Henry the Lion, intensified to the point that King Henry VI was forced to intervene while he was traveling through the region during a military campaign against Poland. Henry decided to call a diet in Zurich, where he was staying, to mediate the situation between the two and invited a number of other figures to the negotiations.[4]All of the nobles across the Holy Roman Empire were invited to the meeting, and many arrived on 25 July to attend.[5] Just as the assembly began, the wooden floor of the deanery, in which the nobles were sitting, broke under the stress, and people fell down through the first floor into the latrine in the cellar. About 60 people are said to have died,[6] including Count Gozmar III of Ziegenhain, Count Friedrich I of Abenberg [de], Burgrave Friedrich I of Kirchberg [de], Count Heinrich I of Schwarzburg [de], Count Burgrave Burchard of Wartburg [de], Count Hansteiner of Liechtenstein, Burgmeister Breuer of Wartschitt and Beringer of Meldingen.[7] King Henry was said to have survived only because he sat in an alcove with a stone floor[5] and was later saved using ladders. He departed as soon as possible. Landgrave Louis III of Thuringia survived as well.[5]Of those who died, many drowned in human excrement or suffocated from the fumes emitted by the decomposing waste, while others were crushed by falling debris.\n"

str_count() counts the number of times the pattern appears in the string.

str_count(erfurt, "King")
## [1] 3

str_extract() and str_extract_all() pull every time the pattern is mentioned in the string.

str_extract_all(erfurt, "King")
## [[1]]
## [1] "King" "King" "King"

But this one is a little puzzling; why would we ever need to do this?

7.2 Regular Expressions

The answer is that there are some special codes that we can put into the pattern to return some more complicated results. These are called regular expressions, or regex for short.

These are incredibly powerful, but sometimes a little hard to wrap your mind around. A one simple example is \d, which matches any digit. So we can use this to find every 0-9 in the string.

str_extract_all(erfurt, "\\d")
## [[1]]
##  [1] "2" "6" "1" "1" "8" "4" "2" "6" "6" "0" "1" "2" "3" "4" "2" "5" "5" "6" "0"
## [20] "6" "7" "5" "5"
str_extract_all(erfurt, "[0-9]")
## [[1]]
##  [1] "2" "6" "1" "1" "8" "4" "2" "6" "6" "0" "1" "2" "3" "4" "2" "5" "5" "6" "0"
## [20] "6" "7" "5" "5"

7.3 Counting Counts

In the middle of this article, we have a list of royalty.

...
including Count Gozmar III of Ziegenhain, Count Friedrich I of Abenberg [de],
Burgrave Friedrich I of Kirchberg [de], Count Heinrich I of Schwarzburg [de],
Count Burgrave Burchard of Wartburg [de], Count Hansteiner of Liechtenstein,
...

As a way to practice regex, let’s build up a way to extract all of these names.

First, we can see that all of this royalty follows a common pattern; an expression that is regular, if you will.

Basically, each one goes:

TITLE Name Sometimes a number of Place

7.3.1 Choices

We can start from the beginning, and get all the titles. With a regex, we can use the () parentheses and | pipe to select any of the possible titles.

str_extract_all(erfurt, "(Count|Archbishop|Burgrave|Burgmeister|Landgrave)")
## [[1]]
##  [1] "Landgrave"   "Archbishop"  "Count"       "Count"       "Burgrave"   
##  [6] "Count"       "Count"       "Burgrave"    "Count"       "Burgmeister"
## [11] "Landgrave"

7.3.2 Wildcards

This is a good start! But we need the full title. like \\d, there are some other wildcards that can help us get the next parts of the name.

  1. \\w Matches anything that isn’t a space.
"Henry VIII" |> str_extract_all("\\w")
## [[1]]
## [1] "H" "e" "n" "r" "y" "V" "I" "I" "I"
  1. \\s Matches any kind of white space.
"Henry VIII" |> str_extract("\\s")
## [1] " "
  1. [ABC] Matches any of the characters A, B, or C.
"Henry VIII" |> str_extract_all("[IVXLCDM]")
## [[1]]
## [1] "V" "I" "I" "I"

So, we can get the next letters by saying TITLE+SPACE+LETTER

str_extract_all(erfurt, "(Count|Archbishop|Burgrave|Burgmeister|Landgrave)\\s\\w")
## [[1]]
##  [1] "Landgrave L"   "Archbishop C"  "Count G"       "Count F"      
##  [5] "Burgrave F"    "Count H"       "Count B"       "Count H"      
##  [9] "Burgmeister B" "Landgrave L"

Better than before, but we can do better.

7.3.3 Repetition

If you want more than 1 of something in Regex,

  1. ? Matches 0 or 1 of something.
  2. + Matches 1 or more of something.
  3. * Matches 0 or more.
  4. {3} Matches exactly 3 of something
  5. {3,} Matches 3 or more
  6. {5,6} Matches between 5 and 6 of something.

These are super confusing! The best thing to do is just get an example and play around with them.

We can then get the full name of our royalty by doing something TITLE+SPACE+[More than 1 letter]

str_extract_all(erfurt, "(Count|Archbishop|Burgrave|Burgmeister|Landgrave)\\s\\w+")
## [[1]]
##  [1] "Landgrave Louis"    "Archbishop Conrad"  "Count Gozmar"      
##  [4] "Count Friedrich"    "Burgrave Friedrich" "Count Heinrich"    
##  [7] "Count Burgrave"     "Count Hansteiner"   "Burgmeister Breuer"
## [10] "Landgrave Louis"

Now, Some of these royals have ordinals after their name, but not all of them! So we will need to use the * to capture zero or more roman numerals, like so:

TITLE+SPACE+[More than 1 letter]+[zero or more spaces]+[zero or more roman numerals]

str_extract_all(erfurt, "(Count|Archbishop|Burgrave|Burgmeister|Landgrave)\\s\\w+\\s*[IVXLCDM]*")
## [[1]]
##  [1] "Landgrave Louis III"  "Archbishop Conrad "   "Count Gozmar III"    
##  [4] "Count Friedrich I"    "Burgrave Friedrich I" "Count Heinrich I"    
##  [7] "Count Burgrave "      "Count Hansteiner "    "Burgmeister Breuer " 
## [10] "Landgrave Louis III"

Finally, We need the last word, which occurs on all of them. We just add two more words to the regex:

TITLE+SPACE+[More than 1 letter]+[zero or more spaces]+[zero or more roman numerals]+of+[More than 1 letter]

str_extract_all(erfurt, "(Count|Archbishop|Burgrave|Burgmeister|Landgrave)\\s\\w+\\s*[IVXLCDM]*\\sof\\s\\w+")
## [[1]]
##  [1] "Landgrave Louis III of Thuringia"  "Archbishop Conrad of Mainz"       
##  [3] "Count Gozmar III of Ziegenhain"    "Count Friedrich I of Abenberg"    
##  [5] "Burgrave Friedrich I of Kirchberg" "Count Heinrich I of Schwarzburg"  
##  [7] "Burgrave Burchard of Wartburg"     "Count Hansteiner of Liechtenstein"
##  [9] "Burgmeister Breuer of Wartschitt"  "Landgrave Louis III of Thuringia"

And there we have it! All the victims of the Erfurt latrine disaster.

7.4 A Warning

Many people consider Regexes to be “read-only code”; once you’ve written it, you’ll never be able to go back and understand it. You should really only use them when you need to do something very specific.

7.5 Resources

Regexes are hard, and sometimes you might feel like you’re experiencing your own little Erfurt latrine disaster.

Here are a few things that can help you get out of the pit:

A really complete introduction to Regex in R can be found on Stringr documentation here:

https://stringr.tidyverse.org/articles/regular-expressions.html

An interactive console for regexes can be found here:

https://regex101.com

Note that R is not officially supported here, but you can use Python, then change all the single backslashes to doubles, and it will mostly work in R (e.g. \d to \\d)