3  stringr and RegExes

needs(tidyverse, rvest)

When working with data, a significant number of variables will be in some sort of text format. When you want to manipulate these variables, an easy approach would be exporting the data to MS Excel and then just performing those manipulations by hand. This is very time-consuming, though, and, hence, we rather recommend the R way which scales well and works fast for data sets of varying sizes.

Quick reminder: a string is an element of a character vector and can be created by simply wrapping some text in quotation marks:

string <- "Hi, how are you doing?"
vector_of_strings <- c("Hi, how are you doing?", "I'm doing well, HBY?", "Me too, thanks for asking.")

Note that you can either wrap your text in double quotation marks and use single ones in the string and vice versa:

single_ones <- "what's up"
double_ones <- 'he said: "I am fine"'

The stringr package (Wickham 2019) contains a multitude of commands (49 in total) that can be used to achieve a couple of things, mainly manipulating character vectors, and finding and matching patterns. These goals can also be achieved with base R functions, but stringr’s advantage is its consistency. The makers of stringr describe it as

A consistent, simple and easy-to-use set of wrappers around the fantastic stringi package. All function and argument names (and positions) are consistent, all functions deal with NA’s and zero-length vectors in the same way, and the output from one function is easy to feed into the input of another.

Every stringr function starts with str_ – which facilitates finding the proper command: just type str_ and RStudio’s auto-suggest function should take care of the rest (if it doesn’t pop up by itself, you can trigger it by hitting the tab key). Also, they take a vector of strings as their first argument, which facilitates using them in a |>-pipeline and adding them to a mutate()-call.

One important component of stringr functions is regular expressions which will be introduced later as well.

3.1 Basic manipulations

In the following, we will introduce you to several different operations that can be performed on strings.

3.1.1 Changing the case of the words

A basic operation is changing words’ cases.

str_to_lower(vector_of_strings)
[1] "hi, how are you doing?"     "i'm doing well, hby?"      
[3] "me too, thanks for asking."
str_to_upper(vector_of_strings)
[1] "HI, HOW ARE YOU DOING?"     "I'M DOING WELL, HBY?"      
[3] "ME TOO, THANKS FOR ASKING."
str_to_title(vector_of_strings)
[1] "Hi, How Are You Doing?"     "I'm Doing Well, Hby?"      
[3] "Me Too, Thanks For Asking."
str_to_sentence(vector_of_strings)
[1] "Hi, how are you doing?"     "I'm doing well, hby?"      
[3] "Me too, thanks for asking."

3.1.2 Determining a string’s length

Determining the string’s number of characters goes as follows:

str_length(vector_of_strings)
[1] 22 20 26

3.1.3 Extracting particular characters

Characters can be extracted (by position) using str_sub

str_sub(vector_of_strings, start = 1, end = 5) # extracting first to fifth character
[1] "Hi, h" "I'm d" "Me to"
str_sub(vector_of_strings, start = -5, end = -1) # extracting fifth-to-last to last character
[1] "oing?" " HBY?" "king."

You can also use str_sub() to replace strings. E.g., to replace the last character by a full stop, you can do the following:

str_sub(vector_of_strings, start = -1) <- "."
vector_of_strings
[1] "Hi, how are you doing."     "I'm doing well, HBY."      
[3] "Me too, thanks for asking."

However, in everyday use, you would probably go with str_replace() and regular expressions.

3.1.4 Concatenating strings

Similar to how c() puts together different elements (or vectors of length 1) and other vectors into a single vector, str_c() can be used to concatenate several strings into a single string. This can, for instance, be used to write some birthday invitations.

names <- c("Inger", "Peter", "Kalle", "Ingrid")

str_c("Hi", names, "I hope you're doing well. As per this letter, I invite you to my birthday party.")
[1] "HiIngerI hope you're doing well. As per this letter, I invite you to my birthday party." 
[2] "HiPeterI hope you're doing well. As per this letter, I invite you to my birthday party." 
[3] "HiKalleI hope you're doing well. As per this letter, I invite you to my birthday party." 
[4] "HiIngridI hope you're doing well. As per this letter, I invite you to my birthday party."

Well, this looks kind of ugly, as there are no spaces, and commas are lacking as well. You can fix that by determining a separator using the sep argument.

str_c("Hi", names, "I hope you're doing well. As per this letter, I invite you to my birthday party.", sep = ", ")
[1] "Hi, Inger, I hope you're doing well. As per this letter, I invite you to my birthday party." 
[2] "Hi, Peter, I hope you're doing well. As per this letter, I invite you to my birthday party." 
[3] "Hi, Kalle, I hope you're doing well. As per this letter, I invite you to my birthday party." 
[4] "Hi, Ingrid, I hope you're doing well. As per this letter, I invite you to my birthday party."

You could also collapse the strings contained in a vector together into one single string using the collapse argument.

str_c(names, collapse = ", ")
[1] "Inger, Peter, Kalle, Ingrid"

3.1.5 Repetition

Repeating (or duplicating) strings is performed using str_dup(). The function takes two arguments: the string to be duplicated and the number of times.

str_dup("felix", 2)
[1] "felixfelix"
str_dup("felix", 1:3)
[1] "felix"           "felixfelix"      "felixfelixfelix"
str_dup(names, 2)
[1] "IngerInger"   "PeterPeter"   "KalleKalle"   "IngridIngrid"
str_dup(names, 1:4)
[1] "Inger"                    "PeterPeter"              
[3] "KalleKalleKalle"          "IngridIngridIngridIngrid"

3.1.6 Removing unnecessary whitespaces

Often text contains unnecessary whitespaces.

unnecessary_whitespaces <- c("    on the left", "on the right    ", "    on both sides   ", "   literally    everywhere  ")

Removing the ones at the beginning or the end of a string can be accomplished using str_trim().

str_trim(unnecessary_whitespaces, side = "left")
[1] "on the left"               "on the right    "         
[3] "on both sides   "          "literally    everywhere  "
str_trim(unnecessary_whitespaces, side = "right")
[1] "    on the left"            "on the right"              
[3] "    on both sides"          "   literally    everywhere"
str_trim(unnecessary_whitespaces, side = "both") # the default option
[1] "on the left"             "on the right"           
[3] "on both sides"           "literally    everywhere"

str_trim() could not fix the last string though, where unnecessary whitespaces were also present in between words. Here, str_squish is more appropriate. It removes leading or trailing whitespaces as well as duplicated ones in between words.

str_squish(unnecessary_whitespaces)
[1] "on the left"          "on the right"         "on both sides"       
[4] "literally everywhere"

3.1.8 Exercises

  1. Run the following code that downloads movies from IMDb. Clean up the column “year” in the resulting film data set. Think about how you could do it with str_sub(). Could you also use it for the dot in the “rank” column?
needs(rvest, tidyverse)
imdb_top250 <- read_html("https://www.imdb.com/chart/top/?ref_=nv_mv_250")

raw_title <- imdb_top250 |> 
    html_elements(".cli-title .ipc-title__text") |> 
    html_text2()

movies <- tibble(
  title_raw = imdb_top250 |> 
    html_elements(".cli-title .ipc-title__text") |> 
    html_text2(),
  year_raw = imdb_top250 |> 
    html_elements(".cli-title-metadata") |> 
    html_text2()
) |> 
  separate(title_raw, sep = " ", into = c("rank", "title"), extra = "merge")

## solution:

movies_clean <- movies |> 
  mutate(year = str_sub(year, start = 1, end = 4) |> as.double(),
         rank = str_sub(rank, start = -4, end = -2) |> as.integer())
  1. Convert the following sentence to different cases:
sentence <- "The quick brown fox jumps over the lazy dog."

## solution

str_to_lower(sentence)
[1] "the quick brown fox jumps over the lazy dog."
str_to_sentence(sentence)
[1] "The quick brown fox jumps over the lazy dog."
str_to_title(sentence)
[1] "The Quick Brown Fox Jumps Over The Lazy Dog."
str_to_upper(sentence)
[1] "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG."
  1. What’s the length of the following string?
text <- "I enjoy studying Sociology at Leipzig University."

## solution

str_length(text)
[1] 49
  1. Using the following vectors, create a full sentence:
start <- "I am a large language model and I am"
attributes <- c("powerful.", "dumb.", "worse at coding than your instructor.")
end <- "Haha, do you really think I asked ChatGPT to give you these exercises?"
ps <- "(Of course I did, I am lazy AF.)"

str_c(start, attributes, end, ps, sep = " ")
[1] "I am a large language model and I am powerful. Haha, do you really think I asked ChatGPT to give you these exercises? (Of course I did, I am lazy AF.)"                            
[2] "I am a large language model and I am dumb. Haha, do you really think I asked ChatGPT to give you these exercises? (Of course I did, I am lazy AF.)"                                
[3] "I am a large language model and I am worse at coding than your instructor. Haha, do you really think I asked ChatGPT to give you these exercises? (Of course I did, I am lazy AF.)"

3.2 Regular expressions

Up to now, you have been introduced to the more basic functions of the stringr package. Those are useful, for sure, yet limited. However, to make use of the full potential of stringr, you will first have to acquaint yourself with regular expressions (also often abbreviated as “RegEx” with plural “RegExes”).

Those regular expressions are patterns that can be used to describe certain strings. Exemplary use cases of RegExes are the identification of phone numbers, email addresses, or whether a password you choose on a web page consists of enough characters, an uppercase character, and at least one special character. Hence, if you want to replace certain words with another one, you can write the proper RegEx and it will identify the strings you want to replace, and the stringr functions (i.e., str_replace()) will take care of the rest.

Before you dive into RegExes, beware that they are quite complicated at the beginning1. Yet, mastering them is very rewarding and will pay off in the future.

3.2.1 Literal characters

The most basic RegEx patterns consist of literal characters only. str_view() tells you which parts of a string match a pattern is present in the element.

five_largest_cities <- c("Stockholm", "Göteborg", "Malmö", "Uppsala", "Västerås")

Note that RegExes are case-sensitive.

str_view(five_largest_cities, "stockholm")
str_view(five_largest_cities, "Stockholm")
[1] │ <Stockholm>

They also match parts of words:

str_view(five_largest_cities, "borg")
[2] │ Göte<borg>

Moreover, they are “greedy,” they only match the first occurrence (in “Stockholm”):

str_view(five_largest_cities, "o")
[1] │ St<o>ckh<o>lm
[2] │ Göteb<o>rg

This can be addressed in the stringr package by using str_._all() functions – but more on that later.

If you want to match multiple literal characters (or words, for that sake), you can connect them using the | meta character (more on meta characters later).

str_view(five_largest_cities, "Stockholm|Göteborg")
[1] │ <Stockholm>
[2] │ <Göteborg>

Every letter of the English alphabet (or number/or combination of those) can serve as a literal character. Those literal characters match themselves. This is, however, not the case with the other sort of characters, so-called meta characters.

3.2.2 Metacharacters

When using RegExes, the following characters are considered meta characters and have a special meaning:

. \ | ( ) { } [ ] ^ $ - * + ?

3.2.2.1 The wildcard

Did you notice how we used the dot to refer to the entirety of the str_._all() functions? This is basically what the . meta-character does: it matches every character except for a new line. The first call extracts all function names from the stringr package, the second one shows the matches (i.e., the elements of the vector where it can find the pattern).

stringr_functions <- ls("package:stringr")

str_detect(stringr_functions, "str_._all")
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE

Well, as you can see, there are none. This is because the . can only replace one character. We need some sort of multiplier to find them. The ones available are:

  • ? – zero or one
  • * – zero or more
  • + – one or more
  • {n} – exactly n
  • {n,} – n or more
  • {n,m} – between n and m

In our case, the appropriate one is +:

str_detect(stringr_functions, "str_.+_all")
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[61] FALSE FALSE

However, if you want to match the character “.”? This problem may arise when searching for clock time. A naive RegEx might look like this:

vectors_with_time <- c("13500", "13M00", "13.00")

str_detect(vectors_with_time, "13.00")
[1] TRUE TRUE TRUE

Yet, it matches everything. We need some sort of literal dot. Here, the metacharacter \ comes in handy. By putting it in front of the metacharacter, it no longer has its special meaning and is interpreted as a literal character. This procedure is referred to as “escaping.” Hence, \ is also referred to as the “escape character.” Note that you will need to escape \ as well, and therefore it will look like this: \\..

str_detect(vectors_with_time, "13\\.00")
[1] FALSE FALSE  TRUE

3.2.3 Sets of characters

You can also define sets of multiple characters using the [ ] meta characters. This can be used to define multiple possible characters that can appear in the same place.

sp_ce <- c("spice", "space")

str_view(sp_ce, "sp[ai]ce")
[1] │ <spice>
[2] │ <space>

You can also define certain ranges of characters using the - metacharacter:

Same holds for numbers:

american_phone_number <- "(555) 555-1234"

str_view(american_phone_number, "\\([:digit:]{3}\\) [0-9]{3}-[0-9]{4}")
[1] │ <(555) 555-1234>

There are also predefined sets of characters, for instance, digits or letters, which are called character classes. You can find them on the stringr cheatsheet.

Furthermore, you can put almost every meta character inside the square brackets without escaping them. This does not apply to the caret (^) in the first position, the dash -, the closing square bracket ], and the backslash \.

str_view(vector_of_strings, "[.]")
[1] │ Hi, how are you doing<.>
[2] │ I'm doing well, HBY<.>
[3] │ Me too, thanks for asking<.>

3.2.3.1 Negating sets of characters

Sometimes you will also want to exclude certain sets of characters or words. To achieve this, you can use the ^ meta character at the beginning of the range or set you are defining.

str_view(sp_ce, "sp[^i]ce")
[2] │ <space>

3.2.4 Anchors

There is also a way to define whether you want the pattern to be present in the beginning ^ or at the end $ of a string. sentences are a couple of (i.e., 720) predefined example sentences. If we were now interested in the number of sentences that begin with a “the,” we could write the following RegEx:

shortened_sentences <- sentences[1:10]

str_view(shortened_sentences, "^The") 
[1] │ <The> birch canoe slid on the smooth planks.
[4] │ <The>se days a chicken leg is a rare dish.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.

If we wanted to know how many start with a “The” and end with a full stop, we could do this one:

str_view(shortened_sentences, "^The.+\\.$") 
[1] │ <The birch canoe slid on the smooth planks.>
[4] │ <These days a chicken leg is a rare dish.>
[6] │ <The juice of lemons makes fine punch.>
[7] │ <The box was thrown beside the parked truck.>
[8] │ <The hogs were fed chopped corn and garbage.>

3.2.4.1 Boundaries

Note that right now, the RegEx also matches the sentence which starts with a “These.” To address this, we need to tell the machine that it should only accept a “The” if there starts a new word thereafter. In RegEx syntax, this is done using so-called boundaries. Those are defined as \b as a word boundary and \B as no word boundary. (Note that you will need an additional escape character as you will have to escape the escape character itself.)

In my example, we would include the former if we were to search for sentences that begin with a single “The” and the latter if we were to search for sentences that begin with a word that starts with a “The” but are not “The” – such as “These.”

str_view(shortened_sentences, "^The\\b.+\\.$") 
[1] │ <The birch canoe slid on the smooth planks.>
[6] │ <The juice of lemons makes fine punch.>
[7] │ <The box was thrown beside the parked truck.>
[8] │ <The hogs were fed chopped corn and garbage.>
str_view(shortened_sentences, "^The\\B.+\\.$") 
[4] │ <These days a chicken leg is a rare dish.>

3.2.4.2 Lookarounds

A final common task is to extract certain words or values based on what comes before or after them. Look at the following example:

heights <- c("1m30cm", "2m01cm", "3m10cm")

Here, to identify the height in meters, the first task is to identify all the numbers that are followed by an “m”. The RegEx syntax for this looks like this: A(?=pattern) with A being the entity that is supposed to be found (hence, in this case, [0-9]+).

str_view(heights, "[0-9]+(?=m)")
[1] │ <1>m30cm
[2] │ <2>m01cm
[3] │ <3>m10cm

The second step now is to identify the centimeters. This could of course be achieved using the same RegEx and replacing m with cm. However, we can also harness a so-called negative look ahead A(?!pattern), a so-called look behind (?<=pattern)A. The negative counterpart, the negative look behind (?<!pattern)A could be used to extract the meters.

The negative lookahead returns everything that is not followed by the defined pattern. The look behind returns everything that is preceded by the pattern, the negative look behind returns everything that is not preceded by the pattern.

In the following, we demonstrate how you could extract the centimeters using negative look ahead and look behind.

str_view(heights, "[0-9]+(?!m)") # negative look ahead
[1] │ 1m<30>cm
[2] │ 2m<01>cm
[3] │ 3m<10>cm
str_view(heights, "(?<=m)[0-9]+") # look behind
[1] │ 1m<30>cm
[2] │ 2m<01>cm
[3] │ 3m<10>cm

3.2.6 Exercises

  1. Write a RegEx for Swedish mobile numbers. Test it with str_detect("+46 71-738 25 33", "[insert your RegEx here]").
str_detect("+46 71-738 25 33", "\\+46 [0-9]{2}\\-[0-9]{3} [0-9]{2} [0-9]{2}")
[1] TRUE
  1. Given the vector c("apple", "banana", "cherry", "date", "elderberry"), use a regular expression to identify fruits that contain the letter “a” exactly two times.
fruits <- c("apple", "banana", "cherry", "date", "elderberry")
str_detect(fruits, "a[^a]*a(<=[^a]|\\b)")
[1] FALSE  TRUE FALSE FALSE FALSE
  1. Given the sentence vector c("The cat sat on the mat.", "Mat is what it sat on.", "On the mat, it sat."), write a regular expression to identify sentences that start with “The” and end with “mat.”.
cats <- c("The cat sat on the mat.", "Mat is what it sat on.", "On the mat, it sat.")
str_detect(cats, "^The.*mat.$")
[1]  TRUE FALSE FALSE
  1. Extract all email addresses from the following vector: c("john.doe@example.com", "alice_smith@company.net", "r.user@domain.org", "I am @ the office RN", "facebook.com").
addresses <- c("john.doe@example.com", "alice_smith@company.net", "r.user@domain.org", "I am @ the office RN", "facebook.com")

str_detect(addresses, "[a-z.\\_]+\\@[a-z]+\\.[a-z]+")
[1]  TRUE  TRUE  TRUE FALSE FALSE
  1. Check a vector of passwords for strength. A strong password should have at least 8 characters, include an uppercase and a lowercase letter, a number, and a special character (e.g., !@#$%^&*).
password <- c("Hi!123456")

if (str_detect(password, "[A-Z]{1,}") &
    str_detect(password, "[a-z]{1,}") &
    str_detect(password, "[0-9]{1,}") &
    str_detect(password, "[\\!\\@\\#\\$\\%\\^\\&\\*]{1,}") &
    str_length(password) > 7){
  "strong password"
}else{
  "weak password"
}
[1] "strong password"
  1. From “The theme of this theater is therapeutic.”, extract all words that start with “the” but are not followed by “me”.
sentence <- "The theme of this theater is therapeutic." |> str_to_lower()
str_extract_all(sentence, "\\bthe(?!me)\\w*\\b")
[[1]]
[1] "the"         "theater"     "therapeutic"

3.3 More advanced string manipulation

Now that you have learned about RegExes, you can unleash the full power of stringr.

The basic syntax of a stringr function looks as follows: str_.*(string, regex("")). Some stringr functions also have the suffix _all which implies that they operate not only on the first match (“greedy”) but on every match.

To demonstrate the different functions, we will again rely on the subset of example sentences.

3.3.1 Detect matches

str_detect can be used to determine whether a certain pattern is present in the string.

str_detect(shortened_sentences, "The\\b")
 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

This also works very well in a dplyr::filter() call. Finding all action movies in the IMDB data set can be solved like this:

imdb_raw <- read_csv("https://www.dropbox.com/s/81o3zzdkw737vt0/imdb2006-2016.csv?dl=1")
Rows: 1000 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Title, Genre, Description, Director, Actors
dbl (7): Rank, Year, Runtime (Minutes), Rating, Votes, Revenue (Millions), M...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imdb_raw |> 
  filter(str_detect(Genre, "Action"))
# A tibble: 303 × 12
    Rank Title       Genre Description Director Actors  Year `Runtime (Minutes)`
   <dbl> <chr>       <chr> <chr>       <chr>    <chr>  <dbl>               <dbl>
 1     1 Guardians … Acti… A group of… James G… Chris…  2014                 121
 2     5 Suicide Sq… Acti… A secret g… David A… Will …  2016                 123
 3     6 The Great … Acti… European m… Yimou Z… Matt …  2016                 103
 4     9 The Lost C… Acti… A true-lif… James G… Charl…  2016                 141
 5    13 Rogue One   Acti… The Rebel … Gareth … Felic…  2016                 133
 6    15 Colossal    Acti… Gloria is … Nacho V… Anne …  2016                 109
 7    18 Jason Bour… Acti… The CIA's … Paul Gr… Matt …  2016                 123
 8    25 Independen… Acti… Two decade… Roland … Liam …  2016                 120
 9    27 Bahubali: … Acti… In ancient… S.S. Ra… Prabh…  2015                 159
10    30 Assassin's… Acti… When Callu… Justin … Micha…  2016                 115
# ℹ 293 more rows
# ℹ 4 more variables: Rating <dbl>, Votes <dbl>, `Revenue (Millions)` <dbl>,
#   Metascore <dbl>

If you want to know whether there are multiple matches present in each string, you can use str_count. Here, it might be advisable to set the ignore_case option to TRUE:

str_count(shortened_sentences, regex("the\\b", ignore_case = TRUE))
 [1] 2 2 1 0 0 1 2 1 0 0

If you want to locate the match in the string, use str_locate. This returns a matrix, which is a vector of multiple dimensions.

str_locate(shortened_sentences, regex("The\\b", ignore_case = TRUE))
      start end
 [1,]     1   3
 [2,]     6   8
 [3,]    19  21
 [4,]    NA  NA
 [5,]    NA  NA
 [6,]     1   3
 [7,]     1   3
 [8,]     1   3
 [9,]    NA  NA
[10,]    NA  NA

Moreover, this is a good example for the greediness of stringr functions. Hence, it is advisable to use str_locate_all which returns a list with one matrix for each element of the original vector:

str_locate_all(shortened_sentences, regex("The\\b", ignore_case = TRUE)) |> pluck(1)
     start end
[1,]     1   3
[2,]    25  27

3.3.2 Mutating strings

Mutating strings usually implies the replacement of certain elements (e.g., words) with other elements (or removing them, which is a special case of replacing them with nothing). In stringr this is performed using str_replace(string, pattern, replacement) and str_replace_all(string, pattern, replacement).

If we wanted, for instance, to replace the first occurrence of “m” letters with “meters,” we would go about this the following way:

str_replace(heights, "m", "meters")
[1] "1meters30cm" "2meters01cm" "3meters10cm"

Note that str_replace_all would have lead to the following outcome:

str_replace_all(heights, "m", "meters")
[1] "1meters30cmeters" "2meters01cmeters" "3meters10cmeters"

However, we also want to replace the “cm” with “centimeters,” hence, we can harness another feature of str_replace_all(), providing multiple replacements:

str_replace_all(heights, c("m" = "meters", "cm" = "centimeters"))
[1] "1meters30centimeterseters" "2meters01centimeterseters"
[3] "3meters10centimeterseters"

What becomes obvious is that a “simple” RegEx containing just literal characters more often than not does not suffice. It will be your task to fix this. And while on it, you can also address the meter/meters problem – a “1” needs meter instead of meters. Another feature is that the replacements are performed in order. You can harness this for solving the problem.

3.3.3 Extracting text

str_extract(_all)() can be used to extract matching strings. In the mtcars data set, the first word describes the car brand. Here, we harness another RegEx, the \\w which stands for any word character. Its opponent is \\W for any non-word character.

mtcars |> 
  rownames_to_column(var = "car_model") |> 
  transmute(manufacturer = str_extract(car_model, "^\\w+\\b")) |> 
  head(6)
  manufacturer
1        Mazda
2        Mazda
3       Datsun
4       Hornet
5       Hornet
6      Valiant

3.3.4 Split vectors

Another use case here would have been to split it into two columns: manufacturer and model. One approach would be to use str_split(). This function splits the string at every occurrence of the predefined pattern. In this example, we use a word boundary as the pattern:

manufacturer_model <- rownames(mtcars)
str_split(manufacturer_model, "\\b") |> 
  head()
[[1]]
[1] ""      "Mazda" " "     "RX4"   ""     

[[2]]
[1] ""      "Mazda" " "     "RX4"   " "     "Wag"   ""     

[[3]]
[1] ""       "Datsun" " "      "710"    ""      

[[4]]
[1] ""       "Hornet" " "      "4"      " "      "Drive"  ""      

[[5]]
[1] ""           "Hornet"     " "          "Sportabout" ""          

[[6]]
[1] ""        "Valiant" ""       

This outputs a list containing the different singular words/special characters. This doesn’t make sense in this case. Here, however, the structure of the string is always roughly the same: “\[manufacturer\]\[ \]\[model description\]”. Moreover, the manufacturer is only one word. Hence, the task can be fixed by splitting the string after the first word, which should indicate the manufacturer. This can be accomplished using str_split_fixed(). Fixed means that the number of splits is predefined. This returns a matrix that can easily become a tibble.

str_split_fixed(manufacturer_model, "(?<=\\w)\\b", n = 2) |> 
  as_tibble() |> 
  rename(manufacturer = V1,
         model = V2) |> 
  mutate(model = str_squish(model))
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
# A tibble: 32 × 2
   manufacturer model       
   <chr>        <chr>       
 1 Mazda        "RX4"       
 2 Mazda        "RX4 Wag"   
 3 Datsun       "710"       
 4 Hornet       "4 Drive"   
 5 Hornet       "Sportabout"
 6 Valiant      ""          
 7 Duster       "360"       
 8 Merc         "240D"      
 9 Merc         "230"       
10 Merc         "280"       
# ℹ 22 more rows

3.3.6 Exercises

  1. Run the following code that downloads movies from IMDb. Create a tibble with the two columns “rank” and “title” by extracting the respective part of the raw title.
needs(rvest, tidyverse)
imdb_top250 <- read_html("https://www.imdb.com/chart/top/?ref_=nv_mv_250")

raw_title <- imdb_top250 |> 
    html_elements(".cli-title .ipc-title__text") |> 
    html_text2()

tibble(
  rank = raw_title |> str_extract("^[0-9]{1,3}") |> as.integer(),
  title = raw_title |> str_remove("^[0-9]{1,3}\\. ")
)
  1. Replace m and cm appropriately in the vector of heights.
heights <- c("1m30cm", "2m01cm", "3m10cm")

str_replace_all(heights, c("(?<=[2-9]{1})m" = "meters", 
                           "(?<=[0-9]{2})m" = "meters", 
                           "(?<=1)m" = "meter", 
                           "(?<=01)cm$" = "centimeter", 
                           "cm$" = "centimeters"))
[1] "1meter30centimeters"  "2meters01centimeter"  "3meters10centimeters"
  1. Run the following code and clean up the resulting table.
  1. Remove the footnotes in the “party” column.

  2. Bring their date of birth (“born”) in proper shape.

  3. Bonus: fix their “occupation” by separating the single jobs (combine look-ahead and -behind for that.)

needs(rvest, janitor, lubridate, tidyverse)

senator_table_raw <- read_html("https://en.wikipedia.org/wiki/List_of_current_United_States_senators") |> 
  html_elements(css = "#senators") |> 
  html_table() |> 
  pluck(1) |> 
  clean_names() |> 
  select(state, senator, party = party_2, born, occupations = occupation_s, assumed_office)

#a;b
senator_table_cleaned <- senator_table_raw |> 
  mutate(party = str_remove_all(party, "\\[.\\]|\\(.+\\)"),
         born = str_extract(born, "19[0-9]{2}\\-[01][0-9]\\-[0-3][0-9]") |> ymd())

#c
occ_cleaned <- senator_table_cleaned |> 
  mutate(occupation_clean = str_replace_all(occupations, c("(?<=[a-z])(?=[A-Z])" = "; ", "CEO" = "CEO ")) |> 
           str_squish())

  1. comment from Felix: “honestly, I was quite overwhelmed when I encountered them first”↩︎