1 Tutorial 1: Searching & manipulating string patterns
As a very basic approach, we will learn how to deal with text data in R and how to manipulate text data in R.
After working through Tutorial 1, you’ll…
- understand what string patterns and regular expressions are
- understand how to write and search for string patterns
- understand how to manipulate search patterns
1.1 Regular Expressions
String patterns are sequences of characters (for instance, letter, numbers, or special characters). For instance, “hello” can be considered a string pattern that identifies the word “hello” in lowercase spelling.
Regular expressions are a type of string pattern used to match or detect other string patterns in texts. When using regular expressions, we usually include specific patterns which are not interpreted literally, but in which certain strings refer to a different, non-literal meaning. Working with regular expressions allows for more flexible searches for specific patterns.
For instance, “[H|h]ello” can be considered a regular expression.
Here, we would not look for the exact string “[H|h]ello”. Instead, we would interpret the pattern based on its non-literal meaning: We would search for both the strings “hello” and “Hello”.
Let’s start with an easy example. Assume you are working with following texts:
texts <- c("I'm learning to analyze texts with the program R today.",
"Learning text analysis is exciting, but you have to learn programming with R.",
"Learning R makes you realize: Programming can be exhausting, can't it?")
Suppose you want to search for a particular word, namely the word program. I’ll call this word/words we’re looking for the pattern (which is also the argument in many functions we’ll work with for automated content analysis).
Let’s say you want to know in which sentences, stored in the vector texts, the pattern program occurs. To see where the string program is used, we’ll rely on the function grep, which will be explained in more detail in the next section.
grep(x, pattern, value = TRUE) searches for a specific pattern and returns only the content of the elements of an object x in which our pattern is matched.
Arguments of the grep() function:
- pattern: string we are looking for: program.
- x: a character vector where we search for our pattern: texts.
- value: whether R should return the position of the elements containing our pattern, or the content of the elements. Since we want to analyze the content elements where pattern is matched, we set value to TRUE (if we set it to FALSE, R would simply return the position of the elements in which the word program occurs).
Using the following command, we tell R to return all elements of the vector texts in which the pattern program is matched:
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
As you see, the first and the second sentence in the vector texts contain the word program.
You have now searched for the pattern program. By doing so, R simply searches for the following sequence of letters: p, r, o, g, r, a, m.
In many cases, however, we need to be more flexible with our search terms.
For example, we may want to know whether the word stem program occurs (to include words like “program”, “programming”, or “Programming”)
Of course, we could search for all these terms (i.e., “program”, “programming”, or “Programming”) individually:
grep(pattern = "program", x = texts, value =TRUE)
grep(pattern = "programming", x = texts, value =TRUE)
grep(pattern = "Programming", x = texts, value =TRUE)
However, far easier (and more efficient) than conducting three separate searches would be to use logical operators, character classes, or metacharacters. All these are used in regular expressions: They contain sequences of letters, numbers, or characters that do not stand for themselves, but are assigned a certain non-literal meaning.
1.1.1 Logical operators
Before, we wanted R to return all elements of texts containing program (i.e., “program”, “programming”, or “Programming”).
It would be more practical if we could do this in one command (instead of doing three separate searches). As you saw before, this can easily be done by using the operator |:
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
## [3] "Learning R makes you realize: Programming can be exhausting, can't it?"
R knows that it should not interpret the character | “literally” - otherwise the program would only output the texts in which the character sequence “program|Program” occurs.
Instead, R translates the character | correctly: The program outputs texts in which the string pattern program or Program occurs.
We can even write this code even more efficiently by using square brackets around the pattern that can take on different values (p in lowercase or uppercase):
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
## [3] "Learning R makes you realize: Programming can be exhausting, can't it?"
Why do we use square brackets here?
The square brackets are metacharacters. That is, R does not look for the square brackets in the texts. Instead, R knows that the string pattern [] identifies a specific sequence of letters (given inside the square brackets, here p and P) for which a specific condition applies (either one or the other should be matched).
Thus, R does not search for the sequence [, p, |, P, ], r, o, g, r, a, m.
Instead, R looks up the sequence , r, o, g, r, a, m preced by either a lowercase p or a upperclass P. (We’ll get back to metacharacters in a second)
1.1.2 Character Classes
Regular expressions also allow us to search for specific types of characters. This includes searching for specific character classes.
For example, if we search for the pattern [a-z], R knows that the pattern [a-z] is used to match any lowercase letter (i.e., a or b or c or d etc.).
Here is a (limited) selection of frequently used regular expressions related to different character classes:
character.classes | meaning |
---|---|
[a-z] | finds any letter (lowercase) |
[A-Z] | finds any letter (uppercase) |
[[:alpha:]] | finds any letter (lowercase and uppercase) |
[0-9] | finds any number |
[a-zA-Z0-9] | finds any letter (lowercase and uppercase) or number |
[[:blank:]] | finds spaces, among others |
[[:punct:]] | finds punctuation, including ! # : ; . |
So if we don’t want to search for the words “program” and “Program” using a combined search query via |, we can use the following regular expression. It will give us all sentences in which any letter in lowercase or uppercase is followed by the string rogram:
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
## [3] "Learning R makes you realize: Programming can be exhausting, can't it?"
1.1.3 Quantifiers
Regular expressions also allow us to specify how often exactly certain letters, numbers or characters should occur.
To specify the number of times a specific pattern should be matched, we can use so-called quantifiers.
quantifier | meaning |
---|---|
? | The preceding expression occurs at most once |
+ | The preceding expression occurs at least once |
* | The preceding expression may or may not occur |
{n} | The preceding expression occurs exactly n times |
{n,} | The preceding expression occurs at least n times |
{n,m} | The preceding expression occurs at least n times and at most m times |
Say we want to see how often the words “text” and “analysis” occur right after another. Our vector texts includes the strings “analyze texts” and “text analysis”, both of which should be matched.
To do so, we instruct R to return all elements of the vector texts in which the letter sequence analy is preceded or followed by the sequence text.
To do so, we use the quantifier + in combination with character classes. The following regular expression matches all sentences in which the pattern analys occurs if the pattern text is matched at least once before or after analy.
- The pattern “text analys” matches all string patterns where the string analys is preceded by the string text.
- The pattern “analy[a-z]+[[:blank:]]+text” matches all string pattern where the string text is preced by the string analy, any lowercase letter (matched at least once as indicated by +) and a blank space (matched at least once as indicated by +).
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
1.1.4 Metacharacters
Metacharacters, including for instance quantifiers such as *, +, or ?, include the following characters (among others).
Again, these meta characters are not interpreted “literally”. R does not search for these exact string patterns in the text. Instead, R assigns a certain, different meaning to these characters, in a way translating them.
Metacharacters | Escape | Fix |
---|---|---|
* | \\* | fixed = TRUE |
+ | \\+ | fixed = TRUE |
? | \\? | fixed = TRUE |
| | \\| | fixed = TRUE |
{ | \\{ | fixed = TRUE |
} | \\} | fixed = TRUE |
( | \\( | fixed = TRUE |
) | \\) | fixed = TRUE |
You already know some of these metacharacters. For instance, you know that the character * is not interpreted literally, but R knows that * stands for a quantifier.
In the following example, R looks up any pattern “analys” possibly (but not necessarily) preceded by the pattern “text”.
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
But what if we actually want to search for a metacharacter?
Example: We want to know which elements in the vector texts contain a question.
Thus, we need to search for a question mark ? at the end of a sentence, i.e. a question mark occurring before any letter (lowercase and uppercase) or number:
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
## [3] "Learning R makes you realize: Programming can be exhausting, can't it?"
That didn’t work, R returns all elements of texts, although we can see that, in fact, only one of them contains a question mark.
Why does R return all elements?
Because ? is a metacharacter in the form of a quantifier: R searches for all sentences in which the expression before ? occurs at least once. Since ? is preceded by the string [a-zA-Z0-9], R return all elements in which any letter or number occurs at least once. This is true for all elements in texts.
There are different ways of handling metacharacters as indicated in the table (second and third column):
Solution 1: Escape
As you can see in the table above, we can “escape” metacharacters. By inserting two backlashes \\, we can instruct R to understand the character ? not as a metacharacter but “literally”, i.e. as the character ?.
By doing so, R searches for texts in which a question mark occurs. This way we “escape” any other meaning the character may have and interpret it literally.
## [1] "Learning R makes you realize: Programming can be exhausting, can't it?"
Solution 2: Fix
A second solution would be to instruct R to understand the entire search pattern “literally”, i.e. to interpret all letters, numbers, and characters exactly as they appear. In this case, R does not interpret the question mark ? as a quantifier, but searches for the character ?.
## [1] "Learning R makes you realize: Programming can be exhausting, can't it?"
1.2 Searching for & manipulating string patterns
Let’s move on to the functions used to search for or manipulate string patterns.
You have already learned about grep(), with the help of which you can make R return or identify elements that contain a certain pattern.
Many functions in base R, of which grep is one, already provide some helpful applications.
In addition, the stringr package contains some more helpful additional functions.
In the following, I will give a short overview of the most important functions to search or manipulate string patterns.
Remember: There are a lot of other helpful functions not mentioned here!
Function.name | package | operation |
---|---|---|
grep(pattern, x, value=FALSE) | Base R | Returns the position of the elements in the vector x that contain a pattern |
grep(pattern, x, value=TRUE) | Base R | Returns the content of elements in vector x that contain a pattern |
grepl(pattern, x) | Base R | For all elements in vector x, indicates whether they contain a pattern |
sub(pattern, replacement, x) | Base R | Replaces the first match for a pattern in vector x with replacement |
gsub(pattern, replacement, x) | Base R | Replaces all matches for a pattern in vector x with replacement |
str_extract(string, pattern) | stringr | Extracts the first match of a pattern in the vector string |
str_extract_all(string, pattern) | stringr | Extracts all matches of a pattern in the vector string |
str_count(string, pattern) | stringr | Counts how many times a pattern occurs in the vector string |
Using our vector texts, let’s apply these functions:
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
## [3] "Learning R makes you realize: Programming can be exhausting, can't it?"
1.2.1 Which elements contain a pattern?
The grep(pattern, x, value=FALSE) function returns the position of elements in the vector x that contain a pattern.
## [1] 1 2 3
As you see, all elements contain the pattern “[P|p]rogram”.
1.2.2 What is the content of elements that contain a pattern?
The grep(pattern, x, value=TRUE) function returns the content of those elements in the vector x that contain a pattern.
## [1] "I'm learning to analyze texts with the program R today."
## [2] "Learning text analysis is exciting, but you have to learn programming with R."
## [3] "Learning R makes you realize: Programming can be exhausting, can't it?"
1.2.3 Do elements contain a pattern?
For all elements in the vector x, grepl(pattern, x) returns whether they contain a pattern.
## [1] TRUE TRUE TRUE
1.2.4 How can I replace the first match of a pattern?
I may want to replace a pattern the first time it is matched in a vector x.
For example, let’s say the first time a punctuation mark occurs, I want to replace it with the pattern punctuation.
To do so, I need to search for the pattern [[:punct:]] and replace it with the pattern punctuation.
The function sub(pattern, replacement, x) replaces the first match of a pattern in the vector x with replacement.
## [1] "Ipunctuationm learning to analyze texts with the program R today."
## [2] "Learning text analysis is excitingpunctuation but you have to learn programming with R."
## [3] "Learning R makes you realizepunctuation Programming can be exhausting, can't it?"
1.2.5 How can I replace all the matches of a pattern?
I may want to replace a pattern not the first time, but everytime it is matched in x.
The function gsub(pattern, replacement, x) replaces all matches of a pattern in the vector x with replacement.
## [1] "Ipunctuationm learning to analyze texts with the program R todaypunctuation"
## [2] "Learning text analysis is excitingpunctuation but you have to learn programming with Rpunctuation"
## [3] "Learning R makes you realizepunctuation Programming can be exhaustingpunctuation canpunctuationt itpunctuation"
1.2.6 How can I extract the first match of a pattern from a vector?
You may want to know exactly what words are used to talk about programming, programs, etc.
Thus, you want to extract the first match for the pattern [P|p]rogram in the vector string.
The function str_extract(string, pattern) from the stringr package extracts the very first match for a pattern in each element of the vector string.
Caution:
The functions discussed in the following are part of the stringr package. They therefore slightly vary in term of the names of arguments and their order:
- The object in which we look for a pattern is called x in base R functions, but string in stringr functions.
- The order of arguments is different: For base R function such as grep() or grepl(), we first give R the pattern to look for and then the vector in which R should look for this pattern. In the stringr package, the order is exactly reverse.
## [1] "program" "program" "Program"
1.2.7 How can I extract all matches of a pattern from a vector?
You may want extract not the first match, but all matches for the pattern [P|p]rogram in elements of the vector string.
The function str_extract_all(string, pattern) extracts all matches for a pattern in the vector string.
## [[1]]
## [1] "program"
##
## [[2]]
## [1] "program"
##
## [[3]]
## [1] "Program"
1.3 Take Aways
Vokabulary:
- String patterns: String patterns are sequences of characters such as letters, numbers, or special characters.
- Regular expressions: Regular expressions are string patterns. However, they usually contain characters that are not interpreted literally, but refer to another meaning (metacharacters).
Commands:
- Searching for patterns: grep(), grepl(),
- Manipulating patterns: sub(), gsub()
- Extracting patterns: str_extract(), str_extract_all()
- Counting the occurrence of patterns: str_count()
1.4 More tutorials on this
You still have questions? The following tutorials & papers can help you with that:
1.5 Test your knowledge
You’ve worked through all the material of Tutorial 1? Let’s see it - the following tasks will test your knowledge based on our data donations.
Please load the csv file containing the data donations. Make sure that you set your working directory with setwd() - otherwise, R will not know where your data file “lies”.
1.5.1 Task 1.1
Writing the corresponding code, reduce the dataset to only the first five donations (i.e., row 1-5).
Solution:
1.5.2 Task 1.2
Writing the corresponding code, count how many times the pattern disco occurs in each search query which are included in the variable search_query.
Solution:
## [1] 0 0 1 1 0
1.5.3 Task 1.3
Let us make these search queries more readable. For now, our uncleaned data often contains “https://www.youtube.com/results?search_query=” before the actual query and different queries are connected with a plus sign “+”.
Writing the corresponding code, reduce the search queries to only the relevant search terms. (Recommendation: First remove the URL-string, e.g. “https://www.youtube.com/results?search_query=”, then replace all “+” signs with a blank space).
Solution:
#First approach, based on our tutorial
data_task1 %>%
mutate(across("search_query",
gsub,
pattern = "https://www.youtube.com/results?search_query=",
replacement = "",
fixed = T)) %>%
mutate(across("search_query",
gsub,
pattern = "+",
replacement = " ",
fixed = T)) %>%
select(search_query)
## search_query
## 1 barbara becker let%27s dance
## 2 one singular sensation
## 3 Mini disco
## 4 Mini disco superman
## 5 spiele selber machen
#Second approach, as proposed by Teodora
data_task1$search_query <- sapply(strsplit(data_task1$search_query, split='=', fixed=TRUE), function(x) (x[2])) %>%
gsub(pattern = "\\+", replacement = " ")
Let’s keep going: with Tutorial 2: Preprocessing