2 Tutorial 2: Preprocessing

In this tutorial, we learn how to preprocess texts before our analyzing it.

Preprocessing is a way of making sure that features (e.g., words) used for analysis are comparable (e.g., that a computer recognizes similar/the same words, even if they are spelled differently) and relevant (e.g., that we only include words that are relevant in our analysis).

As such, a big part of preprocessing is normalization: We align features that have a similar meaning to make them take on a similar form - for example, by converting them to lowercase or reducing them to their word stem.

After working through Tutorial 2, you’ll…

know how to deal with encoding issues
know how to transform text to lowercase
know how to stem words

You will also learn how to use the R package quanteda, a powerful library for automated text analysis in R.

For illustration, we’ll once again work with our data donations.

In this scenario, I am already removing the url and the “+” signs between search terms:

data <- read.csv2("sample_youtube.csv")
data <- data %>% 
  mutate(across("search_query", 
                gsub, 
                pattern = "https://www.youtube.com/results?search_query=", 
                replacement = "",
                fixed = T)) %>%
  mutate(across("search_query", 
                gsub, 
                pattern = "+", 
                replacement = " ",
                fixed = T))

2.1 Check for encoding issues

When working with texts, you should always check whether there are any encoding issues.

Character encodings:

Computers store individual characters - such as the letters “w”, “o”, “r”, and “d”, which stand for “word” - using a specific numeric code (bytes) assigned to each character. For example, “word” is stored as “01110111 01101111 01110010 01100100”.

By using a specific character encoding, R determines which numeric 0/1 code is associated with which character (for instance, which number is translated to which letter when translating bytes to character data).

In a sense, character encodings are the “key”, with which the computer determines how to translate numeric 0/1 to a character format.

The problem: Different encodings coexist.

As you know, many languages contain special characters - in German, for example, the letters “ü” and “ä” (which are not part of the English-language alphabet, for instance). These special characters are often read in incorrectly by the computer if the computer is not told to use the correct specific character encoding. If the computer uses its default encoding, these special characters may be “translated” incorrectly.

Oftentimes, R recognizes the correct encoding on its own, for instance when reading in texts.

Let’s check if we have encoding issues by reading through some random search queries:

data$search_query[1]

## [1] "barbara becker let%27s dance"

When reading the first search query, it becomes evident that reading in texts has (apparently) not worked the way it should have: It seems that an apostrophe has been replaced. This text should probably read: “barbara becker let’s dance”.

Let’s look at another example:

data$search_query[12]

## [1] "tats%C3%A4chlich liebe"

Here, someone seems to have looked up the movie “tatsächlich liebe” - but the “ä”, an umlaut in German language, has not been recognized.

Please note: You should preferably read in texts with the correct encoding to avoid these issues. This can be done directly with the read_text() function.

If this is not possible or encoding issues remain, you can manually clean texts using string manipulation (exactly what you learned in Tutorial 1). Again, this is not the preferred solution, but may be necessary in some instances.

In our example, you have been provided with a dataset with encoding issues. In this case, we choose to/have to manually replace all encoding issues.

To do that, we can rely on our knowledge from Tutorial 1: Searching & manipulating string patterns by using the gsub() function to replace wrong encodings:

data$search_query <- gsub("%27", "'", data$search_query)
data$search_query[1]

## [1] "barbara becker let's dance"

We could also use the grep() function to find further encoding issues. If we know that encoding issues are often characterized by the inclusion of a percentage sign % where there should not be any, we can leverage this knowledge to retrieve encoding issues:

encoding_issues <- grep('%', data$search_query, value = TRUE)
encoding_issues[1:5]

## [1] "tats%C3%A4chlich liebe"                     "I%E2%80%98m never getting that drunl agaim"
## [3] "reel f%C3%BCr fortgeschrittene"             "deutsche filmf%C3%B6rderung"               
## [5] "neuk%C3%B6lln"

2.2 Normalizing to lowercase

In many cases, you will also want to further normalize your texts by converting texts to lowercase. This has the advantage that R will more easily recognize similar features that have, for instance, only been capitalized because they were positioned at the beginning of a sentence.

Look at two sentences as an example:

Here, the word here would have a certain meaning.
But here, the meaning of the word here would be the same although the word is spelled differently.

In both cases, here emans the same thing - but if you don’t normalize texts to lowercase, R would not recognize that the features here and Here mean exactly the same (aha! This is what normalization means!)

(Again, remember that there may be exceptions: Consider, for instance, the difference between the word united in the following two sentences, partly indicated by lowercase or uppercase spelling: “We stand united.” “He is in the United States.”)

In R, you can do this by using the char_tolower() function from the quanteda package. Make sure that you have the quanteda package installed before trying to activate it:

library("quanteda")
lowercase <- char_tolower(data$search_query)
lowercase[1:10]

##  [1] "barbara becker let's dance"  "one singular sensation"      "mini disco"                 
##  [4] "mini disco superman"         "spiele selber machen"        "10 min upper body abs"      
##  [7] "four weddings and a funeral" "45 min workout pamela reif"  "wahlverwandschaften goethe" 
## [10] "beckenrand sheriff"

Great - now all texts have been changed to lowercase!

2.3 Lemmatizing/Stemming

Another common preprocessing step used to normalize text is to reduce words to their base form (lemmatizing) or their root (stemming). We will learn only the latter in this tutorial, i.e. how to “stem” texts (if you’re interested in lemmatizing, check out the spacyr package, a wrapper for the spacy Python library)

For example, a text might contain the words “decide”, “deciding”, and “decided”.

The problem: We know that these different features substantially describe the same thing, namely that something is being decided.

In order for R to recognize this, we need to normalize these words by reducing them to their wordstem. This is done with the char_wordstem command:

char_wordstem(c("decide", "deciding", "decided"))

## [1] "decid" "decid" "decid"

Let’s apply this command to our corpus. Note that we set the check_whitespace argument to FALSE because the command usually only works with tokenized texts - something you can ignored for now:

stemmed <- char_wordstem(data$search_query, check_whitespace = FALSE)
stemmed[1:10]

##  [1] "barbara becker let's d"     "one singular sens"          "Mini disco"                 "Mini disco superman"       
##  [5] "spiele selber machen"       "10 min upper body ab"       "four weddings and a funer"  "45 min workout pamela reif"
##  [9] "wahlverwandschaften goeth"  "beckenrand sheriff"

Here, you for instance see that the word “funeral” (search query 7) has been reduced to the word stem “funer” - but also, that the tool does not work too well since most words have not been stemmed.

Again, stemming has drawn some criticism (see, for instance this paper). Often, word stems may not accurately reflect the meaning of a feature or features may be reduced to a wrong stem.

At the very least, you should check some texts after using stemming to see whether this has actually helped to normalize your texts - or whether you would be better off not applying stemming.

2.4 Take Aways

Vocabulary:

Preprocessing: Preprocessing refers to different transformations used to clean or normalize text, in particular to remove features not helpful for detecting similarities and differences between texts.
Normalization: Normalization of text is part of the preprocessing process: We align features that have a similar meaning for them to take on a similar form - for example, by converting them to lowercase or reducing them to their word stem.
Lemmatizing: Lemmatizing refers to the reduction of a word to its basic form.
Character encoding: Character encoding refers to the “key” with which the computer determines how to translate characters into bytes (as numeric information) - and vice versa.
Stemming: Stemming refers to the removal of suffixes and the reduction of a word to its word stem.

Commands:

Transforming to lowercase: char_tolower()
Stemming: char_wordstem()

2.5 More tutorials on this

You still have questions? The following tutorials & papers can help you with that:

2.6 Test your knowledge

You’ve worked through all the material of Tutorial 2? Let’s see it - the following tasks will test your knowledge based on our data donations.

Please load the csv file containing the data donations and repeat the first preprocessing steps:

data <- data %>% 
  mutate(across("search_query", 
                gsub, 
                pattern = "https://www.youtube.com/results?search_query=", 
                replacement = "",
                fixed = T)) %>%
  mutate(across("search_query", 
                gsub, 
                pattern = "+", 
                replacement = " ",
                fixed = T))
data <- read.csv2("sample_youtube.csv")

2.6.1 Task 2.1

Writing the corresponding code, check for encoding issues. Remove all encoding issues by replacing wrong encodings by the right encodings (e.g., “%C3%A4” should become “ä” and so forth)

Solution:

#check which may contain encoding issues, as indicated by %
encoding_issues <- grep("%", data$search_query, value = TRUE)
encoding_issues[1:5]

## [1] "barbara becker let%27s dance"               "tats%C3%A4chlich liebe"                    
## [3] "I%E2%80%98m never getting that drunl agaim" "reel f%C3%BCr fortgeschrittene"            
## [5] "deutsche filmf%C3%B6rderung"

#correct for most encoding issues
data <- data %>%
  mutate(
         #Correct encoding for German "Umlaute"
         search_query = gsub("%C3%B6", "ö", search_query),
         search_query = gsub("%C3%A4", "ä", search_query),
         search_query = gsub("%C3%BC", "ü", search_query),
         search_query = gsub("%C3%9", "Ü", search_query),
         
         #Correct encoding for special signs
         search_query = gsub("%C3%9F", "ß", search_query),
         
         #Correct encoding for punctuation
         search_query = gsub("%0A", " ", search_query),
         search_query = gsub("%22", '"', search_query),
         search_query = gsub("%23", "#", search_query),
         search_query = gsub("%26", "&", search_query),
         search_query = gsub("%27|%E2%80%98|%E2%80%99|%E2%80%93|%C2%B4", "'", search_query),
         search_query = gsub("%2B", "+", search_query),
         search_query = gsub("%3D", "=", search_query),
         search_query = gsub("%3F", "?", search_query),
         search_query = gsub("%40", "@", search_query),
         
         #Correct encoding for letters from other languages
         search_query = gsub("%C3%A7", "ç", search_query),
         search_query = gsub("%C3%A9", "é", search_query),
         search_query = gsub("%C3%B1", "ñ", search_query),
         search_query = gsub("%C3%A5", "å", search_query),
         search_query = gsub("%C3%B8", "ø", search_query),
         search_query = gsub("%C3%BA", "ú", search_query),
         search_query = gsub("%C3%AE", "î", search_query))

2.6.2 Task 2.2

Discuss which preprocessing steps you would use or not use for our search queries. Are there any relevant preprocessing steps which you would need but which are not discussed here?

Let’s keep going: with Tutorial 3: Rule-based Approaches & Dictionaries