11 Tutorial 11: Preprocessing

After working through Tutorial 10, you’ll…

  • know how to deal with encoding issues
  • know how to tokenize text and remove numbers, punctuation etc.
  • know how to transform text to lowercase
  • know how to remove stopwords
  • know how to do lemmatizing/stemming
  • know how to remove frequent/rare words

For illustration, we’ll once again work with a text corpus already included in the R-Package Quanteda-Corpora-Package. You can find the corresponding R file in OLAT (via: Materials / Data for R) with the name immigration_news.rda.

Source of the data set: Nulty, P. & Poletti, M. (2014). “The Immigration Issue in the UK in the 2014 EU Elections: Text Mining the Public Debate.” Presentation at LSE Text Mining Conference 2014. Accessed via the quanteda corpus package.

load("immigration_news.rda")

11.1 Check for encoding issues

After you have reading texts, you should always check whether there are any encoding issues. The best way to do this is to read through a few texts to spot issues/things that need some further cleaning:

data$text[1]
## [1] "support for ukip continues to grow in the labour heartlands and miliband should be scared\nby leo mckinstry   \n933 words\n10 april 2014\n1415\nexpresscouk\nexco\nenglish\ncopyright 2014   \nnigel farage's overwhelming victory in his two debates against nick clegg has transformed the political landscape\nthrough his barnstorming performance farage has not only brought new credibility and popularity to the uk independence party but has also given a highly articulate voice to large swathes of the public who feel betrayed by our arrogant metropolitan elite \nthe impact of farage's triumph was revealed in opinion polls at the weekend that showed a surge in support for ukip one looking at voting intentions for the european parliamentary elections next month put ukip on 30 per cent neck-and-neck with labour there is little doubt that ukip could emerge the biggest party in the euro contest a remarkable outcome for an organisation with no seats at westminster\nconventional wisdom holds that it is cameron's own conservatives who will suffer the most damage from the rise in ukip particularly at the general election in 2015 in this fashionable narrative farage's followers are portrayed as disgruntled right-wing home counties tories spluttering into their gin-and-tonics at the follies of the coalition and the european commission \nin fact most are workingclass and have never voted conservative that's why labour's complacency is so misguided labour strategists have seen ukip's rise as a welcome development that will destroy the tory vote and sweep ed miliband into number 10 but that could be wishful thinking for it now seems the ukip insurgency could be just as big a threat in labour's heartlands \nresearch by academics matthew goodwin and robert ford shows of the 10 most \"ukip-friendly\" constituencies eight are held by labour and \"the largest concentrations of core ukip supporters are not found in tory seats in the shires but in labour fiefdoms such as miliband's in doncaster north\" \nukip's populist message resonates most strongly with those neglected or marginalised by the political establishment all the points nigel farage made so forcefully against clegg could have been directed against miliband the liberal democrat and labour leaders are the tweedledum and tweedledee of progressive orthodoxy obsessed with mass immigration multi-cultural diversity european integration and the green agenda behind a deluge of weasel words miliband is as opposed as clegg to a referendum on britain's eu membership neither truly believes in democracy or in trusting the public\ned miliband's ideological stance shows how labour has abandoned its grassroots [wenn]\nmiliband's ideological stance shows how labour has abandoned its grassroots the party was founded more than a century ago to represent the working class but now in"

Here, it seems that reading texts has (apparently) not worked the way it should have: You see that the string pattern \n occurs over and over again in the texts.

The reason for that is that \n stands for a line break in many programming languages. In the text file that was read in, for example, there was a line break after the headline of the article, followed by the author’s name, followed by another line break.

If we want to display the text similar to the View function (where everything looks normal), we can do that like so:

writeLines(data$text[1])
## support for ukip continues to grow in the labour heartlands and miliband should be scared
## by leo mckinstry   
## 933 words
## 10 april 2014
## 1415
## expresscouk
## exco
## english
## copyright 2014   
## nigel farage's overwhelming victory in his two debates against nick clegg has transformed the political landscape
## through his barnstorming performance farage has not only brought new credibility and popularity to the uk independence party but has also given a highly articulate voice to large swathes of the public who feel betrayed by our arrogant metropolitan elite 
## the impact of farage's triumph was revealed in opinion polls at the weekend that showed a surge in support for ukip one looking at voting intentions for the european parliamentary elections next month put ukip on 30 per cent neck-and-neck with labour there is little doubt that ukip could emerge the biggest party in the euro contest a remarkable outcome for an organisation with no seats at westminster
## conventional wisdom holds that it is cameron's own conservatives who will suffer the most damage from the rise in ukip particularly at the general election in 2015 in this fashionable narrative farage's followers are portrayed as disgruntled right-wing home counties tories spluttering into their gin-and-tonics at the follies of the coalition and the european commission 
## in fact most are workingclass and have never voted conservative that's why labour's complacency is so misguided labour strategists have seen ukip's rise as a welcome development that will destroy the tory vote and sweep ed miliband into number 10 but that could be wishful thinking for it now seems the ukip insurgency could be just as big a threat in labour's heartlands 
## research by academics matthew goodwin and robert ford shows of the 10 most "ukip-friendly" constituencies eight are held by labour and "the largest concentrations of core ukip supporters are not found in tory seats in the shires but in labour fiefdoms such as miliband's in doncaster north" 
## ukip's populist message resonates most strongly with those neglected or marginalised by the political establishment all the points nigel farage made so forcefully against clegg could have been directed against miliband the liberal democrat and labour leaders are the tweedledum and tweedledee of progressive orthodoxy obsessed with mass immigration multi-cultural diversity european integration and the green agenda behind a deluge of weasel words miliband is as opposed as clegg to a referendum on britain's eu membership neither truly believes in democracy or in trusting the public
## ed miliband's ideological stance shows how labour has abandoned its grassroots [wenn]
## miliband's ideological stance shows how labour has abandoned its grassroots the party was founded more than a century ago to represent the working class but now in

However, if we print a single text, we see that R has read this information (the line break) in form of a string pattern \n.

This can become problematic: For example, if we want to see if text 1 contains the string pattern 933 words 10 April 2014 (which the text does when we look at it with the writeLines function, see above), R would indicate that this is not the case:

grepl(pattern = "933 words 10 april 2014", x = data$text[1])
## [1] FALSE

The reason for that is that we have a line break after the string pattern 933 words. Accordingly, the first text in data does not contain the string pattern 933 words 10 April 2014. Instead, it contains the string pattern 933 words\n10 April 2014 (i.e., includes the line break \n).

If we include the line break, the computer will find the correct pattern:

grepl(pattern = "933 words\n10 april 2014", x = data$text[1])
## [1] TRUE

Please note: You should preferably read in texts with the correct encoding to avoid these issues. This can be done directly with the read_text() function.

If this is not possible or encoding issues remain, you can manually clean texts using string manipulation (exactly what you learned in Tutorial 9). Again, this is not the preferred solution, but may be necessary in some instances.

For example, we could replace the string that stands for a line break here with a blank space here:

data$text <- gsub(pattern = "\n", replacement = " ", x = data$text)
data$text[1]
## [1] "support for ukip continues to grow in the labour heartlands and miliband should be scared by leo mckinstry    933 words 10 april 2014 1415 expresscouk exco english copyright 2014    nigel farage's overwhelming victory in his two debates against nick clegg has transformed the political landscape through his barnstorming performance farage has not only brought new credibility and popularity to the uk independence party but has also given a highly articulate voice to large swathes of the public who feel betrayed by our arrogant metropolitan elite  the impact of farage's triumph was revealed in opinion polls at the weekend that showed a surge in support for ukip one looking at voting intentions for the european parliamentary elections next month put ukip on 30 per cent neck-and-neck with labour there is little doubt that ukip could emerge the biggest party in the euro contest a remarkable outcome for an organisation with no seats at westminster conventional wisdom holds that it is cameron's own conservatives who will suffer the most damage from the rise in ukip particularly at the general election in 2015 in this fashionable narrative farage's followers are portrayed as disgruntled right-wing home counties tories spluttering into their gin-and-tonics at the follies of the coalition and the european commission  in fact most are workingclass and have never voted conservative that's why labour's complacency is so misguided labour strategists have seen ukip's rise as a welcome development that will destroy the tory vote and sweep ed miliband into number 10 but that could be wishful thinking for it now seems the ukip insurgency could be just as big a threat in labour's heartlands  research by academics matthew goodwin and robert ford shows of the 10 most \"ukip-friendly\" constituencies eight are held by labour and \"the largest concentrations of core ukip supporters are not found in tory seats in the shires but in labour fiefdoms such as miliband's in doncaster north\"  ukip's populist message resonates most strongly with those neglected or marginalised by the political establishment all the points nigel farage made so forcefully against clegg could have been directed against miliband the liberal democrat and labour leaders are the tweedledum and tweedledee of progressive orthodoxy obsessed with mass immigration multi-cultural diversity european integration and the green agenda behind a deluge of weasel words miliband is as opposed as clegg to a referendum on britain's eu membership neither truly believes in democracy or in trusting the public ed miliband's ideological stance shows how labour has abandoned its grassroots [wenn] miliband's ideological stance shows how labour has abandoned its grassroots the party was founded more than a century ago to represent the working class but now in"

11.2 Tokenization & removing numbers, punctuation, etc.

You have already learned how to break text down to individual features. We can do this tokenization by relying on the the tokens() command.

What I didn’t tell you: You can also remove features with (often) little informative value in the same step. This includes numbers and punctuation marks, but also, for example, URLs that may be in the text. We now remove numbers, punctuation marks & URLs because we assume that these features will not help us determine similarities and differences between texts.

Important: As mentioned in the theory session on the topic, deciding on these preprocessing steps and their order can have a large impact on your results. In particular, see this paper by Denny & Spirling, 2018. For example, numbers may well be meaningful for some analyses and the results may be quite different if you remove them (think, for instance, of the political meaning of the feature “G7”).

Thus, you should always carefully consider which features to remove and why.

We now tokenize our texts and remove punctuation, numbers, and URLs in one step (the tokens() function includes even more potential preprocessing arguments, which we will ignore for now).

This time, we directly convert our texts from our dataframe object data to a tokens object.

library("quanteda")
tokens <- tokens(data$text, what = "word",
                 remove_punct = TRUE,
                 remove_numbers = TRUE,
                 remove_url = TRUE)
tokens[1]
## Tokens consisting of 1 document.
## text1 :
##  [1] "support"    "for"        "ukip"       "continues"  "to"         "grow"       "in"         "the"       
##  [9] "labour"     "heartlands" "and"        "miliband"  
## [ ... and 422 more ]

11.3 Normalizing to lowercase

In many cases, you will also want to further normalize your texts by converting texts to lowercase. This has the advantage that R will more easily recognize similar features that have, for instance, only been capitalized because they were positioned at the beginning of a sentence.

Look at two sentences as an example:

  • Here, the word here would have a certain meaning.
  • But here, the meaning of the word here would be the same although the word is spelled differently.

In both cases, here the same thing - but if you don’t normalize texts to lowercase, R would not recognize that the features here and Here mean exactly the same.

(Again, remember that there may be exceptions: Consider, for instance, the difference between the word united in the following two sentences, partly indicated by lowercase or uppercase spelling: “We stand united.” “He is in the United States.”)

Our texts have already been transformed to lowercase. For the sake of completeness, I’ll still show you how to do this:

tokens <- tokens_tolower(tokens)
tokens[1]
## Tokens consisting of 1 document.
## text1 :
##  [1] "support"    "for"        "ukip"       "continues"  "to"         "grow"       "in"         "the"       
##  [9] "labour"     "heartlands" "and"        "miliband"  
## [ ... and 422 more ]

11.4 Removing stopwords

Moreover, our corpus contains stopwords.

Stopwords are words that are not very informative for detecting similarities and differences between texts (or so we assume). The quanteda package contains a number of lists of stopwords in different languages.

The following command calls an extract of the English-language stopword lists included in quanteda. Again, you should check these lists before blindly removing these words - in some cases, these stopword lists may include words that are actually informative for your analysis.

stopwords("english")[1:20]
##  [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"       "ourselves" 
##  [9] "you"        "your"       "yours"      "yourself"   "yourselves" "he"         "him"        "his"       
## [17] "himself"    "she"        "her"        "hers"

Using the following command, we can remove these stopwords in our texts (see, for instance, how the feature to vanishes):

tokens <- tokens_remove(tokens, stopwords("english"))
tokens[1]
## Tokens consisting of 1 document.
## text1 :
##  [1] "support"    "ukip"       "continues"  "grow"       "labour"     "heartlands" "miliband"   "scared"    
##  [9] "leo"        "mckinstry"  "words"      "april"     
## [ ... and 239 more ]

You can also build your own stopword list and remove features included here.

In our corpus, each document contains the word words at the beginning of the article (a formal string patterns indicating the length of the text). You may, in some cases, want to remove this feature.

Removing individual stopword lists works as follows:

tokens <- tokens_remove(tokens, c("words"))
tokens[1]
## Tokens consisting of 1 document.
## text1 :
##  [1] "support"     "ukip"        "continues"   "grow"        "labour"      "heartlands"  "miliband"    "scared"     
##  [9] "leo"         "mckinstry"   "april"       "expresscouk"
## [ ... and 237 more ]

11.5 Lemmatizing/Stemming

Another common preprocessing step used to normalize text is to reduce words to their base form (lemmatizing) or their root (stemming). We will learn only the latter in this tutorial, i.e. how to “stem” texts (if you’re interested in lemmatizing, check out the spacyr package, a wrapper for the spacy Python library)

For example, a text might contain the words “decide”, “deciding”, and “decided”.

The problem: We know that these different features substantially describe the same thing, namely that something is being decided.

In order for R to recognize this, we need to normalize these words by reducing them to their wordstem. This is done with the tokens_wordstem() command:

tokens_wordstem(tokens(c("decide", "deciding", "decided")))
## Tokens consisting of 3 documents.
## text1 :
## [1] "decid"
## 
## text2 :
## [1] "decid"
## 
## text3 :
## [1] "decid"

Let’s apply this command to our corpus:

#Before stemming
tokens[1]
## Tokens consisting of 1 document.
## text1 :
##  [1] "support"     "ukip"        "continues"   "grow"        "labour"      "heartlands"  "miliband"    "scared"     
##  [9] "leo"         "mckinstry"   "april"       "expresscouk"
## [ ... and 237 more ]
#Apply stemming
tokens <- tokens_wordstem(tokens)
#After stemming
tokens[1]
## Tokens consisting of 1 document.
## text1 :
##  [1] "support"     "ukip"        "continu"     "grow"        "labour"      "heartland"   "miliband"    "scare"      
##  [9] "leo"         "mckinstri"   "april"       "expresscouk"
## [ ... and 237 more ]

Here, you for instance see that the feature “continues” (part of the first text) has been reduced to the word stem “continu”, “scared” to “scare” etc.

Again, stemming has drawn some criticism (see, for instance this paper).

In some cases, word stems may not accurately reflect the meaning of a feature or features may be reduced to a wrong stem.

At the very least, you should check some texts after using stemming to see whether this has actually helped to normalize your texts - or whether you would be better off not applying stemming.

11.6 Removing rare/frequent words

Lastly, we often want to remove features that appear in almost every text or in almost no text. Features that occur always or never will often not help us to detect similarities or differences between texts, meaning they are not very informative.

The removal of frequent or rare features can be done by converting our tokens to a document-feature-matrix.

dfm <- dfm(tokens)
print(dfm)
## Document-feature matrix of: 2,833 documents, 26,542 features (99.35% sparse) and 0 docvars.
##        features
## docs    support ukip continu grow labour heartland miliband scare leo mckinstri
##   text1       3    9       1    1     10         2        7     1   1         1
##   text2       0    1       0    0      0         0        0     0   0         0
##   text3       0    1       0    0      0         0        0     0   0         0
##   text4       0    0       0    0      0         0        0     0   0         0
##   text5       0    1       0    0      0         0        0     0   0         0
##   text6       0    0       0    0      0         0        0     0   0         0
## [ reached max_ndoc ... 2,827 more documents, reached max_nfeat ... 26,532 more features ]

Here, we use the dfm_trim() command to remove all features that occur in less than 0.5% of all documents or in more than 99% of all documents.

This is also called relative pruning.

Doing so significantly reduces the size of our DFM matrix: we have far fewer features than before, but - or so we hope - still include those that are informative.

dfm <- dfm_trim(dfm, 
                min_docfreq = 0.005, 
                max_docfreq = 0.99, 
                docfreq_type = "prop", 
                verbose = TRUE) 
## Removing features occurring:
##   - in fewer than 14.165 documents: 22,656
##   - in more than 2804.67 documents: 1
##   Total features removed: 22,657 (85.4%).
dfm
## Document-feature matrix of: 2,833 documents, 3,885 features (96.18% sparse) and 0 docvars.
##        features
## docs    support ukip continu grow labour heartland miliband scare leo mckinstri
##   text1       3    9       1    1     10         2        7     1   1         1
##   text2       0    1       0    0      0         0        0     0   0         0
##   text3       0    1       0    0      0         0        0     0   0         0
##   text4       0    0       0    0      0         0        0     0   0         0
##   text5       0    1       0    0      0         0        0     0   0         0
##   text6       0    0       0    0      0         0        0     0   0         0
## [ reached max_ndoc ... 2,827 more documents, reached max_nfeat ... 3,875 more features ]

More than 22,000 features were removed. Now, about 4,000 features remain - we hope that these are the most “informative” for detecting similarities and differences between texts.

11.7 Take Aways

Vocabulary:

  • Preprocessing: Preprocessing refers to different transformations used to clean or normalize text, in particular to remove features not helpful for detecting similarities and differences between texts.
  • Normalization: Normalization of text is part of the preprocessing process: We align features that have a similar meaning for them to take on a similar form - for example, by converting them to lowercase or reducing them to their word stem.
  • Lemmatizing: Lemmatizing refers to the reduction of a word to its basic form.
  • Stemming: Stemming refers to the removal of suffixes and the reduction of a word to its word stem.
  • Relative pruning: Relative pruning indicates the removal of relatively frequent/rare features.

Commands:

  • Tokenization & removing punctuation, numbers, etc..: tokens()
  • Transforming to lowercase: tokens_tolower()
  • Removing stopwords: tokens_remove(), stopwords(“english”)
  • Stemming: tokens_wordstem()
  • Relative pruning: dfm_trim()