9.5 Applications

This chapter so far has shown that working with text is challenging, but can also be rewarding and inspiring. Actually, blurring the boundaries between text and other data types is one of the most creative parts of data science. Here are some examples to sketch the scientific or artistic potential of such transgressions — and hopefully serve to stimulate your imagination.

9.5.1 Transl33ting text

So-called leet slang (aka. l33t, 1337, eleet, or leetspeak) is a system of modified spellings used primarily by online gaming or hacker communities (see the Wikipedia article leet).

The transl33t() function of ds4psy translates text into some variety of leet:

With a little bit of practice, it actually becomes quite easy to read text in leet. To experience this for yourself, transl33t the sentences from stringr and observe how you get faster when reading them aloud:

Table 9.2: Some sentences in l33t slang.
sentence s3nt3nc3
The birch canoe slid on the smooth planks. 7h3 b1Rch c4n03 5l1d 0n +h3 5m00+h pl4nk5.
Glue the sheet to the dark blue background. Glu3 +h3 5h33+ +0 +h3 d4Rk blu3 b4ckgR0und.
It’s easy to tell the depth of a well. 1+’5 345y +0 +3ll +h3 d3p+h 0f 4 w3ll.
These days a chicken leg is a rare dish. 7h353 d4y5 4 ch1ck3n l3g 15 4 R4R3 d15h.
Rice is often served in round bowls. R1c3 15 0f+3n 53Rv3d 1n R0und b0wl5.
The juice of lemons makes fine punch. 7h3 ju1c3 0f l3m0n5 m4k35 f1n3 punch.

The rules used by transl33t() for translating individual characters are specified in a named vector l33t_rules:

By re-defining this vector, we can change the degree and the details of the conversion:

However, with the functions for replacing characters introduced in Sections 9.3 and 9.4, we can easily design our own translations:

Given these skills, we are very close to creating your own transl33t() function (see Chapter 11: Functions).

We will re-visit character replacements in an exercise on naive cryptography below.

9.5.2 Detecting text in tibbles

In applied contexts, the character variable to be analyzed often does not come as an isolated vector of strings, but as a column in a larger table (or tibble). Fortunately, the commands discussed in this chapter can be used in combination with the functions discussed in the rest of this book. For instance, we can use the stringr function str_detect() as part of a filter() command in a dplyr pipe. As an example, the following pipe uses the tibble data_t1 from ds4psy and selects those participants whose family name (as indicated by their 2nd initial) starts with the letter “M”, “N”, or “O”:

9.5.3 Counting word or character frequency

Imagine we wanted to count the number of times a word or character appears in some article or book. As a dataset of non-trivial complexity, let’s use the 720 Harvard sentences (see Wikipedia) included in the stringr package:

Counting the frequency of any particular word or character is pretty straightforward with str_count():

However, what if we wanted to know the frequencies of all possible words or characters?

To address these tasks, the ds4psy package contains two practical helpers count_words() and count_chars():

  • count_words(s) counts the frequency of each word in s.
  • count_chars(s) counts the frequency of each character in s.

Both functions are case-sensitive by default (unless case_sense = FALSE) and sort their results by frequency (unless sort_freq = FALSE). Applied to the 720 sentences, we get:

How can we solve these tasks with stringr commands discussed in Sections 9.4? And if we succeed, do we get the same results? (And if not, why not?)

Hint: These tasks can be solved by using regular expressions or by using the boundary() modifier of stringr functions.

  • Counting the frequency of all words in sentences:

We see that the counts differ slightly, depending on the method we use. The difference between various counts can partly explained by counting vs. not counting occurrences of “s” or “t” as words:

  • Counting the frequency of all characters in sentences:

To address this task, we could use a fairly complicated stringr function:

If we were only interested in the overall number of characters, there are much simpler solutions:

However, both these counts differ substantially from the one obtained by count_chars():

Again, the discrepancies between different approaches are due to different interpretations of the task. By default, the count_chars() function removes a number of special characters (e.g., spaces, hyphens, parentheses, and punctuation characters) from the count. If these were not removed (by setting rm_specials = FALSE), we get the same overall count:

9.5.4 Quantifying terms

Although we tried to include some meaningful examples in Sections 9.3 and 9.4, using text functions and regular expressions in real applications typically involves longer and more complicated strings of text (e.g., paragraphs, collections of articles, or books).

To illustrate a typical workflow of detecting an extracting matches, we will slightly adapt the excellent example from 14.4.2 Extract matches (Wickham & Grolemund, 2017), which detects, counts, and extracts color names in strings of text (using the sentences included in stringr). The specific tasks addressed in this example are:

  • Detect, count, and obtain all sentences that contain a set of common color names.
  • Extract the color names contained in those sentences.
  • Show the sentences containing two or more of these color names.

A key step of all these tasks is constructing a regular expression (or regex, see Appendix E) that matches any of the color names we are interested in:

Equipped with this regex, we can easily detect, count, and obtain all sentences that match the pattern:

Thus, it appears that 69 sentences contain one of the color names in color_match. From these sentences, we can easily extract the colors found and count (or cross-tabulate) them:

Note that the length of the vector col_found equals the number of sentences in has_color (i.e., both are 69). This could either mean that every sentence found contains exactly one color name. However, if some sentences contain more than one color name, this could also mean that the str_extract() function only extracted the first occurrence of a color from each sentence in has_color into col_found. To find out which of these two options is the case, we can use str_count() to count the number of matches to color_match and filter sentences by the logical vector of sentences with more than one match:

Thus, some sentences contain more than one of the color names in color_match. The fact that there exist sentences with more than one match implies that str_extract() only extracted the first match of a color from each sentence in has_color. We can double check this by counting all colors in sentences:

Another indication that str_extract() only extracted first matches is provided by the existence of a str_extract_all() variant of the function. As mentioned in Sections 9.4), applying this version in the context of our current tasks yields a list (here: all_col_found, rather than the 1-dimensional vector col_found above) in which some elements contain more than one color names:

The list all_col_found can be transformed into a vector (by applying unlist() to it), which then allows counting all the matching color names found:

Alternatively, setting simplify = TRUE when using str_extract_all() would return a 2-dimensional matrix in which all rows are expanded to the number of columns of the maximum number of matches:

When quantifying text for scientific purposes (e.g., for doing sentiment analysis or LSA), having the option of working with a variety of output formats is a boon, rather than a burden.

9.5.5 Plotting text

Our final example of applying string functions combines text and graphics. Internally, the plot_text() function of ds4psy first reads in a text file (or uses the base R function scan() to accept text input from the Console) and maps it into a tibble with columns for x- and y-coordinates (via the read_ascii() function). It then uses ggplot to create a tile plot of the entered text, using character frequency as a proxy for the background color of each tile (with darker colors indicating more frequent characters):

Plotting some text with `plot_text()`.

Figure 9.2: Plotting some text with plot_text().

This is not particularly creative, of course, but play a while with plot_text() and its parameters and you will see some pleasing variations:

Playing with `plot_text()` parameters.

Figure 9.3: Playing with plot_text() parameters.

In R, the boundaries between plain text and graphics are easily blurred, once we start thinking about word search problems, crossword puzzles, or word clouds. Thinking about new ways of visual expression will also change the ways you see characters and texts.

Practice

  1. Quantifying more color terms:
  • Generalize the example from Section 9.5.4 to all color names that are pre-defined in base R.

Hint: The function colors() shows all 657 predefined names of colors available in base R (see Section D.2.1 of Appendix D).

To generalize the analysis, we only need to change the first line of code (i.e., the definition of colors):

The regex for color_match now includes 657 color names, but all other code can be recycled from above.

  1. Finding and extracting sentences containing common number words:
  • Adopt the example from Section 9.5.4 to extracting and counting sentences containing number words (i.e., the 10 count words “one”, “two”, …, “ten”).

References

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz