This chapter so far has shown that working with text is challenging, but can also be rewarding and inspiring. Actually, blurring the boundaries between text and other data types is one of the most creative parts of data science. Here are some examples to sketch the scientific or artistic potential of such transgressions — and hopefully serve to stimulate your imagination.
9.5.1 Transl33ting text
So-called leet slang (aka. l33t, 1337, eleet, or leetspeak) is a system of modified spellings used primarily by online gaming or hacker communities (see the Wikipedia article leet).
transl33t() function of ds4psy translates text into some variety of leet:
test <- "This is a simple test of leet slang." # data transl33t(test) #>  "7h15 15 4 51mpl3 +35+ 0f l33+ 5l4ng." paste(transl33t(letters), collapse = " ") #>  "4 b c d 3 f g h 1 j k l m n 0 p q R 5 + u v w x y z" transl33t(s) #>  "7h3 c4+ 54+ 0n +h3 m4+." "7h3 m4d h4++3R h4d h3R, 50 wh4+?" #>  "7h3 f4+ d4d w45 50 54d."
With a little bit of practice, it actually becomes quite easy to read text in leet.
To experience this for yourself,
sentences from stringr and observe how you get faster when reading them aloud:
|The birch canoe slid on the smooth planks.||7h3 b1Rch c4n03 5l1d 0n +h3 5m00+h pl4nk5.|
|Glue the sheet to the dark blue background.||Glu3 +h3 5h33+ +0 +h3 d4Rk blu3 b4ckgR0und.|
|It’s easy to tell the depth of a well.||1+’5 345y +0 +3ll +h3 d3p+h 0f 4 w3ll.|
|These days a chicken leg is a rare dish.||7h353 d4y5 4 ch1ck3n l3g 15 4 R4R3 d15h.|
|Rice is often served in round bowls.||R1c3 15 0f+3n 53Rv3d 1n R0und b0wl5.|
|The juice of lemons makes fine punch.||7h3 ju1c3 0f l3m0n5 m4k35 f1n3 punch.|
The rules used by
transl33t() for translating individual characters are specified in a named vector
By re-defining this vector, we can change the degree and the details of the conversion:
Given these skills, we are very close to creating your own
transl33t() function (see Chapter 11: Functions).
We will re-visit character replacements in an exercise on naive cryptography below.
9.5.2 Detecting text in tibbles
In applied contexts, the character variable to be analyzed often does not come as an isolated vector of strings, but as a column in a larger table (or tibble). Fortunately, the commands discussed in this chapter can be used in combination with the functions discussed in the rest of this book.
For instance, we can use the stringr function
str_detect() as part of a
filter() command in a dplyr pipe.
As an example, the following pipe uses the tibble
data_t1 from ds4psy and selects those participants whose family
name (as indicated by their 2nd initial) starts with the letter “M”, “N”, or “O”:
9.5.3 Counting word or character frequency
Imagine we wanted to count the number of times a word or character appears in some article or book.
As a dataset of non-trivial complexity, let’s use the 720 Harvard
sentences (see Wikipedia) included in the stringr package:
Counting the frequency of any particular word or character is pretty straightforward with
However, what if we wanted to know the frequencies of all possible words or characters?
To address these tasks, the ds4psy package contains two practical helpers
count_words(s)counts the frequency of each word in
count_chars(s)counts the frequency of each character in
Both functions are case-sensitive by default (unless
case_sense = FALSE) and sort their results by frequency (unless
sort_freq = FALSE). Applied to the 720
sentences, we get:
library(ds4psy) # Frequency of words and characters: count_words(sentences)[1:10] # top 10 words #> w #> the The of a to and in is A was #> 489 262 132 130 119 118 85 81 72 66 count_chars(sentences)[1:10] # top 10 characters #> char_s #> e t a h o s r n i l #> 3054 2026 1649 1620 1552 1535 1350 1210 1191 989 # Total sums: sum(count_words(sentences)) # number of words #>  5760 sum(count_chars(sentences)) # number of characters #>  22570
How can we solve these tasks with stringr commands discussed in Sections 9.4? And if we succeed, do we get the same results? (And if not, why not?)
Hint: These tasks can be solved by using regular expressions or by using the
boundary() modifier of stringr functions.
- Counting the frequency of all words in
We see that the counts differ slightly, depending on the method we use. The difference between various counts can partly explained by counting vs. not counting occurrences of “s” or “t” as words:
- Counting the frequency of all characters in
To address this task, we could use a fairly complicated stringr function:
If we were only interested in the overall number of characters, there are much simpler solutions:
However, both these counts differ substantially from the one obtained by
Again, the discrepancies between different approaches are due to different interpretations of the task.
By default, the
count_chars() function removes a number of special characters (e.g., spaces, hyphens, parentheses, and punctuation characters) from the count. If these were not removed (by setting
rm_specials = FALSE), we get the same overall count:
9.5.4 Quantifying terms
Although we tried to include some meaningful examples in Sections 9.3 and 9.4, using text functions and regular expressions in real applications typically involves longer and more complicated strings of text (e.g., paragraphs, collections of articles, or books).
To illustrate a typical workflow of detecting an extracting matches, we will slightly adapt the excellent example from 14.4.2 Extract matches (Wickham & Grolemund, 2017), which detects, counts, and extracts color names in strings of text (using the
sentences included in stringr). The specific tasks addressed in this example are:
- Detect, count, and obtain all
sentencesthat contain a set of common color names.
- Extract the color names contained in those sentences.
- Show the sentences containing two or more of these color names.
A key step of all these tasks is constructing a regular expression (or regex, see Appendix E) that matches any of the color names we are interested in:
# Create a regex (i.e., the pattern to search for): colors <- c("red", "green", "blue", "black", "gray", "grey", "white", "orange", "yellow", "pink", "purple") ## Generalize example to ALL color names used in base R: # colors <- colors() color_match <- str_c(colors, collapse = "|") color_match #>  "red|green|blue|black|gray|grey|white|orange|yellow|pink|purple"
Equipped with this regex, we can easily detect, count, and obtain all
sentences that match the pattern:
Thus, it appears that 69 sentences contain one of the color names in
From these sentences, we can easily extract the colors found and count (or cross-tabulate) them:
# Extract matching strings: col_found <- str_extract(has_color, color_match) length(col_found) # Note: Same as length(has_color) above #>  69 # Count colors found: table(col_found) # base R #> col_found #> black blue gray green orange pink purple red white yellow #> 5 8 1 8 1 2 1 36 5 2 # tibble(col_found) %>% group_by(col_found) %>% count() # dplyr
Note that the length of the vector
col_found equals the number of sentences in
has_color (i.e., both are 69).
This could either mean that every sentence found contains exactly one color name.
However, if some sentences contain more than one color name, this could also mean that the
str_extract() function only extracted the first occurrence of a color from each sentence in
To find out which of these two options is the case, we can use
str_count() to count the number of matches to
color_match and filter
sentences by the logical vector of sentences with more than one match:
Thus, some sentences contain more than one of the color names in
The fact that there exist sentences with more than one match implies that
str_extract() only extracted the first match of a color from each sentence in
has_color. We can double check this by counting all colors in
Another indication that
str_extract() only extracted first matches is provided by the existence of a
str_extract_all() variant of the function. As mentioned in Sections 9.4), applying this version in the context of our current tasks yields a list (here:
all_col_found, rather than the 1-dimensional vector
col_found above) in which some elements contain more than one color names:
all_col_found can be transformed into a vector (by applying
unlist() to it), which then allows counting all the matching color names found:
simplify = TRUE when using
str_extract_all() would return a 2-dimensional matrix in which all rows are expanded to the number of columns of the maximum number of matches:
When quantifying text for scientific purposes (e.g., for doing sentiment analysis or LSA), having the option of working with a variety of output formats is a boon, rather than a burden.
9.5.5 Plotting text
Our final example of applying string functions combines text and graphics.
plot_text() function of ds4psy first reads in a text file (or uses the base R function
scan() to accept text input from the Console) and maps it into a tibble with columns for x- and y-coordinates (via the
It then uses ggplot to create a tile plot of the entered text, using character frequency as a proxy for the background color of each tile (with darker colors indicating more frequent characters):
This is not particularly creative, of course, but play a while with
plot_text() and its parameters and you will see some pleasing variations:
# Plot text (from file): cat("Hello world!", "This is just a test.", "Can you read this text?", "If so, this is good!", "Try using plot_text()", " for plotting text.", "Does this work?", "If so, this is good.", "Try some examples", " and then carry on...", file = "test.txt", sep = "\n") plot_text("test.txt", lbl_rotate = TRUE, pal = unikn::pal_bordeaux[2:5], col_lbl = "white") # Clean up file: unlink("test.txt")
In R, the boundaries between plain text and graphics are easily blurred, once we start thinking about word search problems, crossword puzzles, or word clouds. Thinking about new ways of visual expression will also change the ways you see characters and texts.
- Quantifying more color terms:
- Generalize the example from Section 9.5.4 to all color names that are pre-defined in base R.
To generalize the analysis, we only need to change the first line of code (i.e., the definition of
The regex for
color_match now includes 657 color names, but all other code can be recycled from above.
- Finding and extracting
sentencescontaining common number words:
- Adopt the example from Section 9.5.4 to extracting and counting
sentencescontaining number words (i.e., the 10 count words “one”, “two”, …, “ten”).
sentences <- stringr::sentences # Create a regex (i.e., the pattern to search for): numbers <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten") n_match <- str_c(numbers, collapse = "|") n_match #>  "one|two|three|four|five|six|seven|eight|nine|ten" # Get matching strings: has_number <- str_subset(sentences, n_match) length(has_number) #>  67 # Extract matching strings: n_matches <- str_extract(has_number, n_match) table(n_matches) #> n_matches #> eight five four one seven six ten three two #> 2 1 1 27 2 3 20 5 6 # Multiple matches: mult_n_match <- sentences[str_count(sentences, n_match) > 1] str_view_all(mult_n_match, n_match)
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz