9.7 Exercises

ds4psy: Exercises (09: Text)

Here are some exercises on manipulating strings of text with base R, regular expressions, and stringr commands.

9.7.1 Exercise 1

Escaping into Unicode

Use your knowledge on representing basic text strings and special characters (by consulting a list of Unicode symbols, e.g., Wikipedia: Unicode characters) to define two R strings (ideally from different languages) that each contain at least two special symbols, like:

Der Käsereichtum Österreichs ist ungewöhnlich groß.
LaTeX commands begin with a backslash “\”. For instance, the LaTeX command for emphasis is “\emph{}”.

Hint: See Sections 9.2.2 and ?"'" for general information on strings in R.

9.7.2 Exercise 2

Pasting vectors

Suppose you wanted to create names for 50 image files. The 1st of them should be called “img_1.png”, and the last should be called “img_50.png”.

Can you create all 50 file names in 1 R command?

Hint: Yes, you can — use paste() or paste0() in combination with a numeric vector.

The files do not sort automatically when the first 9 names (up to “img_9.png”) are shorter than the others (from “img_10.png” onwards).

Can you make all file names the same length?

Hint: One solution could be to insert a “0” to make the first 9 names “img_01.png” to “img_09.png”.

9.7.3 Exercise 3

This is exercise requires using regular expressions. (See Appendix E for a primer on using regular expressions.)

Matching countries

The character vector countries included in ds4psy contains the names of 197 countries of the world:

countries <- ds4psy::countries  # data
length(countries)  # 197

Use the names of countries to answer the following questions:

Find all countries with “ee”, “ll”, or “oro”.
Which countries have names that contain the word “and” but not “land”?
Which countries have names that contain the letters “z” or “Z”?
Which countries have names that are 13 letters long?
Which names of countries contain punctuation characters?
Which names of countries contain exactly 1 or more than 2 spaces?
Which countries have names starting with a cardinal direction (i.e., North, East, South, West)?
Which countries have names ending on “land” vs. contain “land” without ending on it?
Which countries have names with a repeated letter?
Which countries have names containing the same letter more then 3 times?
Which countries have names containing 3 or more capital letters?
Which countries have names containing the same capital letter twice?

Hint: Most of these tasks can solved in many different ways.

9.7.4 Exercise 4

This is another exercise requiring regular expressions (see Appendix E).

Quantifying and removing white space

Counting spaces:

In Section 9.5.3, we have seen how we can use the count_chars() function (of ds4psy) to determine the frequency of characters in sentences:

sts <- tolower(sentences)  # data
tb <- count_chars(sts, rm_specials = FALSE)
tb
#> chars
#>             e      t      a      h      s      o      r      n      i      l 
#>   5021   3061   2354   1734   1660   1584   1561   1357   1222   1208   1000 
#>      d      .      c      w      f      u      p      g      m      b      k 
#>    949    724    605    597    563    527    492    425    401    370    327 
#>      y      v      j      ,      z      x      q      '      ? \u0092      - 
#>    299    139     40     31     30     28     18     15      6      3      2 
#>      !      & 
#>      1      1
tb[1]/sum(tb)
#>           
#> 0.1770764

This shows that the most frequent character in sentences is " ", occurring in 17.7 percent of all cases.

Use stringr commands to quantify the percentage of spaces in sentences.

Mimicking str_squish():

Assuming a simple vector xs <- c(" A B C D "), the stringr function str_squish(xs) removes repeated spaces and any leading and trailing spaces.

Achieve the result of str_squish(xs) with regular expressions.

9.7.5 Exercise 5

Parts of this exercise benefit from using regular expressions (see Appendix E), but it is possible to solve most without them as well.

Searching color names

The function colors() (from the R core package grDevices) returns the names of the 657 valid color names in R.

Define the following 10 strings as a character vector color_candidates and use a base R function to check which of them are actual color names in R.

#>  [1] "blanchedalmond" "honeydew"       "hotpink3"       "palevioletred1"
#>  [5] "royalpink"      "sadblue"        "saddlebrown2"   "snowwhite"     
#>  [9] "tan4"           "yello3"

Hint: Half of these names are actual R color names, whereas the others are not. Take a guess which are which prior to checking it! Also, prefer simple solutions over more complex ones.

How many of the 657 valid color names begin with either gray or grey?
(Try solving this twice: Once by using only base R functions and once with functions from the stringr package.)

Hint: We can either add up all colors starting with gray and grey (as its first four characters) or specify a regular expression that searches for both at once (i.e., (a|e)), but requires that hits start with the pattern.

How many of the 657 valid color names contain gray or grey?
Which of the 657 valid color names contain gray or grey, but do neither begin nor end with gray or grey?
Hint: We could solve this by first computing 3 sets of color names and then using setdiff() on them. Or we use a regular expression that requires characters before and after gray or grey.
Which of the 657 valid color names begin and end with a vowel?
Which colornames in colors() contain the character sequence “po”, “pp”, or “oo”?
Which colorname in colors() contains the character “e” four times?

9.7.6 Exercise 6

Detecting patterns in pi

The mathematical constant \(\pi\) denotes the ratio of a circle’s circumference to its diameter and is one of the most famous numbers (see Wikipedia on pi for details). In R, pi is a built-in constant that evaluates to 3.1415927. This is an approximate value, of course. Being an irrational number, the decimal representation of \(\pi\) contains an infinite number of digits and never settles into a permanently repeating pattern.

In this exercise, we are trying to find patterns in \(\pi\) by treating it as a sequence of (text) symbols. To this behalf, the ds4psy package contains a character object pi_100k that provides the first 100,000 digits of \(\pi\):

pi_char <- ds4psy::pi_100k

# Check object:
typeof(pi_char)         # type?
#> [1] "character"
nchar(pi_char)          # number of characters?
#> [1] 100001
substr(pi_char, 1, 10)  # first 10 characters
#> [1] "3.14159265"

Does the sequence “1234” occur within the first 100,000 digits of pi? (Use an R command that answers this question by yielding TRUE or FALSE.)
At which location does the sequence “1234” occur within the first 100,000 digits of pi?
How often and at which locations does the sequence “1234” occur within the first 100,000 digits of pi?
Locate and extract all occurrences of "2_4_6_8" out of the first 100,000 digits of pi (where _ could match an arbitrary digit).

9.7.7 Exercise 7

In Section 9.5.1, we used the chartr() or str_replace_all() functions to translate text into leet slang. This exercise takes this a step further by first encrypting and then decrypting all its characters.

Naive cryptography

Kids often invent “secret codes” to hide written messages from the curious eyes of others. A very simple type of “encryption” consists in systematically replacing each occurrence of a number of characters by a different character.

Create the following strings:

txt: a string of text (for testing purposes)
org: a string of characters to be replaced
new: a string of characters used to replace the corresponding one in org

Encryption: Use the chartr() function to encrypt txt by replacing all characters of org by the corresponding character in new.
Decryption: Use the chartr() function to decrypt the result of 1. to obtain the original txt.

9.7.8 Exercise 8

Known unknowns

According to this Wikipedia article on known knowns, Donald Rumsfeld (then the United States Secretary of Defense) famously stated at a U.S. Department of Defense news briefing on February 12, 2002:

Reports that say that something hasn’t happened are always interesting to me,
because as we know, there are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we know
there are some things we do not know.
But there are also unknown unknowns — the ones we don’t know we don’t know.
And if one looks throughout the history of our country and other free countries,
it is the latter category that tend to be the difficult ones.

Donald Rumsfeld (2002)

Store this as a string kk and then use R commands to answer the following questions:

How often do the words “know”, “known”, or “knowns” occur in this statement?
How often do the words “unknow”, “unknown”, or “unknowns” occur in this statement?

Hints:

Solving this task with base R commands is tricky, but possible (with regular expressions and probably splitting kk into individual words first). Using an appropriate stringr function is much easier. Doing both allows checking your results.
To distinguish between a “know” in “known” vs. in “unknown”, consider searching for ” know” vs. ” unknow” (i.e., include spaces in our searches). Alternatively, use regular expressions with anchors for word boundaries (see Section E.2.4 of Appendix E.

9.7.9 Exercise 9

Bonus task: Literature search

Download a book from https://gutenberg.org or https://www.projekt-gutenberg.org and use R to perform some quantitative analysis (e.g., counting the frequency of certain names, of key terms, or contrasting the frequency of the pronouns “he” vs. “she”) on it.

Hint: There are several R packages that provide (access to) text data. Examples include:

The R package gutenbergr (Johnston & Robinson, 2023) allows to search, download, and process public domain works from the https://www.gutenberg.org collection. Here’s an example to obtain a book by William James:

# install.packages('gutenbergr')
library(gutenbergr)

# Inspect metadata:
# gutenberg_metadata

# Search for an author:
wj_works <- gutenberg_works(str_detect(author, "James, William"))
wj_works

# Download a book by its id:
meaning_of_truth <- gutenberg_download(5117)

# Text of book (first lines):
meaning_of_truth$text[1:30]

The R package bardr (Billings, 2021) provides R data structures for the complete works of William Shakespeare, as provided by Project Gutenberg.

This concludes our exercises on manipulating strings of text.

References

Billings, Z. (2021). bardr: Complete works of William Shakespeare in tidy format. Retrieved from https://CRAN.R-project.org/package=bardr

Johnston, M., & Robinson, D. (2023). gutenbergr: Download and process public domain works from Project Gutenberg. Retrieved from https://docs.ropensci.org/gutenbergr/