9.7 Exercises
Here are some exercises on manipulating strings of text with base R, regular expressions, and stringr commands.
9.7.1 Exercise 1
Escaping into Unicode
Use your knowledge on representing basic text strings and special characters (by consulting a list of Unicode symbols, e.g., Wikipedia: Unicode characters) to define two R strings (ideally from different languages) that each contain at least two special symbols, like:
Der Käsereichtum Österreichs ist ungewöhnlich groß.
LaTeX commands begin with a backslash “\”. For instance, the LaTeX command for emphasis is “\emph{}”.
Hint: See Sections 9.2.2 and ?"'"
for general information on strings in R.
9.7.2 Exercise 2
Pasting vectors
Suppose you wanted to create names for 50 image files. The 1st of them should be called “img_1.png”, and the last should be called “img_50.png”.
- Can you create all 50 file names in 1 R command?
Hint: Yes, you can — use paste()
or paste0()
in combination with a numeric vector.
The files do not sort automatically when the first 9 names (up to “img_9.png”) are shorter than the others (from “img_10.png” onwards).
- Can you make all file names the same length?
Hint: One solution could be to insert a “0” to make the first 9 names “img_01.png” to “img_09.png”.
9.7.3 Exercise 3
This is exercise requires using regular expressions. (See Appendix E for a primer on using regular expressions.)
Matching countries
The character vector countries
included in ds4psy contains the names of 197 countries of the world:
Use the names of countries
to answer the following questions:
- Find all countries with “ee”, “ll”, or “oro”.
- Which countries have names that contain the word “and” but not “land”?
- Which countries have names that contain the letters “z” or “Z”?
- Which countries have names that are 13 letters long?
- Which names of countries contain punctuation characters?
- Which names of countries contain exactly 1 or more than 2 spaces?
- Which countries have names starting with a cardinal direction (i.e., North, East, South, West)?
- Which countries have names ending on “land” vs. contain “land” without ending on it?
- Which countries have names with a repeated letter?
- Which countries have names containing the same letter more then 3 times?
- Which countries have names containing 3 or more capital letters?
- Which countries have names containing the same capital letter twice?
Hint: Most of these tasks can solved in many different ways.
9.7.4 Exercise 4
This is another exercise requiring regular expressions (see Appendix E).
Quantifying and removing white space
- Counting spaces:
In Section 9.5.3, we have seen how we can use the count_chars()
function (of ds4psy) to determine the frequency of characters in sentences
:
sts <- tolower(sentences) # data
tb <- count_chars(sts, rm_specials = FALSE)
tb
#> chars
#> e t a h s o r n i l
#> 5021 3061 2354 1734 1660 1584 1561 1357 1222 1208 1000
#> d . c w f u p g m b k
#> 949 724 605 597 563 527 492 425 401 370 327
#> y v j , z x q ' ? \u0092 -
#> 299 139 40 31 30 28 18 15 6 3 2
#> ! &
#> 1 1
tb[1]/sum(tb)
#>
#> 0.1770764
This shows that the most frequent character in sentences
is " "
, occurring in 17.7 percent of all cases.
- Use stringr commands to quantify the percentage of spaces in
sentences
.
- Mimicking
str_squish()
:
Assuming a simple vector xs <- c(" A B C D ")
, the stringr function str_squish(xs)
removes repeated spaces and any leading and trailing spaces.
- Achieve the result of
str_squish(xs)
with regular expressions.
9.7.5 Exercise 5
Parts of this exercise benefit from using regular expressions (see Appendix E), but it is possible to solve most without them as well.
Searching color names
The function colors()
(from the R core package grDevices) returns the names of the 657 valid color names in R.
- Define the following 10 strings as a character vector
color_candidates
and use a base R function to check which of them are actual color names in R.
#> [1] "blanchedalmond" "honeydew" "hotpink3" "palevioletred1"
#> [5] "royalpink" "sadblue" "saddlebrown2" "snowwhite"
#> [9] "tan4" "yello3"
Hint: Half of these names are actual R color names, whereas the others are not. Take a guess which are which prior to checking it! Also, prefer simple solutions over more complex ones.
- How many of the 657 valid color names begin with either
gray
orgrey
?
(Try solving this twice: Once by using only base R functions and once with functions from the stringr package.)
Hint: We can either add up all colors starting with gray
and grey
(as its first four characters) or specify a regular expression that searches for both at once (i.e., (a|e)
), but requires that hits start with the pattern.
How many of the 657 valid color names contain
gray
orgrey
?Which of the 657 valid color names contain
gray
orgrey
, but do neither begin nor end withgray
orgrey
?
Hint: We could solve this by first computing 3 sets of color names and then usingsetdiff()
on them. Or we use a regular expression that requires characters before and aftergray
orgrey
.Which of the 657 valid color names begin and end with a vowel?
Which colornames in
colors()
contain the character sequence “po”, “pp”, or “oo”?Which colorname in
colors()
contains the character “e” four times?
9.7.6 Exercise 6
Detecting patterns in pi
The mathematical constant \(\pi\) denotes the ratio of a circle’s circumference to its diameter and is one of the most famous numbers (see Wikipedia on pi for details).
In R, pi
is a built-in constant that evaluates to 3.1415927. This is an approximate value, of course. Being an irrational number, the decimal representation of \(\pi\) contains an infinite number of digits and never settles into a permanently repeating pattern.
In this exercise, we are trying to find patterns in \(\pi\) by treating it as a sequence of (text) symbols.
To this behalf, the ds4psy package contains a character object pi_100k
that provides the first 100,000 digits of \(\pi\):
pi_char <- ds4psy::pi_100k
# Check object:
typeof(pi_char) # type?
#> [1] "character"
nchar(pi_char) # number of characters?
#> [1] 100001
substr(pi_char, 1, 10) # first 10 characters
#> [1] "3.14159265"
Does the sequence “1234” occur within the first 100,000 digits of pi? (Use an R command that answers this question by yielding
TRUE
orFALSE
.)At which location does the sequence “1234” occur within the first 100,000 digits of pi?
How often and at which locations does the sequence “1234” occur within the first 100,000 digits of pi?
Locate and extract all occurrences of
"2_4_6_8"
out of the first 100,000 digits of pi (where_
could match an arbitrary digit).
9.7.7 Exercise 7
In Section 9.5.1, we used the chartr()
or str_replace_all()
functions to translate text into leet slang.
This exercise takes this a step further by first encrypting and then decrypting all its characters.
Naive cryptography
Kids often invent “secret codes” to hide written messages from the curious eyes of others. A very simple type of “encryption” consists in systematically replacing each occurrence of a number of characters by a different character.
Create the following strings:
txt
: a string of text (for testing purposes)org
: a string of characters to be replacednew
: a string of characters used to replace the corresponding one inorg
Encryption: Use the
chartr()
function to encrypttxt
by replacing all characters oforg
by the corresponding character innew
.Decryption: Use the
chartr()
function to decrypt the result of 1. to obtain the originaltxt
.
9.7.8 Exercise 8
Known unknowns
According to this Wikipedia article on known knowns, Donald Rumsfeld (then the United States Secretary of Defense) famously stated at a U.S. Department of Defense news briefing on February 12, 2002:
Reports that say that something hasn’t happened are always interesting to me,
because as we know, there are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we know
there are some things we do not know.
But there are also unknown unknowns — the ones we don’t know we don’t know.
And if one looks throughout the history of our country and other free countries,
it is the latter category that tend to be the difficult ones.Donald Rumsfeld (2002)
Store this as a string kk
and then use R commands to answer the following questions:
How often do the words “know”, “known”, or “knowns” occur in this statement?
How often do the words “unknow”, “unknown”, or “unknowns” occur in this statement?
Hints:
Solving this task with base R commands is tricky, but possible (with regular expressions and probably splitting
kk
into individual words first). Using an appropriate stringr function is much easier. Doing both allows checking your results.To distinguish between a “know” in “known” vs. in “unknown”, consider searching for ” know” vs. ” unknow” (i.e., include spaces in our searches). Alternatively, use regular expressions with anchors for word boundaries (see Section E.2.4 of Appendix E.
9.7.9 Exercise 9
Bonus task: Literature search
Download a book from https://gutenberg.org or https://www.projekt-gutenberg.org and use R to perform some quantitative analysis (e.g., counting the frequency of certain names, of key terms, or contrasting the frequency of the pronouns “he” vs. “she”) on it.
Hint: There are several R packages that provide (access to) text data. Examples include:
- The R package gutenbergr (Johnston & Robinson, 2023) allows to search, download, and process public domain works from the https://www.gutenberg.org collection. Here’s an example to obtain a book by William James:
# install.packages('gutenbergr')
library(gutenbergr)
# Inspect metadata:
# gutenberg_metadata
# Search for an author:
wj_works <- gutenberg_works(str_detect(author, "James, William"))
wj_works
# Download a book by its id:
meaning_of_truth <- gutenberg_download(5117)
# Text of book (first lines):
meaning_of_truth$text[1:30]
- The R package bardr (Billings, 2021) provides R data structures for the complete works of William Shakespeare, as provided by Project Gutenberg.
This concludes our exercises on manipulating strings of text.