A.9 Solutions (09)

ds4psy: Solutions 09: Text and strings

Here are the solutions to the exercises of Chapter 9 (Section 9.7) on manipulating strings of text with base R, regular expressions, and stringr commands.

A.9.1 Exercise 1

Escaping into Unicode

Use your knowledge on representing basic text strings and special characters (by consulting a list of Unicode symbols, e.g., Wikipedia: Unicode characters) to define two R strings (ideally from different languages) that each contain at least two special symbols, like:

Hamlet says: “2b ∨ ¬2b”
LaTeX commands begin with a backslash “\.” For instance, the LaTeX command for emphasis is “\emph{}.”
Der Käsereichtum Österreichs ist ungewöhnlich groß.

Hint: See Sections 9.2.2 and ?"'" for general information on strings in R.

Solution

Here are some examples (many of which were provided by former students of this course):

English

# String: Hamlet says: “2b ∨ ¬2b”
e1 <- "Hamlet says: \u201C2b \u2228 \u00AC2b\u201D"
writeLines(e1)
#> Hamlet says: “2b ∨ ¬2b”

# String: LaTeX commands begin with a backslash "\". 
#         For instance, the LaTeX command for emphasis is "\emph{}".
e2 <- 'LaTeX commands begin with a backslash "\\".\nFor instance, the LaTeX command for emphasis is "\\emph{}".'
writeLines(e2)
#> LaTeX commands begin with a backslash "\".
#> For instance, the LaTeX command for emphasis is "\emph{}".

# String: E-mail me at: john.doe1988@gmail.biz
e3 <- "E\u002Dmail me at\u003A john.doe1988\u0040gmail.biz"
writeLines (e3)
#> E-mail me at: john.doe1988@gmail.biz

German

# String: Der Käsereichtum Österreichs ist ungewöhnlich groß.
de_01 <- "Der K\u00E4sereichtum \u00D6sterreichs ist ungew\u00F6hnlich gro\u00DF."
writeLines(de_01)
#> Der Käsereichtum Österreichs ist ungewöhnlich groß.

# String: Das Märchen von Schneeweißchen und Rosenrot ist schön.
de_02 <- "Das M\u00E4rchen von Schneewei\u00DFchen und Rosenrot ist sch\u00F6n."
writeLines(de_02)
#> Das Märchen von Schneeweißchen und Rosenrot ist schön.

# String: Große Frösche fahren gerne mit Öltankern. 
de_03 <- "Gro\u00DFe Fr\u00F6sche fahren gerne mit \u00D6ltankern."
writeLines(de_03)
#> Große Frösche fahren gerne mit Öltankern.

Additional examples (from former student solutions):

Die Worte ‘Ähre’ und ‘Ehre’ klingen ähnlich, haben aber eine unterschiedliche Bedeutung. Die größte Änderungsschneiderei der Stadt ist schön. Ominöse Öhrchen hören weiße Radieschen in Málaga. Ich esse am liebsten Käse aus Österreich. Die süße Hündin läuft in die Höhle des Bären, der sie zum Teekränzchen eingeladen hat. Sören will einen großen Käse aus Oberösterreich. Ich weiß, dass ich nichts weiß. Alle Hähnchen hüpfen hoch. Wie häufig sollte man spazieren gehen? Öfter, als das im Onlinesemester möglich ist. Ludwig der ⅩⅣ war der Sonnenkönig von Frankreich.

French

# String: << La beauté commence au moment où vous décidez d'être vous-même. >>
fr_01 <- "\u00AB La beaut\u00E9 commence au moment o\u00F9 vous d\u00E9cidez d'\u00EAtre vous-m\u00EAme. \u00BB"
writeLines(fr_01)
#> « La beauté commence au moment où vous décidez d'être vous-même. »

fr_02 <- "Apr\u00E8s le petit-d\u00E9jeuner je vais aller \u00E0 l'\u00E9cole."
writeLines(fr_02)
#> Après le petit-déjeuner je vais aller à l'école.

Additional examples (from former student solutions):

Les œufs coûtent 50 ¢. Voici comment on revient à la ligne. Voici comment on fait une “double croix” ‡. À cheval donné on ne regarde pas la bride. « J’habite à Constance. Le garçon est arrivé.

Other languages

‘Lena’ würde man auf griechisch ‘Λενα’ schreiben. En Español los acentos son importantes: La frase ‘Mi papa tiene 47 anos.’ tiene un significado muy diferente a la frase ‘Mi papá tiene 47 años.’ My favorite norwegian word is ‘grønnsaker,’ which means vegetables and translates to ‘légumes’ in French. Gillar du R på morgonen? Nej, det är inte roligt att läser det hela dagen.

Math and special symbols

7 + 8 = 15. The time is 1:15 on Wednesday, June 24. Beim Tetris gibt es verschiedene Bausteine, wie zum Beispiel diesen ▟ oder diesen ▌. My name is ℰℓⅇℕÅ. Der Löwe beißt in das Gemüse: 🦁 🍆. Starsigns can be expressed through emojis: ♈ ♉ ♊. When I was a child, I drew a he♥rt and a fl✿wer for my mother. R seems to censor the word #$%&!, so I have to write it like this: f r e e t i m e ⌛ 🙂 Ich habe eine ⚅ gewürfelt und bin jetzt im Gefängnis, aber nur zu Besuch! 🎲💸 Hast du das süße Kätzchen gesehen? 😍

A.9.2 Exercise 2

Pasting vectors

Suppose you wanted to create names for 50 image files. The 1st of them should be called “img_1.png,” and the last should be called “img_50.png.”

Can you create all 50 file names in 1 R command?

Hint: Yes, you can — use paste() or paste0() in combination with a numeric vector.

The files do not sort automatically when the first 9 names (up to “img_9.png”) are shorter than the others (from “img_10.png” onwards).

Can you make all file names the same length?

Hint: One solution could be to insert a “0” to make the first 9 names “img_01.png” to “img_09.png.”

Solution

paste("img_",  1:50, ".png", sep = "")
#>  [1] "img_1.png"  "img_2.png"  "img_3.png"  "img_4.png"  "img_5.png" 
#>  [6] "img_6.png"  "img_7.png"  "img_8.png"  "img_9.png"  "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"
paste0("img_", 1:50, ".png")
#>  [1] "img_1.png"  "img_2.png"  "img_3.png"  "img_4.png"  "img_5.png" 
#>  [6] "img_6.png"  "img_7.png"  "img_8.png"  "img_9.png"  "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"

Note: The numeric vector 1:50 was converted into a character sequence. However, it can be somewhat annoying that — due to the different number of numeric digits in the file names — your list of files is not sorted correctly in directory views.

Can you make all file names the same length?

A possible solution to this would be:

paste("img_",  c(paste0("0", 1:9), 10:50), ".png", sep = "")
#>  [1] "img_01.png" "img_02.png" "img_03.png" "img_04.png" "img_05.png"
#>  [6] "img_06.png" "img_07.png" "img_08.png" "img_09.png" "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"
paste0("img_", c(paste0("0", 1:9), 10:50), ".png")
#>  [1] "img_01.png" "img_02.png" "img_03.png" "img_04.png" "img_05.png"
#>  [6] "img_06.png" "img_07.png" "img_08.png" "img_09.png" "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"

A more general solution is provided by the num_as_char() function of the ds4psy package. This function turns numbers (i.e., objects of type integer or double) into character sequences (i.e., numeric digits of type character) and allows specifying n_pre_dec (i.e., a desired number of digits prior to the decimal separator) and n_dec (i.e., the desired number of digits after the decimal separator):

library(ds4psy)  # requires version 0.1.0+

# Explore num_as_char function:
num_as_char(1/3, n_pre_dec = 2, n_dec = 2)
#> [1] "00.33"
num_as_char(2/3, n_pre_dec = 2, n_dec = 2)  # rounding up
#> [1] "00.67"
num_as_char(2/3, n_pre_dec = 2, n_dec = 0)  # rounding up
#> [1] "01"

As num_as_char() also works with vector inputs, using this function in combination with paste0 would solve our problem:

# Use num_as_char:
paste0("img_", num_as_char(1:50, n_dec = 0), ".png")
#>  [1] "img_01.png" "img_02.png" "img_03.png" "img_04.png" "img_05.png"
#>  [6] "img_06.png" "img_07.png" "img_08.png" "img_09.png" "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"

We will further examine num_as_char() in Chapter 11: Functions (see Section 11.6.6).

A.9.3 Exercise 3

This is exercise requires using regular expressions. (See Appendix E for a primer on using regular expressions.)

Matching countries

The character vector countries included in ds4psy contains the names of 197 countries of the world:

countries <- ds4psy::countries  # data
length(countries)  # 197
#> [1] 197

Use the names of countries to answer the following questions:

Find all countries with “ee,” “ll,” or “oro.”
Which countries have names that contain the word “and” but not “land?”
Which countries have names that contain the letters “z” or “Z?”
Which countries have names that are 13 letters long?
Which names of countries contain punctuation characters?
Which names of countries contain exactly 1 or more than 2 spaces?
Which countries have names starting with a cardinal direction (i.e., North, East, South, West)?
Which countries have names ending on “land” vs. contain “land” without ending on it?
Which countries have names with a repeated letter?
Which countries have names containing the same letter more then 3 times?
Which countries have names containing 3 or more capital letters?
Which countries have names containing the same capital letter twice?

Hint: Most of these tasks can solved in many different ways.

Solution

Find all countries with “ee,” “ll,” or “oro”:

# ee:
grep("e{2}", countries, value = TRUE)
#> [1] "Greece"   "Holy see"
# ll:
grep("ll", countries, value = TRUE)
#> [1] "Marshall Islands" "Seychelles"
# oro:
str_view_all(countries, "oro", match = TRUE)

Which countries have names that contain the word “and” but not “land?”

# and: 
grep(" and ", countries, value = TRUE)
#> [1] "Antigua and Barbuda"            "Bosnia and Herzegovina"        
#> [3] "St. Kitts and Nevis"            "St. Vincent and the Grenadines"
#> [5] "Sao Tome and Principe"          "Trinidad and Tobago"
grep("\\band\\b ", countries, value = TRUE)
#> [1] "Antigua and Barbuda"            "Bosnia and Herzegovina"        
#> [3] "St. Kitts and Nevis"            "St. Vincent and the Grenadines"
#> [5] "Sao Tome and Principe"          "Trinidad and Tobago"
str_view_all(countries, "\\band\\b", match = TRUE)

Which countries have names that contain the letters “z” or “Z?”

# z or Z:
grep("z|Z", countries, value = TRUE)
#>  [1] "Azerbaijan"             "Belize"                 "Bosnia and Herzegovina"
#>  [4] "Brazil"                 "Czech Republic"         "Kazakhstan"            
#>  [7] "Kyrgyz Republic"        "Mozambique"             "New Zealand"           
#> [10] "Swaziland"              "Switzerland"            "Tanzania"              
#> [13] "Uzbekistan"             "Venezuela"              "Zambia"                
#> [16] "Zimbabwe"
str_view_all(countries, "z|Z", match = TRUE)

str_view_all(countries, "[zZ]", match = TRUE)

Which countries have names that are 13 letters long?

# 13 letters: 
countries[nchar(countries) == 13]
#> [1] "Cote d'Ivoire" "Guinea-Bissau" "Liechtenstein" "United States"
str_view_all(countries, "^.............$", match = TRUE)

Which names of countries contain punctuation characters?

# punctuation:
str_view_all(countries, "[:punct:]", match = TRUE)

Which names of countries contain exactly 1 or more than 2 spaces?

## Any space: 
# grep(" ", countries, value = TRUE)
# grep(" {1}", countries, value = TRUE)
# str_view_all(countries, "[:space:]", match = TRUE)

# 1 space: 
countries[str_count(countries, " ") == 1]
#>  [1] "Burkina Faso"       "Cape Verde"         "Congo, Rep."       
#>  [4] "Costa Rica"         "Cote d'Ivoire"      "Czech Republic"    
#>  [7] "Dominican Republic" "El Salvador"        "Equatorial Guinea" 
#> [10] "Holy see"           "North Korea"        "South Korea"       
#> [13] "Kyrgyz Republic"    "Macedonia, FYR"     "Marshall Islands"  
#> [16] "New Zealand"        "St. Lucia"          "San Marino"        
#> [19] "Saudi Arabia"       "Sierra Leone"       "Slovak Republic"   
#> [22] "Solomon Islands"    "South Africa"       "South Sudan"       
#> [25] "Sri Lanka"          "United Kingdom"     "United States"
# more than 2 spaces: 
countries[str_count(countries, " ") > 2]
#> [1] "St. Kitts and Nevis"            "St. Vincent and the Grenadines"
#> [3] "Sao Tome and Principe"

Which countries have names starting with a cardinal direction (i.e., North, East, South, West)?

# initial anchor:
grep("^North", countries, value = TRUE)
#> [1] "North Korea"
grep("^East", countries, value = TRUE)
#> character(0)
grep("^South", countries, value = TRUE)
#> [1] "South Korea"  "South Africa" "South Sudan"
grep("^West", countries, value = TRUE)
#> character(0)

Which countries have names ending on “land” vs. contain “land” without ending on it?

# final anchor:
grep("land$", countries, value = TRUE)
#> [1] "Finland"     "Iceland"     "Ireland"     "New Zealand" "Poland"     
#> [6] "Swaziland"   "Switzerland" "Thailand"
str_view_all(countries, "land.", match = TRUE)

Which countries have names with a repeated letter?

# repetition:
str_view(countries, "([a-z])\\1", match = TRUE)

Which countries have names containing the same letter more than three times?

# assuming the letter is "a":
countries[str_count(tolower(countries), "a") > 3]  
#> [1] "Antigua and Barbuda" "Madagascar"          "Saudi Arabia"

# repetition with wildcards and quantifiers:
str_view(tolower(countries), "([a-z]).*\\1.*\\1.*\\1", match = TRUE)

Which countries have names containing three or more capital letters?

# counting capital letters:
countries[str_count(countries, "[A-Z]") > 2]
#>  [1] "Central African Republic"       "Congo, Dem. Rep."              
#>  [3] "Hong Kong, China"               "Macedonia, FYR"                
#>  [5] "Micronesia, Fed. Sts."          "Papua New Guinea"              
#>  [7] "St. Kitts and Nevis"            "St. Vincent and the Grenadines"
#>  [9] "Sao Tome and Principe"          "United Arab Emirates"
str_view(countries, "[:upper:].*[:upper:].*[:upper:]", match = TRUE)

Which countries have names containing the same capital letter twice?

# repetition:
str_view(countries, "([A-Z]).*\\1", match = TRUE)

A.9.4 Exercise 4

This is another exercise requiring regular expressions (see Appendix E).

Quantifying and removing white space

Counting spaces:

In Section 9.5.3, we have seen how we can use the count_chars() function (of ds4psy) to determine the frequency of characters in sentences:

sts <- tolower(sentences)  # data
tb <- count_chars(sts, rm_specials = FALSE)
tb
#> chars
#>             e      t      a      h      s      o      r      n      i      l 
#>   5021   3061   2354   1734   1660   1584   1561   1357   1222   1208   1000 
#>      d      .      c      w      f      u      p      g      m      b      k 
#>    949    724    605    597    563    527    492    425    401    370    327 
#>      y      v      j      ,      z      x      q      '      ? \u0092      - 
#>    299    139     40     31     30     28     18     15      6      3      2 
#>      !      & 
#>      1      1
tb[1]/sum(tb)
#>           
#> 0.1770764

This shows that the most frequent character in sentences is " ", occurring in 17.7 percent of all cases.

Use stringr commands to quantify the percentage of spaces in sentences.

Mimicking str_squish():

Assuming a simple vector xs <- c(" A B C D "), the stringr function str_squish(xs) removes repeated spaces and any leading and trailing spaces.

Achieve the result of str_squish(xs) with regular expressions.

Solution

Use stringr commands to quantify the percentage of spaces in sentences.

# count " ":
sum(str_count(sentences, " "))
#> [1] 5021
sum(str_count(sentences, "[:space:]"))
#> [1] 5021

# percentage:
sum(str_count(sentences, " "))/sum(str_length(sentences))
#> [1] 0.1770764

Achieve the result of str_squish(xs) with regular expressions.

# Mimicking str_squish():
xs <- c("  A  B   C    D    ")
str_squish(xs)
#> [1] "A B C D"

# Using regular expressions:
x1 <- str_replace_all(xs, " {2,}", " ")  # replace repeated by single spaces
x2 <- str_replace(x1, "^ ", "")  # remove leading space
str_replace(x2, " $", "")        # remove trailing space
#> [1] "A B C D"

A.9.5 Exercise 5

Parts of this exercise benefit from using regular expressions (see Appendix E), but it is possible to solve most without them as well.

Searching color names

The function colors() (from the R core package grDevices) returns the names of the 657 valid color names in R.

Define the following 10 strings as a character vector color_candidates and use a base R function to check which of them are actual color names in R.

#>  [1] "blanchedalmond" "honeydew"       "hotpink3"       "palevioletred1"
#>  [5] "royalpink"      "sadblue"        "saddlebrown2"   "snowwhite"     
#>  [9] "tan4"           "yello3"

Hint: Half of these names are actual R color names, whereas the others are not. Take a guess which are which prior to checking it! Also, prefer simple solutions over more complex ones.

Solution

We first need to define the vector of color_candidates:

color_candidates <- c("blanchedalmond", "honeydew", "hotpink3",       
                      "palevioletred1", "royalpink", "sadblue",        
                      "saddlebrown2", "snowwhite", "tan4", "yello3")

Two simple solutions are:

# Data: 
color_candidates <- c("blanchedalmond", "honeydew", "hotpink3",       
                      "palevioletred1", "royalpink", "sadblue",        
                      "saddlebrown2", "snowwhite", "tan4", "yello3")  

# (1) using %in%: 
color_candidates %in% colors()
#>  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

true_colors <- color_candidates[color_candidates %in% colors()]
true_colors
#> [1] "blanchedalmond" "honeydew"       "hotpink3"       "palevioletred1"
#> [5] "tan4"

not_colors <- color_candidates[!(color_candidates %in% colors())]
not_colors
#> [1] "royalpink"    "sadblue"      "saddlebrown2" "snowwhite"    "yello3"

# (2) using is.element: 
is.element(el = color_candidates, set = colors())
#>  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

Two more complex solutions — involving regular expressions — would be:

# (+) construct regex:
col_set <- paste(color_candidates, collapse = "|")

# (3) base R + regex:
grep(x = colors(), pattern = col_set, value = TRUE) 
#> [1] "blanchedalmond" "honeydew"       "honeydew1"      "honeydew2"     
#> [5] "honeydew3"      "honeydew4"      "hotpink3"       "palevioletred1"
#> [9] "tan4"

# (4) using stringr + regex:
str_subset(colors(), col_set)  
#> [1] "blanchedalmond" "honeydew"       "honeydew1"      "honeydew2"     
#> [5] "honeydew3"      "honeydew4"      "hotpink3"       "palevioletred1"
#> [9] "tan4"

However, note that these particular solutions find some extra shades of “honeydew.”

Answer

True color names: "blanchedalmond", "honeydew", "hotpink3", "palevioletred1", "tan4".
Not color names: "royalpink", "sadblue", "saddlebrown2", "snowwhite", "yello3".

How many of the 657 valid color names begin with either gray or grey?
(Try solving this twice: Once by using only base R functions and once with functions from the stringr package.)

Hint: We can either add up all colors starting with gray and grey (as its first four characters) or specify a regular expression that searches for both at once (i.e., (a|e)), but requires that hits start with the pattern.

Solution

# (1) base R:
# in 2 parts:
sum(substr(colors(), 1, 4) == "gray")  # 102
#> [1] 102
sum(substr(colors(), 1, 4) == "grey")  # 102
#> [1] 102
# together:
sum( (substr(colors(), 1, 4) == "gray") | (substr(colors(), 1, 4) == "grey") )
#> [1] 204

# (2) using grep() and regex: 
length(grep(x = colors()[], pattern = "^gr(a|e)y"))  # 204
#> [1] 204

# (3) stringr and regex:
# stringr::str_count(colors(), "^gr(a|e)y")
sum(stringr::str_count(colors(), "^gr(a|e)y"))     # Solution 1: Sum of 204 strings starting with pattern 
#> [1] 204
length(stringr::str_subset(colors(), "^gr(a|e)y")) # Solution 2: Length of set of 204 strings that start with pattern
#> [1] 204

How many of the 657 valid color names contain gray or grey?

Solution

Since 204 colors begin with gray or grey, the number of colors that contain gray or grey must be even larger:

# (a) base R and regex: 
length(grep(x = colors()[], pattern = "gr(a|e)y"))  # 224
#> [1] 224

# (b) stringr and regex: 
sum(stringr::str_count(colors(), "gr(a|e)y"))       # 224
#> [1] 224

Which of the 657 valid color names contain gray or grey, but do neither begin nor end with gray or grey?

Hint: We could solve this by first computing three sets of color names and then using setdiff() on them. Or we use a regular expression that requires characters before and after gray or grey.

Solution

# (1) Complicated solution:
set1 <- str_subset(colors(), "gr(a|e)y")   # contain "gray" or "grey"
set2 <- str_subset(colors(), "^gr(a|e)y")  # star with "gray" or "grey"
set3 <- str_subset(colors(), "gr(a|e)y$")  # end on "gray" or "grey"

# As 2 set differences: 
set4 <- setdiff(set1, set2)  # containing but NOT start
setdiff(set4, set3)          # containing, but NOT end
#> [1] "darkslategray1" "darkslategray2" "darkslategray3" "darkslategray4"
#> [5] "slategray1"     "slategray2"     "slategray3"     "slategray4"

# (2) Simpler solution:
str_subset(colors(), ".gr(a|e)y.")
#> [1] "darkslategray1" "darkslategray2" "darkslategray3" "darkslategray4"
#> [5] "slategray1"     "slategray2"     "slategray3"     "slategray4"

Which of the 657 valid color names begin and end with a vowel?

Solution

# (1) in 2 steps:
vow_begin <- str_subset(colors(), "^[aeiou]")  # begin with a vowel
vow_end   <- str_subset(colors(), "[aeiou]$")  # end on a vowel
intersect(vow_begin, vow_end)
#> [1] "aliceblue"    "antiquewhite" "aquamarine"   "azure"        "oldlace"     
#> [6] "orange"

# (2) in 1 regex:
str_subset(colors(), "^[aeiou].*[aeiou]$")
#> [1] "aliceblue"    "antiquewhite" "aquamarine"   "azure"        "oldlace"     
#> [6] "orange"

Which colornames in colors() contain the character sequence “po,” “pp,” or “oo?”

Solution

str_subset(colors(), "po|pp|oo")
#>  [1] "burlywood"  "burlywood1" "burlywood2" "burlywood3" "burlywood4"
#>  [6] "deeppink"   "deeppink1"  "deeppink2"  "deeppink3"  "deeppink4" 
#> [11] "maroon"     "maroon1"    "maroon2"    "maroon3"    "maroon4"   
#> [16] "powderblue"

Which colorname in colors() contains the character “e” four times?

Solution

colors()[str_count(colors(), "e") == 4]
#> [1] "mediumseagreen"

A.9.6 Exercise 6

Detecting patterns in pi

The mathematical constant $\pi$ denotes the ratio of a circle’s circumference to its diameter and is one of the most famous numbers (see Wikipedia on pi for details). In R, pi is a built-in constant that evaluates to 3.1415927. This is an approximate value, of course. Being an irrational number, the decimal representation of $\pi$ contains an infinite number of digits and never settles into a permanently repeating pattern.

In this exercise, we are trying to find patterns in $\pi$ by treating it as a sequence of (text) symbols. To this behalf, the ds4psy package contains a character object pi_100k that provides the first 100,000 digits of $\pi$:

pi_char <- ds4psy::pi_100k

# Check:
typeof(pi_char)         # type?
#> [1] "character"
nchar(pi_char)          # number of characters?
#> [1] 100001
substr(pi_char, 1, 10)  # first 5 characters
#> [1] "3.14159265"

Does the sequence “1234” occur within the first 100,000 digits of pi? (Use an R command that answers this question by yielding TRUE or FALSE.)

Solution

s <- ds4psy::pi_100k  # load data
p <- "1234"

# Does p occur in s?
grepl(pattern = p, x = s)
#> [1] TRUE
str_detect(string = s, pattern = p)
#> [1] TRUE

At which location does the sequence “1234” occur within the first 100,000 digits of pi?

Solution

# Detecting the 1st occurrence of a pattern in a string:
str_locate(s, p)
#>      start   end
#> [1,] 13809 13812

How often and at which locations does the sequence “1234” occur within the first 100,000 digits of pi?

Solution

# Detecting all occurrences of a pattern in a string:
gregexpr(p, s)
#> [[1]]
#> [1] 13809 26291 49704 57132 73923 80404 82047 96424
#> attr(,"match.length")
#> [1] 4 4 4 4 4 4 4 4
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
str_locate_all(string = s, pattern = p)
#> [[1]]
#>      start   end
#> [1,] 13809 13812
#> [2,] 26291 26294
#> [3,] 49704 49707
#> [4,] 57132 57135
#> [5,] 73923 73926
#> [6,] 80404 80407
#> [7,] 82047 82050
#> [8,] 96424 96427

Locate and extract all occurrences of "2_4_6_8" out of the first 100,000 digits of pi (where _ could match an arbitrary digit).

Solution

p <- "2.4.6.8"
p <- "2\\d4\\d6\\d8"

str_locate_all(string = s, pattern = p)  # 4 matches
#> [[1]]
#>      start   end
#> [1,]  5840  5846
#> [2,] 65714 65720
#> [3,] 69308 69314
#> [4,] 85440 85446

# Extract all matches:
str_extract_all(string = s, pattern = p, simplify = FALSE)  # as list
#> [[1]]
#> [1] "2242648" "2243608" "2942648" "2240658"
str_extract_all(string = s, pattern = p, simplify = TRUE)   # as matrix
#>      [,1]      [,2]      [,3]      [,4]     
#> [1,] "2242648" "2243608" "2942648" "2240658"

A.9.7 Exercise 7

In Section 9.5.1, we used the chartr() or str_replace_all() functions to translate text into leet slang. This exercise takes this a step further by first encrypting and then decrypting all its characters.

Naive cryptography

Kids often invent “secret codes” to hide written messages from the curious eyes of others. A very simple type of “encryption” consists in systematically replacing each occurrence of a number of characters by a different character.

Create the following strings:

txt: a string of text (for testing purposes)
org: a string of characters to be replaced
new: a string of characters used to replace the corresponding one in org

Encryption: Use the chartr() function to encrypt txt by replacing all characters of org by the corresponding character in new.
Decryption: Use the chartr() function to decrypt the result of 1. to obtain the original txt.

Solution

Encryption: Replace each of a set of letters of a text by another (random) symbol or letter.

# (a) Simple test:
txt <- "This is only a brief sentence for testing purposes. Does this work?"
org <- "abc def"
new <- "def|abc"
chartr(old = org, new = new, x = txt)
#> [1] "This|is|only|d|eribc|sbntbnfb|cor|tbsting|purposbs.|Dobs|this|work?"

# (b) Replacing all alphabetical letters:
org_symbols <- c(letters, LETTERS)
new_symbols <- sample(org_symbols, size = length(org_symbols), replace = FALSE)  # random permutation

org <- paste(org_symbols, collapse = "")
new <- paste(new_symbols, collapse = "")

txt_encrypt <- chartr(old = org, new = new, x = txt)
txt_encrypt
#> [1] "KUNA NA mvPt h QVNHq AHvbHvMH qmV bHAbNvu jBVjmAHA. gmHA bUNA imVf?"

ad 2. Decryption: Reverse replacement to obtain original text.

txt_decrypt <- chartr(old = new, new = org, x = txt_encrypt)
txt_decrypt
#> [1] "This is only a brief sentence for testing purposes. Does this work?"

A.9.8 Exercise 8

Known unknowns

According to this Wikipedia article on known knowns, Donald Rumsfeld (then the United States Secretary of Defense) famously stated at a U.S. Department of Defense news briefing on February 12, 2002:

Reports that say that something hasn’t happened are always interesting to me,
because as we know, there are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we know
there are some things we do not know.
But there are also unknown unknowns — the ones we don’t know we don’t know.
And if one looks throughout the history of our country and other free countries,
it is the latter category that tend to be the difficult ones.

Donald Rumsfeld (2002)

Store this as a string kk and then use R commands to answer the following questions:

How often do the words “know,” “known,” or “knowns” occur in this statement?
How often do the words “unknow,” “unknown,” or “unknowns” occur in this statement?

Hints:

Solving this task with base R commands is tricky, but possible (with regular expressions and probably splitting kk into individual words first). Using an appropriate stringr function is much easier. Doing both allows checking your results.
To distinguish between a “know” in “known” vs. in “unknown,” consider searching for " know" vs. " unknow" (i.e., include spaces in our searches). Alternatively, use regular expressions with anchors for word boundaries (see Section E.2.4 of Appendix E.

Solution

# Data:
kk <- "Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns --- the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones." 

# From: Donald Rumsfeld, United States Secretary of Defense, at a 
# U.S. Department of Defense (DoD) news briefing on February 12, 2002:
# Source: <https://en.wikipedia.org/wiki/There_are_known_knowns>

kl <- tolower(kk)  # all lowercase letters.

# 1. Counting "know":
str_count(kl, "know")       # 14
str_count(kl, "known")      #  6
str_count(kl, "knowns")     #  3

# "know" vs. "known"::
str_count(kl, "^know")      #  0 (as not start of string)!
str_count(kl, "^known")     #  0 (...)!
# but:
str_count(kl, " know")      # 11
str_count(kl, " known")     #  3
str_count(kl, " knowns")    #  1

str_count(kl, "\\bknow\\b")   #  8 
str_count(kl, "\\bknown\\b")  #  2
str_count(kl, "\\bknowns\\b") #  1 

# 2. "unknown":
str_count(kl, " unknow")    #  3
str_count(kl, " unknown")   #  3
str_count(kl, " unknowns")  #  2

# Frequency of other words:
str_count(kl, "we")         #  8
str_count(kl, "are")        #  6
str_count(kl, "there")      #  5
str_count(kl, "that")       #  4


# Using ds4psy: ------ 

# Splitting into parts:
ds4psy::text_to_sentences(kk)
ds4psy::text_to_words(kk)

# Quantifying all: 
ds4psy::count_words(kk)

A.9.9 Exercise 9

Bonus task: Literature search

Download a book from https://gutenberg.org or https://www.projekt-gutenberg.org and use R to perform some quantitative analysis (e.g., counting the frequency of certain names, of key terms, or contrasting the frequency of the pronouns “he” vs. “she”) on it.

Hint: The R package gutenbergr (Robinson, 2020) allows to search, download, and process public domain works from the https://www.gutenberg.org collection. Here’s an example to obtain a book by William James:

# install.packages('gutenbergr')
library(gutenbergr)

# Inspect metadata:
# gutenberg_metadata

# Search for an author:
wj_works <- gutenberg_works(str_detect(author, "James, William"))
wj_works

# Download a book by its id:
meaning_of_truth <- gutenberg_download(5117)

# Text of book (first lines):
meaning_of_truth$text[1:30]

This concludes our solutions to the exercises of Chapter 9 (Section 9.7) on manipulating strings of text.

References

Robinson, D. (2020). gutenbergr: Download and process public domain works from project gutenberg. Retrieved from https://CRAN.R-project.org/package=gutenbergr