9.4 Exercises

ds4psy: Exercises (09: Text)

Here are some exercises on manipulating strings of text with base R, regular expressions, and stringr commands.

9.4.1 Exercise 1

Escaping and typing Unicode characters

Use your knowledge on representing basic text strings and special characters (by consulting a list of Unicode symbols, e.g., Wikipedia: Unicode characters) to store 2&nbps;R strings that contain at least 2&nbps;special symbols, like:

  • LaTeX commands begin with a backslash “\”. For instance, the LaTeX command for emphasis is “\emph{}”.

  • Der Käsereichtum Österreichs ist ungewöhnlich groß.

  • Hamlet says: “2b ∨ ¬2b”

Hint: See ?"'" for general information on strings in R.

Solution

Here are some examples (some of which were provided by students of this course):

# English: ----

# String: LaTeX commands begin with a backslash "\". 
#         For instance, the LaTeX command for emphasis is "\emph{}".
e1 <- 'LaTeX commands begin with a backslash "\\". \nFor instance, the LaTeX command for emphasis is "\\emph{}".'
writeLines(e1)
#> LaTeX commands begin with a backslash "\". 
#> For instance, the LaTeX command for emphasis is "\emph{}".

# String: Hamlet says: “2b ∨ ¬2b”
e2 <- "Hamlet says: \u201C2b \u2228 \u00AC2b\u201D"
writeLines(e2)
#> Hamlet says: “2b ∨ ¬2b”

# String: E-mail me at: john.doe1988@gmail.biz
e3 <- "E\U002Dmail me at\U003A john.doe1988\U0040gmail.biz"
writeLines (e3)
#> E-mail me at: john.doe1988@gmail.biz

# French: ---- 

# String: << La beauté commence au moment où vous décidez d'être vous-même. >>
f1 <- "\u00AB La beaut\u00E9 commence au moment o\u00F9 vous d\u00E9cidez d'\u00EAtre vous-m\u00EAme. \u00BB"
writeLines(f1) 
#> « La beauté commence au moment où vous décidez d'être vous-même. »

# German: ----

# String: Der Käsereichtum Österreichs ist ungewöhnlich groß.
g1 <- "Der K\u00E4sereichtum \u00D6sterreichs ist ungew\u00F6hnlich gro\u00DF."
writeLines(g1)
#> Der Käsereichtum Österreichs ist ungewöhnlich groß.

# String: Das Märchen von Schneeweißchen und Rosenrot ist schön.
g2 <- "Das M\u00E4rchen von Schneewei\u00DFchen und Rosenrot ist sch\u00F6n."
writeLines(g2)
#> Das Märchen von Schneeweißchen und Rosenrot ist schön.

# String: Große Frösche fahren gerne mit Öltankern. 
g3 <- "Gro\u00DFe Fr\u00F6sche fahren gerne mit \u00D6ltankern."
writeLines(g3)
#> Große Frösche fahren gerne mit Öltankern.

9.4.2 Exercise 2

Color names

The function colors() prints the names of the 657 valid color names in R.

  1. Define the following 10 strings as a character vector color_candidates and use a base R function to check which of them are actual color names in R.
#>  [1] "blanchedalmond" "honeydew"       "hotpink3"       "palevioletred1"
#>  [5] "royalpink"      "sadblue"        "saddlebrown2"   "snowwhite"     
#>  [9] "tan4"           "yello3"

Hint: Half of these names are actual R color names, whereas the others are not. Take a guess which are which prior to checking it!

Solution

# (1) using %in%: 
color_candidates %in% colors()
#>  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

true_colors <- color_candidates[color_candidates %in% colors()]
true_colors
#> [1] "blanchedalmond" "honeydew"       "hotpink3"       "palevioletred1"
#> [5] "tan4"

no_colors <- color_candidates[!(color_candidates %in% colors())]
no_colors
#> [1] "royalpink"    "sadblue"      "saddlebrown2" "snowwhite"    "yello3"

# (2) using is.element: 
is.element(el = color_candidates, set = colors())
#>  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

Answer

  • True color names: "blanchedalmond", "honeydew", "hotpink3", "palevioletred1", "tan4".

  • Not color names: "royalpink", "sadblue", "saddlebrown2", "snowwhite", "yello3".

  1. How many of the 657 valid color names begin with either gray or grey? Use 2 base R functions to find this out.

Hint: One of the functions is sum, the other one should check for an initial substring in colors().

Solution

# How many of the 657 valid color names begin with either gray or grey?

# (a) base R solutions:
# in 2 parts:
sum(substr(colors(), 1, 4) == "gray") 
#> [1] 102
sum(substr(colors(), 1, 4) == "grey")
#> [1] 102
# together:
sum( (substr(colors(), 1, 4) == "gray") | (substr(colors(), 1, 4) == "grey") )
#> [1] 204

# (b) stringr solutions:
# stringr::str_count(colors(), "^gr(a|e)y")
sum(stringr::str_count(colors(), "^gr(a|e)y"))     # Solution 1: Sum of 204 strings starting with pattern 
#> [1] 204
length(stringr::str_subset(colors(), "^gr(a|e)y")) # Solution 2: Length of set of 204 strings that start with pattern
#> [1] 204
  1. How many of the 657 valid color names contain gray or grey?

Solution

# How many of the 657 valid color names contain gray or grey?

sum(stringr::str_count(colors(), "gr(a|e)y"))  # 224
#> [1] 224
  1. Which of the 657 valid color names contain gray or grey, but do neither begin nor end with gray or grey?

Solution

set1 <- stringr::str_subset(colors(), "gr(a|e)y")   # containing "gray" or "grey"
set2 <- stringr::str_subset(colors(), "^gr(a|e)y")  # starting with "gray" or "grey"
set3 <- stringr::str_subset(colors(), "gr(a|e)y$")  # ending on "gray" or "grey"

# Set differences: 
set4 <- setdiff(set1, set2)  # containing but NOT start
setdiff(set4, set3)          # containing, but NOT end
#> [1] "darkslategray1" "darkslategray2" "darkslategray3" "darkslategray4"
#> [5] "slategray1"     "slategray2"     "slategray3"     "slategray4"
  1. Which of the 657 valid color names begin and end with a vowel?

Solution

set1 <- stringr::str_subset(colors(), "^[aeiou]") # beginning with a vowel
set2 <- stringr::str_subset(colors(), "[aeiou]$") # ending with a vowel
intersect(set2, set1)
#> [1] "aliceblue"    "antiquewhite" "aquamarine"   "azure"        "oldlace"     
#> [6] "orange"

9.4.3 Exercise 3

Pasting vectors

Suppose you wanted to create names for 50 image files. The 1st of them should be called “img_1.png”, and the last should be called “img_50.png”.

  • Can you create all 50 file names in 1 R command?

Hint: Yes, you can — use paste() or paste0() in combination with a numeric vector.

Solution

paste("img_",  1:50, ".png", sep = "")
#>  [1] "img_1.png"  "img_2.png"  "img_3.png"  "img_4.png"  "img_5.png" 
#>  [6] "img_6.png"  "img_7.png"  "img_8.png"  "img_9.png"  "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"
paste0("img_", 1:50, ".png")
#>  [1] "img_1.png"  "img_2.png"  "img_3.png"  "img_4.png"  "img_5.png" 
#>  [6] "img_6.png"  "img_7.png"  "img_8.png"  "img_9.png"  "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"

Note: The numeric vector 1:50 was converted into a character sequence. However, it can be somewhat annoying that — due to the different number of numeric digits in the file names — your list of files is not sorted correctly in directory views. A quick and dirty solution to this would be:

paste("img_",  c(paste0("0", 1:9), 10:50), ".png", sep = "")
#>  [1] "img_01.png" "img_02.png" "img_03.png" "img_04.png" "img_05.png"
#>  [6] "img_06.png" "img_07.png" "img_08.png" "img_09.png" "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"
paste0("img_", c(paste0("0", 1:9), 10:50), ".png")
#>  [1] "img_01.png" "img_02.png" "img_03.png" "img_04.png" "img_05.png"
#>  [6] "img_06.png" "img_07.png" "img_08.png" "img_09.png" "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"

A more general solution is provided by the num_as_char() function of the ds4psy package. This function turns numbers (i.e., objects of type integer or double) into character sequences (i.e., numeric digits of type character) and allows specifying n_pre_dec (i.e., a desired number of digits prior to the decimal separator) and n_dec (i.e., the desired number of digits after the decimal separator):

library(ds4psy)  # requires version 0.1.0+

# Explore num_as_char function:
num_as_char(1/3, n_pre_dec = 2, n_dec = 2)
#> [1] "00.33"
num_as_char(2/3, n_pre_dec = 2, n_dec = 2)  # rounding up
#> [1] "00.67"
num_as_char(2/3, n_pre_dec = 2, n_dec = 0)  # rounding up
#> [1] "01"

As num_as_char() also works with vector inputs, using this function in combination with paste0 would solve our problem:

# Use num_as_char:
paste0("img_", num_as_char(1:50, n_dec = 0), ".png")
#>  [1] "img_01.png" "img_02.png" "img_03.png" "img_04.png" "img_05.png"
#>  [6] "img_06.png" "img_07.png" "img_08.png" "img_09.png" "img_10.png"
#> [11] "img_11.png" "img_12.png" "img_13.png" "img_14.png" "img_15.png"
#> [16] "img_16.png" "img_17.png" "img_18.png" "img_19.png" "img_20.png"
#> [21] "img_21.png" "img_22.png" "img_23.png" "img_24.png" "img_25.png"
#> [26] "img_26.png" "img_27.png" "img_28.png" "img_29.png" "img_30.png"
#> [31] "img_31.png" "img_32.png" "img_33.png" "img_34.png" "img_35.png"
#> [36] "img_36.png" "img_37.png" "img_38.png" "img_39.png" "img_40.png"
#> [41] "img_41.png" "img_42.png" "img_43.png" "img_44.png" "img_45.png"
#> [46] "img_46.png" "img_47.png" "img_48.png" "img_49.png" "img_50.png"

We will further examine num_as_char() in Chapter 11: Functions (see Section 11.4.6).

9.4.4 Exercise 4

Detecting patterns in pi

  1. Does the sequence “1234” occur within the first 100,000 digits of pi? (Use an R command that answers this question by yielding TRUE or FALSE.)

Hint: Use the pi_100k data provided by the ds4psy package to answer this question.

Solution

s <- ds4psy::pi_100k  # load data
p <- "1234"

stringr::str_detect(string = s, pattern = p)
#> [1] TRUE
  1. At which location does the sequence “1234” occur within the first 100,000 digits of pi?

Solution

# Detecting the 1st occurrence of a pattern in a string:
stringr::str_locate(string = s, pattern = p)
#>      start   end
#> [1,] 13809 13812
  1. How often and at which locations does the sequence “1234” occur within the first 100,000 digits of pi?

Solution

# Detecting all occurrences of a pattern in a string:
stringr::str_locate_all(string = s, pattern = p)
#> [[1]]
#>      start   end
#> [1,] 13809 13812
#> [2,] 26291 26294
#> [3,] 49704 49707
#> [4,] 57132 57135
#> [5,] 73923 73926
#> [6,] 80404 80407
#> [7,] 82047 82050
#> [8,] 96424 96427
  1. Locate and extract all occurrences of "2_4_6_8" out of the first 100,000 digits of pi (where _ could match any digit).

Solution

p <- "2.4.6.8"
p <- "2\\d4\\d6\\d8"

stringr::str_locate_all(string = s, pattern = p)  # 4 matches
#> [[1]]
#>      start   end
#> [1,]  5840  5846
#> [2,] 65714 65720
#> [3,] 69308 69314
#> [4,] 85440 85446

# Extract all matches:
stringr::str_extract_all(string = s, pattern = p, simplify = FALSE)
#> [[1]]
#> [1] "2242648" "2243608" "2942648" "2240658"
stringr::str_extract_all(string = s, pattern = p, simplify = TRUE)
#>      [,1]      [,2]      [,3]      [,4]     
#> [1,] "2242648" "2243608" "2942648" "2240658"

9.4.5 Exercise 5