17 Strings

In this chapter, we discuss string manipulations, and focus on pattern searching, matching and replacement.

We first introduce three groups of functions:

grep() and grepl() for pattern searching and matching
sub() and gsub() for pattern matching and replacement
stringr functions to
- detect strings
- count the number of matches in strings
- locate strings
- extract strings
- match strings
- split strings

Then we show an example of string manipulation using our demo dataset.

17.1 Regular expressions

There are a group of base R functions that search, match, and replace strings based on pattern.

grep(), grepl(), regexpr(), gregexpr() and regexec() search for matches of a pattern within each element of a character vector. sub() and gsub() perform replacement of the first and all matches of a pattern.

A pattern is described by regular expressions. A regular expression is a sequence of characters that specifies a search pattern. They are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings. Use ?"regular expression" to learn regular expressions as used in R.

In a simple case, if we write a program to check the validity of password input by users, as defined below, we may need the help of regular expressions in its most basic form.

At least 1 letter between a-z and 1 letter between A-Z.
At least 1 number between 0-9.
At least 1 character from $#@.
Minimum length 6 characters.
Maximum length 16 characters.

For instance, to evaluate if the input password contains 1 letter between a-z, we can use grepl("[a-z]", password).

17.2 `grep()`, `grepl()`

We start with the pair of functions grep() and grepl().

grep(pattern, x) and grepl(pattern, x) search for matches of a pattern in a character vector. Both functions need a pattern and an x argument. pattern is a character string containing a regular expression to be matched in the given character vector. x is the character vector from which matches are sought.

`grep()`

grep() returns an integer vector, which refers to the elements in the character vector that contain a match.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

# alphabets
grep("[a-zA-Z]", string)

## [1] 1 2 3

# numbers
grep("[0-9]", string)

## [1] 1 3

# characters from $#@
grep("[$#@]", string)

## integer(0)

`grepl()`

grepl() returns a logical vector with TRUE or FALSE, and indicates which elements of the character vector contain a match. It returns TRUE when a pattern is found in the string.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

# alphabets
grepl("[a-zA-Z]", string)

## [1] TRUE TRUE TRUE

# numbers
grepl("[0-9]", string)

## [1]  TRUE FALSE  TRUE

# characters from $#@
grepl("[$#@]", string)

## [1] FALSE FALSE FALSE

17.3 `sub()`, `gsub()`

The second pair is sub(pattern, replacement, x) and gsub(pattern, replacement, x).

These functions search a character vector for matches, and replace the substrings where a pattern is matched.

Argument x is a character vector where matches are sought. Elements of the character vector x that are not substituted will be returned unchanged.

sub() replaces only the first occurrence of a pattern. gsub() replaces all occurrences. Compare the results from sub() and gsub().

string <- c("A_a_1", "B_b_2", "C_c_3")
sub("_", ".", string)

## [1] "A.a_1" "B.b_2" "C.c_3"

gsub("_", ".", string)

## [1] "A.a.1" "B.b.2" "C.c.3"

In another example, we try to remove all the prefixes in the Open variables. We can use the regular expression "^.*\\." to remove all characters before the period, including the period itself. replacement = "" removes whatever matches the pattern specified with the regular expression.

string <- c("AAPL.Open", "AFL.Open", "MMM.Open")
sub("^.*\\.","", string)

## [1] "Open" "Open" "Open"

17.4 `stringr`

Package stringr provides pattern matching functions for common tasks in string manipulation that detect, locate, extract, match, replace, and split strings. The package is part of tidyverse. It is based on the package stringi.

library(stringr)

Each stringr pattern matching function has the same first two arguments, a character vector of strings to process (string) and a single pattern to match (pattern).

detecting strings

str_detect(string, pattern) detects the presence or absence of a pattern in a string and returns a logical vector, similar to grepl().

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

# alphabets
str_detect(string, "[a-zA-Z]")

## [1] TRUE TRUE TRUE

# numbers
str_detect(string, "[0-9]")

## [1]  TRUE FALSE  TRUE

# characters from $#@
str_detect(string, "[$#@]")

## [1] FALSE FALSE FALSE

counting the number of matches in strings

str_count(string, pattern) counts the number of matches in a string.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

# alphabets
str_count(string, "[a-zA-Z]")

## [1]  23  48 107

# numbers
str_count(string, "[0-9]")

## [1] 4 0 7

# characters from $#@
str_count(string, "[$#@]")

## [1] 0 0 0

locating strings

str_locate_all(string, pattern) locates the positions of all matches in a string. It returns a list of integer matrices. In each of these matrices, the first column gives start positions of matches, and the second column gives their end positions.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_locate_all(string, "NYU")

## [[1]]
##      start end
## [1,]     1   3
## 
## [[2]]
##      start end
## [1,]     1   3
## 
## [[3]]
##      start end

extracting strings

str_extract_all(string, pattern) extracts all matches and returns a list of character vectors.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_extract_all(string, "[0-9]")

## [[1]]
## [1] "2" "0" "1" "2"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "2" "9" "4" "5" "1" "4" "9"

str_extract(string, pattern) extracts text corresponding to the first match, and returns a character vector.

str_extract(string, "[0-9]")

## [1] "2" NA  "2"

matching strings

str_match_all(string, pattern) extracts matched groups from all matches and returns a list of character matrices.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_match_all(string, "[0-9]")

## [[1]]
##      [,1]
## [1,] "2" 
## [2,] "0" 
## [3,] "1" 
## [4,] "2" 
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]
## [1,] "2" 
## [2,] "9" 
## [3,] "4" 
## [4,] "5" 
## [5,] "1" 
## [6,] "4" 
## [7,] "9"

spliting strings

str_split(string, pattern) splits a string into pieces and returns a list of character vectors.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_split(string," ")

## [[1]]
## [1] "NYU"      "Shanghai" "was"      "founded"  "in"       "2012."   
## 
## [[2]]
## [1] "NYU"         "Shanghai"    "is"          "China's"     "first"       "Sino-US"    
## [7] "research"    "university."
## 
## [[3]]
##  [1] "Of"        "the"       "class"     "of"        "294"       "students,"
##  [7] "51%"       "came"      "from"      "The"       "People's"  "Republic" 
## [13] "of"        "China,"    "with"      "the"       "remaining" "49%"      
## [19] "coming"    "from"      "other"     "countries" "around"    "the"      
## [25] "world."

17.5 Example

Now we use our demo dataset to show an application of the string functions introduced above for string manipulation.

We have a variable, Founded, in the dataset sp500tickers. It is the year a company was founded. Some cases may contain more than one year, where a company could have been acquired by another company or gone through some other changes in its history. The variable recorded up to three events/years for each case.

##  [1] "2017"               "1880"               "1998 (1923 / 1874)"
##  [4] "1784"               "1931"               "1897"              
##  [7] "1839"               "1966"               "1952"              
## [10] "1978"               "1988"

Founded is a character variable, and the format of its number-strings are not consistent. For instance, it may contain parentheses, whitespaces, commas, slashes, or founders’ names, such as in “1994 (Northrop 1939, Grumman 1930)” or “1881/1894 (1980)”. Besides, the years are not necessarily organized in descending order, such as in “2005 (Molson 1786, Coors 1873)”.

Our goals are first to extract the numbers (years) from each case, getting rid of any other characters. Then, split the strings and organize the years in three columns, since we have up to three years in a case. Finally, order the three years in descending order from the most recent to the most remote.

using apply family functions

We take the steps below for these tasks using apply family functions.

Extract all the 4-digit number-strings.

founded <- str_extract_all(sp500tickers$Founded, "\\d{4}")

We use the stringr function str_extract_all() to do this. We extract all 4-digit numbers with the help of the regular expression \\d{4}.

head(founded)

## [[1]]
## [1] "1902"
## 
## [[2]]
## [1] "1888"
## 
## [[3]]
## [1] "2013" "1888"
## 
## [[4]]
## [1] "1981"
## 
## [[5]]
## [1] "1989"
## 
## [[6]]
## [1] "2008"

str_extract_all(founded, "\\d{4}") returns a list of strings.

Extract the output strings in each list component and organize them into 3 groups one by one.

We use the sapply() function to handle the list outputs from str_extract_all().

year1 <- sapply(founded, function(x) x[1])
year2 <- sapply(founded, function(x) x[2])
year3 <- sapply(founded, function(x) x[3])

head(year1)

## [1] "1902" "1888" "2013" "1981" "1989" "2008"

head(year2)

## [1] NA     NA     "1888" NA     NA     NA

head(year3)

## [1] NA NA NA NA NA NA

The outputs are characters.

class(year1)

## [1] "character"

Convert the output strings to numbers.

year1 <- as.numeric(year1)
year2 <- as.numeric(year2)
year3 <- as.numeric(year3)

Organize the three year vectors into a data frame of three columns. Order the years from the most recent to the most remote for each case by row.

years <- data.frame(year1, year2, year3)
Founded1 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][1])
Founded2 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][2])
Founded3 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][3])

years_1 <- data.frame(sp500tickers$Founded, year1, year2, year3,
                      Founded1, Founded2, Founded3)

years_1[c(267, 321), ]

##               sp500tickers.Founded year1 year2 year3 Founded1 Founded2 Founded3
## 267             2000 (1799 / 1871)  2000  1799  1871     2000     1871     1799
## 321 2005 (Molson 1786, Coors 1873)  2005  1786  1873     2005     1873     1786

We use the function order(x, decreasing = TRUE) to obtain the positions of the three years in descending order. function(x) x[order(x, decreasing = TRUE)][n] places the three years from the most recent to the most remote. [1] means the most recent year, and [3] means the most remote one.

using `purrr` map functions with `unnest()`

Alternatively, there is a “modern” approach. That is using purrr map functions together with tidyr::unnest() approach. We discussed purrr map functions in the chapter Split, Apply, Combine.

This solution offers a more concise approach in fewer steps. unnest_wider() splits list columns returned by str_extract_all() into separate columns. map_dbl() extracts each year component from years_cols_sort and returns a vector.

library(tidyverse)

years_2 <- sp500tickers %>%
  select(Founded) %>%
  
  # Extract all 4-digit numbers
  mutate(years = str_extract_all(Founded, "\\d{4}")) %>%
  
  # Split into three groups (each 4 digits) 
  unnest_wider(years, 
               names_sep = "_", 
               names_repair = "unique", 
               simplify = TRUE) %>%
  
  # Convert the strings to numbers
  mutate(across(starts_with("years_"), as.numeric)) %>%
  
  # Order years from most recent to most remote for each case
  rowwise() %>%
  mutate(
    years_sorted = list(sort(c_across(starts_with("years_")), decreasing = TRUE, na.last = TRUE))
  ) %>%
  ungroup() %>%
  mutate(
    Founded1 = map_dbl(years_sorted, ~ .x[1]),
    Founded2 = map_dbl(years_sorted, ~ .x[2]),
    Founded3 = map_dbl(years_sorted, ~ .x[3])
  ) 

years_2[c(267, 321), ]

## # A tibble: 2 × 8
##   Founded               years_1 years_2 years_3 years_sorted Founded1 Founded2 Founded3
##   <chr>                   <dbl>   <dbl>   <dbl> <list>          <dbl>    <dbl>    <dbl>
## 1 2000 (1799 / 1871)       2000    1799    1871 <dbl [3]>        2000     1871     1799
## 2 2005 (Molson 1786, C…    2005    1786    1873 <dbl [3]>        2005     1873     1786