17 Strings

In this chapter, we discuss string manipulations, and focus on pattern searching, matching and replacement.

We first introduce three groups of functions:

  • grep() and grepl() for pattern searching and matching
  • sub() and gsub() for pattern matching and replacement
  • stringr functions to
    • detect strings
    • count the number of matches in strings
    • locate strings
    • extract strings
    • match strings
    • split strings

Then we show an example of string manipulation using our demo dataset.

17.1 Regular expressions

There are a group of base R functions that search, match, and replace strings based on pattern.

grep(), grepl(), regexpr(), gregexpr() and regexec() search for matches of a pattern within each element of a character vector. sub() and gsub() perform replacement of the first and all matches of a pattern.

A pattern is described by regular expressions. A regular expression is a sequence of characters that specifies a search pattern. They are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings. Use ?"regular expression" to learn regular expressions as used in R.

In a simple case, if we write a program to check the validity of password input by users, as defined below, we may need the help of regular expressions in its most basic form.

  • At least 1 letter between a-z and 1 letter between A-Z.
  • At least 1 number between 0-9.
  • At least 1 character from $#@.
  • Minimum length 6 characters.
  • Maximum length 16 characters.

For instance, to evaluate if the input password contains 1 letter between a-z, we can use grepl("[a-z]", password).

17.2 grep(), grepl()

We start with the pair of functions grep() and grepl().

grep(pattern, x) and grepl(pattern, x) search for matches of a pattern in a character vector. Both functions need a pattern and an x argument. pattern is a character string containing a regular expression to be matched in the given character vector. x is the character vector from which matches are sought.

grep()

grep() returns an integer vector, which refers to the elements in the character vector that contain a match.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
grep("[a-zA-Z]", string)
## [1] 1 2 3
# numbers
grep("[0-9]", string)
## [1] 1 3
# characters from $#@
grep("[$#@]", string)
## integer(0)

grepl()

grepl() returns a logical vector with TRUE or FALSE, and indicates which elements of the character vector contain a match. It returns TRUE when a pattern is found in the string.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
grepl("[a-zA-Z]", string)
## [1] TRUE TRUE TRUE
# numbers
grepl("[0-9]", string)
## [1]  TRUE FALSE  TRUE
# characters from $#@
grepl("[$#@]", string)
## [1] FALSE FALSE FALSE

17.3 sub(), gsub()

The second pair is sub(pattern, replacement, x) and gsub(pattern, replacement, x).

These functions search a character vector for matches, and replace the substrings where a pattern is matched.

Argument x is a character vector where matches are sought. Elements of the character vector x that are not substituted will be returned unchanged.

sub() replaces only the first occurrence of a pattern. gsub() replaces all occurrences. Compare the results from sub() and gsub().

string <- c("A_a_1", "B_b_2", "C_c_3")
sub("_", ".", string)
## [1] "A.a_1" "B.b_2" "C.c_3"
gsub("_", ".", string)
## [1] "A.a.1" "B.b.2" "C.c.3"

In another example, we try to remove all the prefixes in the Open variables. We can use the regular expression "^.*\\." to remove all characters before the period, including the period itself. replacement = "" removes whatever matches the pattern specified with the regular expression.

string <- c("AAPL.Open", "AFL.Open", "MMM.Open")
sub("^.*\\.","", string)
## [1] "Open" "Open" "Open"

17.4 stringr

Package stringr provides pattern matching functions for common tasks in string manipulation that detect, locate, extract, match, replace, and split strings. The package is part of tidyverse. It is based on the package stringi.

library(stringr)

Each stringr pattern matching function has the same first two arguments, a character vector of strings to process (string) and a single pattern to match (pattern).

detecting strings

str_detect(string, pattern) detects the presence or absence of a pattern in a string and returns a logical vector, similar to grepl().

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
str_detect(string, "[a-zA-Z]")
## [1] TRUE TRUE TRUE
# numbers
str_detect(string, "[0-9]")
## [1]  TRUE FALSE  TRUE
# characters from $#@
str_detect(string, "[$#@]")
## [1] FALSE FALSE FALSE

counting the number of matches in strings

str_count(string, pattern) counts the number of matches in a string.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
str_count(string, "[a-zA-Z]")
## [1]  23  48 107
# numbers
str_count(string, "[0-9]")
## [1] 4 0 7
# characters from $#@
str_count(string, "[$#@]")
## [1] 0 0 0

locating strings

str_locate_all(string, pattern) locates the positions of all matches in a string. It returns a list of integer matrices. In each of these matrices, the first column gives start positions of matches, and the second column gives their end positions.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_locate_all(string, "NYU")
## [[1]]
##      start end
## [1,]     1   3
## 
## [[2]]
##      start end
## [1,]     1   3
## 
## [[3]]
##      start end

extracting strings

str_extract_all(string, pattern) extracts all matches and returns a list of character vectors.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_extract_all(string, "[0-9]")
## [[1]]
## [1] "2" "0" "1" "2"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "2" "9" "4" "5" "1" "4" "9"

str_extract(string, pattern) extracts text corresponding to the first match, and returns a character vector.

str_extract(string, "[0-9]")
## [1] "2" NA  "2"

matching strings

str_match_all(string, pattern) extracts matched groups from all matches and returns a list of character matrices.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_match_all(string, "[0-9]")
## [[1]]
##      [,1]
## [1,] "2" 
## [2,] "0" 
## [3,] "1" 
## [4,] "2" 
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]
## [1,] "2" 
## [2,] "9" 
## [3,] "4" 
## [4,] "5" 
## [5,] "1" 
## [6,] "4" 
## [7,] "9"

spliting strings

str_split(string, pattern) splits a string into pieces and returns a list of character vectors.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_split(string," ")
## [[1]]
## [1] "NYU"      "Shanghai" "was"      "founded"  "in"       "2012."   
## 
## [[2]]
## [1] "NYU"         "Shanghai"    "is"          "China's"     "first"      
## [6] "Sino-US"     "research"    "university."
## 
## [[3]]
##  [1] "Of"        "the"       "class"     "of"        "294"       "students,"
##  [7] "51%"       "came"      "from"      "The"       "People's"  "Republic" 
## [13] "of"        "China,"    "with"      "the"       "remaining" "49%"      
## [19] "coming"    "from"      "other"     "countries" "around"    "the"      
## [25] "world."

17.5 Example

Now we use our demo dataset to show an application of the functions introduced above for string manipulation.

We have a variable Founded in the dataset sp500tickers, which is the year a company was founded. Some cases may contain more than one year, where a company could have been acquired by another company or gone through some other changes in its history. The variable recorded up to three events/years for each case.

head(sp500tickers$Founded, 10)
##  [1] "1902"        "1888"        "2013 (1888)" "1981"        "1989"       
##  [6] "2008"        "1982"        "1969"        "1932"        "1981"

Founded is a character variable, and the format of its number-strings are not consistent. For instance, it may contain parentheses, whitespaces, commas, slashes, or founders’ names, such as in “1994 (Northrop 1939, Grumman 1930)” or “1881/1894 (1980)”. Besides, the years are not necessarily organized in descending order, such as in “2005 (Molson 1786, Coors 1873)”.

Our goals are first to extract the numbers(years) from each case, getting rid of any other kinds of characters. Then, split the strings and organize the years in three columns, since we have up to three years in a case. Finally, order the three years in descending order from the most recent to the most remote.

We take the steps below for these tasks.

  1. Extract all the number-strings.
founded <- sp500tickers$Founded
founded <- str_extract_all(founded, "[0-9]")

We use the stringr function str_extract_all() to do this. We specify the pattern "[0-9]" so that only number-strings will be matched and returned.

The function returns a list of strings.

head(founded)
## [[1]]
## [1] "1" "9" "0" "2"
## 
## [[2]]
## [1] "1" "8" "8" "8"
## 
## [[3]]
## [1] "2" "0" "1" "3" "1" "8" "8" "8"
## 
## [[4]]
## [1] "1" "9" "8" "1"
## 
## [[5]]
## [1] "1" "9" "8" "9"
## 
## [[6]]
## [1] "2" "0" "0" "8"

  1. Split the strings into 3 groups, each consisting of 4 digits.
year1 <- t(sapply(founded, function(x) x[1:4]))
year2 <- t(sapply(founded, function(x) x[5:8]))
year3 <- t(sapply(founded, function(x) x[9:12]))

We use the sapply() function to handle the list outputs from str_extract_all().

head(year1)
##      [,1] [,2] [,3] [,4]
## [1,] "1"  "9"  "0"  "2" 
## [2,] "1"  "8"  "8"  "8" 
## [3,] "2"  "0"  "1"  "3" 
## [4,] "1"  "9"  "8"  "1" 
## [5,] "1"  "9"  "8"  "9" 
## [6,] "2"  "0"  "0"  "8"
head(year2)
##      [,1] [,2] [,3] [,4]
## [1,] NA   NA   NA   NA  
## [2,] NA   NA   NA   NA  
## [3,] "1"  "8"  "8"  "8" 
## [4,] NA   NA   NA   NA  
## [5,] NA   NA   NA   NA  
## [6,] NA   NA   NA   NA
head(year3)
##      [,1] [,2] [,3] [,4]
## [1,] NA   NA   NA   NA  
## [2,] NA   NA   NA   NA  
## [3,] NA   NA   NA   NA  
## [4,] NA   NA   NA   NA  
## [5,] NA   NA   NA   NA  
## [6,] NA   NA   NA   NA

The outputs are matrices.

class(year1)
## [1] "matrix" "array"

  1. Paste the 4 single digits into a whole string to indicate a year.
y1 <- apply(year1, 1, paste0, collapse = "")
y2 <- apply(year2, 1, paste0, collapse = "")
y3 <- apply(year3, 1, paste0, collapse = "")
head(y1)
## [1] "1902" "1888" "2013" "1981" "1989" "2008"
head(y2)
## [1] "NANANANA" "NANANANA" "1888"     "NANANANA" "NANANANA" "NANANANA"
head(y3)
## [1] "NANANANA" "NANANANA" "NANANANA" "NANANANA" "NANANANA" "NANANANA"

The outputs are vectors.

class(y1)
## [1] "character"

  1. Convert the strings to numbers.
y1 <- as.numeric(y1)
y2 <- as.numeric(y2)
y3 <- as.numeric(y3)

Those "NANANANA" are coerced to NA.

head(y1)
## [1] 1902 1888 2013 1981 1989 2008
head(y2)
## [1]   NA   NA 1888   NA   NA   NA
head(y3)
## [1] NA NA NA NA NA NA

  1. Organize the three year vectors into a data frame of three columns, and order the years from the most recent to the most remote for each case.
years <- data.frame(y1, y2, y3)
Founded1 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][1])
Founded2 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][2])
Founded3 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][3])

head(years)
##     y1   y2 y3
## 1 1902   NA NA
## 2 1888   NA NA
## 3 2013 1888 NA
## 4 1981   NA NA
## 5 1989   NA NA
## 6 2008   NA NA

We use the function order(x, decreasing = TRUE) to obtain the positions of the three years in descending order. function(x) x[order(x, decreasing = TRUE)][n] places the three years from the most recent to the most remote. [1] means the most recent year, and [3] means the most remote one.