17 Strings
In this chapter, we discuss string manipulations, and focus on pattern searching, matching and replacement.
We first introduce three groups of functions:
grep()
andgrepl()
for pattern searching and matchingsub()
andgsub()
for pattern matching and replacementstringr
functions to- detect strings
- count the number of matches in strings
- locate strings
- extract strings
- match strings
- split strings
Then we show an example of string manipulation using our demo dataset.
17.1 Regular expressions
There are a group of base R functions that search, match, and replace strings based on pattern.
grep()
, grepl()
, regexpr()
, gregexpr()
and regexec()
search for matches of a pattern within each element of a character vector. sub()
and gsub()
perform replacement of the first and all matches of a pattern.
A pattern is described by regular expressions. A regular expression is a sequence of characters that specifies a search pattern. They are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings. Use ?"regular expression"
to learn regular expressions as used in R.
In a simple case, if we write a program to check the validity of password input by users, as defined below, we may need the help of regular expressions in its most basic form.
- At least 1 letter between a-z and 1 letter between A-Z.
- At least 1 number between 0-9.
- At least 1 character from $#@.
- Minimum length 6 characters.
- Maximum length 16 characters.
For instance, to evaluate if the input password contains 1 letter between a-z, we can use grepl("[a-z]", password)
.
17.2 grep()
, grepl()
We start with the pair of functions grep()
and grepl()
.
grep(pattern, x)
and grepl(pattern, x)
search for matches of a pattern in a character vector. Both functions need a pattern
and an x
argument. pattern
is a character string containing a regular expression to be matched in the given character vector. x
is the character vector from which matches are sought.
grep()
grep()
returns an integer vector, which refers to the elements in the character vector that contain a match.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
## [1] 1 2 3
## [1] 1 3
## integer(0)
grepl()
grepl()
returns a logical vector with TRUE
or FALSE
, and indicates which elements of the character vector contain a match. It returns TRUE
when a pattern is found in the string.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
## [1] TRUE TRUE TRUE
## [1] TRUE FALSE TRUE
## [1] FALSE FALSE FALSE
17.3 sub()
, gsub()
The second pair is sub(pattern, replacement, x)
and gsub(pattern, replacement, x)
.
These functions search a character vector for matches, and replace the substrings where a pattern is matched.
Argument x
is a character vector where matches are sought. Elements of the character vector x
that are not substituted will be returned unchanged.
sub()
replaces only the first occurrence of a pattern. gsub()
replaces all occurrences. Compare the results from sub()
and gsub()
.
## [1] "A.a_1" "B.b_2" "C.c_3"
## [1] "A.a.1" "B.b.2" "C.c.3"
In another example, we try to remove all the prefixes in the Open
variables. We can use the regular expression "^.*\\."
to remove all characters before the period, including the period itself. replacement = ""
removes whatever matches the pattern specified with the regular expression.
## [1] "Open" "Open" "Open"
17.4 stringr
Package stringr
provides pattern matching functions for common tasks in string manipulation that detect, locate, extract, match, replace, and split strings. The package is part of tidyverse
. It is based on the package stringi
.
Each stringr
pattern matching function has the same first two arguments, a character vector of strings to process (string
) and a single pattern to match (pattern
).
detecting strings
str_detect(string, pattern)
detects the presence or absence of a pattern in a string and returns a logical vector, similar to grepl()
.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
## [1] TRUE TRUE TRUE
## [1] TRUE FALSE TRUE
## [1] FALSE FALSE FALSE
counting the number of matches in strings
str_count(string, pattern)
counts the number of matches in a string.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
## [1] 23 48 107
## [1] 4 0 7
## [1] 0 0 0
locating strings
str_locate_all(string, pattern)
locates the positions of all matches in a string. It returns a list of integer matrices. In each of these matrices, the first column gives start positions of matches, and the second column gives their end positions.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
str_locate_all(string, "NYU")
## [[1]]
## start end
## [1,] 1 3
##
## [[2]]
## start end
## [1,] 1 3
##
## [[3]]
## start end
extracting strings
str_extract_all(string, pattern)
extracts all matches and returns a list of character vectors.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
str_extract_all(string, "[0-9]")
## [[1]]
## [1] "2" "0" "1" "2"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "2" "9" "4" "5" "1" "4" "9"
str_extract(string, pattern)
extracts text corresponding to the first match, and returns a character vector.
## [1] "2" NA "2"
matching strings
str_match_all(string, pattern)
extracts matched groups from all matches and returns a list of character matrices.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
str_match_all(string, "[0-9]")
## [[1]]
## [,1]
## [1,] "2"
## [2,] "0"
## [3,] "1"
## [4,] "2"
##
## [[2]]
## [,1]
##
## [[3]]
## [,1]
## [1,] "2"
## [2,] "9"
## [3,] "4"
## [4,] "5"
## [5,] "1"
## [6,] "4"
## [7,] "9"
spliting strings
str_split(string, pattern)
splits a string into pieces and returns a list of character vectors.
string <- c("NYU Shanghai was founded in 2012.",
"NYU Shanghai is China's first Sino-US research university.",
"Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
str_split(string," ")
## [[1]]
## [1] "NYU" "Shanghai" "was" "founded" "in" "2012."
##
## [[2]]
## [1] "NYU" "Shanghai" "is" "China's" "first" "Sino-US" "research"
## [8] "university."
##
## [[3]]
## [1] "Of" "the" "class" "of" "294" "students," "51%" "came" "from"
## [10] "The" "People's" "Republic" "of" "China," "with" "the" "remaining" "49%"
## [19] "coming" "from" "other" "countries" "around" "the" "world."
17.5 Example
Now we use our demo dataset to show an application of the functions introduced above for string manipulation.
We have a variable Founded
in the dataset sp500tickers
, which is the year a company was founded. Some cases may contain more than one year, where a company could have been acquired by another company or gone through some other changes in its history. The variable recorded up to three events/years for each case.
## [1] "1902" "1888" "2013 (1888)" "1981" "1989" "2008" "1982"
## [8] "1969" "1932" "1981"
Founded
is a character variable, and the format of its number-strings are not consistent. For instance, it may contain parentheses, whitespaces, commas, slashes, or founders’ names, such as in “1994 (Northrop 1939, Grumman 1930)” or “1881/1894 (1980)”. Besides, the years are not necessarily organized in descending order, such as in “2005 (Molson 1786, Coors 1873)”.
Our goals are first to extract the numbers(years) from each case, getting rid of any other kinds of characters. Then, split the strings and organize the years in three columns, since we have up to three years in a case. Finally, order the three years in descending order from the most recent to the most remote.
We take the steps below for these tasks.
- Extract all the number-strings.
We use the stringr
function str_extract_all()
to do this. We specify the pattern "[0-9]"
so that only number-strings will be matched and returned.
The function returns a list of strings.
## [[1]]
## [1] "1" "9" "0" "2"
##
## [[2]]
## [1] "1" "8" "8" "8"
##
## [[3]]
## [1] "2" "0" "1" "3" "1" "8" "8" "8"
##
## [[4]]
## [1] "1" "9" "8" "1"
##
## [[5]]
## [1] "1" "9" "8" "9"
##
## [[6]]
## [1] "2" "0" "0" "8"
- Split the strings into 3 groups, each consisting of 4 digits.
year1 <- t(sapply(founded, function(x) x[1:4]))
year2 <- t(sapply(founded, function(x) x[5:8]))
year3 <- t(sapply(founded, function(x) x[9:12]))
We use the sapply()
function to handle the list outputs from str_extract_all()
.
## [,1] [,2] [,3] [,4]
## [1,] "1" "9" "0" "2"
## [2,] "1" "8" "8" "8"
## [3,] "2" "0" "1" "3"
## [4,] "1" "9" "8" "1"
## [5,] "1" "9" "8" "9"
## [6,] "2" "0" "0" "8"
## [,1] [,2] [,3] [,4]
## [1,] NA NA NA NA
## [2,] NA NA NA NA
## [3,] "1" "8" "8" "8"
## [4,] NA NA NA NA
## [5,] NA NA NA NA
## [6,] NA NA NA NA
## [,1] [,2] [,3] [,4]
## [1,] NA NA NA NA
## [2,] NA NA NA NA
## [3,] NA NA NA NA
## [4,] NA NA NA NA
## [5,] NA NA NA NA
## [6,] NA NA NA NA
The outputs are matrices.
## [1] "matrix" "array"
- Paste the 4 single digits into a whole string to indicate a year.
y1 <- apply(year1, 1, paste0, collapse = "")
y2 <- apply(year2, 1, paste0, collapse = "")
y3 <- apply(year3, 1, paste0, collapse = "")
## [1] "1902" "1888" "2013" "1981" "1989" "2008"
## [1] "NANANANA" "NANANANA" "1888" "NANANANA" "NANANANA" "NANANANA"
## [1] "NANANANA" "NANANANA" "NANANANA" "NANANANA" "NANANANA" "NANANANA"
The outputs are vectors.
## [1] "character"
- Convert the strings to numbers.
Those "NANANANA"
are coerced to NA
.
## [1] 1902 1888 2013 1981 1989 2008
## [1] NA NA 1888 NA NA NA
## [1] NA NA NA NA NA NA
- Organize the three year vectors into a data frame of three columns, and order the years from the most recent to the most remote for each case.
years <- data.frame(y1, y2, y3)
Founded1 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][1])
Founded2 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][2])
Founded3 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][3])
head(years)
## y1 y2 y3
## 1 1902 NA NA
## 2 1888 NA NA
## 3 2013 1888 NA
## 4 1981 NA NA
## 5 1989 NA NA
## 6 2008 NA NA
We use the function order(x, decreasing = TRUE)
to obtain the positions of the three years in descending order. function(x) x[order(x, decreasing = TRUE)][n]
places the three years from the most recent to the most remote. [1]
means the most recent year, and [3]
means the most remote one.