Chapter 11 String Manipulation
library(tidyverse)
library(stringr) # tidyverse string functions, not loaded with tidyverse
library(refinr) # fuzzy string matching
Strings make up a very important class of data. Data being read into R often come in the form of character strings where different parts might mean different things. For example a sample ID of “R1_P2_C1_2012_05_28” might represent data from Region 1, Park 2, Camera 1, taken on May 28, 2012. It is important that we have a set of utilities that allow us to split and combine character strings in a easy and consistent fashion. Unfortunately, the utilities included in the base version of R are somewhat inconsistent and were not designed to work nicely together. Hadley Wickham, the developer of ggplot2
and dplyr
has this to say:
“R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.” – Hadley Wickham
For this chapter we will introduce the most commonly used functions from the base version of R. We then introduce the stringr
package that provides many useful functions that operate in a consistent manner. The R for Data Science he has a nice chapter on strings that serves as the motivation of what is presented here.
There are several white space characters that need to be represented in character strings such as tabs and returns. Most programming languages, including R, represent these using the escape character combined with another. For example in a character string \t
represents a tab and \n
represents a newline. However, because the backslash is the escape character, in order to have a backslash in the character string, the backslash needs to be escaped as well. It is important to note that most of these operations, and those you will use for regular expression, require us to use two backslashes to properly represent the command due to the nature of the escape character in R. Thus, we will find ourselves using regular expression such as \\d+
rather than \d
. Keep this in mind as we work forward and you complete your exercises.
11.1 Base function
1.1.1 paste()
The most basic thing we will want to do is to combine two strings or to combine a string with a numerical value. The paste()
command will take one or more R objects and converts them to character strings and then pastes them together to form one or more character strings. It has the form:
The ...
piece means that we can pass any number of objects to be pasted together. The two most used arguments are the sep
argument, which gives the character string that separates the strings to be joined. This can be a space such as in the example above, but is also commonly a comma, underscore, or period. The argument collapse
specifies if a simplification should be performed after being pasting together. If we choose to collapse all elements with the option ,
, then all character elements pasted will have commas between them. Choosing proper separators and collapses can ensure you output the proper string you are looking for, which can have consequences later if you are storing important information within a string.
Suppose we want to combine the strings “Peanut butter” and “Jelly” then we could execute:
## [1] "PeanutButter Jelly"
Notice that without specifying the separator character, R chose to put a space between the two strings. We could specify whatever we wanted:
## [1] "PeanutButter_Jelly"
But really we all know the answer is
## [1] "PeanutButter&Jelly"
We can combine different object types our combine strings, a common operation is to output some type of sentence with a numerical value at the end.
## [1] "Pi is equal to 3.14159265358979"
We can combine vectors of similar or different lengths as well. By default R assumes that you want to produce a vector of character strings as output.
## [1] "n = 5" "n = 25" "n = 100"
These options will be done element-wise if you provide two vector of strings.
first.names <- c('Robb','Stannis','Daenerys')
last.names <- c('Stark','Baratheon','Targaryen')
paste( first.names, last.names)
## [1] "Robb Stark" "Stannis Baratheon" "Daenerys Targaryen"
If we want paste()
produce just a single string of output, use the collapse=
argument to paste together each of the output vectors (separated by the collapse
character).
## [1] "n = 5" "n = 25" "n = 100"
## [1] "n = 5:n = 25:n = 100"
Finally, we can combine sep
and collapse
to create a single output string, with all combined elements separated by sep
and after all elements are created, collapse them together with spaces surround a colon.
## [1] "Robb.Stark : Stannis.Baratheon : Daenerys.Targaryen"
Notice we could use the paste()
command with the collapse option to combine a vector of character strings together.
## [1] "Stark:Baratheon:Targaryen"
11.2 stringr package: Basic operations
The goal of
stringr
is to make a consistent user interface to a suite of functions to manipulate strings. “(stringr) is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.” - Hadley Wickham
We’ll investigate the most commonly used function but there are many we will ignore. I find myself using these in almost all data cleaning I do these days along with many of the pattern matching tools in the next section. Here is a list of what would be called basic string operations, as found in the stringr
package.
|
|
|
---|---|---|
str_c() |
string concatenation, similar to paste | |
str_length() |
number of characters in the string | |
str_sub() |
extract a sub-string | |
str_trim() |
remove leading and trailing whitespace | |
str_pad() |
pad a string with empty space to make it a certain length |
11.2.1 str_c() : Concatenate strings
Lets first discuss how to concatenate two strings or two vectors of strings similarly to the paste()
command. The str_c()
functions acts like a synonym of the paste command. The syntax is:
You can think of the inputs building a matrix of strings, with each input creating a column of the matrix. For each row, str_c()
first joins all the columns (using the separator character given in sep
) into a single column of strings. If the collapse
argument is non-NULL, the function takes the vector and joins each element together using collapse as the separator character. Examples always help to see how this operates:
## first.names last.names
## [1,] "Robb" "Stark"
## [2,] "Stannis" "Baratheon"
## [3,] "Daenerys" "Targaryen"
# join the columns together using the `sep` argument
full.names <- str_c( first.names, last.names, sep='.')
cbind( first.names, last.names, full.names)
## first.names last.names full.names
## [1,] "Robb" "Stark" "Robb.Stark"
## [2,] "Stannis" "Baratheon" "Stannis.Baratheon"
## [3,] "Daenerys" "Targaryen" "Daenerys.Targaryen"
# Join each of the rows together separated by the `collapse` argument
str_c( first.names, last.names, sep='.', collapse=' : ')
## [1] "Robb.Stark : Stannis.Baratheon : Daenerys.Targaryen"
As stated, this is much like the paste()
command previously discussed, but is the build in stringr
version, which can be useful when building pipelines of string commands.
11.2.2 str_length() : Calculate string lengths
The str_length()
function calculates the length of each string in the vector of strings passed to it.
## [1] 11 12 NA 5
Notice that str_length()
correctly interprets the missing data as missing and that the length ought to also be missing.
11.2.3 str_sub : Extracting sub-strings
A sub-string is any contiguous subgroup of characters contained within a string. For example, if we might have a common pattern that requires us to extract the \(3^{rd}\) through \(6^{th}\) letters in a string. Lets apply this to the text vector created above.
## [1] "rdTe" "th a" NA "ght"
Notice for each string in the vector, the \(3^{rd}\) through \(6^{th}\) elements were extracted. If a given string is not long enough to contain all the necessary indices, str_sub()
returns only the letters that where there (as in the above case for “Night”, the command only returned the \(3^{rd}\) through \(5^{th}\) elements, since the string is only \(5\) elements in length.
11.2.4 str_pad() : Pad a string
Sometimes we to make every string in a vector the same length to facilitate display or in the creation of a uniform system of assigning ID numbers. Performing such an operation is typically known as padding a string. The str_pad()
function will add spaces at either the beginning or end of the of every string. The width
argument must be set and is the desired length for all strings in the vector. The side
argument determines if the beginning (left) or end (right) of a string gains the padding, while the pad
argument can be used to change what is used for padding. The defaults are to bad the beginning (left) of a string with spaces. Below we add padding to our first.names vector created earlier:
## [1] " Robb" " Stannis" "Daenerys"
## [1] "Robb****" "Stannis*" "Daenerys"
11.2.5 str_trim() : Trim a string
We can think of str_trim()
as the sort of the opposite of padding. The function str_trim()
removes any leading or trailing white space where white space is defined as spaces ’ ’, tabs \t
or returns \n
.
## [1] " This is some text. \n "
## [1] "This is some text."
By trimming the string, the leading spaces and the trailing new line (with spaces) were removed, returning a string containing only the characters one might call “meaningful”. Notice that trimming did not disrupt the internal spacing of the string, only the outer elements.
11.3 stringr package: Pattern Matching Tools
The previous commands are all quite useful but the most powerful string operation is take a string and match some pattern within it. The following commands are available within stringr
. Each of these command has the added idea of matching a pattern within the string, which allows us to continue to build up to the very strong string manipulation structure known as regular expressions. For now, lets focus on these functions and supplying simple patterns, such as wanting to split up a string at underscores, or finding and replacing all underscores with periods.
Function | Description |
---|---|
str_detect() |
Detect if a pattern occurs in input string |
str_locate()
str_locate_all() |
Locates the first (or all) positions of a pattern. |
str_extract()
str_extract_all() |
Extracts the first (or all) sub-strings corresponding to a pattern |
str_replace()
str_replace_all() |
Replaces the matched sub-string(s) with a new pattern |
str_split()
str_split_fixed() |
Splits the input string based on the input pattern |
There is one other useful command within the tidyr
package that allows us to apply string splitting within a data frame.
Function | Description |
---|---|
tidyr::separate() |
Applies str_split_fixed() to a data frame column |
We will first examine these functions using a very simple pattern matching algorithm where we are matching a specific pattern that is consistent and known ahead of time. For most simpler operations, this is all the complexity that is needed. Suppose that we have a vector of strings that contain a date in the form “Year-Month-Day” and we want to manipulate them to extract certain information. We will use the strings
object below to demonstrate how these pattern based string commands operate.
11.3.1 str_detect() : Detect if a pattern is present
Suppose we want to know which dates are in September. This would relate to detecting if the pattern “Sept” occurs in the strings. The function str_detect()
does just this and returns logicals as the output, indicating if each string in the vector contained the pattern or now. Now is a great time to start seeing how to use these functions within a pipeline and the dplyr
structure previously discussed.
## string result
## 1 2008-Feb-10 FALSE
## 2 2010-Sept-18 TRUE
## 3 2013-Jan-11 FALSE
## 4 2016-Jan-2 FALSE
Here we see that the second string in the test vector included the sub-string “Sept” but none of the others did.
11.3.2 str_locate() : Locate the position of a pattern
The function str_locate()
provides a bit more information than just if the pattern was present or not. Instead, maybe we need to figure out where the dash “-” characters are located within the string. This is a good application of the str_locate()
function.
## start end
## [1,] 5 5
## [2,] 5 5
## [3,] 5 5
## [4,] 5 5
Notice that the str_locate()
function returned a vector and only returned the position of the first dash that occurred. If we wanted to know the position of all the dashes in the string, we instead use the function str_locate_all()
, which changes the output to a list of matrices, but now provides a more complete listing of where the patterns were located.
## [[1]]
## start end
## [1,] 5 5
## [2,] 9 9
##
## [[2]]
## start end
## [1,] 5 5
## [2,] 10 10
##
## [[3]]
## start end
## [1,] 5 5
## [2,] 9 9
##
## [[4]]
## start end
## [1,] 5 5
## [2,] 9 9
Using this information, we could grab the Year/Month/Day information out of each of the dates. It can be tedious and require a lot of storage to do this with location information, instead we will discuss the str_split()
function that will allow us to easily extract sub-strings based on patterns.
11.3.3 str_split() : Split strings into sub-strings
We can split each of the dates into three smaller sub-strings using the str_split()
command, which returns a list where each element of the list is a vector containing pieces of the original string (excluding the pattern we matched on). What is nice here is that we do not have to supply specific positions for each string, instead we allow the pattern matching to do this work for us.
## [[1]]
## [1] "2008" "Feb" "10"
##
## [[2]]
## [1] "2010" "Sept" "18"
##
## [[3]]
## [1] "2013" "Jan" "11"
##
## [[4]]
## [1] "2016" "Jan" "2"
If we know that all the strings will be split into a known number of sub-strings (we have to specify how many sub-strings to match with the n=
argument), we can use str_split_fixed()
to get a matrix of sub-strings instead of list of sub-strings. This can have significant consequences to your work flow, as working with matrices rather than lists is often more straightforward. Either is useful, and often which we want to use is project dependent. Do take note that to generate the matrix version using str_split_fixed()
we must know ahead of time how many sub-strings we expect each object to return.
## [,1] [,2] [,3]
## [1,] "2008" "Feb" "10"
## [2,] "2010" "Sept" "18"
## [3,] "2013" "Jan" "11"
## [4,] "2016" "Jan" "2"
If the number of sub-strings is not consistent, it is possible to supply a larger number, such as n = 4
in the example above, and all strings with less than \(4\) sub-strings will instead return a set of empty strings denoted by the string ""
. Occasionally, I want to split column in a data frame on some pattern and store the resulting pieces in new columns attached to that same data frame. This is an instance where using the tidyr::separate()
command can work very well. The example below shows where one might encounter a useful time to include the tidyr::separate()
command; each provides the same output, but the using tidyr::separate()
command requires less fidgeting with the data frame structure.
# Notice the `.$string` takes the input data frame (that is the `.` part)
# and then grabs the `string` column and passes all of that into `str_split_fixed`.
# The resulting character matrix is then column bound to the input data frame.
data.frame(
Event=c('Date','Marriage','Elise','Casey'),
Date = strings ) %>%
cbind( str_split_fixed(.$Date, pattern='-', n=3)) %>%
rename( Year=`1`, Month=`2`, Day=`3` )
## Event Date Year Month Day
## 1 Date 2008-Feb-10 2008 Feb 10
## 2 Marriage 2010-Sept-18 2010 Sept 18
## 3 Elise 2013-Jan-11 2013 Jan 11
## 4 Casey 2016-Jan-2 2016 Jan 2
# It is really annoying to have to rename it, so the tidyr package includes
# a specialized function `separate` that does this exact thing.
# `remove=FALSE` causes separate to keep the original Date column.
data.frame(
Event=c('Date','Marriage','Elise','Casey'),
Date = strings ) %>%
separate(Date, sep='-', into=c('Year','Month','Day'), remove=FALSE)
## Event Date Year Month Day
## 1 Date 2008-Feb-10 2008 Feb 10
## 2 Marriage 2010-Sept-18 2010 Sept 18
## 3 Elise 2013-Jan-11 2013 Jan 11
## 4 Casey 2016-Jan-2 2016 Jan 2
11.3.4 str_replace(): Replace sub-strings
The last common pattern based operation is replacing sub-strings. Suppose we did not like using “-” to separate the Year/Month/Day but preferred a space, or an underscore, or anything else we might like to use instead. This can be done by replacing all of the “-” with the desired character. Similar to the string splitting functions discussed above, string replacement comes with two options: str_replace()
only replaces the first match it finds, while str_replace_all()
replaces all matches in the entire string.
### only the first dash becomes a colon
data.frame( string = strings ) %>%
mutate(result = str_replace(string, pattern='-', replacement=':'))
## string result
## 1 2008-Feb-10 2008:Feb-10
## 2 2010-Sept-18 2010:Sept-18
## 3 2013-Jan-11 2013:Jan-11
## 4 2016-Jan-2 2016:Jan-2
### all dashes replaced by colons
data.frame( string = strings ) %>%
mutate(result = str_replace_all(string, pattern='-', replacement=':'))
## string result
## 1 2008-Feb-10 2008:Feb:10
## 2 2010-Sept-18 2010:Sept:18
## 3 2013-Jan-11 2013:Jan:11
## 4 2016-Jan-2 2016:Jan:2
11.4 Regular Expressions
The next section will introduce using regular expressions. Regular expressions are a way to specify very complicated patterns. Go look here if you would like to gain insight into just how complex regular expressions can become. I have had undergraduate students whose research projects involved writing complex regular expressions to solve really interesting data cleaning problems! Regular expressions are a way of precisely writing out patterns that are very complicated. The stringr
package pattern based functions can be utilized by supplying patterns based on standard regular expressions (not perl-style!) instead of using fixed strings.
Regular expressions are extremely powerful for sifting through large amounts of text. For example, we might want to extract all of the 4 digit sub-strings (the years) out of our dates vector, or I might want to find all cases in a paragraph of text that begin with a capital letter and are at least \(5\) letters long. In another, somewhat nefarious example, spammers might have downloaded a bunch of text from web pages and want to be able to look for email addresses. So as a first pass, they want to match a pattern that might be thought about like this:
\[
\underset{\textrm{1 or more letters}}{\underbrace{\texttt{Username}}}\texttt{@}\;\;\underset{\textrm{1 or more letter}}{\underbrace{\texttt{OrganizationName}}}\;\texttt{.}\;\begin{cases}
\texttt{com}\\
\texttt{org}\\
\texttt{edu}
\end{cases}
\]
Here the Username
and OrganizationName
can be pretty much anything, but a valid email address always looks like this. We might get even more creative and recognize that a list of possible endings could include country codes as well. For most users, regular expression can be an unnecessary can-of-worms to open, but it is good to know that these pattern matching utilities are available within R and you do not need to export your pattern matching problems to Perl or Python. As we have all decided to take on the challenges of this course together, lets dive into using them to extract complex patterns.
11.4.1 Regular Expression Ingredients
Regular expressions use a select number of characters to signify further meaning in order to create recipes that might be matched within a character string. The special characters are [ \ ^ $ . | ? * + ()
, which are reserved and have special functions within regular expressions. To get started, the table below explore how we can search for common character types, such as letters, digits, and special combinations of these character types.
Character Types | Interpretation |
---|---|
abc |
Letters abc exactly |
123 |
Digits 123 exactly |
\d |
Any Digit |
\D |
Any Non-digit character |
\w |
Any Alphanumeric character |
\W |
Any Non-alphanumeric character |
\s |
Any White space |
\S |
Any Non-white space character |
. |
Any Character (The wildcard!) |
^ |
Beginning of input string |
$ |
End of input string |
Next, we would want to consider if we are looking for exactly those words/combinations, or can they be in different orders, this is where some of the special characters come into play. These operations are often called groupings.
Grouping | Interpretation |
---|---|
[abc] |
Only a, b, or c |
[^abc] |
Not a, b, nor c |
[a-z] |
Characters a to z |
[A-Z] |
Characters A to Z |
[0-9] |
Numbers 0 to 9 |
[a-zA-Z] |
Characters a to z or A to Z |
() |
Capture Group |
(a(bc)) |
Capture Sub-group |
(abc|def) |
Matches abc or def |
Finally, how many times might the pattern appear, these are often called group modifiers.
Group Modifiers | Interpretation |
---|---|
* |
Zero or more repetitions of previous (greedy) |
+ |
One or more repetitions of previous (greedy) |
? |
Previous group is optional |
{m} |
m repetitions of the previous |
{m,n} |
Between m and n repetitions of the previous |
*? |
Zero or more repetitions of previous (not-greedy). Obnoxiously the ? is modifying the modifier here and so has a different interpretation than when modifying a group. |
+? |
One or more repetitions of previous (not-greedy) |
The general idea is to make a recipe that combines one or more groups and add modifiers on the groups for how often the group is repeated. With just the ingredients discussed above, we can start to produce some interesting and complex regular expressions.
11.4.2 Matching a specific string
I might have a set of strings and I want to remove a specific string from them, or perhaps detect if a specific string is present. So long as the string of interest does not contain any of the special characters, you can just type out the string to be detected.
# Replace 'John' from all of the strings with '****'
# The regular expression used here is evaluating 'John'
strings <- c('John Sanderson', 'Johnathan Wilkes', 'Brendan Johnson', 'Bigjohn Smith')
data.frame( string=strings ) %>%
mutate( result = str_replace(string, 'John', '****') )
## string result
## 1 John Sanderson **** Sanderson
## 2 Johnathan Wilkes ****athan Wilkes
## 3 Brendan Johnson Brendan ****son
## 4 Bigjohn Smith Bigjohn Smith
Notice that this is case sensitive and we didn’t replace the ‘john’ in ‘Bigjohn’. We can increase the complexity of our regular expressions to make sure that any ‘John’ or ‘john’ is captured.
# Replace 'John' from all of the strings with '****'
# The regular expression used here now captures John or john
strings <- c('John Sanderson', 'Johnathan Wilkes', 'Brendan Johnson', 'Bigjohn Smith')
data.frame( string=strings ) %>%
mutate( result = str_replace(string, '[Jj]ohn', '****') )
## string result
## 1 John Sanderson **** Sanderson
## 2 Johnathan Wilkes ****athan Wilkes
## 3 Brendan Johnson Brendan ****son
## 4 Bigjohn Smith Big**** Smith
It is possible to identify and replace special characters with proper syntax. I might have special characters in my string that I want to replace. The below example is common when working with finances where money values are properly written out using the US-dollar symbol and commons separating thousands, millions, billions, etc…, which improve readability of finances and reduces mistakes. However, when we work with numerical data in R, none of these characters can be present, only the decimal representation with no additional syntax.
# Remove the commas and the $ sign and convert to integers.
# Because $ is a special character, we need to use the escape character, \.
# However, because R uses the escape character as well, we have to use TWO
# escape characters. The first to escape R usual interpretation of the backslash,
# and the second to cause the regular expression to not use the usual
# interpretation of the $ sign.
strings <- c('$1,000,873', '$4.53', '$672')
data.frame( string=strings ) %>%
mutate( result = str_remove_all(string, '\\$') )
## string result
## 1 $1,000,873 1,000,873
## 2 $4.53 4.53
## 3 $672 672
# We can use the Or clause built into regular expressions to grab the
# dollar signs and the commas using (Pattern1|Pattern2) notation
# the below has the regular expression (\\$|,) to capture both!
data.frame( string=strings ) %>%
mutate( result = str_remove_all(string, '(\\$|,)') )
## string result
## 1 $1,000,873 1000873
## 2 $4.53 4.53
## 3 $672 672
11.4.3 Matching arbitrary numbers
We might need to extract the numbers from a string. To do this, we want to grab \(1\) or more digits.
strings <- c('I need 653 to fix the car',
'But I only have 432.34 in the bank',
'I could get .53 from the piggy bank')
data.frame( string=strings ) %>%
mutate( result = str_extract(string, '\\d+') )
## string result
## 1 I need 653 to fix the car 653
## 2 But I only have 432.34 in the bank 432
## 3 I could get .53 from the piggy bank 53
That is not exactly what we wanted, since we lost some important information related to decimal notation. We extracted only whole numbers and left off the decimals, changing what the values really meant by doing so. To fix this, we could have an optional part of the recipe for decimals. The way to enter into an optional section of the recipe is to use a ()?
and enclose the optional part inside the parentheses.
## string result
## 1 I need 653 to fix the car 653
## 2 But I only have 432.34 in the bank 432.34
## 3 I could get .53 from the piggy bank 53
That fixed the issue for the second row, but still misses the third line. Lets have 3 different recipes and then ‘or’ them together. Now, with a rather inttrique regular expressions, we capture all proper versions of the numbers we might like to retain.
data.frame( string=strings ) %>%
mutate( result = str_extract(string, '(\\d+\\.\\d+|\\.\\d+|\\d+)' ))
## string result
## 1 I need 653 to fix the car 653
## 2 But I only have 432.34 in the bank 432.34
## 3 I could get .53 from the piggy bank .53
11.4.4 Greedy matching
Regular expressions tries to match as much as it can. The modifiers *
and +
try to match as many characters as possible. While often this is what we want, sometimes the result is a bit more subtle than this. Consider the case where we are scanning HTML code and looking for markup tags which are of the form <blah blah>
. The information inside the angled brackets will be important, but for now we just want to search the string for all instances of HTML tags. Here is a possible small piece of some HTML code that includes two different types of tags.
We want to extract <b>
, </b>
, <em>
and </em>
. To do this, we might first consider the following:
## [[1]]
## [1] "<b>MANY</b> types of <em>awesome</em>"
What the regular expression engine did was matched as many characters in the .+
until it got to the very last ending angled bracket it could find. We can force the +
and *
modifiers to be lazy and match as few characters as possible to complete the match by adding a ?
suffix to the +
or *
modifier.
## [[1]]
## [1] "<b>" "</b>" "<em>" "</em>"
This reduces how greedy the regular expression is when extracting information. Although nuianced, it is important to be aware of this feature.
11.5 Fuzzy Pattern Matching - Advanced!
Now that we have tackled regular expressions that can (and will) be used in a data wrangling and cleaning process, lets discuss one more advanced topic. A common data wrangling task is to take a vector of strings that might be subtly different, but those differences are not important. For example the address 321 S. Milton should be the same as the address 321 South Milton. Matching each of these and outputting a common string is referred to as fuzzy pattern matching. These are easy for humans to recognize as being the same, but it is much harder for a computer.
The open source project OpenRefine has a tool that does a very nice job doing fuzzy pattern grouping and several of the algorithms they created have been implemented in the R package refinr
. In this section we will utilize these tools to identify which strings represent the same items. Much of the information (and examples) we present here are taken from the OpenRefine GitHub page on
Clustering in Depth.
11.5.1 Key Collision Merge
The first fuzzy matching algorithm we will discuss works like this:
- Remove leading and trailing white space
- Change all characters to their lowercase representation
- Remove all punctuation and control characters
- Normalize extended western characters to their ASCII representation (for example “gödel” → “godel”)
- Split the string into white space separated tokens
- Sort the tokens and remove duplicates
- Join the tokens back together
From this algorithm upper-case vs lower-case does not matter nor will any punctuation. Furthermore, because we split the string into tokens on the internal white space, then the order of the words also does not matter. This algorithm is available from refinr::key_collision_merge()
. Here is a nice example of its execution, where even though all the inputs are slightly different, fuzzy matching can provide a consistent output.
strings <- c("Acme Pizza", "AcMe PiZzA", " ACME PIZZA", 'Pizza, ACME')
data.frame(Input = strings, stringsAsFactors = FALSE) %>%
mutate( Result = key_collision_merge(Input) )
## Input Result
## 1 Acme Pizza ACME PIZZA
## 2 AcMe PiZzA ACME PIZZA
## 3 ACME PIZZA ACME PIZZA
## 4 Pizza, ACME ACME PIZZA
The function refinr::key_collision_merge()
also includes two options for ignoring tokens. First, the ignore_strings
argument allows us to set up strings that should be ignored. For example the words the
, of
often are filler words that should be ignored.
strings <- c("Northern Arizona University", "University of Northern Arizona",
"The University of Northern Arizona")
data.frame(Input = strings, stringsAsFactors = FALSE) %>%
mutate( Result = key_collision_merge(Input, ignore_strings = c('the','of') ) )
## Input Result
## 1 Northern Arizona University Northern Arizona University
## 2 University of Northern Arizona Northern Arizona University
## 3 The University of Northern Arizona Northern Arizona University
Finally, there are common business suffixes that should be ignored. For example company
, inc.
, LLC
all mean similar things. The key_collision_merge()
function has an bus_suffix=TRUE
argument that indicates if the merging should be insensitive to these business suffixes.
11.5.2 String Distances
Non-exact pattern matching requires some notion of distance between any two strings. One measure of this (called Optimal String Alignment distance) is the number of changes it takes to transform one string to the other where the valid types of changes are 1) deletion, 2) insertion, 3) substitution, and 4) transposition. For example, the distance between sarah
and sara
is \(1\) because we only need one deletion (delete the character h). But the distance between sarah
and syria
is \(3\). Using only the valid types of changes, we could obtain the result using the following sequence of changes: sarah
-> syrah
-> syraa
-> syria
.
There are other distance metrics to use and the full list available to the stringdist
package is available in the documentation help("stringdist-metrics")
## [1] 3
With the idea of string distance, we could then match strings with small distances. This type of task has become much more important as our world is now full of these examples because of ideas like large-language models, natural language processing, and semantic analysis.
11.5.3 N-gram Merge
The refinr::n_gram_merge
function will perform a similar algorithm as the key_collision_merge
but will also match strings with small distances.
- Change all characters to their lowercase representation
- Remove all punctuation, white space, and control characters
- Obtain all the string n-grams
- Sort the n-grams and remove duplicates
- Join the sorted n-grams back together
- Normalize extended western characters to their ASCII representation
- Match strings with distance less than
edit_threshold
(defaults to 2)
Instead of creating tokens using white space, n-grams
tokenize every collection of sequential n-letters. For example the 3-gram of “Frank” is:
- frank
- fra, ran, ank
- ank, fra, ran
- ankfraran
Here is an example of n-gram matching based on the author of the first edition of this book.
strings <- c("Derek Sonderegger",
"Derk Sondreger",
"Derek Sooonderegggggger",
"John Sonderegger")
tibble(Input = strings) %>%
mutate( Result = n_gram_merge(Input, numgram = 2, edit_threshold = 10) )
## # A tibble: 4 × 2
## Input Result
## <chr> <chr>
## 1 Derek Sonderegger Derek Sonderegger
## 2 Derk Sondreger Derek Sonderegger
## 3 Derek Sooonderegggggger Derek Sonderegger
## 4 John Sonderegger John Sonderegger
11.6 Exercises
Exercise 1
For the following regular expression, explain in words what it matches on. Then add test strings to demonstrate that it in fact does match on the pattern you claim it does. Make sure that your test set of strings has several examples that match as well as several that do not. Show at least two examples that return TRUE and two examples that return FALSE. If you copy the Rmarkdown code for these exercises directly from my source pages, make sure to remove the eval=FALSE
from the R-chunk headers.
Here is an example of what a solution might look like.
q) This regular expression matches:
Any string that contains the lower-case letter “a”.
strings <- c('Adel', 'Mathematics', 'able', 'cheese')
data.frame( string = strings ) %>%
mutate( result = str_detect(string, 'a') )
Please complete the questions below.
a) This regular expression matches:
Insert your answer here…
b) This regular expression matches:
Insert your answer here…
c) This regular expression matches:
Insert your answer here…
d) This regular expression matches:
Insert your answer here…
strings <- c()
data.frame( string = strings ) %>%
mutate( result = str_detect(string, '\\d+\\s[aA]') )
e) This regular expression matches:
Insert your answer here…
strings <- c()
data.frame( string = strings ) %>%
mutate( result = str_detect(string, '\\d+\\s*[aA]') )
f) This regular expression matches:
Insert your answer here…
g) This regular expression matches:
Insert your answer here…
strings <- c()
data.frame( string = strings ) %>%
mutate( result = str_detect(string, '^\\w{2}bar') )
h) This regular expression matches:
Insert your answer here…
Exercise 2
The following file names were used in a camera trap study. The S number represents the site, P is the plot within a site, C is the camera number within the plot, the first string of numbers is the YearMonthDay and the second string of numbers is the HourMinuteSecond.
file.names <- c( 'S123.P2.C10_20120621_213422.jpg',
'S10.P1.C1_20120622_050148.jpg',
'S187.P2.C2_20120702_023501.jpg')
Produce a data frame with columns corresponding to the site
, plot
, camera
, year
, month
, day
, hour
, minute
, and second
for these three file names. So we want to produce code that will create the data frame:
Exercise 3
The full text from Lincoln’s Gettysburg Address is given below. It has been provided in a form that includes lots of different types of white space. Your goal is to calculate the mean word length of Lincoln’s Gettysburg Address! Note: you may consider ‘battle-field’ as one word with 11 letters or as two words ‘battle’ and ‘field’. The first option a bit more difficult and technical!.
Gettysburg <- 'Four score and seven years ago our fathers brought forth on this
continent, a new nation, conceived in Liberty, and dedicated to the proposition
that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field, as
a final resting place for those who here gave their lives that that nation might
live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate -- we can not consecrate -- we can
not hallow -- this ground. The brave men, living and dead, who struggled here,
have consecrated it, far above our poor power to add or detract. The world will
little note, nor long remember what we say here, but it can never forget what
they did here. It is for us the living, rather, to be dedicated here to the
unfinished work which they who fought here have thus far so nobly advanced. It
is rather for us to be here dedicated to the great task remaining before us --
that from these honored dead we take increased devotion to that cause for which
they gave the last full measure of devotion -- that we here highly resolve that
these dead shall not have died in vain -- that this nation, under God, shall
have a new birth of freedom -- and that government of the people, by the people,
for the people, shall not perish from the earth.'
Exercise 4
Variable names in R may be any combination of letters, digits, period, and underscore. However, variables within a data frame may not start with a digit and if they start with a period, they must not be followed by a digit.
The first four are valid variable names, but the last four are not.
a) First write a regular expression that determines if the string starts with a character (upper or lower case) or underscore and then is followed by zero or more numbers, letters, periods or underscores. Notice below the use of start/end of string markers. This is important so that we don’t just match somewhere in the middle of the variable name.
b) Modify your regular expression so that the first group could be either [a-zA-Z_]
as before or it could be a period followed by letters or an underscore.