9.2 Essentials of text

R is prepared to handle text in two distinct ways: First, it distinguishes between text and other data by a dedicated data type for storing objects of type character (aka. strings or text). Second, it contains a few text constants and lots of functions that allow to search and tweak these character objects.

Before we consider the basic text constants and text-related functions of R (in Section 9.3), we need to know how we can define character data (Section 9.2.1) and enter special character symbols (Section 9.2.2). And to organize the confusing array of packages and commands that deal with text in R, we adopt a perspective that prioritizes the tasks we may want to solve with text to structure the tools we use for solving them (in Section 9.2.3).

9.2.1 Entering text data

In R, text is represented as a sequence of characters, which can contain letters, numbers, and other symbols. To turn a sequence of characters into a text object, it needs to be enclosed in quotation marks (" or '):

a <- "Hello"
b <- '!'
c <- "This is a sentence."

The data type R provides for storing sequences of characters (or strings) is character and the mode of an object that holds a character string is character:

typeof(a)
#> [1] "character"
mode(a)
#> [1] "character"

Sometimes we want to talk about a word, rather than using it in a sentence. This is typically signaled by enclosing the corresponding word in quotation marks. However, entering the following would result in an error:

"The word "word" is a 4-letter word."

Here, the first character string is "The word " and a second one is " is a 4-letter word.", which leaves the word in between hanging in no-string-land. Fortunately, R allows us to solve such problems by distinguishing between two types of quotation marks:

"The word 'word' is a 4-letter word."
#> [1] "The word 'word' is a 4-letter word."

When using quotation marks within a string, it is important that the types of inner and outer quotation marks match up. For instance, the following 2 definitions both yield valid character objects:

d <- "The word 'word' is a 4-letter word."
e <- 'The word "word" is a 4-letter word.'

whereas the following 4 definitions would all yield errors:

f <- "The word "word' is a 4-letter word.'
g <- "The word 'word" is a 4-letter word.'
h <- 'The word 'word" is a 4-letter word."
i <- 'The word "word' is a 4-letter word."

Thus, when using both types of quotation marks, they need to match up so that the inner ones are enclosed by the outer ones.

It may be a bit confusing at first that quotation marks can enclose any kind of character object — be it an individual symbol (e.g., "a" or "!"), a sentence (e.g., "The cat sat on the mat."), or much longer sequences of text (e.g., an article or book). In practice, people typically break down longer character objects into smaller pieces (e.g., vectors, matrices, or tables of character objects) before working with them.

But before we start to play with character strings, we need to take a closer look at their elementary parts. As it turns out, individual character symbols are much more complicated that most people think…

9.2.2 Entering character symbols

When thinking of character symbols, we primarily think of alphabetic letters (a, b, c, …, Z, Y, Z) and numeric digits (0, 1, … 9). However, a moment of reflection shows that we know and use many other symbols — for instance, mathematical operators (\(+\), \(-\), \(\times\), \(:\)), different types of dashes (-, –, —), parentheses (, [, {, <, >, }, ], ), and punctuation marks (. , ; : ! ?).

As soon as we venture beyond our native language or culture, we realize that the symbols we use are highly context-dependent. In Section 6.2.1 of Chapter 6, we saw how to use the readr function parse_character() to encode character objects like “El Niño” or “こんにちは” (by specifying the encoding in an appropriate locale).

To encounter instances of context-sensitivity in characters, we do not need to travel to exotic locations. Even within our own culture, the same symbol can have different meanings based on its context (e.g., note that the symbol “/” in “he/she” typically means OR, whereas it means divided by in “1/2”) or different symbols can have the same meaning (e.g., “1/2” as “1:2”).

Using rare symbols (e.g., Umlaut letters)

A comparison between computer keyboards used in different locations shows that they differ not only in layout, but also in the symbols they contain. As different languages contain different symbols, a common problem is: How can I type a symbol that my keyboard does not seem to know? For instance, my current keyboard is set to US-English. This has the benefit that I am familiar with it and it renders some special characters that I use a lot — like / and @ — more accessible than other layouts. The price for this convenience is that my keyboard lacks dedicated keys for German Umlaut letters (aka. diaeresis/diacritic, see Wikipedia: Diaeresis_diacritic and Wikipedia: Germanic Umlaut for details), which — given a first name of Hansjörg — I also need to type quite frequently.

So how can we type symbols that our keyboard lacks? Suppose we wanted to impart an insight of refined cultural wisdom, like:

  • Hansjörg says: “Der Käsereichtum Österreichs ist ungewöhnlich groß.”
    (which is German for: “The variety of Austrian cheeses is extraordinary rich.”)

Assuming that we find ways of typing — or copying and pasting — the foreign characters in this German phrase, we could apply our knowledge about matching quotation marks (from above) and enter it as a single character string into R. This would yield:

k1 <- "Hansjörg says: 'Der Käsereichtum Österreichs ist ungewöhnlich groß.'"
k1
#> [1] "Hansjörg says: 'Der Käsereichtum Österreichs ist ungewöhnlich groß.'"

# Write to console: 
writeLines(k1)
#> Hansjörg says: 'Der Käsereichtum Österreichs ist ungewöhnlich groß.'

Surprisingly, this seems to work pretty well, which is really quite remarkable, as it implies that R, R Markdown (including the knitr and rmarkdown packages that transform a .Rmd file into an output file), and our operating system agree on rather sensible settings for default character encodings.36

But what if we fail to find the Umlaut characters on our keyboard, lack a source to copy them from, or wanted to type even more exotic symbols (e.g., the ligature œ or the infinity symbol ∞)? There are many software solutions for this problem, of course, but most of them are context-dependent again (e.g., require a specific operating system or piece of software). A better solution is provided by the Unicode standard.

Using Unicode characters

A solution that generally works in R consists in finding and entering the appropriate Unicode standard character. For this, you need to know that R allows typing Unicode characters by entering \U... inside a character string and replacing the ... with the appropriate code of the desired symbol. For instance, if we wanted to type the Umlaut “ö” (i.e., a lowercase “o” with two dots on top), we could locate this symbol in a Unicode table and then type (\U00F6) inside a character string:

# Typing Unicode characters:
"\U00F6"         # ö
#> [1] "ö"
"Hansj\U00F6rg"  # Hansjörg
#> [1] "Hansjörg"

This may look a bit cryptic, but if we want to store some character data in files and exchange them with others, explicitly entering Unicode characters works more robustly across different platforms and systems.

But where can we find the Unicode character codes for the symbols that we want to type? To find these codes, we can consult long lists of Unicode characters, for instance the charts at unicode.org or the Wikipedia: List of Unicode characters. Such tables allow us to look up the other characters mentioned above:

# More Unicode characters: 
"\U0153"  # ligature oe
#> [1] "œ"
"\U221E"  # infinity
#> [1] "∞"

Equipped with these Unicode symbols, we can compose helpful spelling recommendations like:

oe <- "The name 'Hansj\U00F6rg' is transcribed as 'Hansjoerg', not with '\U0153' or '\U221E'." 

# Write to console: 
writeLines(oe)
#> The name 'Hansjörg' is transcribed as 'Hansjoerg', not with 'œ' or '∞'.

The Unicode characters in our character object oe also work in the HTML output of our R Markdown file:

  • The name ‘Hansjörg’ is transcribed as ‘Hansjoerg’, not with ‘œ’ or ‘∞’.

Umlaut characters

But we have yet to cope with the rich variety of Austrian cheeses — or rather the frequent need for Umlaut characters in the German language. Looking up some additional symbols in a Unicode table yields the following codes for the 7 additional characters commonly occurring in the German language:

# lowercase:
"\U00E4"  # ä
#> [1] "ä"
"\U00F6"  # ö
#> [1] "ö"
"\U00FC"  # ü
#> [1] "ü"

# uppercase:
"\U00C4"  # Ä
#> [1] "Ä"
"\U00D6"  # Ö
#> [1] "Ö"
"\U00DC"  # Ü
#> [1] "Ü"

# sharp s:
"\u00DF"  # ß
#> [1] "ß"

Using these codes within character strings allows us to type our desired phrase as follows:

# Desired phrase: 
# Hansjörg says: 'Der Käsereichtum Österreichs ist ungewöhnlich groß.'

# With Unicode characters:
k2 <- "Hansj\U00F6rg says: 'Der K\U00E4sereichtum \U00D6sterreichs ist ungew\U00F6hnlich gro\U00DF.'"
k2
#> [1] "Hansjörg says: 'Der Käsereichtum Österreichs ist ungewöhnlich groß.'"

# Write to console: 
writeLines(k2)
#> Hansjörg says: 'Der Käsereichtum Österreichs ist ungewöhnlich groß.'

German and all other non-English users of R can either hope that their systems (not just R and supporting software, but also their keyboards and operating systems) are set up so that they can simply type the letters they need. Alternatively, they can look up and memorize the Unicode codes for the additional symbols they frequenly need.

To simplify typing the 7 German Umlaut letters without needing to remember their Unicode code, the ds4psy package (Neth, 2020) provides a named vector Umlaut. For instance, to get or type the Umlaut for a character o, we can simply type Umlaut["o"] within a character string:

library(ds4psy)

Umlaut       # a named vector
#>   a   o   u   A   O   U   s 
#> "ä" "ö" "ü" "Ä" "Ö" "Ü" "ß"
Umlaut["o"]  # select by name
#>   o 
#> "ö"

# Select Umlaut characters in (non-sensical) strings: 
paste0("Susi i", Umlaut["s"], "t gern s", Umlaut["u"], "sse ", Umlaut["A"], "pfel.")
#> [1] "Susi ißt gern süsse Äpfel."
paste0("Heisse ", Umlaut["O"], "fen knattern laut.")
#> [1] "Heisse Öfen knattern laut."
paste0("L", Umlaut["a"], "rm und ", Umlaut["U"], "belkeit sind ", Umlaut["o"], "ffentliche ", Umlaut["A"], "rgernisse.")
#> [1] "Lärm und Übelkeit sind öffentliche Ärgernisse."

Additional Unicode characters

By browsing through the lists of Unicode characters (e.g., at unicode.org and Wikipedia) we can discover lots of interesting symbols that we never knew we needed. Although this can be fun, it can still be difficult to get this to work properly. In my experience, only about half of the rarer symbols appear as intended on my system. Some examples that work well for me include:

  • Greek letters:
# Greek letters:
"\U03B1/\U0391"  # alpha
#> [1] "α/Α"
"\U03B2/\U0392"  # beta
#> [1] "β/Β"
"\U03B3/\U0393"  # gamma
#> [1] "γ/Γ"
"\U03B4/\U0394"  # delta
#> [1] "δ/Δ"
"\U03B5/\U0395"  # epsilon
#> [1] "ε/Ε"
"\U03C6/\U03A6"  # phi
#> [1] "φ/Φ"
"\U03C7/\U03A7"  # chi
#> [1] "χ/Χ"
"\U03C8/\U03A8"  # psi
#> [1] "ψ/Ψ"
"\U03C9/\U03A9"  # omega
#> [1] "ω/Ω"
"\U03C3/\U03A3"  # sigma
#> [1] "σ/Σ"
  • Card suits:
# Card suits:
"\U2660/\U2664"
#> [1] "♠/♤"
"\U2665/\U2661"
#> [1] "♥/♡"
"\U2666/\U2662"
#> [1] "♦/♢"
"\U2663/\U2667"
#> [1] "♣/♧"
  • Dice symbols:
# dice:
"\U2680"   # dice 1
#> [1] "⚀"
"\U2681"   # dice 2
#> [1] "⚁"
"\U2682"   # dice 3
#> [1] "⚂"
"\U2683"   # dice 4
#> [1] "⚃"
"\U2684"   # dice 5
#> [1] "⚄"
"\U2685"   # dice 6
#> [1] "⚅"
  • Dingbats:
# Dingbats: 
"\U2702"  # scissors 
#> [1] "✂"
"\U270D"  # take note 
#> [1] "✍"
"\U270E"  # a pencil 
#> [1] "✎"
  • Pointers:
# pointers:
"\U261C"   # point left
#> [1] "☜"
"\U261D"   # point up
#> [1] "☝"
"\U261E"   # point right
#> [1] "☞"
"\U261F"   # point down
#> [1] "☟"

However, other symbols, for which we also find Unicode definitions, only partly work or fail to work:

  • Dashes (should have 3 different lengths):
# Dashes:
"\U2013"  # en dash
#> [1] "–"
"\U2014"  # em dash
#> [1] "—"
"\U2015"  # horizontal bar
#> [1] "―"
  • Emoticons (are not displayed within R, but do show up in R Markdown, below):
# Emoticons:
"\U1F642"  # slightly smiling face
#> [1] "\U0001f642"
"\U1F605"  # laugh with tear 
#> [1] "\U0001f605"
"\U1F63E"  # pouting cat face
#> [1] "\U0001f63e"
"\U1F607"  # smiling with halo
#> [1] "\U0001f607"
"\U1F631"  # face screaming in fear
#> [1] "\U0001f631"
  • Pictographs and miscellaneous other symbols (seem to be risky bets):
# Misc. symbols:
"\U2600"   # sun
#> [1] "☀"
"\U2601"   # cloud
#> [1] "☁"
"\U2602"   # umbrella
#> [1] "☂"
"\U2603"   # snow man
#> [1] "☃"
"\U2605"   # star 1
#> [1] "★"
"\U2606"   # star 2
#> [1] "☆"
"\U260E"   # telephone
#> [1] "☎"

# hazard etc.:
"\U2620"   # skull
#> [1] "☠"
"\U2622"   # radioactive
#> [1] "☢"
"\U262E"   # peace
#> [1] "☮"
"\U262F"   # yin yang
#> [1] "☯"

# planets/genders:
"\U2640"   # planet 0
#> [1] "♀"
"\U2641"   # planet 1
#> [1] "♁"
"\U2642"   # planet 2
#> [1] "♂"
"\U26A7"   # genders
#> [1] "\u26a7"

# zodiac:
"\U2648"   # aries
#> [1] "♈"
"\U264B"   # cancer
#> [1] "♋"
"\U2653"   # pisces
#> [1] "♓"

# music:
"\U1D11E"  # musical G clef
#> [1] "𝄞"
"\U2669"   # music 1
#> [1] "♩"
"\U266B"   # music 2
#> [1] "♫"

# others:
"\U2615"   # coffee
#> [1] "\u2615"
"\U1D2E0"  # Mayan numeral zero 
#> [1] "\U0001d2e0"
"\U10CB1"  # Old Hungarian 
#> [1] "\U00010cb1"
"\U2757"   # fat exclamation mark
#> [1] "\u2757"
"\U267F"   # ISA wheelchair
#> [1] "\u267f"
"\U26BD"   # soccer
#> [1] "\u26bd"
"\U26F5"   # sailboat
#> [1] "\u26f5"

We can get some of the missing symbols to show up in R Markdown documents by using the asis_output() function of the knitr package (Xie, 2020b):

knitr::asis_output("Dashes: \U2013 \U2014 \U2015")
knitr::asis_output("Emoticons: \U1F642 \U1F605 \U1F63E \U1F607 \U1F631")
knitr::asis_output("Misc. symbols: \U2615 \U264B \U26A7 \U267F \U26BD \U26F5")

By using the knitr::asis_output() function inline (i.e., within an R code chunk in R Markdown documents, see Section E.3.3 of Appendix E) the corresponding symbols even show up in our HTML output:

  • Dashes: – — ―

  • Emoticons: 🙂 😅 😾 😇 😱

  • Misc. symbols: ☕ ♋ ⚧ ♿ ⚽ ⛵

As the universe of Unicode symbols is vast, it is easy to get lost in charts of available symbols. But before we spend too much time on this, it is important to consider yet another category of character symbols — the set of metacharacters that have a special meaning within the R language.

Metacharacters

Having dealt with foreign and exotic characters, we need to close this section with a caveat: Some familiar characters can assume special meanings inside of character strings. The 12 so-called metacharacters are:

  • . \ | ( ) [ { ^ $ * + ?

At this point, it is sufficient to take note of these characters and remember that they may cause some troubles when working with text. We will explain the reason for these troubles together with the remedy in our section of regular expressions (see Section 9.4 below).

To make them easily accessible, the ds4psy package (Neth, 2020) provides these metacharacters as a character vector metachar:

library(ds4psy)

metachar
#>  [1] "."  "\\" "|"  "("  ")"  "["  "{"  "^"  "$"  "*"  "+"  "?"
length(metachar) 
#> [1] 12

# Write as a character string:
writeLines(paste(metachar, collapse = " "))
#> . \ | ( ) [ { ^ $ * + ?

See Section 9.8.2 for additional resources on Unicode characters and encodings.

9.2.3 Tools and tasks involving text

Given a basic understanding of what text is (short answer: data of type character, enclosed in quotation marks) and how we can enter various characters, we can ask ourselves:

  • Which tools exist in R to work with text?

The primary set of tools that come with every installation of R are the functions contained in base R packages. But as these functions have been added and developed further over time, they often lack a systematic organizing principle and can be somewhat confusing at first. Moreover, much of the functionality of text-related functions derives from the use of regular expressions, which allow specifying simple and complex patterns of characters, but can be quite challenging to construct.

As we have seen in previous chapters, R developers have written dedicated packages that address the (real or perceived) shortcomings in the base R lineup. In this chapter, we will encounter the stringr package (Wickham, 2019a), which is a core package of the tidyverse (Wickham et al., 2019) and provides a cohesive set of functions designed to make working with text (aka. “strings”) as easy as possible. stringr uses the stringi package (Gagolewski, 2020), which uses the ICU Unicode C library to provide fast and reliable implementations of common string manipulations.

The following sections provide an introduction to these tools, but to understand their organization, it is more helpful to ask a slightly different question:

  • Which tasks do we want to solve with text?

Given a large variety of tools, adopting a task-oriented view on text has the advantage that we can organize functions by the tasks that are of interest to us, rather than by the zoo of commands that has evolved over time and includes many strange creatures. In other words, to organize the tools, we sort them by the tasks they are designed to address.

Overview

Table 9.1 provides an overview of our resulting structure — and will hopefully make sense at the end of this chapter. At this point, just note that the leftmost column lists some tasks that we would like to solve with text data. The other two columns mention the names of functions that two prominent tools — base R and the stringr package — provide to address these tasks.

Table 9.1: Basic and advanced tasks of text manipulation (involving a string s and pattern p).
Task R base stringr
A: Basic tasks
Measure the length of strings s: nchar(s) str_length(s)
Change chars in s to lower case: tolower(s) str_to_lower(s)\(^{2}\)
Change chars in s to upper case: toupper(s) str_to_upper(s)\(^{2}\)
Combine or collapse strings ...: paste(...)\(^{1}\) str_c(...)
Split a string s: strsplit(s, split) str_split(s, split)\(^{2}\)
Sort a character vector s: sort(s) str_sort(s)\(^{2}\)
Extract or replace substrings in s: substr(s, start, stop)\(^{1}\) str_sub(s, start, stop)
Translate old into new chars in s: chartr(old, new, s)
– Text as input or output: print(), cat(), format(), readLines(), scan(), writeLines()\(^{1}\)
B: Advanced tasks
View strings in s that match p: str_view(s, p)\(^{a}\)
Detect pattern p in strings s: grep(p, s) grepl(p, s)\(^{1}\) str_detect(s, p)
Locate pattern p in strings s: gregexpr(p, s)\(^{1}\) str_locate(s, p)\(^{a}\)
Obtain strings in s that match p: grep(p, s, value = TRUE)\(^{1}\) str_subset(s, p)
Extract matches of p in strings s: regmatches(s, gregexpr(p, s))\(^{1}\) str_extract(s, p)\(^{a}\)
Replace matches of p by r in s: gsub(p, r, s)\(^{1}\) str_replace(s, p, r)\(^{a}\)
Count matches of p in strings s: str_count(s, p)

Table notes

  • \(^{1}\): base R functions with additional variants that tweak their functionality (see their documentation). For instance, the grep() family of functions also exist in agrep() versions that allow approximate matching of patterns to strings, using the generalized Levenshtein edit distance (i.e., the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another). See also utils::adist() for computing the approximate string distance (as a generalized Levenshtein edit distance).

  • \(^{2}\): stringr functions with additional variants that tweak their functionality (see their documentation).

  • \(^{a}\): stringr functions with an additional suffix _all() that applies the function to all matches (rather than just to the first match).

Even without understanding anything else about the functions listed in the table, we can see that there is a large overlap in functionality. As most tasks can be addressed by both base R and stringr functions our choice is often a matter of personal familiarity and preferences.

But rather than covering every combination of task and tool, this chapter adopts the following strategy: The tasks collected in A: Basic tasks are well-addressed by base R functions. By contrast, the tasks collected in B: Advanced tasks involve the specification of symbol patterns, which can be described by regular expressions. Although base R also provides support for such tasks, the corresponding functions are more coherently structured and conveniently performed by the stringr package.

Corresponding to this strategy, the rest of this chapter is structured into three main parts:

  • Manipulating characters and text in base R (Section 9.3)

  • Searching for patterns with regular expressions (Section 9.4)

  • More advanced manipulations of text with stringr (Wickham, 2019a) (Section 9.5)

References

Gagolewski, M. (2020). R package stringi: Character string processing facilities. Retrieved from http://www.gagolewski.com/software/stringi/

Neth, H. (2020). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy

Wickham, H. (2019a). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Xie, Y. (2020b). knitr: A general-purpose package for dynamic report generation in R. Retrieved from https://CRAN.R-project.org/package=knitr


  1. Take a moment to ensure that your version of the R Studio IDE is set up to use UTF-8 as its default text encoding.