9.2 Essentials of text data

R is prepared to handle text in two distinct ways: First, it distinguishes between text and other data by a dedicated data type for storing objects of type character (aka. strings or simply text). Second, it contains a few text constants and lots of functions that allow to search for or tweak character objects.

Before we consider the basic text constants and some basic text-related functions of R (in Section 9.3), we need to know how we can define character data (Section 9.2.1), how to enter special character symbols (Section 9.2.2), and which special characters exist in R (Section 9.2.3). And to organize the confusing array of packages and commands that deal with text in R, we adopt a perspective that prioritizes the tasks we may want to solve with text to structure the tools we use for solving them (in Section 9.2.4).

9.2.1 Entering text data

In R, text is represented as a sequence of characters, which can contain letters, numbers, and other symbols. To turn a sequence of characters into a text object, it needs to be enclosed in quotation marks (" or '):

The data type R provides for storing sequences of characters (or strings) is character and the mode of an object that holds a character string is character:

Sometimes we want to talk about a word, rather than using it in a sentence. This is typically signaled by enclosing the corresponding word in quotation marks. However, entering the following would result in an error:

Here, the first character string is "The word " and a second one is " is a 4-letter word.", which leaves the word in between hanging in no-string-land. Fortunately, R allows us to solve such problems by distinguishing between two types of quotation marks:

When using quotation marks within a string, it is important that the types of inner and outer quotation marks match up. For instance, the following 2 definitions both yield valid character objects:

whereas the following 4 definitions would all yield errors:

Thus, when using both types of quotation marks, they need to match up so that the inner ones are enclosed by the outer ones.

It may be a bit confusing at first that quotation marks can enclose any kind of character object — be it an individual symbol (e.g., "a" or "!"), a sentence (e.g., "The cat sat on the mat."), or much longer sequences of text (e.g., an article or book). In practice, people typically break down longer character objects into smaller pieces (e.g., vectors, matrices, or tables of character objects) before working with them.

But before we start to play with character strings, we need to take a closer look at their elementary parts. As it turns out, individual character symbols are much more complicated that most people think…

9.2.2 Entering character symbols

When thinking of character symbols, we primarily think of alphabetic letters (a, b, c, …, Z, Y, Z) and numeric digits (0, 1, … 9). However, a moment of reflection shows that we know and use many other symbols — for instance, mathematical operators (\(+\), \(-\), \(\times\), \(:\)), different types of dashes (-, –, —), parentheses (, [, {, <, >, }, ], ), and punctuation marks (. , ; : ! ?).

As soon as we venture beyond our native language or culture, we realize that the symbols we use are highly context-dependent. In Section 6.2.1 of Chapter 6, we saw how to use the readr function parse_character() to encode character objects like “El Niño” or “こんにちは” (by specifying the encoding in an appropriate locale).

To encounter instances of context-sensitivity in characters, we do not need to travel to exotic locations. Even within our own culture, the same symbol can have different meanings based on its context (e.g., note that the symbol “/” in “he/she” typically means OR, whereas it means divided by in “1/2”) or different symbols can have the same meaning (e.g., “1/2” as “1:2”).

Using rare symbols (e.g., Umlaut letters)

A comparison between computer keyboards used in different locations shows that they differ not only in layout, but also in the symbols they contain. As different languages contain different symbols, a common problem is: How can I type a symbol that my keyboard does not seem to know? For instance, my current keyboard is set to US-English. This has the benefit that I am familiar with it and it renders some special characters that I use a lot — like / and @ — more accessible than other layouts. The price for this convenience is that my keyboard lacks dedicated keys for German Umlaut letters (aka. diaeresis/diacritic, see Wikipedia: Diaeresis_diacritic and Wikipedia: Germanic Umlaut for details), which — given a first name of Hansjörg — I also need to type quite frequently.

So how can we type symbols that our keyboard lacks? Suppose we wanted to impart an insight of refined cultural wisdom, like:

  • Hansjörg says: “Der Käsereichtum Österreichs ist ungewöhnlich groß.”
    (which is German for: “The variety of Austrian cheeses is extraordinary rich.”)

Assuming that we find ways of typing — or copying and pasting — the foreign characters in this German phrase, we could apply our knowledge about matching quotation marks (from above) and enter it as a single character string into R. This would yield:

Surprisingly, this seems to work pretty well, which is really quite remarkable, as it implies that R, R Markdown (including the knitr and rmarkdown packages that transform a .Rmd file into an output file), and our operating system agree on rather sensible settings for default character encodings.42

But what if we fail to find the Umlaut characters on our keyboard, lack a source to copy them from, or wanted to type even more exotic symbols (e.g., the ligature œ or the infinity symbol ∞)? There are many software solutions for this problem, of course, but most of them are context-dependent again (e.g., require a specific operating system or piece of software). A better solution is provided by the Unicode standard.

Using Unicode characters

A solution that generally works in R consists in finding and entering the appropriate Unicode standard character. For this, you need to know that R allows typing Unicode characters by entering \u... or \U... — with ... standing for a 4-digit hexadecimal code) or \U... (with ... standing for a longer hexadecimal code — inside a character string and replacing the ... with the appropriate code of the desired symbol. For instance, if we wanted to type the Umlaut “ö” (i.e., a lowercase “o” with two dots on top), we could locate this symbol in a Unicode table and then type (\u00F6) inside a character string:

This may look a bit cryptic, but if we want to store some character data in files and exchange them with others, explicitly entering Unicode characters works more robustly across different platforms and systems.

But where can we find the Unicode character codes for the symbols that we want to type? To find these codes, we can consult long lists of Unicode characters, for instance the charts at unicode.org or the Wikipedia: List of Unicode characters. Such tables allow us to look up the other characters mentioned above:

Equipped with these Unicode symbols, we can compose helpful spelling recommendations like:

The Unicode characters in our character object oe also work in the HTML output of our R Markdown file:

  • The name ‘Hansjörg’ is transcribed as ‘Hansjoerg’, not with ‘œ’ or ‘∞’.

Actually, Unicode characters with a code consisting of 4 hex-digits \(nnnn\) should be entered as \unnnn (i.e., with a lowercase letter u). They 4-digit codes often also work with \Unnnn (i.e., an uppercase \U), but this occasionally fails, as R no longer can decide when the Unicode ends.

Here are two sentences that show potential problems when using \U... rather than \u.... Suppose we wanted to enter the following words of wisdom:

  • “Sören will den großen Käse aus Oberösterreich vergüten.”
  • “Après le petit-déjeuner je vais aller à l’école.”

Switching the problematic characters from \U... to \u... solves both problems:

Alternatively, we can disambiguate the string by isolating the problematic Unicode character by using the paste0() function:

Thus, it is smart to use the lowercase version \u... whenever this is possible.

Umlaut characters

But we have yet to cope with the rich variety of Austrian cheeses — or rather the frequent need for Umlaut characters in the German language. Looking up some additional symbols in a Unicode table yields the following codes for the 7 additional characters commonly occurring in the German language:

Using these codes within character strings allows us to type our desired phrase as follows:

German and all other non-English users of R can either hope that their systems (not just R and supporting software, but also their keyboards and operating systems) are set up so that they can simply type the letters they need. Alternatively, they can look up and memorize the Unicode codes for the additional symbols they frequenly need.

To simplify typing the 7 German Umlaut letters without needing to remember their Unicode code, the ds4psy package (Neth, 2020) provides a named vector Umlaut. For instance, to get or type the Umlaut for a character o, we can simply type Umlaut["o"] within a character string:

Additional Unicode characters

By browsing through the lists of Unicode characters (e.g., at unicode.org and Wikipedia) we can discover lots of interesting symbols that we never knew we needed. Although this can be fun, it can still be difficult to get this to work properly. In my experience, only about half of the rarer symbols appear as intended on my system. Some examples that work well for me include:

  • Greek letters:
  • Card suits:
  • Dice symbols:
  • Dingbats:
  • Pointers:

However, other symbols, for which we also find Unicode definitions, only partly work or fail to work:

  • Dashes (should have 3 different lengths):
  • Emoticons (are not displayed within R, but do show up in R Markdown, below):
  • Pictographs and miscellaneous other symbols (seem to be risky bets):

We can get some of the missing symbols to show up in R Markdown documents by using the asis_output() function of the knitr package (Xie, 2020b):

By using the knitr::asis_output() function inline (i.e., within an R code chunk in R Markdown documents, see Section F.3.3 of Appendix F) the corresponding symbols even show up in our HTML output:

  • Dashes: – — ―

  • Emoticons: 🙂 😅 😾 😇 😱

  • Misc. symbols: ☕ ♋ ⚧ ♿ ⚽ ⛵

As the universe of Unicode symbols is vast, it is easy to get lost in charts of available symbols. But before we spend too much time on this, it is important to consider yet another category of character symbols — the set of metacharacters that have a special meaning within the R language.

9.2.3 Special characters

Having dealt with foreign and exotic characters, we need to close this section with a caveat: Some familiar characters can assume special meanings inside of character strings. Here, we distinguish between so-called metacharacters and character constants.

Metacharacters

R may be a letter, a language, a challenge, or an attitude, but it is not a metacharacter. The 12 so-called metacharacters in R are:

  • . \ | ( ) [ { ^ $ * + ?

At this point, it is sufficient to take note of these characters and remember that they may cause some troubles when working with text. The reasons and the remedy for these troubles are explained in Appendix E on using regular expressions.

To make them easily accessible, the ds4psy package (Neth, 2020) provides these metacharacters as a character vector metachar:

See Section 9.8.2 for additional resources on Unicode characters and encodings.

Character constants

Besides metacharacters, there are also so-called character constants. These are characters with a special meaning in R that are also preceded by a backslash \ to be typed. The most common character constants are:

  • \n newline
  • \r carriage return
  • \t tab
  • \b backspace
  • \f form feed
  • \' ASCII apostrophe
  • \" ASCII quotation mark

Evaluate ?"'" to obtain a complete list of character constants and note that these character constants are not to be confused with the built-in constants of base R (to be discussed in Section 9.3.1), which mostly are constants of type character.43

9.2.4 Tools and tasks involving text

Given a basic understanding of what text is (short answer: data of type character, enclosed in quotation marks) and how we can enter various types of characters, we can ask ourselves:

  • Which tools exist in R to work with text?

The primary set of tools that come with every installation of R are the functions contained in base R packages. But as these functions have been added and developed further over time, they often lack a systematic organizing principle and can be somewhat confusing at first. Moreover, much of the functionality of text-related functions derives from the use of regular expressions, which allow specifying simple and complex patterns of characters, but can be quite challenging to construct.

As we have seen in previous chapters, R developers have written dedicated packages that address the (real or perceived) shortcomings in the base R lineup. In this chapter, we will encounter the stringr package (Wickham, 2019b), which is a core package of the tidyverse (Wickham, Averick, et al., 2019) and provides a cohesive set of functions designed to make working with text (aka. “strings”) as easy as possible. stringr uses the stringi package (Gagolewski, 2020), which uses the ICU Unicode C library to provide fast and reliable implementations of common string manipulations.

The following sections provide an introduction to these tools, but to understand their organization, it is more helpful to ask a slightly different question:

  • Which tasks do we want to solve with text?

Given a large variety of tools, adopting a task-oriented view on text has the advantage that we can organize functions by the tasks that are of interest to us, rather than by the zoo of commands that has evolved over time and includes many strange creatures. In other words, to organize the tools, we sort them by the tasks they are designed to address.

Overview

Table 9.1 provides an overview of our resulting structure — and will hopefully make sense at the end of this chapter. At this point, just note that the leftmost column lists some tasks that we would like to solve with text data. The other two columns mention the names of functions that two prominent tools — base R and the stringr package — provide to address these tasks.

Table 9.1: Basic and advanced tasks of text manipulation (involving a string s and pattern p).
Task R base stringr
A: Basic tasks
Measure the length of strings s: nchar(s) str_length(s)
Change chars in s to lower case: tolower(s) str_to_lower(s)\(^{2}\)
Change chars in s to upper case: toupper(s) str_to_upper(s)\(^{2}\)
Combine or collapse strings ...: paste(...)\(^{1}\) str_c(...)
Split a string s: strsplit(s, split) str_split(s, split)\(^{2}\)
Sort a character vector s: sort(s) str_sort(s)\(^{2}\)
Extract or replace substrings in s: substr(s, start, stop)\(^{1}\) str_sub(s, start, stop)
Translate old into new chars in s: chartr(old, new, s)
– Text as input or output: print(), cat(), format(), readLines(), scan(), writeLines()\(^{1}\)
B: Advanced tasks
View strings in s that match p: str_view(s, p)\(^{a}\)
Detect pattern p in strings s: grep(p, s) grepl(p, s)\(^{1}\) str_detect(s, p)
Locate pattern p in strings s: gregexpr(p, s)\(^{1}\) str_locate(s, p)\(^{a}\)
Obtain strings in s that match p: grep(p, s, value = TRUE)\(^{1}\) str_subset(s, p)
Extract matches of p in strings s: regmatches(s, gregexpr(p, s))\(^{1}\) str_extract(s, p)\(^{a}\)
Replace matches of p by r in s: gsub(p, r, s)\(^{1}\) str_replace(s, p, r)\(^{a}\)
Count matches of p in strings s: str_count(s, p)

Table notes

  • \(^{1}\): base R functions with additional variants that tweak their functionality (see their documentation).

  • \(^{2}\): stringr functions with additional variants that tweak their functionality.

  • \(^{a}\): stringr functions with an additional suffix _all() that applies the function to all matches (rather than just to the first match).

Both the base R and the stringr functions of Table 9.1 provide many options that are not listed here. For instance,

  • the grep() family of functions can be made case insensitive by setting ignore.case = TRUE, or also exist in agrep() versions that allow approximate matching of patterns to strings, using the generalized Levenshtein edit distance (i.e., the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another). See also utils::adist() for computing the approximate string distance (as a generalized Levenshtein edit distance).

  • the pattern argument of many stringr functions can be used in combination with so-called modifiers that govern the interpretation of a regular expression and allow setting additional arguments (e.g., ignore.case = TRUE). The default modifier of a given pattern is regex(pattern) (see ?stringr::regex for the documentation).

Even without understanding anything else about the functions listed in the table, we can see that there is a large overlap in functionality. As most tasks can be addressed by both base R and stringr functions our choice is often a matter of personal familiarity and preferences.

But rather than covering every combination of tasks and tools, this chapter adopts the following strategy: The tasks collected in A: Basic tasks are well-addressed by base R functions. By contrast, the tasks collected in B: Advanced tasks typically involve the specification of symbol patterns, which are described by regular expressions (see Appendix E for a primer on using regular expressions). Although base R also provides support for these more advanced tasks, the corresponding functions are more coherently structured and conveniently performed by the stringr package.

Corresponding to this distinction, the rest of this chapter is structured into two parts:

  • Basic text manipulation with base R commands (Section 9.3)

  • Advanced text manipulation with stringr commands (Section 9.4)

Note that even the “advanced” tasks in Section 9.4 are still pretty basic by most standards. They are distinguished from the tasks discussed in Section 9.3 by the use of the more systematic functions provided by the stringr package (Wickham, 2019b) and by assuming some familiarity with using regular expressions (see Appendix E).

References

Gagolewski, M. (2020). R package stringi: Character string processing facilities. Retrieved from http://www.gagolewski.com/software/stringi/

Neth, H. (2020). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy

Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Xie, Y. (2020b). knitr: A general-purpose package for dynamic report generation in R. Retrieved from https://yihui.org/knitr/


  1. Take a moment to ensure that your version of the RStudio IDE is set up to use UTF-8 as its default text encoding.

  2. This does not only sound confusing, but actually is. Working with text within a programming language that uses text to type commands inevitably blurs and transcends boundaries. For instance, when using R or R Markdown, it can be quite confusing that symbols like brackets, parentheses, the backslash \ or the ASCII accent grave (used to invoke a code environment in R Markdown) are used in many different roles and meanings.