9.3 Basic text manipulation

This section covers the elementary constants and functions for manipulating character data that are provided by base R. Base R also provides support for regular expressions and more advanced tasks, which are addressed by Appendix E and used in Section 9.4.

Above (in Section 9.2.1), we defined some simple objects of type character:

In the following, we first explicate some facts that we have encountered repeatedly throughout this book: R comes with pre-defined elements of data and we can collect text objects in vectors and add them to other (rectangular) data structures. To work with text-related data, R provides a range of commands that allow basic manipulations of character objects. Some straightforward examples of such functions have already been used and include:

The rest of this section answers the following three questions:

  • Which text-related constants exist in R?
  • How can we combine and add text objects to other data structures (vectors, matrices, and tables)?
  • Which functions for solving basic text-related tasks should we know?

As the first two questions can be answered quickly, most of this section will cover text-related functions provided by base R.

9.3.1 Text constants

R contains a few built-in constants. Constants are data that are pre-defined and should only be changed with good reasons. Apart from the constant pi, which evaluates to the numeric double 3.1415927 (unless we change it by assigning it to a different value), the following ones are vectors of type character:

  1. LETTERS: 26 upper-case letters of the Roman alphabet;
  2. letters: 26 lower-case letters of the Roman alphabet;
  3. month.abb: 3-letter abbreviations for the English month names;
  4. month.name: English names for the months of the year.

As these four constants come as character vectors, we can select their elements by indexing (just as with any other vector in R):

In rare instances, it may make sense to provide names to character vectors so that their elements can be selected by their name (which are also character objects). Above, we encountered the named Umlaut vector from the ds4psy package (in Section 9.2.2):

Having encountered many character vectors and tables that contain character objects, we need to ask:

  • How can we create character vectors and other data structures that collect and store text data?

9.3.2 Data structures for text

Which data structures have we encountered so far? Back in Chapter 1: Basic R concepts and commands, we learned that R relies primarily on linear and rectangular data structure (i.e., vectors and tables) to store data (see Sections 1.4 and 1.5). Fortunately, this applies to character data just as it does to numerical, logical, or temporal data. Hence, we do not need to learn any new commands for integrating character objects into larger data structures. Nevertheless, we will briefly refresh our memory by providing some examples.

Character vectors

Just as with other R objects, the concatenate (or combine) function c() turns a sequence of character objects (or strings of text) into a vector of type character:

Many R functions that we previously used with numeric vectors also work with character data:

while others clearly would make no sense (as they require logical or numeric arguments):

As the length() function provides the number of elements in a vector, measuring the length of text objects (in terms of their number of characters) requires a different function nchar():

Accessing the elements of a character vector works exactly like accessing the elements of any other vector. For instance, we can use numeric or logical indexing:

Actually, strings in vectors are similarly addictive as NA values in calculations: When combining numbers or other data types with a character object, the entire string is changed into type and mode character:

The reason for the addictive nature of characters is clear: While it is difficult to interpret an object of type character as a number (which number corresponds to the word “tree”?), it is straightforward to interpret numbers as a series of character symbols (i.e., digits). Characters simply are the common denominator of of numbers and text.

The function character() takes a numeric argument length and creates a vector of a corresponding number of empty strings:

This may seem unnecessary at this point, but becomes useful when initializing data structures to be filled by the results of a vector operation or a for loop (see Chapter 12 on Iteration for examples):

The corresponding functions as.character() and is.character() coerce objects into text strings or test whether objects are text strings:

Tabular data structures

Just as with other data types, it is possible and common to store character data in a matrix:

Matrices are useful and efficient structures for storing data, but have the same limitation as vectors in R: They can only store data of a single data type. As we are usually working with data of multiple types (e.g., combinations of character, numeric, and logical variables) our data is typically stored in the form of data frames or tibbles:

As data frames and tibbles were covered extensively in Chapters 1 and 5 (e.g., see Sections 1.5.2 and 5.2), we can devote the rest of this section to functions and the tasks we can address with them.

9.3.3 Text functions

The base R package contains a range of dedicated functions to deal with text objects. As we see them frequently wherever strings are being manipulated — both in this book and in the code of others — it is good to be familiar with them. And even if we should eventually decide to use the alternative functions provided by the stringr package, we should now why we are doing so.

Actually, we have encountered quite a few text-related functions in the previous chapters of this book. For instance, even Chapter 1: Basic R concepts and commands contained the functions nchar() and substr(), as well as the constant LETTERS. Similarly, the paste() function and its variant paste0() appeared in Chapters 1, 2, 5, and in this chapter.

As we will see, the set of text-related functions in base R are a bit like a zoo. There is a mix of some mundane and some amazing creatures, but it can be hard to see their connections or an organizing principle. Thus, we only cover basic text-manipulation functions here and rely on the stringr package for more advanced tasks of string manipulation (in Section 9.4.

9.3.4 Basic tasks with text

Table 9.2 contains the basic tasks of our summary table (from Section 9.2.4).

Table 9.2: Basic tasks of text manipulation (involving a string s).
Task R base stringr
A: Basic tasks
Measure the length of strings s: nchar(s) str_length(s)
Change chars in s to lower case: tolower(s) str_to_lower(s)\(^{2}\)
Change chars in s to upper case: toupper(s) str_to_upper(s)\(^{2}\)
Combine or collapse strings ...: paste(...)\(^{1}\) str_c(...)
Split a string s: strsplit(s, split) str_split(s, split)\(^{2}\)
Sort a character vector s: sort(s) str_sort(s)\(^{2}\)
Extract or replace substrings in s: substr(s, start, stop)\(^{1}\) str_sub(s, start, stop)
Translate old into new chars in s: chartr(old, new, s)
– Text as input or output: print(), cat(), format(), readLines(), scan(), writeLines()\(^{1}\)

Table notes

  • \(^{1}\): base R functions with additional variants that tweak their functionality (see their documentation).

  • \(^{2}\): stringr functions with additional variants that tweak their functionality (see their documentation).

For these basic tasks, we will focus on the base R functions and only briefly show their stringr alternatives.

Measuring string length

We have already seen above that the nchar() function differs from the length() function. Whereas length() provides the number of elements in a vector, the nchar() function provides the length of text objects:

The stringr function str_length() provides an alternative to nchar():

Changing character case

A characteristic property of letters in many alphabets is that they exist in lowercase and uppercase forms. As R is case-sensitive (e.g., x and X are different objects), it makes sense that character objects in lower- vs. uppercase are also distinguished (e.g., the vowels aeiou are different from those in AEIOU).

The base R functions tolower() and toupper() allow changing the case of text strings:

The direct stringr alternatives to tolower() and toupper() are str_to_lower() and str_to_upper(). The functions str_to_sentence() and str_to_title() provide further variations on the same theme with some locale-specific aspects:

As issues of capitalization come up regularly when working with text, the capitalize() and caseflip() functions of the ds4psy package provide similar functionality with slightly different options, output formats, and trade-offs:

Combining or collapsing strings

The paste() function is the string-related workhorse of the base R package. In its most basic form, paste() combines multiple character objects into one character object. Although this behavior is also referred to as “concatenating vectors” (e.g., in R’s documentation), it is instructive to contrast the behavior or paste() with that of the c() function:

Thus, whereas c() combines multiple objects into a vector, paste() combines multiple strings into a single one.

Interestingly, paste() also works with non-character objects, but coerces them into character objects:

The paste() function has two arguments sep and collapse: The sep argument of paste() allows selecting the separation character(s) between the terms to be pasted and is set to " " by default):

As we often want to create a character string without any separation between the original terms, the paste0() variant of paste() is a convenient wrapper for paste(..., sep = ""):

The collapse argument of paste() is set to NULL by default. To understand its usefulness, consider the following examples:

Contrary to what we may have expected, this paste() command did not yield a single character string, but a vector of three character objects. The reason for this is simple: The input we provided to paste() were not multiple character objects, but a vector of character objects (here: letters[1:3], which evaluates to a, b, c).

But as R usually works with vectors, we often want to create a single character object (or string of text) out of a vector of character objects. This can also be achieved with paste(), but requires setting the collapse argument to some value other than NULL:

Thus, collapse works just like sep, but is used when we want to combine a vector of strings into a single one:

If both sep and collapse are provided, the argument used depends on the type of input provided:

A nifty feature of paste() is that it also works with vectors of different length, in which case the term with fewer elements is recycled (i.e., repeated to match the maximum number of elements):

The stringr alternative to paste() is str_c(), which emphasizes the connection to c(). The str_c() function also takes two arguments sep and collapse, but the default of sep is set to "" (i.e., no separation):

In addition, the tidyverse contains a glue package (Hester, 2020) that makes it easier to combine text strings and other data elements (see https://github.com/tidyverse/glue for details).

Splitting strings

Splitting a string into multiple strings is the opposite of pasting (i.e., combining or collapsing) strings. The strsplit() function takes two arguments: x is a string (or multiple strings) and split denotes a character object — or a pattern, defined as a regular expression (see Appendix E — at which the string x is to be split. Intuitively, we would expect a sequence of x <- paste(v, collapse = " ") and strsplit(x, split = " ") to yield v again, but see what happens:

It turns out that strsplit() returns a list, rather than a vector. To obtain the orginal vector v, we need to unlist() the list y:

What do we need strsplit() for? A typical application of using strsplit() (often followed by unlist()) consists in splitting a longer string of text (e.g., a paragraph or chapter) into individual words or sentences:

Unfortunately, we cannot just enter strsplit(p, split = ".") to split the paragraph p at every full stop (i.e., the punctuation symbol .) into the sentences it contains. The solution for this task also involves a version of strsplit() but the split argument looks unusually cryptic:

We already know the reason for this (from Section 9.2.2): The punctuation symbol . is one of the 12 so-called metacharacters that have a special meaning in matching patterns:

  • . \ | ( ) [ { ^ $ * + ? (see the metachar vector of ds4psy).

Unfortunately, understanding the solution(s) shown here requires that we first learn more about regular expressions (see Appendix E). What we can do at this point, however, is adding the missing full stops to the end of each sentence in s2 and verifying that we then have regained our original vector of sentences s1:

As splitting text into sentences or words is a frequent task, but turns out to be surprisingly difficult, the ds4psy package provides two corresponding helper functions: text_to_sentences() and text_to_words(). However, under the hood, these functions are really just convenient wrappers for a series of strsplit() and unlist() commands:

The stringr alternative to strsplit() is str_split(). The str_split() function uses an argument pattern (instead of the split argument) and provides some additional options (e.g., for returning a character matrix). By default, it also returns a list of character vectors, rather than a vector:

Sorting strings

Sorting strings into alphabetical order is easy.
As we have briefly seen above (and in Chapter 1, Section 1.4.3), we can use the sort() function that also handles numeric or logical objects:

Actually, sorting strings is easy, as long as we do not confuse sort() with order(). The order() function yields a numeric vector that provides a permutation that rearranges its first argument into ascending (or descending, if decreasing = TRUE) order:

The connection between both functions can be seen by using the output of order(f) to sort the vector f (by numerical indexing):

If this is confusing, just remember to use sort(), rather than order(), to sort vectors into order.

The stringr alternatives to sort() and order() are — perhaps not surprisingly — str_sort() and str_order(), but provide a few additional options (e.g., a numeric argument to treat numeric character objects like numbers):

Extracting and replacing substrings (by position)

Extracting part(s) of a character strings x by specifying the location (or position) to be extracted is straightforward by using the substr() function:

Let’s find out what happens when we set the main argument x to a character vector and use unusual start and stop values:

So substr() works for character vectors and seems well-behaved for unusual values of start and stop. An interesting setting for start and stop could take into account the current number of characters nchar(). For instance, the following extracts the last four characters of any string in s:

Assigning a new character object to the extracted string(s) will replace the extracted string(s):

As the last example shows, extracting and replacing substrings only affects the positions denoted by start and stop.

The stringr alternative to substr() is str_sub(), but the stop argument of the former is called end in the latter. Additionally, negative argument values now count back from the last character:

Assigning a new character object to str_sub() behaves similar, but slightly differently than it does with substr():

Translating characters

Another way of replacing characters does not specify the parts to-be-replaced by their position, but by their identity. More specifically, the chartr() function takes a sequence of characters old and replaces them by the corresponding character in a sequence of characters new:

There appears to be no direct stringr alternative to the chartr() function.

9.3.5 Text as input or output

There are a number of simple R functions that deal with reading text inputs and showing or storing text outputs. As most of them are fairly straightforward, we only briefly mention them here:

  • print() is a generic function for printing an argument and returning it invisibly.

cat() concatenates and prints several objects, which is useful for printing the output of user-defined functions (see Chapter 11 on writing functions). cat() converts its arguments to character vectors, combines them to a single vector, appends the given string separator sep to each element, and then prints them (to the console or a file):

  • format() provides support for pretty-printing objects. This will become important when displaying dates and times (see Chapter 10), but is also relevant for numbers:
  • readLines() and scan() read text from a connection (e.g., a file, URL, or the console):
  • writeLines() writes text lines to a connection (e.g., a file, URL, or the console):

See each function’s documentation for details and related functions.

Before expanding our horizons by moving on to regular expressions (see Appendix E) or more advanced tasks (see Section 9.4), here are some practice tasks on some important base R commands.

Practice

  1. Verify that the elements of letters are equal to the elements of LETTERS in lower case, and that the elements of LETTERS are equal to those of letters in upper case.
  1. Varieties of paste():
  • Verify that and explain why the following three R commands yield the same result:
  • Which stringr command would yield the same result?
  1. Extract the substring “his” out of the sentence s.
  1. Use a combination of substr() and paste0() commands to change the sentence st into an R object “This is a great sentence!”
  1. Study the documentation to substring() (by evaluating ?substring), compare it to substr(), and then try to explain the results of the following examples:
  1. Word processing:
  • Read in a paragraph of text containing multiple sentences (with punctuation marks) and use base R command(s) to partition it into a vector that contains its (a) individual sentences and (b) individual words.
# Data: William James (1890): The Principles of Psychology[1]
# Chapter I: The Scope of Psychology

wj_pp <- "Psychology is the Science of Mental Life, both of its phenomena and of their conditions. The phenomena are such things as we call feelings, desires, cognitions, reasonings, decisions, and the like; and, superficially considered, their variety and complexity is such as to leave a chaotic impression on the observer. The most natural and consequently the earliest way of unifying the material was, first, to classify it as well as might be, and, secondly, to affiliate the diverse mental modes thus found, upon a simple entity, the personal Soul, of which they are taken to be so many facultative manifestations. Now, for instance, the Soul manifests its faculty of Memory, now of Reasoning, now of Volition, or again its Imagination or its Appetite. This is the orthodox 'spiritualistic' theory of scholasticism and of common-sense. Another and a less obvious way of unifying the chaos is to seek common elements in the divers mental facts rather than a common agent behind them, and to explain them constructively by the various forms of arrangement of these elements, as one explains houses by stones and bricks. The 'associationist' schools of Herbart in Germany, and of Hume, the Mills and Bain in Britain, have thus constructed a psychology without a soul by taking discrete 'ideas,' faint or vivid, and showing how, by their cohesions, repulsions, and forms [p.~2] of succession, such things as reminiscences, perceptions, emotions, volitions, passions, theories, and all the other furnishings of an individual's mind may be engendered. The very Self or ego of the individual comes in this way to be viewed no longer as the pre-existing source of the representations, but rather as their last and most complicated fruit."
wj_pp

# (a) sentences:
st <- unlist(strsplit(wj_pp, split = "\\. "))  # split at "."
st <- unlist(strsplit(st, split = "\\? "))     # split at "."
st <- unlist(strsplit(st, split = "! "))       # split at "!"
st

# (b) words:
wd <- unlist(strsplit(wj_pp, split = " "))  # split at word boundaries (" ")
wd <- chartr(old = "-,.;:!?[]", new = "         ", x = wd)  # replace non-character symbols with " "
wd <- unlist(strsplit(wd, split = " "))     # split at " " (to remove leading/trailing spaces)
words <- wd

words
  • Shortcut: If you find this difficult, try solving the task by using the text_to_sentences() and text_to_words() functions provided by the ds4psy package.
  • Take the vector of individual words from above and sort it alphabetically.
  • Take the (unsorted) vector of individual words from above and capitalize the first letter of each word to create a new vector Words.

Hint: This is surprisingly difficult. A possible solution could split each word into 2 substrings — its first letter vs. the rest of the word — and then re-combine both parts (after capitalizing the first letter).

  • Shortcut: If a tasks turns out to be difficult, we can always look for existing functions that solve our problem. In case of capitalization, we could use the str_to_title() function of the stringr package or the capitalize() function of the ds4psy package:
  • Finally, re-create the original paragraph by pasting the vector of capitalized words Words into a single text string (and note how hard it is to read a paragraph of words without any punctuation).

This concludes our overview of basic text manipulations, for which we primarily use base R functions. Before proceeding to more advanced text manipulations, consider taking a detour for acquiring or refreshing your familiarity with using regular expressions (see Appendix E).

References

Hester, J. (2020). glue: Interpreted string literals. Retrieved from https://CRAN.R-project.org/package=glue