9.2 Essentials

Essentials of this chapter: Which base R commands and R packages are covered?

9.2.1 Text basics

In R, text is represented as a sequence of characters, which can contain letters, numbers, and other symbols. To turn a sequence of characters into a text object, it needs to be enclosed in quotation marks (" or '):

The data type R provides for storing sequences of characters is character and the mode of an object that holds a character string is character:

Sometimes we want to talk about a word, rather than use it in a sentence. This is typically signaled by using quotation marks around the word that is talked about. When using quotation marks within a string, it is important that the types of inner and outer quotation marks match up. For instance, the following 2 definitions yield valid character objects:

whereas the following 4 definitions would all yield errors:

Character vectors

Just like with other R objects, a sequence of strings can be combined into a vector:

Actually, strings in vectors are similarly addictive as NA values in calculations: When combining numbers or other data types with a character object, the entire string is changed into type and mode character:

The function character takes a numeric argument length and creates a vector of a corresponding number of empty strings:

This may seem unnecessary at this point, but becomes useful when initializing data structures to be filled by the results of a vector operation or a for loop (see Chapter 12 for examples):

The corresponding functions as.character and is.character coerce objects into text strings or test whether objects are text strings:

Text constants

R contains a few built-in constants. Apart from the constant pi (which evaluates to the numeric double 3.1415927), the following ones are vectors of type character:

  1. LETTERS: 26 upper-case letters of the Roman alphabet;
  2. letters: 26 lower-case letters of the Roman alphabet;
  3. month.abb: 3-letter abbreviations for the English month names;
  4. month.name: English names for the months of the year.

Text functions

Base R contains a range of functions to deal with text objects. As you will encounter them for simple string manipulations in this book and in the code of others, it is good to have seen them at some point.

However, the set of text-related functions in base R are a bit like a zoo — with amazing and colorful individual creatures, but it can be hard to see the connections or any organizing principle. Thus, we will only cover the most important functions here and use the stringr package (see Section @ref(#text:stringr) below) for more advanced forms of string manipulation.

Task-oriented view

To organize a set of otherwise relatively inconsistent functions, we present these basic string manipulation functions as solutions to specific tasks.

A. Basic tasks to do with text:

  1. Measuring text length
  2. Changing the case of characters
  3. Combining text strings
  4. Splitting text strings
  5. Reading text input
  6. Printing text output

Also: Extract substrings (based on positions)

B. Advanced tasks to do with text:

  1. detect text or patterns
  2. locate text or patterns
  3. extract text or patterns
  4. replace text or patterns
  • Task: Measuring text length.

Rather than using the length() function (which provides the number of elements in a vector), the length of a text object is measured by nchar():

  • Task: Changing the case of characters.

A specific property of letters in many alphabets is that they exist in lowercase and uppercase (aeiou vs. AEIOU). The R functions toupper() and tolower() allow changing the case of text strings:

  • Task: Extracting parts of strings.
  • Task: Detecting characters in strings.
  • Task: Replacing characters in strings.

Note that substr() can also be used to replace characters in strings:

  • Task: Combining strings (into 1 string).

Note vector-based paste commands.

  • Task: Splitting a string (into multiple parts).

Note: The opposite of pasting is splitting: Use strsplit(x, split):

Note: strsplit() returns a list. To obtain a vector, we need to unlist() it.

  • Task: Reading text input (from a file or user console).
  • Task: Providing text output (to the console or a file).

cat() concatenates and prints several objects, which is useful for printing the output of user-defined functions. It converts its arguments to character vectors, concatenates them to a single vector, appends the given string separator sep to each element, and then prints them (to the console or a file).

Practice

  • Verify that the elements of letters are equal to the elements of LETTERS in lower case, and that the elements of LETTERS are equal to those in letters in upper case.
  • Extract the substring “his” out of the sentence s.
  • Study the documentation to substring (by evaluating ?substring), compare it to substr, and then try to explain the results of the following examples:
  • Use a combination of substr and paste0 commands to change the sentence s into an R object “This is a great sentence!”

9.2.2 Regular expressions

A regular expression (aka. regex or regexp) is a sequence of characters that define a search pattern. Many computer languages provide regex functionality to find patterns in strings of text (see Wikipedia for details).

Think of regular expressions as a compact and formal language to define patterns in text.

Regexps can seem very cryptic and learning can be daunting.

  • characters are the basic building blocks of regular expressions: letters and digits typically match themselves.
  • meta-characters (like the wildcard .)

  • repetition quantifiers:

9.2.4 Applications

Blurring the boundaries between data types is creative and fun — and can have scientific or artistic motivations and consequences.

Quantifying text

In science: Quantifying text: Counting character or word frequency of literary works.

Example: count_char() function.

Sentiment analysis

Plotting text

Visual arts: Plotting text, creating word search puzzles or crossword puzzles, etc.

Example: plot_text() function.