In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. (…) A string is generally considered as a data type and is often implemented as an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence (or list) data types and structures.
When hearing the word text, we tend to think of great literary works, scientific articles, or written messages that we exchange with others. By contrast, the above definition of the term string seems a far cry apart from Ulysses or War and Peace. Our habitual interaction with text rests on the premise that we are unaware of its elementary symbols, their font, spacing, and the underlying rules of spelling, grammar, or style. Only when our ordinary expectations are violated or orthografycali challenged, do the underlying mechanisms of text shine through.
As a consequence of our habitual consumption of text, viewing text as data is primarily an exercise in restraint. Whereas our highly-developed skills of reading and writing allow us to transcend written symbols and leap to their meanings and implications, working with text data requires interrupting this automatic process and considering several layers of abstraction. When reflecting on the nature of text, we must become aware of its constituting symbols, their shapes and types, the rules governing their combination into words and sentences, and the conventions and technologies that print and format paragraphs on pages or screens.
9.1.1 Why text is hard
Text is one of the most frequent data types — perhaps the most frequent that we consume on an everyday basis (as long as we exclude particles of light, waves of sound, and other sensory data). Given this ubiquity, it is surprising how rarely text becomes topical in introductory treatments of computer science and how difficult it remains to deal with strings of text. There are two main reasons for this:
Focus on numbers: The primary focus of computers, statistical analysis software, and the people developing the corresponding tools and courses, has traditionally been on numeric data. Most of the people using R are interested and invested in its statistical powers and graphical features, rather than its functionality for processing character strings and texts.
Linguistic complexity: Any serious endeavor for processing text data immediately involves linguistic issues. Not only does the same text in Chinese and English look pretty different (e.g., contain different symbols), but even within a single language the exact form and shape of a word varies as a function of its role in a sentence — let alone the fact that different words can mean the same or the same word different things, based on their context. Thus, dealing with language in an explicit and systematic fashion is immensely difficult, even when only considering our native language.
Essentially, any attempt to deal with text inevitably struggles with the distinctions between
- symbols (i.e., alphabetic letters, additional characters, and words used to express and record text),
- syntax (i.e., how grammatic rules change the form and identity of words), and
- semantics (i.e., the sense and meaning of words and phrases).
Actually, there are additional areas (like lexicographics, phonetics, and pragmatics) and even the definition of and demarcations between all these terms are tricky. For instance, how many different types of empty spaces are used in printed text? And if words consist of multiple symbols, should they still be considered to be symbols? We will leave most of those issues for the linguists to decide. And before we get to far ahead of ourselves, we should acknowledge that the most interesting questions will often be at the level of semantics, but our analysis tools must first address the more elementary levels of symbols and the basic chunks of data that make up words and phrases.
Beware that this chapter only covers the most mundane aspects of text: Manipulating the character symbols (i.e., letters) of the (Western) alphabet and searching for patterns of characters in text. This merely scratches the surface of text and falls short with regard to its most interesting aspects.
Despite these limitations, this chapter is one of the longest in this book and covers a lot of new ground.
But before we can introduce various tools and functions for manipulating text, we need to clarify some terminology and distinguish between different tasks involving strings of text.
Just as text is a challenging type of data to work with, the terminology to refer to text can be confusing. We will adopt the following conventions:
In most modern languages, written text consists of characters (i.e., letters, digits, and other symbols, some of which are invisible). In R, the data type and mode of text strings are also called character.
Text objects can be called characters, strings, or text. We will try to use the term character to refer to individual symbols (e.g., the letter “a” or the mathematical operator “+”) and use the more abstract terms character data, strings, and text interchangeably to refer to data containing longer sequences of characters (e.g., words, sentences, paragraphs, etc.).42 Calling this chapter “Strings of text” attempts to explicate both the form (“strings”) and the content (“text”) of data of type “character”.
Larger units of text data are often called by other names (e.g., words, sentences, paragraphs, but also article, chapter, or book, which are stored in archives, documents, or files). Loading, reading, or searching through strings of text is also referred to as parsing.
Storing characters and strings of text in computer files requires an appropriate encoding. This involves technical voodoo stuff that most users do not need or want to understand. However, it is good to know that there are standards for this — and the terms Unicode and UTF-8 are our friends (and see Sections 9.2.2 and 9.8.2 for additional information and resources).
Thus, even our basic terminology with regard to text is somewhat fuzzy and muddled. But this is a general feature of language and normally does not keep us from productively using it. Hopefully, our familiarity with and agreement on the meaning of these terms is sufficient to avoid ambiguity.
After working through this chapter, you should be able to:
- understand that text consists of characters and comes in strings,
- enter character data and rare symbols as Unicode characters,
- distinguish between different tasks to solve with text,
- use base R commands to define, combine, and manipulate strings of text,
- read and write basic regular expressions (see Appendix E),
- use stringr commands to detect, locate, extract, replace, and count patterns in strings of text.
This chapter defines many strings of text for practice purposes, but also uses many longer vectors of character objects that are included in the R packages stringr (Wickham, 2019b) and ds4psy (Neth, 2020).
9.1.5 Getting ready
This chapter formerly assumed that you have read and worked through Chapter 14: Strings of the r4ds textbook (Wickham & Grolemund, 2017). However, that chapter is very dense and suffers from some shortcomings: In focusing primarily on the stringr package, important base R functions for manipulating strings are essentially ignored. And although the chapter contains a section on Matching patterns with regular expressions, this section is hard to follow and digest for students that have never worked with regular expressions before. Similarly, the topics of using non-standard Unicode characters, of identifying and matching meta-characters, and distinguishing between different character classes, are mentioned, but not explained much. As it nevertheless provides many good examples and challenging exercises, reading Chapter 14 of r4ds is still recommended. Overall, be aware that this chapter (in combination with Appendix E on using regular expressions) goes beyond the materials covered there, but mostly in depth rather than in scope (i.e., by proceeding more slowly and mentioning many details more explicitly).
Please do the following to get started:
Structure your document by inserting headings and empty lines between different parts. Here’s an example how your initial file could look:
Create an initial code chunk below the header of your
.Rmdfile that loads the R packages of the tidyverse and the ds4psy package (and see Section F.3.3 if you want to get rid of the messages and warnings of this chunk in your HTML output).
Save your file (e.g., as
nr_name.Rmdin the R folder of your current project) and remember saving and knitting it regularly as you keep adding content to it.
Neth, H. (2020). ds4psy: Data science for psychologists. Retrieved from https://CRAN.R-project.org/package=ds4psy
Wickham, H. (2019b). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz
Which term is appropriate often depends on context: The term character can denote both an R data type or an individual text symbol. Nevertheless, referring to a word or sentence as a character or as a text sounds odd. The term string seems more technical than text, but allows for both singular and plural forms. Phrases like a character/text string or strings of characters/text are somewhat repetitive, but often seem better and clearer.↩