Chapter 9 Strings of text

ds4psy: (09) Strings of text/Text data

Most of the data we have been working and wrangling with so far were numbers. When comparing this to our everyday experience, this is surprising and somewhat peculiar. For unless you happen to spend your days totally immersed in statistics or stock prices, you are surely encountering a lot more text than numbers.

Text is probably the most ubiquitous type of data — and certainly the most common type of data encountered by psychologists. When people — be it patients or participants — talk or write, they primarily tend to consume, process, and produce verbal or written text. Science can transform parts of those verbal statements and other dimensions of interest into categories that can be counted and compared, but those attempts to measure verbal or natural phenomena are secondary and involve the adoption of a quantitative approach that is fairly artificial. And as much of the text surrounding us is not edited, formatted, and pretty-printed, any attempt at dealing with text-related material typically involves messy data. Thus, text may be ubiquitous, but is not an easy data type to work with — which is reflected in its late consideration in (or blatant absence from) introductory courses on data science.

Figure 9.1: Strings of text: Manipulating text data with base R and stringr commands.

Although we have already encountered data of type character (e.g., in variables that store names or factor labels), we are now taking a closer look at strings of text. This chapter first introduces some terminology and distinguishes between different tasks involving text (in Section 9.1). After this introduction, the chapter contains three main parts:

Essentials of strings and text data (Section 9.2)
Manipulating strings of text with base R commands (Section 9.3)
Advanced text-manipulation with stringr commands (Section 9.4)

Whereas these three sections mainly introduce new concepts and commands, Section 9.5 provides a glimpse into possible applications that creatively combine text and the tools to work with character data. Originally, this chapter also contained a section on searching for patterns with regular expressions. However, as this section turned out to be rather long and mostly self-sufficient, Appendix E now provides a primer on using regular expressions.

Even without this appendix, this chapter still covers a lot of new material. Nevertheless, the functionality introduced here only deals with text as sequences of character symbols. This only scratches the surface of text, but is a key pre-requisite for dealing with issues of syntax or semantics.

Just like other types of data (e.g., dates and times), dealing with text touches upon deep issues regarding the meaning of symbols and the rules and regulations governing their representation, visualization and interpretation. Successfully manipulating strings of text requires reflecting and negotiating a variety of habits, idiosyncracies, and standards, as well as mastering tools that allow us to cope with the technical challenges of symbol systems and their encodings. As we will see, text is a challenging, but also a rich and rewarding type of data. So let’s find out what we can do with it.