Chapter 9 Text data

ds4psy: (09) Text data

Please note: This chapter is fragmetary — most of it is yet to be written.

This chapter contains essential commands on manipulating text data. Its tidyverse-related parts introduce the stringr package (Wickham, 2019). See also Chapter 14: Strings of the r4ds textbook (Wickham & Grolemund, 2017). and corresponding examples and exercises.

Text data: Regular expressions and string manipulation with **stringr**.

Figure 9.1: Text data: Regular expressions and string manipulation with stringr.

Apart from numbers, text is probably the most common type of data encountered by psychologists. When people — be it patients or participants — talk, they primarily produce text. Although science usually aims to transform verbal statements into categories that can be quantified, we often have to deal with with messy data that is saved as strings of text.

The majority of data we have been encountering and wrangling so far were numbers. This is a bit surprising. For unless you happen to be a mathematician or computer scientist, you are surely seeing a lot more text than numbers. Text is probably the most ubiquitous type of data. But as it is not an easy type of data to work with, it is typically tackled relatively late in data science courses.

(Actually, we have already seen some text labels when dealing with factors.)

Introduction to this chapter: Text strings, objects of type character.

  • R package stringr (Wickham, 2019).

  • Regular expressions.

Relation to other chapters (see Chapter 11) and book parts.

References

Wickham, H. (2019). stringr: Simple, consistent wrappers for common string operations. Retrieved from https://CRAN.R-project.org/package=stringr

Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz