17 Text data

Text is a frequent type of data — and so familiar to us, that we hardly notice it as a type of data. This changes when we are trying to manipulate or analyze it in a quantitative fashion. Although most people are familiar with reading and writing text on a computer, far fewer people are thinking about encoding and parsing text data.

Note

This chapter is yet to be written, but Chapter 9: Text data of Data Science for Psychologists (Neth, 2023a) provides an overview of its topic.

17.1 Introduction

See Section 9.1 of the ds4psy book

17.2 Essentials

Basic text-manipulation issues and functions:

See Sections 9.2 and 9.3 of the ds4psy book

17.3 Advanced text-manipulation

Advanced text-manipulation involves using more advanced text-manipulation functions of the stringr package (Wickham, 2022) and regular expressions:

See Section 9.4 Advanced text-manipulation of the ds4psy book
See Appendix E: Using regular expressions of the ds4psy book

17.4 Conclusion

See some example applications in Section 9.5 of the ds4psy book

17.4.1 Summary

See Section 9.6 of the ds4psy book

17.4.2 Resources

Take a look at the Posit cheatsheet on stringr:

Figure 17.1:

Figure 17.1: Text and string manipulation with stringr and regular expressions from Posit cheatsheets.

The contributed cheatsheets also contain a fine reference on Basic Regular Expressions in R.

See additional links at Section 9.8 of the ds4psy book (Neth, 2023a).

17.4.3 Preview

This chapter provided an in-depth focus on text (aka. strings or characters) as a particular data type.

17.5 Exercises

Here are six exercises from 9.7: Exercises of the ds4psy book (Neth, 2023a):

17 Text data

Note

17.1 Introduction

17.2 Essentials

17.3 Advanced text-manipulation

17.4 Conclusion

17.4.1 Summary

17.4.2 Resources

17.4.3 Preview

17.5 Exercises

17.5.1 Escaping into Unicode

17.5.2 Pasting vectors

17.5.3 Searching color names

17.5.4 Patterns in pi

17.5.5 Naive cryptography

17.5.6 Known unknowns