Chapter 3 Transforming data
In most courses on data analysis and statistics, data is the raw material that is read into some software program and we immediately start analyzing it.18
However, viewing data as some resource to be read in one initial step is a gross over-simplification of what typically happens when we analyze real-world data. Instead, directly using a dataset is only possible when the data is formatted in precisely the way we need it to be in our analysis. In most realistic scenarios, however, a dataset requires extensive pre-processing prior to being analyzed. This pre-processing phase has been called “data munging”, “data wrangling”, or “transforming data”, and can often be quite extensive. In fact, transforming data into the right shape can easily take up the majority of time needed to make sense of it.
The topic of data transformation encompasses many different tasks, including re-arranging, selecting, changing, and aggregating data. A tool supporting these tasks is the dplyr package (Wickham, François, Henry, & Müller, 2019b), which is a core component of the tidyverse.19 dplyr provides a set of commands — best thought of as verbs — that allow slicing and dicing rectangular datasets and computing many summary statistics. While each individual command is simple, they can be combined into a powerful language of data manipulation. In combination with other functions, using dplyr quickly provides us with quantitative overviews of datasets that amount to what psychologists often call descriptive statistics.
In this chapter, we are only concerned with the essentials, as the subsequent chapters will cover many related tasks and tools in greater detail. Similarly, the tools provided by the dplyr package extend beyond the scope of this introduction. For instance, dplyr includes additional commands for handling multiple data tables, which we will discuss later (e.g., in Chapter 8 on Joining data).
Wickham, H., François, R., Henry, L., & Müller, K. (2019b). dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr
The same was true for our use of ggplot in the previous chapter on Visualizing data (Chapter 2): When composing a new
ggplot()command, we only specified
data = mpgand then accessed the variables of this dataset.↩
Like ggplot2, dplyr is widely used by people who otherwise do not reside within the tidyverse. But as dplyr is a package that is both immensely useful and embodies many of the tidyverse principles in paradigmatic form, we can think of it as the primary citizen of the tidyverse.↩