Chapter 4 Exploring data
This chapter introduces the notion of exploratory data analysis (EDA) which is a key component of data science and an important pre-requisite for statistics. Rather than introducing new R packages, we will now combine what we have learned about the ggplot2 and dplyr packages in Chapters 2 and 3). As we will see, our recently acquired skills in data visualization and transformation provide a powerful foundation for exploring data by asking and answering questions. And any reader should soon experience that actively engaging in EDA really feels like “doing research”, as it requires us to question and scrutinize data from a variety of perspectives and follow up on any answers and hypotheses that find along the way.
EDA comprises one of the most important parts of data analysis — and one that is often ignored or under-estimated in typical courses on statistics. Especially when subscribing to the principles of transparent data analysis and reproducible research (e.g., by sharing analysis notebooks in R Markdown, see Appendix E), a solid EDA allows establishing consensus about the main characteristics of a dataset. While such practices are indispensable when working in a team of colleagues and the wider scientific community, organizing our workflow in a clear and consistent fashion is also beneficial for our own projects and our future self.
Although any actual EDA is tailored to specific features of a dataset and the current research goals, this session highlights some common themes that are relevant in most cases. We illustrate the essential steps of an EDA by exploring a real dataset from clincial psychology (Woodworth et al., 2018), which measures the effects of positive psychology interventions (see Section B.1 of Appendix B for details on the data).
Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. Journal of Open Psychology Data, 6(1). https://doi.org/10.5334/jopd.35