15 Exploring data
Exploratory data analysis (EDA) is a fancy name for an ordinary, but important phase of any data analysis.
From a technical viewpoint, EDA is where our skills in transforming, reshaping, and visualizing data (e.g., using base R functions and additional tools from the dplyr, tidyr, and ggplot2 packages) meet so that we become familiar with a dataset.
Preparation
The recommended reading for this chapter is:
of the ds4psy book (Neth, 2023a).
See
of the r4ds book (Wickham & Grolemund, 2017) for an EDA of ggplot2’s diamonds
data.
Preflections
Before reading, please take some time to reflect upon the following questions:
What are the first steps of any data analysis?
Which questions should we always ask before conducting any statistical test?
What is the difference between generating and testing hypotheses?
15.1 Introduction
Contrasting exploratory data analysis (EDA) with confirmatory data analysis.
Please note: This page merely summarizes the longer version at 4.1 Introduction (Neth, 2023a).
15.1.1 What is EDA?
Distinguish between three main discussions and interpretations:
- EDA as the opposite of (or complement to) confirmatory data analysis:
Exploratory data analysis (EDA) is a type of data analysis that John Tukey contrasted with confirmatory data analysis (CDA) (e.g., Tukey, 1977, 1980) and likened it to the work of a detective (Tukey, 1969). Long before the ubiquity of personal computers, Tukey emphasized the importance of visual displays for detecting patterns or irregularities in data, while most psychologists of the same era were obsessed with statistical rituals (like null hypothesis significance testing, NHST, see Nickerson, 2000) of a rather mechanistic and mindless nature (Gigerenzer, 2004). Irrespective of your stance towards statistics, EDA approaches data in a more curious and open-minded fashion.
- EDA as an attitude or mindset.
Exploratory data analysis is an attitude,
a flexibility, and a reliance on display,
NOT a bundle of techniques, and should be so taught.John W. Tukey (1980, p. 23)
Philosophical idea of gaining insight by hermeneutics: EDA is a data scientist’s way of doing hermeneutics — see the corresponding definitions in Wikipedia or The Stanford Encyclopedia of Philosophy for details — to get a grip on some dataset.
- EDA as an inevitable process to familiarize us with new data, detect potential problems and irregularities, discover patterns, and formulate better questions.
The goal of EDA is to discover patterns in data. (…)
The role of the data analyst is to listen to the data in as many ways as possible
until a plausible “story” of the data is apparent, even if such a description
would not be borne out in subsequent samples.John T. Behrens (1997, p. 132)
If a good exploratory technique gives you more data, then maybe
good exploratory data analysis gives you more questions, or better questions.
More refined, more focused, and with a sharper point.Roger Peng (2019), Simply Statistics
Having learned what exploratory data analysis (EDA) is and wants, we are curious to learn how to do an EDA.
15.2 Essentials
As any exploration depends on the data to be explored, there are no general recipes for EDA. However, we can collect a set of good practices and recommendations.
Section 4.2 Essentials of EDA of the ds4psy book (Neth, 2023a) provides the following list:
15.2.1 The principles of EDA
As a quick summary, here are the 10 principles of EDA mentioned above:
Start with a clean slate and explicitly load all data and all required packages.
Structure, document, and comment your analysis.
Make copies (and copies of copies) of your data.
Know your data (variables and observations).
Know and deal with unusual variables and unusual values.
Inspect the distributions of variables.
Use filter variables to identify and select sub-sets of observations.
Inspect relationships between variables.
Inspect trends over time or repeated measurements.
Create graphs that convey their messages as clearly as possible.
In Section 4.2 Essentials of EDA, these princpiles are illustrated in the context of some data collected in an investigation on the benefits of positive psychology interventions (described in Appendix B1).
15.2.2 Caveat: Explaining vs. predicting in science
Discovering some pattern in data is usually interesting and exciting. However, me must be very careful to avoid drawing premature conclusions from it. Importantly, any observed relationship between variables could be spurious, merely due to chance fluctuations. Hence, before getting carried away by discovering some pattern in our data, we must always keep in mind:
- Science 101: To really find something, we need to predict it — and ideally replicate it under different conditions.
Thus, taking into account the principles of EDA does never guarantee any results, but provides valuable insights into the structure and contents of a dataset. Gaining such insights before embarking on statistical tests minimizes the risk of missing something important or violating key assumptions. However, EDA does not replace solid research design and sound practices of inferential statistics.
15.3 Conclusion
15.3.1 Summary
What is exploratory data analysis (EDA)?
Three possible interpretations are:
A complement to confirmatory data analysis
an attitude or mindset (allowing for insights based on hermeneutics)
an inevitable process to familiarize us with new data, detect patterns, and formulate better questions.
Instead of a fixed recipe, we collected 10 principles of EDA:
Start with a clean slate and explicitly load all data and all required packages.
Structure, document, and comment your analysis.
Make copies (and copies of copies) of your data.
Know your data (variables and observations).
Know and deal with unusual variables and unusual values.
Inspect the distributions of variables.
Use filter variables to identify and select sub-sets of observations.
Inspect relationships between variables.
Inspect trends over time or repeated measurements.
Create graphs that convey their messages as clearly as possible.
From a technical viewpoint, EDA involves a combination of base R data structures and commands, and is facilitated by additional tools from the dplyr, tidyr, and ggplot packages.
15.3.2 Resources
See the pointers to related resources at Section 4.5 Resources.
- Know your data. Really, really, know it (by Randy Au, 2019-02-15)
15.4 Exercises
The following four exercises are from Section 4.4 Exercises of the ds4psy book (Neth, 2023a) and explore the posPsy_wide
dataset included in the ds4psy package: