4.3 Summary

ds4psy: (4) Exploring data

The goal of EDA is to discover patterns in data. (…)
The role of the data analyst is to listen to the data in as many ways as possible
until a plausible “story” of the data is apparent, even if such a description
would not be borne out in subsequent samples.

John T. Behrens (1997, p. 132)

As we have seen, conducting an explanatory data analysis (EDA) is not like cooking a meal according to a fixed recipe, but more like actively engaging in detective work: We need to query our data and increase our understanding by iteratively asking and answering questions.⁴⁰

Although the process of EDA is tailored to the constraints and demands of a particular dataset, the attitude and mindest of EDA can still be practised and taught. As any actual EDA depends not just on the data, but also on the background and interests of the data analyst, we cannot distill a single sequence of steps that is always applicable. Nevertheless, skilled data detectives are not proceeding randomly, but organize the process by honoring a set of principles.

4.3.1 The principles of EDA

As a quick summary, here are the ten principles of EDA mentioned above:

Start with a clean slate and explicitly load all data and all required packages.
Structure, document, and comment your analysis.
Make copies (and copies of copies) of your data.
Know your data (variables and observations).
Know and deal with unusual variables and unusual values.
Inspect the distributions of variables.
Use filter variables to identify and select sub-sets of observations.
Inspect relationships between variables.
Inspect trends over time or repeated measurements.
Create graphs that convey their messages as clearly as possible.

Taking into account these principles does not guarantee any results, but provides valuable insights into the structure and contents of a dataset. Gaining such insights before embarking on statistical tests minimizes the risk of missing something important or violating key assumptions. But before getting carried away by discovering some pattern in your data, always keep in mind:

Science 101: To really find something, we need to predict it — and ideally replicate it under different conditions.

4.3.2 Learning goals

After working through this chapter, you are able to conduct an explanatory data analysis (EDA), which includes:

knowing the difference between exploratory and confirmatory data analysis,
initializing and organizing a new project,
screening your data (observations and variables),
checking for unusual values and distributions,
inspecting trends (over time or multiple measurements),
inspecting relationships between variables,
structuring and commenting your analysis and results.

From a technical viewpoint, we combined two core components of the tidyverse (Wickham, 2019c) to quickly answer questions about our data:

ggplot2 (Wickham et al., 2021) for visualizations of distributions of and relationships or trends between variables (see Chapter 2 on Visualizing data for details and resources).
dplyr (Wickham, François, Henry, & Müller, 2022) for filtering and selecting data and creating new variables or descriptive summary tables (see Chapter 3 on Transforming data for details and resources), and

Although conducting an EDA requires many skills and tools, the process and its results are primarily driven by the nature and properties of the data and your intellectual curiosity. You can assess your current knowledge and skills by completing the following exercises.

References

Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131–160. https://doi.org/10.1037/1082-989X.2.2.131

Wickham, H. (2019c). tidyverse: Easily install and load the ’tidyverse’. Retrieved from https://CRAN.R-project.org/package=tidyverse

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … Dunnington, D. (2021). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://CRAN.R-project.org/package=ggplot2

Wickham, H., François, R., Henry, L., & Müller, K. (2022). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr

If we wanted to stick with our cooking metaphor: EDA is like improvising a quick dish out of the ingredients and tools available, but with the goal of creating a more elaborate and refined meal soon.↩︎