13 Exploring data

Exploratory data analysis (EDA) is a fancy name for an ordinary, but important phase of any data analysis.

From a technical viewpoint, EDA is where our skills in transforming, reshaping, and visualizing data (e.g., using base R functions and additional tools from the dplyr, tidyr, and ggplot2 packages) meet so that we become familiar with a dataset.

Preparation

Recommended readings for this chapter include:

of the ds4psy book (Neth, 2023a), and the corresponding chapter

of the r4ds book (Wickham & Grolemund, 2017).

Preflections

Before reading, please take some time to reflect upon the following questions:

i2ds: Preflexions

  • What are the first steps of any data analysis?

  • Which questions should we always ask before conducting any statistical test?

  • What is the difference between generating and testing hypotheses?

13.1 Introduction

Contrasting exploratory data analysis (EDA) with confirmatory data analysis.

Please note: This page merely summarizes the longer version at 4.1 Introduction (Neth, 2023a).

13.1.1 What is EDA?

Distinguish between three main discussions and interpretations:

  1. EDA as the opposite of (or complement to) confirmatory data analysis:

Exploratory data analysis (EDA) is a type of data analysis that John Tukey contrasted with confirmatory data analysis (CDA) (e.g., Tukey, 1977, 1980) and likened to the work of a detective (Tukey, 1969). Long before the ubiquity of personal computers, Tukey emphasized the importance of visual displays for detecting patterns or irregularities in data, while most psychologists of the same era were obsessed with statistical rituals (like null hypothesis significance testing, NHST, see Nickerson, 2000) of a rather mechanistic and mindless nature (Gigerenzer, 2004). Irrespective of your stance towards statistics, EDA approaches data in a more curious and open-minded fashion.

  1. EDA as an attitude or mindset.

Exploratory data analysis is an attitude,
a flexibility, and a reliance on display,
NOT a bundle of techniques, and should be so taught.

John W. Tukey (1980, p. 23)

Philosophical idea of gaining insight by hermeneutics: EDA is a data scientist’s way of doing hermeneutics — see the corresponding definitions in Wikipedia or The Stanford Encyclopedia of Philosophy for details — to get a grip on some dataset.

  1. EDA as an inevitable process to familiarize us with new data, detect potential problems and irregularities, discover patterns, and formulate better questions.

The goal of EDA is to discover patterns in data. (…)
The role of the data analyst is to listen to the data in as many ways as possible
until a plausible “story” of the data is apparent, even if such a description
would not be borne out in subsequent samples.

John T. Behrens (1997, p. 132)

If a good exploratory technique gives you more data, then maybe
good exploratory data analysis gives you more questions, or better questions.
More refined, more focused, and with a sharper point.

Roger Peng (2019), Simply Statistics

Having learned what exploratory data analysis (EDA) is and wants, we are curious to learn how to do an EDA.

13.2 Essentials

As any exploration depends on the data to be explored, there are no general recipes for EDA. However, we can collect a set of good practices and recommendations.

Section 4.2 Essentials of EDA of the ds4psy book (Neth, 2023a) provides the following list:

13.2.1 The principles of EDA

As a quick summary, here are the 10 principles of EDA mentioned above:

  1. Start with a clean slate and explicitly load all data and all required packages.

  2. Structure, document, and comment your analysis.

  3. Make copies (and copies of copies) of your data.

  4. Know your data (variables and observations).

  5. Know and deal with unusual variables and unusual values.

  6. Inspect the distributions of variables.

  7. Use filter variables to identify and select sub-sets of observations.

  8. Inspect relationships between variables.

  9. Inspect trends over time or repeated measurements.

  10. Create graphs that convey their messages as clearly as possible.

In Section 4.2 Essentials of EDA, these princpiles are illustrated in the context of some data collected in an investigation on the benefits of positive psychology interventions (described in Appendix B1).

13.2.2 Caveat: Explaining vs. predicting in science

Discovering some pattern in data is usually interesting and exciting. However, me must be very careful to avoid drawing premature conclusions from it. Importantly, any observed relationship between variables could be spurious, merely due to chance fluctuations. Hence, before getting carried away by discovering some pattern in our data, we must always keep in mind:

  • Science 101: To really find something, we need to predict it — and ideally replicate it under different conditions.

Thus, taking into account the principles of EDA does never guarantee any results, but provides valuable insights into the structure and contents of a dataset. Gaining such insights before embarking on statistical tests minimizes the risk of missing something important or violating key assumptions. However, EDA does not replace solid research design and sound practices of inferential statistics.

13.3 Conclusion

13.3.1 Summary

What is exploratory data analysis (EDA)?

Three possible interpretations are:

  1. A complement to confirmatory data analysis

  2. an attitude or mindset (allowing for insights based on hermeneutics)

  3. an inevitable process to familiarize us with new data, detect patterns, and formulate better questions.

Instead of a fixed recipe, we collected 10 principles of EDA:

  1. Start with a clean slate and explicitly load all data and all required packages.

  2. Structure, document, and comment your analysis.

  3. Make copies (and copies of copies) of your data.

  4. Know your data (variables and observations).

  5. Know and deal with unusual variables and unusual values.

  6. Inspect the distributions of variables.

  7. Use filter variables to identify and select sub-sets of observations.

  8. Inspect relationships between variables.

  9. Inspect trends over time or repeated measurements.

  10. Create graphs that convey their messages as clearly as possible.

From a technical viewpoint, EDA involves a combination of base R data structures and commands, and is facilitated by additional tools from the dplyr, tidyr, and ggplot packages.

13.3.2 Resources

See the pointers to related resources at Section 4.5 Resources.

13.3.3 Preview

This concludes the current part (getting, transforming, and exploring data).

We will study specific data types (e.g., text and date-time data) next.

13.4 Exercises

i2ds: Exercises

The following four exercises are from Section 4.4 Exercises of the ds4psy book (Neth, 2023a) and explore the posPsy_wide dataset included in the ds4psy package:

13.4.1 Exploring wide data

13.4.2 Selective dropouts

13.4.3 Effects of income

13.4.4 Main findings

Many representations seem so simple and straightforward that we fail to notice them. For instance, we do not care about the properties of our numeral system when checking our bank accounts, hardly notice the spelling of words when reading an exciting story, or rarely consider the peculiarities of dates when planning a picnic or a holiday.

Having extensive experience with particular representations differs from the situation that we have encountered with Color. In Chapter 10, we noted that most people habitually perceive colors, but have no idea how to encode them. With numbers, text, and representations of time, we do not feel as unprepared. As our education and schooling conveys particular representations, we learn to calculate, read and write, and interpret dates and times according to the norms established in our society. However, the impression that these representations are simple is deceiving, as it is mostly owed to our intense familiarity with specific representational formats.

As philosophers have pointed out, we mostly tend to notice (or “see”) representations when they break down. When we cannot compute the result of a calculation, fail to understand a written message, or realize that our clock no longer shows the right time, we become aware of the properties and preconditions that we usually take for granted. And while becoming aware of representational properties can be challenging, it also enables us to evaluate and look for alternatives.

When working with data, we must de-familiarize ourselves from familiar representations to discover and understand their underlying mechanisms. This process requires effort and has similar effects as explaining a joke or a magician’s trick: Replacing a surface phenomenon by its elementary details first seems pedantic, technical, and laborious. However, it is a necessary precondition for gaining deeper insights and eventually performing similar tricks ourselves.

In the following chapters, we will disenchant three types of representations:

  • numbers and factors (Chapter 14)
  • text (Chapter 15)
  • dates and times (Chapter 16)

Our method for discovering key properties of representations lies in blurring the boundaries between them — by treating numbers as symbol sequences, manipulating the visual appearance of words and texts, or performing calculations with dates and times. Hopefully, this will reveal the mechanisms in a playful and entertaining fashion.