Chapter 4 Exploratory Data Analysis

Exploratory data analysis is the process of exploring your data, and it typically includes examining the structure and components of your dataset, the distributions of individual variables, and the relationships between two or more variables. The most heavily relied upon tool for exploratory data analysis is visualizing data using a graphical representation of the data. Data visualization is arguably the most important tool for exploratory data analysis because the information conveyed by graphical display can be very quickly absorbed and because it is generally easy to recognize patterns in a graphical display.

There are several goals of exploratory data analysis, which are:

  1. To determine if there are any problems with your dataset.

  2. To determine whether the question you are asking can be answered by the data that you have.

  3. To develop a sketch of the answer to your question.

Your application of exploratory data analysis will be guided by your question. The example question used in this chapter is: “Do counties in the eastern United States have higher ozone levels than counties in the western United States?” In this instance, you will explore the data to determine if there are problems with the dataset, and to determine if you can answer your question with this dataset.

To answer the question of course, you need ozone, county, and US region data. The next step is to use exploratory data analysis to begin to answer your question, which could include displaying boxplots of ozone by region of the US. At the end of exploratory data analysis, you should have a good sense of what the answer to your question is and be armed with sufficient information to move onto the next steps of data analysis.

It’s important to note that here, again, the concept of the epicycle of analysis applies. You should have an expectation of what your dataset will look like and whether your question can be answered by the data you have. If the content and structure of the dataset doesn’t match your expectation, then you will need to go back and figure out if your expectation was correct (but there was a problem with the data) or alternatively, your expectation was incorrect, so you cannot use the dataset to answer the question and will need to find another dataset.

You should also have some expectation of what the ozone levels will be as well as whether one region’s ozone should be higher (or lower) than another’s. As you move to step 3 of beginning to answer your question, you will again apply the epicycle of analysis so that if, for example, the ozone levels in the dataset are lower than what you expected from looking at previously published data, you will need to pause and figure out if there is an issue with your data or if your expectation was incorrect. Your expectation could be incorrect, for example, if your source of information for setting your expectation about ozone levels was data collected from 20 years ago (when levels were likely higher) or from only a single city in the U.S. We will go into more detail with the case study below, but this should give you an overview about the approach and goals of exploratory data analysis.