Exploratory Data Analysis

Adapted for ForestGEO from R for Data Science, by Hadley Wickham and Garrett Grolemund

Mauro Lepore

2018-06-24

Introduction

The goal of this article is to show you how to explore your data with general purpose tools (not with fgeo (https://forestgeo.github.io/fgeo/)). It literally reproduces the section Exploratory Data Analysis in R for Data Science (http://r4ds.had.co.nz/), by Hadley Wickham and Garrett Grolemund, except that the examples and some text were changed to more strongly engage ecologists from ForestGEO – althought people from outside ForestGEO should also be able to understand. If anything here is pooly written it’s surely my fault.

You will learn how to explore your data systematically, using modern and powerful tools for visualization and transformation. An exploratory data analysis is an iterative process that aims to use data and some research questions to learn something that can help you refine those questions and move forward along the learning spiral.

Prerequisites

Most functions we will use are available in the tidyverse package (https://www.tidyverse.org/). If you are new to the tidyverse, you may feel a little frustrated at the beginning, but quickly your effort will pay-off. The tools in the tidyverse are powerful, relatively easy to use and consistent; so once you learn some, you will learn most other tools intuitively.

The data we will use is a public dataset of trees censused in a forest plot in Luquillo, in Puerto Rico. Those data are available in the data set luquillo_stem_random from the data-package fgeo.data (https://forestgeo.github.io/fgeo.data/).

Style

The tidyverse style guide

The code style I use here follows the tidyverse guide (http://style.tidyverse.org/).

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

Combining multiple operations with the pipe %>%

Often I combine multiple operations with the pipe %>% because it makes the code considerably more readable.

[The pipe] focuses on the transformations, not what’s being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce %>% when reading code is “then”.

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered.

–- R for Data Science (http://r4ds.had.co.nz/transform.html)

Questions and Definitions

Questions

To start exploring your data, you can generally ask these questions:

Definitions

A variable is a quantity, quality, or property that you can measure.

A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.

Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

Variation is the tendency of the values of a variable to change from measurement to measurement.

Variation

Every variable varies with a particular pattern, and that pattern may be insightful. To understand the variation pattern of a variable we can visualize the distribution of the variables’ values. The best way to visualize a variable’s distribution depends on whether the variable is categorical or continuous.

The stem dataset has both, categorical and continuous variables. In R, categorical variables are usually stored as character strings (<chr>) or factors (<fctr>), and continuous variables are stored as integers (<int>) or doubles (<dbl>).

stem is a tibble. A tibble is a subclass of dataframe. But compared to regular dataframes tibbles are optimized to handle large data; and they print more information and nicer.

class(stem)
#> [1] "tbl_df"     "tbl"        "data.frame"

stem
#> # A tibble: 7,944 x 19
#>    treeID stemID tag    StemTag sp     quadrat    gx    gy MeasureID
#>     <int>  <int> <chr>  <chr>   <chr>  <chr>   <dbl> <dbl>     <int>
#>  1    104    143 10009  10009   DACEXC 113      10.3  245.       143
#>  2    119    158 100104 100104  MYRSPL 1021    183.   410.       158
#>  3    180    222 100171 100095  CASARB 921     165.   410.       222
#>  4    180    223 100171 100096  CASARB 921     165.   410.       223
#>  5    180    224 100171 100171  CASARB 921     165.   410.       224
#>  6    180    225 100171 100174  CASARB 921     165.   410.       225
#>  7    602    736 100649 100649  GUAGUI 821     149.   414.       736
#>  8    631    775 10069  10069   PREMON 213      38.3  245.       775
#>  9    647    793 100708 100708  SCHMOR 821     143.   411.       793
#> 10   1086   1339 10122  10122   DRYGLA 413      68.9  253.      1339
#> # ... with 7,934 more rows, and 10 more variables: CensusID <int>,
#> #   dbh <dbl>, pom <chr>, hom <dbl>, ExactDate <dbl>, DFstatus <chr>,
#> #   codes <chr>, countPOM <dbl>, status <chr>, date <dbl>

By default, tibbles print a few rows only; to print more rows use:

print(<YOUR_TIBBLE>, n = <N_ROWS>, width = <N_COLUMNS>)

You can always convert a tibble into a regular dataframe.

# This is ackward because it prints way more than you can fit on a screen
as.data.frame(stem)  

For alternative views try:

# informative and prints nice
str(stem)  
#> Classes 'tbl_df', 'tbl' and 'data.frame':    7944 obs. of  19 variables:
#>  $ treeID   : int  104 119 180 180 180 180 602 631 647 1086 ...
#>  $ stemID   : int  143 158 222 223 224 225 736 775 793 1339 ...
#>  $ tag      : chr  "10009" "100104" "100171" "100171" ...
#>  $ StemTag  : chr  "10009" "100104" "100095" "100096" ...
#>  $ sp       : chr  "DACEXC" "MYRSPL" "CASARB" "CASARB" ...
#>  $ quadrat  : chr  "113" "1021" "921" "921" ...
#>  $ gx       : num  10.3 182.9 164.6 164.6 164.6 ...
#>  $ gy       : num  245 410 410 410 410 ...
#>  $ MeasureID: int  143 158 222 223 224 225 736 775 793 1339 ...
#>  $ CensusID : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ dbh      : num  115 16 17.2 11.7 80 19.4 24.1 100 146 165 ...
#>  $ pom      : chr  "1.2" "1.3" "1.3" "1.3" ...
#>  $ hom      : num  1.2 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 ...
#>  $ ExactDate: num  7831 8609 8610 8610 8610 ...
#>  $ DFstatus : chr  "alive" "alive" "alive" "alive" ...
#>  $ codes    : chr  "MAIN;A" "MAIN;A" "SPROUT;A" "SPROUT;A" ...
#>  $ countPOM : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ status   : chr  "A" "A" "A" "A" ...
#>  $ date     : num  NA NA NA NA NA NA NA NA NA NA ...
# like str() but shows as much data as possible
glimpse(stem)  
#> Observations: 7,944
#> Variables: 19
#> $ treeID    <int> 104, 119, 180, 180, 180, 180, 602, 631, 647, 1086, 1...
#> $ stemID    <int> 143, 158, 222, 223, 224, 225, 736, 775, 793, 1339, 1...
#> $ tag       <chr> "10009", "100104", "100171", "100171", "100171", "10...
#> $ StemTag   <chr> "10009", "100104", "100095", "100096", "100171", "10...
#> $ sp        <chr> "DACEXC", "MYRSPL", "CASARB", "CASARB", "CASARB", "C...
#> $ quadrat   <chr> "113", "1021", "921", "921", "921", "921", "821", "2...
#> $ gx        <dbl> 10.31, 182.89, 164.61, 164.61, 164.61, 164.61, 148.9...
#> $ gy        <dbl> 245.36, 410.15, 409.50, 409.50, 409.50, 409.50, 414....
#> $ MeasureID <int> 143, 158, 222, 223, 224, 225, 736, 775, 793, 1339, 1...
#> $ CensusID  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ dbh       <dbl> 115.0, 16.0, 17.2, 11.7, 80.0, 19.4, 24.1, 100.0, 14...
#> $ pom       <chr> "1.2", "1.3", "1.3", "1.3", "1.3", "1.3", "1.3", "1....
#> $ hom       <dbl> 1.20, 1.30, 1.30, 1.30, 1.30, 1.30, 1.30, 1.30, 1.30...
#> $ ExactDate <dbl> 7831, 8609, 8610, 8610, 8610, 8610, 8611, 7829, 8611...
#> $ DFstatus  <chr> "alive", "alive", "alive", "alive", "alive", "alive"...
#> $ codes     <chr> "MAIN;A", "MAIN;A", "SPROUT;A", "SPROUT;A", "MAIN;A"...
#> $ countPOM  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ status    <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A...
#> $ date      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

If you are in RStudio, your best option might be View(), which lets you search values and filter columns.

View(stem)           

Visualizing Distributions

Categorical variables

A variable is categorical if it can only take one of a small set of values. To examine the distribution of a categorical variable, use a bar chart.

The categories A, D and M, of the variable status mean “alive”, “dead” and “missing” (http://ctfs.si.edu/Public/DataDict/data_dict.php).

Continuous Variables

A variable is continuous if it can take any of an infinite set of ordered values. To examine the distribution of a continuous variable use a histogram. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.

Let’s explore dbh, which represents the stem diameter at breast height. Now we will focus on small stems; we’ll explore big stems later. And we will try bars of different widths (with binwidth) and choose one for further analyses.

A binwidth of 30 seems useful.

To overlay multiple histograms in the same plot, geom_freqpoly() may produce clearer plots than geom_histogram() because it is easier to understand overlying lines than bars.

Typical and rare Values

In both bar charts and histograms, tall and short bars let us explore common and less-common values. For example, we could ask:

  • Which values are the most common? Why?
  • Which values are rare? Why? Does that match your expectations?
  • Can you see any unusual patterns? What might explain them?

The later two plots give a good estimate of what stem diameters are most common. You can compute the same count manually, by cutting the variable dbh with ggplot2::cut_width() and then counting the unique pieces with dplyr::count().

The most common stems are those which diameter (dbh) is between 15 mm and 45 mm.

Clustered values

In this section we will focus on the largest stems; later we will explore all stems.

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:

  • How are the observations within each cluster similar to each other?
  • How are the observations in separate clusters different from each other?
  • How can you explain or describe the clusters?
  • Why might the appearance of clusters be misleading?

The clustering, however, may be an artifact of the chosen binwidth:

Unusual values

In this section we will work with the entire stem data set.

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram.

For example, take the distribution of the dbh variable of the stem data set. Notice how wide are the limits of the x-axis.

There are so many observations in the common bins that the rare bins are so short that you can barely see them . To make the unusual values more noticeable, we can zoom into smaller values of the y-axis with coord_cartesian().

coord_cartesian() also has an xlim argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.

This allows us to see that dbh values over ~625 are unusual and may be outliers. We can pluck them out with dplyr::filter() and select a subset of informative variables:

You could plot the unusual stems to see where they are located.

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

Missing Values

If you’ve encountered unusual values in your data set, and simply want to move on to the rest of your analysis, you have two options.

  1. Drop the entire row with the strange values:
are_usual <- !stem$stemID %in% unusual$stemID
usual <- filter(stem, are_usual)

# Confirm data set of usual stems has less rows than full data set.
nrow(stem)
#> [1] 7944
nrow(usual)
#> [1] 7920

I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!

  1. Instead, I recommend replacing the unusual values with missing values. The easiest way to do this is to use dplyr::mutate() to replace the variable with a modified copy. You can use the ifelse() function to replace unusual values with NA:
are_unusual <- !are_usual
with_unusual_made_NA <- stem %>% 
  mutate(dbh = ifelse(are_unusual, NA_real_, dbh))

ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is FALSE.

# Confirm no rows have been removed,
nrow(stem)
#> [1] 7944
nrow(with_unusual_made_NA)
#> [1] 7944

# but dbh of unusual stems is NA
unusual_only <- with_unusual_made_NA %>% 
  filter(are_unusual) %>% 
  select(dbh, stemID)
unusual_only
#> # A tibble: 24 x 2
#>      dbh stemID
#>    <dbl>  <int>
#>  1    NA  30562
#>  2    NA  32585
#>  3    NA  34692
#>  4    NA  80141
#>  5    NA  30562
#>  6    NA  32585
#>  7    NA  34692
#>  8    NA  80141
#>  9    NA  30562
#> 10    NA  32585
#> # ... with 14 more rows

Alternatively to ifelse(), use dplyr::case_when(). case_when() is particularly useful inside mutate() when you want to create a new variable that relies on a complex combination of existing variables.

It’s not obvious where you should plot missing values in a plot, so ggplot2 doesn’t include them, but it does warn that they’ve been removed (left plot below). To suppress that warning, set na.rm = TRUE (right plot below).

# Left: don't remove NAs and get a warning on the console
ggplot(unusual_only, aes(dbh)) + 
  geom_histogram(binwidth = useful_binwidth) +
  labs(title = "Expect empty plot but get a warning")
#> Warning: Removed 24 rows containing non-finite values (stat_bin).

# Right: NAs explicitely removed in geom_histogram(), so no warning
ggplot(unusual_only, aes(dbh)) + 
  geom_histogram(binwidth = useful_binwidth, na.rm = TRUE) +
  labs(title = "Expect empty plot but no warning")

Other times you want to understand what makes observations with missing values different to observations with recorded values.

For example, you might want to compare the status for stems with missing and non-missing values of dbh. You can do this by making a new variable with is.na().

missing_dbh <- stem %>% 
  mutate(missing = is.na(dbh))

ggplot(missing_dbh) + 
  geom_bar(aes(status, fill = missing))

Exercises

  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

  2. What does na.rm = TRUE do in mean() and sum()?

Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.

A categorical and continuous variable

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how stem diameter (dbh) varies with species (sp):

It’s hard to see the difference in distribution because the overall counts differ so much:

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardized so that the area under each frequency polygon is one.

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:

Let’s take a look at the distribution of stem diameter (dbh) by species (sp) using geom_boxplot():

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot).

Above we ordered the variable sp by its frequency. But now, to make the trend easier to see, we can reorder sp based on the median value of dbh. One way to do that is with reorder().

If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip() (left plot below), or with ggstance::geom_boxploth() and swapping x and y mappings (right plot below):

Alternatives

One problem with boxplots is that they were developed in an era of much smaller data sets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter-value plot.

Compare and contrast geom_violin() with a faceted geom_histogram(), or a colored geom_freqpoly() (). What are the pros and cons of each method?

If you have a small data set, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter().

Two categorical variables

To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination. One way to do that is to rely on the built-in geom_count(). The size of each circle in the plot displays how many observations occurred at each combination of values (left plot below). Covariation will appear as a strong correlation between specific x values and specific y values.

To more clearly show the distribution of status within sp (or sp within status) you can map bubble size to a proportion calculated over sp (or over status):

Then visualize with geom_tile() and the fill aesthetic:

Two continuous variables

You’ve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. For example, we can estimate the basal area of each stem as a function of dbh with the equation ba = 1 / 4 * pi * (dbh)^2. I’ll add some error to make the pattern a little more interesting.

Scatterplots become less useful as the size of your data set grows, because points begin to overplot, and pile up into areas of uniform black (as above). You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency.

But using transparency can be challenging for very large data sets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to install the hexbin package to use geom_hex().

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin dbh and then for each group, display a boxplot:

cut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarizes a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.

Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number():

Patterns and models

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

Above we learned that a scatterplot of stem diameter (dbh) versus basal area (ba) shows a pattern: larger stem diameters are associated with larger basal area.

Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables co-vary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

Models are a tool for extracting patterns out of data. For example, consider the stem data; dbh and ba are tightly related. It’s possible to use a model to remove the very strong relationship between dbh and ba so we can explore the subtleties that remain. The following code fits a model that predicts ba from dbh and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the basal area of a stem, once the effect of dbh has been removed.

There is no pattern left. This is unsurprising because we created ba as a precise function of dbh plus a random error – once we removed the pattern due to dbh all that remains is noise.