4 Visualization

Visualization in Exploratory Data Analysis

Figure 4.1: Visualization in Exploratory Data Analysis

In this section we’ll explore visualization methods in R. Visualization has been a key element of R since its inception, since visualization is central to the exploratory philosophy of the language. The base plot system generally does a good job in coming up with the most likely graphical output based on the data you provide.

par(mfrow=c(1,2))
plot(penguins$body_mass_g, penguins$flipper_length_mm)
plot(penguins$species, penguins$flipper_length_mm)
Flipper length by mass and by species, base plot system

Figure 4.2: Flipper length by mass and by species, base plot system

par(mfrow=c(1,1))

Antarctic peninsula penguin data is from Horst, Hill, and Gorman (2020).

4.1 ggplot2

We’ll mostly focus however on gpplot2, based on the Grammar of Graphics because it provides considerable control over your graphics while remaining fairly easily readable, as long as you buy into its grammar.

ggplot2 looks at three aspects of a graph:

  • data : where are the data coming from?
  • geometry : what type of graph are we creating?
  • aesthetics : what choices can we make about symbology and how do we connect symbology to data?

As with other tidyverse and RStudio packages, find the ggplot2 cheat sheet at https://www.rstudio.com/resources/cheatsheets/

The ggplot2 system provides plots of single and multiple variables, using various coordinate systems (including geographic).

4.2 Plotting one variable

  • continuous

    • histograms
    • density plots
    • dot plots
  • discrete

    • bar

[Create a new NDVI project]

We’ll look at a study of Normalized Difference Vegetation Index from a transect across a montane meadow in the northern Sierra Nevada, derived from multispectral drone imagery (J. D. Davis et al. 2020).

library(igisci)
library(tidyverse)
summary(XSptsNDVI)
##     DistNtoS       elevation     vegetation          geometry        
##  Min.   :  0.0   Min.   :1510   Length:29          Length:29         
##  1st Qu.: 37.0   1st Qu.:1510   Class :character   Class :character  
##  Median :175.0   Median :1511   Mode  :character   Mode  :character  
##  Mean   :164.7   Mean   :1511                                        
##  3rd Qu.:275.5   3rd Qu.:1511                                        
##  Max.   :298.8   Max.   :1511                                        
##   NDVIgrowing     NDVIsenescent   
##  Min.   :0.3255   Min.   :0.1402  
##  1st Qu.:0.5052   1st Qu.:0.2418  
##  Median :0.6169   Median :0.2817  
##  Mean   :0.5901   Mean   :0.3662  
##  3rd Qu.:0.6768   3rd Qu.:0.5407  
##  Max.   :0.7683   Max.   :0.7578
ggplot(XSptsNDVI, aes(vegetation)) + 
  geom_bar()
Simple bar graph of meadow vegetation samples

Figure 4.3: Simple bar graph of meadow vegetation samples

4.2.1 Histogram

First, to prepare the data, we need to use a pivot_longer on XSptsNDVI:

XSptsPheno <- XSptsNDVI %>%
  filter(vegetation != "pine") %>%
  pivot_longer(cols = starts_with("NDVI"), names_to = "phenology", values_to = "NDVI") %>%
  mutate(phenology = str_sub(phenology, 5, str_length(phenology)))
XSptsPheno %>%
  ggplot(aes(NDVI)) + 
  geom_histogram(binwidth=0.05)
Distribution of NDVI, Knuthson Meadow

Figure 4.4: Distribution of NDVI, Knuthson Meadow

Normal histogram: easier to visualize the distribution, see modes [sierra]

sierraData %>%
    ggplot(aes(TEMPERATURE)) +
  geom_histogram(fill="dark green")
Distribution of Average Monthly Temperatures, Sierra Nevada

Figure 4.5: Distribution of Average Monthly Temperatures, Sierra Nevada

Cumulative histogram with proportions: easier to see percentiles, median

n <- length(sierraData$TEMPERATURE)
sierraData %>%
  ggplot(aes(TEMPERATURE)) +
  geom_histogram(aes(y=cumsum(..count..)/n), fill="dark goldenrod")
Cumulative Distribution of Average Monthly Temperatures, Sierra Nevada

Figure 4.6: Cumulative Distribution of Average Monthly Temperatures, Sierra Nevada

4.2.2 Density Plot

Density represents how much out of the total. The total area (sum of widths of bins times densities of that bin) adds up to 1. [NDVI]

XSptsPheno %>% 
  ggplot(aes(NDVI)) + 
  geom_density()
Density plot of NDVI, Knuthson Meadow

Figure 4.7: Density plot of NDVI, Knuthson Meadow

Note that NDVI values are <1 so bins are very small numbers, so in this case densities can be >1.

Using alpha and mapping phenology as fill color. This illustrates two useful ggplot methods:

  • “mapping” a variable (phenology) to an aesthetic property (fill color of the density polygon)
  • setting a a property (alpha = 0.2) to all polygons of the density plot. The alpha channel of colors defines its opacity, from invisible (0) to opaque (1) so is commonly used to set as its reverse, transparency.
XSptsPheno %>%
  ggplot(aes(NDVI, fill=phenology)) +
  geom_density(alpha=0.2)
Comparative density plot using alpha setting

Figure 4.8: Comparative density plot using alpha setting

tidy_eucoak %>%
  ggplot(aes(log(runoff_L),fill=tree)) +
  geom_density(alpha=0.2)
Runoff under Eucalyptus and Oak in Bay Area sites

Figure 4.9: Runoff under Eucalyptus and Oak in Bay Area sites

4.2.3 boxplot

ggplot(data = tidy_eucoak) +
  geom_boxplot(aes(x = site, y = runoff_L))
Boxplot of runoff by site

Figure 4.10: Boxplot of runoff by site

Get color from tree within aes()

ggplot(data = tidy_eucoak) +
  geom_boxplot(aes(x=site, y=runoff_L, color=tree))
Runoff at Bay Area Sites, colored as Eucalyptus and Oak

Figure 4.11: Runoff at Bay Area Sites, colored as Eucalyptus and Oak

Visualizing soil CO2 data with a Tukey box plot

In a study of soil CO2 in the Marble Mountains of California, we sampled extracted soil air in a 11-point transect across Marble Valley in 1997. (J. D. Davis, Amato, and Kiefer 2001)

Marble Valley soil CO~2~ sampling

Figure 4.12: Marble Valley soil CO2 sampling

Figure 4.13: Marble Mountain cross-valley CO2 sample sites

soilCO2 <- soilCO2_97
soilCO2$SITE <- factor(soilCO2$SITE)  # in order to make the numeric field a factor
ggplot(data = soilCO2, mapping = aes(x = SITE, y = CO2pct)) +
  geom_boxplot()
Visualizing soil CO~2~ data with a Tukey box plot

Figure 4.14: Visualizing soil CO2 data with a Tukey box plot

4.3 Plotting two variables

4.3.1 Two continuous variables

We’ve looked at this before – the scatterplot – but let’s try some new data: daily discharge (Q) and other data (like EC, electrical conductivity, a surrogate for solute concentration) from Sagehen Creek, north of Truckee, CA, 1970 to present. I downloaded the data for this location (which I’ve visited multiple times to chat with the USGS hydrologist about calibration methods with my Sierra Nevada Field Campus students) from the https://waterdata.usgs.gov/nwis/sw site. If you visit this site, you can download similar data from thousands of surface water (as well as groundwater) gauges around the country.

library(tidyverse); library(lubridate); library(igisci)
sagehen <- read_csv(ex("sierra/sagehen_dv.csv"))
plot(sagehen$Q,sagehen$EC)
Scatter plot of Discharge (Q) and specific electrical conductance (EC) for Sagehen Creek

Figure 4.15: Scatter plot of Discharge (Q) and specific electrical conductance (EC) for Sagehen Creek

Streamflow and water quality are commonly best represented using a log transform, or we can just use log scaling, which retains the original units.

ggplot(data=sagehen, aes(x=Q, y=EC)) + geom_point() +
  scale_x_log10() + scale_y_log10()
Q and EC for Sagehen Creek, using log10 scaling on both axes

Figure 4.16: Q and EC for Sagehen Creek, using log10 scaling on both axes

  • For both graphs, the aes (“aesthetics”) function specifies the variables to use as x and y coordinates
  • geom_point creates a scatter plot of those coordinate points

Set color for all (not in aes())

ggplot(data=sierraFeb) + 
  geom_point(aes(TEMPERATURE, ELEVATION), color="blue")
Setting one color for all points

Figure 4.17: Setting one color for all points

  • color is defined outside of aes, so is applies to all points.
  • mapping is first argument of geom_point, so mapping = is not needed.

4.3.2 Two variables, one discrete

[eucoak]

ggplot(tidy_eucoak) +
  geom_bar(aes(site, runoff_L), stat="identity")
Two variables, one discrete

Figure 4.18: Two variables, one discrete

4.4 Color systems

There’s a lot to working with color, with different color schemes needed for continuous data vs discrete values, and situations like bidirectional data. We’ll look into some basics, but the reader is recommended to learn more at sources like the following:

4.4.1 Specifying colors to use for a graphical element

When a color is requested for an entire graphical element, like geom_point or geom_line, and not in the aesthetics, all feature get that color. In the following graph the same x & y values are used to display as points in blue and as lines in red:

sierraFeb %>%
  ggplot(aes(TEMPERATURE,ELEVATION)) +
  geom_point(color="blue") +
  geom_line(color="red")
Using aesthetics settings for both points and lines

Figure 4.19: Using aesthetics settings for both points and lines

Note the use of pipe to start with the data then apply ggplot.

4.4.2 Color from variable, in aesthetics

If color is connected to a variable within the aes() call, a color scheme is chosen to assign either a range (for continuous) or a set of unique colors (for discrete). In this graph, color is defined inside aes, so is based on COUNTY:

ggplot(data=sierraFeb) + 
  geom_point(aes(TEMPERATURE, ELEVATION, color=COUNTY))
Color set within aes()

Figure 4.20: Color set within aes()

Note that counties represesent discrete data, and this is detected by ggplot to assign an appropriate color scheme. Continuous data will require a different palette.

ggplot(data=sagehen, aes(x=Q, y=EC, col=waterTmax)) + geom_point() +
  scale_x_log10() + scale_y_log10()
Q and EC for Sagehen Creek, colored by temperature

Figure 4.21: Q and EC for Sagehen Creek, colored by temperature

River map & profile

We’ll build a riverData dataframe with x & y location values and elevation. We’ll need to start by creating empty vectors we’ll populate with values in a loop: d, longd and s are assigned an empty value double(), then slope s (since it only occurs between two points) needs one NA value assigned for the last point s[length(x] to have the same length as other vectors.

library(tidyverse)
x <- c(1000, 1100, 1300, 1500, 1600, 1800, 1900)
y <- c(500, 780, 820, 950, 1250, 1320, 1500)
elev <- c(0, 1, 2, 5, 25, 75, 150)
d <- double()      # creates an empty numeric vector 
longd <- double()  # ("double" means double-precision floating point)
s <- double()
for(i in 1:length(x)){
  if(i==1){longd[i] <- 0; d[i] <-0}
  else{
    d[i] <- sqrt((x[i]-x[i-1])^2 + (y[i]-y[i-1])^2)
    longd[i] <- longd[i-1] + d[i]
    s[i-1] <- (elev[i]-elev[i-1])/d[i]}}
s[length(x)] <- NA  # make the last slope value NA since we have no data past it, 
                    # and so the vector lengths are all the same
riverData <- bind_cols(x=x,y=y,elev=elev,d=d,longd=longd,s=s)
riverData
## # A tibble: 7 x 6
##       x     y  elev     d longd        s
##   <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1  1000   500     0    0     0   0.00336
## 2  1100   780     1  297.  297.  0.00490
## 3  1300   820     2  204.  501.  0.0126 
## 4  1500   950     5  239.  740.  0.0632 
## 5  1600  1250    25  316. 1056.  0.236  
## 6  1800  1320    75  212. 1268.  0.364  
## 7  1900  1500   150  206. 1474. NA

For this continuous data, a range of values is detected and a continous color scheme is assigned. The ggplot scale_color_gradient function is used to establish end points of a color range that the data are stretched between. The scale_color_gradient2 lets you use a mid color. Note that there’s a comparable scale_fill_gradient and scale_fill_gradient2 for use when specifying a fill (e.g. for a polygon) instead of a color (for polygons linked to the border).

ggplot(riverData, aes(x,y)) +
  geom_line(mapping=aes(col=s), size=1.2) + 
  geom_point(mapping=aes(col=s, size=elev)) +
  coord_fixed(ratio=1) + scale_color_gradient(low="green", high="red") +
  ggtitle("Simulated river path, elevations and slopes")
Range of colors from green to red

Figure 4.22: Range of colors from green to red

ggplot(riverData, aes(longd,elev)) + geom_line(aes(col=s), size=1.5) + geom_point()  +
  scale_color_gradient(low="green", high="red") +
  ggtitle("Elevation over longitudinal distance upstream")
Range of colors from green to red on longitudinal profile

Figure 4.23: Range of colors from green to red on longitudinal profile

ggplot(riverData, aes(longd,s)) + geom_point(aes(col=s), size=3) +
  scale_color_gradient(low="green", high="red") +
  ggtitle("Slope over longitudinal distance upstream")
Slope by longitudinal distance

Figure 4.24: Slope by longitudinal distance

4.4.3 Trend line

[sierra]

sierraFeb %>%
  ggplot(aes(TEMPERATURE,ELEVATION)) +
  geom_point(color="blue") +
  geom_smooth(color="red", method="lm")
Trend line using geom_smooth with a linear model

Figure 4.25: Trend line using geom_smooth with a linear model

4.4.4 General symbology

A useful vignette accessed by vignette("ggplot2-specs") lets you see aesthetic specifications for symbols, including:

  • Color & fill

  • Lines

    • line type, size, ends
  • Polygon

    • border color, linetype, size
    • fill
  • Points

    • shape
    • size
    • color & fill
    • stroke
  • Text

    • font face & size
    • justification

4.4.4.1 Categorical symbology

One example of a “Big Data” resource is EPA’s Toxic Release Inventory [air_quality] that tracks releases from a wide array of sources, from oil refineries on down. One way of dealing with big data in terms of exploring meaning is to use symbology to try to make sense of it.

library(igisci)
TRI <- read_csv(ex("TRI/TRI_2017_CA.csv")) %>%
  filter(`5.1_FUGITIVE_AIR` > 100 & `5.2_STACK_AIR` > 100)
ggplot(data = TRI, aes(log(`5.2_STACK_AIR`), log(`5.1_FUGITIVE_AIR`), 
                       color = INDUSTRY_SECTOR)) +
       geom_point()
EPA Toxic Release Inventory, categorical symbology for industry sector

Figure 4.26: EPA Toxic Release Inventory, categorical symbology for industry sector

4.4.4.2 Log scales instead of transform

In the above graph, we used the log() function in aes to use natural logarithms instead of the actual value. That’s a simple way to do this. But what if we want to display the original data, just using a logarithmic grid? This might communicate better since readers would see the actual values.

ggplot(data=TRI, aes(`5.2_STACK_AIR`,`5.1_FUGITIVE_AIR`,color=INDUSTRY_SECTOR)) +
       geom_point() + scale_x_log10() + scale_y_log10()
Using log scales instead of transforming

Figure 4.27: Using log scales instead of transforming

4.4.4.3 Graphs from grouped data

[NDVI]

XSptsPheno %>%
  ggplot() +
  geom_point(aes(elevation, NDVI, shape=vegetation, 
                 color = phenology), size = 3) +
  geom_smooth(aes(elevation, NDVI, 
                 color = phenology), method="lm") 
NDVI symbolized by vegetation in two seasons

Figure 4.28: NDVI symbolized by vegetation in two seasons

[eucoak] (Thompson, Davis, and Oliphant 2016)

ggplot(data = tidy_eucoak) +
  geom_point(mapping = aes(x = rain_mm, y = runoff_L, color = tree)) +
  geom_smooth(mapping = aes(x = rain_mm, y= runoff_L, color = tree), 
              method = "lm") +
  scale_color_manual(values = c("seagreen4", "orange3"))
Eucalyptus and Oak: rainfall and runoff

Figure 4.29: Eucalyptus and Oak: rainfall and runoff

4.4.4.4 Faceted graphs

This is another option to displaying groups of data, with parallel graphs

ggplot(data = tidy_eucoak) +
  geom_point(aes(x=rain_mm,y=runoff_L)) +
  geom_smooth(aes(x=rain_mm,y=runoff_L), method="lm") +
  facet_grid(tree ~ .)
Faceted graph alternative to color grouping

Figure 4.30: Faceted graph alternative to color grouping

Note that the y scale is the same for each, which is normally what you want since each graph is representing the same variable. If you were displaying different variables, however, you’d want to use the scales = "free_y" setting. But we’ll need to learn something about pivot tables in the next chapter to set up our data for multiple variables on a common x variable with differing y variables.

4.5 Titles and subtitles

ggplot(data = tidy_eucoak) +
  geom_point(aes(x=rain_mm,y=runoff_L, color=tree)) +
  geom_smooth(aes(x=rain_mm,y=runoff_L, color=tree), method="lm") +
  scale_color_manual(values=c("seagreen4","orange3")) +
  labs(title="rainfall ~ runoff", 
       subtitle="eucalyptus & oak sites, 2016")
Titles added

Figure 4.31: Titles added

4.6 Pairs Plot

[sierra] Pairs plots are an excellent exploratory tool to see which variables are correlated. Since only continuous data are useful for this, and since pairs plots can quickly get overly complex, it’s good to use dplyr::select to select the continuous variables, or maybe use a helper function like is.numeric with dplyr::select_if:

sierraFeb %>%
     dplyr::select(ELEVATION:TEMPERATURE) %>%
     pairs()
Pairs plot for Sierra Nevada stations variables

Figure 4.32: Pairs plot for Sierra Nevada stations variables

The GGally package has a very nice looking pairs plot that takes it even further. In addition to scatterplots it has boxplots, histograms, density plots, and correlations, all stratified and colored by species. The following code is mostly borrowed from Horst, Hill, and Gorman (2020) :

library(palmerpenguins)
penguins %>%
  dplyr::select(species, body_mass_g, ends_with("_mm")) %>%
  GGally::ggpairs(aes(color = species, alpha = 0.8)) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"))
Enhanced GGally pairs plot for palmerpenguin data

Figure 4.33: Enhanced GGally pairs plot for palmerpenguin data

4.7 Exercises

  1. Create a bar graph of the counts of the species in the penguins data frame (Horst, Hill, and Gorman 2020). What can you say about what it shows?

  2. Use bind_cols in dplyr to create a tibble from built-in vectors state.abb and state.region, then use ggplot with geom_bar to create a bar graph of the four regions. [generic_methods]

  3. Convert the built-in time series treering into a tibble tr using the tibble() function with the single variable assigned as treering = treering (or just specifying treering will also work for this simple example), then create a histogram, using that tibble and variable for the data and x settings needed. Attach a screen capture of the histogram. (Also, learn about the treering data by entering ?treering in the console and read the Help displayed.)

  4. Start by clearing your environment with the broom icon in the Environment tab, then we’ll create two tibbles: Create a new tibble st using bind_cols with Name=state.name, Abb=state.abb, Region=state.region, and as_tibble(state.x77). Note that this works since all of the parts are sorted by state. Then use summary(st) and copy and paste its results for your answer.

  5. From st, create a density plot from the variable Frost (number of days with frost for that state). Attach that plot, and answer: approximately what is the modal value?

  6. From st create a a boxplot of Area by Region. Which region has the highest and which has the lowest median Area? Do the same for Frost.

  7. From st, compare murder rate (y=Murder) to Frost (x) in a scatter plot, colored by Region.

  8. Add a trend line (smooth) with method=“lm” to your scatterplot, not colored by Region (but keep the points colored by Region). What can you say about what this graph is showing you?

  9. Add a title to your graph.

  10. Change your scatterplot to place labels using the Abb variable (still colored by Region) using geom_label(aes(label=Abb, col=Region)). Any observations about outliers?

  11. Change the boxplot of CO2 soil samples by site to use a log10 scale grid but display the original numbers (i.e., not in aes()).

References

Davis, J. D., P. Amato, and R. Kiefer. 2001. “Soil Carbon Dioxide in a Summer-Dry Subalpine Karst, Marble Mountains, California, USA.” Zeitschrift Für Geomorphologie N.F. 45. https://www.researchgate.net/publication/258333952_Soil_carbon_dioxide_in_a_summer-dry_subalpine_karst_Marble_Mountains_California_USA.
Davis, J. D., L. Blesius, M. Slocombe, S. Maher, M. Vasey, P. Christian, and P. Lynch. 2020. “Unpiloted Aerial System (UAS)-Supported Biogeomorphic Analysis of Restored Sierra Nevada Montane Meadows.” Remote Sensing 12. https://www.mdpi.com/2072-4292/12/11/1828.
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://allisonhorst.github.io/palmerpenguins/.
Thompson, A., J. D. Davis, and A. J. Oliphant. 2016. “Surface Runoff and Soil Erosion Under Eucalyptus and Oak Canopy.” Earth Surface Processes and Landforms. https://doi.org/10.1002/esp.3881.