4 Visualization

Figure 4.1: Visualization in Exploratory Data Analysis
In this section we’ll explore visualization methods in R. Visualization has been a key element of R since its inception, since visualization is central to the exploratory philosophy of the language. The base plot system generally does a good job in coming up with the most likely graphical output based on the data you provide.
par(mfrow=c(1,2))
plot(penguins$body_mass_g, penguins$flipper_length_mm)
plot(penguins$species, penguins$flipper_length_mm)

Figure 4.2: Flipper length by mass and by species, base plot system
par(mfrow=c(1,1))
Antarctic peninsula penguin data is from Horst, Hill, and Gorman (2020).
4.1 ggplot2
We’ll mostly focus however on gpplot2, based on the Grammar of Graphics because it provides considerable control over your graphics while remaining fairly easily readable, as long as you buy into its grammar.
ggplot2 looks at three aspects of a graph:
- data : where are the data coming from?
- geometry : what type of graph are we creating?
- aesthetics : what choices can we make about symbology and how do we connect symbology to data?
As with other tidyverse and RStudio packages, find the ggplot2 cheat sheet at https://www.rstudio.com/resources/cheatsheets/
The ggplot2 system provides plots of single and multiple variables, using various coordinate systems (including geographic).
4.2 Plotting one variable
continuous
- histograms
- density plots
- dot plots
discrete
- bar
[Create a new NDVI
project]
We’ll look at a study of Normalized Difference Vegetation Index from a transect across a montane meadow in the northern Sierra Nevada, derived from multispectral drone imagery (J. D. Davis et al. 2020).
library(igisci)
library(tidyverse)
summary(XSptsNDVI)
## DistNtoS elevation vegetation geometry
## Min. : 0.0 Min. :1510 Length:29 Length:29
## 1st Qu.: 37.0 1st Qu.:1510 Class :character Class :character
## Median :175.0 Median :1511 Mode :character Mode :character
## Mean :164.7 Mean :1511
## 3rd Qu.:275.5 3rd Qu.:1511
## Max. :298.8 Max. :1511
## NDVIgrowing NDVIsenescent
## Min. :0.3255 Min. :0.1402
## 1st Qu.:0.5052 1st Qu.:0.2418
## Median :0.6169 Median :0.2817
## Mean :0.5901 Mean :0.3662
## 3rd Qu.:0.6768 3rd Qu.:0.5407
## Max. :0.7683 Max. :0.7578
ggplot(XSptsNDVI, aes(vegetation)) +
geom_bar()

Figure 4.3: Simple bar graph of meadow vegetation samples
4.2.1 Histogram
First, to prepare the data, we need to use a pivot_longer on XSptsNDVI:
<- XSptsNDVI %>%
XSptsPheno filter(vegetation != "pine") %>%
pivot_longer(cols = starts_with("NDVI"), names_to = "phenology", values_to = "NDVI") %>%
mutate(phenology = str_sub(phenology, 5, str_length(phenology)))
%>%
XSptsPheno ggplot(aes(NDVI)) +
geom_histogram(binwidth=0.05)

Figure 4.4: Distribution of NDVI, Knuthson Meadow
Normal histogram: easier to visualize the distribution, see modes [sierra
]
%>%
sierraData ggplot(aes(TEMPERATURE)) +
geom_histogram(fill="dark green")

Figure 4.5: Distribution of Average Monthly Temperatures, Sierra Nevada
Cumulative histogram with proportions: easier to see percentiles, median
<- length(sierraData$TEMPERATURE)
n %>%
sierraData ggplot(aes(TEMPERATURE)) +
geom_histogram(aes(y=cumsum(..count..)/n), fill="dark goldenrod")

Figure 4.6: Cumulative Distribution of Average Monthly Temperatures, Sierra Nevada
4.2.2 Density Plot
Density represents how much out of the total. The total area (sum of widths of bins times densities of that bin) adds up to 1. [NDVI
]
%>%
XSptsPheno ggplot(aes(NDVI)) +
geom_density()

Figure 4.7: Density plot of NDVI, Knuthson Meadow
Note that NDVI values are <1 so bins are very small numbers, so in this case densities can be >1.
Using alpha and mapping phenology as fill color. This illustrates two useful ggplot methods:
- “mapping” a variable (phenology) to an aesthetic property (fill color of the density polygon)
- setting a a property (alpha = 0.2) to all polygons of the density plot. The alpha channel of colors defines its opacity, from invisible (0) to opaque (1) so is commonly used to set as its reverse, transparency.
%>%
XSptsPheno ggplot(aes(NDVI, fill=phenology)) +
geom_density(alpha=0.2)

Figure 4.8: Comparative density plot using alpha setting
%>%
tidy_eucoak ggplot(aes(log(runoff_L),fill=tree)) +
geom_density(alpha=0.2)

Figure 4.9: Runoff under Eucalyptus and Oak in Bay Area sites
4.2.3 boxplot
ggplot(data = tidy_eucoak) +
geom_boxplot(aes(x = site, y = runoff_L))

Figure 4.10: Boxplot of runoff by site
Get color from tree within aes()
ggplot(data = tidy_eucoak) +
geom_boxplot(aes(x=site, y=runoff_L, color=tree))

Figure 4.11: Runoff at Bay Area Sites, colored as Eucalyptus and Oak
Visualizing soil CO2 data with a Tukey box plot
In a study of soil CO2 in the Marble Mountains of California, we sampled extracted soil air in a 11-point transect across Marble Valley in 1997. (J. D. Davis, Amato, and Kiefer 2001)

Figure 4.12: Marble Valley soil CO2 sampling
Figure 4.13: Marble Mountain cross-valley CO2 sample sites
<- soilCO2_97
soilCO2 $SITE <- factor(soilCO2$SITE) # in order to make the numeric field a factor
soilCO2ggplot(data = soilCO2, mapping = aes(x = SITE, y = CO2pct)) +
geom_boxplot()

Figure 4.14: Visualizing soil CO2 data with a Tukey box plot
4.3 Plotting two variables
4.3.1 Two continuous variables
We’ve looked at this before – the scatterplot – but let’s try some new data: daily discharge (Q) and other data (like EC, electrical conductivity, a surrogate for solute concentration) from Sagehen Creek, north of Truckee, CA, 1970 to present. I downloaded the data for this location (which I’ve visited multiple times to chat with the USGS hydrologist about calibration methods with my Sierra Nevada Field Campus students) from the https://waterdata.usgs.gov/nwis/sw site. If you visit this site, you can download similar data from thousands of surface water (as well as groundwater) gauges around the country.
library(tidyverse); library(lubridate); library(igisci)
<- read_csv(ex("sierra/sagehen_dv.csv"))
sagehen plot(sagehen$Q,sagehen$EC)

Figure 4.15: Scatter plot of Discharge (Q) and specific electrical conductance (EC) for Sagehen Creek
Streamflow and water quality are commonly best represented using a log transform, or we can just use log scaling, which retains the original units.
ggplot(data=sagehen, aes(x=Q, y=EC)) + geom_point() +
scale_x_log10() + scale_y_log10()

Figure 4.16: Q and EC for Sagehen Creek, using log10 scaling on both axes
- For both graphs, the
aes
(“aesthetics”) function specifies the variables to use as x and y coordinates - geom_point creates a scatter plot of those coordinate points
Set color for all (not in aes())
ggplot(data=sierraFeb) +
geom_point(aes(TEMPERATURE, ELEVATION), color="blue")

Figure 4.17: Setting one color for all points
- color is defined outside of
aes
, so is applies to all points. - mapping is first argument of
geom_point
, somapping =
is not needed.
4.3.2 Two variables, one discrete
[eucoak
]
ggplot(tidy_eucoak) +
geom_bar(aes(site, runoff_L), stat="identity")

Figure 4.18: Two variables, one discrete
4.4 Color systems
There’s a lot to working with color, with different color schemes needed for continuous data vs discrete values, and situations like bidirectional data. We’ll look into some basics, but the reader is recommended to learn more at sources like the following:
4.4.1 Specifying colors to use for a graphical element
When a color is requested for an entire graphical element, like geom_point or geom_line, and not in the aesthetics, all feature get that color. In the following graph the same x & y values are used to display as points in blue and as lines in red:
%>%
sierraFeb ggplot(aes(TEMPERATURE,ELEVATION)) +
geom_point(color="blue") +
geom_line(color="red")

Figure 4.19: Using aesthetics settings for both points and lines
Note the use of pipe to start with the data then apply ggplot.
4.4.2 Color from variable, in aesthetics
If color is connected to a variable within the aes()
call, a color scheme is chosen to assign either a range (for continuous) or a set of unique colors (for discrete). In this graph, color is defined inside aes
, so is based on COUNTY:
ggplot(data=sierraFeb) +
geom_point(aes(TEMPERATURE, ELEVATION, color=COUNTY))

Figure 4.20: Color set within aes()
Note that counties represesent discrete data, and this is detected by ggplot to assign an appropriate color scheme. Continuous data will require a different palette.
ggplot(data=sagehen, aes(x=Q, y=EC, col=waterTmax)) + geom_point() +
scale_x_log10() + scale_y_log10()

Figure 4.21: Q and EC for Sagehen Creek, colored by temperature
River map & profile
We’ll build a riverData
dataframe with x & y location values and elevation. We’ll need to start by creating empty vectors we’ll populate with values in a loop: d
, longd
and s
are assigned an empty value double()
, then slope s
(since it only occurs between two points) needs one NA
value assigned for the last point s[length(x]
to have the same length as other vectors.
library(tidyverse)
<- c(1000, 1100, 1300, 1500, 1600, 1800, 1900)
x <- c(500, 780, 820, 950, 1250, 1320, 1500)
y <- c(0, 1, 2, 5, 25, 75, 150)
elev <- double() # creates an empty numeric vector
d <- double() # ("double" means double-precision floating point)
longd <- double()
s for(i in 1:length(x)){
if(i==1){longd[i] <- 0; d[i] <-0}
else{
<- sqrt((x[i]-x[i-1])^2 + (y[i]-y[i-1])^2)
d[i] <- longd[i-1] + d[i]
longd[i] -1] <- (elev[i]-elev[i-1])/d[i]}}
s[ilength(x)] <- NA # make the last slope value NA since we have no data past it,
s[# and so the vector lengths are all the same
<- bind_cols(x=x,y=y,elev=elev,d=d,longd=longd,s=s)
riverData riverData
## # A tibble: 7 x 6
## x y elev d longd s
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1000 500 0 0 0 0.00336
## 2 1100 780 1 297. 297. 0.00490
## 3 1300 820 2 204. 501. 0.0126
## 4 1500 950 5 239. 740. 0.0632
## 5 1600 1250 25 316. 1056. 0.236
## 6 1800 1320 75 212. 1268. 0.364
## 7 1900 1500 150 206. 1474. NA
For this continuous data, a range of values is detected and a continous color scheme is assigned. The ggplot scale_color_gradient
function is used to establish end points of a color range that the data are stretched between. The scale_color_gradient2
lets you use a mid
color. Note that there’s a comparable scale_fill_gradient
and scale_fill_gradient2
for use when specifying a fill (e.g. for a polygon) instead of a color (for polygons linked to the border).
ggplot(riverData, aes(x,y)) +
geom_line(mapping=aes(col=s), size=1.2) +
geom_point(mapping=aes(col=s, size=elev)) +
coord_fixed(ratio=1) + scale_color_gradient(low="green", high="red") +
ggtitle("Simulated river path, elevations and slopes")

Figure 4.22: Range of colors from green to red
ggplot(riverData, aes(longd,elev)) + geom_line(aes(col=s), size=1.5) + geom_point() +
scale_color_gradient(low="green", high="red") +
ggtitle("Elevation over longitudinal distance upstream")

Figure 4.23: Range of colors from green to red on longitudinal profile
ggplot(riverData, aes(longd,s)) + geom_point(aes(col=s), size=3) +
scale_color_gradient(low="green", high="red") +
ggtitle("Slope over longitudinal distance upstream")

Figure 4.24: Slope by longitudinal distance
4.4.3 Trend line
[sierra
]
%>%
sierraFeb ggplot(aes(TEMPERATURE,ELEVATION)) +
geom_point(color="blue") +
geom_smooth(color="red", method="lm")

Figure 4.25: Trend line using geom_smooth with a linear model
4.4.4 General symbology
A useful vignette accessed by vignette("ggplot2-specs")
lets you see aesthetic specifications for symbols, including:
Color & fill
Lines
- line type, size, ends
Polygon
- border color, linetype, size
- fill
Points
- shape
- size
- color & fill
- stroke
Text
- font face & size
- justification
4.4.4.1 Categorical symbology
One example of a “Big Data” resource is EPA’s Toxic Release Inventory [air_quality
] that tracks releases from a wide array of sources, from oil refineries on down. One way of dealing with big data in terms of exploring meaning is to use symbology to try to make sense of it.
library(igisci)
<- read_csv(ex("TRI/TRI_2017_CA.csv")) %>%
TRI filter(`5.1_FUGITIVE_AIR` > 100 & `5.2_STACK_AIR` > 100)
ggplot(data = TRI, aes(log(`5.2_STACK_AIR`), log(`5.1_FUGITIVE_AIR`),
color = INDUSTRY_SECTOR)) +
geom_point()

Figure 4.26: EPA Toxic Release Inventory, categorical symbology for industry sector
4.4.4.2 Log scales instead of transform
In the above graph, we used the log() function in aes to use natural logarithms instead of the actual value. That’s a simple way to do this. But what if we want to display the original data, just using a logarithmic grid? This might communicate better since readers would see the actual values.
ggplot(data=TRI, aes(`5.2_STACK_AIR`,`5.1_FUGITIVE_AIR`,color=INDUSTRY_SECTOR)) +
geom_point() + scale_x_log10() + scale_y_log10()

Figure 4.27: Using log scales instead of transforming
4.4.4.3 Graphs from grouped data
[NDVI
]
%>%
XSptsPheno ggplot() +
geom_point(aes(elevation, NDVI, shape=vegetation,
color = phenology), size = 3) +
geom_smooth(aes(elevation, NDVI,
color = phenology), method="lm")

Figure 4.28: NDVI symbolized by vegetation in two seasons
[eucoak
] (Thompson, Davis, and Oliphant 2016)
ggplot(data = tidy_eucoak) +
geom_point(mapping = aes(x = rain_mm, y = runoff_L, color = tree)) +
geom_smooth(mapping = aes(x = rain_mm, y= runoff_L, color = tree),
method = "lm") +
scale_color_manual(values = c("seagreen4", "orange3"))

Figure 4.29: Eucalyptus and Oak: rainfall and runoff
4.4.4.4 Faceted graphs
This is another option to displaying groups of data, with parallel graphs
ggplot(data = tidy_eucoak) +
geom_point(aes(x=rain_mm,y=runoff_L)) +
geom_smooth(aes(x=rain_mm,y=runoff_L), method="lm") +
facet_grid(tree ~ .)

Figure 4.30: Faceted graph alternative to color grouping
Note that the y scale is the same for each, which is normally what you want since each graph is representing the same variable. If you were displaying different variables, however, you’d want to use the scales = "free_y"
setting. But we’ll need to learn something about pivot tables in the next chapter to set up our data for multiple variables on a common x variable with differing y variables.
4.5 Titles and subtitles
ggplot(data = tidy_eucoak) +
geom_point(aes(x=rain_mm,y=runoff_L, color=tree)) +
geom_smooth(aes(x=rain_mm,y=runoff_L, color=tree), method="lm") +
scale_color_manual(values=c("seagreen4","orange3")) +
labs(title="rainfall ~ runoff",
subtitle="eucalyptus & oak sites, 2016")

Figure 4.31: Titles added
4.6 Pairs Plot
[sierra
] Pairs plots are an excellent exploratory tool to see which variables are correlated. Since only continuous data are useful for this, and since pairs plots can quickly get overly complex, it’s good to use dplyr::select to select the continuous variables, or maybe use a helper function like is.numeric with dplyr::select_if:
%>%
sierraFeb ::select(ELEVATION:TEMPERATURE) %>%
dplyrpairs()

Figure 4.32: Pairs plot for Sierra Nevada stations variables
The GGally package has a very nice looking pairs plot that takes it even further. In addition to scatterplots it has boxplots, histograms, density plots, and correlations, all stratified and colored by species. The following code is mostly borrowed from Horst, Hill, and Gorman (2020) :
library(palmerpenguins)
%>%
penguins ::select(species, body_mass_g, ends_with("_mm")) %>%
dplyr::ggpairs(aes(color = species, alpha = 0.8)) +
GGallyscale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4"))

Figure 4.33: Enhanced GGally pairs plot for palmerpenguin data
4.7 Exercises
Create a bar graph of the counts of the species in the
penguins
data frame (Horst, Hill, and Gorman 2020). What can you say about what it shows?Use bind_cols in dplyr to create a tibble from built-in vectors state.abb and state.region, then use ggplot with geom_bar to create a bar graph of the four regions. [
generic_methods
]Convert the built-in time series
treering
into a tibbletr
using thetibble()
function with the single variable assigned astreering = treering
(or just specifyingtreering
will also work for this simple example), then create a histogram, using that tibble and variable for thedata
andx
settings needed. Attach a screen capture of the histogram. (Also, learn about the treering data by entering?treering
in the console and read the Help displayed.)Start by clearing your environment with the broom icon in the Environment tab, then we’ll create two tibbles: Create a new tibble
st
usingbind_cols
withName=state.name
,Abb=state.abb
,Region=state.region
, andas_tibble(state.x77)
. Note that this works since all of the parts are sorted by state. Then usesummary(st)
and copy and paste its results for your answer.From
st
, create a density plot from the variableFrost
(number of days with frost for that state). Attach that plot, and answer: approximately what is the modal value?From
st
create a a boxplot ofArea
byRegion
. Which region has the highest and which has the lowest median Area? Do the same forFrost
.From st, compare murder rate (y=Murder) to Frost (x) in a scatter plot, colored by Region.
Add a trend line (smooth) with method=“lm” to your scatterplot, not colored by Region (but keep the points colored by Region). What can you say about what this graph is showing you?
Add a title to your graph.
Change your scatterplot to place labels using the Abb variable (still colored by Region) using
geom_label(aes(label=Abb, col=Region))
. Any observations about outliers?Change the boxplot of CO2 soil samples by site to use a log10 scale grid but display the original numbers (i.e., not in aes()).