5 Data Transformation

The goal of this section is to continue where we started in the earlier chapter on data abstraction with dplyr to look at the more transformational functions applied to data in a database, and tidyr adds other tools like pivot tables.

  • dplyr tools:
    • joins: left_join, right_join, inner_join, full_join, semi_join, anti_join
    • set operations: intersect, union, setdiff
    • binding rows and columns: bind_cols, bind_rows
  • tidyr tools:
    • pivot tables: pivot_longer, pivot_wider

The term “data wrangling” has been used for what we’re doing with these tools, and the relevant cheat sheet is actually called “Data Wrangling” https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

5.1 Data joins

To bring in variables from another data frame based on a common join field. There are multiple types of joins. Probably the most common is left_join since it starts from the data frame (or sf) you want to continue working with and bring in data from an additional source. You’ll retain all records of the first data set. For any non-matches, NA is assigned. [air_quality]

library(tidyverse)
library(iGIScData)
library(sf)
csvPath <- system.file("extdata", "CA_MdInc.csv", package = "iGIScData")
income <- read_csv(csvPath) %>%
   dplyr::select(trID, HHinc2016) %>%
   mutate(HHinc2016 = as.numeric(HHinc2016),
          joinid = str_c("0", trID)) %>%
   dplyr::select(joinid, HHinc2016)
census <- BayAreaTracts %>%
   left_join(income, by = c("FIPS" = "joinid")) %>%
   dplyr::select(FIPS, POP12_SQMI, POP2012, HHinc2016)
head(census %>% st_set_geometry(NULL))
##          FIPS POP12_SQMI POP2012 HHinc2016
## 1 06001400100   1118.797    2976    177417
## 2 06001400200   9130.435    2100    153125
## 3 06001400300  11440.476    4805     85313
## 4 06001400400  14573.077    3789     99539
## 5 06001400500  15582.609    3584     83650
## 6 06001400600  13516.667    1622     61597

Other joins are:

  • right_join where you end up retaining all the rows of the second data set and NA is assigned to non-matches
  • inner_join where you only retain records for matches
  • full_join where records are retained for both sides, and NAs assigned to non-matches

Right join example We need to join NCDC monthly climate data for all California weather stations to a selection of 82 stations that are in the Sierra.

  • The monthly data has 12 rows (1/month) for each station
  • The right_join gets all months for all stations, so we weed out the non-Sierra stations by removing NAs from a field only with Sierra station data [sierra]
sierra <- right_join(sierraStations, CA_ClimateNormals, by="STATION") %>%
   filter(!is.na(STATION_NA)) %>% dplyr::select(-STATION_NA)
head(sierra %>% filter(DATE == "01") %>% dplyr::select(NAME, ELEVATION, `MLY-TAVG-NORMAL`), n=10)
## # A tibble: 10 x 3
##    NAME                              ELEVATION `MLY-TAVG-NORMAL`
##    <chr>                                 <dbl>             <dbl>
##  1 GROVELAND 2, CA US                    853.                5.6
##  2 CANYON DAM, CA US                    1390.                0.2
##  3 KERN RIVER PH 3, CA US                824.                7.6
##  4 DONNER MEMORIAL ST PARK, CA US       1810.               -1.9
##  5 BOWMAN DAM, CA US                    1641.                3  
##  6 BRUSH CREEK RANGER STATION, CA US    1085.               NA  
##  7 GRANT GROVE, CA US                   2012.                1.9
##  8 LEE VINING, CA US                    2072.               -1.2
##  9 OROVILLE MUNICIPAL AIRPORT, CA US      57.9               7.7
## 10 LEMON COVE, CA US                     156.                8.6

The exact same thing however could be accomplished with an inner_join and it doesn’t required removing the NAs:

sierraAlso <- inner_join(sierraStations, CA_ClimateNormals, by="STATION") %>%
   dplyr::select(-STATION_NA)

5.2 Set Operations

Set operations compare two data frames (or vectors) to handle observations or rows that are the same for each, or not the same. The three set methods are:

  • dplyr::intersect(x,y) retains rows that appear in both x and y
  • dplyr::union(x,y) retains rows that appear in either or both of x and y
  • dplyr::setdiff(x,y) retains rows that appear in x but not in y

[generic_methods]

squares <- (1:10)^2
evens <- seq(0,100,2)
squares
##  [1]   1   4   9  16  25  36  49  64  81 100
evens
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44
## [24]  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80  82  84  86  88  90
## [47]  92  94  96  98 100
intersect(squares,evens)
## [1]   4  16  36  64 100
sort(union(squares,evens))
##  [1]   0   1   2   4   6   8   9  10  12  14  16  18  20  22  24  25  26  28  30  32  34  36  38
## [24]  40  42  44  46  48  49  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80  81
## [47]  82  84  86  88  90  92  94  96  98 100
sort(setdiff(squares,evens))
## [1]  1  9 25 49 81

5.3 Binding Rows and Columns

These dplyr functions are similar to cbind and rbind in base R, but always creates data frames. For instance, cbind usually creates matrices, and make all vectors the same class. Note that in bind_cols, the order of data in rows must be the same.

states <- bind_cols(abb=state.abb,
                    name=state.name,
                    region=state.region,
                    state.x77)
head(states)
## # A tibble: 6 x 11
##   abb   name       region Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
##   <chr> <chr>      <fct>       <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
## 1 AL    Alabama    South        3615   3624        2.1       69.0   15.1      41.3    20  50708
## 2 AK    Alaska     West          365   6315        1.5       69.3   11.3      66.7   152 566432
## 3 AZ    Arizona    West         2212   4530        1.8       70.6    7.8      58.1    15 113417
## 4 AR    Arkansas   South        2110   3378        1.9       70.7   10.1      39.9    65  51945
## 5 CA    California West        21198   5114        1.1       71.7   10.3      62.6    20 156361
## 6 CO    Colorado   West         2541   4884        0.7       72.1    6.8      63.9   166 103766

To compare, note that cbind converts numeric fields to character when any other field is character, and character fields are converted to character integers where there are any repeats, which would require manipulating them into factors:

states <- as_tibble(cbind(abb=state.abb, 
                          name=state.name, 
                          region=state.region,
                          division=state.division,
                          state.x77))
head(states)
## # A tibble: 6 x 12
##   abb   name       region division Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost
##   <chr> <chr>      <chr>  <chr>    <chr>      <chr>  <chr>      <chr>      <chr>  <chr>     <chr>
## 1 AL    Alabama    2      4        3615       3624   2.1        69.05      15.1   41.3      20   
## 2 AK    Alaska     4      9        365        6315   1.5        69.31      11.3   66.7      152  
## 3 AZ    Arizona    4      8        2212       4530   1.8        70.55      7.8    58.1      15   
## 4 AR    Arkansas   2      5        2110       3378   1.9        70.66      10.1   39.9      65   
## 5 CA    California 4      9        21198      5114   1.1        71.71      10.3   62.6      20   
## 6 CO    Colorado   4      8        2541       4884   0.7        72.06      6.8    63.9      166  
## # ... with 1 more variable: Area <chr>

5.4 Pivotting data frames

Pivot tables are a popular tool in Excel, allowing you to transform your data to be more useful in a particular analysis. A common need to pivot is 2+ variables with the same data where the variable name should be a factor. Tidyr has pivot_wider and pivot_longer.

  • pivot_wider pivots rows into variables.
  • pivot_longer pivots variables into rows, creating factors.

In our meadows study cross-section (J. D. Davis et al. 2020) created by intersecting normalized difference vegetation index (NDVI) values from multispectral drone imagery with surveyed elevation and vegetation types (xeric, mesic, and hydric), we have fields NDVIgrowing from a July 2019 growing season and NDVIsenescent from a September 2020 dry season, but would like ‘growing’ and ‘senescent’ to be factors with a single NDVI variable. This is how we used pivot_longer to accomplish this, using data from the iGIScData data package [NDVI]:

XSptsPheno <- XSptsNDVI %>%
      filter(vegetation != "pine") %>%    # trees removed
      pivot_longer(cols = starts_with("NDVI"), 
                   names_to = "phenology", values_to = "NDVI") %>%
      mutate(phenology = str_sub(phenology, 5, str_length(phenology)))

Then to do the opposite use pivot_wider:

XSptsPheno %>%
  pivot_wider(names_from = phenology, names_prefix = "NDVI", 
              values_from = NDVI)
## # A tibble: 19 x 6
##    DistNtoS elevation vegetation      geometry                    NDVIgrowing NDVIsenescent
##       <dbl>     <dbl> <chr>           <chr>                             <dbl>         <dbl>
##  1      0       1510. Artemisia       c(718649.456, 4397466.714)        0.326         0.140
##  2     16.7     1510. mixed graminoid c(718649.4309, 4397450.077)       0.627         0.259
##  3     28.6     1510. mixed graminoid c(718649.413, 4397438.222)        0.686         0.329
##  4     30.5     1510. mixed graminoid c(718649.4101, 4397436.33)        0.668         0.282
##  5     31.1     1510. mixed graminoid c(718649.4092, 4397435.732)       0.655         0.266
##  6     33.4     1510. mixed graminoid c(718649.4058, 4397433.441)       0.617         0.274
##  7     35.6     1510. mixed graminoid c(718649.4025, 4397431.249)       0.623         0.275
##  8     37       1510. mixed graminoid c(718649.4004, 4397429.854)       0.589         0.242
##  9     74       1510. mixed graminoid c(718649.3448, 4397392.994)       0.641         0.325
## 10    101       1510. mixed graminoid c(718649.3042, 4397366.097)       0.558         0.312
## 11    126.      1511. Artemisia       c(718649.2672, 4397341.59)        0.391         0.163
## 12    137       1510. mixed graminoid c(718649.25, 4397330.233)         0.583         0.272
## 13    149.      1510. Carex           c(718649.2317, 4397318.079)       0.736         0.392
## 14    154.      1510. Carex           c(718649.224, 4397312.999)        0.744         0.424
## 15    175       1511. mixed graminoid c(718649.1929, 4397292.377)       0.455         0.204
## 16    195.      1511. mixed graminoid c(718649.1633, 4397272.752)       0.512         0.237
## 17    197.      1510. mixed graminoid c(718649.1597, 4397270.361)       0.489         0.197
## 18    201.      1511. mixed graminoid c(718649.1544, 4397266.884)       0.533         0.275
## 19    259.      1511. mixed graminoid c(718649.0663, 4397208.496)       0.505         0.193
XSptsPheno
## # A tibble: 38 x 6
##    DistNtoS elevation vegetation      geometry                    phenology  NDVI
##       <dbl>     <dbl> <chr>           <chr>                       <chr>     <dbl>
##  1      0       1510. Artemisia       c(718649.456, 4397466.714)  growing   0.326
##  2      0       1510. Artemisia       c(718649.456, 4397466.714)  senescent 0.140
##  3     16.7     1510. mixed graminoid c(718649.4309, 4397450.077) growing   0.627
##  4     16.7     1510. mixed graminoid c(718649.4309, 4397450.077) senescent 0.259
##  5     28.6     1510. mixed graminoid c(718649.413, 4397438.222)  growing   0.686
##  6     28.6     1510. mixed graminoid c(718649.413, 4397438.222)  senescent 0.329
##  7     30.5     1510. mixed graminoid c(718649.4101, 4397436.33)  growing   0.668
##  8     30.5     1510. mixed graminoid c(718649.4101, 4397436.33)  senescent 0.282
##  9     31.1     1510. mixed graminoid c(718649.4092, 4397435.732) growing   0.655
## 10     31.1     1510. mixed graminoid c(718649.4092, 4397435.732) senescent 0.266
## # ... with 28 more rows
XSptsPheno %>%
  ggplot() +
  geom_point(aes(elevation, NDVI, shape=vegetation, 
                 color = phenology), size = 5) +
  geom_smooth(aes(elevation, NDVI, 
                 color = phenology), method="lm")

Pivots turn out to be commonly useful. Runoff graphing from the Eucalyptus/Oak study (Thompson, Davis, and Oliphant 2016) also benefited from a pivot_longer [eucoak]:

eucoakrainfallrunoffTDR %>%
  pivot_longer(cols = starts_with("runoffL"),
               names_to = "tree", values_to = "runoffL") %>%
  mutate(tree = str_sub(tree, str_length(tree)-2, str_length(tree))) %>%
  ggplot() + geom_boxplot(aes(site, runoffL)) +
    facet_grid(tree ~ .)

Combining a pivot with bind_rows to create a runoff/rainfall scatterplot colored by tree

runoffPivot <- eucoakrainfallrunoffTDR %>%
  pivot_longer(cols = starts_with("runoffL"),
               names_to = "tree", values_to = "runoffL") %>%
  mutate(tree = str_sub(tree, str_length(tree)-2, str_length(tree)),
         Date = as.Date(date, "%m/%d/%Y"))
euc <- runoffPivot %>%
  filter(tree == "euc") %>%
  mutate(rain_subcanopy = rain_euc,
         slope = slope_euc,    aspect = aspect_euc,
         surface_tension = surface_tension_euc,
         runoff_rainfall_ratio = runoff_rainfall_ratio_euc) %>%
  dplyr::select(site, `site #`, tree, Date, month, rain_mm, 
         rain_subcanopy, slope, aspect, runoffL,     
         surface_tension, runoff_rainfall_ratio)
oak <- runoffPivot %>%
  filter(tree == "oak") %>%
  mutate(rain_subcanopy = rain_oak,
         slope = slope_oak, aspect = aspect_oak,
         surface_tension = surface_tension_oak,
         runoff_rainfall_ratio = runoff_rainfall_ratio_oak) %>%
  dplyr::select(site, `site #`, tree, Date, month, rain_mm, 
         rain_subcanopy, slope, aspect, runoffL, 
         surface_tension, runoff_rainfall_ratio)
bind_rows(euc, oak) %>%
  ggplot() +
  geom_point(mapping = aes(x = rain_mm, y = runoffL, color = tree)) +
  geom_smooth(mapping = aes(x = rain_mm, y= runoffL, color = tree), 
              method = "lm") +
  scale_color_manual(values = c("seagreen4", "orange3"))

5.4.1 pivot_wider

The opposite of pivot_longer, pivot_wider is less commonly used for tidying data, but can useful for creating tables of that desired format. An environmental application of pivot_wider can be found in vignette("pivot") (modified below) for studying fish detected by automatic monitors, with fish_encounters data contributed by Johnston and Rudis (n.d.) . This pivot makes it easier to see fish encounters by station. See Johnston and Rudis (n.d.) and vignette("pivot") for more information on the dataset and how the pivot provides this view.

library(tidyverse)
if (!file.exists("fishdata.csv")) {
  download.file(
    url = 'https://github.com/Myfanwy/ReproducibleExamples/raw/master/encounterhistories/fishdata.csv',
    destfile = "fishdata.csv"
  )
}
fish_encounters <- read_csv("fishdata.csv")
fish_encounters
## # A tibble: 209 x 3
##    TagID Station value
##    <dbl> <chr>   <dbl>
##  1  4842 Release     1
##  2  4843 Release     1
##  3  4844 Release     1
##  4  4845 Release     1
##  5  4847 Release     1
##  6  4848 Release     1
##  7  4849 Release     1
##  8  4850 Release     1
##  9  4851 Release     1
## 10  4854 Release     1
## # ... with 199 more rows
fishEncountersWide <- fish_encounters %>% pivot_wider(names_from = Station, values_from = value, values_fill = 0)

5.5 Data wrangling exercise

by Josh von Nonn (2021)

The impetus behind this exercise came from the movie Dark Waters (https://www.youtube.com/watch?v=RvAOuhyunhY), inspired by a true story of negligent chemical waste disposal by Dupont.

First create a new RStudio project, named GAMA and save this .Rmd file there, and create a folder GAMA_water_data in the project folder; the path to the data can then be specified as "GAMA_water_data/gama_pfas_statewide_v2.txt" assuming that the latter name matches what is unzipped from your download.

Then download from the California Water Boards, GAMA groundwater website. [https://gamagroundwater.waterboards.ca.gov/gama/datadownload]

Then select “Statewide Downloads” then “Statewide PFOS Data” and extract the zip file into a folder "" located in your project folder. Remember this path. This is a large txt file so if you open it with notepad it may take some time to load. Notice that this is a space delimited file.

Required packages:

library(tidyverse)
library(lubridate)
  1. Read in the downloaded file "gama_pfas_statewide_v2.txt" and call it cal_pfas and have a look at the data set. You can select if from the Environment pane or use view(cal_pfas).

  2. Before we clean up this data, let’s preserve the locations of all the wells. Create a new data frame, cal_pfas_loc, Select GM_WELL_ID, GM_LATITUDE, and GM_LONGITUDE and remove all the duplicate wells (hint: dplyr cheat sheet provides a function to do this).

  3. Now to trim down the data. Create a new data frame, cal_pfas_trim; add a new column, DATE, using the associated lubridate function on GM_SAMP_COLLECTION_DATE (this will allow ggplot to recognize it as a date and not a character string), select GM_WELL_ID,GM_CHEMICAL_VVL,GM_RESULT, and the newly created DATE. Sort (arrange) the data by GM_WELL_ID.

  4. Use pivot_wider to create new columns from the chemical names and values from the result column and store the new data frame as cal_pfas_wide. Notice the warnings. Some of the wells have multiple samples on the same day so they will be put into a vector (ex. c(6.8,9,4.3,etc..)). Rerun the pivot but include the argument- values_fn = mean. This will return only the average of all the samples. Once the pivot is working correctly, keep the columns GM_WELL_ID,DATE,PFOS,PFOA and pipe a mutate to create a new column, SUMPFS, that is the sum of PFOS and PFOA.

The US EPA has established a lifetime Health Advisory Level (HAL) for PFOA and PFOS of 70 ng/L. When both PFOA and PFOS are found in drinking water, the combined concentrations of PFOA and PFOS should be compared with the 70 ng/L HAL.

from the GROUNDWATER INFORMATION SHEET for PFOA (website:https://www.waterboards.ca.gov/water_issues/programs/gama/factsheets.html)

  1. For the sake of creating an interesting time series plot, let’s filter data for wells that have a SUMPFS greater than 70 and that have more than 10 sampling dates. Start by creating a new data frame- cal_pfas_index from the pivoted data frame. Hint: one way to do this is use group_by, filter, count, and filter again. The resulting data frame should have 11 observations and 2 variables.

  2. Create a new data frame, cal_pfas_max to locate the well with the most sampling dates (n). The data wrangling function from base R, subset, can do this using the max function as an argument.

  3. Now let’s pull all the data on that well by joining the max indexed well with cal_pfas_wide and call the new data frame cal_pfas_join. Remove the “n” column using the select function.

  4. Create a new data frame, cal_pfs_long and pivot_longer the cal_pfs_join data, creating new columns: "chemical" and "ngl".

  5. Plot the well using the wide data from cal_pfs_join with ggplot, using DATE on the x axis and plot the three variables (PFOS,PFOA,SUMPFS) with different colors of your choice. Add a horizontal reference line (geom_hline(yintercept = 70)) for the HAL limit at 70.

  6. Plot the well using the long data from cal_pfs_long using DATE on the x axis and ngl on the y s. Add the horizontal reference at 70.

References

Davis, J. D., L. Blesius, M. Slocombe, S. Maher, M. Vasey, P. Christian, and P. Lynch. 2020. “Unpiloted Aerial System (UAS)-Supported Biogeomorphic Analysis of Restored Sierra Nevada Montane Meadows.” Remote Sensing 12. https://www.mdpi.com/2072-4292/12/11/1828.
Johnston, Myfanwy, and Bob Rudis. n.d. Visualizing Fish Encounter Histories. https://fishsciences.github.io/post/visualizing-fish-encounter-histories/.
Thompson, A., J. D. Davis, and A. J. Oliphant. 2016. “Surface Runoff and Soil Erosion Under Eucalyptus and Oak Canopy.” Earth Surface Processes and Landforms. https://doi.org/10.1002/esp.3881.