Chapter 3 Processing Tabular Data

The preceding chapter provided a foundation for visualizing data using functions from the ggplot2 package. The sample dataset provided with the examples was already in a format suitable for plotting with ggplot(). However, in many cases, data will need to be reformatted before it can be visualized and analyzed. Common formatting tasks include selecting a subset of rows and columns from the data table, calculating new variables from the raw data values, and computing summary statistics.

This tutorial will demonstrate basic commands for manipulating data frames (and tibbles). These manipulations can be accomplished in a variety of ways, including using base R operators and functions like those covered in Chapter 1. However, the functions in the dplyr and tidyr packages provide a more efficient and intuitive approach. These packages (along with the the ggplot2, readr, and readxl packages used in Chapter 2) are members of the “tidyverse”, collection of data science R packages. These packages share an underlying design philosophy, grammar of data manipulation, and set of data structures, The rest of this book will make extensive use of these packages. They will allow us to to carry out sophisticated processing, analysis, and visualization of complex “real-world” datasets by writing concise and easily interpretable R code.

The dplyr package has some excellent vignettes, including an Introduction to dplyr which can be accessed by running vignette("dplyr")

To run the examples in this chapter, it is necessary to load all of the previously-mentioned tidyverse packages along with lubridate, another package from the same family. Remember that if these packages are not available on your computer, you will need to install them using the install.packages() function or the installation tools available in RStudio.

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(readxl)
library(lubridate)

3.1 Single Table Verbs

Each dplyr function accomplishes a particular type of data transformation These functions are referred to as “verbs” that describe the action performed. Some dplyr verbs operate on a single data frame or tibble, while others operate by coming two or more. Below is a list of some important dplyr verbs for single tables:

  • filter() selects observations (rows) based on their values.
  • arrange() reorders observations.
  • select() and rename() select variables (columns) based on their names.
  • mutate() and transmute() add new variables that are functions of existing variables.

In all the following examples, the first argument to the function is always the data frame that will be modified. Additional, comma-separated arguments control how the function will be implemented.

3.1.1 Filter

These examples uses the filter function to select records by station, year, and temperature and humidity values. The %in% operator returns TRUE if the input matches one of the values in the subsequent. The : operator is used here to create a vector containing the years 2014 and 2015 as demonstrated in Chapter 1. When multiple logical statements are separated by commas, they are combined using a logical “and” operator.

mesosm <- read_csv("mesodata_small.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   MONTH = col_double(),
##   YEAR = col_double(),
##   STID = col_character(),
##   TMAX = col_double(),
##   TMIN = col_double(),
##   HMAX = col_double(),
##   HMIN = col_double(),
##   RAIN = col_double(),
##   DATE = col_date(format = "")
## )
# Select rates from the Butler station
filter(mesosm, STID == "BUTL")
## # A tibble: 60 x 9
##    MONTH  YEAR STID   TMAX  TMIN  HMAX  HMIN  RAIN DATE      
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>    
##  1     1  2014 BUTL   52.0  21.6  76.3  27.2  0.01 2014-01-01
##  2     2  2014 BUTL   47.5  22.3  88.8  41.2  0.26 2014-02-01
##  3     3  2014 BUTL   61.1  31.3  79.5  27.1  0.59 2014-03-01
##  4     4  2014 BUTL   74.9  45.3  80.2  25.1  1.33 2014-04-01
##  5     5  2014 BUTL   84.8  56.0  79.1  28.4  2.76 2014-05-01
##  6     6  2014 BUTL   89.0  67.1  89.7  42.8  2.97 2014-06-01
##  7     7  2014 BUTL   91.4  67.8  87.2  37.7  3.85 2014-07-01
##  8     8  2014 BUTL   96.5  68.0  82.6  27.9  0.22 2014-08-01
##  9     9  2014 BUTL   85.9  61.2  89.3  39.1  2.39 2014-09-01
## 10    10  2014 BUTL   78.8  48.5  92.3  32.3  2.62 2014-10-01
## # ... with 50 more rows
# Select rows from the years 2014 or 2015
filter(mesosm, YEAR %in% 2014:2015)
## # A tibble: 96 x 9
##    MONTH  YEAR STID   TMAX  TMIN  HMAX  HMIN  RAIN DATE      
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>    
##  1     1  2014 BUTL   52.0  21.6  76.3  27.2  0.01 2014-01-01
##  2     2  2014 BUTL   47.5  22.3  88.8  41.2  0.26 2014-02-01
##  3     3  2014 BUTL   61.1  31.3  79.5  27.1  0.59 2014-03-01
##  4     4  2014 BUTL   74.9  45.3  80.2  25.1  1.33 2014-04-01
##  5     5  2014 BUTL   84.8  56.0  79.1  28.4  2.76 2014-05-01
##  6     6  2014 BUTL   89.0  67.1  89.7  42.8  2.97 2014-06-01
##  7     7  2014 BUTL   91.4  67.8  87.2  37.7  3.85 2014-07-01
##  8     8  2014 BUTL   96.5  68.0  82.6  27.9  0.22 2014-08-01
##  9     9  2014 BUTL   85.9  61.2  89.3  39.1  2.39 2014-09-01
## 10    10  2014 BUTL   78.8  48.5  92.3  32.3  2.62 2014-10-01
## # ... with 86 more rows
# Select rows where max temperature > 90 and max humidity > 90
filter(mesosm, TMAX > 90, HMAX > 90)
## # A tibble: 14 x 9
##    MONTH  YEAR STID   TMAX  TMIN  HMAX  HMIN  RAIN DATE      
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>    
##  1     7  2015 BUTL   92.0  69.5  91.3  44.0  5.16 2015-07-01
##  2     8  2015 BUTL   91.4  66.1  91.5  38.0  2.13 2015-08-01
##  3     6  2016 BUTL   90.3  66.4  92.4  40.5  4.06 2016-06-01
##  4     8  2016 BUTL   92.3  67.1  92.9  38.1  5.81 2016-08-01
##  5     7  2015 MTHE   91.9  71.6  94.7  48.5  1.45 2015-07-01
##  6     7  2016 MTHE   92.7  72.0  94.8  47.5  5.05 2016-07-01
##  7     6  2018 MTHE   90.6  69.2  94.0  48.9  2.51 2018-06-01
##  8     7  2018 MTHE   93.8  71.2  92.4  43.9  3.46 2018-07-01
##  9     8  2014 SKIA   90.7  69.7  90.4  43.5  1.2  2014-08-01
## 10     7  2016 SKIA   91.0  72.8  92.0  51.2  5.32 2016-07-01
## 11     8  2016 SKIA   90.7  71.0  90.8  45.6  2.19 2016-08-01
## 12     7  2017 SKIA   90.4  71.5  91.1  49.6  3.18 2017-07-01
## 13     7  2018 SKIA   92.4  71.5  92.5  44.9  2.23 2018-07-01
## 14     7  2018 SPEN   91.3  71.2  92.4  49.5  3.01 2018-07-01

3.1.2 Arrange

The arrange() function returns a data frame that is sorted on the comma-separated column names in order from left to right.

arrange(mesosm, STID, YEAR, MONTH)
## # A tibble: 240 x 9
##    MONTH  YEAR STID   TMAX  TMIN  HMAX  HMIN  RAIN DATE      
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>    
##  1     1  2014 BUTL   52.0  21.6  76.3  27.2  0.01 2014-01-01
##  2     2  2014 BUTL   47.5  22.3  88.8  41.2  0.26 2014-02-01
##  3     3  2014 BUTL   61.1  31.3  79.5  27.1  0.59 2014-03-01
##  4     4  2014 BUTL   74.9  45.3  80.2  25.1  1.33 2014-04-01
##  5     5  2014 BUTL   84.8  56.0  79.1  28.4  2.76 2014-05-01
##  6     6  2014 BUTL   89.0  67.1  89.7  42.8  2.97 2014-06-01
##  7     7  2014 BUTL   91.4  67.8  87.2  37.7  3.85 2014-07-01
##  8     8  2014 BUTL   96.5  68.0  82.6  27.9  0.22 2014-08-01
##  9     9  2014 BUTL   85.9  61.2  89.3  39.1  2.39 2014-09-01
## 10    10  2014 BUTL   78.8  48.5  92.3  32.3  2.62 2014-10-01
## # ... with 230 more rows

The desc() function can be used to order a column in descending rather than ascending order.

arrange(mesosm, STID, desc(YEAR), desc(MONTH))
## # A tibble: 240 x 9
##    MONTH  YEAR STID   TMAX  TMIN  HMAX  HMIN  RAIN DATE      
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>    
##  1    12  2018 BUTL   51.5  25.9  88.1  39.8  2.58 2018-12-01
##  2    11  2018 BUTL   58.1  28.7  91.3  37.3  0.39 2018-11-01
##  3    10  2018 BUTL   71.4  47.4  95.7  47.4  7.68 2018-10-01
##  4     9  2018 BUTL   84.5  62.7  94.7  50.7  2.79 2018-09-01
##  5     8  2018 BUTL   93.4  68.5  86.5  36.8  2.75 2018-08-01
##  6     7  2018 BUTL   96.9  69.7  89.8  33.0  1.69 2018-07-01
##  7     6  2018 BUTL   93.3  68.6  85.8  37.4  4.6  2018-06-01
##  8     5  2018 BUTL   90.5  62.3  87.8  33.7  3.08 2018-05-01
##  9     4  2018 BUTL   71.0  38.5  81.3  26.8  1.41 2018-04-01
## 10     3  2018 BUTL   68.5  36.5  82.7  28.1  0.41 2018-03-01
## # ... with 230 more rows

3.1.3 Select and Rename

The select() function returns only the specified columns. Other columns are removed from the data frame. One way to specify the selected columsn is a comma-separated list of column names.

select(mesosm, STID, YEAR, MONTH, TMAX, TMIN)
## # A tibble: 240 x 5
##    STID   YEAR MONTH  TMAX  TMIN
##    <chr> <dbl> <dbl> <dbl> <dbl>
##  1 BUTL   2014     1  52.0  21.6
##  2 BUTL   2014     2  47.5  22.3
##  3 BUTL   2014     3  61.1  31.3
##  4 BUTL   2014     4  74.9  45.3
##  5 BUTL   2014     5  84.8  56.0
##  6 BUTL   2014     6  89.0  67.1
##  7 BUTL   2014     7  91.4  67.8
##  8 BUTL   2014     8  96.5  68.0
##  9 BUTL   2014     9  85.9  61.2
## 10 BUTL   2014    10  78.8  48.5
## # ... with 230 more rows

The : operator can be used to select a continuous series of columns. This usage is analogous to the : operator in base R, where it is used to specify a continuous series of integers.

select(mesosm, 
       MONTH:TMAX) 
## # A tibble: 240 x 4
##    MONTH  YEAR STID   TMAX
##    <dbl> <dbl> <chr> <dbl>
##  1     1  2014 BUTL   52.0
##  2     2  2014 BUTL   47.5
##  3     3  2014 BUTL   61.1
##  4     4  2014 BUTL   74.9
##  5     5  2014 BUTL   84.8
##  6     6  2014 BUTL   89.0
##  7     7  2014 BUTL   91.4
##  8     8  2014 BUTL   96.5
##  9     9  2014 BUTL   85.9
## 10    10  2014 BUTL   78.8
## # ... with 230 more rows

The helper functions starts_with(), ends_with(), and contains() can be used to find multiple columns by matching part of the column name.

select(mesosm, 
       starts_with("T"))
## # A tibble: 240 x 2
##     TMAX  TMIN
##    <dbl> <dbl>
##  1  52.0  21.6
##  2  47.5  22.3
##  3  61.1  31.3
##  4  74.9  45.3
##  5  84.8  56.0
##  6  89.0  67.1
##  7  91.4  67.8
##  8  96.5  68.0
##  9  85.9  61.2
## 10  78.8  48.5
## # ... with 230 more rows

Columns can be removed by prefixing their names with a -. Other columns will be kept.

select(mesosm, 
       -HMIN, 
       -HMAX)
## # A tibble: 240 x 7
##    MONTH  YEAR STID   TMAX  TMIN  RAIN DATE      
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <date>    
##  1     1  2014 BUTL   52.0  21.6  0.01 2014-01-01
##  2     2  2014 BUTL   47.5  22.3  0.26 2014-02-01
##  3     3  2014 BUTL   61.1  31.3  0.59 2014-03-01
##  4     4  2014 BUTL   74.9  45.3  1.33 2014-04-01
##  5     5  2014 BUTL   84.8  56.0  2.76 2014-05-01
##  6     6  2014 BUTL   89.0  67.1  2.97 2014-06-01
##  7     7  2014 BUTL   91.4  67.8  3.85 2014-07-01
##  8     8  2014 BUTL   96.5  68.0  0.22 2014-08-01
##  9     9  2014 BUTL   85.9  61.2  2.39 2014-09-01
## 10    10  2014 BUTL   78.8  48.5  2.62 2014-10-01
## # ... with 230 more rows

The rename() function is used to change column names. Name changes are specified using the = operator, placing the new name first and the old name second.

rename(mesosm, maxtemp = TMAX, 
       mintemp = TMIN)
## # A tibble: 240 x 9
##    MONTH  YEAR STID  maxtemp mintemp  HMAX  HMIN  RAIN DATE      
##    <dbl> <dbl> <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <date>    
##  1     1  2014 BUTL     52.0    21.6  76.3  27.2  0.01 2014-01-01
##  2     2  2014 BUTL     47.5    22.3  88.8  41.2  0.26 2014-02-01
##  3     3  2014 BUTL     61.1    31.3  79.5  27.1  0.59 2014-03-01
##  4     4  2014 BUTL     74.9    45.3  80.2  25.1  1.33 2014-04-01
##  5     5  2014 BUTL     84.8    56.0  79.1  28.4  2.76 2014-05-01
##  6     6  2014 BUTL     89.0    67.1  89.7  42.8  2.97 2014-06-01
##  7     7  2014 BUTL     91.4    67.8  87.2  37.7  3.85 2014-07-01
##  8     8  2014 BUTL     96.5    68.0  82.6  27.9  0.22 2014-08-01
##  9     9  2014 BUTL     85.9    61.2  89.3  39.1  2.39 2014-09-01
## 10    10  2014 BUTL     78.8    48.5  92.3  32.3  2.62 2014-10-01
## # ... with 230 more rows

3.1.4 Mutate and Transmute

The mutate() function adds new variables to the data frame. Multiple new variables can be generated with a single function call, using a comma to separate each new variable. The name of each new variable is specified on the left of the = and a function that can contain names of other variables in the table is specified on the right. The following example converts the minimum and maximum temperature variables from Fahrenheit to Celsius.

mutate(mesosm, 
       TMINC = (TMIN - 32) * .5556,
       TMAXC = (TMIN - 32) * .5556)
## # A tibble: 240 x 11
##    MONTH  YEAR STID   TMAX  TMIN  HMAX  HMIN  RAIN DATE        TMINC  TMAXC
##    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>      <dbl>  <dbl>
##  1     1  2014 BUTL   52.0  21.6  76.3  27.2  0.01 2014-01-01 -5.79  -5.79 
##  2     2  2014 BUTL   47.5  22.3  88.8  41.2  0.26 2014-02-01 -5.41  -5.41 
##  3     3  2014 BUTL   61.1  31.3  79.5  27.1  0.59 2014-03-01 -0.409 -0.409
##  4     4  2014 BUTL   74.9  45.3  80.2  25.1  1.33 2014-04-01  7.41   7.41 
##  5     5  2014 BUTL   84.8  56.0  79.1  28.4  2.76 2014-05-01 13.4   13.4  
##  6     6  2014 BUTL   89.0  67.1  89.7  42.8  2.97 2014-06-01 19.5   19.5  
##  7     7  2014 BUTL   91.4  67.8  87.2  37.7  3.85 2014-07-01 19.9   19.9  
##  8     8  2014 BUTL   96.5  68.0  82.6  27.9  0.22 2014-08-01 20.0   20.0  
##  9     9  2014 BUTL   85.9  61.2  89.3  39.1  2.39 2014-09-01 16.2   16.2  
## 10    10  2014 BUTL   78.8  48.5  92.3  32.3  2.62 2014-10-01  9.17   9.17 
## # ... with 230 more rows

The transmute() function only keeps the newly-created variables.

transmute(mesosm, 
       TMINC = (TMIN - 32) * .5556,
       TMAXC = (TMIN - 32) * .5556)
## # A tibble: 240 x 2
##     TMINC  TMAXC
##     <dbl>  <dbl>
##  1 -5.79  -5.79 
##  2 -5.41  -5.41 
##  3 -0.409 -0.409
##  4  7.41   7.41 
##  5 13.4   13.4  
##  6 19.5   19.5  
##  7 19.9   19.9  
##  8 20.0   20.0  
##  9 16.2   16.2  
## 10  9.17   9.17 
## # ... with 230 more rows

3.1.5 Application

The following examples will use another meteorological dataset from the Oklahoma Mesonet. The mesodata_large.csv file contains daily data records from every Mesonet station in Oklahoma from 1994-present. This this is a very large data table, more than with more than one million rows and over 150 Mb of data. There are also numerous missing data codes (values < -990 or > 990) that will need to dealt with.

mesobig <- read_csv("mesodata_large.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   STID = col_character()
## )
## i Use `spec()` for the full column specifications.
dim(mesobig)
## [1] 1296602      22
summary(mesobig)
##       YEAR          MONTH             DAY            STID          
##  Min.   :1994   Min.   : 1.000   Min.   : 1.00   Length:1296602    
##  1st Qu.:2000   1st Qu.: 4.000   1st Qu.: 8.00   Class :character  
##  Median :2006   Median : 7.000   Median :16.00   Mode  :character  
##  Mean   :2006   Mean   : 6.523   Mean   :15.73                     
##  3rd Qu.:2012   3rd Qu.:10.000   3rd Qu.:23.00                     
##  Max.   :2018   Max.   :12.000   Max.   :31.00                     
##       TMAX              TMIN              TAVG              DMAX        
##  Min.   :-999.00   Min.   :-999.00   Min.   :-999.00   Min.   :-999.00  
##  1st Qu.:-996.00   1st Qu.:-996.00   1st Qu.:-996.00   1st Qu.:-996.00  
##  Median :  62.73   Median :  37.22   Median :  49.74   Median :  42.90  
##  Mean   :-234.99   Mean   :-251.78   Mean   :-243.61   Mean   :-255.03  
##  3rd Qu.:  82.51   3rd Qu.:  59.88   3rd Qu.:  70.53   3rd Qu.:  64.19  
##  Max.   : 115.07   Max.   :  89.01   Max.   : 100.06   Max.   :  84.19  
##       DMIN              DAVG              HMAX              HMIN        
##  Min.   :-999.00   Min.   :-999.00   Min.   :-999.00   Min.   :-999.00  
##  1st Qu.:-996.00   1st Qu.:-996.00   1st Qu.:  70.39   1st Qu.:  18.68  
##  Median :  28.42   Median :  35.76   Median :  89.80   Median :  34.54  
##  Mean   :-263.81   Mean   :-259.27   Mean   :-120.33   Mean   :-157.88  
##  3rd Qu.:  52.19   3rd Qu.:  58.58   3rd Qu.:  96.20   3rd Qu.:  50.49  
##  Max.   :  76.54   Max.   :  78.52   Max.   : 100.00   Max.   : 100.00  
##       HAVG              VDEF              9AVG              HDEG        
##  Min.   :-999.00   Min.   :-999.00   Min.   :-999.00   Min.   :-999.00  
##  1st Qu.:  43.87   1st Qu.:-996.00   1st Qu.:-996.00   1st Qu.:-996.00  
##  Median :  62.93   Median :   4.01   Median :  45.58   Median :   0.00  
##  Mean   :-138.20   Mean   :-287.10   Mean   :-315.38   Mean   :-279.82  
##  3rd Qu.:  75.42   3rd Qu.:   9.00   3rd Qu.:  68.82   3rd Qu.:  11.84  
##  Max.   : 100.00   Max.   :  54.38   Max.   :  99.93   Max.   :  72.57  
##       CDEG              HTMX              WCMN              RAIN       
##  Min.   :-999.00   Min.   :-999.00   Min.   :-999.00   Min.   :-999.0  
##  1st Qu.:-996.00   1st Qu.:-996.00   1st Qu.:-996.00   1st Qu.:   0.0  
##  Median :   0.00   Median :-996.00   Median :-996.00   Median :   0.0  
##  Mean   :-282.82   Mean   :-686.45   Mean   :-643.44   Mean   :-194.0  
##  3rd Qu.:   5.89   3rd Qu.:  82.77   3rd Qu.:  20.51   3rd Qu.:   0.0  
##  Max.   :  35.00   Max.   : 125.38   Max.   :  49.67   Max.   :  14.2  
##       RNUM             RMAX        
##  Min.   :-999.0   Min.   :-999.00  
##  1st Qu.:   0.0   1st Qu.:   0.00  
##  Median :   0.0   Median :   0.00  
##  Mean   :-158.2   Mean   :-194.57  
##  3rd Qu.:   0.0   3rd Qu.:   0.00  
##  Max.   : 996.0   Max.   :  10.92

The dplyr single-table functions can be used to created a smaller version of the big Mesonet dataset. The first step is to filter out records (rows) that belong to one of four sites and are from the years 2014 to present.

mesosm <- filter(mesobig, STID %in% c("SPEN", "SKIA", "MTHE", "BUTL") & YEAR >= 2014)
dim(mesobig)
## [1] 1296602      22
dim(mesosm)                 
## [1] 7304   22

Next, the select() function is used to choose a subset of the columns.

mesosm <- select(mesosm, STID, YEAR, MONTH, DAY, TMIN, TMAX, HMIN, HMAX, RAIN)
dim(mesosm)                 
## [1] 7304    9

This example takes the mesosm data frame as an input and also creates an output objected called mesosm. We are effectively overwriting the object with a new version. This was done intentionally to avoid the confusion of having multiple data objects generated by a string of functions (e.g., mesosm1, mesosm2, etc.). There is really no need to save the output of each step because all of the code can easily be rerun if a change is required or an error needs to be fixed. The end of this chapter will present a technique called “piping” that makes it even easier to combine multiple dplyr functions into concise blocks of code.

Next, the temperature, humidity, and rainfall records are transformed by replacing the missing data codes with ‘NA’ values using the mutate() function in combination with the replace() function. Replace takes three comma-separated arguments. The first is the variable to replace, the second is a logical statement indicating which values will replaced, and the third if the replacement value. Note that in this example, the columns are overwritten with their updated values rather than creating new column.

mesosm <- mutate(mesosm, 
                 TMIN = replace(TMIN, TMIN <= -990 | TMIN >= 990, NA),
                 TMAX = replace(TMAX, TMAX <= -990 | TMAX >= 990, NA),
                 HMIN = replace(HMIN, HMIN <= -990 | HMIN >= 990, NA),
                 HMAX = replace(HMAX, HMAX <= -990 | HMAX >= 990, NA),
                 RAIN = replace(RAIN, RAIN <= -990 | RAIN >= 990, NA))

In this example, the columns are overwritten with their updated values rather than creating new columns.

3.2 Summarizing

At this point, the size of the data frame has been reduced and the missing data values have been removed. However, there is still one record in the data frame for each day. The next step is to convert these daily values to monthly summaries. This can be accomplished using the summarize() function. The following example uses summarize() to calculate the mean maximum temperature for all records in the data frame. As discussed in Chapter 1, the na.rm = TRUE argument is specified so that the mean() function will ignore the NA values

summarize(mesosm, meantmax = mean(TMAX, na.rm = TRUE))
## # A tibble: 1 x 1
##   meantmax
##      <dbl>
## 1     72.1

The summarize() function becomes particularly useful when we pair it with another dplyr function, group_by(), which groups rows of data together to produce a “grouped data frame”. When a grouped data frame is summarized, the summaries are generated for individual groups rather than the entire dataset, and the result is a data frame with one row per group.

mesogrp <- group_by(mesosm, STID, YEAR, MONTH)
class(mesogrp)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
mesomnth <- summarise(mesogrp, TMAX = mean(TMAX, na.rm=T),
                      TMIN = mean(TMIN, na.rm=T),
                      HMAX = mean(HMAX, na.rm=T),
                      HMIN = mean(HMIN, na.rm=T),
                      RAIN = sum(RAIN, na.rm=T))
## `summarise()` regrouping output by 'STID', 'YEAR' (override with `.groups` argument)

Finally, the mutate() function is used to add a date column. Since each record represents an entire month, the date is set to correspond to the first day of each month. The ymd() function from the lubridate package is used to create a date object. The input to this function is a date string with year followed by month followed by day separated by dashes. For example, March 31, 2018 would be “2018-3-31”. The paste() function joins multiple strings together using a separate specified with the sep argument.

mesomnth <- mutate(mesomnth,
                   DATE = ymd(paste(YEAR, MONTH, "1", sep = "-")))

Take a look at the resulting data frame with the View() function. It should be the same as the “mesodata_small” dataset that was used in Chapter 2 and at the beginning of this Chapter. These data can be used to make the same plots as in Chapter 2.

ggplot(data = mesomnth) +
  geom_line(mapping = aes(x = DATE, y = RAIN)) +
  facet_wrap(~ STID, ncol = 1)

3.2.1 Counts

When summarizing data, it is often useful to count the number of values used. This can be accomplished with the dplyr function n(), which will include NA values in the count, or sum(!is.na(x)) which will exclude them. It’s a good idea to count when summarizing to make sure your summaries are not based on a very small sample size. Here, for example, a new variable called pv_hmax to is calculated to determine, for each month and site, what percentage of days have missing data. The mutate() function is again used to add a data columns to the mesomissing data frame and then overwrite the old version of the data frame.

mesomissing <- summarize(mesogrp, 
                         n_rows = n(),
                         obs_hmax = sum(!is.na(HMAX)),
                         pv_hmax = 100 * obs_hmax/n_rows)
## `summarise()` regrouping output by 'STID', 'YEAR' (override with `.groups` argument)
mesomissing <- mutate(mesomissing,
                      DATE = ymd(paste(YEAR, MONTH, "1", sep = "-")))
mesomissing
## # A tibble: 240 x 7
## # Groups:   STID, YEAR [20]
##    STID   YEAR MONTH n_rows obs_hmax pv_hmax DATE      
##    <chr> <dbl> <dbl>  <int>    <int>   <dbl> <date>    
##  1 BUTL   2014     1     31       31     100 2014-01-01
##  2 BUTL   2014     2     28       28     100 2014-02-01
##  3 BUTL   2014     3     31       31     100 2014-03-01
##  4 BUTL   2014     4     30       30     100 2014-04-01
##  5 BUTL   2014     5     31       31     100 2014-05-01
##  6 BUTL   2014     6     30       30     100 2014-06-01
##  7 BUTL   2014     7     31       31     100 2014-07-01
##  8 BUTL   2014     8     31       31     100 2014-08-01
##  9 BUTL   2014     9     30       30     100 2014-09-01
## 10 BUTL   2014    10     31       31     100 2014-10-01
## # ... with 230 more rows

The resulting data table contains information about when and where there lot of missing data. An efficient way to explore these data is using a simple plot. The following example generates a line plot similar to the examples in Chapter 2 that displays the monthly time series of missing data using a separate plot for each site.

ggplot(data = mesomissing) +
  geom_line(mapping = aes(x = DATE, y = pv_hmax)) +
  facet_wrap(~ STID, ncol = 1)

3.2.2 Summary Functions

Up to this point, only a few summary functions have been covered: sum() mean(), and n(). Other useful summary functions include:

  • Measures of location: mean(x) and median(x). The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.
  • Measures of spread: sd(x), IQR(x), mad(x). The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.
  • Measures of rank: min(x), quantile(x, 0.25), max(x). Quantiles are a generalization of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.
  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements).
  • Counts: You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

3.3 Pivoting Data

Almost all of the data used in environmental geography applications can be stored and manipulated in data frames, but depending on the application the format of the data frame can differ. Consider an example like the Mesonet data where each observation is index by a location (station) and a time (month). These data could be stored in a “wide” format, where there is one row for each station and one column for each month. Alternately, the data can be stored in a “long” format where there is one row for each combination of station and month, with index columns that specify the station and month associated with each row.

In general the “long” format is considered to be a more efficient and “tidy” way to format and process data. This concept of working with “tidy” data is the conceptual foundation for the tidyr package as well as the larger “tidyverse” collection of packages. In general, the long data format with fewer columns and more rows is easiest to manipulate using dplyr functions and is the format needed for plotting with ggplot() and running many types of statistical models. However, there are also many situations where a wider format is required for data manipulation and analysis. Therefore the tidyr package provides two important functions called pivot_longer() and pivot wider(). That can be used to reformat data frames.

Consider the problem of plotting both minimum and maximum temperatures on the same graph. In the current mesomnth data frame, they are in separate columns - TMIN and TMAX. For plotting, it is necessary to have all of the temperature observations in a single values column and to have a second names column that indicates the type of temperature measurement (minimum or maximum). The pivot_longer() function takes three basic arguments. The cols argument specifies which columns from the input dataset will be combined into the new “values” column. The one_of() function is used to specify a list of column names from the input data frame. The values_to argument specifies the name of the new values column. The names_to argument specifies the name of the new varkable names column.

temp_tidy <- pivot_longer(mesomnth, 
                          cols = one_of("TMAX", "TMIN"),
                          values_to = "temp",
                          names_to = "tstat")
temp_tidy
## # A tibble: 480 x 9
## # Groups:   STID, YEAR [20]
##    STID   YEAR MONTH  HMAX  HMIN  RAIN DATE       tstat  temp
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>     <chr> <dbl>
##  1 BUTL   2014     1  76.3  27.2  0.01 2014-01-01 TMAX   52.0
##  2 BUTL   2014     1  76.3  27.2  0.01 2014-01-01 TMIN   21.6
##  3 BUTL   2014     2  88.8  41.2  0.26 2014-02-01 TMAX   47.5
##  4 BUTL   2014     2  88.8  41.2  0.26 2014-02-01 TMIN   22.3
##  5 BUTL   2014     3  79.5  27.1  0.59 2014-03-01 TMAX   61.1
##  6 BUTL   2014     3  79.5  27.1  0.59 2014-03-01 TMIN   31.3
##  7 BUTL   2014     4  80.2  25.1  1.33 2014-04-01 TMAX   74.9
##  8 BUTL   2014     4  80.2  25.1  1.33 2014-04-01 TMIN   45.3
##  9 BUTL   2014     5  79.1  28.4  2.76 2014-05-01 TMAX   84.8
## 10 BUTL   2014     5  79.1  28.4  2.76 2014-05-01 TMIN   56.0
## # ... with 470 more rows

Now there is only a single column with temperature measurements (“temp”), and another column with a code that indicates the type of temperature measurement (“tstat”). However, the temp_tidy data frame now has twice as many rows as the original mesomnth data frame, because each minimum and maximum temperature value occupies a separate row. By organizing the data this way, it is easier to make a graph that includes minimum and maximum temperatures.

ggplot(temp_tidy) +
  geom_line(aes(x = DATE, y = temp, color = tstat)) +
  facet_wrap(~ STID, ncol = 1, scales = "free_y")

When observations are distributed across multiple rows, it is sometimes necessary to reorganize the data into multiple columns. For example, consider the problem of generating a table with just one row for each location and columns containing the total rainfall for each month in 2018. The code below shows how to accomplish this task using several dplyr functions along with the pivot_wider() function. The values_from argument indicates the input column with data to be distributed over multiple output columns. The names of these new columns will be generated by pasting the names_prefix string to the values in the names_from column.

# Filter rows with data for 2018
meso2018 <- filter(mesobig, YEAR == 2018)
# Select only the videos we need
meso2018 <- select(meso2018, STID, MONTH, RAIN)
# These are the data look like in long format, with one row for each month
meso2018
## # A tibble: 51,830 x 3
##    STID  MONTH  RAIN
##    <chr> <dbl> <dbl>
##  1 ACME      1     0
##  2 ADAX      1     0
##  3 ALTU      1     0
##  4 ALVA      1  -999
##  5 ALV2      1     0
##  6 ANTL      1  -999
##  7 ANT2      1     0
##  8 APAC      1     0
##  9 ARDM      1  -999
## 10 ARD2      1     0
## # ... with 51,820 more rows
# Replace missing data code with NA
meso2018 <- mutate(meso2018, RAIN = replace(RAIN, RAIN <= -990 | RAIN >= 990, NA))
# Group and summarize by month
meso2018grp <- group_by(meso2018, STID, MONTH)
meso2018 <- summarise(meso2018grp, RAIN = sum(RAIN, na.rm=T))
## `summarise()` regrouping output by 'STID' (override with `.groups` argument)
# Transform to wide format
meso2018 <- pivot_wider(meso2018, 
                        values_from = RAIN,
                        names_from = MONTH,
                        names_prefix = "M")
# These are the data in wide format, with one column for each month
meso2018
## # A tibble: 142 x 13
## # Groups:   STID [142]
##    STID     M1    M2    M3    M4    M5    M6    M7    M8    M9   M10   M11   M12
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ACME   0.13  2.12  0.67  1.8   4.77  4.25  3.04  1.71 10.2   6.61  0.54  5.44
##  2 ADAX   0.37  7.5   3.91  2.97  4.2   4.81  2.36  7.11 11.4   7.42  0.4   5.92
##  3 ALTU   0.01  0.79  0.54  0.97  3.62  2.24  1.31  7.05  3.51  8.6   0.24  1.09
##  4 ALV2   0     0.34  1.58  1.55  9.17  4.32  3.25  4.08  4.72  8.18  0.41  2.58
##  5 ALVA   0     0     0     0     0     0     0     0     0     0     0     0   
##  6 ANT2   1.33 10.3   2.67  2.93  3.56  4.7   2.99  5.93  6.3   7.48  1.9   7.65
##  7 ANTL   0     0     0     0     0     0     0     0     0     0     0     0   
##  8 APAC   0.19  2.21  0.71  1.28  4.32  4.38  2     1.14  7.4   5.55  0.44  3.54
##  9 ARD2   0.12  7.31  2.36  1.93  6.84  1.45  0.9   4.26 13.6  11.3   0.52  5.03
## 10 ARDM   0     0     0     0     0     0     0     0     0     0     0     0   
## # ... with 132 more rows

3.4 Joining Tables

Table joins are a concept shared across many data science disciplines and are implemented in relational database management systems such as MySQL. With a join, two tables are connected to each other through variables called keys, which are variables found in both tables. For the Mesonet data, possible key columns include STID, YEAR, or MONTH.

To map the mesonet summaries, we are going to need to add information about the geographic coordinates of each station to our summary table. This information is in a separate dataset that contains, among other variables, the station ID codes and the latitude and longitude of each weather station. There are a number of different join functions available in dplyr, and each operates a bit differently.

inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

full_join(): return all rows and all columns from both x and y. Where there are not matching values, the function returns NA for the one missing.

The inner_join() function can be used to join the geographic coordinates to the existing meso2018 table. Both datasets have a station ID column, but the column name is in upper case in meso2018 and in lower case in geo_coords. Specifying the by = c("STID" = "stid") indicates that these are the key columns that need to be matched.

geo_coords <- read_csv("geoinfo.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   stnm = col_double(),
##   stid = col_character(),
##   name = col_character(),
##   city = col_character(),
##   rang = col_double(),
##   cdir = col_character(),
##   cnty = col_character(),
##   lat = col_double(),
##   lon = col_double(),
##   elev = col_double(),
##   cdiv = col_character(),
##   clas = col_character()
## )
mesospatial <- inner_join(x = meso2018, y = geo_coords, by = c("STID" = "stid"))
mesospatial
## # A tibble: 142 x 24
## # Groups:   STID [142]
##    STID     M1    M2    M3    M4    M5    M6    M7    M8    M9   M10   M11   M12
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ACME   0.13  2.12  0.67  1.8   4.77  4.25  3.04  1.71 10.2   6.61  0.54  5.44
##  2 ADAX   0.37  7.5   3.91  2.97  4.2   4.81  2.36  7.11 11.4   7.42  0.4   5.92
##  3 ALTU   0.01  0.79  0.54  0.97  3.62  2.24  1.31  7.05  3.51  8.6   0.24  1.09
##  4 ALV2   0     0.34  1.58  1.55  9.17  4.32  3.25  4.08  4.72  8.18  0.41  2.58
##  5 ALVA   0     0     0     0     0     0     0     0     0     0     0     0   
##  6 ANT2   1.33 10.3   2.67  2.93  3.56  4.7   2.99  5.93  6.3   7.48  1.9   7.65
##  7 ANTL   0     0     0     0     0     0     0     0     0     0     0     0   
##  8 APAC   0.19  2.21  0.71  1.28  4.32  4.38  2     1.14  7.4   5.55  0.44  3.54
##  9 ARD2   0.12  7.31  2.36  1.93  6.84  1.45  0.9   4.26 13.6  11.3   0.52  5.03
## 10 ARDM   0     0     0     0     0     0     0     0     0     0     0     0   
## # ... with 132 more rows, and 11 more variables: stnm <dbl>, name <chr>,
## #   city <chr>, rang <dbl>, cdir <chr>, cnty <chr>, lat <dbl>, lon <dbl>,
## #   elev <dbl>, cdiv <chr>, clas <chr>

The mesospatial data frame now contains monthly rainfall summaries and geographic coordinates for each Mesonet station, so ggplot() can be used to generate a quick map.

ggplot() +
  geom_point(data = mesospatial,
             aes(x = lon, y = lat, size = M1),
             color = "black") +
  scale_size_continuous(name="Rainfall (mm)") +
  coord_equal() +
  labs(title="Jan 2018 Rainfall in Oklahoma")

3.5 The Pipe Operator

As shown in some of the previous examples, it is a bit awkward to have to keep overwriting the same data frame in multiple steps. One way to streamline the code and avoid this problem is to use the pipe operator, %>%. When a pipe is placed at the right side of a line of dplyr code, the data that are output from that function are passed as the first argument to the dplyr function on the next line. For example, compare the following code to the code that was executed previously. Using pipes, it is possible to generate the same data frame using a much more compact and readable block of code.

meso2018 <- mesobig %>%
  filter(YEAR == 2018) %>%
  select(STID, MONTH, RAIN) %>%
  mutate(RAIN = replace(RAIN, RAIN <= -990 | RAIN >= 990, NA)) %>%
  group_by(STID, MONTH) %>%
  summarise(RAIN = sum(RAIN, na.rm=T)) %>%
  spread(key = MONTH, value = RAIN, sep = "_") %>%
  inner_join(geo_coords, by = c("STID" = "stid"))
## `summarise()` regrouping output by 'STID' (override with `.groups` argument)
meso2018
## # A tibble: 142 x 24
## # Groups:   STID [142]
##    STID  MONTH_1 MONTH_2 MONTH_3 MONTH_4 MONTH_5 MONTH_6 MONTH_7 MONTH_8 MONTH_9
##    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 ACME     0.13    2.12    0.67    1.8     4.77    4.25    3.04    1.71   10.2 
##  2 ADAX     0.37    7.5     3.91    2.97    4.2     4.81    2.36    7.11   11.4 
##  3 ALTU     0.01    0.79    0.54    0.97    3.62    2.24    1.31    7.05    3.51
##  4 ALV2     0       0.34    1.58    1.55    9.17    4.32    3.25    4.08    4.72
##  5 ALVA     0       0       0       0       0       0       0       0       0   
##  6 ANT2     1.33   10.3     2.67    2.93    3.56    4.7     2.99    5.93    6.3 
##  7 ANTL     0       0       0       0       0       0       0       0       0   
##  8 APAC     0.19    2.21    0.71    1.28    4.32    4.38    2       1.14    7.4 
##  9 ARD2     0.12    7.31    2.36    1.93    6.84    1.45    0.9     4.26   13.6 
## 10 ARDM     0       0       0       0       0       0       0       0       0   
## # ... with 132 more rows, and 14 more variables: MONTH_10 <dbl>,
## #   MONTH_11 <dbl>, MONTH_12 <dbl>, stnm <dbl>, name <chr>, city <chr>,
## #   rang <dbl>, cdir <chr>, cnty <chr>, lat <dbl>, lon <dbl>, elev <dbl>,
## #   cdiv <chr>, clas <chr>

For more information on functions for data summarization with the dplyr and tidyr packages, check out the Data Import and Data Transoformation cheat sheets: https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

3.6 Practice

  1. Determine how many missing values there are for TMIN, TMAX, HMIN, HMAX, and RAIN in the mesobig data frame.

  2. Starting with the mesobig data frame, generate a data frame that contains STID, YEAR, MONTH, DAY, HMIN, and HMAX values for all mesonet stations for the year 2000. Be sure to convert all observations with missing data to ‘NA’ values.

  3. Starting with the mesobig data frame, generate a data frame that contains monthly mean TMAX values for the year 2016. This data frame should contain one row for each mesonet station and the data for each month should be in a separate column (similar to the meso2018 data frame that was created in the tutorial).

  4. Use the data frame you created in (3) to generate a map of 2016 July maximum temperatures.