Chapter 3 Processing Tabular Data
The preceding chapter provided a foundation for visualizing data using functions from the ggplot2 package. The sample dataset provided with the examples was already in a format suitable for plotting with ggplot()
. However, in many cases, data will need to be reformatted before it can be visualized and analyzed. Common formatting tasks include selecting a subset of rows and columns from the data table, calculating new variables from the raw data values, and computing summary statistics.
This tutorial will demonstrate basic commands for manipulating data frames (and tibbles). These manipulations can be accomplished in a variety of ways, including using base R operators and functions like those covered in Chapter 1. However, the functions in the dplyr and tidyr packages provide a more efficient and intuitive approach. These packages (along with the the ggplot2, readr, and readxl packages used in Chapter 2) are members of the “tidyverse”, collection of data science R packages. These packages share an underlying design philosophy, grammar of data manipulation, and set of data structures, The rest of this book will make extensive use of these packages. They will allow us to to carry out sophisticated processing, analysis, and visualization of complex “real-world” datasets by writing concise and easily interpretable R code.
The dplyr package has some excellent vignettes, including an Introduction to dplyr which can be accessed by running vignette("dplyr")
To run the examples in this chapter, it is necessary to load all of the previously-mentioned tidyverse packages along with lubridate, another package from the same family. Remember that if these packages are not available on your computer, you will need to install them using the install.packages()
function or the installation tools available in RStudio.
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(readxl)
library(lubridate)
3.1 Single Table Verbs
Each dplyr function accomplishes a particular type of data transformation These functions are referred to as “verbs” that describe the action performed. Some dplyr verbs operate on a single data frame or tibble, while others operate by coming two or more. Below is a list of some important dplyr verbs for single tables:
filter()
selects observations (rows) based on their values.arrange()
reorders observations.select()
andrename()
select variables (columns) based on their names.mutate()
andtransmute()
add new variables that are functions of existing variables.
In all the following examples, the first argument to the function is always the data frame that will be modified. Additional, comma-separated arguments control how the function will be implemented.
3.1.1 Filter
These examples uses the filter
function to select records by station, year, and temperature and humidity values. The %in%
operator returns TRUE
if the input matches one of the values in the subsequent. The :
operator is used here to create a vector containing the years 2014 and 2015 as demonstrated in Chapter 1. When multiple logical statements are separated by commas, they are combined using a logical “and” operator.
read_csv("mesodata_small.csv") mesosm <-
##
## -- Column specification --------------------------------------------------------
## cols(
## MONTH = col_double(),
## YEAR = col_double(),
## STID = col_character(),
## TMAX = col_double(),
## TMIN = col_double(),
## HMAX = col_double(),
## HMIN = col_double(),
## RAIN = col_double(),
## DATE = col_date(format = "")
## )
# Select rates from the Butler station
filter(mesosm, STID == "BUTL")
## # A tibble: 60 x 9
## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01
## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01
## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01
## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01
## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01
## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01
## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01
## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01
## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01
## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01
## # ... with 50 more rows
# Select rows from the years 2014 or 2015
filter(mesosm, YEAR %in% 2014:2015)
## # A tibble: 96 x 9
## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01
## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01
## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01
## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01
## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01
## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01
## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01
## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01
## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01
## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01
## # ... with 86 more rows
# Select rows where max temperature > 90 and max humidity > 90
filter(mesosm, TMAX > 90, HMAX > 90)
## # A tibble: 14 x 9
## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 7 2015 BUTL 92.0 69.5 91.3 44.0 5.16 2015-07-01
## 2 8 2015 BUTL 91.4 66.1 91.5 38.0 2.13 2015-08-01
## 3 6 2016 BUTL 90.3 66.4 92.4 40.5 4.06 2016-06-01
## 4 8 2016 BUTL 92.3 67.1 92.9 38.1 5.81 2016-08-01
## 5 7 2015 MTHE 91.9 71.6 94.7 48.5 1.45 2015-07-01
## 6 7 2016 MTHE 92.7 72.0 94.8 47.5 5.05 2016-07-01
## 7 6 2018 MTHE 90.6 69.2 94.0 48.9 2.51 2018-06-01
## 8 7 2018 MTHE 93.8 71.2 92.4 43.9 3.46 2018-07-01
## 9 8 2014 SKIA 90.7 69.7 90.4 43.5 1.2 2014-08-01
## 10 7 2016 SKIA 91.0 72.8 92.0 51.2 5.32 2016-07-01
## 11 8 2016 SKIA 90.7 71.0 90.8 45.6 2.19 2016-08-01
## 12 7 2017 SKIA 90.4 71.5 91.1 49.6 3.18 2017-07-01
## 13 7 2018 SKIA 92.4 71.5 92.5 44.9 2.23 2018-07-01
## 14 7 2018 SPEN 91.3 71.2 92.4 49.5 3.01 2018-07-01
3.1.2 Arrange
The arrange()
function returns a data frame that is sorted on the comma-separated column names in order from left to right.
arrange(mesosm, STID, YEAR, MONTH)
## # A tibble: 240 x 9
## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01
## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01
## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01
## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01
## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01
## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01
## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01
## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01
## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01
## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01
## # ... with 230 more rows
The desc()
function can be used to order a column in descending rather than ascending order.
arrange(mesosm, STID, desc(YEAR), desc(MONTH))
## # A tibble: 240 x 9
## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 12 2018 BUTL 51.5 25.9 88.1 39.8 2.58 2018-12-01
## 2 11 2018 BUTL 58.1 28.7 91.3 37.3 0.39 2018-11-01
## 3 10 2018 BUTL 71.4 47.4 95.7 47.4 7.68 2018-10-01
## 4 9 2018 BUTL 84.5 62.7 94.7 50.7 2.79 2018-09-01
## 5 8 2018 BUTL 93.4 68.5 86.5 36.8 2.75 2018-08-01
## 6 7 2018 BUTL 96.9 69.7 89.8 33.0 1.69 2018-07-01
## 7 6 2018 BUTL 93.3 68.6 85.8 37.4 4.6 2018-06-01
## 8 5 2018 BUTL 90.5 62.3 87.8 33.7 3.08 2018-05-01
## 9 4 2018 BUTL 71.0 38.5 81.3 26.8 1.41 2018-04-01
## 10 3 2018 BUTL 68.5 36.5 82.7 28.1 0.41 2018-03-01
## # ... with 230 more rows
3.1.3 Select and Rename
The select()
function returns only the specified columns. Other columns are removed from the data frame. One way to specify the selected columsn is a comma-separated list of column names.
select(mesosm, STID, YEAR, MONTH, TMAX, TMIN)
## # A tibble: 240 x 5
## STID YEAR MONTH TMAX TMIN
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BUTL 2014 1 52.0 21.6
## 2 BUTL 2014 2 47.5 22.3
## 3 BUTL 2014 3 61.1 31.3
## 4 BUTL 2014 4 74.9 45.3
## 5 BUTL 2014 5 84.8 56.0
## 6 BUTL 2014 6 89.0 67.1
## 7 BUTL 2014 7 91.4 67.8
## 8 BUTL 2014 8 96.5 68.0
## 9 BUTL 2014 9 85.9 61.2
## 10 BUTL 2014 10 78.8 48.5
## # ... with 230 more rows
The :
operator can be used to select a continuous series of columns. This usage is analogous to the :
operator in base R, where it is used to specify a continuous series of integers.
select(mesosm,
:TMAX) MONTH
## # A tibble: 240 x 4
## MONTH YEAR STID TMAX
## <dbl> <dbl> <chr> <dbl>
## 1 1 2014 BUTL 52.0
## 2 2 2014 BUTL 47.5
## 3 3 2014 BUTL 61.1
## 4 4 2014 BUTL 74.9
## 5 5 2014 BUTL 84.8
## 6 6 2014 BUTL 89.0
## 7 7 2014 BUTL 91.4
## 8 8 2014 BUTL 96.5
## 9 9 2014 BUTL 85.9
## 10 10 2014 BUTL 78.8
## # ... with 230 more rows
The helper functions starts_with()
, ends_with()
, and contains()
can be used to find multiple columns by matching part of the column name.
select(mesosm,
starts_with("T"))
## # A tibble: 240 x 2
## TMAX TMIN
## <dbl> <dbl>
## 1 52.0 21.6
## 2 47.5 22.3
## 3 61.1 31.3
## 4 74.9 45.3
## 5 84.8 56.0
## 6 89.0 67.1
## 7 91.4 67.8
## 8 96.5 68.0
## 9 85.9 61.2
## 10 78.8 48.5
## # ... with 230 more rows
Columns can be removed by prefixing their names with a -
. Other columns will be kept.
select(mesosm,
-HMIN,
-HMAX)
## # A tibble: 240 x 7
## MONTH YEAR STID TMAX TMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <date>
## 1 1 2014 BUTL 52.0 21.6 0.01 2014-01-01
## 2 2 2014 BUTL 47.5 22.3 0.26 2014-02-01
## 3 3 2014 BUTL 61.1 31.3 0.59 2014-03-01
## 4 4 2014 BUTL 74.9 45.3 1.33 2014-04-01
## 5 5 2014 BUTL 84.8 56.0 2.76 2014-05-01
## 6 6 2014 BUTL 89.0 67.1 2.97 2014-06-01
## 7 7 2014 BUTL 91.4 67.8 3.85 2014-07-01
## 8 8 2014 BUTL 96.5 68.0 0.22 2014-08-01
## 9 9 2014 BUTL 85.9 61.2 2.39 2014-09-01
## 10 10 2014 BUTL 78.8 48.5 2.62 2014-10-01
## # ... with 230 more rows
The rename()
function is used to change column names. Name changes are specified using the =
operator, placing the new name first and the old name second.
rename(mesosm, maxtemp = TMAX,
mintemp = TMIN)
## # A tibble: 240 x 9
## MONTH YEAR STID maxtemp mintemp HMAX HMIN RAIN DATE
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01
## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01
## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01
## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01
## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01
## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01
## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01
## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01
## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01
## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01
## # ... with 230 more rows
3.1.4 Mutate and Transmute
The mutate()
function adds new variables to the data frame. Multiple new variables can be generated with a single function call, using a comma to separate each new variable. The name of each new variable is specified on the left of the =
and a function that can contain names of other variables in the table is specified on the right. The following example converts the minimum and maximum temperature variables from Fahrenheit to Celsius.
mutate(mesosm,
TMINC = (TMIN - 32) * .5556,
TMAXC = (TMIN - 32) * .5556)
## # A tibble: 240 x 11
## MONTH YEAR STID TMAX TMIN HMAX HMIN RAIN DATE TMINC TMAXC
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
## 1 1 2014 BUTL 52.0 21.6 76.3 27.2 0.01 2014-01-01 -5.79 -5.79
## 2 2 2014 BUTL 47.5 22.3 88.8 41.2 0.26 2014-02-01 -5.41 -5.41
## 3 3 2014 BUTL 61.1 31.3 79.5 27.1 0.59 2014-03-01 -0.409 -0.409
## 4 4 2014 BUTL 74.9 45.3 80.2 25.1 1.33 2014-04-01 7.41 7.41
## 5 5 2014 BUTL 84.8 56.0 79.1 28.4 2.76 2014-05-01 13.4 13.4
## 6 6 2014 BUTL 89.0 67.1 89.7 42.8 2.97 2014-06-01 19.5 19.5
## 7 7 2014 BUTL 91.4 67.8 87.2 37.7 3.85 2014-07-01 19.9 19.9
## 8 8 2014 BUTL 96.5 68.0 82.6 27.9 0.22 2014-08-01 20.0 20.0
## 9 9 2014 BUTL 85.9 61.2 89.3 39.1 2.39 2014-09-01 16.2 16.2
## 10 10 2014 BUTL 78.8 48.5 92.3 32.3 2.62 2014-10-01 9.17 9.17
## # ... with 230 more rows
The transmute()
function only keeps the newly-created variables.
transmute(mesosm,
TMINC = (TMIN - 32) * .5556,
TMAXC = (TMIN - 32) * .5556)
## # A tibble: 240 x 2
## TMINC TMAXC
## <dbl> <dbl>
## 1 -5.79 -5.79
## 2 -5.41 -5.41
## 3 -0.409 -0.409
## 4 7.41 7.41
## 5 13.4 13.4
## 6 19.5 19.5
## 7 19.9 19.9
## 8 20.0 20.0
## 9 16.2 16.2
## 10 9.17 9.17
## # ... with 230 more rows
3.1.5 Application
The following examples will use another meteorological dataset from the Oklahoma Mesonet. The mesodata_large.csv
file contains daily data records from every Mesonet station in Oklahoma from 1994-present. This this is a very large data table, more than with more than one million rows and over 150 Mb of data. There are also numerous missing data codes (values < -990 or > 990) that will need to dealt with.
read_csv("mesodata_large.csv") mesobig <-
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## STID = col_character()
## )
## i Use `spec()` for the full column specifications.
dim(mesobig)
## [1] 1296602 22
summary(mesobig)
## YEAR MONTH DAY STID
## Min. :1994 Min. : 1.000 Min. : 1.00 Length:1296602
## 1st Qu.:2000 1st Qu.: 4.000 1st Qu.: 8.00 Class :character
## Median :2006 Median : 7.000 Median :16.00 Mode :character
## Mean :2006 Mean : 6.523 Mean :15.73
## 3rd Qu.:2012 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :2018 Max. :12.000 Max. :31.00
## TMAX TMIN TAVG DMAX
## Min. :-999.00 Min. :-999.00 Min. :-999.00 Min. :-999.00
## 1st Qu.:-996.00 1st Qu.:-996.00 1st Qu.:-996.00 1st Qu.:-996.00
## Median : 62.73 Median : 37.22 Median : 49.74 Median : 42.90
## Mean :-234.99 Mean :-251.78 Mean :-243.61 Mean :-255.03
## 3rd Qu.: 82.51 3rd Qu.: 59.88 3rd Qu.: 70.53 3rd Qu.: 64.19
## Max. : 115.07 Max. : 89.01 Max. : 100.06 Max. : 84.19
## DMIN DAVG HMAX HMIN
## Min. :-999.00 Min. :-999.00 Min. :-999.00 Min. :-999.00
## 1st Qu.:-996.00 1st Qu.:-996.00 1st Qu.: 70.39 1st Qu.: 18.68
## Median : 28.42 Median : 35.76 Median : 89.80 Median : 34.54
## Mean :-263.81 Mean :-259.27 Mean :-120.33 Mean :-157.88
## 3rd Qu.: 52.19 3rd Qu.: 58.58 3rd Qu.: 96.20 3rd Qu.: 50.49
## Max. : 76.54 Max. : 78.52 Max. : 100.00 Max. : 100.00
## HAVG VDEF 9AVG HDEG
## Min. :-999.00 Min. :-999.00 Min. :-999.00 Min. :-999.00
## 1st Qu.: 43.87 1st Qu.:-996.00 1st Qu.:-996.00 1st Qu.:-996.00
## Median : 62.93 Median : 4.01 Median : 45.58 Median : 0.00
## Mean :-138.20 Mean :-287.10 Mean :-315.38 Mean :-279.82
## 3rd Qu.: 75.42 3rd Qu.: 9.00 3rd Qu.: 68.82 3rd Qu.: 11.84
## Max. : 100.00 Max. : 54.38 Max. : 99.93 Max. : 72.57
## CDEG HTMX WCMN RAIN
## Min. :-999.00 Min. :-999.00 Min. :-999.00 Min. :-999.0
## 1st Qu.:-996.00 1st Qu.:-996.00 1st Qu.:-996.00 1st Qu.: 0.0
## Median : 0.00 Median :-996.00 Median :-996.00 Median : 0.0
## Mean :-282.82 Mean :-686.45 Mean :-643.44 Mean :-194.0
## 3rd Qu.: 5.89 3rd Qu.: 82.77 3rd Qu.: 20.51 3rd Qu.: 0.0
## Max. : 35.00 Max. : 125.38 Max. : 49.67 Max. : 14.2
## RNUM RMAX
## Min. :-999.0 Min. :-999.00
## 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 0.0 Median : 0.00
## Mean :-158.2 Mean :-194.57
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. : 996.0 Max. : 10.92
The dplyr single-table functions can be used to created a smaller version of the big Mesonet dataset. The first step is to filter out records (rows) that belong to one of four sites and are from the years 2014 to present.
filter(mesobig, STID %in% c("SPEN", "SKIA", "MTHE", "BUTL") & YEAR >= 2014)
mesosm <-dim(mesobig)
## [1] 1296602 22
dim(mesosm)
## [1] 7304 22
Next, the select()
function is used to choose a subset of the columns.
select(mesosm, STID, YEAR, MONTH, DAY, TMIN, TMAX, HMIN, HMAX, RAIN)
mesosm <-dim(mesosm)
## [1] 7304 9
This example takes the mesosm
data frame as an input and also creates an output objected called mesosm
. We are effectively overwriting the object with a new version. This was done intentionally to avoid the confusion of having multiple data objects generated by a string of functions (e.g., mesosm1
, mesosm2
, etc.). There is really no need to save the output of each step because all of the code can easily be rerun if a change is required or an error needs to be fixed. The end of this chapter will present a technique called “piping” that makes it even easier to combine multiple dplyr functions into concise blocks of code.
Next, the temperature, humidity, and rainfall records are transformed by replacing the missing data codes with ‘NA’ values using the mutate()
function in combination with the replace()
function. Replace takes three comma-separated arguments. The first is the variable to replace, the second is a logical statement indicating which values will replaced, and the third if the replacement value. Note that in this example, the columns are overwritten with their updated values rather than creating new column.
mutate(mesosm,
mesosm <-TMIN = replace(TMIN, TMIN <= -990 | TMIN >= 990, NA),
TMAX = replace(TMAX, TMAX <= -990 | TMAX >= 990, NA),
HMIN = replace(HMIN, HMIN <= -990 | HMIN >= 990, NA),
HMAX = replace(HMAX, HMAX <= -990 | HMAX >= 990, NA),
RAIN = replace(RAIN, RAIN <= -990 | RAIN >= 990, NA))
In this example, the columns are overwritten with their updated values rather than creating new columns.
3.2 Summarizing
At this point, the size of the data frame has been reduced and the missing data values have been removed. However, there is still one record in the data frame for each day. The next step is to convert these daily values to monthly summaries. This can be accomplished using the summarize()
function. The following example uses summarize()
to calculate the mean maximum temperature for all records in the data frame. As discussed in Chapter 1, the na.rm = TRUE
argument is specified so that the mean()
function will ignore the NA
values
summarize(mesosm, meantmax = mean(TMAX, na.rm = TRUE))
## # A tibble: 1 x 1
## meantmax
## <dbl>
## 1 72.1
The summarize()
function becomes particularly useful when we pair it with another dplyr function, group_by()
, which groups rows of data together to produce a “grouped data frame”. When a grouped data frame is summarized, the summaries are generated for individual groups rather than the entire dataset, and the result is a data frame with one row per group.
group_by(mesosm, STID, YEAR, MONTH)
mesogrp <-class(mesogrp)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
summarise(mesogrp, TMAX = mean(TMAX, na.rm=T),
mesomnth <-TMIN = mean(TMIN, na.rm=T),
HMAX = mean(HMAX, na.rm=T),
HMIN = mean(HMIN, na.rm=T),
RAIN = sum(RAIN, na.rm=T))
## `summarise()` regrouping output by 'STID', 'YEAR' (override with `.groups` argument)
Finally, the mutate()
function is used to add a date column. Since each record represents an entire month, the date is set to correspond to the first day of each month. The ymd()
function from the lubridate package is used to create a date object. The input to this function is a date string with year followed by month followed by day separated by dashes. For example, March 31, 2018 would be “2018-3-31”. The paste()
function joins multiple strings together using a separate specified with the sep
argument.
mutate(mesomnth,
mesomnth <-DATE = ymd(paste(YEAR, MONTH, "1", sep = "-")))
Take a look at the resulting data frame with the View()
function. It should be the same as the “mesodata_small” dataset that was used in Chapter 2 and at the beginning of this Chapter. These data can be used to make the same plots as in Chapter 2.
ggplot(data = mesomnth) +
geom_line(mapping = aes(x = DATE, y = RAIN)) +
facet_wrap(~ STID, ncol = 1)
3.2.1 Counts
When summarizing data, it is often useful to count the number of values used. This can be accomplished with the dplyr function n()
, which will include NA
values in the count, or sum(!is.na(x))
which will exclude them. It’s a good idea to count when summarizing to make sure your summaries are not based on a very small sample size. Here, for example, a new variable called pv_hmax
to is calculated to determine, for each month and site, what percentage of days have missing data. The mutate()
function is again used to add a data columns to the mesomissing
data frame and then overwrite the old version of the data frame.
summarize(mesogrp,
mesomissing <-n_rows = n(),
obs_hmax = sum(!is.na(HMAX)),
pv_hmax = 100 * obs_hmax/n_rows)
## `summarise()` regrouping output by 'STID', 'YEAR' (override with `.groups` argument)
mutate(mesomissing,
mesomissing <-DATE = ymd(paste(YEAR, MONTH, "1", sep = "-")))
mesomissing
## # A tibble: 240 x 7
## # Groups: STID, YEAR [20]
## STID YEAR MONTH n_rows obs_hmax pv_hmax DATE
## <chr> <dbl> <dbl> <int> <int> <dbl> <date>
## 1 BUTL 2014 1 31 31 100 2014-01-01
## 2 BUTL 2014 2 28 28 100 2014-02-01
## 3 BUTL 2014 3 31 31 100 2014-03-01
## 4 BUTL 2014 4 30 30 100 2014-04-01
## 5 BUTL 2014 5 31 31 100 2014-05-01
## 6 BUTL 2014 6 30 30 100 2014-06-01
## 7 BUTL 2014 7 31 31 100 2014-07-01
## 8 BUTL 2014 8 31 31 100 2014-08-01
## 9 BUTL 2014 9 30 30 100 2014-09-01
## 10 BUTL 2014 10 31 31 100 2014-10-01
## # ... with 230 more rows
The resulting data table contains information about when and where there lot of missing data. An efficient way to explore these data is using a simple plot. The following example generates a line plot similar to the examples in Chapter 2 that displays the monthly time series of missing data using a separate plot for each site.
ggplot(data = mesomissing) +
geom_line(mapping = aes(x = DATE, y = pv_hmax)) +
facet_wrap(~ STID, ncol = 1)
3.2.2 Summary Functions
Up to this point, only a few summary functions have been covered: sum()
mean()
, and n()
. Other useful summary functions include:
- Measures of location:
mean(x)
andmedian(x)
. The mean is the sum divided by the length; the median is a value where 50% ofx
is above it, and 50% is below it. - Measures of spread:
sd(x)
,IQR(x)
,mad(x)
. The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile rangeIQR()
and median absolute deviationmad(x)
are robust equivalents that may be more useful if you have outliers. - Measures of rank:
min(x)
,quantile(x, 0.25)
,max(x)
. Quantiles are a generalization of the median. For example,quantile(x, 0.25)
will find a value ofx
that is greater than 25% of the values, and less than the remaining 75%. - Measures of position:
first(x)
,nth(x, 2)
,last(x)
. These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). - Counts: You’ve seen
n()
, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, usesum(!is.na(x))
. To count the number of distinct (unique) values, usen_distinct(x)
.
3.3 Pivoting Data
Almost all of the data used in environmental geography applications can be stored and manipulated in data frames, but depending on the application the format of the data frame can differ. Consider an example like the Mesonet data where each observation is index by a location (station) and a time (month). These data could be stored in a “wide” format, where there is one row for each station and one column for each month. Alternately, the data can be stored in a “long” format where there is one row for each combination of station and month, with index columns that specify the station and month associated with each row.
In general the “long” format is considered to be a more efficient and “tidy” way to format and process data. This concept of working with “tidy” data is the conceptual foundation for the tidyr
package as well as the larger “tidyverse” collection of packages. In general, the long data format with fewer columns and more rows is easiest to manipulate using dplyr functions and is the format needed for plotting with ggplot()
and running many types of statistical models. However, there are also many situations where a wider format is required for data manipulation and analysis. Therefore the tidyr package provides two important functions called pivot_longer()
and pivot wider()
. That can be used to reformat data frames.
Consider the problem of plotting both minimum and maximum temperatures on the same graph. In the current mesomnth
data frame, they are in separate columns - TMIN
and TMAX
. For plotting, it is necessary to have all of the temperature observations in a single values column and to have a second names column that indicates the type of temperature measurement (minimum or maximum). The pivot_longer()
function takes three basic arguments. The cols
argument specifies which columns from the input dataset will be combined into the new “values” column. The one_of()
function is used to specify a list of column names from the input data frame. The values_to
argument specifies the name of the new values column. The names_to
argument specifies the name of the new varkable names column.
pivot_longer(mesomnth,
temp_tidy <-cols = one_of("TMAX", "TMIN"),
values_to = "temp",
names_to = "tstat")
temp_tidy
## # A tibble: 480 x 9
## # Groups: STID, YEAR [20]
## STID YEAR MONTH HMAX HMIN RAIN DATE tstat temp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <chr> <dbl>
## 1 BUTL 2014 1 76.3 27.2 0.01 2014-01-01 TMAX 52.0
## 2 BUTL 2014 1 76.3 27.2 0.01 2014-01-01 TMIN 21.6
## 3 BUTL 2014 2 88.8 41.2 0.26 2014-02-01 TMAX 47.5
## 4 BUTL 2014 2 88.8 41.2 0.26 2014-02-01 TMIN 22.3
## 5 BUTL 2014 3 79.5 27.1 0.59 2014-03-01 TMAX 61.1
## 6 BUTL 2014 3 79.5 27.1 0.59 2014-03-01 TMIN 31.3
## 7 BUTL 2014 4 80.2 25.1 1.33 2014-04-01 TMAX 74.9
## 8 BUTL 2014 4 80.2 25.1 1.33 2014-04-01 TMIN 45.3
## 9 BUTL 2014 5 79.1 28.4 2.76 2014-05-01 TMAX 84.8
## 10 BUTL 2014 5 79.1 28.4 2.76 2014-05-01 TMIN 56.0
## # ... with 470 more rows
Now there is only a single column with temperature measurements (“temp”), and another column with a code that indicates the type of temperature measurement (“tstat”). However, the temp_tidy
data frame now has twice as many rows as the original mesomnth
data frame, because each minimum and maximum temperature value occupies a separate row. By organizing the data this way, it is easier to make a graph that includes minimum and maximum temperatures.
ggplot(temp_tidy) +
geom_line(aes(x = DATE, y = temp, color = tstat)) +
facet_wrap(~ STID, ncol = 1, scales = "free_y")
When observations are distributed across multiple rows, it is sometimes necessary to reorganize the data into multiple columns. For example, consider the problem of generating a table with just one row for each location and columns containing the total rainfall for each month in 2018. The code below shows how to accomplish this task using several dplyr functions along with the pivot_wider()
function. The values_from
argument indicates the input column with data to be distributed over multiple output columns. The names of these new columns will be generated by pasting the names_prefix
string to the values in the names_from
column.
# Filter rows with data for 2018
filter(mesobig, YEAR == 2018)
meso2018 <-# Select only the videos we need
select(meso2018, STID, MONTH, RAIN)
meso2018 <-# These are the data look like in long format, with one row for each month
meso2018
## # A tibble: 51,830 x 3
## STID MONTH RAIN
## <chr> <dbl> <dbl>
## 1 ACME 1 0
## 2 ADAX 1 0
## 3 ALTU 1 0
## 4 ALVA 1 -999
## 5 ALV2 1 0
## 6 ANTL 1 -999
## 7 ANT2 1 0
## 8 APAC 1 0
## 9 ARDM 1 -999
## 10 ARD2 1 0
## # ... with 51,820 more rows
# Replace missing data code with NA
mutate(meso2018, RAIN = replace(RAIN, RAIN <= -990 | RAIN >= 990, NA))
meso2018 <-# Group and summarize by month
group_by(meso2018, STID, MONTH)
meso2018grp <- summarise(meso2018grp, RAIN = sum(RAIN, na.rm=T)) meso2018 <-
## `summarise()` regrouping output by 'STID' (override with `.groups` argument)
# Transform to wide format
pivot_wider(meso2018,
meso2018 <-values_from = RAIN,
names_from = MONTH,
names_prefix = "M")
# These are the data in wide format, with one column for each month
meso2018
## # A tibble: 142 x 13
## # Groups: STID [142]
## STID M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ACME 0.13 2.12 0.67 1.8 4.77 4.25 3.04 1.71 10.2 6.61 0.54 5.44
## 2 ADAX 0.37 7.5 3.91 2.97 4.2 4.81 2.36 7.11 11.4 7.42 0.4 5.92
## 3 ALTU 0.01 0.79 0.54 0.97 3.62 2.24 1.31 7.05 3.51 8.6 0.24 1.09
## 4 ALV2 0 0.34 1.58 1.55 9.17 4.32 3.25 4.08 4.72 8.18 0.41 2.58
## 5 ALVA 0 0 0 0 0 0 0 0 0 0 0 0
## 6 ANT2 1.33 10.3 2.67 2.93 3.56 4.7 2.99 5.93 6.3 7.48 1.9 7.65
## 7 ANTL 0 0 0 0 0 0 0 0 0 0 0 0
## 8 APAC 0.19 2.21 0.71 1.28 4.32 4.38 2 1.14 7.4 5.55 0.44 3.54
## 9 ARD2 0.12 7.31 2.36 1.93 6.84 1.45 0.9 4.26 13.6 11.3 0.52 5.03
## 10 ARDM 0 0 0 0 0 0 0 0 0 0 0 0
## # ... with 132 more rows
3.4 Joining Tables
Table joins are a concept shared across many data science disciplines and are implemented in relational database management systems such as MySQL. With a join, two tables are connected to each other through variables called keys, which are variables found in both tables. For the Mesonet data, possible key columns include STID, YEAR, or MONTH.
To map the mesonet summaries, we are going to need to add information about the geographic coordinates of each station to our summary table. This information is in a separate dataset that contains, among other variables, the station ID codes and the latitude and longitude of each weather station. There are a number of different join functions available in dplyr, and each operates a bit differently.
inner_join()
: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join()
: return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join()
: return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
full_join()
: return all rows and all columns from both x and y. Where there are not matching values, the function returns NA
for the one missing.
The inner_join()
function can be used to join the geographic coordinates to the existing meso2018 table. Both datasets have a station ID column, but the column name is in upper case in meso2018
and in lower case in geo_coords
. Specifying the by = c("STID" = "stid")
indicates that these are the key columns that need to be matched.
read_csv("geoinfo.csv") geo_coords <-
##
## -- Column specification --------------------------------------------------------
## cols(
## stnm = col_double(),
## stid = col_character(),
## name = col_character(),
## city = col_character(),
## rang = col_double(),
## cdir = col_character(),
## cnty = col_character(),
## lat = col_double(),
## lon = col_double(),
## elev = col_double(),
## cdiv = col_character(),
## clas = col_character()
## )
inner_join(x = meso2018, y = geo_coords, by = c("STID" = "stid"))
mesospatial <- mesospatial
## # A tibble: 142 x 24
## # Groups: STID [142]
## STID M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ACME 0.13 2.12 0.67 1.8 4.77 4.25 3.04 1.71 10.2 6.61 0.54 5.44
## 2 ADAX 0.37 7.5 3.91 2.97 4.2 4.81 2.36 7.11 11.4 7.42 0.4 5.92
## 3 ALTU 0.01 0.79 0.54 0.97 3.62 2.24 1.31 7.05 3.51 8.6 0.24 1.09
## 4 ALV2 0 0.34 1.58 1.55 9.17 4.32 3.25 4.08 4.72 8.18 0.41 2.58
## 5 ALVA 0 0 0 0 0 0 0 0 0 0 0 0
## 6 ANT2 1.33 10.3 2.67 2.93 3.56 4.7 2.99 5.93 6.3 7.48 1.9 7.65
## 7 ANTL 0 0 0 0 0 0 0 0 0 0 0 0
## 8 APAC 0.19 2.21 0.71 1.28 4.32 4.38 2 1.14 7.4 5.55 0.44 3.54
## 9 ARD2 0.12 7.31 2.36 1.93 6.84 1.45 0.9 4.26 13.6 11.3 0.52 5.03
## 10 ARDM 0 0 0 0 0 0 0 0 0 0 0 0
## # ... with 132 more rows, and 11 more variables: stnm <dbl>, name <chr>,
## # city <chr>, rang <dbl>, cdir <chr>, cnty <chr>, lat <dbl>, lon <dbl>,
## # elev <dbl>, cdiv <chr>, clas <chr>
The mesospatial
data frame now contains monthly rainfall summaries and geographic coordinates for each Mesonet station, so ggplot()
can be used to generate a quick map.
ggplot() +
geom_point(data = mesospatial,
aes(x = lon, y = lat, size = M1),
color = "black") +
scale_size_continuous(name="Rainfall (mm)") +
coord_equal() +
labs(title="Jan 2018 Rainfall in Oklahoma")
3.5 The Pipe Operator
As shown in some of the previous examples, it is a bit awkward to have to keep overwriting the same data frame in multiple steps. One way to streamline the code and avoid this problem is to use the pipe operator, %>%
. When a pipe is placed at the right side of a line of dplyr code, the data that are output from that function are passed as the first argument to the dplyr function on the next line. For example, compare the following code to the code that was executed previously. Using pipes, it is possible to generate the same data frame using a much more compact and readable block of code.
mesobig %>%
meso2018 <- filter(YEAR == 2018) %>%
select(STID, MONTH, RAIN) %>%
mutate(RAIN = replace(RAIN, RAIN <= -990 | RAIN >= 990, NA)) %>%
group_by(STID, MONTH) %>%
summarise(RAIN = sum(RAIN, na.rm=T)) %>%
spread(key = MONTH, value = RAIN, sep = "_") %>%
inner_join(geo_coords, by = c("STID" = "stid"))
## `summarise()` regrouping output by 'STID' (override with `.groups` argument)
meso2018
## # A tibble: 142 x 24
## # Groups: STID [142]
## STID MONTH_1 MONTH_2 MONTH_3 MONTH_4 MONTH_5 MONTH_6 MONTH_7 MONTH_8 MONTH_9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ACME 0.13 2.12 0.67 1.8 4.77 4.25 3.04 1.71 10.2
## 2 ADAX 0.37 7.5 3.91 2.97 4.2 4.81 2.36 7.11 11.4
## 3 ALTU 0.01 0.79 0.54 0.97 3.62 2.24 1.31 7.05 3.51
## 4 ALV2 0 0.34 1.58 1.55 9.17 4.32 3.25 4.08 4.72
## 5 ALVA 0 0 0 0 0 0 0 0 0
## 6 ANT2 1.33 10.3 2.67 2.93 3.56 4.7 2.99 5.93 6.3
## 7 ANTL 0 0 0 0 0 0 0 0 0
## 8 APAC 0.19 2.21 0.71 1.28 4.32 4.38 2 1.14 7.4
## 9 ARD2 0.12 7.31 2.36 1.93 6.84 1.45 0.9 4.26 13.6
## 10 ARDM 0 0 0 0 0 0 0 0 0
## # ... with 132 more rows, and 14 more variables: MONTH_10 <dbl>,
## # MONTH_11 <dbl>, MONTH_12 <dbl>, stnm <dbl>, name <chr>, city <chr>,
## # rang <dbl>, cdir <chr>, cnty <chr>, lat <dbl>, lon <dbl>, elev <dbl>,
## # cdiv <chr>, clas <chr>
For more information on functions for data summarization with the dplyr and tidyr packages, check out the Data Import and Data Transoformation cheat sheets: https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
3.6 Practice
Determine how many missing values there are for TMIN, TMAX, HMIN, HMAX, and RAIN in the mesobig data frame.
Starting with the mesobig data frame, generate a data frame that contains STID, YEAR, MONTH, DAY, HMIN, and HMAX values for all mesonet stations for the year 2000. Be sure to convert all observations with missing data to ‘NA’ values.
Starting with the mesobig data frame, generate a data frame that contains monthly mean TMAX values for the year 2016. This data frame should contain one row for each mesonet station and the data for each month should be in a separate column (similar to the meso2018 data frame that was created in the tutorial).
Use the data frame you created in (3) to generate a map of 2016 July maximum temperatures.