3 Data Abstraction

At this point, we’ve learned the basics of working with the R language. From here we’ll want to explore how to analyze data, both statistically and spatially. One part of this is abstracting information from existing data sets by selecting variables and observations and summarizing their statistics.

Some useful methods for data abstraction can be found in the various packages of “The Tidyverse” (Wickham 2017) which can be included all at once with the tidyverse package. We’ll start with dplyr, which includes an array of data manipulation tools, including select for selecting variables, filter for subsetting observations, summarize for reducing variables to summary statistics, typically stratified by groups, and mutate for creating new variables from mathematical expressions from existing variables. Some dplyr tools such as data joins we’ll look at later in the data transformation chapter.

3.1 Background: Exploratory Data Analysis

In 1961, John Tukey proposed a new approach to data analysis, defining it as “Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
He followed this up much later in 1977 with Exploratory Data Analysis.

Exploratory data analysis (EDA) in part as an approach to analyzing data via summaries, tables and graphics. The key word is exploratory, in contrast with confirmatory statistics. Both are important, but ignoring exploration is ignoring enlightenment.

Some purposes of EDA are:

  • to suggest hypotheses
  • to assess assumptions on which inference will be based
  • to select appropriate inferential statistical tools
  • to guide further data collection

These concepts led to the development of S at Bell Labs (John Chambers, 1976), then R, built on clear design and extensive, clear graphics.

3.2 The Tidyverse and what we’ll explore in this chapter

The Tidyverse refers to a suite of R packages developed at RStudio (see R Studio and R for Data Science) for facilitating data processing and analysis. While R itself is designed around EDA, the Tidyverse takes it further. Some of the packages in the Tidyverse that are widely used are:

  • dplyr : data manipulation like a database
  • readr : better methods for reading and writing rectangular data
  • tidyr : reorganization methods that extend dplyr’s database capabilities
  • purrr : expanded programming toolkit including enhanced “apply” methods
  • tibble : improved data frame
  • stringr : string manipulation library
  • ggplot2 : graphing system based on the grammar of graphics

In this chapter, we’ll be mostly exploring dplyr, with a few other things thrown in like reading data frames with readr. For simplicity, we can just include library(tidyverse) to get everything.

3.3 Tibbles

Tibbles are an improved type of data frame

  • part of the Tidyverse
  • serve the same purpose as a data frame, and all data frame operations work

Advantages

  • display better
  • can be composed of more complex objects like lists, etc.
  • can be grouped

How created

  • Reading from a CSV, using one of a variety of Tidyverse functions similarly named to base functions:
    • read_csv() creates a tibble (in general, underscores are used in the Tidyverse)
    • read.csv() creates a regular data frame, then tibble() can be used to convert it, though that may not be identical to the last method since some variable names may be altered by the read.csv() method.

[air_quality project] (Toxic Release Inventory (TRI) Program, n.d.) You might consider opening or creating a new air_quality project, to maintain related code and data.

library(tidyverse) # includes readr, ggplot2, and dplyr which we'll use in this chapter
library(iGIScData)
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
TRI87df <- read.csv(csvPath)
TRI87b <- tibble(TRI87df)
identical(TRI87, TRI87b)
## [1] FALSE
  • The tibble() function will also build a tibble from vectors.

We’ll start by looking at a couple of built-in character vectors (there are lots of things like this in R) - letters : lower case letters - LETTERS : upper case letters’s also a LETTERS and lots of other things like that in R)…

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

… then make a tibble of letters, LETTERS, and two random sets of 26 values, one normally distributed, the other uniform:

norm <- rnorm(26)
unif <- runif(26)
tibble26 <- tibble(letters,LETTERS,norm,unif)
tibble26
## # A tibble: 26 x 4
##    letters LETTERS    norm   unif
##    <chr>   <chr>     <dbl>  <dbl>
##  1 a       A       -0.394  0.167 
##  2 b       B       -0.0127 0.137 
##  3 c       C        0.796  0.282 
##  4 d       D        0.854  0.517 
##  5 e       E        0.641  0.576 
##  6 f       F        1.30   0.645 
##  7 g       G       -0.525  0.515 
##  8 h       H        1.90   0.748 
##  9 i       I       -1.85   0.0668
## 10 j       J       -1.30   0.0190
## # ... with 16 more rows

3.3.1 read_csv vs. read.csv

You might be tempted to use read.csv from base R

  • They look a lot alike, so you might confuse them
  • You don’t need to load library(readr)
  • read.csv “fixes” some things and that might be desired: problematic field names like MLY-TAVG-NORMAL become MLY.TAVG.NORMAL
  • numbers stored as characters are converted to numbers “01” becomes 1, “02” becomes 2, etc.

However, there are potential problems

  • You may not want some of those changes, and want to specify those changes separately
  • There are known problems that read_csv avoids

Recommendation: Use read_csv and write_csv.

3.4 Summarizing variable distributions

A simple statistical summary is very easy to do in R, and we’ll use eucoak data in the iGIScData package from a study of comparative runoff and erosion under Eucalyptus and oak canopies (Thompson, Davis, and Oliphant 2016). In this study, we looked at the amount of runoff and erosion captured in Gerlach troughs on paired eucalyptus and oak sites in the San Francisco Bay Area. You might consider creating a eucoak project, since we’ll be referencing this data set in several places.

library(iGIScData)
summary(eucoakrainfallrunoffTDR)
##      site               site #          date              month              rain_mm         rain_oak        rain_euc    
##  Length:90          Min.   :1.000   Length:90          Length:90          Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  Class :character   1st Qu.:2.000   Class :character   Class :character   1st Qu.:16.00   1st Qu.:16.00   1st Qu.:14.75  
##  Mode  :character   Median :4.000   Mode  :character   Mode  :character   Median :28.50   Median :30.50   Median :30.00  
##                     Mean   :4.422                                         Mean   :37.99   Mean   :35.08   Mean   :34.60  
##                     3rd Qu.:6.000                                         3rd Qu.:63.25   3rd Qu.:50.50   3rd Qu.:50.00  
##                     Max.   :8.000                                         Max.   :99.00   Max.   :98.00   Max.   :96.00  
##                                                                           NA's   :18      NA's   :2       NA's   :2      
##   runoffL_oak      runoffL_euc      slope_oak       slope_euc       aspect_oak      aspect_euc    surface_tension_oak
##  Min.   : 0.000   Min.   : 0.00   Min.   : 9.00   Min.   : 9.00   Min.   :100.0   Min.   :106.0   Min.   :37.40      
##  1st Qu.: 0.000   1st Qu.: 0.07   1st Qu.:12.00   1st Qu.:12.00   1st Qu.:143.0   1st Qu.:175.0   1st Qu.:72.75      
##  Median : 0.450   Median : 1.20   Median :24.50   Median :23.00   Median :189.0   Median :196.5   Median :72.75      
##  Mean   : 2.032   Mean   : 2.45   Mean   :21.62   Mean   :19.34   Mean   :181.9   Mean   :191.2   Mean   :68.35      
##  3rd Qu.: 2.800   3rd Qu.: 3.30   3rd Qu.:30.50   3rd Qu.:25.00   3rd Qu.:220.0   3rd Qu.:224.0   3rd Qu.:72.75      
##  Max.   :14.000   Max.   :16.00   Max.   :32.00   Max.   :31.00   Max.   :264.0   Max.   :296.0   Max.   :72.75      
##  NA's   :5        NA's   :3                                                                       NA's   :22         
##  surface_tension_euc runoff_rainfall_ratio_oak runoff_rainfall_ratio_euc
##  Min.   :28.51       Min.   :0.00000           Min.   :0.000000         
##  1st Qu.:32.79       1st Qu.:0.00000           1st Qu.:0.003027         
##  Median :37.40       Median :0.02046           Median :0.047619         
##  Mean   :43.11       Mean   :0.05357           Mean   :0.065902         
##  3rd Qu.:56.41       3rd Qu.:0.08485           3rd Qu.:0.083603         
##  Max.   :72.75       Max.   :0.42000           Max.   :0.335652         
##  NA's   :22          NA's   :5                 NA's   :3

In the summary output, how are character variables handled differently from numeric ones?

Remembering what we discussed in the previous chapter, consider the site variable, and in particular its Length. Looking at the table, what does that length represent?

Explore the map to learn more about the site variable (spelled “Site” in this CSV):

Figure 3.1: Study of runoff & erosion under Eucalyptus & Oak canopy

There are a couple of ways of seeing what unique values exist in a character variable like site which can be considered a categorical variable (factor). Consider what these return:

unique(eucoakrainfallrunoffTDR$site)
## [1] "AB1" "AB2" "KM1" "PR1" "TP1" "TP2" "TP3" "TP4"
factor(eucoakrainfallrunoffTDR$site)
##  [1] AB1 AB1 AB1 AB1 AB1 AB1 AB1 AB1 AB1 AB1 AB1 AB1 AB2 AB2 AB2 AB2 AB2 AB2 AB2 AB2 AB2 AB2 AB2 AB2 KM1 KM1 KM1 KM1 KM1 KM1 KM1 KM1 KM1
## [34] KM1 KM1 KM1 PR1 PR1 PR1 PR1 PR1 PR1 PR1 PR1 PR1 PR1 TP1 TP1 TP1 TP1 TP1 TP1 TP1 TP1 TP1 TP1 TP1 TP2 TP2 TP2 TP2 TP2 TP2 TP2 TP2 TP2
## [67] TP2 TP2 TP3 TP3 TP3 TP3 TP3 TP3 TP3 TP3 TP3 TP3 TP3 TP4 TP4 TP4 TP4 TP4 TP4 TP4 TP4 TP4 TP4 TP4
## Levels: AB1 AB2 KM1 PR1 TP1 TP2 TP3 TP4

3.4.1 Stratifying variables by site using a Tukey box plot

A good way to look at variable distributions stratified by a sample site factor is the Tukey box plot. We’ll be looking more at this and other visualization methods in the next chapter.

ggplot(data = eucoakrainfallrunoffTDR) + geom_boxplot(mapping = aes(x=site, y=runoffL_euc))
Tukey boxplot of runoff under Eucalyptus canopy

Figure 3.2: Tukey boxplot of runoff under Eucalyptus canopy

3.5 Database operations with dplyr

As part of exploring our data, we’ll typically simplify or reduce it for our purposes. The following methods are quickly discovered to be essential as part of exploring and analyzing data.

  • select rows using logic, such as population > 10000, with filter
  • select variable columns you want to retain with select
  • add new variables and assign their values with mutate
  • sort rows based on a a field with arrange
  • summarize by group

3.5.1 Select, mutate, and the pipe

The pipe %>%: Read %>% as “and then…” This is bigger than it sounds and opens up a lot of possibilities.

See example below, and observe how the expression becomes several lines long. In the process, we’ll see examples of new variables with mutate and selecting (and in the process ordering) variables. [If you haven’t already created it, this should be in a eucoak project]

runoff <- eucoakrainfallrunoffTDR %>%
  mutate(Date = as.Date(date,"%m/%d/%Y"),
         rain_subcanopy = (rain_oak + rain_euc)/2) %>%
  dplyr::select(site, Date, rain_mm, rain_subcanopy, 
         runoffL_oak, runoffL_euc, slope_oak, slope_euc)
library(DT)
DT::datatable(runoff,options=list(scrollX=T))

Another way of thinking of the pipe that is very useful is that whatever goes before it becomes the first parameter for any functions that follow. So in the example above:

  1. eucoakrainfallrunoffTDR becomes the first parameter for mutate(), then
  2. the result of the mutate() becomes the first parameter for dplyr::select()

Side note: to just rename a variable, use rename instead of mutate. It will stay in position.

3.5.1.1 Review: creating penguins from penguins_raw

To review some of these methods, it’s useful to consider how the penguins data frame was created from the more complex penguins_raw data frame, both of which are part of the palmerpenguins package (Horst, Hill, and Gorman (2020)). First let’s look at palmerpenguins::penguins_raw:

library(palmerpenguins)
library(tidyverse)
library(lubridate)
summary(penguins_raw)
##   studyName         Sample Number      Species             Region             Island             Stage           Individual ID     
##  Length:344         Min.   :  1.00   Length:344         Length:344         Length:344         Length:344         Length:344        
##  Class :character   1st Qu.: 29.00   Class :character   Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Median : 58.00   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                     Mean   : 63.15                                                                                                 
##                     3rd Qu.: 95.25                                                                                                 
##                     Max.   :152.00                                                                                                 
##                                                                                                                                    
##  Clutch Completion     Date Egg          Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g)      Sex           
##  Length:344         Min.   :2007-11-09   Min.   :32.10      Min.   :13.10     Min.   :172.0       Min.   :2700   Length:344        
##  Class :character   1st Qu.:2007-11-28   1st Qu.:39.23      1st Qu.:15.60     1st Qu.:190.0       1st Qu.:3550   Class :character  
##  Mode  :character   Median :2008-11-09   Median :44.45      Median :17.30     Median :197.0       Median :4050   Mode  :character  
##                     Mean   :2008-11-27   Mean   :43.92      Mean   :17.15     Mean   :200.9       Mean   :4202                     
##                     3rd Qu.:2009-11-16   3rd Qu.:48.50      3rd Qu.:18.70     3rd Qu.:213.0       3rd Qu.:4750                     
##                     Max.   :2009-12-01   Max.   :59.60      Max.   :21.50     Max.   :231.0       Max.   :6300                     
##                                          NA's   :2          NA's   :2         NA's   :2           NA's   :2                        
##  Delta 15 N (o/oo) Delta 13 C (o/oo)   Comments        
##  Min.   : 7.632    Min.   :-27.02    Length:344        
##  1st Qu.: 8.300    1st Qu.:-26.32    Class :character  
##  Median : 8.652    Median :-25.83    Mode  :character  
##  Mean   : 8.733    Mean   :-25.69                      
##  3rd Qu.: 9.172    3rd Qu.:-25.06                      
##  Max.   :10.025    Max.   :-23.79                      
##  NA's   :14        NA's   :13

Now let’s create the simpler penguins data frame. We’ll use rename for a couple, but most variables required mutation, to manipulate strings (we’ll get to that later), create factors, or convert to integers. And we’ll rename some variables to avoid using backticks.

penguins <- penguins_raw %>%
  rename(bill_length_mm = `Culmen Length (mm)`,
         bill_depth_mm = `Culmen Depth (mm)`) %>%
  mutate(species = factor(word(Species)),
         island = factor(Island),
         flipper_length_mm = as.integer(`Flipper Length (mm)`),
         body_mass_g = as.integer(`Body Mass (g)`),
         sex = factor(str_to_lower(Sex)),
         year = as.integer(year(ymd(`Date Egg`)))) %>%
  dplyr::select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, year)
summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g       sex           year     
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30   Median :197.0     Median :4050   NA's  : 11   Median :2008  
##                                  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202                Mean   :2008  
##                                  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##                                  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300                Max.   :2009  
##                                  NA's   :2       NA's   :2       NA's   :2         NA's   :2

Unfortunately, they don’t end up as exactly identical, though all of the variables are identical as vectors

identical(penguins, palmerpenguins::penguins)
## [1] FALSE

Anyone want to figure what’s different?

3.5.1.2 Helper functions for dplyr::select()

In the select() example above, we listed all of the variables, but there are a variety of helper functions for using logic to specify which variables to select:

  • contains("_") or any substring of interest in the variable name
  • `starts_with(“runoff”)
  • ends_with("euc")
  • everything()
  • matches() a regular expression
  • num_range("x",1:5) for the common situation where a series of variable names combine a string and a number
  • one_of(myList) for when you have a group of variable names
  • range of variable: e.g. runoffL_oak:slope_euc could have followed rain_subcanopy above
  • all but (-): preface a variable or a set of variabe names with - to select all others

3.5.2 filter

filter lets you select observations that meet criteria, similar to an SQL WHERE clause.

runoff2007 <- runoff %>%
  filter(Date >= as.Date("01/01/2007", "%m/%d/%Y"))
DT::datatable(runoff2007,options=list(scrollX=T))

3.5.2.1 Filtering out NA with !is.na

Here’s an important one. There are many times you need to avoid NAs.
We commonly see summary statistics using na.rm = TRUE in order to ignore NAs when calculating a statistic like mean.

To simply filter out NAs from a vector or a variable use a filter: feb_filt <- feb_s %>% filter(!is.na(TEMP))

3.5.3 Writing a data frame to a csv

Let’s say you have created a data frame, maybe with read_csv

runoff20062007 <- read_csv(csvPath)

Then you do some processing to change it, maybe adding variables, reorganizing, etc., and you want to write out your new eucoak, so you just need to use write_csv

write_csv(eucoak, "data/tidy_eucoak.csv")

3.5.4 Summarize by group

You’ll find that you need to use this all the time with real data. You have a bunch of data where some categorical variable is defining a grouping, like our site field in the eucoak data. We’d like to just create average slope, rainfall, and runoff for each site. Note that it involves two steps, first defining which field defines the group, then the various summary statistics we’d like to store. In this case all of the slopes under oak remain the same for a given site – it’s a site characteristic – and the same applies to the euc site, so we can just grab the first value (mean would have also worked of course).

eucoakSiteAvg <- runoff %>%
  group_by(site) %>%
  summarize(
    rain = mean(rain_mm, na.rm = TRUE),
    rain_subcanopy = mean(rain_subcanopy, na.rm = TRUE),
    runoffL_oak = mean(runoffL_oak, na.rm = TRUE),
    runoffL_euc = mean(runoffL_euc, na.rm = TRUE),
    slope_oak = first(slope_oak),
    slope_euc = first(slope_euc)
  )
eucoakSiteAvg
## # A tibble: 8 x 7
##   site   rain rain_subcanopy runoffL_oak runoffL_euc slope_oak slope_euc
##   <chr> <dbl>          <dbl>       <dbl>       <dbl>     <dbl>     <dbl>
## 1 AB1    48.4           43.1      6.80         6.03       32          31
## 2 AB2    34.1           35.4      4.91         3.65       24          25
## 3 KM1    48             36.1      1.94         0.592      30.5        25
## 4 PR1    56.5           37.6      0.459        2.31       27          23
## 5 TP1    38.4           30.0      0.877        1.66        9           9
## 6 TP2    34.3           32.9      0.0955       1.53       12          10
## 7 TP3    32.1           27.8      0.381        0.815      25          18
## 8 TP4    32.5           35.7      0.231        2.83       12          12

Summarizing by group with TRI data [air_quality project] (Toxic Release Inventory (TRI) Program, n.d.)

csvPath <- system.file("extdata","TRI_2017_CA.csv", package="iGIScData")
TRI_BySite <- read_csv(csvPath) %>%
  mutate(all_air = `5.1_FUGITIVE_AIR` + `5.2_STACK_AIR`) %>%
  filter(all_air > 0) %>%
  group_by(FACILITY_NAME) %>%
  summarize(
    FACILITY_NAME = first(FACILITY_NAME),
    air_releases = sum(all_air, na.rm = TRUE),
    mean_fugitive = mean(`5.1_FUGITIVE_AIR`, na.rm = TRUE), 
    LATITUDE = first(LATITUDE), LONGITUDE = first(LONGITUDE))

3.5.5 Count

Count is a simple variant on summarize by group, since the only statistic is the count of events. [eucoak project]

tidy_eucoak %>% count(tree)
## # A tibble: 2 x 2
##   tree      n
##   <chr> <int>
## 1 euc      90
## 2 oak      90

Another way is to use n():

tidy_eucoak %>%
  group_by(tree) %>%
  summarize(n = n())
## # A tibble: 2 x 2
##   tree      n
##   <chr> <int>
## 1 euc      90
## 2 oak      90

3.5.6 Sorting after summarizing

Using the marine debris data from the Marine Debris Monitoring and Assessment Project (Marine Debris Program, n.d.) [in a new litter project]

shorelineLatLong <- ConcentrationReport %>%
  group_by(`Shoreline Name`) %>%
  summarize(
    latitude = mean((`Latitude Start`+`Latitude End`)/2),
    longitude = mean((`Longitude Start`+`Longitude End`)/2)
  ) %>%
  arrange(latitude)
shorelineLatLong
## # A tibble: 38 x 3
##    `Shoreline Name`   latitude longitude
##    <chr>                 <dbl>     <dbl>
##  1 Aimee Arvidson         33.6     -118.
##  2 Balboa Pier #2         33.6     -118.
##  3 Bolsa Chica            33.7     -118.
##  4 Junipero Beach         33.8     -118.
##  5 Malaga Cove            33.8     -118.
##  6 Zuma Beach, Malibu     34.0     -119.
##  7 Zuma Beach             34.0     -119.
##  8 Will Rodgers           34.0     -119.
##  9 Carbon Beach           34.0     -119.
## 10 Nicholas Canyon        34.0     -119.
## # ... with 28 more rows

3.6 The dot operator

The dot “.” operator derives from UNIX syntax, and refers to “here.”

  • For accessing files in the current folder, the path is “./filename”

A similar specification is used in piped sequences

  • The advantage of the pipe is you don’t have to keep referencing the data frame.
  • The dot is then used to connect to items inside the data frame [in air_quality]
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
stackrate <- TRI87 %>%
  mutate(stackrate = stack_air/air_releases) %>%
  .$stackrate
head(stackrate)
## [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.6666667 1.0000000

3.7 String Abstraction

Character string manipulation is surprisingly critical to data analysis, and so the stringr package was developed to provide a wider array of string processing tools than what is in base R, including functions for detecting matches, subsetting strings, managing lengths, replacing substrings with other text, and joining, splitting, and sorting strings.

We’ll look at some of the stringr functions, but a good way to learn about the wide array of functions is through the cheat sheet that can be downloaded from https://www.rstudio.com/resources/cheatsheets/.

3.7.1 Detecting matches

These functions look for patterns within existing strings which can then be used subset observations based on those patterns. [These can be investigated in your generic_methods project]

  • str_detect detects patterns in a string, returns true or false if detected
  • str_locate detects patterns in a string, returns start and end position if detected, or NA if not
  • str_which returns the indices of strings that match a pattern
  • str_count counts the number of matches in each string
str_detect(fruit,"qu")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [45] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
fruit[str_detect(fruit,"qu")]
## [1] "kumquat" "loquat"  "quince"
tail(str_locate(fruit, "qu"),15)
##       start end
## [66,]    NA  NA
## [67,]     1   2
## [68,]    NA  NA
## [69,]    NA  NA
## [70,]    NA  NA
## [71,]    NA  NA
## [72,]    NA  NA
## [73,]    NA  NA
## [74,]    NA  NA
## [75,]    NA  NA
## [76,]    NA  NA
## [77,]    NA  NA
## [78,]    NA  NA
## [79,]    NA  NA
## [80,]    NA  NA
str_which(fruit, "qu")
## [1] 43 46 67
fruit[str_which(fruit,"qu")]
## [1] "kumquat" "loquat"  "quince"
str_count(fruit,"qu")
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [67] 1 0 0 0 0 0 0 0 0 0 0 0 0 0

3.7.2 Subsetting Strings

Subsetting in this case includes its normal use of abstracting the observations specified by a match (similar to a filter for data frames), or just a specified part of a string specified by start and end character positions, or the part of the string that matches an expression.

  • str_sub extracts a part of a string from a start to and end character position
  • str_subset returns the strings that contain a pattern match
  • str_extract returns the first (or if str_extract_all then all matches) pattern matches
  • str_match returns the first (or _all) pattern match as a matrix
qfruit <- str_subset(fruit, "q")
qfruit
## [1] "kumquat" "loquat"  "quince"
str_sub(qfruit,1,2)
## [1] "ku" "lo" "qu"
str_sub("94132",1,2)
## [1] "94"
str_extract(qfruit,"[aeiou]")
## [1] "u" "o" "u"

3.7.3 String Length

The length of strings is often useful in an analysis process, either just knowing the length as an integer, or purposefully increasing or reducing it.

  • str_length simply returns the length of the string as an integer
  • str_pad adds a specified character (typically a space " ") to either end of a string
  • str_trim removes whitespace from the either end of a string
qfruit <- str_subset(fruit,"q")
qfruit
## [1] "kumquat" "loquat"  "quince"
str_length(qfruit)
## [1] 7 6 6
name <- "Inigo Montoya"
str_length(name)
## [1] 13
firstname <- str_sub(name,1,str_locate(name," ")[1]-1)
firstname
## [1] "Inigo"
lastname <- str_sub(name,str_locate(name," ")[1]+1,str_length(name))
lastname
## [1] "Montoya"
str_pad(qfruit,10,"both")
## [1] " kumquat  " "  loquat  " "  quince  "
str_trim(str_pad(qfruit,10,"both"),"both")
## [1] "kumquat" "loquat"  "quince"

3.7.4 Replacing substrings with other text (“mutating” strings)

These methods range from converting case to replace substrings.

  • str_to_lower converts strings to lower case
  • str_to_upper converts strings to upper case
  • str_to_title capitalizes strings (makes the first character of each word upper case)
  • str_sub a special use of this function to replace substrings with a specified string
  • str_replace replaces the first matched pattern (or all with str_replace_all) with a specified string
str_to_lower(name)
## [1] "inigo montoya"
str_to_upper(name)
## [1] "INIGO MONTOYA"
str_to_title("for whom the bell tolls")
## [1] "For Whom The Bell Tolls"
str_sub(name,1,str_locate(name," ")-1) <- "Diego"
str_replace(qfruit,"q","z")
## [1] "kumzuat" "lozuat"  "zuince"

3.7.5 Concatenating and splitting

One very common string function is that to concatenate strings, and somewhat less common though useful is splitting them using a key separator like space, comma, or line end. One use of using str_c in the example below is to create a comparable join field based on a numeric character string that might need a zero or something at the left or right.

  • str_c The paste() function in base R will work but you might want the default separator setting to be "" instead of " “, so str_c is just paste with a default”" separator, but you can also use " ".
  • str_split splits a string into parts based upon the detection of a specified separator like space, comma, or line end
str_split("for whom the bell tolls", " ")
## [[1]]
## [1] "for"   "whom"  "the"   "bell"  "tolls"
str_c("for","whom","the","bell","tolls",sep=" ")
## [1] "for whom the bell tolls"
csvPath <- system.file("extdata","CA_MdInc.csv",package="iGIScData")
CA_MdInc <- read_csv(csvPath)
join_id <- str_c("0",CA_MdInc$NAME) # could also use str_pad(CA_MdInc$NAME,1,side="left",pad="0")
head(CA_MdInc)
## # A tibble: 6 x 3
##         trID     NAME HHinc2016
##        <dbl>    <dbl>     <dbl>
## 1 6001400100 60014001    177417
## 2 6001400200 60014002    153125
## 3 6001400300 60014003     85313
## 4 6001400400 60014004     99539
## 5 6001400500 60014005     83650
## 6 6001400600 60014006     61597
head(join_id)
## [1] "060014001" "060014002" "060014003" "060014004" "060014005" "060014006"

3.8 Dates and times with lubridate

Makes it easy to work with dates and times.

  • Can parse many forms
  • We’ll look at more with time series
  • See the cheat sheet for more information, but the following examples may demonstrate that it’s pretty easy to use, and does a good job of making your job easier.
library(lubridate)
dmy("20 September 2020")
## [1] "2020-09-20"
dmy_hm("20 September 2020 10:45")
## [1] "2020-09-20 10:45:00 UTC"
mdy_hms("September 20, 2020 10:48")
## [1] "2020-09-20 20:10:48 UTC"
mdy_hm("9/20/20 10:50")
## [1] "2020-09-20 10:50:00 UTC"
mdy("9.20.20")
## [1] "2020-09-20"
start704 <- dmy_hm("24 August 2020 16:00")
end704 <- mdy_hm("12/18/2020 4:45 pm")
year(start704)
## [1] 2020
month(start704)
## [1] 8
day(end704)
## [1] 18
hour(end704)
## [1] 16
end704-start704
## Time difference of 116.0312 days
as_date(end704)
## [1] "2020-12-18"
hms::as_hms(end704)
## 16:45:00

You will find that dates are typically displayed as yyyy-mm-dd, e.g. “2020-09-20” above. You can display them other ways of course, but it’s useful to write dates this way since they sort better, with the highest order coming first.

Note the use of :: after the package

Sometimes you need to specify the package and function name this way, for instance if more than one package has a function of the same name. You can also use this method to call a function without having loaded its library.

`dplyr::select(...)`

3.9 Exercises

  1. Create a tibble with 20 rows of two variables norm and unif with norm created with rnorm() and unif created with runif().

  2. Read in “TRI_2017_CA.csv” in two ways, as a normal data frame assigned to df and as a tibble assigned to tbl. What field names result for what’s listed in the CSV as 5.1_FUGITIVE_AIR?

  3. Use the summary function to investigate the variables in either the data.frame or tibble you just created. What type of field and what values are assigned to BIA_CODE?

  4. Create a boxplot of body_mass_g by species from the penguins data frame in the palmerpenguins package (Horst, Hill, and Gorman 2020). Access the data with data(package = ‘palmerpenguins’), and also remember library(ggplot2) or library(tidyverse).

  5. Use select, mutate, and the pipe to create a penguinMass tibble where the only original variable retained is species, but with body_mass_kg created as \(\frac{1}{1000}\) the body_mass_g. The statement should start with penguinMass <- penguins and use a pipe plus the other functions after that.

  6. Now, also with penguins, create FemaleChinstaps to include only the female Chinstrap penguins. Start with FemaleChinstraps <- penguins %>%

  7. Now, summarize by species groups to create mean and standard deviation variables from bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. Preface the variable names with either avg. or sd. Include na.rm=T with all statistics function calls.

  8. Sort the penguins by body_mass_g.

  9. Using stringr methods, detect and print out a vector of sites in the ConcentrationReport that are parks (include the substring “Park” in the Shoreline Name), then detect and print out the longest shoreline name.

  10. The sierraStations data frame has weather stations only in California, but includes “CA US” at the end of the name. Use dplyr and stringr methods to create a new STATION_NAME that truncates the name, removing that part at the end.

  11. Use dplyr and lubridate methods to create a new date-type variable sampleDate from the existing DATE character variable in soilCO2_97, then use this to plot a graph of soil CO2 over time, with sampleDate as x, and CO2% as y. Then use code to answer the question: What’s the length of time (time difference) from beginning to end for this data set?

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://allisonhorst.github.io/palmerpenguins/.
Marine Debris Program. n.d. NOAA Office of Response; Restoration. https://marinedebris.noaa.gov/.
Thompson, A., J. D. Davis, and A. J. Oliphant. 2016. “Surface Runoff and Soil Erosion Under Eucalyptus and Oak Canopy.” Earth Surface Processes and Landforms. https://doi.org/10.1002/esp.3881.
Toxic Release Inventory (TRI) Program. n.d. U.S. Environmental Protection Agency. https://www.epa.gov/toxics-release-inventory-tri-program.
Wickham, Hadley. 2017. “The Tidyverse.” R Package Ver 1 (1): 1.