1 Data manipulation

This is Chapter covers the core tabular manipulation syntax of data : useing R package dplyr

1.1 Data pipelnes:

Piping allows us to create series of connected transformations that goes from the source to the destination. When we write a code using pipes %>% we do a step by step approach and refer to the pipe operator as “and then”eg. get the iris data then filter to setosa records then create a column “area” then calculate the average”.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
is_setosa <- iris$Species=="setosa"
setosa <- iris[is_setosa, ]
setosa_area <- setosa$Sepal.Length * setosa$Sepal.Width
avg_area <- mean(setosa_area)

1.1.1 Example:

R code length(toupper(letters)) and translate it into a pipeline: 1) Get the letters object 2) Apply the toupper() function to letters 3) Count the results using length()

letters %>%
toupper() %>%
length()
## [1] 26

Now, let’s go the other way and translate a pipeline into nested functions: 1. Get the iris dataset 2. Take some rows from the top of the dataset 3. Count how many rows are in the subset * equivalent to nrow(head(iris))

iris %>%
head() %>%
nrow()
## [1] 6

1.2 Assigning results:

iris_filtered <- iris %>% head()
iris_filtered2 <- iris %>%
head()
iris_filtered
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris_filtered2
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Because of the way data pipelines flow - top to bottom - left to right it feels somewhat counter-intuitive to have the output and input on the same line. So we can use the right hand side (RHS) operator ->

iris %>%
head() ->
iris_filtered3

1.3 Disorganized functions:

Sometimes functions that we encounter do not put the key input as the first argument. - e.g. linear model function starts with description of the model and just after the dataset

my_lm <- lm(Sepal.Width~Sepal.Length, iris)

but using the pipe operator we can use a dot (.) as a place holder in the place of the argument data

iris %>%
lm(Sepal.Width~Sepal.Length, .) ->
my_lm

1.4 Filtering columns:

To extract columns from our dataset, we can use the select() function to provide instructions about which columns we want or don’t want. select() takes a comma separated list of instructions.

iris %>%
select(Species, Sepal.Length)%>%
head(n=3)
##   Species Sepal.Length
## 1  setosa          5.1
## 2  setosa          4.9
## 3  setosa          4.7

If we have lots of columns we want to include but a few we want to exclude, we can use a minus (-) before a column name to say to exclude it.

iris %>%
select(-Species)%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

For a range of columns:

iris %>%
select(Sepal.Length:Petal.Length)%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length
## 1          5.1         3.5          1.4
## 2          4.9         3.0          1.4
## 3          4.7         3.2          1.3

To exclude a range of columns:

iris %>%
select(-(Sepal.Length:Petal.Length))%>%
head(n=3)
##   Petal.Width Species
## 1         0.2  setosa
## 2         0.2  setosa
## 3         0.2  setosa

1.5 Name-based seldection:

  1. starts_with() will return columns where the string you provide is at the beginning of a column name
  2. ends_with() will return columns where the string you provide is at the end of a column name
  3. contains() will return columns where the string you provide is anywhere in the column name
  4. num_range() will allow you to return columns with names like “year 2020” through to “year_2021” by providing a prefix and a numeric range you want to select
  5. matches() allows you to provide pattern that a column name must conform to in order to be returned 6)one_of() allows you to provide a vector of column names that you would like to be matched in their entirety

1.5.1 Examples

Starts_with

iris %>%
select(starts_with("S"))%>%
  head(n=3)
##   Sepal.Length Sepal.Width Species
## 1          5.1         3.5  setosa
## 2          4.9         3.0  setosa
## 3          4.7         3.2  setosa

Ends_with

iris %>%
select(ends_with("s"))%>%
  head(n=3)
##   Species
## 1  setosa
## 2  setosa
## 3  setosa

Contains

iris %>%
select(contains("Length"))%>%
  head(n=3)
##   Sepal.Length Petal.Length
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3

or

iris %>%
select(Petal.Width:Species, -contains("Length"))%>%
  head(n=3)
##   Petal.Width Species
## 1         0.2  setosa
## 2         0.2  setosa
## 3         0.2  setosa

1.6 Content based on selection:

Select columns based on the boolean results of functions applied to the data in the columns with select_if() Select all numeric columns from the iris dataset.

iris %>%
select_if(is.numeric)%>%
  head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

1.7 Advanced conditional selection:

Create conditional statements on the fly by using a tilde (~) and then providing some code that results in a boolean. You tell the statement where the column of data should go by using the place holder symbol (.).

iris %>%
select_if(~is.numeric(.) & n_distinct(.)>30)%>%
  head(n=3)
##   Sepal.Length Petal.Length
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3

1.8 Filtering rows:

Row-position selection The slice() function takes a vector of values that denote positions

iris %>%
slice(1:5)%>%
  head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

1.9 Conditional selection:

To select rows based on a logical statement use filter()

iris %>%
filter(Species=="virginica")%>%
  head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          5.8         2.7          5.1         1.9 virginica
## 3          7.1         3.0          5.9         2.1 virginica

Instead of needing to write multiple filter() commands, we can put multiple logical conditions inside a single filter() function

iris %>%
filter(Species == "virginica",
Sepal.Length > mean(Sepal.Length))%>%
  head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          7.1         3.0          5.9         2.1 virginica
## 3          6.3         2.9          5.6         1.8 virginica

1.10 Working with column names:

Rename columns by selecting them but providing a new name on the left-hand side of an equals operator (=) when doing a select(). This is useful if you want to filter columns and rename at the same time.

iris %>%
select(sepal_width=Sepal.Width, species=Species)%>%
  head(n=3)
##   sepal_width species
## 1         3.5  setosa
## 2         3.0  setosa
## 3         3.2  setosa

1.11 My Data Example

Water samples were collected on 6/6/18, 9/24/18, 12/17/18, 1/26/19 and ran on an autoanalyzer to determine NO3-N, NH4-N, PO4, and Chl-a concentrations across 8 tanks identified by treatments (control "A" or nutrient enriched "N"). Moreover this data was used in my master's thesis.

library(readr)
Porewater_data <- read_csv("~/Desktop/R folder/Bookdown/Porewater_data.csv")
head(Porewater_data, n=3)
## # A tibble: 3 x 7
##    Tank Date   `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/… Nutrient
##   <dbl> <chr>       <dbl>      <dbl>      <dbl>              <dbl> <chr>   
## 1     1 6/6/18       2.07       1.35       3.08               2.77 A       
## 2     2 6/6/18       4.68       2.48       2.45               5.3  N       
## 3     3 6/6/18       3.05       1.77       4.62               2.99 A

Here I read in my porewater dataset and headed the first three lines.

1.11.1 Evaluating datasets

Porewater_data %>%
head() %>%
nrow()
## [1] 6

Using the pipe we determined that there are 6 rows in the dataset porewater_data.

1.11.2 Assigning results:

pw_filtered <- Porewater_data %>% head()
pw_filtered
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 7
##    Tank Date   `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/… Nutrient
##   <dbl> <chr>       <dbl>      <dbl>      <dbl>              <dbl> <chr>   
## 1     1 6/6/18       2.07       1.35       3.08               2.77 A       
## 2     2 6/6/18       4.68       2.48       2.45               5.3  N       
## 3     3 6/6/18       3.05       1.77       4.62               2.99 A       
## 4     4 6/6/18       5.46       3.64       4.44               4.75 N       
## 5     5 6/6/18       2.91       0.91       2.44               2.47 A       
## 6     6 6/6/18       4.82       3.46       3.11               5.25 N

Here I assigned the data set as pw_filtered. This would be useful to set aside newly created datatables using dplyr.

1.11.3 Disorganized functions:

Porewater_data %>%
lm(`NH4 (uM)`~Nutrient, .) ->
my_lm
my_lm
## 
## Call:
## lm(formula = `NH4 (uM)` ~ Nutrient, data = .)
## 
## Coefficients:
## (Intercept)    NutrientN  
##      1.8431      -0.1225

Sometimes functions that we encounter do not put the key input as the first argument. - e.g. linear model function starts with description of the model and just after the dataset but using the pipe operator we can use a dot (.) as a place holder in the place of the argument data.

1.11.4 Filtering columns:

Porewater_data %>%
select(Date: Nutrient)%>%
head(n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 6
##   Date   `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/L)` Nutrient
##   <chr>       <dbl>      <dbl>      <dbl>                <dbl> <chr>   
## 1 6/6/18       2.07       1.35       3.08                 2.77 A       
## 2 6/6/18       4.68       2.48       2.45                 5.3  N       
## 3 6/6/18       3.05       1.77       4.62                 2.99 A

To extract columns from our dataset, we can use the select() function to provide instructions about which columns we want or don’t want. select() takes a comma separated list of instructions. Here we selected rows between Date and Nutrient.

1.11.5 Remove Row

Porewater_data %>%
select(-`NO3 (uM)`)%>%
head(n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 6
##    Tank Date   `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/L)` Nutrient
##   <dbl> <chr>       <dbl>      <dbl>                <dbl> <chr>   
## 1     1 6/6/18       1.35       3.08                 2.77 A       
## 2     2 6/6/18       2.48       2.45                 5.3  N       
## 3     3 6/6/18       1.77       4.62                 2.99 A

Here I used the select function to remove NO3 uM from the previously constructed table. I would do this to deselect unimportant data rows or duplicates

1.11.6 Select using starts_with

Porewater_data %>%
select(starts_with("N"))%>%
  head(n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 3
##   `NO3 (uM)` `NH4 (uM)` Nutrient
##        <dbl>      <dbl> <chr>   
## 1       2.07       1.35 A       
## 2       4.68       2.48 N       
## 3       3.05       1.77 A

Here I used the select function to only select colums that start with an "N". I would use this to make sure I do not have two of the same column.

1.11.7 Content based on selection:

Porewater_data %>%
select_if(is.numeric)%>%
  head(n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 5
##    Tank `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/L)`
##   <dbl>      <dbl>      <dbl>      <dbl>                <dbl>
## 1     1       2.07       1.35       3.08                 2.77
## 2     2       4.68       2.48       2.45                 5.3 
## 3     3       3.05       1.77       4.62                 2.99

Here is another method using the select_if function. Using this function I specified all numeric input to be selected for the new table. Very useful for determining how rows are being characterized.

1.11.8 Conditional selection:

Porewater_data %>%
filter(Nutrient=="N")%>%
  head(n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 7
##    Tank Date   `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/… Nutrient
##   <dbl> <chr>       <dbl>      <dbl>      <dbl>              <dbl> <chr>   
## 1     2 6/6/18       4.68       2.48       2.45               5.3  N       
## 2     4 6/6/18       5.46       3.64       4.44               4.75 N       
## 3     6 6/6/18       4.82       3.46       3.11               5.25 N

Here I selected rows based on a logical statement using the filter() function. In this expression I filtered by treatment "nutrient" to only show data that was nutrient enriched "N".

1.11.9 Working with renaming column names:

Porewater_data %>%
select(Nitrate=`NO3 (uM)`, Ammonium=`NH4 (uM)`,Chla= `Conc of chl (ug/L)`,  Treatment=Nutrient)%>% 
  head(n=3)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 4
##   Nitrate Ammonium  Chla Treatment
##     <dbl>    <dbl> <dbl> <chr>    
## 1    2.07     1.35  2.77 A        
## 2    4.68     2.48  5.3  N        
## 3    3.05     1.77  2.99 A

Here I renamed columns by selecting them but providing a new name on the left-hand side of an equals operator (=) when doing a select(). This is useful if you want to filter columns and rename at the same time.