# 1 Data manipulation

This is Chapter covers the core tabular manipulation syntax of data : useing R package dplyr

## 1.1 Data pipelnes:

Piping allows us to create series of connected transformations that goes from the source to the destination. When we write a code using pipes %>% we do a step by step approach and refer to the pipe operator as “and then”eg. get the iris data then filter to setosa records then create a column “area” then calculate the average”.

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
is_setosa <- iris$Species=="setosa" setosa <- iris[is_setosa, ] setosa_area <- setosa$Sepal.Length * setosa\$Sepal.Width
avg_area <- mean(setosa_area)

### 1.1.1 Example:

R code length(toupper(letters)) and translate it into a pipeline: 1) Get the letters object 2) Apply the toupper() function to letters 3) Count the results using length()

letters %>%
toupper() %>%
length()
## [1] 26

Now, let’s go the other way and translate a pipeline into nested functions: 1. Get the iris dataset 2. Take some rows from the top of the dataset 3. Count how many rows are in the subset * equivalent to nrow(head(iris))

iris %>%
nrow()
## [1] 6

## 1.2 Assigning results:

iris_filtered <- iris %>% head()
iris_filtered2 <- iris %>%
iris_filtered
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris_filtered2
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Because of the way data pipelines flow - top to bottom - left to right it feels somewhat counter-intuitive to have the output and input on the same line. So we can use the right hand side (RHS) operator ->

iris %>%
iris_filtered3

## 1.3 Disorganized functions:

Sometimes functions that we encounter do not put the key input as the first argument. - e.g. linear model function starts with description of the model and just after the dataset

my_lm <- lm(Sepal.Width~Sepal.Length, iris)

but using the pipe operator we can use a dot (.) as a place holder in the place of the argument data

iris %>%
lm(Sepal.Width~Sepal.Length, .) ->
my_lm

## 1.4 Filtering columns:

To extract columns from our dataset, we can use the select() function to provide instructions about which columns we want or don’t want. select() takes a comma separated list of instructions.

iris %>%
select(Species, Sepal.Length)%>%
head(n=3)
##   Species Sepal.Length
## 1  setosa          5.1
## 2  setosa          4.9
## 3  setosa          4.7

If we have lots of columns we want to include but a few we want to exclude, we can use a minus (-) before a column name to say to exclude it.

iris %>%
select(-Species)%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

For a range of columns:

iris %>%
select(Sepal.Length:Petal.Length)%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length
## 1          5.1         3.5          1.4
## 2          4.9         3.0          1.4
## 3          4.7         3.2          1.3

To exclude a range of columns:

iris %>%
select(-(Sepal.Length:Petal.Length))%>%
head(n=3)
##   Petal.Width Species
## 1         0.2  setosa
## 2         0.2  setosa
## 3         0.2  setosa

## 1.5 Name-based seldection:

1. starts_with() will return columns where the string you provide is at the beginning of a column name
2. ends_with() will return columns where the string you provide is at the end of a column name
3. contains() will return columns where the string you provide is anywhere in the column name
4. num_range() will allow you to return columns with names like “year 2020” through to “year_2021” by providing a prefix and a numeric range you want to select
5. matches() allows you to provide pattern that a column name must conform to in order to be returned 6)one_of() allows you to provide a vector of column names that you would like to be matched in their entirety

### 1.5.1 Examples

Starts_with

iris %>%
select(starts_with("S"))%>%
head(n=3)
##   Sepal.Length Sepal.Width Species
## 1          5.1         3.5  setosa
## 2          4.9         3.0  setosa
## 3          4.7         3.2  setosa

Ends_with

iris %>%
select(ends_with("s"))%>%
head(n=3)
##   Species
## 1  setosa
## 2  setosa
## 3  setosa

Contains

iris %>%
select(contains("Length"))%>%
head(n=3)
##   Sepal.Length Petal.Length
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3

or

iris %>%
select(Petal.Width:Species, -contains("Length"))%>%
head(n=3)
##   Petal.Width Species
## 1         0.2  setosa
## 2         0.2  setosa
## 3         0.2  setosa

## 1.6 Content based on selection:

Select columns based on the boolean results of functions applied to the data in the columns with select_if() Select all numeric columns from the iris dataset.

iris %>%
select_if(is.numeric)%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

Create conditional statements on the fly by using a tilde (~) and then providing some code that results in a boolean. You tell the statement where the column of data should go by using the place holder symbol (.).

iris %>%
select_if(~is.numeric(.) & n_distinct(.)>30)%>%
head(n=3)
##   Sepal.Length Petal.Length
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3

## 1.8 Filtering rows:

Row-position selection The slice() function takes a vector of values that denote positions

iris %>%
slice(1:5)%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

## 1.9 Conditional selection:

To select rows based on a logical statement use filter()

iris %>%
filter(Species=="virginica")%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          5.8         2.7          5.1         1.9 virginica
## 3          7.1         3.0          5.9         2.1 virginica

Instead of needing to write multiple filter() commands, we can put multiple logical conditions inside a single filter() function

iris %>%
filter(Species == "virginica",
Sepal.Length > mean(Sepal.Length))%>%
head(n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          7.1         3.0          5.9         2.1 virginica
## 3          6.3         2.9          5.6         1.8 virginica

## 1.10 Working with column names:

Rename columns by selecting them but providing a new name on the left-hand side of an equals operator (=) when doing a select(). This is useful if you want to filter columns and rename at the same time.

iris %>%
select(sepal_width=Sepal.Width, species=Species)%>%
head(n=3)
##   sepal_width species
## 1         3.5  setosa
## 2         3.0  setosa
## 3         3.2  setosa

## 1.11 My Data Example

Water samples were collected on 6/6/18, 9/24/18, 12/17/18, 1/26/19 and ran on an autoanalyzer to determine NO3-N, NH4-N, PO4, and Chl-a concentrations across 8 tanks identified by treatments (control "A" or nutrient enriched "N"). Moreover this data was used in my master's thesis.

library(readr)
head(Porewater_data, n=3)
## # A tibble: 3 x 7
##    Tank Date   NO3 (uM) NH4 (uM) PO4 (uM) Conc of chl (ug/… Nutrient
##   <dbl> <chr>       <dbl>      <dbl>      <dbl>              <dbl> <chr>
## 1     1 6/6/18       2.07       1.35       3.08               2.77 A
## 2     2 6/6/18       4.68       2.48       2.45               5.3  N
## 3     3 6/6/18       3.05       1.77       4.62               2.99 A

Here I read in my porewater dataset and headed the first three lines.

### 1.11.1 Evaluating datasets

Porewater_data %>%
nrow()
## [1] 6

Using the pipe we determined that there are 6 rows in the dataset porewater_data.

### 1.11.2 Assigning results:

pw_filtered <- Porewater_data %>% head()
pw_filtered
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 7
##    Tank Date   NO3 (uM) NH4 (uM) PO4 (uM) Conc of chl (ug/… Nutrient
##   <dbl> <chr>       <dbl>      <dbl>      <dbl>              <dbl> <chr>
## 1     1 6/6/18       2.07       1.35       3.08               2.77 A
## 2     2 6/6/18       4.68       2.48       2.45               5.3  N
## 3     3 6/6/18       3.05       1.77       4.62               2.99 A
## 4     4 6/6/18       5.46       3.64       4.44               4.75 N
## 5     5 6/6/18       2.91       0.91       2.44               2.47 A
## 6     6 6/6/18       4.82       3.46       3.11               5.25 N

Here I assigned the data set as pw_filtered. This would be useful to set aside newly created datatables using dplyr.

### 1.11.3 Disorganized functions:

Porewater_data %>%
lm(NH4 (uM)~Nutrient, .) ->
my_lm
my_lm
##
## Call:
## lm(formula = NH4 (uM) ~ Nutrient, data = .)
##
## Coefficients:
## (Intercept)    NutrientN
##      1.8431      -0.1225

Sometimes functions that we encounter do not put the key input as the first argument. - e.g. linear model function starts with description of the model and just after the dataset but using the pipe operator we can use a dot (.) as a place holder in the place of the argument data.

### 1.11.4 Filtering columns:

Porewater_data %>%
select(Date: Nutrient)%>%
head(n=3)
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 6
##   Date   NO3 (uM) NH4 (uM) PO4 (uM) Conc of chl (ug/L) Nutrient
##   <chr>       <dbl>      <dbl>      <dbl>                <dbl> <chr>
## 1 6/6/18       2.07       1.35       3.08                 2.77 A
## 2 6/6/18       4.68       2.48       2.45                 5.3  N
## 3 6/6/18       3.05       1.77       4.62                 2.99 A

To extract columns from our dataset, we can use the select() function to provide instructions about which columns we want or don’t want. select() takes a comma separated list of instructions. Here we selected rows between Date and Nutrient.

### 1.11.5 Remove Row

Porewater_data %>%
select(-NO3 (uM))%>%
head(n=3)
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 6
##    Tank Date   NH4 (uM) PO4 (uM) Conc of chl (ug/L) Nutrient
##   <dbl> <chr>       <dbl>      <dbl>                <dbl> <chr>
## 1     1 6/6/18       1.35       3.08                 2.77 A
## 2     2 6/6/18       2.48       2.45                 5.3  N
## 3     3 6/6/18       1.77       4.62                 2.99 A

Here I used the select function to remove NO3 uM from the previously constructed table. I would do this to deselect unimportant data rows or duplicates

### 1.11.6 Select using starts_with

Porewater_data %>%
select(starts_with("N"))%>%
head(n=3)
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 3
##   NO3 (uM) NH4 (uM) Nutrient
##        <dbl>      <dbl> <chr>
## 1       2.07       1.35 A
## 2       4.68       2.48 N
## 3       3.05       1.77 A

Here I used the select function to only select colums that start with an "N". I would use this to make sure I do not have two of the same column.

### 1.11.7 Content based on selection:

Porewater_data %>%
select_if(is.numeric)%>%
head(n=3)
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 5
##    Tank NO3 (uM) NH4 (uM) PO4 (uM) Conc of chl (ug/L)
##   <dbl>      <dbl>      <dbl>      <dbl>                <dbl>
## 1     1       2.07       1.35       3.08                 2.77
## 2     2       4.68       2.48       2.45                 5.3
## 3     3       3.05       1.77       4.62                 2.99

Here is another method using the select_if function. Using this function I specified all numeric input to be selected for the new table. Very useful for determining how rows are being characterized.

### 1.11.8 Conditional selection:

Porewater_data %>%
filter(Nutrient=="N")%>%
head(n=3)
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 7
##    Tank Date   NO3 (uM) NH4 (uM) PO4 (uM) Conc of chl (ug/… Nutrient
##   <dbl> <chr>       <dbl>      <dbl>      <dbl>              <dbl> <chr>
## 1     2 6/6/18       4.68       2.48       2.45               5.3  N
## 2     4 6/6/18       5.46       3.64       4.44               4.75 N
## 3     6 6/6/18       4.82       3.46       3.11               5.25 N

Here I selected rows based on a logical statement using the filter() function. In this expression I filtered by treatment "nutrient" to only show data that was nutrient enriched "N".

### 1.11.9 Working with renaming column names:

Porewater_data %>%
select(Nitrate=NO3 (uM), Ammonium=NH4 (uM),Chla= Conc of chl (ug/L),  Treatment=Nutrient)%>%
head(n=3)
## Warning: ... is not empty.
##
## We detected these problematic arguments:
## * needs_dots
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 4
##   Nitrate Ammonium  Chla Treatment
##     <dbl>    <dbl> <dbl> <chr>
## 1    2.07     1.35  2.77 A
## 2    4.68     2.48  5.3  N
## 3    3.05     1.77  2.99 A`

Here I renamed columns by selecting them but providing a new name on the left-hand side of an equals operator (=) when doing a select(). This is useful if you want to filter columns and rename at the same time.