1 Data manipulation
This is Chapter covers the core tabular manipulation syntax of data : useing R package dplyr
1.1 Data pipelnes:
Piping allows us to create series of connected transformations that goes from the source to the destination. When we write a code using pipes %>% we do a step by step approach and refer to the pipe operator as “and then”eg. get the iris data then filter to setosa records then create a column “area” then calculate the average”.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
is_setosa <- iris$Species=="setosa"
setosa <- iris[is_setosa, ]
setosa_area <- setosa$Sepal.Length * setosa$Sepal.Width
avg_area <- mean(setosa_area)
1.1.1 Example:
R code length(toupper(letters)) and translate it into a pipeline: 1) Get the letters object 2) Apply the toupper() function to letters 3) Count the results using length()
letters %>%
toupper() %>%
length()
## [1] 26
Now, let’s go the other way and translate a pipeline into nested functions: 1. Get the iris dataset 2. Take some rows from the top of the dataset 3. Count how many rows are in the subset * equivalent to nrow(head(iris))
iris %>%
head() %>%
nrow()
## [1] 6
1.2 Assigning results:
iris_filtered <- iris %>% head()
iris_filtered2 <- iris %>%
head()
iris_filtered
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris_filtered2
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Because of the way data pipelines flow - top to bottom - left to right it feels somewhat counter-intuitive to have the output and input on the same line. So we can use the right hand side (RHS) operator ->
iris %>%
head() ->
iris_filtered3
1.3 Disorganized functions:
Sometimes functions that we encounter do not put the key input as the first argument. - e.g. linear model function starts with description of the model and just after the dataset
my_lm <- lm(Sepal.Width~Sepal.Length, iris)
but using the pipe operator we can use a dot (.) as a place holder in the place of the argument data
iris %>%
lm(Sepal.Width~Sepal.Length, .) ->
my_lm
1.4 Filtering columns:
To extract columns from our dataset, we can use the select() function to provide instructions about which columns we want or don’t want. select() takes a comma separated list of instructions.
iris %>%
select(Species, Sepal.Length)%>%
head(n=3)
## Species Sepal.Length
## 1 setosa 5.1
## 2 setosa 4.9
## 3 setosa 4.7
If we have lots of columns we want to include but a few we want to exclude, we can use a minus (-) before a column name to say to exclude it.
iris %>%
select(-Species)%>%
head(n=3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
For a range of columns:
iris %>%
select(Sepal.Length:Petal.Length)%>%
head(n=3)
## Sepal.Length Sepal.Width Petal.Length
## 1 5.1 3.5 1.4
## 2 4.9 3.0 1.4
## 3 4.7 3.2 1.3
To exclude a range of columns:
iris %>%
select(-(Sepal.Length:Petal.Length))%>%
head(n=3)
## Petal.Width Species
## 1 0.2 setosa
## 2 0.2 setosa
## 3 0.2 setosa
1.5 Name-based seldection:
- starts_with() will return columns where the string you provide is at the beginning of a column name
- ends_with() will return columns where the string you provide is at the end of a column name
- contains() will return columns where the string you provide is anywhere in the column name
- num_range() will allow you to return columns with names like “year 2020” through to “year_2021” by providing a prefix and a numeric range you want to select
- matches() allows you to provide pattern that a column name must conform to in order to be returned 6)one_of() allows you to provide a vector of column names that you would like to be matched in their entirety
1.5.1 Examples
Starts_with
iris %>%
select(starts_with("S"))%>%
head(n=3)
## Sepal.Length Sepal.Width Species
## 1 5.1 3.5 setosa
## 2 4.9 3.0 setosa
## 3 4.7 3.2 setosa
Ends_with
iris %>%
select(ends_with("s"))%>%
head(n=3)
## Species
## 1 setosa
## 2 setosa
## 3 setosa
Contains
iris %>%
select(contains("Length"))%>%
head(n=3)
## Sepal.Length Petal.Length
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
or
iris %>%
select(Petal.Width:Species, -contains("Length"))%>%
head(n=3)
## Petal.Width Species
## 1 0.2 setosa
## 2 0.2 setosa
## 3 0.2 setosa
1.6 Content based on selection:
Select columns based on the boolean results of functions applied to the data in the columns with select_if() Select all numeric columns from the iris dataset.
iris %>%
select_if(is.numeric)%>%
head(n=3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
1.7 Advanced conditional selection:
Create conditional statements on the fly by using a tilde (~) and then providing some code that results in a boolean. You tell the statement where the column of data should go by using the place holder symbol (.).
iris %>%
select_if(~is.numeric(.) & n_distinct(.)>30)%>%
head(n=3)
## Sepal.Length Petal.Length
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
1.8 Filtering rows:
Row-position selection The slice() function takes a vector of values that denote positions
iris %>%
slice(1:5)%>%
head(n=3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
1.9 Conditional selection:
To select rows based on a logical statement use filter()
iris %>%
filter(Species=="virginica")%>%
head(n=3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.3 3.3 6.0 2.5 virginica
## 2 5.8 2.7 5.1 1.9 virginica
## 3 7.1 3.0 5.9 2.1 virginica
Instead of needing to write multiple filter() commands, we can put multiple logical conditions inside a single filter() function
iris %>%
filter(Species == "virginica",
Sepal.Length > mean(Sepal.Length))%>%
head(n=3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 6.3 3.3 6.0 2.5 virginica
## 2 7.1 3.0 5.9 2.1 virginica
## 3 6.3 2.9 5.6 1.8 virginica
1.10 Working with column names:
Rename columns by selecting them but providing a new name on the left-hand side of an equals operator (=) when doing a select(). This is useful if you want to filter columns and rename at the same time.
iris %>%
select(sepal_width=Sepal.Width, species=Species)%>%
head(n=3)
## sepal_width species
## 1 3.5 setosa
## 2 3.0 setosa
## 3 3.2 setosa
1.11 My Data Example
Water samples were collected on 6/6/18, 9/24/18, 12/17/18, 1/26/19 and ran on an autoanalyzer to determine NO3-N, NH4-N, PO4, and Chl-a concentrations across 8 tanks identified by treatments (control "A" or nutrient enriched "N"). Moreover this data was used in my master's thesis.
library(readr)
Porewater_data <- read_csv("~/Desktop/R folder/Bookdown/Porewater_data.csv")
head(Porewater_data, n=3)
## # A tibble: 3 x 7
## Tank Date `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/… Nutrient
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 6/6/18 2.07 1.35 3.08 2.77 A
## 2 2 6/6/18 4.68 2.48 2.45 5.3 N
## 3 3 6/6/18 3.05 1.77 4.62 2.99 A
Here I read in my porewater dataset and headed the first three lines.
1.11.1 Evaluating datasets
Porewater_data %>%
head() %>%
nrow()
## [1] 6
Using the pipe we determined that there are 6 rows in the dataset porewater_data.
1.11.2 Assigning results:
pw_filtered <- Porewater_data %>% head()
pw_filtered
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 7
## Tank Date `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/… Nutrient
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 6/6/18 2.07 1.35 3.08 2.77 A
## 2 2 6/6/18 4.68 2.48 2.45 5.3 N
## 3 3 6/6/18 3.05 1.77 4.62 2.99 A
## 4 4 6/6/18 5.46 3.64 4.44 4.75 N
## 5 5 6/6/18 2.91 0.91 2.44 2.47 A
## 6 6 6/6/18 4.82 3.46 3.11 5.25 N
Here I assigned the data set as pw_filtered. This would be useful to set aside newly created datatables using dplyr.
1.11.3 Disorganized functions:
Porewater_data %>%
lm(`NH4 (uM)`~Nutrient, .) ->
my_lm
my_lm
##
## Call:
## lm(formula = `NH4 (uM)` ~ Nutrient, data = .)
##
## Coefficients:
## (Intercept) NutrientN
## 1.8431 -0.1225
Sometimes functions that we encounter do not put the key input as the first argument. - e.g. linear model function starts with description of the model and just after the dataset but using the pipe operator we can use a dot (.) as a place holder in the place of the argument data.
1.11.4 Filtering columns:
Porewater_data %>%
select(Date: Nutrient)%>%
head(n=3)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 6
## Date `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/L)` Nutrient
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 6/6/18 2.07 1.35 3.08 2.77 A
## 2 6/6/18 4.68 2.48 2.45 5.3 N
## 3 6/6/18 3.05 1.77 4.62 2.99 A
To extract columns from our dataset, we can use the select() function to provide instructions about which columns we want or don’t want. select() takes a comma separated list of instructions. Here we selected rows between Date and Nutrient.
1.11.5 Remove Row
Porewater_data %>%
select(-`NO3 (uM)`)%>%
head(n=3)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 6
## Tank Date `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/L)` Nutrient
## <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 6/6/18 1.35 3.08 2.77 A
## 2 2 6/6/18 2.48 2.45 5.3 N
## 3 3 6/6/18 1.77 4.62 2.99 A
Here I used the select function to remove NO3 uM from the previously constructed table. I would do this to deselect unimportant data rows or duplicates
1.11.6 Select using starts_with
Porewater_data %>%
select(starts_with("N"))%>%
head(n=3)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 3
## `NO3 (uM)` `NH4 (uM)` Nutrient
## <dbl> <dbl> <chr>
## 1 2.07 1.35 A
## 2 4.68 2.48 N
## 3 3.05 1.77 A
Here I used the select function to only select colums that start with an "N". I would use this to make sure I do not have two of the same column.
1.11.7 Content based on selection:
Porewater_data %>%
select_if(is.numeric)%>%
head(n=3)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 5
## Tank `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/L)`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2.07 1.35 3.08 2.77
## 2 2 4.68 2.48 2.45 5.3
## 3 3 3.05 1.77 4.62 2.99
Here is another method using the select_if function. Using this function I specified all numeric input to be selected for the new table. Very useful for determining how rows are being characterized.
1.11.8 Conditional selection:
Porewater_data %>%
filter(Nutrient=="N")%>%
head(n=3)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 7
## Tank Date `NO3 (uM)` `NH4 (uM)` `PO4 (uM)` `Conc of chl (ug/… Nutrient
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2 6/6/18 4.68 2.48 2.45 5.3 N
## 2 4 6/6/18 5.46 3.64 4.44 4.75 N
## 3 6 6/6/18 4.82 3.46 3.11 5.25 N
Here I selected rows based on a logical statement using the filter() function. In this expression I filtered by treatment "nutrient" to only show data that was nutrient enriched "N".
1.11.9 Working with renaming column names:
Porewater_data %>%
select(Nitrate=`NO3 (uM)`, Ammonium=`NH4 (uM)`,Chla= `Conc of chl (ug/L)`, Treatment=Nutrient)%>%
head(n=3)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 4
## Nitrate Ammonium Chla Treatment
## <dbl> <dbl> <dbl> <chr>
## 1 2.07 1.35 2.77 A
## 2 4.68 2.48 5.3 N
## 3 3.05 1.77 2.99 A
Here I renamed columns by selecting them but providing a new name on the left-hand side of an equals operator (=) when doing a select(). This is useful if you want to filter columns and rename at the same time.