Chapter 2 Chapter 2: Getting Data into R
install.packages("dplyr",repos = "https://cran.us.r-project.org")
install.packages("tidyr",repos = "https://cran.us.r-project.org")
install.packages("stringr",repos = "https://cran.us.r-project.org")
install.packages("lubridate",repos = "https://cran.us.r-project.org")
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
Read in the data
<- read.csv("/Users/peteapicella/Documents/R_tutorials/GSwR/compensation.csv")
compensation head(compensation)
## Root Fruit Grazing
## 1 6.225 59.77 Ungrazed
## 2 6.487 60.98 Ungrazed
## 3 4.919 14.73 Ungrazed
## 4 5.130 19.28 Ungrazed
## 5 5.417 34.25 Ungrazed
## 6 5.359 35.53 Ungrazed
::kable(head(compensation)) knitr
Root | Fruit | Grazing |
---|---|---|
6.225 | 59.77 | Ungrazed |
6.487 | 60.98 | Ungrazed |
4.919 | 14.73 | Ungrazed |
5.130 | 19.28 | Ungrazed |
5.417 | 34.25 | Ungrazed |
5.359 | 35.53 | Ungrazed |
2.1 Checking that your data are your data
Generate names of the columns/variables in the console:
names(compensation)
## [1] "Root" "Fruit" "Grazing"
Produce number of observations (rows in each column) followed by # of variables:
dim(compensation)
## [1] 40 3
Review structure of the data:
str(compensation)
## 'data.frame': 40 obs. of 3 variables:
## $ Root : num 6.22 6.49 4.92 5.13 5.42 ...
## $ Fruit : num 59.8 61 14.7 19.3 34.2 ...
## $ Grazing: chr "Ungrazed" "Ungrazed" "Ungrazed" "Ungrazed" ...
2.2 Appendix advanced activity: dealing with untidy data
<- read.csv("/Users/peteapicella/Documents/R_tutorials/GSwR/nasty format.csv")
nasty.format head(nasty.format)
## Species Bottle Temp X1.2.13 X2.2.13 X3.2.13 X4.2.13 X6.2.13 X8.2.13
## 1 P.caudatum 7-P.c 22 100.0 58.8 67.5 6.8 0.93 0.39
## 2 P.caudatum 8-P.c 22 62.5 71.3 67.5 7.9 0.90 0.36
## 3 P.caudatum 9-P.c 22 75.0 72.5 62.3 7.9 0.88 0.25
## 4 P.caudatum 22-P.c 20 75.0 73.8 76.3 31.3 3.12 1.01
## 5 P.caudatum 23-P.c 20 50.0 NA 81.3 32.5 3.75 1.06
## 6 P.caudatum 24-P.c 20 87.5 NA 62.5 28.8 3.12 1.00
## X10.2.13 X12.2.13
## 1 0.19 0.46
## 2 0.16 0.34
## 3 0.23 0.31
## 4 0.56 0.50
## 5 0.49 0.38
## 6 0.41 0.46
Review data structure:
str(nasty.format)
## 'data.frame': 37 obs. of 11 variables:
## $ Species : chr "P.caudatum" "P.caudatum" "P.caudatum" "P.caudatum" ...
## $ Bottle : chr "7-P.c" "8-P.c" "9-P.c" "22-P.c" ...
## $ Temp : int 22 22 22 20 20 20 15 15 15 22 ...
## $ X1.2.13 : num 100 62.5 75 75 50 87.5 75 50 75 37.5 ...
## $ X2.2.13 : num 58.8 71.3 72.5 73.8 NA NA NA NA NA 52.5 ...
## $ X3.2.13 : num 67.5 67.5 62.3 76.3 81.3 62.5 90 78.8 78.3 23.8 ...
## $ X4.2.13 : num 6.8 7.9 7.9 31.3 32.5 28.8 72.5 92.5 77.5 1.25 ...
## $ X6.2.13 : num 0.93 0.9 0.88 3.12 3.75 ...
## $ X8.2.13 : num 0.39 0.36 0.25 1.01 1.06 1 67.5 72.5 60 0.96 ...
## $ X10.2.13: num 0.19 0.16 0.23 0.56 0.49 0.41 37.5 52.5 60 0.33 ...
## $ X12.2.13: num 0.46 0.34 0.31 0.5 0.38 ...
- this dataset is poorly constructed
Eliminate extra (37th) row in dataset:
<-filter(nasty.format, Bottle !="") # '!=' symbol means '≠'
nasty.formattail(nasty.format)
## Species Bottle Temp X1.2.13 X2.2.13 X3.2.13 X4.2.13 X6.2.13 X8.2.13
## 31 S. fonticola 19 20 25.0 87.5 85.0 98.8 78.75 71.25
## 32 S. fonticola 20 20 87.5 63.8 81.3 76.3 72.50 85.00
## 33 S. fonticola 21 20 50.0 77.5 83.8 97.5 68.75 71.25
## 34 S. fonticola 34 15 50.0 NA 101.3 93.8 70.00 91.25
## 35 S. fonticola 35 15 62.5 NA 65.0 72.5 61.25 72.50
## 36 S. fonticola 36 15 112.5 NA 76.3 67.5 61.25 77.50
## X10.2.13 X12.2.13
## 31 68.8 101.25
## 32 72.5 85.00
## 33 60.0 98.75
## 34 76.3 80.00
## 35 66.3 102.50
## 36 91.3 77.50
- this filter function is programmed to capture every row in which variable, ‘Bottle,’ contains text
Create new variables and assort data into them:
<- gather(nasty.format,
tidy_data #the variables to be created
Date, Abundance, 4:11) #column headers that are dates in the nasty.format dataframe
head(tidy_data)
## Species Bottle Temp Date Abundance
## 1 P.caudatum 7-P.c 22 X1.2.13 100.0
## 2 P.caudatum 8-P.c 22 X1.2.13 62.5
## 3 P.caudatum 9-P.c 22 X1.2.13 75.0
## 4 P.caudatum 22-P.c 20 X1.2.13 75.0
## 5 P.caudatum 23-P.c 20 X1.2.13 50.0
## 6 P.caudatum 24-P.c 20 X1.2.13 87.5
Remove the ‘X,’ which precedes that date in each observation:
<- mutate(tidy_data, Date=substr(Date,2,20))
tidy_data head(tidy_data)
## Species Bottle Temp Date Abundance
## 1 P.caudatum 7-P.c 22 1.2.13 100.0
## 2 P.caudatum 8-P.c 22 1.2.13 62.5
## 3 P.caudatum 9-P.c 22 1.2.13 75.0
## 4 P.caudatum 22-P.c 20 1.2.13 75.0
## 5 P.caudatum 23-P.c 20 1.2.13 50.0
## 6 P.caudatum 24-P.c 20 1.2.13 87.5
Display all unique dates:
unique(
$Date) #this says use the observations in the variable 'Date' in the 'tidy_data' dataframe tidy_data
## [1] "1.2.13" "2.2.13" "3.2.13" "4.2.13" "6.2.13" "8.2.13" "10.2.13"
## [8] "12.2.13"
Reformat the dates to be universally recognized:
<-mutate(tidy_data, Date=dmy(Date))
tidy_data head(tidy_data)
## Species Bottle Temp Date Abundance
## 1 P.caudatum 7-P.c 22 2013-02-01 100.0
## 2 P.caudatum 8-P.c 22 2013-02-01 62.5
## 3 P.caudatum 9-P.c 22 2013-02-01 75.0
## 4 P.caudatum 22-P.c 20 2013-02-01 75.0
## 5 P.caudatum 23-P.c 20 2013-02-01 50.0
## 6 P.caudatum 24-P.c 20 2013-02-01 87.5
separate()
Separates information present in one column to multiple new columns
unite()
Puts information from several columns into one column
rbind()
Puts datasets with exactly the same columns together
cbind()
Combines two datasets with exactly the same columns together
full_join()
Joins two datasets with one or more columns in common
merge()
Same function as full_join() but from the base package