Chapter 6 Analysis data creation
6.1 Common analysis data formats
Although the types of analysis you can perform on camera trap data vary markedly, they often depend on three key dataframe structures. We introduce these structures here, then show you how to apply them in subsequent chapters.
6.2 Independent detections
The independent detections dataframe is the work horse of the vast majority of camera trap analyses, it is from this that you build the rest of your data frames. The threshold we use for determining what is an “independent detection” is typically 30 minutes… because camera trappers are creatures of habit! If you want to dig a little deeper it to the why, there is a nice summary in Rahel Sollmans “A gentle introduction to camera‐trap data analysis”:
Researchers have used different thresholds, typically 30 min (e.g., O’Brien, Kinnaird, & Wibisono, 2003) to an hour (Bahaa‐el‐din et al., 2016); some researchers have argued that multiple pictures within the same day may not represent independent detections (Royle, Nichols, Karanth, & Gopalaswamy, 2009). In most cases, this threshold is determined subjectively, based on the best available knowledge of the species under study. But it can also be determined based on the temporal autocorrelation (Kays & Parsons, 2014) or analysis of time intervals (Yasuda, 2004) of subsequent pictures.
Independent data has a single row for each independent event:
6.3 Effort look-up
Image data without effort data is worthless!
There are lots of instances where you need to know which stations were operating on a given day.
Some people like to store this information in a site x date
matrix, but they are actually not that easy to data wrangle with.
A long data frame with a site and date column is the most flexible (and keeps the dates in their native POSIX formats).
Effort lookups have a single row for ever day a given location has an active camera:
6.4 Observations by time interval
We saved the most useful data format until last!
A site, time interval, effort, and species detection dataframe integrates the independent data and daily lookup described above. You can use it to create detection rates, occupancy data frames and much more (see the subsequent chapters)!
We export yearly, monthly, weekly and daily data frames from our single site exploration script - which should cover you for much of what you want to do.
We include two different types of response terms:
- Observations = the number of independent detections per time interval
- Counts = sum of the independent minimum group sizes per time interval
Example of an observation by time matrix:
Let’s build these data frames from our example_data
!
6.5 Our data
First, lets create the folder to store our data!
dir.create("data/processed_data")
This section will follow the following steps:
Filter to our target species
Create a camera activity look-up
Determine our “independent detections”
Create our analysis data frames
6.5.1 Filter to target species
# Remove observations without animals detected, where we don't know the species, and non-mammals
<- img %>% filter(is_blank==0, # Remove the blanks
img_sub is.na(img$species)==FALSE, # Remove classifications which don't have species
=="Mammalia", # Subset to mammals
class!="sapiens") # Subset to anything that isn't human species
This has resulted in the removal of 33.2% of the observations.
Which are composed of the following species:
%>% group_by(common_name) %>% summarize(n()) img_sub
## # A tibble: 14 × 2
## common_name `n()`
## <chr> <int>
## 1 american marten 41
## 2 black bear 1331
## 3 canada lynx 140
## 4 caribou 787
## 5 coyote 21
## 6 elk 6
## 7 gray wolf 352
## 8 moose 2038
## 9 rabbit 9
## 10 red fox 39
## 11 red squirrel 34
## 12 river otter 2
## 13 snowshoe hare 629
## 14 white-tailed deer 4790
6.5.2 Create a daily camera activity lookup
Next we create the daily camera activity look up (remember, one row for every day a camera is active).
# Remove any deployments without end dates
<- dep[is.na(dep$end_date)==F,]
tmp
# Create an empty list to store our days
<- list()
daily_lookup
# Loop through the deployment dataframe and create a row for every day the camera is active
for(i in 1:nrow(tmp))
{if(ymd(tmp$start_date[i])!=ymd(tmp$end_date[i]))
{<- data.frame("date"=seq(ymd(tmp$start_date[i]), ymd(tmp$end_date[i]), by="days"), "placename"=tmp$placename[i])
daily_lookup[[i]]
}
}
# Merge the lists into a dataframe
<- bind_rows(daily_lookup)
row_lookup
# Remove duplicates - when start and end days are the same for successive deployments
<- row_lookup[duplicated(row_lookup)==F,] row_lookup
6.5.3 Determine ‘independent’ camera detections
We rarely analyse raw camera data, rather we filter out multiple detections of the same individual within a given event. This is called creating and “independent detections” dataframe.
As stated above, it is wise to think about what you are analyzing and whether such a threshold is appropriate. For example, if your organism of interest is very abundant, for examples human hikers on a busy trail, then using a 30 minute threshold may mean that multiple independent groups of hikers are rolled into a single, huge, “event”.
# Set the "independence" interval in minutes
<- 30 independent
Finally we need to specify what a “count” means in this dataset. Some people do estimates of group_size
in their footage - summing all of the individuals they are sure are different. Others only sum the animals they can see in each photo. Here is where you specify which to use:
# Check for a `group_size` variable?
table(img_sub$group_size)
##
## 1 2 3 4 5 6
## 7923 1637 597 28 24 10
# Check for a 'number_ofobjects' variable
table(img_sub$number_of_objects)
##
## 1 2 3
## 9772 412 35
Make your selection:
# If yes use that, if no use 'number_of_objects'
$animal_count <- img_sub$group_size img_sub
We will now break down the algorithm into subsections to make it clear what is occurring:
- Order the dataframe by deployment code and species
<- img_sub %>%
img_tmp arrange(deployment_id) %>% # Order by deployment_id
group_by(deployment_id, sp) %>% # Group species together
mutate(duration = int_length(timestamp %--% lag(timestamp))) # Calculate the gap between successive detections
- Determine independence of images
If subsequent detections occur outside of the independence threshold, assign it a unique ID code.
library(stringr)
# Give a random value to all cells
$event_id <- 9999
img_tmp
# Create a counter
<- 1
counter
# Make a unique code that has one more zero than rows in your dataframe
<- as.numeric(paste0(nrow(img_sub),0))
num_code
# Loop through img_tmp - if gap is greater than the threshold -> give it a new event ID
for (i in 2:nrow(img_tmp)) {
$event_id[i-1] <- paste0("E", str_pad(counter, nchar(num_code), pad = "0"))
img_tmp
if(is.na(img_tmp$duration[i]) | abs(img_tmp$duration[i]) > (independent * 60))
{<- counter + 1
counter
}
}
# Update the information for the last row - the loop above always updates the previous row... leaving the last row unchanged
# group ID for the last row
if(img_tmp$duration[nrow(img_tmp)] < (independent * 60)|
is.na(img_tmp$duration[nrow(img_tmp)])){
$event_id[nrow(img_tmp)] <- img_tmp$event_id[nrow(img_tmp)-1]
img_tmpelse{
} <- counter + 1
counter $event_id[nrow(img_tmp)] <- paste0("E", str_pad(counter, nchar(num_code), pad = "0"))
img_tmp
}
# remove the duration column
$duration <- NULL img_tmp
6.5.4 Add additional data
We could stop there, however there is other information we might light to extract about each individual event:
- the maximum number objects detected in an event
- how long the event lasts
- how many images are in each event
# find out the last and the first of the time in the group
<- img_tmp %>% group_by(event_id) %>% top_n(1,timestamp) %>% dplyr::select(event_id, timestamp)
top <- img_tmp %>% group_by(event_id) %>% top_n(-1,timestamp) %>% dplyr::select(event_id, timestamp)
bot names(bot)[2] <- c("timestamp_end")
<- img_tmp %>% group_by(event_id) %>% summarise(event_observations=n()) # number of images in the event
img_num <- img_tmp %>% group_by(event_id) %>% summarise(event_groupsize=max(animal_count))
event_grp
# calculate the duration and add the other elements
<- top %>% left_join(bot, by="event_id") %>%
diff mutate(event_duration=abs(int_length(timestamp %--% timestamp_end))) %>%
left_join(event_grp, by="event_id")%>%
left_join(img_num, by="event_id")
# Remove columns you don't need
$timestamp <-NULL
diff$timestamp_end <-NULL
diff# remove duplicates
<- diff[duplicated(diff)==F,]
diff # Merge the img_tmp with the event data
<- img_tmp %>%
img_tmp left_join(diff,by="event_id")
Finally lets subset to the first row of each event to create our independent dataframe!
# Remove duplicates
<- img_tmp[duplicated(img_tmp$event_id)==F,] ind_dat
Next we remove any detections which occur outside of our known camera activity periods:
# Make a unique code for ever day and deployment where cameras were functioning
<- paste(row_lookup$date, row_lookup$placename)
tmp
#Subset ind_dat to data that matches the unique codes
<- ind_dat[paste(substr(ind_dat$timestamp,1,10), ind_dat$placename) %in% tmp, ] ind_dat
As a final step, we make the species column a ‘factor’ - this makes all the data frame building operations much simpler:
$sp <- as.factor(ind_dat$sp) ind_dat
And we are ready to build our dataframes!
6.6 Creating analysis dataframes
Finally, this script outputs 11 useful data frames for future data analysis:
1. A data frame of “independent detections” at the 30 minute threshold you specified at the start:
- “data/processed_data/AlgarRestorationProject_30min_Independent.csv”
write.csv(ind_dat, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_detections.csv"), row.names = F)
# also write the cleaned all detections file (some activity analyses require it)
write.csv(img_tmp, paste0("data/processed_data/",ind_dat$project_id[1], "_raw_detections.csv"), row.names = F)
2. The “daily_lookup” which is a dataframe of all days a given camera station was active. Some people use an lookup matrix for this step, but we find the long format is much easier to use in downstream analysis. - “data/processed_data/_daily_deport_lookup.csv”
write.csv(row_lookup, paste0("data/processed_data/",ind_dat$project_id[1], "_daily_lookup.csv"), row.names = F)
3. Unique camera locations list:
When we start to build the covariates for data analysis, it is very useful to have a list of your final project’s camera locations. We create this below in a simplified form. You can include any columns which will be use for data analysis, and export it.
#Subset the columns
<- dep[, c("project_id", "placename", "longitude", "latitude", "feature_type")]
tmp # Remove duplicated rows
<- tmp[duplicated(tmp)==F,]
tmp# write the file
write.csv(tmp, paste0("data/processed_data/",ind_dat$project_id[1], "_camera_locations.csv"), row.names = F)
4. Final species list
We also want to create a final species list. We subset the data to just those included in the independent data, and then save the file.
<- sp_list[sp_list$sp %in% ind_dat$sp,]
tmp
# Remove the 'verified' column
$verified <- NULL
tmp
# We will replace the spaces in the species names with dots, this will make things easier for us later (as column headings with spaces in are annoying).
library(stringr)
$sp <- str_replace(tmp$sp, " ", ".")
tmp
write.csv(tmp, paste0("data/processed_data/",ind_dat$project_id[1], "_species_list.csv"), row.names = F)
5 & 6: A ‘site x species’ matrix of the number of independent detections and species counts across the full study period:
“data/processed_data/AlgarRestorationProject_30min_Independent_total_observations.csv”
“data/processed_data/AlgarRestorationProject_30min_Independent_total_counts.csv”
# Total counts
# Station / Month / deport / Species
<- row_lookup
tmp
# Calculate the number of days at each site
<- tmp %>%
total_obs group_by(placename) %>%
summarise(days = n())
# Convert to a data frame
<- as.data.frame(total_obs)
total_obs
# Add columns for each species
levels(ind_dat$sp)] <- NA
total_obs[, # Duplicate for counts
<- total_obs
total_count # Test counter
<-1
i # For each station, count the number of individuals/observations
for(i in 1:nrow(total_obs))
{<- ind_dat[ind_dat$placename==total_obs$placename[i],]
tmp
<- tmp %>% group_by(sp, .drop=F) %>% summarise(obs=n(), count=sum(animal_count))
tmp_stats
as.character(tmp_stats$sp)] <- tmp_stats$obs
total_obs[i,as.character(tmp_stats$sp)] <- tmp_stats$count
total_count[i,
}
# Save them
write.csv(total_obs, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_total_observations.csv"), row.names = F)
write.csv(total_count, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_total_counts.csv"), row.names = F)
7 & 8: A ‘site_month x species’ matrix of the number of independent detections and species counts across for each month in the study period:
“data/processed_data/AlgarRestorationProject_30min_Monthly_total_observations.csv”
“data/processed_data/AlgarRestorationProject_30min_Monthly_total_counts.csv”
# Monthly counts
# Station / Month / days / Covariates / Species
<- row_lookup
tmp # Simplify the date to monthly
$date <- substr(tmp$date,1,7)
tmp
# Calculate the number of days in each month
<- tmp %>%
mon_obs group_by(placename,date ) %>%
summarise(days = n())
# Convert to a data frame
<- as.data.frame(mon_obs)
mon_obs
levels(ind_dat$sp)] <- NA
mon_obs[, <- mon_obs
mon_count # For each month, count the number of individuals/observations
for(i in 1:nrow(mon_obs))
{<- ind_dat[ind_dat$placename==mon_obs$placename[i] & substr(ind_dat$timestamp,1,7)== mon_obs$date[i],]
tmp
<- tmp %>% group_by(sp, .drop=F) %>% summarise(obs=n(), count=sum(animal_count))
tmp_stats
as.character(tmp_stats$sp)] <- tmp_stats$obs
mon_obs[i,as.character(tmp_stats$sp)] <- tmp_stats$count
mon_count[i,
}
write.csv(mon_obs, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_monthly_observations.csv"), row.names = F)
write.csv(mon_count, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_monthly_counts.csv"), row.names = F)
9 & 10: A ‘site_week x species’ matrix of the number of independent detections and species counts across for each week in the study period:
“data/processed_data/AlgarRestorationProject_30min_Weekly_total_observations.csv”
“data/processed_data/AlgarRestorationProject_30min_Weekly_total_counts.csv”
# Weekly format
# Station / Month / days / Covariates / Species
<- row_lookup
tmp # Simplify the date to year-week
$date <- strftime(tmp$date, format = "%Y-W%U")
tmp# The way this is coded is the counter W01 starts at the first Sunday of the year, everything before that is W00. Weeks do not roll across years.
# Calculate the number of days in each week
<- tmp %>%
week_obs group_by(placename,date ) %>%
summarise(days = n())
# Convert to a data frame
<- as.data.frame(week_obs)
week_obs
# Add species columns
levels(ind_dat$sp)] <- NA
week_obs[,
# Duplicate for counts
<- week_obs
week_count
# For each week, count the number of individuals/observations
for(i in 1:nrow(week_obs))
{<- ind_dat[ind_dat$placename==week_obs$placename[i] & strftime(ind_dat$timestamp, format = "%Y-W%U")== week_obs$date[i],]
tmp
<- tmp %>% group_by(sp, .drop=F) %>% summarise(obs=n(), count=sum(animal_count))
tmp_stats
as.character(tmp_stats$sp)] <- tmp_stats$obs
week_obs[i,as.character(tmp_stats$sp)] <- tmp_stats$count
week_count[i,
}
write.csv(week_obs, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_weekly_observations.csv"), row.names = F)
write.csv(week_count, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_weekly_counts.csv"), row.names = F)
11 & 12: A ‘site_day x species’ matrix of the number of independent detections and species counts across for each day a station was active in the study period:
“data/processed_data/AlgarRestorationProject_30min_Daily_total_observations.csv”
“data/processed_data/AlgarRestorationProject_30min_Daily_total_counts.csv”
# Daily format
# Station / Month / days / Covariates / Species
<- row_lookup
tmp $days <- 1
tmp# Add species columns
levels(ind_dat$sp)] <- NA
tmp[,
<- tmp
day_obs <- tmp
day_count # For each week, count the number of individuals/observations
for(i in 1:nrow(day_obs))
{<- ind_dat[ind_dat$placename==day_obs$placename[i] & strftime(ind_dat$timestamp, format = "%Y-%m-%d")== day_obs$date[i],]
tmp
<- tmp %>% group_by(sp, .drop=F) %>% summarise(obs=n(), count=sum(animal_count))
tmp_stats
as.character(tmp_stats$sp)] <- tmp_stats$obs
day_obs[i,as.character(tmp_stats$sp)] <- tmp_stats$count
day_count[i,
}write.csv(day_obs, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_daily_observations.csv"), row.names = F)
write.csv(day_count, paste0("data/processed_data/",ind_dat$project_id[1], "_",independent ,"min_independent_daily_counts.csv"), row.names = F)
6.6.1 Final data check
Finally, as a last check that our code is creating robust analysis data frames, we check if the observations/counts are the same across each temporal scale (total/monthly/weekly/daily). Check this using the following tables.
Observations
<- cbind(data.frame("Time"=c("Total", "Monthly", "Weekly", "Daily")),
tmp rbind(colSums(total_obs[,2:ncol(total_obs)]),
colSums(mon_obs[,3:ncol(mon_obs)]),
colSums(week_obs[,3:ncol(week_obs)]),
colSums(day_obs[,3:ncol(day_obs)]) ))
%>%
tmp kbl() %>%
kable_styling(full_width = T) %>%
column_spec(1, bold = T, border_right = T)%>%
::scroll_box(width = "100%") kableExtra
Time | days | Alces.alces | Canis.latrans | Canis.lupus | Cervus.canadensis | Lepus.americanus | Lontra.canadensis | Lynx.canadensis | Martes.americana | Odocoileus.virginianus | Oryctolagus.cuniculus | Rangifer.tarandus | Tamiasciurus.hudsonicus | Ursus.americanus | Vulpes.vulpes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Total | 20350 | 176 | 8 | 93 | 1 | 241 | 1 | 52 | 24 | 562 | 2 | 119 | 26 | 228 | 10 |
Monthly | 20350 | 176 | 8 | 93 | 1 | 241 | 1 | 52 | 24 | 562 | 2 | 119 | 26 | 228 | 10 |
Weekly | 20350 | 176 | 8 | 93 | 1 | 241 | 1 | 52 | 24 | 562 | 2 | 119 | 26 | 228 | 10 |
Daily | 20350 | 176 | 8 | 93 | 1 | 241 | 1 | 52 | 24 | 562 | 2 | 119 | 26 | 228 | 10 |
Counts
<- cbind(data.frame("Time"=c("Total", "Monthly", "Weekly", "Daily")),
tmp rbind(colSums(total_count[,2:ncol(total_count)]),
colSums(mon_count[,3:ncol(mon_count)]),
colSums(week_count[,3:ncol(week_count)]),
colSums(day_count[,3:ncol(day_count)]) ))
%>%
tmp kbl() %>%
kable_styling(full_width = T) %>%
column_spec(1, bold = T, border_right = T)%>%
::scroll_box(width = "100%") kableExtra
Time | days | Alces.alces | Canis.latrans | Canis.lupus | Cervus.canadensis | Lepus.americanus | Lontra.canadensis | Lynx.canadensis | Martes.americana | Odocoileus.virginianus | Oryctolagus.cuniculus | Rangifer.tarandus | Tamiasciurus.hudsonicus | Ursus.americanus | Vulpes.vulpes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Total | 20350 | 227 | 8 | 133 | 1 | 244 | 1 | 53 | 24 | 678 | 2 | 159 | 26 | 256 | 10 |
Monthly | 20350 | 227 | 8 | 133 | 1 | 244 | 1 | 53 | 24 | 678 | 2 | 159 | 26 | 256 | 10 |
Weekly | 20350 | 227 | 8 | 133 | 1 | 244 | 1 | 53 | 24 | 678 | 2 | 159 | 26 | 256 | 10 |
Daily | 20350 | 227 | 8 | 133 | 1 | 244 | 1 | 53 | 24 | 678 | 2 | 159 | 26 | 256 | 10 |