The computation below leverages data.table’s fread function to read the csv and answer some preliminary exploratory questions concerning the data.
# Leverage fread function from package data.table to quickly read in csv data as an R object.
<- fread("data/ABoVE_ Boutin Alberta Grey Wolf.csv")
initData
# How many wolves?
# all rows
initData[
,# data.table fast unique count for the column
# labeling the different wolves
uniqueN(`tag-local-identifier`)]
## [1] 43
# How big is the data set
::object.size(initData) utils
## 32539288 bytes
# What are the dimensions
dim(initData)
## [1] 239194 18
# View the first three rows
head(initData,3) %>% formattable(align="l") %>%
::as.datatable(rownames=FALSE,
formattableoptions=list(
searching = FALSE,
scrollX = TRUE,
columnDefs = list(list(className = 'dt-left', targets = '_all'))))
The raw data set has 239,194 observations on 46 wolves. We only need the following columns for our study:
The following computation extracts the data relevant to this study, and customizes the column labels for convenience.
# Load only the relevant columns
<- fread("data/ABoVE_ Boutin Alberta Grey Wolf.csv",
data select=c("study-local-timestamp",
"tag-local-identifier",
"location-long",
"location-lat"),
# Make sure data.table does not automatically generate factor columns
stringsAsFactors = FALSE) %>%
# Omit NAs in the data. Familiarization with how the data was collected is
# necessary to consider retaining these values and making them useful
na.omit()
# Set the column names for convenience
setnames(data,
# Vector of old names that we want to change
old=c("study-local-timestamp",
"tag-local-identifier",
"location-long",
"location-lat"),
# Vector of new more convenient names
new=c("dtg", # date time group
"cid", # component ID
"lon", # longitude
"lat")) # latitude
This study focuses on making inferences about one individual, so we use one wolf as a surrogate. The computation below identifies a wolf of interest and subsets the data to only include this wolf’s observations.
# Use data.table indexing to determine to wolf with the most data
<- data[,.(count=.N),by="cid"][count==max(count)]$cid
wolf.of.Interest
# Subset to focus on one wolf:
<- data[cid==wolf.of.Interest]
dt_init <- nrow(dt_init )
m
# Create datetime objects from dtg character strings
"dtg" := dtg %>% as.POSIXct(format="%Y-%m-%d %H:%M:%S")]
dt_init [,
# Order the data sequentially by date time group
setorder(dt_init,dtg)
# Set inter-obs time interval column
"timeInt" := dtg %>% difftime(shift(dtg),unit="secs")]
dt_init[,"timeInt" := ifelse(timeInt <= 24*3600,timeInt,NA)]
dt_init[,
# Use lubridate package to get the weekday from the date objects
"Weekday" := dtg %>% lubridate::wday(label=TRUE)]
dt_init[,
# Get the hour from the date objects
"Hour" := lubridate::hour(dtg)]
dt_init[,$Hour <- dt_init$Hour %>% Vectorize(military_time)() %>% factor(levels=Vectorize(military_time)(0:23))
dt_init
# Get the time of day for each dtg
"time" := as.ITime(dtg)]
dt_init[,
# Set group time label (for plotting)
$group <- cut(dt_init$time,
dt_initbreaks=c(0,6*3600,9*3600,12*3600,
15*3600,18*3600,21*3600,24*3600),
labels=c("0000 - 0600","0600 - 0900","0900 - 1200",
"1200 - 1500","1500 - 1800","1800 - 2100",
"2100 - 2400"))
save(dt_init,file="products/dt_init.RData")
Next we check the collection volume over the time range of the data to judge the collection bias with respect to each day of the study.
We manipulate the data and generate a bivariate heatmap to study the the collection volume on every weekday, for each hour of the day.
The wolf’s activity between his observations is unknown, which can skew our results. We need to study the time between recorded observations to judge how well the data represents the wolf’s locations.
We define the Python function below to calculate the max curvature of a line plot. We use it by setting the x-axis as the order vector and the y-axis as the distances vector, which produces a convex increasing curve. The y-value for the knee in the curve is the optimal parameter value for \(cluster\_selection\_epsilon\).
from kneed import KneeLocator
def findKnee(order,distances,smoothParam,interpMethod='polynomial'):
= KneeLocator(order,
kneedle
distances,=smoothParam,
S=interpMethod,
interp_method='convex',
curve='increasing',
direction=True)
onlinereturn kneedle.knee_y
As described here, we first calculate the spatial distance to the nearest neighbor for each observation and sort these values. We then find the maximum curvature, and plot the results.
We now apply the same method to get the optimal temporal epsilon distance.
# Imports
import numpy as np
import pandas as pd
import hdbscan
# Functions
def runHDB(points,minClusterSize,epsDist,minSampleSize=None,distMatrix=None,verbose=False):
# Organize parameters
if minSampleSize is None:
= minClusterSize # restore default
minSampleSize
# Define cluster object
if distMatrix is not None:
= hdbscan.HDBSCAN(min_cluster_size=int(minClusterSize),
clusterFinder =int(minSampleSize),
min_samples="precomputed",
metric=epsDist,
cluster_selection_epsilon="eom")
cluster_selection_method= distMatrix
X
else:
= hdbscan.HDBSCAN(min_cluster_size=int(minClusterSize),
clusterFinder =int(minSampleSize),
min_samples="haversine",
metric=(epsDist/1000)/6371,
cluster_selection_epsilon="eom")
cluster_selection_method= np.radians(points)
X if verbose:
print("Running HDBSCAN on {} observations".format(len(X)))
= clusterFinder.fit(X)
res = res.labels_
y if verbose:
print("Found {} clusters".format(pd.Series(y).max() + 1))
return y + 1
# Cluster spatially
<- runHDB(points=dt[,c("lon","lat")],
Y minClusterSize=2,
minSampleSize=100,
epsDist=epsSpat)
"spatClus" := Y]
dt[,
# Cluster temporally
<- rep(0,m)
tempClus for (i in unique(dt[spatClus != 0, spatClus])) {
<- runHDB(points=NULL,
Y distMatrix=timeDistMat[dt$spatClus==i,dt$spatClus==i],
minClusterSize=2,
minSampleSize=100,
epsDist=epsTime)
# label
$spatClus==i] <- ifelse(Y!=0,paste0(i,".",Y),0)
tempClus[dt
}
# Set cluster column
"clus" := tempClus]
dt[,save(dt,file="products/dt.RData")
# Function to find consecutive integers, and the length of each consecutive instance
<- function(x,incr=1) {
seqle if(!is.numeric(x)) x <- as.numeric(x)
<- length(x)
n <- x[-1L] != x[-n] + incr
y <- c(which(y|is.na(y)),n)
i <- list(lengths = diff(c(0L,i)),
temp values = x[head(c(0L,i)+1L,-1L)])
return(list(lengths=temp$lengths[temp$lengths > 1] - 1,
values=temp$values[temp$lengths > 1]))
}
# Quick look
seqle(which(dt$clus==1.1)) %>% as.data.table() %>% formattable(align="l") %>% as.htmlwidget()
Now that we have stored the results in our data and a list object, we can use this information to plot using leaflet. The functions available here provide a comprehensive method to convey the results of our analysis. We omit the functions in this tutorial for brevity.
The map below conveys the results of clustering spatially and temporally using HDBSCAN. The loiter information for each cluster is shown when hovering over each cluster, and a radial heatmap of time of day by weekday is shown when clicking each cluster. We group the clusters by the time of day and include interactive toggles to allow the analyst to study the wolf’s pattern of life.