Chapter 2 Use case 1: basic use of jointModel()

This first use case will show how to fit and interpret the joint model with paired eDNA and traditional survey data.



The data used in this example comes from a study by Schmelzle and Kinziger (2016) about endangered tidewater gobies (Eucyclogobius newberryi) in California. Environmental DNA samples were collected at 39 sites, along with paired traditional seine sampling. The eDNA data is detection/non-detection data generated through quantitative polymerase chain reaction (qPCR).

2.1 Prepare the data

Both eDNA and traditional survey data should have a hierarchical structure:

  • Sites (primary sample units) within a study area
  • eDNA and traditional samples (secondary sample units) collected from each site
  • eDNA subsamples (replicate observations) taken from each eDNA sample

Ensuring that your data is formatted correctly is essential for successfully using eDNAjoint. Let’s first explore the structure of the goby data.

library(eDNAjoint)
data(gobyData)
str(gobyData)
## List of 4
##  $ qPCR.N  : num [1:39, 1:22] 6 6 6 6 6 6 6 6 6 6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:22] "1" "2" "3" "4" ...
##  $ qPCR.K  : num [1:39, 1:22] 0 0 0 0 6 0 0 0 0 5 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:22] "1" "2" "3" "4" ...
##  $ count   : int [1:39, 1:22] 0 0 0 0 0 0 0 0 0 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:22] "1" "2" "3" "4" ...
##  $ site.cov: num [1:39, 1:5] -0.711 -0.211 -1.16 -0.556 -0.988 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "Salinity" "Filter_time" "Other_fishes" "Hab_size" ...

You can see that the data is a list of four matrices all with the same number of rows that represent each site (n=39). Across all matrices, rows in the data should correspond to the same sites (i.e., row one in all matrices corresponds to site one, and row 31 in all matrices corresponds to site 31).

qPCR.K, qPCR.N, and count are required for all implementations of jointModel(), and site.cov is optional and will be used in use case 2.

Let’s look first at qPCR.K. These are the total number of positive qPCR detections for each site (row) and eDNA sample (column). The number of columns should equal the maximum number of eDNA samples collected at any site. Blank spaces are filled with NA at sites where fewer eDNA samples were collected than the maximum.

For example, at site one, 11 eDNA samples were collected, and at site five, 20 eDNA samples were collected.

head(gobyData$qPCR.K)
##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 0 6 6  4  6  5  4  6  5  3 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 4  6  6  6  5  4  2  2  0  6  5  5  6  6  6  5  5  4 NA NA
## [6,] 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Now let’s look at qPCR.N. These are the total number of qPCR eDNA subsamples (replicate observations) collected for each site (row) and eDNA sample (column). In this data, six qPCR replicate observations were collected for each eDNA sample. Notice that the locations of the NAs in the matrix match qPCR.K.

head(gobyData$qPCR.N)
##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 6 6 6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA NA
## [6,] 6 6 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Next, let’s look at count. This data is from seine sampling for tidewater gobies. Each integer refers to the catch of each seine sample (i.e., catch per unit effort, when effort = 1). Again, the rows correspond to sites, and the columns refer to replicated seine samples (secondary sample units) at each site, with a maximum of 22 samples. Blank spaces are filled with NA at sites where fewer seine samples were collected than the maximum.

In this example, the count data are integers, but continuous values can be used in the model (see Eq. 1.3 in the model description).

head(gobyData$count)
##      1 2 3  4  5  6  7   8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 0 0 0  0  0  0  0   0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 0 0 0  0  0  0  0   0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 0 0 0  0  0  0  0   0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 0 4 1  0  2  1 38 112  1 15 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 0 0 0  2  0  0  0   0  0  0  0  0  4  1  0  2  0  8 NA NA NA NA
## [6,] 0 0 0 NA NA NA NA  NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

2.1.1 Converting data from long to wide

Using eDNAjoint requires your data to be in “wide” format. Wide vs. long data refers to the shape of your data in tidyverse jargon (see more here). Below is an example for how to convert your data to wide format.

First let’s simulate “long” qPCR data.

## data dimensions
# number of sites (primary sample units)
nsite <- 4 
# number of eDNA samples (secondary sample units)
neDNA_samples <- 6 
# number of traditional samples (secondary sample units)
ntraditional_subsamples <- 8 
# number of eDNA subsamples (replicate observations)
neDNA_subsamples <- 3

# simulate qPCR data
qPCR_long <- data.frame(
  site = rep(1:nsite, each = neDNA_samples),
  eDNA_sample = rep(1:neDNA_samples, times = nsite),
  N = rep(neDNA_subsamples, nsite*neDNA_samples),
  K = c(
    rbinom(neDNA_samples, neDNA_subsamples, 0.1), # site 1 positive detections
    rbinom(neDNA_samples, neDNA_subsamples, 0.6), # site 2 positive detections
    rbinom(neDNA_samples, neDNA_subsamples, 0), # site 3 positive detections
    rbinom(neDNA_samples, neDNA_subsamples, 0.4) # site 4 positive detections
  )
)

head(qPCR_long)
##   site eDNA_sample N K
## 1    1           1 3 0
## 2    1           2 3 0
## 3    1           3 3 0
## 4    1           4 3 0
## 5    1           5 3 0
## 6    1           6 3 0

And simulate “long” traditional data:

# simulate traditional count data
count_long <- data.frame(
  site = rep(1:nsite, each = ntraditional_subsamples),
  traditional_sample = rep(1:ntraditional_subsamples, times = nsite),
  count = c(
    rpois(ntraditional_subsamples, 0.5), # site 1 count data
    rpois(ntraditional_subsamples, 4), # site 2 count data
    rpois(ntraditional_subsamples, 0), # site 3 count data
    rpois(ntraditional_subsamples, 2) # site 4 count data
  )
)

head(count_long)
##   site traditional_sample count
## 1    1                  1     0
## 2    1                  2     3
## 3    1                  3     0
## 4    1                  4     1
## 5    1                  5     1
## 6    1                  6     1

Now let’s convert the data from “long” to “wide”.

library(tidyverse)
## ── Attaching core tidyverse packages ─────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# N: number of qPCR eDNA subsamples (i.e., replicate observations) 
# per eDNA sample (i.e., secondary sample)
qPCR_N_wide <- qPCR_long %>% 
  pivot_wider(id_cols=site,
              names_from=eDNA_sample,
              values_from=N) %>% 
  arrange(site) %>% 
  select(-site)

# K: number of positive qPCR detections in each 
# eDNA sample (i.e., secondary sample)
qPCR_K_wide <- qPCR_long %>% 
  pivot_wider(id_cols=site,
              names_from=eDNA_sample,
              values_from=K) %>% 
  arrange(site) %>% 
  select(-site)

# count: number of individuals in each traditional sample 
# (i.e., secondary sample)
count_wide <- count_long %>% 
  pivot_wider(id_cols=site,
              names_from=traditional_sample,
              values_from=count) %>% 
  arrange(site) %>% 
  select(-site)

Finally, we’ll bundle this data up into a named list of matrices.

data <- list(
  qPCR.N = as.matrix(qPCR_N_wide),
  qPCR.K = as.matrix(qPCR_K_wide),
  count = as.matrix(count_wide)
)

2.2 Fit the model

Now that we understand our data, let’s fit the joint model. The key arguments of this function include:

  1. data: list of qPCR.K, qPCR.N, and count matrices
  2. family: probability distribution used to model the seine count data. A poisson distribution is chosen here.
  3. p10priors: Beta distribution parameters for the prior on the probability of false positive eDNA detection, \(p_{10}\). c(1,20) is the default specification. More on this later.
  4. q: logical value indicating the presence of multiple traditional gear types. Here, we’re only using data from one traditional method.

More parameters exist to further customize the MCMC sampling, but we’ll stick with the defaults.

# run the joint model 
goby.fit1 <- jointModel(data = gobyData, family = 'poisson', 
                        p10priors = c(1,20), q=FALSE)

goby.fit1 is a list containing:

  1. model fit (goby.fit1$model) of the class ‘stanfit’ and can be accessed and interpreted using all functions in the rstan package.
  2. initial values used for each chain in MCMC (goby.fit1$inits)

2.3 Model selection

We previously made a choice to use a poisson distribution to describe the count data. This distribution assumes that the mean equals variance for count data at each site. Perhaps we want to test how that model specification compares to a model specification using a negative binomial distribution, which allows the variance of the count data to be greater than the mean.

# fit a new model with negative binomial distribution
goby.fit2 <- jointModel(data = gobyData, family = 'negbin', 
                        p10priors = c(1,20), q=FALSE)

We can now compare the fit of these model to our data using the jointSelect() function, which performs leave-one-out cross validation with functions from the loo package.

# perform model selection
jointSelect(modelfits = list(goby.fit1$model, goby.fit2$model))
##        elpd_diff se_diff
## model2     0.0       0.0
## model1 -2371.8     619.7

These results tell us that model1 has a higher Bayesian LOO estimate of the expected log pointwise predictive density (elpd_loo). This means that goby.fit1 is likely a better fit to the data.

2.4 Interpret the output

2.4.1 Summarize posterior distributions

Let’s interpret goby.fit1. Use jointSummarize() to see the posterior summaries of the model parameters.

jointSummarize(goby.fit1$model, par = c('p10','beta'))
##       mean se_mean    sd  2.5% 97.5%     n_eff  Rhat
## p10  0.002   0.000 0.001 0.001 0.003 13267.232 1.000
## beta 0.656   0.001 0.095 0.469 0.841  8782.801 1.001

This summarizes the mean, sd, and quantiles of the posterior estimates of \(p_{10}\) and \(\beta\), as well as the effective sample size (n_eff) and Rhat for the parameters.

The mean estimated probability of a false positive eDNA detection is 0.001. \(\beta\) is the parameter that scales the sensitivity between eDNA and traditional sampling.

We can also use functions from the bayesplot package to examine the posterior distributions and chain convergence.

First let’s look at the posterior distribution for \(p_{10}\).

library(bayesplot)
# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby.fit1$model), pars = 'p10', prob = 0.8)

Next let’s look at chain convergence for \(p_{10}\) and \(\mu_{i=1}\).

# this will plot the MCMC chains for p10 and mu at site 1
mcmc_trace(rstan::extract(goby.fit1$model, permuted = FALSE), 
           pars = c('p10', 'mu[1]'))

2.4.2 Effort necessary to detect presence

To further highlight the relative sensitivity of eDNA and traditional sampling, we can use detectionCalculate() to find the units of survey effort necessary to detect presence of the species. Here, detecting presence refers to producing at least one true positive eDNA detection or catching at least one individual in a traditional survey.

This function is finding the median number of survey units necessary to detect species presence if the expected catch rate, \(\mu\) is 0.1, 0.5, or 1.

detectionCalculate(goby.fit1$model, mu=c(0.1,0.5,1), probability = 0.9)
##       mu n_traditional n_eDNA
## [1,] 0.1            24     16
## [2,] 0.5             5      4
## [3,] 1.0             3      2

We can also plot these comparisons. mu.min and mu.max define the x-axis in the plot.

detectionPlot(goby.fit1$model, mu.min=0.1, mu.max =1, probability = 0.9)

2.4.3 Calculate \(\mu_{critical}\)

The probability of a true positive eDNA detection, \(p_{11}\), is a function of the expected catch rate, \(\mu\). Low values of \(\mu\) correspond to low probability of eDNA detection. Since the probability of a false-positive eDNA detection is non-zero, the probability of a false positive detection may be higher than the probability of a true positive detection at very low values of \(\mu\).

\(\mu_{critical}\) describes the value of \(\mu\) where the probability of a false positive eDNA detection equals the probability of a true positive eDNA detection. This value can be calculated using muCritical().

muCritical(goby.fit1$model, ci = 0.9)
## $median
## [1] 0.002979823
## 
## $lower_ci
## Highest Density Interval: 1.11e-03
## 
## $upper_ci
## Highest Density Interval: 5.06e-03

This function calculates \(\mu_{critical}\) using the entire posterior distributions of parameters from the model, and ‘HDI’ corresponds to the 90% credibility interval calculated using the highest density interval.

2.5 Prior sensitivity analysis

The previous model implementation used default values for the beta prior distribution for \(p_{10}\). The choice of these prior parameters can impose undue influence on the model’s inference. The best way to investigate this is to perform a prior sensitivity analysis.

Let’s look at how three prior choices affect the posterior estimates of \(p_{10}\) and \(\beta\).

Prior choice 1: default prior parameters c(1,20). The mean and sd of this prior distribution are 0.048 and 0.045, respectively.

fit.prior.1 <- goby.fit1$model
jointSummarize(fit.prior.1, par = c('p10','beta'))
##       mean se_mean    sd  2.5% 97.5%     n_eff  Rhat
## p10  0.002   0.000 0.001 0.001 0.003 13267.232 1.000
## beta 0.656   0.001 0.095 0.469 0.841  8782.801 1.001

Prior choice 2: c(1,15). The mean and sd of this prior distribution are 0.063 and 0.058, respectively.

fit.prior.2 <- jointModel(data = gobyData, family = 'poisson', 
                          p10priors = c(1,15), q=FALSE)
jointSummarize(fit.prior.2$model, par = c('p10','beta'))
##       mean se_mean    sd  2.5% 97.5%     n_eff Rhat
## p10  0.002   0.000 0.001 0.001 0.003 13495.740    1
## beta 0.662   0.001 0.093 0.476 0.843  8840.092    1

Prior choice 3: c(1,10). The mean and sd of this prior distribution are 0.091 and 0.083, respectively.

fit.prior.3 <- jointModel(data = gobyData, family = 'poisson', 
                          p10priors = c(1,10), q=FALSE)
jointSummarize(fit.prior.3$model, par = c('p10','beta'))
##       mean se_mean    sd  2.5% 97.5%     n_eff Rhat
## p10  0.002   0.000 0.001 0.001 0.004 15727.325    1
## beta 0.667   0.001 0.095 0.483 0.849  9917.415    1

You can see that the choice of the \(p_{10}\) prior within this range has little influence on the estimated parameters.

2.6 Initial values

By default, eDNAjoint will provide initial values for parameters estimated by the model, but you can provide your own initial values if you prefer. Here is an example of providing initial values for parameters, mu,p10, and beta, as an input in jointModel().

# set number of chains
n.chain <- 4

# initial values should be a list of named lists
inits <- list()
for(i in 1:n.chain){
  inits[[i]] <- list(
    # length should equal the number of sites (dim(gobyData$count)[1]) for each chain
    mu = stats::runif(dim(gobyData$count)[1], 0.01, 5), 
    # length should equal 1 for each chain 
    p10 = stats::runif(1,0.0001,0.08),
    # length should equal 1 for each chain 
    alpha = stats::runif(1,0.05,0.2)
    )
}

# now fit the model
fit.w.inits <- jointModel(data = gobyData, initial_values = inits)
## Refer to the eDNAjoint guide for visualization tips:  https://bookdown.org/abigailkeller/eDNAjoint_vignette/tips.html#visualization-tips
# check to see the initial values that were used
fit.w.inits$inits
## $chain1
## $chain1$mu_trad
##  [1] 4.06611653 1.67601907 1.68482160 4.72819379 1.40149558 2.20568115 4.32353379 3.80244156
##  [9] 4.59688765 0.72069402 0.80944044 1.10078293 1.31161444 2.30636431 2.90337682 3.49312840
## [17] 3.13355773 3.24936785 2.09153489 1.76121942 2.94220564 2.25723384 0.04624391 4.19146293
## [25] 4.46967555 0.41054091 1.58519251 3.63498577 2.67140720 4.47725638 2.58373319 2.78155226
## [33] 0.36274071 1.51552552 0.50118176 2.75825373 1.79814097 2.46610006 4.96382571
## 
## $chain1$mu
##  [1] 4.06611653 1.67601907 1.68482160 4.72819379 1.40149558 2.20568115 4.32353379 3.80244156
##  [9] 4.59688765 0.72069402 0.80944044 1.10078293 1.31161444 2.30636431 2.90337682 3.49312840
## [17] 3.13355773 3.24936785 2.09153489 1.76121942 2.94220564 2.25723384 0.04624391 4.19146293
## [25] 4.46967555 0.41054091 1.58519251 3.63498577 2.67140720 4.47725638 2.58373319 2.78155226
## [33] 0.36274071 1.51552552 0.50118176 2.75825373 1.79814097 2.46610006 4.96382571
## 
## $chain1$log_p10
## [1] -2.770258
## 
## $chain1$<NA>
## NULL
## 
## 
## $chain2
## $chain2$mu_trad
##  [1] 2.0006232 2.0364795 0.4683033 1.0430751 1.7159229 1.1128158 4.6426035 4.5108807 0.2831293
## [10] 0.5189773 0.5695098 0.2959724 2.7590754 2.0022167 0.5734731 0.3649607 3.9643658 4.1246022
## [19] 0.9382868 2.2055430 2.8199738 4.7407737 2.3843966 3.9297155 4.4910028 2.3125928 1.8660026
## [28] 3.3213043 2.9645365 3.1051004 3.4571764 3.0409846 4.9891156 3.2099551 1.7288706 4.0565366
## [37] 3.3030191 3.6403171 1.4978137
## 
## $chain2$mu
##  [1] 2.0006232 2.0364795 0.4683033 1.0430751 1.7159229 1.1128158 4.6426035 4.5108807 0.2831293
## [10] 0.5189773 0.5695098 0.2959724 2.7590754 2.0022167 0.5734731 0.3649607 3.9643658 4.1246022
## [19] 0.9382868 2.2055430 2.8199738 4.7407737 2.3843966 3.9297155 4.4910028 2.3125928 1.8660026
## [28] 3.3213043 2.9645365 3.1051004 3.4571764 3.0409846 4.9891156 3.2099551 1.7288706 4.0565366
## [37] 3.3030191 3.6403171 1.4978137
## 
## $chain2$log_p10
## [1] -3.349045
## 
## $chain2$<NA>
## NULL
## 
## 
## $chain3
## $chain3$mu_trad
##  [1] 2.99695421 0.09057625 3.98415022 3.91308833 0.50499539 1.68798139 2.55347710 0.89065027
##  [9] 0.60918770 4.86294593 1.34739243 4.59752687 2.21442222 3.55637674 4.23580476 2.20107646
## [17] 1.17671660 0.78170485 2.25673160 3.77711585 3.82682900 0.07699152 4.59123373 0.98172936
## [25] 2.18065515 0.64172538 3.89674000 2.24100328 2.14183347 3.89530170 4.15906922 3.49548558
## [33] 2.00729023 1.46407406 1.81262553 1.01111195 0.39493405 2.68817777 1.73941542
## 
## $chain3$mu
##  [1] 2.99695421 0.09057625 3.98415022 3.91308833 0.50499539 1.68798139 2.55347710 0.89065027
##  [9] 0.60918770 4.86294593 1.34739243 4.59752687 2.21442222 3.55637674 4.23580476 2.20107646
## [17] 1.17671660 0.78170485 2.25673160 3.77711585 3.82682900 0.07699152 4.59123373 0.98172936
## [25] 2.18065515 0.64172538 3.89674000 2.24100328 2.14183347 3.89530170 4.15906922 3.49548558
## [33] 2.00729023 1.46407406 1.81262553 1.01111195 0.39493405 2.68817777 1.73941542
## 
## $chain3$log_p10
## [1] -3.600834
## 
## $chain3$<NA>
## NULL
## 
## 
## $chain4
## $chain4$mu_trad
##  [1] 4.11369274 3.20815163 2.07068574 2.94038715 4.28064689 1.04783007 3.12528817 4.08612534
##  [9] 0.81744778 0.94730030 1.88968171 3.19194577 0.84335635 3.25802200 3.30216107 2.97273400
## [17] 4.73673723 4.64996030 2.23394780 1.39865302 4.44788431 3.95464463 4.06588640 2.45849753
## [25] 1.55218769 2.56381572 1.74954671 4.93318672 0.41201408 4.83840704 3.02857553 4.60127717
## [33] 0.92424307 2.56824036 0.02303776 3.62457889 4.70564250 0.54421719 4.29598154
## 
## $chain4$mu
##  [1] 4.11369274 3.20815163 2.07068574 2.94038715 4.28064689 1.04783007 3.12528817 4.08612534
##  [9] 0.81744778 0.94730030 1.88968171 3.19194577 0.84335635 3.25802200 3.30216107 2.97273400
## [17] 4.73673723 4.64996030 2.23394780 1.39865302 4.44788431 3.95464463 4.06588640 2.45849753
## [25] 1.55218769 2.56381572 1.74954671 4.93318672 0.41201408 4.83840704 3.02857553 4.60127717
## [33] 0.92424307 2.56824036 0.02303776 3.62457889 4.70564250 0.54421719 4.29598154
## 
## $chain4$log_p10
## [1] -2.527196
## 
## $chain4$<NA>
## NULL