Chapter 5 Use case 4: jointModel() with semi-paired data

This final use case shows how to fit and interpret the joint model with semi-paired eDNA and traditional survey data.

In many cases, you may have some sites where only eDNA samples were collected, while other sites have paired eDNA and traditional samples (i.e., semi-paired data). We can fit the joint model to this semi-paired data and use the sites with paired data to make predictions about the sites with only un-paired eDNA data.



Let’s use the same tidewater goby data from uses cases 1 and 2, except let’s pretend we did not have traditional seine sampling at two of the 39 sites.

5.1 Simulate semi-paired data

Both eDNA and traditional survey data should have a hierarchical structure:

  • Sites (primary sample units) within a study area
  • eDNA and traditional samples (secondary sample units) collected from each site
  • eDNA subsamples (replicate observations) taken from each eDNA sample

First replace the traditional count data at sites 4 and 34 with NA to simulate a scenario where we did not have traditional samples at these sites.

library(eDNAjoint)
data(gobyData)

# create new dataset of semi-paired goby data
gobyData_semipaired <- gobyData
gobyData_semipaired$count[4,] <- NA
gobyData_semipaired$count[34,] <- NA

We can now see that we have no data at site 4:

gobyData_semipaired$count[4,]
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

And site 34:

gobyData_semipaired$count[34,]
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
## NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

It’s important to ensure that we replace the data with NA so that we have an empty row indicating missing data.

5.2 Prepare the data

Now let’s look at the format of the data. You can see that the data is a list of four matrices all with the same number of rows that represent each site (n=39). Across all matrices, rows in the data should correspond to the same sites (i.e., row one in all matrices corresponds to site one, and row 31 in all matrices corresponds to site 31).

qPCR.K, qPCR.N, and count are required for all implementations of jointModel(), and site.cov is optional and is used in use case 2.

Let’s look first at qPCR.K. These are the total number of positive qPCR detections for each site (row) and eDNA sample (column). The number of columns should equal the maximum number of eDNA samples collected at any site. Blank spaces are filled with NA at sites where fewer eDNA samples were collected than the maximum.

For example, at site one, 11 eDNA samples were collected, and at site five, 20 eDNA samples were collected.

head(gobyData_semipaired$qPCR.K)
##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 0 0 0  0  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 0 6 6  4  6  5  4  6  5  3 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 4  6  6  6  5  4  2  2  0  6  5  5  6  6  6  5  5  4 NA NA
## [6,] 0 0 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Now let’s look at qPCR.N. These are the total number of qPCR eDNA subsamples (replicate observations) collected for each site (row) and eDNA sample (column). In this data, six qPCR replicate observations were collected for each eDNA sample. Notice that the locations of the NAs in the matrix match qPCR.K.

head(gobyData_semipaired$qPCR.N)
##      1 2 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [1,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [2,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [3,] 6 6 6  6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA
## [4,] 6 6 6  6  6  6  6  6  6  6 NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] 6 6 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6 NA NA
## [6,] 6 6 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Next, let’s look at count. This data is from seine sampling for tidewater gobies. Each integer refers to the catch of each seine sample (i.e., catch per unit effort, when effort = 1). Again, the rows correspond to sites, and the columns refer to replicated seine samples (secondary sample units) at each site, with a maximum of 22 samples. Blank spaces are filled with NA at sites where fewer seine samples were collected than the maximum, as well as at sites with no seine samples. Notice that the number of rows (i.e., number of sites) in the count data still equals the number of rows in the qPCR data, even though some of these sites have no count data.

dim(gobyData_semipaired$count)[1] == dim(gobyData_semipaired$qPCR.K)[1]
## [1] TRUE

For more data formatting guidance, see section 2.1.1.

5.3 Fit the model

Now that we understand our data, let’s fit the joint model. The key arguments of this function include:

  1. data: list of qPCR.K, qPCR.N, and count matrices
  2. family: probability distribution used to model the seine count data. A poisson distribution is chosen here.
  3. p10priors: Beta distribution parameters for the prior on the probability of false positive eDNA detection, \(p_{10}\). c(1,20) is the default specification. More on this later.
  4. q: logical value indicating the presence of multiple traditional gear types. Here, we’re only using data from one traditional method.

More parameters exist to further customize the MCMC sampling, but we’ll stick with the defaults.

# run the joint model 
goby.fit.semi <- jointModel(data = gobyData_semipaired, family = 'poisson', 
                            p10priors = c(1,20), q=FALSE)

goby.fit.semi is a list containing:

  1. model fit (goby.fit.semi$model) of the class ‘stanfit’ and can be accessed and interpreted using all functions in the rstan package.
  2. initial values used for each chain in MCMC (goby.fit.semi$inits)

5.4 Interpret the output

5.4.1 Summarize posterior distributions

Let’s interpret goby.fit.semi. Use jointSummarize() to see the posterior summaries of the model parameters. Let’s look at the estimates of the expected catch rate at each site, \(\mu\).

jointSummarize(goby.fit.semi$model, par = 'mu')
##           mean se_mean    sd    2.5%   97.5%    n_eff Rhat
## mu[1]    0.021   0.000 0.022   0.001   0.079 15731.37    1
## mu[2]    0.021   0.000 0.021   0.001   0.078 16119.57    1
## mu[3]    0.021   0.000 0.021   0.001   0.078 15756.26    1
## mu[4]    5.422   0.014 1.735   2.880   9.562 14596.38    1
## mu[5]    2.429   0.002 0.296   1.892   3.039 16720.88    1
## mu[6]    0.081   0.001 0.084   0.002   0.307 14007.13    1
## mu[7]    0.338   0.001 0.103   0.170   0.570 17931.51    1
## mu[8]    0.278   0.003 0.303   0.007   1.111 14559.48    1
## mu[9]    0.048   0.000 0.049   0.001   0.179 14179.35    1
## mu[10]   1.515   0.002 0.202   1.147   1.934 17514.82    1
## mu[11]  26.957   0.007 1.095  24.836  29.176 25048.08    1
## mu[12]   0.021   0.000 0.021   0.001   0.079 16179.82    1
## mu[13]   0.021   0.000 0.021   0.001   0.078 15649.24    1
## mu[14]   0.034   0.000 0.035   0.001   0.127 13626.96    1
## mu[15]   0.101   0.000 0.063   0.018   0.255 18681.63    1
## mu[16]   0.066   0.000 0.050   0.006   0.196 15343.17    1
## mu[17]   0.026   0.000 0.026   0.001   0.099 14417.54    1
## mu[18]   0.020   0.000 0.020   0.000   0.075 18741.47    1
## mu[19]   0.047   0.000 0.047   0.001   0.174 12437.86    1
## mu[20]   0.224   0.001 0.081   0.097   0.416 17875.97    1
## mu[21]   0.081   0.001 0.085   0.002   0.311 14980.05    1
## mu[22]   0.254   0.001 0.111   0.086   0.521 20849.87    1
## mu[23]  90.537   0.021 3.243  84.377  96.990 23092.32    1
## mu[24]   1.054   0.002 0.218   0.680   1.531 16759.97    1
## mu[25]  21.043   0.010 1.596  18.059  24.286 23458.54    1
## mu[26]   1.285   0.002 0.213   0.901   1.738 17987.44    1
## mu[27]   2.724   0.003 0.453   1.913   3.699 18132.13    1
## mu[28]   8.157   0.005 0.755   6.758   9.707 20515.64    1
## mu[29]   3.230   0.003 0.453   2.422   4.187 18279.26    1
## mu[30]  13.704   0.015 2.046   9.966  18.130 19826.94    1
## mu[31]  94.525   0.031 4.331  86.174 103.229 20112.78    1
## mu[32]   0.023   0.000 0.023   0.001   0.084 15884.89    1
## mu[33]   0.081   0.001 0.083   0.002   0.306 15098.78    1
## mu[34]   0.260   0.001 0.147   0.063   0.626 17786.37    1
## mu[35]   0.126   0.001 0.132   0.003   0.490 14438.16    1
## mu[36]   0.021   0.000 0.022   0.001   0.080 15774.31    1
## mu[37]   0.023   0.000 0.023   0.001   0.086 15885.14    1
## mu[38] 168.959   0.039 5.869 157.792 180.557 22295.33    1
## mu[39]   0.023   0.000 0.024   0.001   0.085 16659.07    1

This summarizes the mean, sd, and quantiles of the posterior estimates of \(\mu\), as well as the effective sample size (n_eff) and Rhat for the parameters.

The model estimates \(\beta\) using data from sites with paired samples and uses this estimate to make predictions for \(\mu_{i=4}\) and \(\mu_{i=34}\) where no traditional seine samples were collected.

We can also use functions from the bayesplot package to examine the posterior distributions.

First let’s look at the posterior distributions for \(\mu_{i=4}\) and \(\mu_{i=29}\). Traditional seine samples were collected at site 29 but not at site 4.

library(bayesplot)
# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby.fit.semi$model), 
           pars = c('mu[4]', 'mu[29]'), prob = 0.8)

As you could expect, the credibility interval for the expected catch rate at site four is much wider than the credibility interval at site 29, since no traditional samples were collected at site four.

Let’s do the same for \(\mu_{i=7}\) and \(\mu_{i=34}\). Traditional seine samples were collected at site 7 but not at site 34.

# plot posterior distribution, highlighting median and 80% credibility interval
mcmc_areas(as.matrix(goby.fit.semi$model), 
           pars = c('mu[7]', 'mu[34]'), prob = 0.8)

Again, the credibility interval for the expected catch rate at site 34 is wider with a longer tail than the credibility interval at site 7, since no traditional samples were collected at site 34.

Note: It’s important to consider that fitting the joint model with semi-paired data will be more successful if there is paired data at most sites. Since the data at paired sites is leveraged to make predictions at the un-paired sites, you want to maximize the amount of data to leverage.