## 17.1 Worked Example

Michelle DiLeo

### 1. Overview

#### a. Background and Goals

In this lab, you will use the `gstudio`

package to analyse parent-offspring data for the wildflower *Pulsatilla vulgaris*. You will learn:

- how to conduct landscape genetic analyses on pollen pools,
- how to conduct paternity analysis to quantify contemporary pollen flow, and
- how to relate realized pollen flow events to landscape variables.

The data are from DiLeo et al. (in press). The data include genotypes from seeds, mothers, and all potential fathers from seven focal populations. The study area in the Franconian Jura in Germany is the same as for the *Dianthus carthusianorum* data from Week 7 (see introductory video by Yessica Rico for the Dianthus data set to learn more about the system).

Here, we will recreate one analysis from DiLeo et al. (in press), looking at the ecological factors that best explain pollen immigration rates of the seven populations.

#### b. Data set

There are three data files associated with this lab:

- “pulsatilla_genotypes.csv”, which contains the genotypes of offspring, mothers, and potential fathers from seven populations,
- “pulsatilla_momVariables.csv”, which contains ecological variables collect for each mother, and
- “pulsatilla_population.csv”, which contains landscape and population size information for each of the seven populations.

#### c. Required packages

Install some packages needed for this worked example.

```
if(!requireNamespace("popgraph", quietly = TRUE))
{
install.packages(c("tmap", "geosphere", "proto", "sampling",
"seqinr", "spacetime", "spdep"), dependencies=TRUE)
remotes::install_github("dyerlab/popgraph")
}
if(!requireNamespace("gstudio", quietly = TRUE)) remotes::install_github("dyerlab/gstudio")
```

### 2. Import Genotypes

The package `gstudio`

makes it very easy to import genotype data (see Week 1). Typically, we store our genotype data in a spreadsheet with one line per individual and two columns per locus. Using the `read_population`

function, we tell `gstudio`

which columns contain our loci. The argument `type="column"`

indicates that our raw text file stores genotypes in two columns per locus.

#### a. Import genotype data

```
dat <- gstudio::read_population(system.file("extdata",
"pulsatilla_genotypes.csv", package="LandGenCourse"),
type = "column", locus.columns = 6:19)
dat$ID <- as.factor(dat$ID)
```

Let’s take a look at the data:

```
## ID OffID Population X Y loc1_a loc2_a loc3_a loc4_a loc5_a
## 1 62 0 A21 4426941 5427173 340:340 422:422 413:425 446:448 121:121
## 2 64 0 A21 4426933 5427178 334:334 424:424 417:434 444:449 122:133
## 3 65 0 A21 4426936 5427173 338:340 417:422 417:419 446:452 135:137
## 4 66 0 A21 4426937 5427174 340:344 422:422 411:431 446:448 122:126
## 5 68 0 A21 4426934 5427171 336:342 417:422 423:438 448:448 119:123
## 6 69 0 A21 4426933 5427166 336:346 422:422 417:421 444:452 122:124
## loc6_a loc7_a
## 1 155:157 242:244
## 2 155:161 242:252
## 3 153:157 244:246
## 4 157:159 242:252
## 5 155:155 244:256
## 6 155:161 256:256
```

There are 12 columns:

**ID**: A unique identifier for every adult.**OffID**: Zero if the sample is an adult and > 0 if the sample is an offspring.**Population**: The population of origin.**2 coordinate columns, X & Y**: The geographic coordinates of the sample (Euclidean coordinate system Gauss-Krueger GK4 with CRS(“+init=epsg:31468”)).**7 locus columns**: For each locus, the microsatellite genotype is stored in a single column with alleles separated by “:”.

#### b. Plot map of sampling locations

To create a map with `qmplot`

(see Week 4), here we do the following:

- Use functions from
`dplyr`

to get mean coordinates per population - Convert the coordinates to lat-lon coordinates with
`st_project`

from the`sf`

package. We use the EPSG reference numbers (http://epsg.io/) to specify the ‘from’ and ‘to’ coordinate systems. - Convert
`Coords`

to an`sf`

object`Coords.sf`

with lat/lon coordinates.

```
Coords <- dat %>% group_by(Population) %>%
summarize(X = mean(X), Y = mean(Y))
Coords <- data.frame(Coords, sf::sf_project(from = "+init=epsg:31468",
to = "+init=epsg:4326", pts = Coords[c("X", "Y")]))
names(Coords)[4:5] <- c("Longitude", "Latitude")
Coords.sf <- sf::st_as_sf(Coords, coords=c("Longitude", "Latitude"), crs=sf::st_crs(4326))
```

- Create an interactive map
`MyMap`

with`tmap`

, specify the basemaps. - Add population labels with
`tm_text`

.

`## tmap mode set to interactive viewing`

### 3. Pollen pool genetic structure

#### a. Subtract maternal contribution to offspring genotype

Let’s look at the data for the mother with the ID=3083 and her associated offspring:

```
## ID OffID Population X Y loc1_a loc2_a loc3_a loc4_a
## 207 3083 0 A25 4422659 5425370 338:340 417:424 392:392 452:452
## 208 3083 1 A25 4422659 5425370 338:340 417:424 392:409 446:452
## 209 3083 3 A25 4422659 5425370 338:338 417:424 452:452
## 210 3083 4 A25 4422659 5425370 340:340 417:422 392:411 448:452
## 211 3083 5 A25 4422659 5425370 340:340 417:424 392:392 452:452
## 212 3083 6 A25 4422659 5425370 338:338 422:424 392:411 452:452
## 213 3083 7 A25 4422659 5425370 338:340 424:424 392:392 452:452
## 214 3083 8 A25 4422659 5425370 340:340 417:417 392:417 452:452
## loc5_a loc6_a loc7_a
## 207 121:121 157:157 244:252
## 208 121:121 157:157 240:244
## 209 121:121 157:157
## 210 121:121 155:157 236:252
## 211 121:121 157:157 244:244
## 212 121:121 157:159 242:244
## 213 121:121 157:157 244:244
## 214 121:121 155:157 242:244
```

The first row is the mother and the subsequent rows are her offspring with OffIDs 1-8.

Now, use the `minus_mom`

function to remove the mother’s contribution to the offspring genotypes. You will be left with just the paternal contribution. The allele frequecies of all paternal contributions associated with a single mother is called a “pollen pool”.

In some cases, the pollen contribution is ambiguous when the offspring has both alleles in common with the mother (e.g. see OffID 1, loc1_a and loc2_a). In such cases, both alleles are retained.

```
## ID OffID Population X Y loc1_a loc2_a loc3_a loc4_a loc5_a
## 208 3083 1 A25 4422659 5425370 338:340 417:424 409 446 121
## 209 3083 3 A25 4422659 5425370 338 417:424 452 121
## 210 3083 4 A25 4422659 5425370 340 422 411 448 121
## 211 3083 5 A25 4422659 5425370 340 417:424 392 452 121
## 212 3083 6 A25 4422659 5425370 338 422 411 452 121
## 213 3083 7 A25 4422659 5425370 338:340 424 392 452 121
## 214 3083 8 A25 4422659 5425370 340 417 417 452 121
## loc6_a loc7_a
## 208 157 240
## 209 157
## 210 155 236
## 211 157 244
## 212 159 242
## 213 157 244
## 214 155 242
```

The pollen pool data can be analysed as any other population genetic data.

Note: There are some missing genotypes for the offspring that trigger warnings, which you may ignore. Here we suppressed the warnings to reduce output.

#### b. ‘TwoGener’ analysis (AMOVA of pollen pool data)

For example, we can conduct an AMOVA (see Week 4) to test if pollen pools belonging to different mothers are signficantly differentiated from one another. When an AMOVA is applied to pollen pools it is called a ‘TwoGener’ analysis (Two-generational analysis; Smouse et al. 2001, Dyer et al. 2004).

Here we use the `amova`

function from the `pegas`

package to perform a hierarchical analysis of variance on genetic distance matrices (AMOVA), with moms nested within patches (this may take a while).

```
D <- gstudio::genetic_distance(pollen,mode="amova")
D <- as.dist(D)
Moms <- pollen$ID
Populations <- as.factor(pollen$Population)
amova.result <- pegas::amova(D ~ Populations/Moms, nperm=500)
amova.result
```

```
##
## Analysis of Molecular Variance
##
## Call: pegas::amova(formula = D ~ Populations/Moms, nperm = 500)
##
## SSD MSD df
## Populations 664.3079 110.717976 6
## Moms 1252.7296 36.844988 34
## Error 1895.1503 6.916607 274
## Total 3812.1878 12.140725 314
##
## Variance components:
## sigma2 P.value
## Populations 1.6747 0
## Moms 3.9278 0
## Error 6.9166
##
## Phi-statistics:
## Populations.in.GLOBAL (Phi_CT) Moms.in.GLOBAL (Phi_ST)
## 0.1337729 0.4475164
## Moms.in.Populations (Phi_SC)
## 0.3621954
##
## Variance coefficients:
## a b c
## 7.619648 7.920406 43.405291
```

Calculate phi from the variance components:

```
## phi
## Populations 0.1337729
## Moms 0.3137434
## Error 0.5524836
```

Looking at the variance components and the phi-statistics (analogue to F-statistics), what can be said about differentiation among pollen pools?

**Questions:**

- Are pollen pools more differentiated among populations or among mothers within populations?
- What does this mean biologically?

#### c. Including space: Isolation by distance

The ‘TwoGener’ analysis gives a summary statistic describing overall differentiation, but this does not tell us anything about **why** the pollen pools are differentiated. In order to better understand **why**, we can analyse the data spatially just like any other landscape genetic dataset.

For example, we can test a basic hypothesis of isolation by distance by asking: are pollen pools from mothers that are geographically close more similar than those that are far? We test this by calculating pairwise genetic similarity (proportion of shared alleles) and relate this to geographic distance. For a more sophistocated analysis, the data could also be analysed as a “pollination graph” (Dyer et al. 2012), which is similar to population graphs that will be included in the Bonus Material for Week 13, but we will not do this here.

First, calculate matrix of pairwise genetic similarity among pollen pools.

`## Bray distance will be assumed to be entirely additive across loci.`

Note: the proportion of shared alleles is a measure of genetic similarity, hence we expect it to decrease with geographic distance. We convert this into a measure of genetic distance by subtracting it from 1 (see Week 5).

Calculate matrix of pairwise geographic distance among pollen pools:

Plot ‘dps’ against geographic distance.

```
par(mar=c(4,4,0,0))
plot(xy.dist[lower.tri(xy.dist)], dps[lower.tri(dps)],
xlab = "Geographic distance (m)",
ylab = "1 - Proportion of shared alleles (dps)")
abline(lm(dps[lower.tri(dps)]~xy.dist[lower.tri(xy.dist)]))
```

We see that pollen pools belonging to mothers that are spatially close are more similar. Let’s quantify the strength of this relationship using a Mantel test.

```
##
## Mantel statistic based on Pearson's product-moment correlation
##
## Call:
## vegan::mantel(xdis = xy.dist, ydis = dps)
##
## Mantel statistic r: 0.4432
## Significance: 0.001
##
## Upper quantiles of permutations (null model):
## 90% 95% 97.5% 99%
## 0.114 0.144 0.166 0.193
## Permutation: free
## Number of permutations: 999
```

Note: it seems that the `mantel`

function in `vegan`

automatically tests a one-sided alternative that the Mantel correlation is greater than zero. This means that if we had used ‘Dps’ as a measure of genetic similarity, the default setting would have resulted in a p-value of 0.999. Unfortunately, for now, this is not discussed in the help file and there is no option to change the alternative.

### 4. Paternity analysis

‘TwoGener’ and related methods are sensitive to inbreeding and strong genetic structure of adult populations, although methods exist to correct for some of these things (e.g. Dyer et al. 2004). However, it is often the best (or only) option for quantifying pollen flow for high density species where other analyses such as paternity assignment cannot be conducted.

In our case, we have complete sampling of potential fathers within the seven populations (i.e., patches) from which mothers and their offspring were sampled. Therefore, we are able to conduct a paternity analysis, which will give us more detailed information about contemporary pollen flow.

Note that we do not have complete sampling of all potential fathers within the larger study region, and so we assume that any seed that is not assigned to a father is a pollen immigration event. In this section, we will conduct a paternity analysis using the `gstudio`

package and relate the results to landscape variables.

#### a. Exclusion probabilities

First let’s check if our genetic markers have sufficient information (genetic resolution) to discriminate among alternative fathers by calculating exclusion probabilites.

```
## e.clusion probabilities
pollen.freqs <- gstudio::frequencies(dat)
p.excl <- gstudio::exclusion_probability( pollen.freqs )
p.excl
```

```
## Locus Pexcl PexclMax Fraction
## 1 loc1_a 0.6817515 0.8926727 0.7637194
## 2 loc2_a 0.3359125 0.7434082 0.4518547
## 3 loc3_a 0.7614844 0.9218755 0.8260165
## 4 loc4_a 0.5451972 0.7718336 0.7063663
## 5 loc5_a 0.5984514 0.8981062 0.6663481
## 6 loc6_a 0.5998659 0.8537849 0.7025960
## 7 loc7_a 0.7002772 0.8424075 0.8312808
```

The second column, ‘Pexcl’, gives us the information we need. We can see that certain loci have poor ability to exclude alternative potential fathers (e.g. ‘loc2’) and others are much more informative (e.g. ‘loc3’). Luckily, we have multiple loci to use and the multilocus exclusion probability can be calculated by multiplying across loci.

The code below does the following:

`1 - unlist(p.excl$Pexcl)`

: extract the elements of`p.excl.Pexcl`

(the exclusion probabilities for all loci) as a vector and calculage for each locus, the complement of the exclusion probability.

`1 - prod()`

: multiplies the probabilities, and take the complement to get the multilocus exculsion probability.

`## [1] 0.9988959`

We see that when using all seven loci, we achieve a very high exlusion probability, and that on average we should be able to exclude 99.9 of alternative fathers.

#### b. Paternity analysis

We will use the `gstudio`

function `paternity`

, which conducts a fractional-allocation paternity analyis. This approach is useful when full exclusion cannot be achieved. In case multiple potential fathers cannot be exluded for a particular offspring, a likelihood of paternity is calculated for each non-excluded father based on Mendelian transition probabilites.

Note that in the original paper from which these data originate, we used a more sophisticated approach implemented in the program ‘COLONY2’. ‘COLONY2’ can account for genotyping errors, whereas the `gstudio`

function `paternity`

approach cannot and so we had to throw out some of our data here.

Now let’s conduct a paternity analysis for the seeds from mother 3083:

```
# Select all offspring of mom 3083:
offspring.3083 <- subset(dat, OffID!=0 & ID == "3083")
# Select mom 3083:
mother.3083 <- subset(dat, ID == "3083" & dat$OffID == 0 )
# Select all potential fathers in the same population as mom 3083:
fathers.3083 <- subset(dat, OffID == 0 & Population %in% offspring.3083$Population)
# Paternity analysis:
pat.3083 <- gstudio::paternity(offspring.3083, mother.3083, fathers.3083 )
pat.3083
```

```
## MomID OffID DadID Fij
## 1 3083 1 212 1.00000000
## 2 3083 3 3083 0.55172414
## 3 3083 3 218 0.13793103
## 4 3083 3 230 0.13793103
## 5 3083 3 3084 0.06896552
## 6 3083 3 255 0.03448276
## 7 3083 3 261 0.03448276
## 8 3083 3 3081 0.03448276
## 9 3083 5 3083 0.72727273
## 10 3083 5 234 0.18181818
## 11 3083 5 219 0.04545455
## 12 3083 5 239 0.04545455
## 13 3083 6 215 1.00000000
## 14 3083 7 3083 0.88888889
## 15 3083 7 219 0.05555556
## 16 3083 7 239 0.05555556
## 17 3083 8 212 1.00000000
```

The output shows the probability of paternity (Fij) for each offspring-father pair, which is calcualted from Mendelian inheritance transition probabilities. Some offspring have multiple potential fathers and so appear in the table more than once (e.g. Offsprings 3 and 5). A higher ‘Fij’ value indcates that the Offspring-father pair is more likely.

Now let’s do the paternity analysis for the entire data set. First, we define a function `get.parentage`

that does the paternity analysis for the offspring of one mom (i.e., one family).

```
# make a dataframe just for the offspring:
offspring <- subset(dat, OffID!=0)
## h.re is the function that we will apply to all mothers:
get.parentage <- function(x){
tmp.offspring <- subset(offspring, ID == x)
tmp.mother <- subset(dat, ID == x & OffID == 0)
tmp.fathers <- subset(dat, OffID == 0 & Population %in% tmp.offspring$Population)
return(gstudio::paternity(tmp.offspring, tmp.mother, tmp.fathers ))
}
```

We could use `lapply`

to apply `get.parentage`

to each mom and her offspring. Here, we use an alternative function `map`

from package `purr`

(which is part of the new ‘tidyverse’ set of packages by Hadley Wickham). This allows us to address the following issue: Because we have not sampled all potential fathers in the larger study region, `paternity`

will return an error when all offspring of a certain mother cannot be assigned to any father.

To get around this, we will use the `possibly`

function from the library `purrr`

. When an error occurs, it will simply return `NA`

instead of stopping our function. The function `map`

knows how to apply a ‘purrr-ified’ function to an object.

This will take about a minute to run:

```
# purrr-ify the function so that NA is returned when an error pops up:
possible_pat <- purrr::possibly(get.parentage, otherwise = NA_real_)
# run the function and store the output:
# list of results for each mother:
pat.all <- purrr::map(unique(offspring$ID), possible_pat)
## c.nvert the list to a dataframe:
pat.all <- do.call(rbind, pat.all[!is.na(pat.all)])
```

For the purposes of this tutorial, we will keep only the highest-probability father for each offspring. With your own data, you should choose some sort of threshold (e.g. keep only those with probability > 0.9) or you can conduct analyses using the probability as weights.

```
## c.eate a temporary ID that combines the MomID and the OffID
pat.all$tmpID <- paste(pat.all$MomID, pat.all$OffID, sep="_")
## g.t rid of all rows with duplicated tmpIDs, leaving just the first entry for each
pat.sub <- pat.all[!duplicated(pat.all$tmpID),]
## g.t rid of the tmpID column
pat.sub <- pat.sub[,1:4] ## g.t rid of the tmp_ID column
```

#### c. Visualize reconstructed pollen flow events in space

We can use the `spiderplot_data`

function from `gstudio`

to visualize the paternity assignments. `spiderplot_data`

takes the output from `paternity`

and combines it with spatial XY coordinate information. The output includes the coordinates of both mother (X, Y) and father (Xend, Yend) for each offspring. Here we will also add in the population IDs so that we can visualize each of the seven populations separately. Note that the function asks for “longitude” and “latitude” but any form of spatial XY coordinates will do.

```
pat.sub <- gstudio::spiderplot_data(pat.sub, dat, longitude = "X", latitude = "Y")
# Join data to add population IDs
pat.sub <- merge(pat.sub, dat[, c("Population", "ID" ,"OffID")],
by.x=c("MomID", "OffID"), by.y=c("ID", "OffID"), all.x=T)
head(pat.sub)
```

```
## MomID OffID DadID Fij X Xend Y Yend Population
## 1 3080 1 211 1 4422657 4422658 5425371 5425371 A25
## 2 3080 10 211 1 4422657 4422658 5425371 5425371 A25
## 3 3080 2 214 1 4422657 4422658 5425371 5425370 A25
## 4 3080 3 214 1 4422657 4422658 5425371 5425370 A25
## 5 3080 4 211 1 4422657 4422658 5425371 5425371 A25
## 6 3080 5 211 1 4422657 4422658 5425371 5425371 A25
```

Use `ggplot2`

to visualize the reconstructed pollen flow events. Let’s look at population ‘A25’ first. The arrows point to mothers, and arrows without lines indicate selfing events. The darkness of the lines is proportional to the number of pollen flow events between a particular mother and father.

```
pop <- "A25"
ggplot() +
geom_point(data=dat[dat$Population==pop,],
aes(X,Y),size=3, color="red") +
geom_segment(data=pat.sub[pat.sub$Population==pop,],
aes(x=X, y=Y, xend=Xend, yend=Yend), linewidth=0.5, alpha=0.2,
arrow = arrow(ends = "first", length = unit(0.3, "cm"))) +
theme(legend.position = "none")
```

Try modifying the code to look at patterns from other populations.

**Questions:**

- Do pollen flow events appear to be amongst near-neigbours?
- Are there clear differences in pollen flow patterns among populations?

#### d. Visualize dispersal kernel

We can calculate the distance of each pollen flow event, and then plot the distribution to visualize the dispersal kernel.

Calculate distance for each event:

```
pat.sub$pollen.dist <- unlist(lapply(1:nrow(pat.sub),
function(x) dist(rbind(c(pat.sub$X[x], pat.sub$Y[x]),
c(pat.sub$Xend[x], pat.sub$Yend[x]))) ))
```

Plot the distribution of pollination events that are greater than 0 m (i.e. excluding selfing):

### 5. Linking paternity to ecological variables

Now that we have conducted our paternity analysis, we can ask which ecological factors explain the patterns that we see.

- First we will look within populations: what factors explain pollen flow distances within populations?
- Second, we will look at the population level: what factors explain pollen immigration rates?

#### a. Explain pollen flow within populations

We have collected some information about mothers, which we now add to the ‘pat.sub’ dataframe. Specifically, we have measured:

**Mother isolation**: how far a mom is from other conspecifics, and**Local floral density**: how many flowers are within 2 m of mother plants.

We are only interested in outcrossed pollen events, so we make a new data frame that excludes selfing.

```
# read in the data
mom.vars <- read.csv(system.file("extdata",
"pulsatilla_momVariables.csv", package="LandGenCourse"))
## e.clude selfing
pat.outcrossed <- subset(pat.sub, MomID != DadID)
## a.d mom variables to pat.outcrossed
pat.outcrossed <- merge(pat.outcrossed, mom.vars, by.x = "MomID",
by.y = "ID", all.x=T)
# look at the data
head(pat.outcrossed)
```

```
## MomID OffID DadID Fij X Xend Y Yend Population pollen.dist
## 1 3080 1 211 1 4422657 4422658 5425371 5425371 A25 0.8632989
## 2 3080 10 211 1 4422657 4422658 5425371 5425371 A25 0.8632989
## 3 3080 2 214 1 4422657 4422658 5425371 5425370 A25 1.3829147
## 4 3080 3 214 1 4422657 4422658 5425371 5425370 A25 1.3829147
## 5 3080 4 211 1 4422657 4422658 5425371 5425371 A25 0.8632989
## 6 3080 5 211 1 4422657 4422658 5425371 5425371 A25 0.8632989
## mom.isolation flower.density
## 1 10.97917 11
## 2 10.97917 11
## 3 10.97917 11
## 4 10.97917 11
## 5 10.97917 11
## 6 10.97917 11
```

Let’s run some models to test if `mom.isolation`

or `flower.density`

explain pollen flow distances. We used mixed models with population and mother ID as random effects to control for multiple sampling from the same mom and populations (see Week 6 videos for linear mixed models).

Note: for model selection with varying fixed effects and constant random effects, we fit the models with maximum likelihood ML, hence we set `REML=F`

(see Week 12).

```
# specify the model
mod <- lme4::lmer(log(pollen.dist) ~ scale(log(flower.density)) +
scale(log(mom.isolation)) + (1|Population/MomID),
data=pat.outcrossed, na.action = "na.fail", REML=F)
```

`## boundary (singular) fit: see help('isSingular')`

Here we use the function `dredge`

from package `MuMIn`

to compare all nested submodels and select the ‘best’ model. This is not generally recommended (see Week 12), as submodels may not be biologically meaningful. In this case, there is good reason to hypothesize that either mom isolation or local flower density, or both, would affect pollination distance.

`## Fixed term is "(Intercept)"`

```
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
```

```
## Global model call: lme4::lmer(formula = log(pollen.dist) ~ scale(log(flower.density)) +
## scale(log(mom.isolation)) + (1 | Population/MomID), data = pat.outcrossed,
## REML = F, na.action = "na.fail")
## ---
## Model selection table
## (Int) scl(log(flw.dns)) scl(log(mom.isl)) df logLik AICc delta weight
## 4 1.075 -0.3609 0.3493 6 -102.581 218.0 0.00 0.698
## 2 1.076 -0.5151 5 -105.193 221.0 2.99 0.157
## 3 1.094 0.5977 5 -105.280 221.1 3.16 0.144
## 1 1.081 4 -110.975 230.3 12.35 0.001
## Models ranked by AICc(x)
## Random terms (all models):
## 1 | Population/MomID
```

We see that the best model includes both flower density and mom isolation (for model selection, see Week 12 worked example and Week 7 video).

Note: we should follow up with residual analysis etc., which we’ll skip here to keep it short.

Let’s plot the relationships:

```
Mom.isolation.plot <- ggplot(pat.outcrossed,
aes(x=log(mom.isolation), y=log(pollen.dist))) +
geom_point() + stat_smooth(method="lm", se=F) +
xlab("Log(mom isolation)") +
ylab("Log(pollen flow distance)")
Flower.density.plot <- ggplot(pat.outcrossed,
aes(x=log(flower.density),
y=log(pollen.dist))) + geom_point() +
stat_smooth(method="lm", se=F) +
xlab("Log(flower density)") +
ylab("Log(pollen flow distance)")
cowplot::plot_grid(Mom.isolation.plot, Flower.density.plot,
labels = c("A", "B"))
```

```
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
```

**Questions:**

- Which moms (isolated/non-isolated, surrounded by more/less flowers) tend to receive pollen from further away?
- What biological processes might explain these patterns?

#### b. Explain pollen flow at the population level

Now let’s look at the population level. First we need to calculate how many offspring were unable to be assigned a father. We do this by adding the first four columns of the paternity analysis output (which contains only successful assignments) to the offspring dataframe (which includes all offspring, regardless of assignment). If there is no assigment, ‘Fij’ will be NA in the newly merged data frame.

```
offspring <- merge(offspring, pat.sub[,1:4], by.x=c("ID", "OffID"),
by.y=c("MomID", "OffID"), all.x=T)
head(offspring)
```

```
## ID OffID Population X Y loc1_a loc2_a loc3_a loc4_a loc5_a
## 1 3080 1 A25 4422657 5425371 334:358 417:424 392:411 446:448 121:121
## 2 3080 2 A25 4422657 5425371 334:358 422:424 392:434 442:446 121:121
## 3 3080 3 A25 4422657 5425371 334:354 422:424 392:434 446:446 118:121
## 4 3080 4 A25 4422657 5425371 334:354 422:424 392:392 446:446 121:121
## 5 3080 5 A25 4422657 5425371 352:358 417:424 392:411 446:446 121:121
## 6 3080 6 A25 4422657 5425371 352:354 417:424 392:392 446:446 121:121
## loc6_a loc7_a DadID Fij
## 1 157:165 242:258 211 1
## 2 157:157 252:258 214 1
## 3 157:157 236:258 214 1
## 4 153:157 242:258 211 1
## 5 157:165 240:258 211 1
## 6 153:157 240:256 211 1
```

Calculate the number of outside pollination events per population by counting the number of NAs in the ‘Fij’ column:

Total number of pollination events per population (should be same as the number of seeds sampled):

Let’s put this information into a data frame and add ecological information that we have gathered about each population.

Specifically we have measured the proportion of forest cover within radii of 50, 100, 250, 500, and 1000 metres from population centroids, and have measured population size as the total number of flowering *P. vulgaris* plants per population.

```
# turn it into a dataframe:
pop.df <- data.frame(Population=names(num.out),
num.out=as.vector(num.out), num.tot=as.vector(num.tot))
# read in the population variable data:
pop.vars <- read.csv(system.file("extdata",
"pulsatilla_population.csv", package="LandGenCourse"))
## a.d the population variables to our outside pollination data:
pop.df <- merge(pop.df, pop.vars, by="Population")
pop.df
```

```
## Population num.out num.tot forest.50 forest.100 forest.250 forest.500
## 1 A03 2 13 0.3531695 0.5305529 0.8116160 0.8809752
## 2 A21 17 48 0.6036269 0.5291262 0.5581914 0.6131979
## 3 A25 22 72 0.3622251 0.2200647 0.2445019 0.2904836
## 4 A26 13 57 0.4948187 0.4838188 0.5212385 0.4778212
## 5 A41 11 57 0.6843467 0.8159172 0.9087349 0.7966062
## 6 A45 15 53 0.3824289 0.4297280 0.4747971 0.4027923
## 7 G05a 4 15 0.6152850 0.4102232 0.3046323 0.2119176
## forest.1000 population.size
## 1 0.8723172 44
## 2 0.4553216 21
## 3 0.2106881 57
## 4 0.2314417 22
## 5 0.6717830 15
## 6 0.2988908 22
## 7 0.2810409 46
```

Now we can run a model to see which variables best explain the proportion of immigrated pollen per population. Because we only have n=7 populations, we limit the models to a single explanatory variable.

Our response variable is a proportion, so we use a glm (generalized linear model) with binomial error distribution. This model does not include any random effects.

To automatize this, we first fit a ‘full’ model with all six potential predictors. We won’t interpret this model as it would be overfitted, but we’ll use it as input for the next step.

```
# specify the model
mod2 <- glm(cbind(num.out, num.tot-num.out) ~ forest.50 + forest.100 +
forest.250 + forest.500 + forest.1000 + population.size,
family = binomial(link = "logit"), data = pop.df,
na.action = "na.fail")
```

We use the function `dredge`

again to select the best model among those with only a single predictors (plus the intercept-only model). This is specified with the argument `m.lim=c(0,1)`

.

`## Fixed term is "(Intercept)"`

```
## Global model call: glm(formula = cbind(num.out, num.tot - num.out) ~ forest.50 +
## forest.100 + forest.250 + forest.500 + forest.1000 + population.size,
## family = binomial(link = "logit"), data = pop.df, na.action = "na.fail")
## ---
## Model selection table
## (Int) frs.100 frs.1000 frs.250 frs.50 frs.500 ppl.siz df logLik AICc
## 1 -1.0120 1 -16.061 34.9
## 5 -0.5516 -0.8899 2 -14.866 36.7
## 2 -0.5758 -0.9236 2 -15.118 37.2
## 17 -0.6049 -0.8182 2 -15.268 37.5
## 3 -0.7071 -0.8159 2 -15.318 37.6
## 33 -1.1830 0.00555 2 -15.807 38.6
## 9 -0.7084 -0.6136 2 -15.877 38.8
## delta weight
## 1 0.00 0.392
## 5 1.81 0.158
## 2 2.31 0.123
## 17 2.62 0.106
## 3 2.71 0.101
## 33 3.69 0.062
## 9 3.83 0.058
## Models ranked by AICc(x)
```

We see that the best model is the intercept-only (null) model, suggesting that none of the predictors fit the data very well. However, the next-best model includes ‘forest.250’ with a delta AICc of less than 2, and we take this model to be equally likely.

Note: Interestingly, in our original paper with found that forest.500 was the best predictor, followed by population size. This suggests that the choice of paternity analysis method really can make a difference.

Let’s plot the model including forest.250:

```
forest.250.mod <- glm(cbind(num.out, num.tot-num.out) ~ forest.250,
family=binomial(link="logit"), data=pop.df)
ggplot(pop.df, aes(x=forest.250, y=num.out/num.tot)) + geom_point() +
geom_line(aes(x=forest.250, y=predict(forest.250.mod, type="response"))) +
xlab("Proportion of forest at 250 m") +
ylab("Proportion of immigrant pollen")
```

We see that populations surrounded by more forest receive less outside pollen, although it is apparent that there is one outlier (population A21) with a high proportion of immigration pollen but intermediate levels of surrounding forest. This is likely also a product of our choice of paternity anlaysis - in our original paper (DiLeo et al. in press) we were able to assign a much higher proportion of fathers in popualtion A21, giving us lower pollen immigration rates.

### 6. References

DiLeo MF, R Holderegger and HH Wagner (

*in press*). Contemporary pollen flow as a multiscale process: evidence from the insect-pollinated herb*Pulsatilla vulgaris*. Journal of Ecology. https://doi.org/10.1111/1365-2745.12992Dyer RJ, RD Westfall, VL Sork VL and PE Smouse (2004). Two-generation analysis of pollen flow across a landscape V: a stepwise approach for extracting factors contributing to pollen structure. Heredity 92: 204-211.

Dyer RJ, Chan DM, Gardiakos VA, Meadows CA (2012) Pollination graphs: quantifying pollen pool covariance networks and the influence of intervening landscape on genetic connectivity in the North American understory tree, Cornus florida L. Landscape ecology 27:239-51.

Smouse PE, RJ Dyer, RD Westfall and VL Sork (2001). Two‐generation analysis of pollen flow across a landscape. I. Male gamete heterogeneity among females. Evolution 55: 260–271.