# 13 Permutation Testing

Permutation tests are a type of randomization test. The theoretial difference between permutation tests and inferential tests is that with permutation tests we build the sampling distribution from the observed data, rather than infering or assuming that a sampling distribution exist.

In practice, what a permutation test does is to take your observed data and then shuffle (or permute) part of it. After each shuffle, some aspect of the data is recalculated. That could be for instance the correlation coefficient, or it could be a difference in means between two groups. The data then get randomly reshuffled again, and the test-statistic is recalculated again. This goes on for thousands of times - for as many shuffles are deemed acceptable. This is usually a minimum of 1,000 but typically at least 10,000 shuffles are done. After all the permutations (shuffles) are performed, a distribution of the statistic of interest is generated from the permutations. This is comapred to the original observed statistics (e.g. correlation coefficient, difference in group means) to see if the observed value is unusually large compared to the permuted data.

If this seems a little confusing, hopefully seeing it in action will help…

## 13.1 t-test Permutation

Let’s look at our two independent samples of exam scores:

```
library(tidyverse)
anastasia <- c(65, 74, 73, 83, 76, 65, 86, 70, 80, 55, 78, 78, 90, 77, 68)
bernadette <- c(72, 66, 71, 66, 76, 69, 79, 73, 62, 69, 68, 60, 73, 68, 67, 74, 56, 74)
# put into a dataframe:
dd <- data.frame(values = c(anastasia, bernadette),
group = c(rep("Anastasia",15), rep("Bernadette", 18))
)
dd
```

```
## values group
## 1 65 Anastasia
## 2 74 Anastasia
## 3 73 Anastasia
## 4 83 Anastasia
## 5 76 Anastasia
## 6 65 Anastasia
## 7 86 Anastasia
## 8 70 Anastasia
## 9 80 Anastasia
## 10 55 Anastasia
## 11 78 Anastasia
## 12 78 Anastasia
## 13 90 Anastasia
## 14 77 Anastasia
## 15 68 Anastasia
## 16 72 Bernadette
## 17 66 Bernadette
## 18 71 Bernadette
## 19 66 Bernadette
## 20 76 Bernadette
## 21 69 Bernadette
## 22 79 Bernadette
## 23 73 Bernadette
## 24 62 Bernadette
## 25 69 Bernadette
## 26 68 Bernadette
## 27 60 Bernadette
## 28 73 Bernadette
## 29 68 Bernadette
## 30 67 Bernadette
## 31 74 Bernadette
## 32 56 Bernadette
## 33 74 Bernadette
```

We can plot these data as boxplots to get a sense of the within group variation as well as the observed differences between the groups:

```
ggplot(dd, aes(x = group, y = values, fill = group)) +
geom_boxplot(alpha=.3, outlier.shape = NA) +
geom_jitter(width=.1, size=2) +
theme_classic() +
scale_fill_manual(values = c("firebrick", "dodgerblue"))
```

Now, from our two independent samples, we can directly observe what the difference in sample means is. This is just calculated by subtracting one sample mean from the other:

`## [1] 5.477778`

So, from our samples, we observed a difference in grades of 5.48 between the groups. Typically, we would run an independent t-test to test whether these two samples came from theoretical populations that differ in their means:

```
##
## Two Sample t-test
##
## data: anastasia and bernadette
## t = 2.1154, df = 31, p-value = 0.04253
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1965873 10.7589683
## sample estimates:
## mean of x mean of y
## 74.53333 69.05556
```

This Student’s t-test (notice `var.equal=T`

) suggests that this is a significant difference, meaning that the groups do differ in their population means.

However, this test relies on several assumptions (see section 10.7). Instead, we could apply a permutation test that is free of assumptions.

Essentially what we are going to do is ask how surprising it was to get a difference of 5.48 given our real data. Put another way, if we shuffled the data into different groups of 15 and 18 (the respective sample sizes of Anastasia and Bernadette), would we get a difference in sample means of greater or lower than 5.48? If we did this thousands of times, how many times would we get differences in sample means above 5.48?

Let’s apply this theory to just one permutation.

First, we combine all the data:

```
set.seed(1) # just to keep the random number generator the same for all of us
allscores <- c(anastasia, bernadette)
allscores
```

```
## [1] 65 74 73 83 76 65 86 70 80 55 78 78 90 77 68 72 66 71 66 76 69 79 73 62 69
## [26] 68 60 73 68 67 74 56 74
```

Next, we shuffle them into new groups of 15 and 18.:

```
## $`1`
## [1] 80 78 71 73 65 68 67 74 72 74 76 83 68 70 69
##
## $`2`
## [1] 74 90 69 68 78 66 73 76 62 56 79 65 60 73 55 77 66 86
```

We have two brand new samples that contain all of the scores from our original data, but they’ve just been shuffled around. We could look at what the difference in sample means is between these two new samples:

`## [1] 80 78 71 73 65 68 67 74 72 74 76 83 68 70 69`

`## [1] 74 90 69 68 78 66 73 76 62 56 79 65 60 73 55 77 66 86`

`## [1] 72.53333`

`## [1] 70.72222`

`## [1] 1.811111`

The difference in sample means is 1.81, which is a lot smaller than our original difference in sample means.

Let’s do this same process 10,000 times! Don’t worry too much about the details of the code. What we are doing is the above process, just putting it in a loop and asking it to do it 10,000 times. We save all the results in an object called `results`

.

```
results<-vector('list',10000)
for(i in 1:10000){
x <- split(sample(allscores), rep(1:2, c(15,18)))
results[[i]]<-mean(x[[1]]) - mean(x[[2]])
}
head(unlist(results)) # these are all our mean differences from 10,000 shuffles of the data. We're just looking at the first 6.
```

`## [1] -1.8555556 -2.5888889 4.0111111 -3.9333333 0.2222222 3.5222222`

We can actually make a histogram showing the distribution of these differences in sample means.

```
df <- data.frame(difs = unlist(results))
ggplot(df, aes(x=difs)) +
geom_histogram(color="black", fill="green", alpha=.4) +
geom_vline(color="navy",lwd=1,lty=2,xintercept = 5.48) +
theme_classic()+
ggtitle("Mean Differences from \n 10000 Permutations of Raw Data")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

This histogram shows that for some of our 10,000 shuffles, we actually got some differences between our two samples of higher than 5.48 (the dotted blue line), but the vast majority of shuffles led to samples that had mean differences lower than 5.48. In fact, several shuffles led to samples where the sample of size 18 (Bernadette in the original data) had a sample mean that was higher than the sample of size 15 (Anastasia in the original data).

We can directly calculate how many times out of 10,000 shuffles we got a difference in sample means that was greater than 5.48

`## [1] 215`

To convert this to a p-value, we simply divide this value by the number of shuffles we ran - which was 10,000.

`## [1] 0.0215`

So our p-value is `p=0.0215`

which is similar to a one-tailed p-value. If we wished to have a 2-tailed p-value we would simply multiply this value by 2:

`## [1] 0.043`

**Example 2:**

Let’s take a look at a second example. Here, we have various subjects rating their anxiety levels. They do this after either taking a new anxiolytic drug or a placebo. The subjects in each group are independent of each other. The placebo group has 19 subjects and the drug group has 21 subjects.

The data:

```
placebo <- c(15, 16, 19, 19, 17, 20, 18, 14, 18, 20, 20, 20, 13, 11, 16, 19, 19, 16, 10)
drug <- c(15, 15, 16, 13, 11, 19, 17, 17, 11, 14, 10, 18, 19, 14, 13, 16, 16, 17, 14, 10, 14)
length(placebo) #19
```

`## [1] 19`

`## [1] 21`

If we were interested in doing a Student’s t-test, we might want to check whether the data are approximately normal. We could perform Shapiro-Wilk tests to do this:

```
##
## Shapiro-Wilk normality test
##
## data: drug
## W = 0.95184, p-value = 0.3688
```

```
##
## Shapiro-Wilk normality test
##
## data: placebo
## W = 0.88372, p-value = 0.02494
```

From this we find that the placebo group is not approximately normally distributed (p value of the Shapiro-Wilk test is <.05). We could do a non-parametric test such as Wilcoxon Ranked Sum test (see xxx.xxx), but an alternative strategy is to perform a permutation test.

Let’s first plot the data, and then look at our observed difference in anxiety scores between our two independent samples:

```
# put into dataframe - long format
ddf <- data.frame(anxiety = c(placebo, drug),
group = c(rep("placebo", length(placebo)),
rep("drug", length(drug))
)
)
head(ddf)
```

```
## anxiety group
## 1 15 placebo
## 2 16 placebo
## 3 19 placebo
## 4 19 placebo
## 5 17 placebo
## 6 20 placebo
```

```
#boxplots
ggplot(ddf, aes(x=group, y=anxiety, fill=group)) +
geom_boxplot(outlier.shape = NA, alpha=.4) +
geom_jitter(width=.1) +
theme_classic() +
scale_fill_manual(values=c("orange", "brown"))
```

`## [1] 2.12782`

So our observed difference in sample means is 2.128. In the permutation test, what we’ll do is shuffle all the scores randomly between the two groups, creating new samples of the same size (19 and 21). Then we’ll see what difference in sample means we get from those shuffled groups. We’ll also do this 10,000 times.

```
allvalues <- c(placebo, drug)
results<-vector('list',10000)
for(i in 1:10000){
x <- split(sample(allvalues), rep(1:2, c(19,21)))
results[[i]]<-mean(x[[1]]) - mean(x[[2]])
}
head(unlist(results)) # these are the first six of all our mean differences from 10,000 shuffles of the data.
```

`## [1] -0.8796992 -0.7794486 -1.2807018 -0.4786967 2.5288221 1.1253133`

Let’s plot the distribution of these data to see what proportion of times our shuffled groups got samples that were greater than 2.128.

```
df0 <- data.frame(difs = unlist(results))
ggplot(df0, aes(x=difs)) +
geom_histogram(color="black", fill="pink", alpha=.4) +
geom_vline(color="navy",lwd=1,lty=2,xintercept = 2.128) +
theme_classic()+
ggtitle("Mean Differences from \n 10000 Permutations of Raw Data")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

It looks like very few times did we get two samples that had differences in sample means that were greater than 2.128. We can calculate exactly how many times, and express this as the proportion of times we got a difference in sample means greater than 2.128:

`## [1] 113`

`## [1] 0.0113`

So, in this case we can say that the probability of getting a difference in sample means between the drug and placebo groups that was larger than our observed difference of 2.128 was `p = 0.0109`

. This is very strong evidence that the observed difference is significantly greater than we’d expect by chance.

## 13.2 Correlation Coefficient Permutation Tests

You can apply the logic of permutation tests to almost any statistical test. Let’s look at an example for Pearson correlations.

In these data, we are looking at 15 subjects who are completing a task. We measured the time they spent on the task and their high scores.

```
## Parsed with column specification:
## cols(
## subject = col_character(),
## time = col_double(),
## score = col_double()
## )
```

```
## # A tibble: 6 x 3
## subject time score
## <chr> <dbl> <dbl>
## 1 1A 5.5 3
## 2 2B 2.4 6.9
## 3 3C 8.8 17.9
## 4 4D 7 10.5
## 5 5E 9.3 12.2
## 6 6F 2.5 3.5
```

If we make a scatterplot of the data, we can see that those who spent longer on the task tended to get higher scores:

```
# scatterplot
ggplot(df, aes(x = time, y = score)) +
geom_point() +
stat_smooth(method = "lm", se=F)
```

`## `geom_smooth()` using formula 'y ~ x'`

Using a standard approach, we could find the correlation of these two variables and run a signficance test using `cor.test()`

. We can see that there is a moderate Pearson’s r of `r=0.55`

which is statistically significant (p=0.031).

```
##
## Pearson's product-moment correlation
##
## data: df$time and df$score
## t = 2.4258, df = 13, p-value = 0.03057
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0643515 0.8324385
## sample estimates:
## cor
## 0.5582129
```

We could take an alternative tack, and decide to do a permutation test. The idea here is again, how surprising is it to get a correlation of 0.55 with these data? Were there other ways of ordering the `x`

and `y`

variables to get higher correlation coefficients?

Let’s look at our `y`

axis variable, the `score`

:

`## [1] 3.0 6.9 17.9 10.5 12.2 3.5 11.0 7.6 8.4 13.4 10.1 9.0 10.1 17.7 6.8`

This is the original order of the data. If we use `sample()`

we can shuffle the data:

`## [1] 10.5 3.5 7.6 10.1 17.9 8.4 13.4 17.7 12.2 3.0 6.9 10.1 9.0 6.8 11.0`

Let’s shuffle the score again, but this time store it in the original dataframe:

```
## # A tibble: 15 x 4
## subject time score shuffle1
## <chr> <dbl> <dbl> <dbl>
## 1 1A 5.5 3 7.6
## 2 2B 2.4 6.9 10.1
## 3 3C 8.8 17.9 10.1
## 4 4D 7 10.5 12.2
## 5 5E 9.3 12.2 8.4
## 6 6F 2.5 3.5 13.4
## 7 7G 4.8 11 6.9
## 8 8H 4.1 7.6 3.5
## 9 9I 5 8.4 3
## 10 10J 2.9 13.4 17.7
## 11 11K 6.4 10.1 6.8
## 12 12L 7.7 9 11
## 13 13M 9.3 10.1 9
## 14 14N 8.3 17.7 17.9
## 15 15O 5.1 6.8 10.5
```

If we plot this shuffled `y`

(score) against the original `x`

(time), we now get this scatterplot, which basically shows no relationship:

```
# this is what that new column looks like:
ggplot(df, aes(x = time, y = shuffle1)) +
geom_point() +
stat_smooth(method = "lm", se=F)
```

`## `geom_smooth()` using formula 'y ~ x'`

And the correlation for this new scatterplot is really close to 0! r = 0.0005:

```
##
## Pearson's product-moment correlation
##
## data: df$time and df$shuffle1
## t = 0.0016429, df = 13, p-value = 0.9987
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5119267 0.5125988
## sample estimates:
## cor
## 0.0004556502
```

We could shuffle the score variable even more times, and directly calculate the `r`

value aginst the time variable for each shuffle using `cor()`

.

`## [1] 0.3023584`

`## [1] -0.05905503`

`## [1] -0.4665168`

`## [1] -0.435933`

As you can see, the more shuffles we do, we get varied values of `r`

. What we really should do is perform 10,000 (or another really high number) shuffles of the score variable and re-calculate `r`

against the time variable for all 10,000 of these shuffles. Don’t worry about the code below, but that’s exactly what we’re doing. We’re saving the `r`

values from the 10,000 shuffles in the object called `results`

.

```
results <- vector('list',10000)
for(i in 1:10000){
results[[i]] <- cor(df$time, sample(df$score))
}
head(unlist(results)) # this are the correlations for the first 6 of 10,000 shuffles
```

```
## [1] 0.274190962 0.005288304 -0.114492469 -0.280528642 0.235874922
## [6] 0.061278049
```

We can plot the results in a histogram, and also put a vertical line at 0.56 which was our original observed correlation between time and score from the raw unshuffled data.

```
results.df <- data.frame(x = unlist(results))
ggplot(results.df, aes(x)) +
geom_histogram(color="darkgreen",fill="lightseagreen") +
geom_vline(xintercept = 0.56, lwd=1, lty=2) +
xlab("r")
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

As you can see, there were a few shuffles (or permutations) that we got an `r`

value of greater than 0.56, but not that many. In fact, we can directly calculate how many:

`## [1] 163`

It turns out that 163 times out of 10,000 shuffles we got a `r`

value of greater than 0.56. WE could calculate this as a proportion by dividing by 10,000:

`## [1] 0.0163`

We can use this value as our p-value. Because it is relatively low, we could argue that we were very unlikely by chance alone to have got a `r`

value of 0.56 from our data. This suggests that the correlation between time and score is significant.

The advantages of running a permutation test is that it is free of the assumptions of normality for the Pearson’s r correlation signifiance test. It’s also a cool method, and pretty intuitive.

## 13.3 Permutation test for a Paired t-test

We can apply the same principle of permutation to the paired t-test. Rememeber, essentially the paired t-test is focused on performing a one-sample t-test on the difference in scores between the paired data - testing whether the mean of the differences could potentially come from a population with \(\mu=0\).

Let’s look at the following data that record scores for the same individual over two time points - ‘before’ and ‘after’.

```
## Parsed with column specification:
## cols(
## id = col_character(),
## before = col_double(),
## after = col_double()
## )
```

```
## # A tibble: 6 x 3
## id before after
## <chr> <dbl> <dbl>
## 1 mc 5.5 5.3
## 2 ma 5.7 5.3
## 3 co 4.4 3.3
## 4 kj 3.4 3.1
## 5 ln 5.3 5.3
## 6 oe 5.2 5.1
```

We could plot these data using a scatterplot to examine the overall trend of how scores change from before to after:

```
# make a scatterplot with the x being 'before' and y being 'after'
ggplot(ba, aes(x=before, y=after)) +
geom_point() +
theme_classic()+
geom_abline(intercept =0 , slope = 1) +
xlim(2,8)+
ylim(2,8)
```

As most of these points are below the diagonal line, this seems to suggest that the scores for the ‘before’ data seem to be lower on the whole than the scores for the ‘above’ data.

Typically, we would run a paired t-test with such data to examine if there was a difference:

```
##
## Paired t-test
##
## data: ba$before and ba$after
## t = 2.6667, df = 10, p-value = 0.02363
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1315583 1.4684417
## sample estimates:
## mean of the differences
## 0.8
```

This suggests that there is a significant difference `p<.05`

with the 95% confidence interval of the true difference in means being between 0.13 and 1.47. However, the paired t-test assumes that the data are from an approximately normal distribution. In particular, that the differences scores (the difference between the ‘before’ and ‘after’ scores for each individual) are normally distributed. We can check that using a Shapiro-Wilk test:

```
# create a difference column for the difference between before and after
ba$difference <- ba$before - ba$after
# run a Shapiro test on the difference column
shapiro.test(ba$difference)
```

```
##
## Shapiro-Wilk normality test
##
## data: ba$difference
## W = 0.82621, p-value = 0.02081
```

With the p-value here being `p<.05`

, this suggests that our data are not normally distributed. One option would be to do a non-parametric Wilcoxon-signed rank test (see section 10.12). Alternatively, we could do a permutation test.

Let’s look at our data again, and focus on the difference column.

```
## # A tibble: 11 x 4
## id before after difference
## <chr> <dbl> <dbl> <dbl>
## 1 mc 5.5 5.3 0.2
## 2 ma 5.7 5.3 0.4
## 3 co 4.4 3.3 1.1
## 4 kj 3.4 3.1 0.300
## 5 ln 5.3 5.3 0
## 6 oe 5.2 5.1 0.1
## 7 mb 3.4 3 0.400
## 8 dc 7.5 5 2.5
## 9 dg 3.4 2.1 1.30
## 10 mj 6.6 3.9 2.70
## 11 kb 5 5.2 -0.2
```

Our observed mean for the differences scores is 0.8.

`## [1] 0.8`

How likely were we to get this mean difference if our ‘before’ and ‘after’ conditions were randomized? For example, for individaul ‘mj’, their before score was 6.6 and after was 3.9 leading to a difference of 2.7. But what if their before and after were switched? Then the difference score would be -2.7. What we would like to do, is to randomly flip the before and after columns for each individual and recalculate the difference scores. Each time we do this, we will calculate the mean of the difference scores. A programmatic shortcut to doing this is to multiple each difference score randomly by either +1 or -1. Here is the first shuffle we could perform:

`## [1] -0.2 -0.4 1.1 0.3 0.0 0.1 0.4 2.5 1.3 -2.7 0.2`

`## [1] 0.2363636`

In this example, the ‘before’ and ‘after’ scores were randomly flipped for individuals ‘mc’, ‘ma’, ‘mj’ and ‘kb’. Let’s do a second shuffle:

`## [1] -0.2 0.4 -1.1 0.3 0.0 0.1 0.4 -2.5 1.3 2.7 0.2`

`## [1] 0.1454545`

In this example, the ‘before’ and ‘after’ scores were randomly flipped for individuals ‘mc’, ‘co’, ‘kj’, ‘oe’, ‘mb’, ‘dg’ and ‘mj’. In both shuffles the mean of the difference scores was less than our observed mean of 0.8.

We can put this into a loop to do it 10,000 times:

```
results <- vector('list',10000)
for(i in 1:10000){
results[[i]] <- mean(ba$difference * sample(c(-1,1), 11, replace = T))
}
```

And we can plot these results as a histogram:

```
df1 <- data.frame(difs = unlist(results))
ggplot(df1, aes(x=difs)) +
geom_histogram(color="black", fill="pink", alpha=.4, binwidth = .05) +
geom_vline(color="navy",lwd=1,lty=2,xintercept = .8) +
theme_classic()+
ggtitle("Mean Differences from \n 10000 Permutations of Raw Data")
```

We can also calculate the number of times out of 10,000 that we observed a mean difference higher than the mean of 0.8 in our original data, which is only in 19 shuffles out fo 10,000:

`## [1] 17`

We divide this number by 10,000 to get our p-value:

`## [1] 0.0017`

This suggests that we have a highly significant `p=0.002`

difference between our ‘before’ and ‘after’ data within subjects.

## 13.4 Permutation tests in Packages

Above we wrote script from scratch to perform our permutation tests. In many ways, this is our preferred approach as it is more customizable. However, in some packages there are some permutation tests already available as functions. One example is the `independence_test`

from the package `coin`

that will do a permutation t-test for between subjects. The code for this is below (this requires dataframes to be in the long format):

`## Loading required package: survival`

```
## anxiety group
## 1 15 placebo
## 2 16 placebo
## 3 19 placebo
## 4 19 placebo
## 5 17 placebo
## 6 20 placebo
```

```
##
## Asymptotic General Independence Test
##
## data: anxiety by group (drug, placebo)
## Z = -2.1998, p-value = 0.01391
## alternative hypothesis: less
```

As you can see, this gives a roughly similar result to our own permutation script.

You can also do a 2-tailed version:

```
##
## Asymptotic General Independence Test
##
## data: anxiety by group (drug, placebo)
## Z = -2.1998, p-value = 0.02782
## alternative hypothesis: two.sided
```