12 Permutation testing

Permutation testing is a testing strategy that can be used when we do not have a known sampling distribution, and also cannot (or don’t want to) use a theoretical approximation of the sampling distribution. Let’s study an example.

Assume you want to test a new medical treatment to prevent type II diabetes. You have eight 6 identical (in all aspects relevant to the medical issue) patients. To test the treatment, you randomly assign three patients to receive the treatment and three other to a control group. You then check their fasting insulin level to check if the treatment worked.

12.1 Definitions

  • Remember that the null hypothesis contains equality. In the context of the example, the null hypothesis Ho is the statement that corresponds to “no real effect of the drug”, i.e. both group have equal mean insulin levels.

  • A test statistics is a numerical function of the data whose values determines the result of the test. After being evaluated for the sample data, the result is called the observed test statistic.

  • The p-value is the probability that chance alone would produce a test statistic as extreme or more extreme as the observed test statistic.

  • A result is statistically significant if it would rarely occur by chance (i.e. under the null hypothesis, or if the null hypothesis were true, or if there is no real effect). Exactly what rarely means depends on the context of the problem.

  • The null distribution is the distribution of the test statistic if the null hypothesis is true. So the null distribution is the sampling distribution of the test statistic under the null hypothesis

Let’s start with comparing two groups and testing if there is a difference between the two groups under \(H_o\). Here is the idea behind resampling: If there is no effect of the treatment, then the split into two groups was essentially random. Let’s look at all or at least many different ways to split the sample into two groups, and see what if any difference we can observe, and how unusual the original observation was.

In our example, the Null-hypothesis is
\(H_o\) : the drug has no effect, i.e.the average fasting insulin levels are the same for both groups.

As our test statistic we use the difference between the mean fasting insulin levels of the treated group and the control group. Note that this means we can re-phrase \(H_o\) as “mean (treatment) \(\le\) mean (control)” or ” mean (control)-mean (treatment) \(\ge\) 0”.

treatment <- c(8,11,12)
control <- c(20,15,10)
test_stat <- mean(control)-mean(treatment)
#> [1] 4.666667

We see that the treatment group has are 4.6666667 mcU/ml lower insulin levels, but is that difference big enough to be statistically significant?

Here is how the thinking goes: If there is no effect of the treatment, then the split of the six observations into 2 groups is essentially random. There are \(6 \choose 3\)=20 different ways of splitting 6 patients into two groups of 3.

You have two choices: you could find all 20 different groups, compute the difference in means, and see how the original difference compares. This would be using the exact Null-distribution. It is possible for 20 groups, but becomes a pain for bigger samples.

patient1_treatment <- c(8,8,8,8,8,8,8,8,8,8,10,10,10,10,10,10,11,11,11,12)
patient2_treatment<- c(10,10,10,10,11,11,11,12,12,15,11,11,11,12,12,15,12,12,15,15)
patient3_treatment<- c(11,12,15,20,12,15,20,15,20,20,12,15,20,15,20,20,15,20,20,20)
patient1_control <- c(12,11,11,11,10,10,10,10,10,10,8,8,8,8,8,8,8,8,8,8)
patient2_control <- c(15,15,12,12,15,12,12,11,11,11,15,12,12,11,11,11,10,10,10,10)
patient3_control <- c(20,20,20,15,20,20,15,20,15,12,20,20,15,20,15,12,20,15,12,11)
observed.difference <-  mean.control-mean.treatment
#>  [1]  6.0000000  5.3333333  3.3333333  0.0000000  4.6666667
#>  [6]  2.6666667 -0.6666667  2.0000000 -1.3333333 -3.3333333
#> [11]  3.3333333  1.3333333 -2.0000000  0.6666667 -2.6666667
#> [16] -4.6666667  0.0000000 -3.3333333 -5.3333333 -6.0000000

You see that there are three (out of twenty) values as large or larger than the 4.66667 that we observed in the original experiment.

Our results mean that there is a 15% chance that the observed difference - or some difference even larger - could occur by chance. 15% is pretty large, so we say the observed difference is not statistically significant.

Listing all possible groups give us the exact sampling distribution of the difference in mean insulin level for our six patients. However, it is tedious to list all possible groups when you have a large number of observations. Instead, we could run a bunch of simulations, basically creating random groups, using R. Below, I am running N=100000 simulations.

treatment <- c(8,11,12)
control<- c(20,15,10)
combined<- c(control, treatment)
observed.difference <- mean(control)-mean(treatment)
N <- 100000
difference <- numeric(N)
for (i in 1:N)
  index <- sample(6, size=3, replace =FALSE)
difference[i] <- mean(combined[index])- mean(combined[-index])

I visualize the results with a histogram. You can just use the built-in basic plot. The red line marks the observed difference.

hist(difference,xlab="difference control-treatment", breaks = seq(from = -6, to =6, by=2),main = "empirical distribution")
abline(v=observed.difference, col="red")

Or you could get all fancy and use ggplot:

#> Warning: package 'ggplot2' was built under R version 4.3.2
data <- data.frame(difference)
ggplot(data,aes(x=difference))+geom_histogram(color="pink",fill="white", binwidth=2)+
  labs(title="empirical sampling distribution, absolute frequency")
ggplot(data,aes(x=difference, y=after_stat(density)))+geom_histogram(color="pink",fill="white", binwidth=2)+
  labs(title="empirical sampling distribution, relative frequency")

Now you’ll have to find the percentage of values of the (empirical) test statistic at or above the observed value. There are different ways to do that. You could find the indices of the values at or above the observed value,and then find the length of that vector, like this:

index <- which(difference >= observed.difference)
pvalue <- (length(index)+1)/(N+1)
#> [1] 0.1516485

Alternatively, you could use the sum function. This function counts the number of times the statement inside is “TRUE”.

pvalue <- (sum(difference >= observed.difference)+1)/(N+1)
#> [1] 0.1516485

Note that you get the same result either way.

As we saw above, the actual p-value is 15%. Running simulations, you should come close to that. Again, the more runs you do the better your results are.

There is a 15% chance to get a difference as large or larger than the one observed purely due to chance. So now it is up to you to decide what to do.

12.2 Notes

  • If you have a two-sided test, you multiply the p-value by two. Two-sided tests are the default in hypothesis testing.

  • It is cheating to look at the data first to see what hypothesis it might support and then pick a test accordingly.

  • The basic procedure of permutation testing works for any test statistic. That means you can choose a test statistic that is less sensitive to outliers. For example, if you are testing a claim about repair times, one really difficult repair can throw off all your averages. Maybe the median would be a better test statistic. However….

  • It is not right to play around with different test statistics and pick the one that gives you the result you want. For example, we could have picked the minimum (fastest) time of each group, the maximum (slowest) time, the middle…

  • The advantage of permutation testing is that you do not have to make any assumptions about the population’s distribution. There is no requirement that the population be normally distributed. However, often times you test two groups that are samples from two different populations. It is required that, under \(H_o\) , both populations (not both samples) have the same distribution.

  • Unless you have samples from populations that have different spreads and very different sample sizes, if there is no other method you know how to use, you can somewhat relax the last requirement.

12.3 Assignment

Run a permutation test for the following claim: The maximum value of the treatment group is less than the minimum of the control group. Observations:
treatment group: 4,6,6,8
control group: 9, 9, 9, 11