3.2 \(2^k\) designs

3.2.1 Two-level designs

In the guinea pig experiment we have one treatment factor with 3 levels, and one with 2 levels. But a common situation is when all of the treatment factors have just two levels each.

Why is this common? Well, it’s not unusual that treatment factors are inherently binary. You eat breakfast or you don’t, you get the new drug or you don’t, you apply the protective coating or you don’t. But possibly a more common reason is that your treatment factors are quantitative.

If you’re dealing with a quantitative factor, very often, you as the experimenter get to decide what the levels are. A common thing to do is to pick two – a “low” and a “high”. This makes the analysis simpler, and it’s enough to detect a linear relationship between the predictor and the response – since you only need two points to define a line. Properly choosing the levels for quantitative predictors is a story for another time, but for now, assume that you pick a “low” and “high” level that are reasonably different. Then you estimate the factor’s effect by looking at the difference in the response between “low” runs and “high” runs.

If you have a factorial design with \(k\) factors, each measured at two levels, this is called a \(2^k\) design. This is an apt name because, to do a full factorial design where you observe each combination of factor levels, you’ll need…\(2^k\) runs.

3.2.2 A \(2^3\) design

Let’s take a new example. There is an old saying that “a watched pot never boils,” and this seems to be especially true when you are super hungry and waiting for the pasta water to be ready. So let’s look at an experiment. We have three treatment factors:

  • A: amount of water in the pot (500 mL or 600 mL)
  • B: lid off or lid on
  • C: size of pot (2 L or 3 L)

Notice how factors A and C are quantitative, but we’re just picking a high and low setting for each.

Apparently it’s hard to measure exactly when the water starts boiling, so we can just measure when it hits this temperature. Presumably the time between reaching 90 degrees and boiling is short enough that you can spend it realizing you forgot to open the pasta and you can’t find the scissors anywhere.

The response here is how long it takes the water to reach 90 degrees Celsius. Here’s the dataset:

boil_dat
##     A   B C   y
## 1 500 Off 2 450
## 2 600 Off 2 492
## 3 500  On 2 390
## 4 600  On 2 408
## 5 500 Off 3 432
## 6 600 Off 3 450
## 7 500  On 3 318
## 8 600  On 3 312

A common way to write out this kind of experiment is to encode the factors as being + (high) or - (low) for each run. Thus:

Run A B C y
1 - - - 450
2 + - - 492
3 - + - 390
4 + + - 408
5 - - + 432
6 + - + 450
7 - + + 318
8 + + + 312

Let’s pause and think about estimating the main effects here. For example, what’s the effect of factor A? Well, if B and C are both on “low,” then we have two runs to look at: row 2, where A is high, and row 1, where A is low. So one data point about the effect of A is the difference between runs 2 and 1: \(492-450\).

But wait! We have more data available. Any pair of runs where B and C are the same but A is different can give us information about the effect of changing A. So we could look at run 4 vs. run 3: \(408-390\). Likewise, we can use run 6 minus run 5, and run 8 minus run 7. So we actually have four relevant data points. We can average these four differences to estimate the average effect of A:

boil_times = boil_dat$y
mean(c(boil_times[2] - boil_times[1],
       boil_times[4] - boil_times[3],
       boil_times[6] - boil_times[5],
       boil_times[8] - boil_times[7]))
## [1] 18

So, on average, the difference between “high” water level and “low” water level is 18. Since we like to talk about effects relative to the overall average, we would say that “high” water level adds 9 to the overall average time, while “low” water level subtracts 9 from the overall average time.

The same is true for the main effect of B: any pair of runs where A and C are the same, but B is different, can tell us about the effect of B. So we could compare run 3 vs. run 1, run 4 vs. run 2, and so on – we have four such pairs. Likewise, to get information about the effect of C, we could compare run 5 vs. run 1, and so on – again, there are four pairs of rows we can use.

You may notice an interesting thing here: we have four data points we can use to estimate the effect of A, four to estimate the effect of B, and four to estimate the effect of C. And yet…we only have eight runs. Factorial designs are quite efficient: they give us lots of information in relatively few runs. That’s because each run is in some sense serving multiple purposes. For example, when we do run 8, we can use it at least three ways: compare to run 7 to see the effect of A, compare to run 6 to see the effect of B, and compare to run 4 to see the effect of C. Super handy!

Let’s see what R has to say about this, shall we?

boil_aov1 = aov(y~., data=boil_dat)
boil_aov1 %>% summary()
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## A            1    648     648   1.274 0.32207   
## B            1  19602   19602  38.549 0.00342 **
## C            1   6498    6498  12.779 0.02328 * 
## Residuals    4   2034     509                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like the amount of water isn’t a big deal (at least in this range), but it really matters whether you have the lid on the pot, and the size of the pot may also be important.

To get more information, we can ask R to go and calculate the actual effects associated with each level of the factors:

model.tables(boil_aov1)
## Tables of effects
## 
##  A 
## A
## 500 600 
##  -9   9 
## 
##  B 
## B
##   Off    On 
##  49.5 -49.5 
## 
##  C 
## C
##     2     3 
##  28.5 -28.5

Hey, there’s that \(\pm 9\) we calculated before for the main effect of A!

R will also, if asked nicely, give us the actual group means for each level of each factor:

model.tables(boil_aov1, type = "means")
## Tables of means
## Grand mean
##       
## 406.5 
## 
##  A 
## A
##   500   600 
## 397.5 415.5 
## 
##  B 
## B
## Off  On 
## 456 357 
## 
##  C 
## C
##   2   3 
## 435 378

3.2.3 Interactions

“Hey wait,” I hear you say. “You said factorial designs were cool because they allowed us to spot interactions. Where are the interactions?”

Good point. All the analysis we did above was based on the idea that there was no interaction between the factors. For example, the effect of A didn’t depend on the levels of B and C. When we were estimating A, we looked at pairs of rows where B and C were the same but A was different. Each of those pairs was different: for example in run 2 vs. run 1, B and C were - and -, but in run 8 vs. run 7, B and C were + and +. But we averaged over all four pairs to get the average effect of A.

If there’s an interaction, though, the effect of A might be different if B is - as opposed to when B is +. In context, the effect of the water level might be different if the pot lid is on vs. if it’s off. We need to calculate an adjustment to the average effect of A under each of these scenarios: the term \(\widehat{(\alpha\beta)}_{ij}\) for each \(i\) and \(j\).

Well, okay. What’s the average effect of A when B is -?

mean(c(boil_times[2] - boil_times[1],
       boil_times[6] - boil_times[5]))
## [1] 30

So, +15 for high water, -15 for low water.

What’s the average effect of A when B is +?

mean(c(boil_times[4] - boil_times[3],
       boil_times[8] - boil_times[7]))
## [1] 6

Oohh. When B is +, the effect of A is just +3 for high water, and -3 for low water. Looks like there is a difference.

So what does this mean in terms of effects?

Well, suppose A is at the high level. Overall, \(\widehat{\alpha}_+\) is 9. But then we adjust: for example, if A is + and B is +, we add \(\widehat{(\alpha\beta)}_{++}\) to the effect of A. The effect of high A is only 3 when B is +, so that means the adjustment \(\widehat{(\alpha\beta)}_{++}\) must be -6. Likewise, the effect of low A is -9 on average but only -3 when B is +, so the adjustment \(\widehat{(\alpha\beta)}_{-+}\) must be 6.

Let’s let R do the rest of them for us, shall we? Here I’ll use an R formula shorthand, .^2. In R formulas, . means “all the other variables,” and ^2 indicates that I want to include all the second-order terms – that is, two-way interactions. So y~.^2 indicates that I want to use y as the response, and as predictors, include all the main effects and two-way interactions involving all the other variables.

boil_aov2 = aov(y~.^2, data = boil_dat)
model.tables(boil_aov2)
## Tables of effects
## 
##  A 
## A
## 500 600 
##  -9   9 
## 
##  B 
## B
##   Off    On 
##  49.5 -49.5 
## 
##  C 
## C
##     2     3 
##  28.5 -28.5 
## 
##  A:B 
##      B
## A     Off On
##   500 -6   6
##   600  6  -6
## 
##  A:C 
##      C
## A     2  3 
##   500 -6  6
##   600  6 -6
## 
##  B:C 
##      C
## B     2     3    
##   Off -13.5  13.5
##   On   13.5 -13.5

We see that R has estimated the overall main effects for A, B, and C, as well as all the two-way interaction adjustments.

3.2.4 A glimpse of inference

So are these effects statistically significant? Let’s find out!

boil_aov2 %>% summary()
##             Df Sum Sq Mean Sq   F value   Pr(>F)    
## A            1    648     648 5.134e+29 8.88e-16 ***
## B            1  19602   19602 1.553e+31  < 2e-16 ***
## C            1   6498    6498 5.148e+30 2.81e-16 ***
## A:B          1    288     288 2.282e+29 1.33e-15 ***
## A:C          1    288     288 2.282e+29 1.33e-15 ***
## B:C          1   1458    1458 1.155e+30 5.92e-16 ***
## Residuals    1      0       0                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well…are they? Good question, really. They do look significant, just based on those p-values. Except…I mean, look at those degrees of freedom. Each of my factors had 2 levels, so \(a=b=c=2\). It took me \(a-1=1\) df to fit the main effect for A, and so on. Then it took me \((a-1)(b-1) = 1*1 =1\) df to fit the interaction effect AB, and so on. And I used 1 df for the overall mean. So out of the 8 runs, I was left with just 1 df to use to estimate the error.

And on top of that, the sum of squares for error is…0?? That means my model fit exactly. Which is not completely impossible, given that the response values were rounded to whole numbers and I’m fitting a model with 7 terms to 8 points. So that’s a little uncomfortable. My model might be fitting noise, or overfitting, and it might also change substantially if I got a slightly different value for the response in one run.

And on top of that, remember: what about higher-order interactions? If there’s a three-way interaction between A, B, and C, then fitting it would require another \((a-1)(b-1)(c-1) = 1\) degree of freedom. Which would leave me none at all to estimate error!

One solution here is: do some more runs. If we boiled the water twice with each combination of water level, lid status, and pot size, we’d have \(n=16\) instead of 8, and we’d have lots more information about error.

Or maybe we could, sort of, buy some degrees of freedom back? We had lots of df for error until we went and threw those interaction terms in. Maybe we could decide to remove some terms from our model – basically, declare a priori that those terms don’t matter – and then we’d have more df for estimating error. This is an approach we’ll explore extensively later on.

Response moment: Context practice! Recall the meaning of each factor here:

  • A: amount of water in the pot (500 mL or 600 mL)
  • B: lid off or lid on
  • C: size of pot (2 L or 3 L)

Now recall some of the effects we found: \(\widehat{\alpha}_+\) is 9, and the interaction term \(\widehat{(\alpha\beta)}_{++}\) is -6; \(\widehat{\alpha}_-\) is -9 and \(\widehat{(\alpha\beta)}_{-+}\) is 6. In a sentence, what does this mean about water boiling?