2.2 ANOVA Math

2.2.1 Introduction

Let’s observe these two possible side-by-side boxplots. In the first, it definitely looks like there’s a difference between the groups. But in the second, I’m really not sure there’s a difference at all.

Yet each of these pictures has the same four group means. The groups look different in the first plot because the differences between the groups are large when compared to the spread within each group. In the second plot, the variation within each group is so large that the differences between the groups no longer look significant.

This is what we want to quantify with ANOVA!

2.2.2 ANOVA math

All right, so how do we do this with math?

Let’s call the weight variable \(Y\). We’ll indicate a specific chicken’s weight as \(y_{ij}\), where \(i\) represents what group it’s in – that is, what diet it’s getting – and \(j\) indicates the individual chick within that group. So \(y_{4,6}\) is the weight of the sixth chicken on diet 4.

So think about our null hypothesis here: no difference between the groups. On average, the weight of chickens in each diet group is the same. If we write \(\overline{y}_i\) for the average weight in group \(i\), we have: \[ H_0: \overline{y}_1 = \overline{y}_2 = \overline{y}_3 = \overline{y}_4 \] And the alternative is that there’s…some difference, somewhere. I don’t care what, just that there’s some difference.

While we’re at it, let’s pick an alpha of, say, 0.05. It’s chickens. This is not high-stakes.

Now, back to the math. We can actually write out an equation for the weight of an individual chicken: \[ y_{ij} = \overline{\overline{y}} + (\overline{y}_i - \overline{\overline{y}}) + (y_{ij} - \overline{y}_i) \]

Here, \(\overline{\overline{y}}\) represents the overall mean or grand mean – the average weight of all the chicks in the whole dataset. And \(\overline{y}_i\), again, represents the group mean – the average weight of all chicks who get diet \(i\).

You will notice that this equation doesn’t actually say a lot – the terms on the right side just cancel right out to give you the left side. But thinking about it in this way turns out to be useful. An individual chicken’s weight is:

  • The overall average weight, \(\overline{\overline{y}}\)
  • Plus an adjustment based on its diet – the difference between this diet and the overall average, \((\overline{y}_i - \overline{\overline{y}})\)
  • Plus an adjustment for that particular chicken – the difference between this chicken and the average for its diet, \((y_{ij} - \overline{y}_i)\)

Now recall we were particularly interested in the variation here. We’re going to do something very like (okay, identical to) working with variances. Recall how you find the sample variance of some variable: you subtract the mean from each observation, square the differences, sum them up, and divide by the degrees of freedom, \(n-1\). (Note that -1! That’s because before you could calculate the variance, you used a degree of freedom to estimate the mean.) We also used to have a rule for the variance of independent random variables, which I hope rings distant bells from intro. Namely, variances add: \[ Var(X+Y) = Var(X) + Var(Y)\] provided that \(X\) and \(Y\) are independent random variables.

Here’s what this looks like now. We take our equation for \(y_{ij}\), and we move \(\overline{\overline{y}}\) over to the left: \[ y_{ij} - \overline{\overline{y}} = (\overline{y}_i - \overline{\overline{y}}) + (y_{ij} - \overline{y}_i) \]

Then we take each piece, square it, and then sum up over all the chickens: \[ \sum_{ij} (y_{ij} - \overline{\overline{y}})^2 = \sum_{ij}(\overline{y}_i - \overline{\overline{y}})^2 + \sum_{ij}(y_{ij} - \overline{y}_i)^2 \]

Different people use different abbreviations for the sums of squares and it is super confusing. In my notes, I’ll use SSTot, SSTr, and SSE, because at least they aren’t ambiguous.

What we see here are called sums of squares. The one on the left is the total sum of squares, called SSTot. Over on the right, we have the treatment sum of squares, SSTr, followed by the error sum of squares, SSE.

Now, for the “divide by \(n\)” part. Here is where we have to think about degrees of freedom: how many values do we get from data, and how many things do we get to estimate?

Well, for SSTr, we think of the \(\overline{y}_i\)’s as data, and \(\overline{\overline{y}}\) as the thing we estimate. So if we have \(k\) levels of our grouping factor, and estimate one grand mean, we end up with \(k-1\) degrees of freedom.

For SSE, we think of the individual \(y_{ij}\)’s as data, and the group means \(\overline{y}\) as the things we estimate. So we start with \(N\) data points and subtract the \(k\) group means, to get \(N-k\) degrees of freedom.

Over on the left, with SSTot, we have \(N\) individual \(y_{ij}\)’s as data points, and one thing to estimate, \(\overline{\overline{y}}\). So the degrees of freedom is \(N-1\). Notice that the df for SSTr and SSE add up to the df for SSTot. So tidy!

Let’s divide each piece by its own degrees of freedom: \[ \frac{\sum_{ij} (y_{ij} - \overline{\overline{y}})^2}{N-1} = \frac{\sum_{ij}(\overline{y}_i - \overline{\overline{y}})^2 }{k-1} + \frac{\sum_{ij}(y_{ij} - \overline{y}_i)^2}{N-k} \]

By the way, if you feel like I am doing extremely sketchy things to these equations here, I sympathize! It turns out that actually what I’m doing is correct, but it sure doesn’t look that way. But you can prove it later in life :)

What we have obtained here are called mean squares. On the left, again, is the total mean square, MSTot. And on the right are MSTr and MSE.

2.2.3 The ANOVA inference test

Okay! Back to our original question: is the variation between groups large compared to the variation within groups? To do this comparison, we’ll just take the ratio of the two quantities. This gives us a test statistic (remember, it’s calculated from our data, so it’s a statistic) called the \(F\) statistic: \[ F = \frac{MSTr}{MSE}\]

Now, as you may recall from reviewing inference, once we have a test statistic, we need a null distribution to compare it to. What is the distribution of this \(F\) if the null hypothesis is true and there’s no real difference between the groups? Let’s prove it!

Look, if you really want to know and you just can’t wait for an actual course about distributions: The \(F\) statistic has an \(F\) distribution because of the way we calculate MSTr and MSE.

You may (?) recall that there are some rules about what happens when you transform or combine variables with certain distributions. For example, if \(X\) is normally distributed, \(5X+3\) is also normally distributed. And if \(X\) and \(Y\) are both Normal, \(X+Y\) is also Normal.

This means that if the \(y_{ij}\) are normally distributed (actually, technically, if the errors are), then the group and grand means, \(\overline{y}_i\) and \(\overline{\overline{y}}\), are also Normal (because they’re created by adding together \(y_{ij}\)’s, which are Normal, and dividing by a constant). And that means that differences like \(y_{ij}-\overline{y}_i\) are also Normal.

But when you find SSTr or SSE, you square those differences, and add them together. Well, it turns out that the square of a Normal random variable has a chi-squared (\(\chi^2\)) distribution. In fact, the sum of \(m\) independent squared Normal random variables has a \(\chi^2_m\) distribution (that’s right, \(\chi^2\)’s have degrees of freedom too). So \(MSTr\sim\chi^2_{k-1}\) and \(MSE\sim\chi^2_{N-k}\).

And it also turns out that the ratio of two chi-squared distributions, each divided by its own degrees of freedom, is an \(F\) distribution. That’s exactly what we have here: MSTr is SSTr divided by its own degrees of freedom, and likewise for MSE and SSE.

And that, incidentally, is why we care about independence of errors (well, one reason), and also why we have to assume the errors are normally distributed. Which we will talk about checking in the next section.

Aren’t you glad this was an optional side note?

Hahahaha no, we’re not going to prove it. That is a task for another class. I’m just going to tell you what it is: the \(F\) distribution, with \(k-1\) and \(N-k\) degrees of freedom. (Yeah, \(F\) distributions are defined by two different degrees of freedom.)

So now we have a test statistic and a null distribution to compare it to. So we can go about getting a p-value, just like usual: if \(H_0\) is true, what is the probability of getting an \(F\) statistic at least as extreme as the one we saw in our sample? (Actually the \(F\) distribution is inherently one-sided, so we’re only interested in large \(F\) statistics.)

If that p-value is greater than or equal to our pre-defined alpha level, then we fail to reject \(H_0\). We didn’t see evidence of a difference between the groups. We don’t know that the groups are all the same – we might have just missed something – but based on what we saw, we can’t say that they aren’t the same.

But when we try it on our actual data, we see:

chicken_dat = ChickWeight %>% filter(Time == 20)
chicken_lm = lm(weight~Diet, data = chicken_dat)
chicken_fit = chicken_lm %>% anova()
## Analysis of Variance Table
## Response: weight
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## Diet       3  55881 18627.0  5.4636 0.002909 **
## Residuals 42 143190  3409.3                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Behold R’s ANOVA output. It helpfully provides a p-value, on the right there, associated with Diet.

That p-value is below our pre-defined alpha level, so we reject \(H_0\). We conclude that there’s evidence of some difference between the groups: chicken diet matters! But we still don’t know how it matters. This test doesn’t tell us which group is the best, or how many of the groups are different, or anything. Just that the factor matters somehow.

To find out more, we’ll need follow-up tests…which we’ll discuss in another section.

Response moment: Look at the rest of the values in the ANOVA output. Do you see how they map to the mathematical objects SSTr, SSE, MSTr, MSE, and \(F\)? Also, how many chickens did I have in this experiment?