2.3 Diagnostics for ANOVA

2.3.1 Introduction

There’s nothing about ANOVA, in itself, that requires any preconditions or assumptions. You give me a dataset, I can find the grand mean and the group means and the residuals, and all those sums of squares, and everything.

But if we’re going to do inference, that’s different. That requires us to know the null distribution of things. I can always find an \(F\) statistic, but if I want to decide if that F statistic is unusually large and I should reject \(H_0\), then I have to know how it behaves. And that does rely on some assumptions and conditions for the data.

We will use baby chickens as our example throughout because baby chickens are adorable.

2.3.2 ANOVA as a model

So what do we assume these assumptions about? In ANOVA, as in many modeling contexts, we talk a lot about the errors and the residuals. These two terms do actually have distinct meanings. One is the truth, and the other is an estimate.

You may recall that in frequentist land, we assume that there is a Truth – a true value of the population parameters – but that we don’t know the Truth, so we use our observed sample to estimate it. And we use different terms to refer to True things and our estimates of them, like \(\mu\) for the population mean and \(\overline{y}\) for the sample mean.

In ANOVA, we assume that there is a True Model driving the data:

\[y_{ij} = \mu + \tau_i + \varepsilon_{ij}\]

Compare this to our previous breakdown of \(y_{ij}\):

\[ y_{ij} = \overline{\overline{y}} + (\overline{y}_i - \overline{\overline{y}}) + (y_{ij} - \overline{y}_i) \]

The first term is the grand mean. The true grand mean is \(\mu\) (familiar notation there), and the sample grand mean is \(\overline{\overline{y}}\). This is the average weight of all baby chicks at 20 days old.

Next we have the adjustment that we make based on the fact that the observation is in group \(i\), aka treatment \(i\) – in our example, the \(i\)th diet. This is the treatment effect – the difference between the average value in group \(i\) and the overall average. In the True Model, this difference is called \(\tau_i\). (\(\tau\) is the Greek letter t, for “treatment.”) Our estimated version is the sample group mean, \(\overline{y}_i\), minus the sample overall mean, \(\overline{\overline{y}}\).

And finally we have…the leftovers: the difference between the individual value, \(y_{ij}\), and the average for its group. Notice that this term still exists in the theoretical, True Model version. Even if we magically knew the true \(\mu\), and the true group adjustment \(\tau_i\), the individuals in that group would still vary around the average. So each of them would have an individual error, \(\varepsilon_{ij}\). Our \(y_{ij} - \overline{y}_i\), which call the residual or \(e_{ij}\), is an estimate of this true error quantity.

These true errors, \(\varepsilon_{ij}\), are random variables, in the sense that if we picked a different chicken it would have a different error. So we can talk about their distribution. These errors are what our ANOVA assumptions and conditions are about. In particular, we assume that the errors:

  • Are independent
  • Are Normally distributed with mean 0
  • Have the same variance

(If you’ve encountered the terminology, this can be stated more concisely as the errors being iid, or independently and identically distributed, \(N(0,\sigma^2_{\varepsilon})\).)

Of course, we don’t know the True Errors, since we don’t know the True \(\mu\) or \(\tau\)’s. So we can’t check these conditions directly. But we can check whether they hold for the residuals, which are our estimates of the errors.

2.3.3 Independence

The independence bit you’ve sort of heard before. Back in intro stats, you were always talking about independence of observations as a condition for any of your inference tests. The distinction here is that we acknowledge that our observations, the \(y_{ij}\)’s, aren’t completely independent.

Remember what independence means, statistically: knowing something about A tells you nothing about B. Now, if two chickens are in the same diet group, knowing the weight of chicken A actually does tell me something about the weight of chicken B. Not for certain, of course, but if I knew that chicken A was super big and chicken B was eating the same thing, I’d bet that chicken B would be pretty big too.

But if I knew that chicken A weighed above the average for its diet, that wouldn’t necessarily tell me anything about chicken B. Even if chicken B is eating the same thing, well, who knows? Maybe chicken B is also above average for its diet, maybe it isn’t. This is what we mean by the errors being independent.

Of course, that might not actually be true. For example, if I have all the chickens in a group eating from the same feed trough, and chicken A is unusually large for its diet, that actually suggests that chicken B will be unusually small for its diet…because chicken A is probably eating all the food before B can get any. When you’re checking the assumption of independent errors, this is the kind of thinking you need to do – thinking about where the data are coming from and what might be affecting them. This would be an example of poor experimental design at a practical level. You can get as clever about randomization as you want, but you’re still going to get messed-up data if you don’t make sure that each chicken gets its own feed bag.

Another way that the independent-errors assumption gets violated is around proximity – in space but especially in time. If you have runs of an experiment that are taking place over time, take a very close look at them to see if the ones closer together are more (or less!) similar than those farther apart. This can happen because doing one run leaves some kind of side effect that affects the next run – this happens in industrial experiments. Or maybe several runs are affected by some external factor that changes over time, like wear on the equipment, or seasonal changes in temperature. If you’re dealing with data collected over time, always make a plot with time on the \(x\) axis and the residuals on the \(y\) axis to look for patterns like this.

Beyond that, though, you just have to be thinking about the context of the experiment, and any real-world phenomena that might be affecting things.

2.3.4 Normal, mean zero

The good (?) news is that you can’t really check the “mean zero” part, because the mean of your residuals will always be zero; that’s just how the math works. But you can, and should, check for Normality.

This works just the way it used to for other tests. Throw those residuals onto a histogram and/or Normal QQ plot and see what happens.

chicken_resids = data.frame(
  "residuals" = chicken_lm$residuals,
  "predicted" = chicken_lm$fitted.values
)
chicken_resids %>%
  ggplot() +
  geom_histogram(aes(x = residuals), bins=15)

chicken_resids %>%
  ggplot(aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line()

There’s a bit of deviation from normality here, with a possibility of multiple modes. But it doesn’t seem like there’s a skew problem or any huge outliers.

This isn’t the end of the world, because ANOVA is reasonably robust to violations of the Normal assumption. It might be worth looking into these multiple modes (which often indicate the possibility of subgroups), but you could probably still do the test.

2.3.5 Equal variance

The other big assumption we make is that the residuals all have equal variance. This property is called homoskedasticity, and not having it is called heteroskedasticity. This is a set of words to have on hand because (1) they’re fun to say, and (2) real talk: while it is very important to be able to communicate without using jargon, sometimes you need to throw jargon around just to prove you can.

So: why is it important that the residuals be homoskedastic? Well, recall that \(F\) statistic we were using for tests: \[F = \frac{MSTr}{MSE}\]

This is the ratio of variation between groups to variation within each group. But notice that there is nothing in the denominator that’s specific to a particular group. There’s just one, general term there for MSE. That’s based on all of the residuals – it is a pooled estimate of the error variance. If the different groups have different internal variance, then this pooled estimate doesn’t make sense.

So how do you check this? As usual, you can do it numerically but it’s probably best to make a plot. If you’re only dealing with one factor, you can get some sense of this just by looking at side-by-side box plots of the original data:

chicken_dat %>%
  ggplot() +
  geom_boxplot(aes(x = Diet, y = weight))

It looks like the spread in each group is roughly the same.

But when you have more than one factor in play, this kind of plot can get tricky. Instead, look at the residuals vs. predicted values:

chicken_resids %>%
  ggplot() +
  geom_point(aes(x = predicted, y = residuals))

You want to see that each group has roughly the same vertical spread – indicating equal variance in each treatment group. The nice thing about this plot is that you can also check for any other weird behaviors, like subgroups or extreme outliers.

All in all, this plot looks okay; the equal variance assumption isn’t super sensitive, so these spreads are close enough to equal. There does appear to be one outlier chicken; it’s not super extreme, but I might go back to my industry partner and ask whether that chicken was sick or something.

If your data seem to violate the equal variance assumption – or the Normality assumption – it can be worth trying a transformation on your response variable. For example, if you see a right skew, you could take the log or the square root. There’s a certain amount of actual methodology for this, but the short version is, try some stuff and see if it helps!

Response moment: Suppose that, yet again, I did not pay attention and accidentally included two types of baby chicks in my experiment: Li’l Fluffies, and Dino Terrors, which are much bigger. What might you see in the residuals plot(s) that could warn you of my mistake?