# Chapter 14 Comparing several means (one-way ANOVA)

This chapter introduces one of the most widely used tools in statistics, known as “the analysis of variance”, which is usually referred to as ANOVA. The basic technique was developed by Sir Ronald Fisher in the early 20th century, and it is to him that we owe the rather unfortunate terminology. The term ANOVA is a little misleading, in two respects. Firstly, although the name of the technique refers to variances, ANOVA is concerned with investigating differences in means. Secondly, there are several different things out there that are all referred to as ANOVAs, some of which have only a very tenuous connection to one another. Later on in the book we’ll encounter a range of different ANOVA methods that apply in quite different situations, but for the purposes of this chapter we’ll only consider the simplest form of ANOVA, in which we have several different groups of observations, and we’re interested in finding out whether those groups differ in terms of some outcome variable of interest. This is the question that is addressed by a one-way ANOVA.

The structure of this chapter is as follows: In Section 14.1 I’ll introduce a fictitious data set that we’ll use as a running example throughout the chapter. After introducing the data, I’ll describe the mechanics of how a one-way ANOVA actually works (Section 14.2) and then focus on how you can run one in R (Section 14.3). These two sections are the core of the chapter. The remainder of the chapter discusses a range of important topics that inevitably arise when running an ANOVA, namely how to calculate effect sizes (Section 14.4), post hoc tests and corrections for multiple comparisons (Section 14.5) and the assumptions that ANOVA relies upon (Section 14.6). We’ll also talk about how to check those assumptions and some of the things you can do if the assumptions are violated (Sections 14.7 to 14.10). At the end of the chapter we’ll talk a little about the relationship between ANOVA and other statistical tools (Section 14.11).

## 14.1 An illustrative data set

Suppose you’ve become involved in a clinical trial in which you are testing a new antidepressant drug called Joyzepam. In order to construct a fair test of the drug’s effectiveness, the study involves three separate drugs to be administered. One is a placebo, and the other is an existing antidepressant / anti-anxiety drug called Anxifree. A collection of 18 participants with moderate to severe depression are recruited for your initial testing. Because the drugs are sometimes administered in conjunction with psychological therapy, your study includes 9 people undergoing cognitive behavioural therapy (CBT) and 9 who are not. Participants are randomly assigned (doubly blinded, of course) a treatment, such that there are 3 CBT people and 3 no-therapy people assigned to each of the 3 drugs. A psychologist assesses the mood of each person after a 3 month run with each drug: and the overall improvement in each person’s mood is assessed on a scale ranging from $$-5$$ to $$+5$$.

With that as the study design, let’s now look at what we’ve got in the data file:

load( "./rbook-master/data/clinicaltrial.Rdata" ) # load data
str(clin.trial)       
## 'data.frame':    18 obs. of  3 variables:
##  $drug : Factor w/ 3 levels "placebo","anxifree",..: 1 1 1 2 2 2 3 3 3 1 ... ##$ therapy  : Factor w/ 2 levels "no.therapy","CBT": 1 1 1 1 1 1 1 1 1 2 ...
group <- clin.trial$drug gp.means <- tapply(outcome,group,mean) gp.means <- gp.means[group] dev.from.gp.means <- outcome - gp.means squared.devs <- dev.from.gp.means ^2 It might not be obvious from inspection what these commands are doing: as a general rule, the human brain seems to just shut down when faced with a big block of programming. However, I strongly suggest that – if you’re like me and tend to find that the mere sight of this code makes you want to look away and see if there’s any beer left in the fridge or a game of footy on the telly – you take a moment and look closely at these commands one at a time. Every single one of these commands is something you’ve seen before somewhere else in the book. There’s nothing novel about them (though I’ll have to admit that the tapply() function takes a while to get a handle on), so if you’re not quite sure how these commands work, this might be a good time to try playing around with them yourself, to try to get a sense of what’s happening. On the other hand, if this does seem to make sense, then you won’t be all that surprised at what happens when I wrap these variables in a data frame, and print it out… Y <- data.frame( group, outcome, gp.means, dev.from.gp.means, squared.devs ) print(Y, digits = 2) ## group outcome gp.means dev.from.gp.means squared.devs ## 1 placebo 0.5 0.45 0.050 0.0025 ## 2 placebo 0.3 0.45 -0.150 0.0225 ## 3 placebo 0.1 0.45 -0.350 0.1225 ## 4 anxifree 0.6 0.72 -0.117 0.0136 ## 5 anxifree 0.4 0.72 -0.317 0.1003 ## 6 anxifree 0.2 0.72 -0.517 0.2669 ## 7 joyzepam 1.4 1.48 -0.083 0.0069 ## 8 joyzepam 1.7 1.48 0.217 0.0469 ## 9 joyzepam 1.3 1.48 -0.183 0.0336 ## 10 placebo 0.6 0.45 0.150 0.0225 ## 11 placebo 0.9 0.45 0.450 0.2025 ## 12 placebo 0.3 0.45 -0.150 0.0225 ## 13 anxifree 1.1 0.72 0.383 0.1469 ## 14 anxifree 0.8 0.72 0.083 0.0069 ## 15 anxifree 1.2 0.72 0.483 0.2336 ## 16 joyzepam 1.8 1.48 0.317 0.1003 ## 17 joyzepam 1.3 1.48 -0.183 0.0336 ## 18 joyzepam 1.4 1.48 -0.083 0.0069 If you compare this output to the contents of the table I’ve been constructing by hand, you can see that R has done exactly the same calculations that I was doing, and much faster too. So, if we want to finish the calculations of the within-group sum of squares in R, we just ask for the sum() of the squared.devs variable: SSw <- sum( squared.devs ) print( SSw ) ## [1] 1.391667 Obviously, this isn’t the same as what I calculated, because R used all 18 observations. But if I’d typed sum( squared.devs[1:5] ) instead, it would have given the same answer that I got earlier. Okay. Now that we’ve calculated the within groups variation, $$\mbox{SS}_w$$, it’s time to turn our attention to the between-group sum of squares, $$\mbox{SS}_b$$. The calculations for this case are very similar. The main difference is that, instead of calculating the differences between an observation $$Y_{ik}$$ and a group mean $$\bar{Y}_k$$ for all of the observations, we calculate the differences between the group means $$\bar{Y}_k$$ and the grand mean $$\bar{Y}$$ (in this case 0.88) for all of the groups… group ($$k$$) group mean ($$\bar{Y}_k$$) grand mean ($$\bar{Y}$$) deviation ($$\bar{Y}_{k} - \bar{Y}$$) squared deviations ($$(\bar{Y}_{k} - \bar{Y})^2$$) placebo 0.45 0.88 -0.43 0.18 anxifree 0.72 0.88 -0.16 0.03 joyzepam 1.48 0.88 0.60 0.36 However, for the between group calculations we need to multiply each of these squared deviations by $$N_k$$, the number of observations in the group. We do this because every observation in the group (all $$N_k$$ of them) is associated with a between group difference. So if there are six people in the placebo group, and the placebo group mean differs from the grand mean by 0.19, then the total between group variation associated with these six people is $$6 \times 0.16 = 1.14$$. So we have to extend our little table of calculations… group ($$k$$) squared deviations ($$(\bar{Y}_{k} - \bar{Y})^2$$) sample size ($$N_k$$) weighted squared dev ($$N_k (\bar{Y}_{k} - \bar{Y})^2$$) placebo 0.18 6 1.11 anxifree 0.03 6 0.16 joyzepam 0.36 6 2.18 And so now our between group sum of squares is obtained by summing these “weighted squared deviations” over all three groups in the study: $\begin{array}{rcl} \mbox{SS}_{b} &=& 1.11 + 0.16 + 2.18 \\ &=& 3.45 \end{array}$ As you can see, the between group calculations are a lot shorter, so you probably wouldn’t usually want to bother using R as your calculator. However, if you did decide to do so, here’s one way you could do it: gp.means <- tapply(outcome,group,mean) grand.mean <- mean(outcome) dev.from.grand.mean <- gp.means - grand.mean squared.devs <- dev.from.grand.mean ^2 gp.sizes <- tapply(outcome,group,length) wt.squared.devs <- gp.sizes * squared.devs Again, I won’t actually try to explain this code line by line, but – just like last time – there’s nothing in there that we haven’t seen in several places elsewhere in the book, so I’ll leave it as an exercise for you to make sure you understand it. Once again, we can dump all our variables into a data frame so that we can print it out as a nice table: Y <- data.frame( gp.means, grand.mean, dev.from.grand.mean, squared.devs, gp.sizes, wt.squared.devs ) print(Y, digits = 2) ## gp.means grand.mean dev.from.grand.mean squared.devs gp.sizes ## placebo 0.45 0.88 -0.43 0.188 6 ## anxifree 0.72 0.88 -0.17 0.028 6 ## joyzepam 1.48 0.88 0.60 0.360 6 ## wt.squared.devs ## placebo 1.13 ## anxifree 0.17 ## joyzepam 2.16 Clearly, these are basically the same numbers that we got before. There are a few tiny differences, but that’s only because the hand-calculated versions have some small errors caused by the fact that I rounded all my numbers to 2 decimal places at each step in the calculations, whereas R only does it at the end (obviously, R s version is more accurate). Anyway, here’s the R command showing the final step: SSb <- sum( wt.squared.devs ) print( SSb ) ## [1] 3.453333 which is (ignoring the slight differences due to rounding error) the same answer that I got when doing things by hand. Now that we’ve calculated our sums of squares values, $$\mbox{SS}_b$$ and $$\mbox{SS}_w$$, the rest of the ANOVA is pretty painless. The next step is to calculate the degrees of freedom. Since we have $$G = 3$$ groups and $$N = 18$$ observations in total, our degrees of freedom can be calculated by simple subtraction: $\begin{array}{lclcl} \mbox{df}_b &=& G - 1 &=& 2 \\ \mbox{df}_w &=& N - G &=& 15 \end{array}$ Next, since we’ve now calculated the values for the sums of squares and the degrees of freedom, for both the within-groups variability and the between-groups variability, we can obtain the mean square values by dividing one by the other: $\begin{array}{lclclcl} \mbox{MS}_b &=& \displaystyle\frac{\mbox{SS}_b }{ \mbox{df}_b } &=& \displaystyle\frac{3.45}{ 2} &=& 1.73 \end{array}$ $\begin{array}{lclclcl} \mbox{MS}_w &=& \displaystyle\frac{\mbox{SS}_w }{ \mbox{df}_w } &=& \displaystyle\frac{1.39}{15} &=& 0.09 \end{array}$ We’re almost done. The mean square values can be used to calculate the $$F$$-value, which is the test statistic that we’re interested in. We do this by dividing the between-groups MS value by the and within-groups MS value. $F \ = \ \frac{\mbox{MS}_b }{ \mbox{MS}_w } \ = \ \frac{1.73}{0.09} \ = \ 18.6$ Woohooo! This is terribly exciting, yes? Now that we have our test statistic, the last step is to find out whether the test itself gives us a significant result. As discussed in Chapter @ref(hypothesistesting, what we really ought to do is choose an $$\alpha$$ level (i.e., acceptable Type I error rate) ahead of time, construct our rejection region, etc etc. But in practice it’s just easier to directly calculate the $$p$$-value. Back in the “old days”, what we’d do is open up a statistics textbook or something and flick to the back section which would actually have a huge lookup table… that’s how we’d “compute” our $$p$$-value, because it’s too much effort to do it any other way. However, since we have access to R, I’ll use the pf() function to do it instead. Now, remember that I explained earlier that the $$F$$-test is always one sided? And that we only reject the null hypothesis for very large $$F$$-values? That means we’re only interested in the upper tail of the $$F$$-distribution. The command that you’d use here would be this… pf( 18.6, df1 = 2, df2 = 15, lower.tail = FALSE) ## [1] 8.672727e-05 Therefore, our $$p$$-value comes to 0.0000867, or $$8.67 \times 10^{-5}$$ in scientific notation. So, unless we’re being extremely conservative about our Type I error rate, we’re pretty much guaranteed to reject the null hypothesis. At this point, we’re basically done. Having completed our calculations, it’s traditional to organise all these numbers into an ANOVA table like the one in Table@reftab:anovatable. For our clinical trial data, the ANOVA table would look like this: df sum of squares mean squares $$F$$-statistic $$p$$-value between groups 2 3.45 1.73 18.6 $$8.67 \times 10^{-5}$$ within groups 15 1.39 0.09 - - These days, you’ll probably never have much reason to want to construct one of these tables yourself, but you will find that almost all statistical software (R included) tends to organise the output of an ANOVA into a table like this, so it’s a good idea to get used to reading them. However, although the software will output a full ANOVA table, there’s almost never a good reason to include the whole table in your write up. A pretty standard way of reporting this result would be to write something like this: One-way ANOVA showed a significant effect of drug on mood gain ($$F(2,15) = 18.6, p<.001$$). Sigh. So much work for one short sentence. ## 14.3 Running an ANOVA in R I’m pretty sure I know what you’re thinking after reading the last section, especially if you followed my advice and tried typing all the commands in yourself…. doing the ANOVA calculations yourself sucks. There’s quite a lot of calculations that we needed to do along the way, and it would be tedious to have to do this over and over again every time you wanted to do an ANOVA. One possible solution to the problem would be to take all these calculations and turn them into some R functions yourself. You’d still have to do a lot of typing, but at least you’d only have to do it the one time: once you’ve created the functions, you can reuse them over and over again. However, writing your own functions is a lot of work, so this is kind of a last resort. Besides, it’s much better if someone else does all the work for you… ### 14.3.1 Using the aov() function to specify your ANOVA To make life easier for you, R provides a function called aov(), which – obviously – is an acronym of “Analysis Of Variance.”205 If you type ?aov and have a look at the help documentation, you’ll see that there are several arguments to the aov() function, but the only two that we’re interested in are formula and data. As we’ve seen in a few places previously, the formula argument is what you use to specify the outcome variable and the grouping variable, and the data argument is what you use to specify the data frame that stores these variables. In other words, to do the same ANOVA that I laboriously calculated in the previous section, I’d use a command like this: aov( formula = mood.gain ~ drug, data = clin.trial )  Actually, that’s not quite the whole story, as you’ll see as soon as you look at the output from this command, which I’ve hidden for the moment in order to avoid confusing you. Before we go into specifics, I should point out that either of these commands will do the same thing: aov( clin.trial$mood.gain ~ clin.trial$drug ) aov( mood.gain ~ drug, clin.trial )  In the first command, I didn’t specify a data set, and instead relied on the $ operator to tell R how to find the variables. In the second command, I dropped the argument names, which is okay in this case because formula is the first argument to the aov() function, and data is the second one. Regardless of how I specify the ANOVA, I can assign the output of the aov() function to a variable, like this for example:

my.anova <- aov( mood.gain ~ drug, clin.trial ) 

This is almost always a good thing to do, because there’s lots of useful things that we can do with the my.anova variable. So let’s assume that it’s this last command that I used to specify the ANOVA that I’m trying to run, and as a consequence I have this my.anova variable sitting in my workspace, waiting for me to do something with it…

### 14.3.2 Understanding what the aov() function produces

Now that we’ve seen how to use the aov() function to create my.anova we’d better have a look at what this variable actually is. The first thing to do is to check to see what class of variable we’ve created, since it’s kind of interesting in this case. When we do that…

class( my.anova )
## [1] "aov" "lm"

… we discover that my.anova actually has two classes! The first class tells us that it’s an aov (analysis of variance) object, but the second tells us that it’s also an lm (linear model) object. Later on, we’ll see that this reflects a pretty deep statistical relationship between ANOVA and regression (Chapter 15 and it means that any function that exists in R for dealing with regressions can also be applied to aov objects, which is neat; but I’m getting ahead of myself. For now, I want to note that what we’ve created is an aov object, and to also make the point that aov objects are actually rather complicated beasts. I won’t be trying to explain everything about them, since it’s way beyond the scope of an introductory statistics subject, but to give you a tiny hint of some of the stuff that R stores inside an aov object, let’s ask it to print out the names() of all the stored quantities…

names( my.anova )
##  [1] "coefficients"  "residuals"     "effects"       "rank"
##  [5] "fitted.values" "assign"        "qr"            "df.residual"
##  [9] "contrasts"     "xlevels"       "call"          "terms"
## [13] "model"

As we go through the rest of the book, I hope that a few of these will become a little more obvious to you, but right now that’s going to look pretty damned opaque. That’s okay. You don’t need to know any of the details about it right now, and most of it you don’t need at all… what you do need to understand is that the aov() function does a lot of calculations for you, not just the basic ones that I outlined in the previous sections. What this means is that it’s generally a good idea to create a variable like my.anova that stores the output of the aov() function… because later on, you can use my.anova as an input to lots of other functions: those other functions can pull out bits and pieces from the aov object, and calculate various other things that you might need.

Right then. The simplest thing you can do with an aov object is to print() it out. When we do that, it shows us a few of the key quantities of interest:

print( my.anova )
## Call:
##    aov(formula = mood.gain ~ drug, data = clin.trial)
##
## Terms:
##                     drug Residuals
## Sum of Squares  3.453333  1.391667
## Deg. of Freedom        2        15
##
## Residual standard error: 0.3045944
## Estimated effects may be unbalanced

Specificially, it prints out a reminder of the command that you used when you called aov() in the first place, shows you the sums of squares values, the degrees of freedom, and a couple of other quantities that we’re not really interested in right now. Notice, however, that R doesn’t use the names “between-group” and “within-group”. Instead, it tries to assign more meaningful names: in our particular example, the between groups variance corresponds to the effect that the drug has on the outcome variable; and the within groups variance is corresponds to the “leftover” variability, so it calls that the residuals. If we compare these numbers to the numbers that I calculated by hand in Section 14.2.5, you can see that they’re identical… the between groups sums of squares is $$\mbox{SS}_b = 3.45$$, the within groups sums of squares is $$\mbox{SS}_w = 1.39$$, and the degrees of freedom are 2 and 15 repectively.

### 14.3.3 Running the hypothesis tests for the ANOVA

Okay, so we’ve verified that my.anova seems to be storing a bunch of the numbers that we’re looking for, but the print() function didn’t quite give us the output that we really wanted. Where’s the $$F$$-value? The $$p$$-value? These are the most important numbers in our hypothesis test, but the print() function doesn’t provide them. To get those numbers, we need to use a different function. Instead of asking R to print() out the aov object, we should have asked for a summary() of it.206 When we do that…

summary( my.anova )
##             Df Sum Sq Mean Sq F value   Pr(>F)
## drug         2  3.453  1.7267   18.61 8.65e-05 ***
## Residuals   15  1.392  0.0928
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

… we get all of the key numbers that we calculated earlier. We get the sums of squares, the degrees of freedom, the mean squares, the $$F$$-statistic, and the $$p$$-value itself. These are all identical to the numbers that we calculated ourselves when doing it the long and tedious way, and it’s even organised into the same kind of ANOVA table that I showed in Table 14.1, and then filled out by hand in Section 14.2.5. The only things that are even slightly different is that some of the row and column names are a bit different.

## 14.4 Effect size

There’s a few different ways you could measure the effect size in an ANOVA, but the most commonly used measures are $$\eta^2$$ (eta squared) and partial $$\eta^2$$. For a one way analysis of variance they’re identical to each other, so for the moment I’ll just explain $$\eta^2$$. The definition of $$\eta^2$$ is actually really simple: $\eta^2 = \frac{\mbox{SS}_b}{\mbox{SS}_{tot}}$ That’s all it is. So when I look at the ANOVA table above, I see that $$\mbox{SS}_b = 3.45$$ and $$\mbox{SS}_{tot} = 3.45 + 1.39 = 4.84$$. Thus we get an $$\eta^2$$ value of $\eta^2 = \frac{3.45}{4.84} = 0.71$ The interpretation of $$\eta^2$$ is equally straightforward: it refers to the proportion of the variability in the outcome variable (mood.gain) that can be explained in terms of the predictor (drug). A value of $$\eta^2 = 0$$ means that there is no relationship at all between the two, whereas a value of $$\eta^2 = 1$$ means that the relationship is perfect. Better yet, the $$\eta^2$$ value is very closely related to a squared correlation (i.e., $$r^2$$). So, if you’re trying to figure out whether a particular value of $$\eta^2$$ is big or small, it’s sometimes useful to remember that $\eta= \sqrt{\frac{\mbox{SS}_b}{\mbox{SS}_{tot}}}$ can be interpreted as if it referred to the magnitude of a Pearson correlation. So in our drugs example, the $$\eta^2$$ value of .71 corresponds to an $$\eta$$ value of $$\sqrt{.71} = .84$$. If we think about this as being equivalent to a correlation of about .84, we’d conclude that the relationship between drug and mood.gain is strong.

The core packages in R don’t include any functions for calculating $$\eta^2$$. However, it’s pretty straightforward to calculate it directly from the numbers in the ANOVA table. In fact, since I’ve already got the SSw and SSb variables lying around from my earlier calculations, I can do this:

SStot <- SSb + SSw          # total sums of squares
eta.squared <- SSb / SStot  # eta-squared value
print( eta.squared )
## [1] 0.7127623

However, since it can be tedious to do this the long way (especially when we start running more complicated ANOVAs, such as those in Chapter 16 I’ve included an etaSquared() function in the lsr package which will do it for you. For now, the only argument you need to care about is x, which should be the aov object corresponding to your ANOVA. When we do this, what we get as output is this:

etaSquared( x = my.anova )
##         eta.sq eta.sq.part
## drug 0.7127623   0.7127623

The output here shows two different numbers. The first one corresponds to the $$\eta^2$$ statistic, precisely as described above. The second one refers to “partial $$\eta^2$$”, which is a somewhat different measure of effect size that I’ll describe later. For the simple ANOVA that we’ve just run, they’re the same number. But this won’t always be true once we start running more complicated ANOVAs.207

## 14.5 Multiple comparisons and post hoc tests

Any time you run an ANOVA with more than two groups, and you end up with a significant effect, the first thing you’ll probably want to ask is which groups are actually different from one another. In our drugs example, our null hypothesis was that all three drugs (placebo, Anxifree and Joyzepam) have the exact same effect on mood. But if you think about it, the null hypothesis is actually claiming three different things all at once here. Specifically, it claims that:

• Your competitor’s drug (Anxifree) is no better than a placebo (i.e., $$\mu_A = \mu_P$$)
• Your drug (Joyzepam) is no better than a placebo (i.e., $$\mu_J = \mu_P$$)
• Anxifree and Joyzepam are equally effective (i.e., $$\mu_J = \mu_A$$)

If any one of those three claims is false, then the null hypothesis is also false. So, now that we’ve rejected our null hypothesis, we’re thinking that at least one of those things isn’t true. But which ones? All three of these propositions are of interest: you certainly want to know if your new drug Joyzepam is better than a placebo, and it would be nice to know how well it stacks up against an existing commercial alternative (i.e., Anxifree). It would even be useful to check the performance of Anxifree against the placebo: even if Anxifree has already been extensively tested against placebos by other researchers, it can still be very useful to check that your study is producing similar results to earlier work.

When we characterise the null hypothesis in terms of these three distinct propositions, it becomes clear that there are eight possible “states of the world” that we need to distinguish between:

possibility: is $$\mu_P = \mu_A$$? is $$\mu_P = \mu_J$$? is $$\mu_A = \mu_J$$? which hypothesis?
1 $$\checkmark$$ $$\checkmark$$ $$\checkmark$$ null
2 $$\checkmark$$ $$\checkmark$$ alternative
3 $$\checkmark$$ $$\checkmark$$ alternative
4 $$\checkmark$$ alternative
5 $$\checkmark$$ $$\checkmark$$ alternative
6 $$\checkmark$$ alternative
7 $$\checkmark$$ alternative
8 alternative

By rejecting the null hypothesis, we’ve decided that we don’t believe that #1 is the true state of the world. The next question to ask is, which of the other seven possibilities do we think is right? When faced with this situation, its usually helps to look at the data. For instance, if we look at the plots in Figure 14.1, it’s tempting to conclude that Joyzepam is better than the placebo and better than Anxifree, but there’s no real difference between Anxifree and the placebo. However, if we want to get a clearer answer about this, it might help to run some tests.

### 14.5.1 Running “pairwise” $$t$$-tests

How might we go about solving our problem? Given that we’ve got three separate pairs of means (placebo versus Anxifree, placebo versus Joyzepam, and Anxifree versus Joyzepam) to compare, what we could do is run three separate $$t$$-tests and see what happens. There’s a couple of ways that we could do this. One method would be to construct new variables corresponding the groups you want to compare (e.g., anxifree, placebo and joyzepam), and then run a $$t$$-test on these new variables:

t.test( anxifree, placebo, var.equal = TRUE )   # Student t-test

anxifree <- with(clin.trial, mood.gain[drug == "anxifree"])  # mood change due to anxifree
placebo <- with(clin.trial, mood.gain[drug == "placebo"])    # mood change due to placebo 

or, you could use the subset argument in the t.test() function to select only those observations corresponding to one of the two groups we’re interested in:

t.test( formula = mood.gain ~ drug,
data = clin.trial,
subset = drug %in% c("placebo","anxifree"),
var.equal = TRUE
)

See Chapter 7 if you’ve forgotten how the %in% operator works. Regardless of which version we do, R will print out the results of the $$t$$-test, though I haven’t included that output here. If we go on to do this for all possible pairs of variables, we can look to see which (if any) pairs of groups are significantly different to each other. This “lots of $$t$$-tests idea” isn’t a bad strategy, though as we’ll see later on there are some problems with it. However, for the moment our bigger problem is that it’s a pain to have to type in such a long command over and over again: for instance, if your experiment has 10 groups, then you have to run 45 $$t$$-tests. That’s way too much typing.

To help keep the typing to a minimum, R provides a function called pairwise.t.test() that automatically runs all of the $$t$$-tests for you. There are three arguments that you need to specify, the outcome variable x, the group variable g, and the p.adjust.method argument, which “adjusts” the $$p$$-value in one way or another. I’ll explain $$p$$-value adjustment in a moment, but for now we can just set p.adjust.method = "none" since we’re not doing any adjustments. For our example, here’s what we do:

 pairwise.t.test( x = clin.trial$mood.gain, # outcome variable g = clin.trial$drug,        # grouping variable
p.adjust.method = "none"    # which correction to use?
)
##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  clin.trial$mood.gain and clin.trial$drug
##
##          placebo anxifree
## anxifree 0.15021 -
## joyzepam 3e-05   0.00056
##
## P value adjustment method: none

One thing that bugs me slightly about the pairwise.t.test() function is that you can’t just give it an aov object, and have it produce this output. After all, I went to all that trouble earlier of getting R to create the my.anova variable and – as we saw in Section 14.3.2 – R has actually stored enough information inside it that I should just be able to get it to run all the pairwise tests using my.anova as an input. To that end, I’ve included a posthocPairwiseT() function in the lsr package that lets you do this. The idea behind this function is that you can just input the aov object itself,208 and then get the pairwise tests as an output. As of the current writing, posthocPairwiseT() is actually just a simple way of calling pairwise.t.test() function, but you should be aware that I intend to make some changes to it later on. Here’s an example:

posthocPairwiseT( x = my.anova, p.adjust.method = "none" )
##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  mood.gain and drug
##
##          placebo anxifree
## anxifree 0.15021 -
## joyzepam 3e-05   0.00056
##
## P value adjustment method: none

In later versions, I plan to add more functionality (e.g., adjusted confidence intervals), but for now I think it’s at least kind of useful. To see why, let’s suppose you’ve run your ANOVA and stored the results in my.anova, and you’re happy using the Holm correction (the default method in pairwise.t.test(), which I’ll explain this in a moment). In that case, all you have to do is type this:

posthocPairwiseT( my.anova )

and R will output the test results. Much more convenient, I think.

### 14.5.2 Corrections for multiple testing

In the previous section I hinted that there’s a problem with just running lots and lots of $$t$$-tests. The concern is that when running these analyses, what we’re doing is going on a “fishing expedition”: we’re running lots and lots of tests without much theoretical guidance, in the hope that some of them come up significant. This kind of theory-free search for group differences is referred to as post hoc analysis (“post hoc” being Latin for “after this”).209

It’s okay to run post hoc analyses, but a lot of care is required. For instance, the analysis that I ran in the previous section is actually pretty dangerous: each individual $$t$$-test is designed to have a 5% Type I error rate (i.e., $$\alpha = .05$$), and I ran three of these tests. Imagine what would have happened if my ANOVA involved 10 different groups, and I had decided to run 45 “post hoc” $$t$$-tests to try to find out which ones were significantly different from each other, you’d expect 2 or 3 of them to come up significant by chance alone. As we saw in Chapter 11, the central organising principle behind null hypothesis testing is that we seek to control our Type I error rate, but now that I’m running lots of $$t$$-tests at once, in order to determine the source of my ANOVA results, my actual Type I error rate across this whole family of tests has gotten completely out of control.

The usual solution to this problem is to introduce an adjustment to the $$p$$-value, which aims to control the total error rate across the family of tests (see Shaffer 1995). An adjustment of this form, which is usually (but not always) applied because one is doing post hoc analysis, is often referred to as a correction for multiple comparisons, though it is sometimes referred to as “simultaneous inference”. In any case, there are quite a few different ways of doing this adjustment. I’ll discuss a few of them in this section and in Section 16.8, but you should be aware that there are many other methods out there (see, e.g., Hsu 1996).

### 14.5.3 Bonferroni corrections

The simplest of these adjustments is called the Bonferroni correction (Dunn 1961), and it’s very very simple indeed. Suppose that my post hoc analysis consists of $$m$$ separate tests, and I want to ensure that the total probability of making any Type I errors at all is at most $$\alpha$$.210 If so, then the Bonferroni correction just says “multiply all your raw $$p$$-values by $$m$$”. If we let $$p$$ denote the original $$p$$-value, and let $$p^\prime_j$$ be the corrected value, then the Bonferroni correction tells that: $p^\prime = m \times p$ And therefore, if you’re using the Bonferroni correction, you would reject the null hypothesis if $$p^\prime < \alpha$$. The logic behind this correction is very straightforward. We’re doing $$m$$ different tests; so if we arrange it so that each test has a Type I error rate of at most $$\alpha / m$$, then the total Type I error rate across these tests cannot be larger than $$\alpha$$. That’s pretty simple, so much so that in the original paper, the author writes:

The method given here is so simple and so general that I am sure it must have been used before this. I do not find it, however, so can only conclude that perhaps its very simplicity has kept statisticians from realizing that it is a very good method in some situations (pp 52-53 Dunn 1961)

To use the Bonferroni correction in R, you can use the pairwise.t.test() function,211 making sure that you set p.adjust.method = "bonferroni". Alternatively, since the whole reason why we’re doing these pairwise tests in the first place is because we have an ANOVA that we’re trying to understand, it’s probably more convenient to use the posthocPairwiseT() function in the lsr package, since we can use my.anova as the input:

posthocPairwiseT( my.anova, p.adjust.method = "bonferroni")
##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  mood.gain and drug
##
##          placebo anxifree
## anxifree 0.4506  -
## joyzepam 9.1e-05 0.0017
##
## P value adjustment method: bonferroni

If we compare these three $$p$$-values to those that we saw in the previous section when we made no adjustment at all, it is clear that the only thing that R has done is multiply them by 3.

### 14.5.4 Holm corrections

Although the Bonferroni correction is the simplest adjustment out there, it’s not usually the best one to use. One method that is often used instead is the Holm correction (Holm 1979). The idea behind the Holm correction is to pretend that you’re doing the tests sequentially; starting with the smallest (raw) $$p$$-value and moving onto the largest one. For the $$j$$-th largest of the $$p$$-values, the adjustment is either $p^\prime_j = j \times p_j$ (i.e., the biggest $$p$$-value remains unchanged, the second biggest $$p$$-value is doubled, the third biggest $$p$$-value is tripled, and so on), or $p^\prime_j = p^\prime_{j+1}$ whichever one is larger. This might sound a little confusing, so let’s go through it a little more slowly. Here’s what the Holm correction does. First, you sort all of your $$p$$-values in order, from smallest to largest. For the smallest $$p$$-value all you do is multiply it by $$m$$, and you’re done. However, for all the other ones it’s a two-stage process. For instance, when you move to the second smallest $$p$$ value, you first multiply it by $$m-1$$. If this produces a number that is bigger than the adjusted $$p$$-value that you got last time, then you keep it. But if it’s smaller than the last one, then you copy the last $$p$$-value. To illustrate how this works, consider the table below, which shows the calculations of a Holm correction for a collection of five $$p$$-values:

raw $$p$$ rank $$j$$ $$p \times j$$ Holm $$p$$
.001 5 .005 .005
.005 4 .020 .020
.019 3 .057 .057
.022 2 .044 .057
.103 1 .103 .103

Hopefully that makes things clear.

Although it’s a little harder to calculate, the Holm correction has some very nice properties: it’s more powerful than Bonferroni (i.e., it has a lower Type II error rate), but – counterintuitive as it might seem – it has the same Type I error rate. As a consequence, in practice there’s never any reason to use the simpler Bonferroni correction, since it is always outperformed by the slightly more elaborate Holm correction. Because of this, the Holm correction is the default one used by pairwise.t.test() and posthocPairwiseT(). To run the Holm correction in R, you could specify p.adjust.method = "Holm" if you wanted to, but since it’s the default you can just to do this:

posthocPairwiseT( my.anova )
##
##  Pairwise comparisons using t tests with pooled SD
##
## data:  mood.gain and drug
##
##          placebo anxifree
## anxifree 0.1502  -
## joyzepam 9.1e-05 0.0011
##
## P value adjustment method: holm

As you can see, the biggest $$p$$-value (corresponding to the comparison between Anxifree and the placebo) is unaltered: at a value of $$.15$$, it is exactly the same as the value we got originally when we applied no correction at all. In contrast, the smallest $$p$$-value (Joyzepam versus placebo) has been multiplied by three.

### 14.5.5 Writing up the post hoc test

Finally, having run the post hoc analysis to determine which groups are significantly different to one another, you might write up the result like this:

Post hoc tests (using the Holm correction to adjust $$p$$) indicated that Joyzepam produced a significantly larger mood change than both Anxifree ($$p = .001$$) and the placebo ($$p = 9.1 \times 10^{-5}$$). We found no evidence that Anxifree performed better than the placebo ($$p = .15$$).

Or, if you don’t like the idea of reporting exact $$p$$-values, then you’d change those numbers to $$p<.01$$, $$p<.001$$ and $$p > .05$$ respectively. Either way, the key thing is that you indicate that you used Holm’s correction to adjust the $$p$$-values. And of course, I’m assuming that elsewhere in the write up you’ve included the relevant descriptive statistics (i.e., the group means and standard deviations), since these $$p$$-values on their own aren’t terribly informative.

## 14.6 Assumptions of one-way ANOVA

Like any statistical test, analysis of variance relies on some assumptions about the data. There are three key assumptions that you need to be aware of: normality, homogeneity of variance and independence. If you remember back to Section 14.2.4 – which I hope you at least skimmed even if you didn’t read the whole thing – I described the statistical models underpinning ANOVA, which I wrote down like this: $\begin{array}{lrcl} H_0: & Y_{ik} &=& \mu + \epsilon_{ik} \\ H_1: & Y_{ik} &=& \mu_k + \epsilon_{ik} \end{array}$ In these equations $$\mu$$ refers to a single, grand population mean which is the same for all groups, and $$\mu_k$$ is the population mean for the $$k$$-th group. Up to this point we’ve been mostly interested in whether our data are best described in terms of a single grand mean (the null hypothesis) or in terms of different group-specific means (the alternative hypothesis). This makes sense, of course: that’s actually the important research question! However, all of our testing procedures have – implicitly – relied on a specific assumption about the residuals, $$\epsilon_{ik}$$, namely that $\epsilon_{ik} \sim \mbox{Normal}(0, \sigma^2)$ None of the maths works properly without this bit. Or, to be precise, you can still do all the calculations, and you’ll end up with an $$F$$-statistic, but you have no guarantee that this $$F$$-statistic actually measures what you think it’s measuring, and so any conclusions that you might draw on the basis of the $$F$$ test might be wrong.

So, how do we check whether this assumption about the residuals is accurate? Well, as I indicated above, there are three distinct claims buried in this one statement, and we’ll consider them separately.

• Normality. The residuals are assumed to be normally distributed. As we saw in Section 13.9, we can assess this by looking at QQ plots or running a Shapiro-Wilk test. I’ll talk about this in an ANOVA context in Section 14.9.
• Homogeneity of variance. Notice that we’ve only got the one value for the population standard deviation (i.e., $$\sigma$$), rather than allowing each group to have it’s own value (i.e., $$\sigma_k$$). This is referred to as the homogeneity of variance (sometimes called homoscedasticity) assumption. ANOVA assumes that the population standard deviation is the same for all groups. We’ll talk about this extensively in Section 14.7.
• Independence. The independence assumption is a little trickier. What it basically means is that, knowing one residual tells you nothing about any other residual. All of the $$\epsilon_{ik}$$ values are assumed to have been generated without any “regard for” or “relationship to” any of the other ones. There’s not an obvious or simple way to test for this, but there are some situations that are clear violations of this: for instance, if you have a repeated-measures design, where each participant in your study appears in more than one condition, then independence doesn’t hold; there’s a special relationship between some observations… namely those that correspond to the same person! When that happens, you need to use something like repeated measures ANOVA. I don’t currently talk about repeated measures ANOVA in this book, but it will be included in later versions.

### 14.6.1 How robust is ANOVA?

One question that people often want to know the answer to is the extent to which you can trust the results of an ANOVA if the assumptions are violated. Or, to use the technical language, how robust is ANOVA to violations of the assumptions. Due to deadline constraints I don’t have the time to discuss this topic. This is a topic I’ll cover in some detail in a later version of the book.

## 14.7 Checking the homogeneity of variance assumption

There’s more than one way to skin a cat, as the saying goes, and more than one way to test the homogeneity of variance assumption, too (though for some reason no-one made a saying out of that). The most commonly used test for this that I’ve seen in the literature is the Levene test (Levene 1960), and the closely related Brown-Forsythe test (Brown and Forsythe 1974), both of which I’ll describe here. Alternatively, you could use the Bartlett test, which is implemented in R via the bartlett.test() function, but I’ll leave it as an exercise for the reader to go check that one out if you’re interested.

Levene’s test is shockingly simple. Suppose we have our outcome variable $$Y_{ik}$$. All we do is define a new variable, which I’ll call $$Z_{ik}$$, corresponding to the absolute deviation from the group mean: $Z_{ik} = \left| Y_{ik} - \bar{Y}_k \right|$ Okay, what good does this do us? Well, let’s take a moment to think about what $$Z_{ik}$$ actually is, and what we’re trying to test. The value of $$Z_{ik}$$ is a measure of how the $$i$$-th observation in the $$k$$-th group deviates from its group mean. And our null hypothesis is that all groups have the same variance; that is, the same overall deviations from the group means! So, the null hypothesis in a Levene’s test is that the population means of $$Z$$ are identical for all groups. Hm. So what we need now is a statistical test of the null hypothesis that all group means are identical. Where have we seen that before? Oh right, that’s what ANOVA is… and so all that the Levene’s test does is run an ANOVA on the new variable $$Z_{ik}$$.

What about the Brown-Forsythe test? Does that do anything particularly different? Nope. The only change from the Levene’s test is that it constructs the transformed variable $$Z$$ in a slightly different way, using deviations from the group medians rather than deviations from the group means. That is, for the Brown-Forsythe test, $Z_{ik} = \left| Y_{ik} - \mbox{median}_k(Y) \right|$ where $$\mbox{median}_k(Y)$$ is the median for group $$k$$. Regardless of whether you’re doing the standard Levene test or the Brown-Forsythe test, the test statistic – which is sometimes denoted $$F$$, but sometimes written as $$W$$ – is calculated in exactly the same way that the $$F$$-statistic for the regular ANOVA is calculated, just using a $$Z_{ik}$$ rather than $$Y_{ik}$$. With that in mind, let’s just move on and look at how to run the test in R.

### 14.7.1 Running the Levene’s test in R

Okay, so how do we run the Levene test? Obviously, since the Levene test is just an ANOVA, it would be easy enough to manually create the transformed variable $$Z_{ik}$$ and then use the aov() function to run an ANOVA on that. However, that’s the tedious way to do it. A better way to do run your Levene’s test is to use the leveneTest() function, which is in the car package. As usual, we first load the package

library( car )  
## Loading required package: carData

and now that we have, we can run our Levene test. The main argument that you need to specify is y, but you can do this in lots of different ways. Probably the simplest way to do it is actually input the original aov object. Since I’ve got the my.anova variable stored from my original ANOVA, I can just do this:

leveneTest( my.anova )
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  1.4672 0.2618
##       15

If we look at the output, we see that the test is non-significant $$(F_{2,15} = 1.47, p = .26)$$, so it looks like the homogeneity of variance assumption is fine. Remember, although R reports the test statistic as an $$F$$-value, it could equally be called $$W$$, in which case you’d just write $$W_{2,15} = 1.47$$. Also, note the part of the output that says center = median. That’s telling you that, by default, the leveneTest() function actually does the Brown-Forsythe test. If you want to use the mean instead, then you need to explicitly set the center argument, like this:

leveneTest( y = my.anova, center = mean )
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  2  1.4497 0.2657
##       15

That being said, in most cases it’s probably best to stick to the default value, since the Brown-Forsythe test is a bit more robust than the original Levene test.

Two more quick comments before I move onto a different topic. Firstly, as mentioned above, there are other ways of calling the leveneTest() function. Although the vast majority of situations that call for a Levene test involve checking the assumptions of an ANOVA (in which case you probably have a variable like my.anova lying around), sometimes you might find yourself wanting to specify the variables directly. Two different ways that you can do this are shown below:

leveneTest(y = mood.gain ~ drug, data = clin.trial)   # y is a formula in this case
leveneTest(y = clin.trial$mood.gain, group = clin.trial$drug)   # y is the outcome  

Secondly, I did mention that it’s possible to run a Levene test just using the aov() function. I don’t want to waste a lot of space on this, but just in case some readers are interested in seeing how this is done, here’s the code that creates the new variables and runs an ANOVA. If you are interested, feel free to run this to verify that it produces the same answers as the Levene test (i.e., with center = mean):

Y <- clin.trial $mood.gain # the original outcome variable, Y G <- clin.trial$ drug         # the grouping variable, G
gp.mean <- tapply(Y, G, mean)  # calculate group means
Ybar <- gp.mean[G]             # group mean associated with each obs
Z <- abs(Y - Ybar)             # the transformed variable, Z
summary( aov(Z ~ G) )          # run the ANOVA 
##             Df Sum Sq Mean Sq F value Pr(>F)
## G            2 0.0616 0.03080    1.45  0.266
## Residuals   15 0.3187 0.02125

That said, I don’t imagine that many people will care about this. Nevertheless, it’s nice to know that you could do it this way if you wanted to. And for those of you who do try it, I think it helps to demystify the test a little bit when you can see – with your own eyes – the way in which Levene’s test relates to ANOVA.

## 14.8 Removing the homogeneity of variance assumption

In our example, the homogeneity of variance assumption turned out to be a pretty safe one: the Levene test came back non-significant, so we probably don’t need to worry. However, in real life we aren’t always that lucky. How do we save our ANOVA when the homogeneity of variance assumption is violated? If you recall from our discussion of $$t$$-tests, we’ve seen this problem before. The Student $$t$$-test assumes equal variances, so the solution was to use the Welch $$t$$-test, which does not. In fact, Welch (1951) also showed how we can solve this problem for ANOVA too (the Welch one-way test). It’s implemented in R using the oneway.test() function. The arguments that we’ll need for our example are:

• formula. This is the model formula, which (as usual) needs to specify the outcome variable on the left hand side and the grouping variable on the right hand side: i.e., something like outcome ~ group.
• data. Specifies the data frame containing the variables.
• var.equal. If this is FALSE (the default) a Welch one-way test is run. If it is TRUE then it just runs a regular ANOVA.

The function also has a subset argument that lets you analyse only some of the observations and a na.action argument that tells it how to handle missing data, but these aren’t necessary for our purposes. So, to run the Welch one-way ANOVA for our example, we would do this:

oneway.test(mood.gain ~ drug, data = clin.trial)
##
##  One-way analysis of means (not assuming equal variances)
##
## data:  mood.gain and drug
## F = 26.322, num df = 2.0000, denom df = 9.4932, p-value = 0.000134

To understand what’s happening here, let’s compare these numbers to what we got earlier in Section 14.3 when we ran our original ANOVA. To save you the trouble of flicking back, here are those numbers again, this time calculated by setting var.equal = TRUE for the oneway.test() function:

oneway.test(mood.gain ~ drug, data = clin.trial, var.equal = TRUE)
##
##  One-way analysis of means
##
## data:  mood.gain and drug
## F = 18.611, num df = 2, denom df = 15, p-value = 8.646e-05

Okay, so originally our ANOVA gave us the result $$F(2,15) = 18.6$$, whereas the Welch one-way test gave us $$F(2,9.49) = 26.32$$. In other words, the Welch test has reduced the within-groups degrees of freedom from 15 to 9.49, and the $$F$$-value has increased from 18.6 to 26.32.

## 14.9 Checking the normality assumption

Testing the normality assumption is relatively straightforward. We covered most of what you need to know in Section 13.9. The only thing we really need to know how to do is pull out the residuals (i.e., the $$\epsilon_{ik}$$ values) so that we can draw our QQ plot and run our Shapiro-Wilk test. First, let’s extract the residuals. R provides a function called residuals() that will do this for us. If we pass our my.anova to this function, it will return the residuals. So let’s do that:

my.anova.residuals <- residuals( object = my.anova )   # extract the residuals

We can print them out too, though it’s not exactly an edifying experience. In fact, given that I’m on the verge of putting myself to sleep just typing this, it might be a good idea to skip that step. Instead, let’s draw some pictures and run ourselves a hypothesis test:

hist( x = my.anova.residuals )           # plot a histogram (similar to Figure @ref{fig:normalityanova}a)

qqnorm( y = my.anova.residuals )         # draw a QQ plot (similar to Figure @ref{fig:normalityanova}b)

shapiro.test( x = my.anova.residuals )   # run Shapiro-Wilk test
##
##  Shapiro-Wilk normality test
##
## data:  my.anova.residuals
## W = 0.96019, p-value = 0.6053

The histogram and QQ plot are both look pretty normal to me.212 This is supported by the results of our Shapiro-Wilk test ($$W = .96$$, $$p = .61$$) which finds no indication that normality is violated.

## 14.10 Removing the normality assumption

Now that we’ve seen how to check for normality, we are led naturally to ask what we can do to address violations of normality. In the context of a one-way ANOVA, the easiest solution is probably to switch to a non-parametric test (i.e., one that doesn’t rely on any particular assumption about the kind of distribution involved). We’ve seen non-parametric tests before, in Chapter 13: when you only have two groups, the Wilcoxon test provides the non-parametric alternative that you need. When you’ve got three or more groups, you can use the Kruskal-Wallis rank sum test (Kruskal and Wallis 1952). So that’s the test we’ll talk about next.

### 14.10.1 The logic behind the Kruskal-Wallis test

The Kruskal-Wallis test is surprisingly similar to ANOVA, in some ways. In ANOVA, we started with $$Y_{ik}$$, the value of the outcome variable for the $$i$$th person in the $$k$$th group. For the Kruskal-Wallis test, what we’ll do is rank order all of these $$Y_{ik}$$ values, and conduct our analysis on the ranked data. So let’s let $$R_{ik}$$ refer to the ranking given to the $$i$$th member of the $$k$$th group. Now, let’s calculate $$\bar{R}_k$$, the average rank given to observations in the $$k$$th group: $\bar{R}_k = \frac{1}{N_K} \sum_{i} R_{ik}$ and let’s also calculate $$\bar{R}$$, the grand mean rank: $\bar{R} = \frac{1}{N} \sum_{i} \sum_{k} R_{ik}$ Now that we’ve done this, we can calculate the squared deviations from the grand mean rank $$\bar{R}$$. When we do this for the individual scores – i.e., if we calculate $$(R_{ik} - \bar{R})^2$$ – what we have is a “nonparametric” measure of how far the $$ik$$-th observation deviates from the grand mean rank. When we calculate the squared deviation of the group means from the grand means – i.e., if we calculate $$(\bar{R}_k - \bar{R} )^2$$ – then what we have is a nonparametric measure of how much the group deviates from the grand mean rank. With this in mind, let’s follow the same logic that we did with ANOVA, and define our ranked sums of squares measures in much the same way that we did earlier. First, we have our “total ranked sums of squares”: $\mbox{RSS}_{tot} = \sum_k \sum_i ( R_{ik} - \bar{R} )^2$ and we can define the “between groups ranked sums of squares” like this: $\begin{array}{rcl} \mbox{RSS}_{b} &=& \sum_k \sum_i ( \bar{R}_k - \bar{R} )^2 \\ &=& \sum_k N_k ( \bar{R}_k - \bar{R} )^2 \end{array}$ So, if the null hypothesis is true and there are no true group differences at all, you’d expect the between group rank sums $$\mbox{RSS}_{b}$$ to be very small, much smaller than the total rank sums $$\mbox{RSS}_{tot}$$. Qualitatively this is very much the same as what we found when we went about constructing the ANOVA $$F$$-statistic; but for technical reasons the Kruskal-Wallis test statistic, usually denoted $$K$$, is constructed in a slightly different way: $K = (N - 1) \times \frac{\mbox{RSS}_b}{\mbox{RSS}_{tot}}$ and, if the null hypothesis is true, then the sampling distribution of $$K$$ is approximately chi-square with $$G-1$$ degrees of freedom (where $$G$$ is the number of groups). The larger the value of $$K$$, the less consistent the data are with null hypothesis, so this is a one-sided test: we reject $$H_0$$ when $$K$$ is sufficiently large.

The description in the previous section illustrates the logic behind the Kruskal-Wallis test. At a conceptual level, this is the right way to think about how the test works. However, from a purely mathematical perspective it’s needlessly complicated. I won’t show you the derivation, but you can use a bit of algebraic jiggery-pokery213 to show that the equation for $$K$$ can be rewritten as $K = \frac{12}{N(N-1)} \sum_k N_k {\bar{R}_k}^2 - 3(N+1)$ It’s this last equation that you sometimes see given for $$K$$. This is way easier to calculate than the version I described in the previous section, it’s just that it’s totally meaningless to actual humans. It’s probably best to think of $$K$$ the way I described it earlier… as an analogue of ANOVA based on ranks. But keep in mind that the test statistic that gets calculated ends up with a rather different look to it than the one we used for our original ANOVA.

But wait, there’s more! Dear lord, why is there always more? The story I’ve told so far is only actually true when there are no ties in the raw data. That is, if there are no two observations that have exactly the same value. If there are ties, then we have to introduce a correction factor to these calculations. At this point I’m assuming that even the most diligent reader has stopped caring (or at least formed the opinion that the tie-correction factor is something that doesn’t require their immediate attention). So I’ll very quickly tell you how it’s calculated, and omit the tedious details about why it’s done this way. Suppose we construct a frequency table for the raw data, and let $$f_j$$ be the number of observations that have the $$j$$-th unique value. This might sound a bit abstract, so here’s the R code showing a concrete example:

f <- table( clin.trial$mood.gain ) # frequency table for mood gain print(f) # we have some ties ## ## 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.9 1.1 1.2 1.3 1.4 1.7 1.8 ## 1 1 2 1 1 2 1 1 1 1 2 2 1 1 Looking at this table, notice that the third entry in the frequency table has a value of $$2$$. Since this corresponds to a mood.gain of 0.3, this table is telling us that two people’s mood increased by 0.3. More to the point, note that we can say that f[3] has a value of 2. Or, in the mathematical notation I introduced above, this is telling us that $$f_3 = 2$$. Yay. So, now that we know this, the tie correction factor (TCF) is: $\mbox{TCF} = 1 - \frac{\sum_j {f_j}^3 - f_j}{N^3 - N}$ The tie-corrected value of the Kruskal-Wallis statistic obtained by dividing the value of $$K$$ by this quantity: it is this tie-corrected version that R calculates. And at long last, we’re actually finished with the theory of the Kruskal-Wallis test. I’m sure you’re all terribly relieved that I’ve cured you of the existential anxiety that naturally arises when you realise that you don’t know how to calculate the tie-correction factor for the Kruskal-Wallis test. Right? ### 14.10.3 How to run the Kruskal-Wallis test in R Despite the horror that we’ve gone through in trying to understand what the Kruskal-Wallis test actually does, it turns out that running the test is pretty painless, since R has a function called kruskal.test(). The function is pretty flexible, and allows you to input your data in a few different ways. Most of the time you’ll have data like the clin.trial data set, in which you have your outcome variable mood.gain, and a grouping variable drug. If so, you can call the kruskal.test() function by specifying a formula, and a data frame: kruskal.test(mood.gain ~ drug, data = clin.trial) ## ## Kruskal-Wallis rank sum test ## ## data: mood.gain by drug ## Kruskal-Wallis chi-squared = 12.076, df = 2, p-value = 0.002386 A second way of using the kruskal.test() function, which you probably won’t have much reason to use, is to directly specify the outcome variable and the grouping variable as separate input arguments, x and g: kruskal.test(x = clin.trial$mood.gain, g = clin.trial$drug) ## ## Kruskal-Wallis rank sum test ## ## data: clin.trial$mood.gain and clin.trial\$drug
## Kruskal-Wallis chi-squared = 12.076, df = 2, p-value = 0.002386

This isn’t very interesting, since it’s just plain easier to specify a formula. However, sometimes it can be useful to specify x as a list. What I mean is this. Suppose you actually had data as three separate variables, placebo, anxifree and joyzepam. If that’s the format that your data are in, then it’s convenient to know that you can bundle all three together as a list:

mood.gain <- list( placebo, joyzepam, anxifree )
kruskal.test( x = mood.gain )

And again, this would give you exactly the same results as the command we tried originally.

## 14.11 On the relationship between ANOVA and the Student $$t$$ test

There’s one last thing I want to point out before finishing. It’s something that a lot of people find kind of surprising, but it’s worth knowing about: an ANOVA with two groups is identical to the Student $$t$$-test. No, really. It’s not just that they are similar, but they are actually equivalent in every meaningful way. I won’t try to prove that this is always true, but I will show you a single concrete demonstration. Suppose that, instead of running an ANOVA on our mood.gain ~ drug model, let’s instead do it using therapy as the predictor. If we run this ANOVA, here’s what we get:

summary( aov( mood.gain ~ therapy, data = clin.trial ))
##             Df Sum Sq Mean Sq F value Pr(>F)
## therapy      1  0.467  0.4672   1.708   0.21
## Residuals   16  4.378  0.2736

Overall, it looks like there’s no significant effect here at all but, as we’ll see in Chapter @ref(anova2 this is actually a misleading answer! In any case, it’s irrelevant to our current goals: our interest here is in the $$F$$-statistic, which is $$F(1,16) = 1.71$$, and the $$p$$-value, which is .21. Since we only have two groups, I didn’t actually need to resort to an ANOVA, I could have just decided to run a Student $$t$$-test. So let’s see what happens when I do that:

t.test( mood.gain ~ therapy, data = clin.trial, var.equal = TRUE )
##
##  Two Sample t-test
##
## data:  mood.gain by therapy
## t = -1.3068, df = 16, p-value = 0.2098
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8449518  0.2005073
## sample estimates:
## mean in group no.therapy        mean in group CBT
##                0.7222222                1.0444444

Curiously, the $$p$$-values are identical: once again we obtain a value of $$p = .21$$. But what about the test statistic? Having run a $$t$$-test instead of an ANOVA, we get a somewhat different answer, namely $$t(16) = -1.3068$$. However, there is a fairly straightforward relationship here. If we square the $$t$$-statistic

1.3068 ^ 2
## [1] 1.707726

we get the $$F$$-statistic from before.

## 14.12 Summary

There’s a fair bit covered in this chapter, but there’s still a lot missing. Most obviously, I haven’t yet discussed any analog of the paired samples $$t$$-test for more than two groups. There is a way of doing this, known as repeated measures ANOVA, which will appear in a later version of this book. I also haven’t discussed how to run an ANOVA when you are interested in more than one grouping variable, but that will be discussed in a lot of detail in Chapter 16. In terms of what we have discussed, the key topics were:

• The basic logic behind how ANOVA works (Section 14.2) and how to run one in R (Section 14.3).
• How to compute an effect size for an ANOVA (Section 14.4)
• Post hoc analysis and corrections for multiple testing (Section 14.5).
• The assumptions made by ANOVA (Section 14.6).
• How to check the homogeneity of variance assumption (Section 14.7) and what to do if it is violated (Section 14.8).
• How to check the normality assumption (Section 14.9 and what to do if it is violated (Section 14.10).

As with all of the chapters in this book, there are quite a few different sources that I’ve relied upon, but the one stand-out text that I’ve been most heavily influenced by is Sahai and Ageel (2000). It’s not a good book for beginners, but it’s an excellent book for more advanced readers who are interested in understanding the mathematics behind ANOVA.

### References

Hays, W. L. 1994. Statistics. 5th ed. Fort Worth, TX: Harcourt Brace.

Shaffer, J. P. 1995. “Multiple Hypothesis Testing.” Annual Review of Psychology 46: 561–84.

Hsu, J. C. 1996. Multiple Comparisons: Theory and Methods. London, UK: Chapman; Hall.

Dunn, O.J. 1961. “Multiple Comparisons Among Means.” Journal of the American Statistical Association 56: 52–64.

Holm, S. 1979. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics 6: 65–70.

Levene, H. 1960. “Robust Tests for Equality of Variances.” In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, edited by I. Olkin et al, 278–92. Palo Alto, CA: Stanford University Press.

Brown, M. B., and A. B. Forsythe. 1974. “Robust Tests for Equality of Variances.” Journal of the American Statistical Association 69: 364–67.

Welch, B. 1951. “On the Comparison of Several Mean Values: An Alternative Approach.” Biometrika 38: 330–36.

Kruskal, W. H., and W. A. Wallis. 1952. “Use of Ranks in One-Criterion Variance Analysis.” Journal of the American Statistical Association 47: 583–621.

Sahai, H., and M. I. Ageel. 2000. The Analysis of Variance: Fixed, Random and Mixed Models. Boston: Birkhauser.

1. When all groups have the same number of observations, the experimental design is said to be “balanced”. Balance isn’t such a big deal for one-way ANOVA, which is the topic of this chapter. It becomes more important when you start doing more complicated ANOVAs.

2. In a later versions I’m intending to expand on this. But because I’m writing in a rush, and am already over my deadlines, I’ll just briefly note that if you read ahead to Chapter 16 and look at how the “treatment effect” at level $$k$$ of a factor is defined in terms of the $$\alpha_k$$ values (see Section 16.2), it turns out that $$Q$$ refers to a weighted mean of the squared treatment effects, $$Q=(\sum_{k=1}^G N_k\alpha_k^2)/(G-1)$$.

3. Or, if we want to be sticklers for accuracy, $$1 + \frac{2}{df_2 - 2}$$.

4. Or, to be precise, party like “it’s 1899 and we’ve got no friends and nothing better to do with our time than do some calculations that wouldn’t have made any sense in 1899 because ANOVA didn’t exist until about the 1920s”.

5. Actually, it also provides a function called anova(), but that works a bit differently, so let’s just ignore it for now.

6. It’s worth noting that you can get the same result by using the command anova( my.anova ).

7. A potentially important footnote – I wrote the etaSquared() function for the lsr package as a teaching exercise, but like all the other functions in the lsr package it hasn’t been exhaustively tested. As of this writing – lsr package version 0.5 – there is at least one known bug in the code. In some cases at least, it doesn’t work (and can give very silly answers) when you set the weights on the observations to something other than uniform. That doesn’t matter at all for this book, since those kinds of analyses are well beyond the scope, but I haven’t had a chance to revisit the package in a long time; it would probably be wise to be very cautious with the use of this function in any context other than very simple introductory analyses. Thanks to Emil Kirkegaard for finding the bug! (Oh, and while I’m here, there’s an interesting blog post by Daniel Lakens suggesting that eta-squared itself is perhaps not the best measure of effect size in real world data analysis: http://daniellakens.blogspot.com.au/2015/06/why-you-should-use-omega-squared.html

8. I should point out that there are other functions in R for running multiple comparisons, and at least one of them works this way: the TukeyHSD() function takes an aov object as its input, and outputs Tukey’s “honestly significant difference” tests. I talk about Tukey’s HSD in Chapter 16.

9. If you do have some theoretical basis for wanting to investigate some comparisons but not others, it’s a different story. In those circumstances you’re not really running “post hoc” analyses at all: you’re making “planned comparisons”. I do talk about this situation later in the book (Section 16.9), but for now I want to keep things simple.

10. It’s worth noting in passing that not all adjustment methods try to do this. What I’ve described here is an approach for controlling “family wise Type I error rate”. However, there are other post hoc tests seek to control the “false discovery rate”, which is a somewhat different thing.

11. There’s also a function called p.adjust() in which you can input a vector of raw $$p$$-values, and it will output a vector of adjusted $$p$$-values. This can be handy sometimes. I should also note that more advanced users may wish to consider using some of the tools provided by the multcomp package.

12. Note that neither of these figures has been tidied up at all: if you want to create nicer looking graphs it’s always a good idea to use the tools from Chapter 6 to help you draw cleaner looking images.

13. A technical term.