15 One-Way ANOVA Test in R

15.1 The one way ANOVA in R

The one-way analysis of variance (ANOVA), also known as one-factor ANOVA or single factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups based on a single grouping variable (also called factor variable).

One might ask why we need a new test if we could just use the t-test to test all possible pairs of means. First of all, if you had \(k\) groups, that would mean you have to run \(\frac{k(k-1)}{2}\) different tests. Second, as we increase the number of tests we run we increase the risk of finding a difference by chance alone.

ANOVA is based on analyzing the ratio of the variance between samples and the variance within the samples. \[F=\frac{ \text{variance between samples}}{\text{variance within samples}}\] This statistic follows an F-distribution with numerator degree of freedom = k-1 and denominator degree of freedom = N-k where k = number of groups (4 in the example below) and N = total number of samples in all groups combined (20 in the example). This is what the density function of a F-distribution with k-1 = 3 and N-k = 16 degrees of freedom looks like:

Notation:

\(\bar{\bar{x}}\) = mean of all sample values combined

k = number of groups you have

\(n_i\) = number of values in group i

\(\overline{x_i}\) = mean of the values in group i

\(s_i^2\) = variance of the values in group i

We define: \[ \small MSB = \text{total weighted sum of squares difference between sample means and overall mean} \] \[ \small = \text{variance between} = \frac{\sum_{i=1}^{k} n_i (\overline{x}_i - \bar{\bar{x}})^2}{k-1}\] and

\[\small MSW = \text{weighted average variance within samples} \] \[ \small = \text{variance within} = \frac{\sum_{i=1}^{k} (n_i-1) s_i^2}{\sum_{i=1}^{k} (n_i-1)} = \frac{\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{i,j}-\overline{x_i})^2}{N-k}\] \(\small MSB\) and \(\small MSW\) stand for mean square between and mean square within. The test statistic \(F\) is then computed as \[ \small F=\frac{MSB}{MSW}.\]

Assumptions:

  • The samples are simple random samples.

  • The samples independently from each other.

  • The data of each factor level are normally distributed. (This is a loose requirement, the test work reasonably well unless the population is very far from normal. We should really check this though.)

  • These normal populations have a common variance. (This is also a loose requirement. It has been shown that, as long as the sample sizes are equal or near equal, the smallest and largest of the variances can vary by a factor up to nine, though other source set a threshold of a factor of two. We should really check this,too)

  • The different samples are from populations that are categorized in only one way.

ANOVA test hypotheses:

\(H_o\): The means of all the different groups are the same

\(H_A\): At least one sample mean is not equal to the others.

A large test-statistic F would lead us to reject the assumption that all the means are the same.

Example: Potato yields per acre can vary between 16 and 28 tons per acre. Assume a farmer tests different management strategies and grows potatoes under four different treatments: fertilizer, irrigation, both, or none. Here I enter the data from 20 acres:

#entering the data
yield <- c(19,16,19,19,18,21,20,25,17,22,25,21,24,28,22,24,16,18,18,22)
treatment <- rep("none",5)
treatment[6:10] <- "fertilized"
treatment[11:15] <- "irrigated"
treatment[16:20] <- "irrigated_fertilized"
potato.data <- data.frame(yield, treatment)

I want to know if there is a difference in mean yield between the groups. We use the dplyr library to find the mean yield for each treatment.

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
potato.data %>%
  group_by(by=treatment) %>%
  summarize(av_yield_per_acre = mean(yield))
#> # A tibble: 4 × 2
#>   by                   av_yield_per_acre
#>   <chr>                            <dbl>
#> 1 fertilized                        21  
#> 2 irrigated                         24  
#> 3 irrigated_fertilized              19.6
#> 4 none                              18.2

It sure looks like there is a difference in mean yield, so let’s run a formal test. Here, the groups are of the same size, so the formulas for the variances simplify. We have
\(n_1 = n_2 = n_3 = n_4 = 5\)
\(k=4\)
so \[MSB = \frac{\sum_{i=1}^{k} n_i (\overline{x}_i - \overline{\overline{x}})^2}{k-1}=5 \cdot \frac{\sum_{i=1}^{4} (\overline{x}_i - \overline{\overline{x}})^2}{3} = 5 \cdot \text{variance of the }\overline{x_i} \] and

\[MSW = \frac{\sum_{i=1}^{k} (n_i-1) s_i^2}{\sum_{i=1}^{k} (n_i-1)} = 4 \cdot \frac{\sum_{i=1}^{4} s_i^2}{\sum_{i=1}^{4} 4} = \frac{ (s_1^2 + s_2^2 + s_3^2 + 2_4^2)}{4}\]

v1 <- var(potato.data$yield[which(potato.data$treatment=="none")])
v2 <- var(potato.data$yield[which(potato.data$treatment=="irrigated_fertilized")])
v3 <- var(potato.data$yield[which(potato.data$treatment=="fertilized")])
v4 <- var(potato.data$yield[which(potato.data$treatment=="irrigated")])
m1 <- mean(potato.data$yield[which(potato.data$treatment=="none")])
m2 <- mean(potato.data$yield[which(potato.data$treatment=="irrigated_fertilized")])
m3 <- mean(potato.data$yield[which(potato.data$treatment=="fertilized")])
m4 <- mean(potato.data$yield[which(potato.data$treatment=="irrigated")])

variance.between <- var(c(m1, m2, m3, m4))*(5)
variance.between
#> [1] 30.73333
variance.within <- (v1+v2+v3+v4)/4
variance.within
#> [1] 7.125

F.stat <- variance.between/variance.within

F.stat
#> [1] 4.31345

p.value <- 1-pf(F.stat,3,16)

p.value
#> [1] 0.02075182

Here the p-value is small-ish, less than 5%, so we reject the claim that all means are equal. At least one mean is different.

The next chunk of code runs the built-in ANOVA function. Note that we first have to make sure the data is in two columns, one for x and one for y. We already did that when we created the data frame potato.data above. I am saving the results of our ANOVA as ‘potato.aov’.

potato.aov <- aov(potato.data$yield~potato.data$treatment)
summary(potato.aov)
#>                       Df Sum Sq Mean Sq F value Pr(>F)  
#> potato.data$treatment  3   92.2  30.733   4.313 0.0208 *
#> Residuals             16  114.0   7.125                 
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the ANOVA test only tells us that there are some means that are different, not which one(s). We can however look at the data table and make an educated guess. Here, it sure seems that irrigated fields do better.

15.2 TUKEY HSD test

If the ANOVA test shows that there are statistically significant difference between means, we can use the TUKEY HSD (Tukey Honest Significant Differences) test to find them. TukeyHSD is a post-hoc test, performed after ANOVA has confirmed that we have a significant difference between means. TukeyHSD basicaly performs multiple pairwise-comparison between the means of groups. Making multiple comparisons leads to an increased chance of making a false discovery, i.e. rejecting a null hypothesis that should not have been rejected (type 1 error), so that is why we should use ANOVA first.

The function TukeyHD() takes the fitted ANOVA as an argument.

TukeyHSD(potato.aov)
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = potato.data$yield ~ potato.data$treatment)
#> 
#> $`potato.data$treatment`
#>                                 diff        lwr        upr
#> irrigated-fertilized             3.0  -1.829957  7.8299574
#> irrigated_fertilized-fertilized -1.4  -6.229957  3.4299574
#> none-fertilized                 -2.8  -7.629957  2.0299574
#> irrigated_fertilized-irrigated  -4.4  -9.229957  0.4299574
#> none-irrigated                  -5.8 -10.629957 -0.9700426
#> none-irrigated_fertilized       -1.4  -6.229957  3.4299574
#>                                     p adj
#> irrigated-fertilized            0.3194523
#> irrigated_fertilized-fertilized 0.8398187
#> none-fertilized                 0.3761527
#> irrigated_fertilized-irrigated  0.0807519
#> none-irrigated                  0.0161016
#> none-irrigated_fertilized       0.8398187

You can see here that there only is a statistically significant differences between irrigated and non-irrigated fields. Note how the 95% confidence for the differences between the treatment methods indicated intervals support this result. However, whether a 5.8 ton increase is significant (vs statistically significant) needs an analysis of cost of irrigation vs improved yield.

Another example, the cabbage data from the MASS library. I want to see if there is a statistically significant difference in the mean Vitamin C content or mean head weight depending on planting date. The steps are

  1. Eyeball to see if there might be a difference

  2. Run ANOVA to confirm that there is indeed a statistically significant difference (or show that there isn’t)

  3. If ANOVA comes back as significant, run Tukey to find where the difference might be.

Remember the difference between statistically significant and significant!!

First we read in the data and have a look at the variables we are dealing with.

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
str(cabbages)
#> 'data.frame':    60 obs. of  4 variables:
#>  $ Cult  : Factor w/ 2 levels "c39","c52": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Date  : Factor w/ 3 levels "d16","d20","d21": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ HeadWt: num  2.5 2.2 3.1 4.3 2.5 4.3 3.8 4.3 1.7 3.1 ...
#>  $ VitC  : int  51 55 45 42 53 50 50 52 56 49 ...

Step 1:

cabbages %>%
  group_by(by=Date) %>%
  summarize(weight = mean(HeadWt), vitamin = mean(VitC))
#> # A tibble: 3 × 3
#>   by    weight vitamin
#>   <fct>  <dbl>   <dbl>
#> 1 d16     2.72    56.4
#> 2 d20     2.96    54.2
#> 3 d21     2.10    63.3

It looks like there might be differences in mean.

Step 2: Run ANOVA

First for the head weight

weight.aov <- aov(cabbages$HeadWt~cabbages$Date)
summary(weight.aov)
#>               Df Sum Sq Mean Sq F value  Pr(>F)   
#> cabbages$Date  2   7.71   3.853   5.745 0.00533 **
#> Residuals     57  38.23   0.671                   
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next for the vitamin C content

vit.aov <- aov(cabbages$VitC ~ cabbages$Date)
summary(vit.aov)
#>               Df Sum Sq Mean Sq F value  Pr(>F)   
#> cabbages$Date  2    909   454.6    5.05 0.00957 **
#> Residuals     57   5132    90.0                   
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the p-values, we can say that the differences are statistically significant.

Step 3: Tukey to find the differences

For weight:

TukeyHSD(weight.aov)
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = cabbages$HeadWt ~ cabbages$Date)
#> 
#> $`cabbages$Date`
#>           diff        lwr          upr     p adj
#> d20-d16  0.235 -0.3882202  0.858220165 0.6378918
#> d21-d16 -0.615 -1.2382202  0.008220165 0.0538386
#> d21-d20 -0.850 -1.4732202 -0.226779835 0.0049415
TukeyHSD(weight.aov, conf.level = 0.90)
#>   Tukey multiple comparisons of means
#>     90% family-wise confidence level
#> 
#> Fit: aov(formula = cabbages$HeadWt ~ cabbages$Date)
#> 
#> $`cabbages$Date`
#>           diff        lwr         upr     p adj
#> d20-d16  0.235 -0.3074101  0.77741009 0.6378918
#> d21-d16 -0.615 -1.1574101 -0.07258991 0.0538386
#> d21-d20 -0.850 -1.3924101 -0.30758991 0.0049415

Only the difference between d21 and d20 is statistically significant at th 95% confidence level. At 90%, the difference between d21 and d16 is also significant.

For weight:

TukeyHSD(vit.aov)
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = cabbages$VitC ~ cabbages$Date)
#> 
#> $`cabbages$Date`
#>          diff       lwr       upr     p adj
#> d20-d16 -2.25 -9.470345  4.970345 0.7349269
#> d21-d16  6.90 -0.320345 14.120345 0.0639292
#> d21-d20  9.15  1.929655 16.370345 0.0095834
TukeyHSD(vit.aov, conf.level = 0.90)
#>   Tukey multiple comparisons of means
#>     90% family-wise confidence level
#> 
#> Fit: aov(formula = cabbages$VitC ~ cabbages$Date)
#> 
#> $`cabbages$Date`
#>          diff        lwr       upr     p adj
#> d20-d16 -2.25 -8.5341163  4.034116 0.7349269
#> d21-d16  6.90  0.6158837 13.184116 0.0639292
#> d21-d20  9.15  2.8658837 15.434116 0.0095834

Again, only the difference between d21 and d20 is statistically significant at th 95% confidence level. At 90%, the difference between d21 and d16 is also significant.

15.3 Assignment

Suppose there are six different dogs representing six different breeds (A, B, C, D, E, F) competing for best in show at a dog show, and a panel of 15 judges (J1 thru J15) is evaluating them on a scale from 0 to 7.
I want you to consider two questions:
1. Is there a difference in the mean average scores for the six different dogs?
2. Is there a difference between the mean scores each judge gives out?
Run the code chunk below to enter the data.

score <-
 c(5,7,3,6,6,4,
 4,2,6,5,4,3,
 5,3,6,4,7,6,
 5,6,0,2,1,6,
 7,4,0,1,1,2,
 7,7,0,2,6,3,
 6,6,0,5,6,3,
 4,6,1,5,4,1,
 6,4,0,5,7,6,
 7,7,0,6,6,2,
 2,4,0,1,2,2,
 5,7,4,6,7,4,
 7,5,0,4,6,2,
 4,5,0,4,3,7,
 6,6,3,6,7,8)
dog <-  rep(c("A", "B", "C","D","E","F"), 15)
judge <- rep(c("J1","J2","J3","J4","J5","J6","J7","J8","J9","J10,","J11,","J12,","J13","J14","J15"),rep(6,15))
dog.data <- data.frame(score, dog,judge)

Next, lets summarize the data. I am doing it for the dogs, you do it for the judges.

library(dplyr)
dog.data %>%
  group_by(dog) %>%
  summarize(avscore = mean(score))
#> # A tibble: 6 × 2
#>   dog   avscore
#>   <chr>   <dbl>
#> 1 A        5.33
#> 2 B        5.27
#> 3 C        1.53
#> 4 D        4.13
#> 5 E        4.87
#> 6 F        3.93

Looking at the output, it sure looks like dog C is unpopular.

Write \(\small H_o\) and \(\small H_a\) here:

Then run the ANOVA test and interpret. Make sure you use plain language. Finally, if the ANOVA finds a statistically significant difference, run a Tukey test as well and interpret.

Finally, discuss if this is an appropriate use of ANOVA and TUKEY.