Chapter 12 Differences Between Three or More Things (the ANOVA chapter)
12.1 Differences Between Things and Differences Within Things
Tufts University (“Tufts”) is an institution of higher learning in the Commonwealth of Massachusetts in the United States of America. Tufts has three campuses: a main campus which covers parts of each of the neighboring cities of Medford, MA and Somerville, MA; a medical center located in Boston, MA; and a veterinary school located in Grafton, MA. Here’s a handy map:
Imagine that somebody asked you – given your access to the map pictured above – to describe the distances between Tufts buildings. What if they asked you if they could comfortably walk or use a wheelchair to get between any two of the buildings? Your answer likely would be that it depends: if one wanted to travel between two buildings that are each on the Medford/Somerville campus, or each on the Boston campus, or each on the Grafton campus, then those distances are small and easily covered without motorized vehicles. If one wanted to travel between campuses, then, yeah, one would need a car or at least a bicycle and probably a bottle of water, too.
We can conceive of the distances between the nine locations marked on the map as if they were just nine unrelated places: some happen to be closer to each other, some happen to be more distant from each other. Or, we can conceive of the distances as a function of campus membership: the distances are small within each campus and the distances are relatively large between campuses. In that conception, there are two sources of variance in the distances: within campus variance and between campus variance, with the former much shorter than the latter. If one were trying to locate a Tufts building, getting the campus correct is much more important than getting the location within each campus correct: if you had an appointment at a building on the Medford/Somerville campus and you went to the wrong Medford/Somerville campus building, there might be time to recover and find the right building in time to make your appointment; if you went to, say, the Grafton campus, you should probably call, explain, apologize, and try to reschedule.
The example of campus buildings (hopefully) illustrates the basic logic of the Analysis of Variance, more frequently known by the acronym ANOVA. We use ANOVA when there are three or more groups being compared at a time: we could compare pairs of groups using \(t\)-tests, but this would A. inflate the type-I error rate (the more tests you run, the more chances of false alarms),207 and B. fail to give a clear analysis of the omnibus effect of a factor along which all of the groups differ.
As noted in the introduction, ANOVA is based on comparisons of sources of variance. For example, the structure of the simplest ANOVA design – the one-way, independent-groups ANOVA – creates a comparison between the variance in the data that comes from between-groups differences and the variance in the data that comes from within-groups differences: if differences in the data that are attributable to the group membership of each datapoint are generally more important than the differences that just occur naturally (and are observable within each group). The logic of ANOVA is represented visually in Figure 12.1 and Figure 12.2. Figure 12.1 is a representation of what happens when the populations of interest are either the same or close to it: there the variance between data points in different groups is the same as the variance between data points within each group, and the is no variance* added by group membership.
In Figure 12.2, group membership makes a substantial contribution to the overall variance of the data: the variance within each group is the same as in the situation depicted in Figure 12.1, but the differences between the groups slide the populations up and down the \(x\)-axis, spreading out the data as they go.
12.1.1 The \(F\)-ratio
The central inference of ANOVA is on the population-level between variance, labeled as \(\sigma^2_{between}\) in Figure 12.1 and Figure 12.2. The null hypothesis for the one-way between-groups ANOVA is:
\[H_0: \sigma^2_{between}=0\] and the alternative hypothesis is:
\[H_1: \sigma^2_{between}>0\] 208The overall variance in the observed data can be decomposed into the sum of the two sources of variance between and within. If the null hypothesis is true and there is no contribution of between-groups differences to the overall variance – which is what is meant by \(\sigma^2_{between}=0\), then all of the variance is attributable to the within-group differences: \(\sigma^2_{between}+\sigma^2_{within}=\sigma^2_{within}\). If, however, the null hypothesis is false and there is any contribution of between-groups difference to the overall variance – \(\sigma^2_{between}>0\), then there will be variance beyond what is attributable to within-groups differences: \(\sigma^2_{between}+\sigma^2_{within}>\sigma^2_{within}\).
Thus, the ANOVA compares \(\sigma^2_{between}+\sigma^2_{within}\) and \(\sigma^2_{within}\). More specifically, it compares \(n\sigma^2_{between}+\sigma^2_{within}\) and \(\sigma^2_{within}\): the influence of \(\sigma^2_{between}\) multiplies with every observation that it impacts (e.g., the influence of between-groups differences cause more variation if there are \(n=30\) observations per group than if there are \(n=3\) observations per group). The way that ANOVA makes this comparison is by evaluating ratios of between-groups variances and within-groups variances: on the population-level, if the ratio is equal to 1, then the two things are the same, if the ratio is greater than one, then the between-groups variance is at least a little bit bigger than 0. Of course, as we’ve seen before with inference, if we estimate that ratio using sample data, any little difference isn’t going to impress us the way a hypothetical population-level little difference would: we’re likely going to need a sample statistic that is comfortably greater than 1.
When we estimate the between-groups plus within-groups variance and within-groups variance using sample data, we calculate what is known as the \(F\)-ratio, also known as the \(F\)-statistic, or just plain \(F\).209The cumulative likelihood of observed \(F\)-ratios are evaluated in terms of the \(F\)-distribution. When we calculate an \(F\)-ratio, we are calculating the ratio between two variances: between variance and within variance. As we know from the chapter on probability distributions, variances are modeled by \(\chi^2\) distributions: \(\chi^2\) distributions comprise the squares of normally-distributed values and given the normality assumption, the numerator of the variance calculation \(\sum(x-\bar{x})^2\) is a normally-distributed set of variables (and dividing by the constant \(n-1\) doesn’t disqualify it). Developed to model the ratio of variances, the \(F\)-distribution is literally the ratio of \(\chi^2\) distributions. The \(F\)-distribution has two sufficient statistics: \(df_{numerator}\) and \(df_{denominator}\), which correspond to the \(df\) of the \(\chi^2\) distribution in the numerator and the \(df\) of the denominator \(\chi^2\) distribution by which it is divided to get the \(F\)-distribution (see illustration in Figure 12.3).
The observed \(F\)-ratio is a sample-level estimate of the population-level ratio \(\frac{n\sigma^2_{between}+\sigma^2_{within}}{\sigma^2_{within}}\). If the sample data indicate no difference between the observed variance caused by a combination of between-groups plus within-groups variance and the variance observed within-groups, then \(F\approx 1\) and we will continue to assume \(H_0\). Otherwise, \(F>>1\), and we will reject \(H_0\) in favor of the \(H_1\) that there is some contribution of the differences between groups to the overall variance in the data.
Because of the assumptions involved in constructing the \(F\)-distribution – that at the population-level the residuals are normally distributed and the within-groups variance is the same across groups – the assumptions of normality and homoscedasticity apply to all flavors of ANOVA. For some of those flavors, there will be additional assumptions that we will discuss when we get to them, but normality and homoscedasticity are univerally assumed for classical ANOVAs.
Conceptually, the ANOVA logic applies to all flavors of ANOVA. The tricky part, of course, is how to calculate the \(F\)-ratio, and that will differ across flavors of ANOVA. There are also a number of other features of ANOVA-related analysis that we will discuss, including estimating effect sizes, making more specific between-group comparisons post hoc, a couple of nonparametric alternatives, and the Bayesian counterpart to the classical ANOVA (don’t get too excited: it’s just a “Bayesian ANOVA”). The flavors of ANOVA are more frequently (but less colorfully) described as different ANOVA models or ANOVA designs. We will start our ANOVA journey with the one-way between-groups ANOVA, describing it and its related features before proceeding to more complex models.
12.2 One-Way Independent-Groups ANOVA models
12.2.1 The Model
To use a one-way between-groups ANOVA is to conceive of dependent-variable scores as being the product of a number of different elements. One is the grand mean \(\mu\), which is the baseline state-of-nature average value in a population. Each number is then also influenced by some condition that causes some numbers to be greater than another. In the case of the one-way between-groups model, the conditions are the levels \(j\) of the factor \(\alpha\)210, for example, if the dependent variable is the width of leaves on plants, and the factor \(\alpha\) in question is the amount of time exposed to sunlight and the levels of the factor are 8 hours/day, 10 hours/day, and 12 hours/day, then \(\alpha_1=8\), \(\alpha_2=10\), and \(\alpha_3=12\), and the between-groups difference is the difference caused by the \(j\) different levels of the between-groups factor \(\alpha\). The last element is the within-groups variance, which falls into the more general categories of error or residuals. It is the naturally-occurring variation within each condition. To continue the example of the plants and the sunlight: not every plant that is exposed to the same sunlight for the same duration will have leaves that are exactly the same width. In this model, the within-groups influence is symbolized by \(\epsilon_{ij}\), indicating that the error (\(\epsilon\)) exists within each individual observation (\(i\)) and at each level of the treatment (\(j\)).
In a between-groups design, each observation comes from an independent group, that is, each observation comes from one and only one level of the treatment.
Could we account for at least some of the within-groups variation by identifying individual differences? Sure, and that’s the point of within-groups designs, but it’s not possible unless we observe each plant under different sunlight Conditions, which is not part of a between-groups design. For more on when we want to (or have to) use between-groups designs instead of within-groups designs, please refer to the chapter on differences between two things.
Thus, each value \(y\) of the dependent variable is said to be determined by the sum of the grand mean \(\mu\), the mean of the factor-level \(\alpha_j\), and the individual contribution of error \(\epsilon_{ij}\); and because \(y\) is influenced by factor level and treatment, it gets the subscript \(ij\):
\[y_{ij}=\mu+\alpha_j+\epsilon_i\] If you think that equation vaguely resembles a regression model, that’s because it is:
ANOVA is a special case of multiple regression where there is equal \(n\) per factor-level.
ANOVA is regression-based much as the \(t\)-test can be expressed as a regression analysis. In fact, while we are making connections, the \(t\)-test is a special case of ANOVA. If, for example, you enter data with two levels of a between-groups factor, you will get precisely the same results – with an \(F\) statistic instead of a \(t\) statistic – as you would for an independent-groups \(t\)-test.211 Furthermore, we can show that the \(t\)-statistic and the \(F\)-statistic are functionally equivalent to each other: \(F\) is simply \(t^2\); and the cumulative densities for a \(t\)-distribution are the same as they are for a corresponding \(F\)-distribution, meaning that if one were analyzing factors with just two levels, one would get the same \(p\)-value for a \(t\)-test as they would with an ANOVA on the same data. You could take my word for that (which, to be fair, I kind of ask you to do all over this book), but there is also a mathematical proof included as an appendix to this chapter.
12.2.2 The ANOVA table
The results of ANOVA can be represented by an ANOVA table. The standard ANOVA table includes the following elements:
Source (or Source of Variance, or SV): the leftmost column represents the decomposition of the elements that influence values of the dependent variable (except the grand mean \(\mu\)). For the one-way between-groups ANOVA, the listed sources are typically between groups (or just between) and within groups (or just within), with an entry for total variance often included. To increase the generality of the table – which will be helpful because we are about to talk about a whole bunch of ANOVA tables, we, we can call the between variance \(A\) to indicate that it is the variance associated with factor \(A\) and we can call the within variance the error.
Degrees of Freedom (\(df\)): the degrees of freedom associated with each source. These are the degrees of freedom that are associated with the variance calculation (i.e., the variance denominator) for each source of variance.
Sums of Squares (\(SS\)): The numerator of the calculation for the variance (\(\sum(x-\bar{x})^2\)) associated with each source.
Mean Squares (\(MS\)): The estimate of the variance for each source. It is equal to the ratio of \(SS\) and \(df\) that are respectively associated with each source. Note: \(MS\) are not calculated for the total because there is no reason whatsoever to care about the total \(MS\).
\(F\): The \(F\)-ratio. The \(F\)-ratio is the test statistic for the ANOVA. For the one-way between-groups design (and also for the one-way within-groups design but we’ll get to that later), there is only one \(F\)-ratio, and it is the ratio of the between-groups \(MS\) (\(MS_{A}\)) to within-groups \(MS\) \(MS_e\).
In addition to the five columns described above, there is a sixth column that is extremely optional to include:
- Expected Mean Squares (\(EMS\)): a purely algebraic, non-numeric representation of the population-level variances associated with each source. As noted earlier, the population-level comparison represented with sample data by the \(F\)-statistic is \(\frac{n\sigma^2_{between}+\sigma^2_{within}}{\sigma^2_{within}}\), which can somewhat more simply be annotated \(\frac{n\sigma^2_\alpha+\sigma^2_\epsilon}{\sigma^2_\epsilon}\). In other words, it is the ratio of the sum of between-groups variance – which is multiplied by the number of observations per group – plus the within-groups variance divided by the within-groups variance. The \(EMS\) for each group lists the terms that are represented by each source. For between-groups variance – or factor \(\alpha\) – the \(EMS\) is the numerator of the population-level \(F\)-ratio \(n\sigma^2_\alpha+\sigma^2_\epsilon\). For the within-groups variance – or the error (\(\epsilon\)) – it is \(\sigma^2_\epsilon\). The \(EMS\) are helpful for two things: keeping track of how to calculate the \(F\)-ratio for a particular factor in a particular ANOVA model (admittedly, this isn’t really an issue for the one-way between-groups ANOVA since there is only one way to calculate an \(F\)-statistic and it’s relatively easy to remember it), and calculating the effect-size statistic \(\omega^2\). Both of those things can be accomplished without explicitly knowing the \(EMS\) for each source of variance, which is why the \(EMS\) column of the ANOVA table is eminently skippable (but, if you do need to include it – presumably at the behest of some evil stats professor – please keep in mind that there are no numbers to be calculated for \(EMS\): they are strictly algebraic formulae).
Here is the ANOVA table – with \(EMS\) – for the one-way between-groups ANOVA model with formula guides for how to calculate the values that go in each cell:
\(SV\) | \(df\) | \(SS\) | \(MS\) | \(F\) | \(EMS\) |
---|---|---|---|---|---|
\(A\) | \(j-1\) | \(n\sum(y_{\bullet j}-y_{\bullet\bullet})^2\) | \(SS_A/df_A\) | \(MS_A/MS_e\) | \(n\sigma^2_\alpha+\sigma^2_\epsilon\) |
\(Error\) | \(n-1\) | \(\sum(y_{ij}-y_{\bullet j})^2\) | \(SS_A/df_e\) | \(\sigma^2_\epsilon\) | |
\(Total\) | \(jn-1\) | \(\sum(y_{ij}-y_{\bullet\bullet})^2\) |
In the table above, the bullet (\(\bullet\)) replaces the subscript \(i\), \(j\), or both to indicate that the \(y\) values are averaged across each level of the subscript being replaced. \(y_{\bullet j}\) indicates the average of all of the \(y\) values in each \(j\) averaged across all \(i\) observations and \(y_{\bullet \bullet}\) is the grand mean of all of the \(y\) values averaged across all \(i\) observations in each of \(j\) Conditions. Thus, \(SS_A\) is equal to the sum of the squared differences between each group mean \(y_{\bullet j}\) and the grand mean \(y_{\bullet \bullet}\) times the number of observations in each group \(n\), \(SS_e\) is equal to the sum of the sums of the squared differences between each observation \(y_{ij}\) and the mean \(y_{\bullet j}\) of the group to which it belongs, and \(SS_total\) is equal to the sum of the squared differences between each observation \(y_{ij}\) and the grand mean \(y_{\bullet \bullet}\).
In the context of ANOVA, \(n\) always refers to the number of observations per group. If one needs to refer to all of the observations (e.g. all the participants in an experiment that uses ANOVA to analyze the data), \(N\) is preferred to (hopefully) avoid confusion. In the classic ANOVA design, \(n\) is always the same between all groups: if \(n\) is not the same between groups, then technically the analysis is not an ANOVA. That’s not a dealbreaker for statistical analysis, though: if \(n\) is unequal between groups, the data can be analyzed in a regression context with the general linear model and the procedure will return output that is pretty-much-identical to the ANOVA table. The handling of unequal \(n\) is beyond the scope of this course, but I assure you, it’s not that bad, and you’ll probably get to it next semester.
12.2.3 Calculation Example for One-Way Independent-Groups ANOVA
Please imagine that the following data are observed in a between-groups design:
Condition 1 | Condition 2 | Condition 3 | Condition 4 |
---|---|---|---|
-3.10 | 7.28 | 0.12 | 8.18 |
0.18 | 3.06 | 5.51 | 9.05 |
-0.72 | 4.74 | 5.72 | 11.21 |
0.09 | 5.29 | 5.93 | 7.31 |
-1.66 | 7.88 | 6.56 | 8.83 |
Grand Mean = 4.57 |
12.2.3.1 Calculating \(df\)
There are \(j=4\) levels of factor \(\alpha\), thus, \(df_A=j-1=3\). There are \(n=5\) observations in each group, so \(df_e=j(n-1)=4(4)=16\). The total \(df_{total}=jn-1=(4)(5)-1=19\). Note that \(df_A+df_e=df_{total}\), and that \(df_{total}=N-1\) (or the total number of observations minus 1). Both of those facts will be good ways to check your \(df\) math as models become more complex.
12.2.3.2 Calculating Sums of Squares
\(SS_A\) is the sum of squared deviations between the level means and the grand mean multiplied by the number of observations in each level. Another way of saying that is that for each observation, we take the mean of the level to which that observation belongs, subtract the grand mean, square the difference, and sum all of those squared differences. Because \(n\) is the same across groups by the ANOVA definition, it is no difference to multiply the squared differences between the group means and the grand mean by \(n\) or to add up the squared differences between the group means for every observation; I happen to find the latter more visually convincing for pedagogical purposes, as in the table below:
Condition 1 | \((y_{\bullet 1}-y_{\bullet \bullet})^2\) | Condition 2 | \((y_{\bullet 2}-y_{\bullet \bullet})^2\) | Condition 3 | \((y_{\bullet 3}-y_{\bullet \bullet})^2\) | Condition 4 | \((y_{\bullet 4}-y_{\bullet \bullet})^2\) |
---|---|---|---|---|---|---|---|
-3.10 | 31.49 | 7.28 | 1.17 | 0.12 | 0.04 | 8.18 | 18.89 |
0.18 | 31.49 | 3.06 | 1.17 | 5.51 | 0.04 | 9.05 | 18.89 |
-0.72 | 31.49 | 4.74 | 1.17 | 5.72 | 0.04 | 11.21 | 18.89 |
0.09 | 31.49 | 5.29 | 1.17 | 5.93 | 0.04 | 7.31 | 18.89 |
-1.66 | 31.49 | 7.88 | 1.17 | 6.56 | 0.04 | 8.83 | 18.89 |
The sum of the values in the shaded columns is 257.94: that is the value of \(SS_A\).
The \(SS_e\) for this model – or, the within-groups variance – is the sum across levels of factor \(A\) of the sums of the squared differences between each \(y_{ij}\) value of the dependent variable and the mean \(y_{\bullet j}\) of the level to which it belongs:
Condition 1 | \((y_{i1}-y_{\bullet j})^2\) | Condition 2 | \((y_{i 2}-y_{\bullet j})^2\) | Condition 3 | \((y_{i 3}-y_{\bullet j})^2\) | Condition 4 | \((y_{i 4}-y_{\bullet j})^2\) |
---|---|---|---|---|---|---|---|
-3.10 | 4.24 | 7.28 | 2.66 | 0.12 | 21.60 | 8.18 | 0.54 |
0.18 | 1.49 | 3.06 | 6.71 | 5.51 | 0.55 | 9.05 | 0.02 |
-0.72 | 0.10 | 4.74 | 0.83 | 5.72 | 0.91 | 11.21 | 5.26 |
0.09 | 1.28 | 5.29 | 0.13 | 5.93 | 1.35 | 7.31 | 2.58 |
-1.66 | 0.38 | 7.88 | 4.97 | 6.56 | 3.21 | 8.83 | 0.01 |
The sum of the values in the shaded columns is 58.82: that is \(SS_e\).
\(SS_{total}\) is the total sums of squares of the dependent variable \(y\):
\[\sum_{i=1}^N(y_{ij}-y_{\bullet \bullet})^2\] To calculate \(SS_{total}\), we sum the squared differences between each observed value \(y_{ij}\) and the grand mean \(y_{\bullet \bullet}\):
Condition 1 | \((y_{i1}-y_{\bullet \bullet})^2\) | Condition 2 | \((y_{i 2}-y_{\bullet \bullet})^2\) | Condition 3 | \((y_{i 3}-y_{\bullet \bullet})^2\) | Condition 4 | \((y_{i 4}-y_{\bullet \bullet})^2\) |
---|---|---|---|---|---|---|---|
-3.10 | 58.83 | 7.28 | 7.34 | 0.12 | 19.80 | 8.18 | 13.03 |
0.18 | 19.27 | 3.06 | 2.28 | 5.51 | 0.88 | 9.05 | 20.07 |
-0.72 | 27.98 | 4.74 | 0.03 | 5.72 | 1.32 | 11.21 | 44.09 |
0.09 | 20.07 | 5.29 | 0.52 | 5.93 | 1.85 | 7.31 | 7.51 |
-1.66 | 38.81 | 7.88 | 10.96 | 6.56 | 3.96 | 8.83 | 18.15 |
The sum of the shaded values is 316.76, which is \(SS_{total}\). It is also equal to \(SS_A\)+\(SS_e\), which is important for two reasons:
It means we did the math right, and
The total sums of squares of the dependent variable is completely broken down by the sums of squares associated with the between-groups factor and the sums of squares associated with within-groups variation, with no overlap between the two. The fact that the total variation is exactly equal to the sum of the variation contributed by the two sources demonstrates the principle of orthogonality: where two or more factors do not account for the same variance because they are uncorrelated with each other.
Orthogonality is an important statistical concept that shows up in multiple applications, including one of the post hoc tests to be discussed below and, critically, all ANOVA models. In every ANOVA model, the variance contributed to the dependent variable will be broken down orthogonally between each of the sources, meaning that \(SS_{total}\) will always be precisely equal to the sum of the \(SS\) for all sources of variance for each model.
12.2.3.3 Calculating Mean Squares and \(F\)
As noted in the ANOVA table above, the mean squares for each source of variance is the ratio of the sums of squares associated with that source divided by the degrees of freedom for that source. For the example data:
\[MS_A=\frac{SS_A}{df_A}=\frac{257.94}{3}=85.98\] \[MS_e=\frac{SS_e}{df_e}=\frac{58.82}{16}=3.68\]
The observed \(F\)-ratio is the ratio of the \(MS_A\) to \(MS_e\), and is annotated with the degrees of freedom associated with the numerator of that ratio \(df_{numerator}\) (usually abbreviated \(df_{num}\)) and degrees of freedom associated with the denominator of that ratio \(df_{denominator}\) (usually abbreviated \(df_{denom}\)) in parenthesis following the letter \(F\):
\[F(3, 16)=\frac{MS_A}{MS_e}=\frac{85.98}{3.68}=23.39\]
12.2.3.4 There is nothing to calculate for \(EMS\)
The EMS for this model are simply \(n\sigma^2_\alpha + \sigma^2_\epsilon\) for factor \(A\) and \(\sigma^2_\epsilon\) for the error.
Now we are ready to fill in our ANOVA table.
12.2.3.5 Example ANOVA Table and Statistical Inference
Here is the ANOVA table for the example data:
\(SV\) | \(df\) | \(SS\) | \(MS\) | \(F\) | \(EMS\) |
---|---|---|---|---|---|
\(A\) | 3 | 257.94 | 85.98 | 23.39 | \(n\sigma^2_\alpha+\sigma^2_\epsilon\) |
\(Error\) | 16 | 58.82 | 3.68 | \(\sigma^2_\epsilon\) | |
\(Total\) | 19 | 316.76 |
The \(F\)-ratio is evaluated in the context of the \(F\)-distribution associated with the observed \(df_{num}\) and \(df_{denom}\). For our example data, the \(F\)-ratio is 23.39: the cumulative likelihood of observing an \(F\)-statistic of 23.39 or greater given \(df_{num}=3\) and \(df_{denom}=16\) – the area under the \(F\)-distribution curve with those specific parameters at or beyond 23.39, is:
## [1] 4.30986e-06
Since \(4.3\times 10^-6\) is smaller than any reasonable value of \(\alpha\), we will reject the null hypothesis that \(\sigma^2_\alpha=0\) in favor of the alternative hypothesis that \(\sigma^2_\alpha>0\): there is a significant effect of the between-groups factor.
12.2.3.6 ANOVA in R
There are several packages to calculate ANOVA results in R, but for most purposes, the base command aov()
will do. The aov()
command returns an object that can then be summarized with the summary()
command wrapper to give the elements of the traditional ANOVA table.
aov()
expects a formula and data structure similar to what we used for linear regression models and for tests of homoscedasticity. Thus, it is best to put the data into long format. Using the example data from above:
DV<-c(Condition1, Condition2, Condition3, Condition4)
Factor.A<-c(rep("Condition 1", length(Condition1)),
rep("Condition 2", length(Condition2)),
rep("Condition 3", length(Condition3)),
rep("Condition 4", length(Condition4)))
one.way.between.groups.example.df<-data.frame(DV, Factor.A)
summary(aov(DV~Factor.A,
data=one.way.between.groups.example.df))
## Df Sum Sq Mean Sq F value Pr(>F)
## Factor.A 3 257.94 85.98 23.39 4.31e-06 ***
## Residuals 16 58.82 3.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
12.2.4 Effect Size Statistics
The effect-size statistics for ANOVA models are based on the same principle as \(R^2\): they are measures of the proportion of the overall variance data that is explained by the factor(s) in the model (as opposed to the error). In fact, one ANOVA-based effect-size statistic – \(\eta^2\) – is precisely equal to \(R^2\) for the one-way between-groups model (remember: ANOVA is a special case of multiple regression). Since they are proportions, the ANOVA-based effect-size statistics range from 0 to 1, with larger values indicating increased predictive power for the factor(s) in the model relative to the error.
We here will discuss two ANOVA-based effect-size statistics: \(\eta^2\) and \(\omega^2\).212 Estimates of \(\eta^2\) and of \(\omega^2\) tend to be extremely close: if you calculate both statistics for the same data, you are less likely to observe that the two statistics disagree as to the general size of an effect than you are to see them be separated by a couple percent, or tenths of a percent, or hundredths of a percent.
Of \(\eta^2\) and \(\omega^2\), \(\eta^2\) is vastly easier to calculate, and as noted just above, not far off from estimates of \(\omega^2\) for the same data. \(\eta^2\) is also more widely used and understood. The only disadvantage is that \(\eta^2\) is a biased estimator of effect size: it’s consistently too big (one can test that with Monte Carlo simulations and/or synthetic data – using things like the rnorm()
command in R – to check the relationship between the magnitude of a statistical estimate and what it should be given that the underlying distribution characteristics are known). That bias is usually more pronounced for smaller sample sizes: if you google “eta squared disadvantage,” the first few results will be about the estimation bias of \(\eta^2\) and the problems it has with relatively small \(n\). Again, it’s not a gamechanging error, but if you are pitching your results to an audience of ANOVA-based effect-size purists, they may be looking for you to use the unbiased estimator \(\omega^2\).
In addition to being unbiased, the other advantage that \(\omega^2\) has is that it is equivalent to the statistic known as the intraclass correlation (ICC), which is usually considered a measure of the psychometric construct reliability. ANOVA-based effect size statistics – including both \(\eta^2\) and \(\omega^2\) – can also be interpreted as measuring reliability of a measure: if a factor or factors being measured make a big difference in the scores, then assessing dependent variables based on those factors should return reliable results over time. In the world of psychometrics – which I can tell you from lived experience is exactly as exciting and glamorous as it sounds – some analysts report ICC statistics and some report \(\omega^2\) statistics and a suprisingly large proportion don’t know that they are talking about the same thing.
12.2.4.1 Calculating \(\eta^2\)
For the one-way between-groups ANOVA model, \(\eta^2\) is a ratio of sums of squares: the sums of squares associated with the factor \(SS_A\) to the total sums of squares \(SS_{total}\). It is thus the proportion of observed sample-level variance associated with the factor:
\[\eta^2=\frac{SS_A}{SS_{total}}\] For the example data, the observed \(\eta^2\) is:
\[\eta^2=\frac{SS_A}{SS_{total}}=\frac{257.94}{316.76}=0.81\] which is a large effect (see the guidelines for interpreting ANOVA-based effect sizes below).
12.2.4.2 Calculating \(\omega^2\)
The \(\omega^2\) statistic is an estimate of the population-level variance that is explained by the factor(s) in a model relative to the population-level error.213 The theoretical population-level variances are known as population variance components: for the one-way between-groups ANOVA model, the two components are the variance from the independent-factor variable \(\sigma^2_\alpha\) and the variance associated with error \(\sigma^2_\epsilon\).
As noted above, the between-groups variance in the one-way model is classified as \(n\sigma^2_\alpha + \sigma^2_\epsilon\), and is estimated in the sample data by \(MS_A\); the within-groups variance is classified as \(\sigma^2_\epsilon\), and is estiamted in the sample data by \(MS_e\): these are the terms listed for factor A and for the error, respectively, in the \(EMS\) of the ANOVA table.
\(\omega^2\) is the ratio of the population variance component for factor A \(\sigma^2_\alpha\) divided by the sum of the population variance components \(\sigma^2_\alpha +\sigma^2_\epsilon\):
\[\omega^2=\frac{\sigma^2_\alpha}{\sigma^2_\alpha + \sigma^2_\epsilon}\]
As noted above, \(MS_e\) is an estimator of \(\sigma^2_\epsilon\) (\(\sigma^2_\epsilon\) gets a hat because it’s an estimate):
\[\hat{\sigma}^2_\epsilon=MS_e\]
That takes care of one population variance component. \(MS_A\) is an estimator of \(n\sigma^2_\alpha + \epsilon\): if we subtract \(MS_e\) (being the estimate of \(\sigma^2_\epsilon\)) from \(MS_A\) and divide by \(n\), we get the population variance component \(\sigma^2_\alpha\):
\[\hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}\] Please note that a simpler and probably more meaningful way to conceive of \(\sigma^2_\alpha\) is that it is literally the variance of the condition means. For the one-way between-groups ANOVA model, it is more straightforward to simply calculate the variance of the condition means than to use the above formula, but it gets moderately more tricky to apply that concept to more complex designs.
We’re not quite done with \(\hat{\sigma}^2_\alpha\) yet: there is one more consideration.
12.2.4.2.1 Random vs. Fixed Effects
The value of \(\hat{\sigma}^2_\alpha\) as calculated above has for its denominator \(df_A\), which is \(j-1\). Please recall that the formula for a sample variance – \(s^2=\frac{\sum(x-\bar{x})^2}{n-1}\) also has an \(x-1\) term in the denominator. In the case of a sample variance, that’s an indicator that the variance refers to a sample that is much smaller than the population from which it comes. In the case of a population variance component, that indicates that the number of levels \(j\) of the independent-variable factor \(A\) is a sample of levels much smaller than the total number of possible levels of the factor. When the number of levels of the independent variable represent just some of the possible levels of the independent variable, that IV is known as a random-effects factor (the word “random” may be misleading: it doesn’t mean that factors were chosen out of a hat or via a random number generator, just that there are many more possible levels that could have been tested). For example, a health sciences researcher might be interested in the effect of different U.S. hospitals on health outcomes. They may not be able to include every hospital in the United States in the study, but rather 3 or 4 hospitals that are willing to participate in the study: in that case, hospital would be a random effect with those 3 or 4 levels (and the within-groups variable would be the patients in the hospitals).
A fixed-effect factor is one in which all (or, technically, very nearly all) possible levels of the independent variables are examined. To recycle the example of studying the effect of hospitals on patient outcomes, we might imagine that instead of hospitals in the United States, the independent variable of interest were hospitals in one particular city, in which case 3 or 4 levels might represent all possible levels of the variable. Please note that it is not the absolute quantity of levels that determines whether an effect is random or fixed but the relative quantity of observed levels to the total possible set of levels of interest.
When factor A is fixed, we have to replace the denominator \(j-1\) with \(j\) (to indicate that the variance is a measure of all of the levels of interest of the IV). We can do that pretty easily by multiplying the population variance component \(\hat{\sigma}^2_\alpha\) by \(j-1\) – which is equal to \(df_A\) – and then dividing the result by \(j\) – which is equal to \(df_A+1\). We can accomplish that more simply by multiplying the estimate of \(\sigma^2_\alpha\) by the term \(\frac{df_A}{df_A+1}\).
So, for random effects in terms of Expected Mean Squares:
\[\hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}\] \(\sigma^2_\alpha\), as noted above, is the variance of the means of the levels of factor \(A\). If factor \(A\) is random, then \(\sigma^2_\alpha\) is, specifically, the sample variance of the means of the levels of factor \(A\) about the grand mean \(y_{\bullet \bullet}\):
\[\hat{\sigma}^2_\alpha=\frac{\sum(y_{\bullet j}-y_{\bullet \bullet})^2}{j-1}\]
and for fixed effects:
\[\hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}\left[\frac{df_A}{df_A+1}\right]\]
which is equivalent to the population variance of the means of the levels of factor \(A\) about the grand mean \(y_{\bullet \bullet}\):
\[\hat{\sigma}^2_\alpha=\frac{\sum(y_{\bullet j}-y_{\bullet \bullet})^2}{j}\]
One more note on population variance components before we proceed to calculating \(\omega^2\) for the sample data: the fact that the numerator of \(\hat{\sigma}^2_x\) (using \(x\) as a generic indicator of a factor that can be \(\alpha\), \(\beta\), \(\gamma\), etc. for more complex models) includes a subtraction term (in the above equation, it’s \(MS_a-MS_e\)), it is possible to end up with a negative estimate for \(\hat{\sigma}^2_x\) if the \(F\)-ratio for that factor is not significant. This really isn’t an issue for the one-way between-groups ANOVA design because if \(F_A\) is not significant then there is no effect to report and, as noted in the chapter on differences between two things, we do not report effect sizes when effects are not significant. But, it can happen for population variance components in factorial designs. If that is the case, we set all negative population variance components to 0.
Using the example data, assuming that factor A is a random effect:
\[\hat{\sigma}^2_\epsilon=MS_e=3.68\] \[\hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}=\frac{85.98-3.68}{5}=16.46\] and:
\[\omega^2=\frac{\hat{\sigma}^2_\alpha}{\hat{\sigma}^2_\alpha + \hat{\sigma}^2_\epsilon}=\frac{16.46}{16.46+3.68}=0.82\] Assuming that factor A is a fixed effect:
\[\hat{\sigma}^2_\epsilon=MS_e=3.68\]
\[\hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}\left[\frac{df_A}{df_A+1}\right]=\frac{85.98-3.68}{5}\left(\frac{3}{4}\right)=12.35\] and:
\[\omega^2=\frac{\hat{\sigma}^2_\alpha}{\hat{\sigma}^2_\alpha + \hat{\sigma}^2_\epsilon}=\frac{12.35}{12.35+3.68}=0.77\]
12.2.4.3 Effect-size Guidelines for \(\eta^2\) and \(\omega^2\)
As is the case for most classical effect-size guidelines, Cohen (2013) provided interpretations for \(\eta^2\) and \(\omega^2\). In his book Statistical Power Analysis for the Behavioral Sciences – which is the source for all of Cohen’s effect-size guidelines – Cohen claims that \(\eta^2\) and \(\omega^2\) are measures of the same thing (which is sort of true but also sort of not) and thus should have the same guidelines for what effects are small, medium, and large (which more-or-less works out because \(\eta^2\) and \(\omega^2\) are usually so similar in magnitude):
\(\eta^2\) or \(\omega^2\) | Interpretation |
---|---|
\(0.01\to 0.06\) | Small |
\(0.06 \to 0.14\) | Medium |
\(0.14\to 1\) | Large |
Thus, the \(\eta^2\) value and the \(\omega^2\) values assuming either random or fixed effects for the sample data all fall comfortably into the large effect range.214
12.2.5 Post hoc tests
Roughly translated, post hoc is Latin for after this. Post hoc tests are conducted after a significant effect of a factor has been established to learn more about the nature of an effect, i.e.: which levels of the factor are associated with higher or lower scores?
Post hoc tests can help tell the story of data beyond just a significant \(F\)-ratio in several ways. There is a test that compares experimental conditions to a single control condition (the Dunnett common control procedure). There are tests that compare the average(s) of one set of conditions to the average(s) of another set of conditions (including orthogonal contrasts and Scheffé contrasts). There are also post hoc tests that compare the differences between each possible pair of condition means (including Dunn-Bonferroni tests, Tukey’s HSD, and the Hayter-Fisher test). The choice of post hoc test depends on the story you are trying to tell with the data
12.2.5.1 Pairwise Comparisons
12.2.5.1.1 Dunn-Bonferroni Tests
The use of Dunn-Bonferroni tests, also known as applying the Dunn-Bonferroni correction, is a catch-all term for any kind of statistical procedure where multiple hypotheses are conducted simultaneously and the type-I error rate is reduced for each test of hypotheses in order to avoid inflating the overall type-I error rate. For example, a researcher testing eight different simple regression models using eight independent variables to separately predict the dependent variable – that is, running eight different regression commands – might compare the p-values associated each of the eight models to an \(\alpha\) that is \(1/8th\) the size of the overall desired \(\alpha\)-rate (otherwise, she would risk committing eight times the type-I errors she would with a single model).
In the specific context of post hoc tests, the Dunn-Bonferroni test is an independent-samples t-test for each of the condition means. The only difference is that one takes the \(\alpha\)-rate of one’s preference – say, \(\alpha=0.05\), and divides it by the number of possible multiple comparisons that can be performed on the condition means:
\[\alpha_{Dunn-Bonferroni}=\frac{\alpha_{overall}}{\#~of~pairwise~comparisons}\]
(it may be tempting to divide the overall \(\alpha\) by the number of comparisons performed instead of the number possible, but then one could just run one comparison with regular \(\alpha\) and thus defeat the purpose)
To use the example data from above, let’s imagine that we are interested in applying the Dunn-Bonferroni correction to a post hoc comparison of the mean of Condition 1 and Condition 2. To do so, we simply follow the same procedure as for the independent-samples t-test and adjust the \(\alpha\)-rate. Assuming we start with \(\alpha=0.05\), since there are \(4\) groups in the data, there are \(_4C_2=6\) possible pairwise comparisons, so the Dunn-Bonferroni-adjusted \(\alpha\)-rate is \(0.05/6=0.0083\)
##
## Welch Two Sample t-test
##
## data: Condition1 and Condition2
## t = -6.2688, df = 7.1613, p-value = 0.0003798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.204768 -4.179232
## sample estimates:
## mean of x mean of y
## -1.042 5.650
The observed \(p\)-value of \(0.0002408<0.0125\), so we can say that there is a signficant difference between the groups after applying the Dunn-Bonferroni correction.
Please also note that if you need to for any reason, you could also multiply the observed \(p\)-value by the number of possible pairwise comparisons and compare the result to the original \(\alpha\). It works out to be the same thing, a fact that is exploited by the base R command pairwise.t.test()
, in which you can use the p.adjust="bonferroni"
215 to conduct pairwise Dunn-Bonferroni tests on all of the conditions in a set simultaneously:
DV<-c(Condition1, Condition2, Condition3, Condition4)
Factor.A<-c(rep("Condition 1", length(Condition1)),
rep("Condition 2", length(Condition2)),
rep("Condition 3", length(Condition3)),
rep("Condition 4", length(Condition4)))
pairwise.t.test(DV, Factor.A, p.adjust="bonferroni", pool.sd = FALSE, paired=FALSE)
##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: DV and Factor.A
##
## Condition 1 Condition 2 Condition 3
## Condition 2 0.0023 - -
## Condition 3 0.0276 1.0000 -
## Condition 4 2.3e-05 0.1125 0.1222
##
## P value adjustment method: bonferroni
12.2.5.1.2 Tukey’s HSD
Tukey’s Honestly Significant Difference (HSD) Test is a way of determining whether the difference between two condition means is (honestly) significant. The test statistic for Tukey’s HSD test is the studentized range statistic \(q\), where for two means \(y_{\bullet1}\) and \(y_{\bullet2}\):
\[q=\frac{y_{\bullet1}-y_{\bullet2}}{\sqrt{\frac{MS_e}{n}}}\] ^[In this chapter, we are dealing exclusively with classical ANOVA models with equal \(n\) per group. The \(q\) statistic for the HSD test can also handle cases of unequal \(n\) for post hoc tests of results from generalized linear models by using the denominator:
\[\sqrt{\frac{MS_e}{2}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\] which is known as the Kramer correction. If using the Kramer correction, the more appropriate name for the test is not the Tukey test but rather the Tukey-Kramer procedure.]
The critical value of \(q\) depends on the number of conditions and on \(df_e\): those values can be found in tables such as this one.
For any given pair of means, the observed value of \(q\) is given by the difference between those means divided by \(\sqrt{MS_e/n}\), where \(MS_e\) comes from the ANOVA table and \(n\) is the number in each group: if the observed \(q\) exceeds the critical \(q\), then the difference between those means is honestly significantly different.
When performing the Tukey HSD test by hand, it also can be helpful to start with the critical \(q\) and calculate a critical HSD value such that any observed difference between means that exceeds the critical HSD is itself honestly significant:
\[HSD_{crit}=q_{crit}\sqrt{\frac{MS_e}{n}}\]
But, it is unlikely that you will be calculating HSDs by hand. The base R command TukeyHSD()
wraps an aov()
object – in the same way that we can wrap summary()
around an aov()
object to get a full-ish ANOVA table – and returns results about the differences between means with confidence intervals and \(p\)-values:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DV ~ Factor.A, data = one.way.between.groups.example.df)
##
## $Factor.A
## diff lwr upr p adj
## Condition 2-Condition 1 6.692 3.2225407 10.161459 0.0002473
## Condition 3-Condition 1 5.810 2.3405407 9.279459 0.0010334
## Condition 4-Condition 1 9.958 6.4885407 13.427459 0.0000022
## Condition 3-Condition 2 -0.882 -4.3514593 2.587459 0.8847332
## Condition 4-Condition 2 3.266 -0.2034593 6.735459 0.0687136
## Condition 4-Condition 3 4.148 0.6785407 7.617459 0.0165968
According to the results here, there are honestly significant differences (assuming \(\alpha=0.05\)) between Condition 2 and Condition 1, between Condition 3 and Condition 1, between Condition 4 and Condition 1, and between Condition 4 and Condition 3. All other differences between level means are not honestly significant.
12.2.5.1.3 Hayter-Fisher Test
The Hayter-Fisher test is identical to the Tukey HSD, but with a smaller critical \(q\) value, which makes this test more powerful than the Tukey HSD test. The critical value of \(q\) used for the Hayter-Fisher test is the value listed in tables for \(k-1\) groups given the same \(df_e\) (in the table linked above, it is the critical \(q\) listed one column immediately to the left of the value one would use for Tukey’s HSD).
The condition that comes from being able to use a smaller critical \(q\) value is that the Hayter-Fisher test can only be used following significant \(F\) test. However, since you shouldn’t be reporting results of post hoc tests following non-significant \(F\)-tests anyway, that’s not really an issue. The issue with the Hayter-Fisher test is that fewer people have heard about it: probably not a dealbreaker, but something you might have to write in a response to Reviewer 2 if you do use it.
12.2.5.2 Contrasts
A contrast is a comparison of multiple elements. In post hoc testing, just as a pairwise comparison will return a test statistic like \(q\) in the Tukey HSD test or the Dunn-Bonferroni-corrected \(t\), a contrast is a single statistic – which can be tested for statistical significance – that helps us evaluate a comparison that we have constructed. For example, please imagine that we have six experimental conditions and we are interested in comparing the average of the means of the first three with the average of the means of the second three. The resulting contrast would be a value equal to the differences of the averages defined by the contrast. The two methods of contrasting discussed here – orthogonal contrasts and Scheffé contrasts – differ mainly in how the statistical significance of the contrast values are evaluated.
12.2.5.2.1 Orthogonal Contrasts
As noted above, orthogonality is the condition of being uncorrelated. Take for example the \(x\)- and \(y\)-axes in the Cartesian Coordinate System: the two axes are at right angles to each other, and the \(x\) value of a point in the plane tells us nothing about the \(y\) value of the point and vice verse. We need both values to accurately locate a point. The earlier mention of orthogonality in this chapter referred to the orthogonality of between-groups variance and within-groups variance: the sum of the two is precisely equal to the total variance observed in the dependent variable, indicating that the two sources of variance do not overlap, and each explain separate variability in the data.
Orthogonal contrasts employ this principle by making comparisons between condition means that don’t compare the same things more than once. Pairwise comparisons like the ones made in Tukey’s HSD test, the Hayter-Fisher test, and the Dunn-Bonferroni test are not orthogonal because the same condition means appear in more than one of the comparisons: if there are three condition means in a dataset and thus three possible pairwise comparisons, then each condition mean will appear in two of the comparisons.^[For this example, if we call the three condition means “\(C_1\)”, “\(C_2\)”, and “\(C_3\),” the three possible pairwise comparisons are:
- \(C_1\) vs. \(C_2\)
- \(C_1\) vs. \(C_3\)
- \(C_2\) vs. \(C_3\)
\(C_1\) appears in comparisons (1) and (2), \(C_2\) appears in comparisons (1) and (3), and \(C_3\) appears in comparisons (2) and (3).]
By contrast (ha, ha), orthogonal contrasts are constructed in such a way that each condition mean is represented once and in a balanced way. This is a cleaner way of making comparisons and is a more powerful way of making comparisons because it does not invite the extra statistical noise associated with accounting for within-groups variance appearing multiple times.
To construct orthogonal contrasts, we choose positive and negative coefficients that are going to be multiplied by the condition means. We multiply each condition mean by its respective coefficient and take the sum of all of those terms to get the value of a contrast \(\Psi\).
For each set of contrasts to be orthogonal, the sum of the products of all coefficients associated with each condition mean must be 0.
For example, in a 4-group design like the one in the example data, the coefficients for one contrast \(\Psi_1\) could be:
\[c_{1j}=\left[\frac{1}{2}, \frac{1}{2}, -\frac{1}{2}, -\frac{1}{2} \right]\] That set of contrasts represents the average of the first two condition means (the sum of the first two condition means divided by two) minus the average of the second two condition means (the sum of the second two condition means divided by two times negative 1).
A contrast \(\Psi_2\) that is orthogonal to \(\Psi_1\) would be:
\[c_{2j}=\left[0, 0, 1, -1\right]\] which would represent the difference between the condition 3 mean and the condition 4 mean, ignoring the means of conditions 1 and 2. \(\Psi_1\) and \(\Psi_2\) are orthogonal to each other because:
\[\left(\frac{1}{2}\right)(0)+\left(\frac{1}{2}\right)(0)+\left(-\frac{1}{2}\right)(1)+\left(-\frac{1}{2}\right)(-1)=0\]
A third possible contrast \(\Psi_3\) would have the coefficients:
\[c_{3j}=\left[1, -1, 0, 0\right]\]
which would represent the difference between the condition 1 mean and the condition 2 mean, ignoring the means of conditions 3 and 4. \(\Psi_1\) and \(\Psi_3\) are orthogonal to each other because:
\[\left(\frac{1}{2}\right)(1)+\left(\frac{1}{2}\right)(-1)+\left(-\frac{1}{2}\right)(0)+\left(-\frac{1}{2}\right)(0)=0\] and \(\Psi_2\) is orthogonal to \(\Psi_3\) because:
\[(0)(1)+(0)(-1)+(1)(0)+(-1)(0)=0\] Thus, \(\{\Psi_1, \Psi_2, \Psi_3\}\) is a set of orthogonal contrasts.
To calculate \(\Psi_1\) using the example data:
\[\Psi_1=\sum_1^j c_{1j}y_{\bullet j}=\left(\frac{1}{2}\right)(-1.04)+\left(\frac{1}{2}\right)(5.65)+\left(-\frac{1}{2}\right)(4.77)+\left(-\frac{1}{2}\right)(8.92)=-4.538\]
The significance of \(\Psi\) is tested using an \(F\)-statistic. The numerator of this \(F\)-statistic is \(\Psi^2\): contrasts always have one degree of freedom, the sums of squares is the square of the contrast, and the mean squares (which is required for an \(F\)-ratio numerator) is equal to the sums of squares whenever \(df=1\). The denominator of the \(F\)-statistic for testing \(\Psi\) has \(df\) equal to \(df_e\) from the ANOVA table and its value is given by \(MS_e\sum\frac{c_j^2}{n_j}\), where \(MS_e\) comes from the ANOVA table, \(c_j\) is the contrast \(c\) for condition mean \(j\) and \(n_j\) is the number of observations in condition \(j\) (for a classic ANOVA model, \(n\) is the same for all groups so the subscript \(j\) isn’t all that relevant):
\[F_{obs}^{\Psi}=\frac{\Psi^2}{MS_e\sum \frac{c_{ij}^2}{n_j}}\] For \(\Psi_1\) calculated above, the observed \(F\)-statistic therefore would be:
\[F_{obs}^{\Psi_1}=\frac{-4.54^2}{3.68\left[\frac{(1/2)^2}{5}+\frac{(1/2)^2}{5}+\frac{(-1/2)^2}{5}+\frac{(-1/2)^2}{5}\right]}=\frac{20.61}{0.74}=27.85\] The \(p\)-value for an \(F\)-ratio of 27.85 with \(df_{num}=1,~df_{denom}=16\) is:
## [1] 7.514161e-05
Unless our \(\alpha\)-rate is something absurdly small, we can say that \(\Psi_1\) – the contrast of the average of the first two condition means to average of the last two condition means – is statistically significant.
As for the other contrasts in our orthogonal set:
\[\Psi_2=\sum_1^j c_{2j}y_{\bullet j}=0(-1.04)+0(5.65)+(1)(4.77)+(-1)(8.92)=-4.15\] \[F_{obs}^{\Psi_2}=\frac{-4.15^2}{3.68\left[\frac{0^2}{5}+\frac{0^2}{5}+\frac{1^2}{5}+\frac{-1^2}{5}\right]}=\frac{17.22}{1.47}=27.85,~p=0.003\] \[\Psi_3=\sum_1^j c_{3j}y_{\bullet j}=1(-1.04)+(-1)(5.65)+0(4.77)+0(8.92)=-6.69\]
\[F_{obs}^{\Psi_3}=\frac{-6.69^2}{3.68\left[\frac{1^2}{5}+\frac{-1^2}{5}+\frac{0^2}{5}+\frac{0^2}{5}\right]}=\frac{44.76}{1.47}=30.45,~p=0.00005\]
12.2.5.3 Scheffé Contrasts
When using Scheffé contrasts, one does not have to be concerned with the orthogonality of any given contrast with any other given contrast. That makes the Scheffé contrast ideal for making complex comparisons (e.g., for a factor with 20 levels, one could easily calculate a Scheffé contrast comparing the average of the means of the third, fourth, and seventh levels levels to the average of levels ten through 18), and/or for making many contrast comparisons. The drawback of the Scheffé contrast is that it is by design less powerful than orthogonal contrasts. That is the price of freedom (in this specific context). Although power is sacrificed, it is true that if the \(F\) test for a factor is signficant, then at least one Scheffé contrast constructed from the level means of that factor will be signficant.
The \(\Psi\) value for any given Scheffé contrast is found in the same way as for orthogonal contrasts, except that any coefficients can be chosen without concern for the coefficients of other contrasts. Say we were interested, for our sample data, in comparing the mean of condition 2 with the mean of the other three condition means. The Scheffé \(\Psi\) would therefore be:
\[\Psi_{Scheff\acute{e}}=\frac{1}{3}(y_{\bullet 1})+-1(y_{\bullet 2})+\frac{1}{3}(y_{\bullet 3})+\frac{1}{3}(y_{\bullet 4})=\frac{-1.042}{3}-5.65+\frac{4.768}{3}+\frac{8.916}{3}=-1.44\] The \(F\)-statistic formula for the Scheffé is likewise the same as the \(F\)-statistic formula for orthogonal contrasts:
\[F_{Scheff\acute{e}}=\frac{\Psi^2}{MS_e\sum\frac{c_j^2}{n_j}}\] For our Scheffé example:
\[F_{Scheff\acute{e}}=\frac{-1.44^2}{3.68\left[\frac{(1/3)^2}{5}+\frac{-1^2}{5}+\frac{(-1/3)^2}{5}+\frac{(-1/3)^2}{5}\right]}=\frac{2.07}{0.98}=2.11\] The more stringent signficance criterion for the Scheffé contrast is given by multiplying the \(F\)-statistic that would be used for a similar orthogonal contrast – the \(F\) with \(df_{num}=1\) and \(df_{denom}=df_e\), by the number of conditions minus one:
\[F_{crit}^{Scheff\acute{e}}=(j-1)F_{crit}(df_{num}=1, df_{denom}=df_e)\] So, to test our example \(\Psi_{Scheff{\acute{e}}}\) of 2.11, we compare it to a critical value equal to \((j-1)\) times the \(F\) that puts \(\alpha\) in the upper tail of the \(F\) distribution with \(df_{num}=1\) and \(df_{denom}=df_e\):
## [1] 13.482
The observed \(\Psi\) is less than the critical value of 13.482, so this particular Scheffé contrast is not significant at the \(\alpha=0.05\) level.
12.2.5.4 Dunnett Common Control Procedure
The last post hoc test we will discuss in this chapter216 is one that is useful for only one particular data structure, but that one data structure is commonly used: it is the Dunnett Common Control Procedure, used when there is a single control group in the data. The Dunnett test calculates a modified \(t\)-statistic based on the differences between condition means and the \(MS_e\) from the ANOVA. Given the control condition and any one of the experimental conditions:
\[t_{Dunnett}=\frac{y_{\bullet control}-y_{\bullet experimental}}{\sqrt{\frac{2MS_e}{n}}}\]
Critical values of the Dunnett test can be found in tables. There is also an easy-to-use DunnettTest()
command in the DescTools
package. To produce the results of the Dunnett Test, enter the data array for each factor level as a list()
: the DunnettTest()
command will take the first array in the list to be the control condition. For example, if we treat Condition 1 of our sample data as the control condition:
##
## Dunnett's test for comparing several treatments with a control :
## 95% family-wise confidence level
##
## $`1`
## diff lwr.ci upr.ci pval
## 2-1 6.692 3.548031 9.835969 0.00011 ***
## 3-1 5.810 2.666031 8.953969 0.00051 ***
## 4-1 9.958 6.814031 13.101969 4.9e-07 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
12.2.6 Power Analysis for the One-Way Independent-Groups ANOVA
The basic concept behind power analysis for ANOVA is the same as it was for classical \(t\)-tests: how many observations does an experiment require to expect to be able to reject a null hypothesis over a long-term rate defined by our desired power level? For the \(t\)-test, that calculation was relatively simple: given a \(t\)-distribution representing sample means drawn from a parent distribution as defined by the null hypothesis and a measure of alternative-hypothesis expectations, one can calculate the proportion of expected sample means that would fall in the rejection region of that null-defined \(t\)-distribution. With ANOVA, however, there are more parent distributions reflecting the different levels of each factor, and those parent distributions may be related to each other in complex ways (particularly in repeated-measures and factorial ANOVA designs).
There are multiple software packages designed to help with the potentially complex task of power analysis for ANOVA designs. For the between-groups one-way ANOVA, we can use the base R command power.anova.test()
. The power.anova.test()
command takes as its options the number of groups, the within variance (given by \(MSe\)), the between variance (given by \(\sigma^2_\alpha\): the variance of the group means), \(n\), \(\alpha\), and the desired power. To run the power analysis, leave exactly one of those options out and power.anova.test()
will estimate the missing value. For example, imagine that we wanted to run a study that produces results similar to what we see in the example data. There are 4 groups, \(\alpha=0.05\), and we desire a power of 0.9. We can estimate the within-groups variance and between-groups variance for the next study based on rough estimate of what we would expect (we could use the precise values from our example data but we don’t want to get too specific because that’s just one study: to be conservative, we can overestimate the within-groups variance and underestimate the between-groups variance), and the output will recommend our \(n\) per group:
##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 5.809905
## between.var = 8
## within.var = 8
## sig.level = 0.05
## power = 0.9
##
## NOTE: n is number in each group
If we expect our between-groups variance to be 8 (about half of what we observed in our sample data) and our within-groups variance to be about 8 as well (about twice what we observed in our sample data), R tells us that would require 5.81 participants per group. However, we are smarter than R, and we know that we can only recruit whole participants – we always round up the results for \(n\) from a power analysis to the next whole number – so we should plan for \(n=6\) for power to equal 0.9.
There are R packages and other software platforms that can handle more complex ANOVA models. However, if you can’t find one that suits your particular needs, then power analysis can be accomplished in an ad hoc manner by simulating parent distributions (using commands like rnorm()
) that are related to each other in the ways one expects – in terms of things like the distance between the distributions of levels of between-groups factors and/or correlations between distributions of levels of repeated-measures factors – and then derive \(n\) based on repeated sampling from those synthetic datasets.
12.2.7 Nonparametric Differences Between 3 or More Independent Groups
12.2.7.1 The \(\chi^2\) Test of Statistical Independence with \(k \ge 3\)/Extension of the Median Test
The \(\chi^2\) test of statistical independence was covered in depth in the chapter on differences between two things for \(k=2\) and \(2 \times 2\) contingency tables. The exact same procedure is followed for \(k>2\) and for contingency tables with more than 2 rows and/or columns. The one thing to keep in mind is that the \(df\) for the \(\chi^2\) test is equal to \(df_{rows}\times df_{columns}\) (in all of the examples from the differences between two things chapter, \(df=1\) because was always \(k=2\) rows and/or columns).
Everything said above about the \(\chi^2\) test applies to the extension of the median test. It’s exactly the same procedure, but with more than \(k=2\) groups.
We can’t use the sample data for the one-way between-groups ANOVA model to demonstrate these tests because the sample is too small to ensure that \(f_e\ge5\) for each cell. So, here is another group of sample data:
Condition 1 | Condition 2 | Condition 3 | Condition 4 |
---|---|---|---|
-5.41 | -5.55 | 3.79 | 0.33 |
-5.96 | 0.51 | 3.07 | 3.18 |
-3.51 | -2.10 | 0.94 | 4.05 |
-9.54 | -0.65 | 0.47 | 5.82 |
-2.61 | 0.13 | 2.55 | 7.27 |
-6.13 | 2.02 | 4.49 | 4.12 |
-10.14 | 0.32 | 3.67 | 7.70 |
-8.87 | 1.24 | 0.75 | 4.16 |
-6.87 | -2.57 | 1.59 | 6.84 |
-7.00 | -1.85 | 1.54 | 6.92 |
Median = 0.63 |
The median of all of the observed data is 0.63. We can construct a contingency table with Conditions as the columns and the greater than or equal to/less than the median status of each observed value of the dependent variable as the rows, populating the cells of the table with the counts of values that fall into each cross-tabulated category:
Condition 1 | Condition 2 | Condition 3 | Condition 4 | totals | |
---|---|---|---|---|---|
\(\ge 0.63\) | 0 | 2 | 9 | 9 | 20 |
\(<0.63\) | 10 | 8 | 1 | 1 | 20 |
totals | 10 | 10 | 10 | 10 | 40 |
We then can perform a \(\chi^2\) on the contingency table for the extension of the median test:
##
## Pearson's Chi-squared test
##
## data: matrix(c(0, 10, 2, 8, 9, 1, 9, 1), ncol = 4)
## X-squared = 26.4, df = 3, p-value = 7.864e-06
The observed \(p\)-value is \(7.9\times 10^-6\), and so is likely smaller than whatever \(\alpha\)-rate we might choose. Therefore, we can conclude that there is a significant difference between the conditions.
12.2.7.2 The Kruskal-Wallis One-Way Analysis of Variance
The closest nonparametric analogue to the classical one-way between-groups ANOVA is the Kruskal-Wallis test. The Kruskal-Wallis test relies wholly on the ranks of the data, determining whether there is a difference between groups based on the distribution of the ranks in each group.
We can use the example data from the one-way between-groups ANOVA to demonstrate the Kruskal-Wallis test. The first step in is to assign an overall rank (independent of group membership) to each observed value of the dependent variable from smallest to largest, with any ties receiving the average of the rank above and below the tie cluster:
Condition 1 | ranks | Condition 2 | ranks | Condition 3 | ranks | Condition 4 | ranks |
---|---|---|---|---|---|---|---|
-3.1 | 1 | 7.28 | 14 | 0.12 | 5 | 8.18 | 17 |
0.18 | 6 | 3.06 | 7 | 5.51 | 10 | 9.05 | 19 |
-0.72 | 3 | 4.74 | 8 | 5.72 | 11 | 11.21 | 20 |
0.09 | 4 | 5.29 | 9 | 5.93 | 12 | 7.31 | 15 |
-1.66 | 2 | 7.88 | 16 | 6.56 | 13 | 8.83 | 18 |
\(R_1=\) | 16 | \(R_2=\) | 54 | \(R_3=\) | 51 | \(R_4=\) | 89 |
The sum of the ranks for each condition \(j\) are called \(R_j\) as noted in the table above. The test statistic \(H\) for the Kruskal-Wallis test is given by:
\[H=\frac{\frac{12}{N(N+1)} \left[\sum_{i=1}^k \frac{R_j^2}{n_i} \right]-3(N+1)}{1-\frac{\sum _1^T t^3-t}{N^3-N}}\] where \(N\) is the total number of observations, \(n\) is the number of observations per group (they don’t have to be equal for the Kruskal-Wallis test), \(T\) is the number of tie clusters and \(t\) is the number of values clustered in each tie: if there are no tied ranks, then the denominator is 1.217
The observed \(H\) for these data is:
\[H=\frac{\frac{12}{20(21)}(2738.8)-3(21)}{1}=15.25\] Observed values of \(H\) for relatively small samples can be compared to ancient-looking tables of critical values based on the \(n\) per each group. If – as in this case – the values are not included in the table, we can take advantage of the fact that \(H\) can be modeled by a \(\chi^2\) distribution with \(df=k-1\). In our case, that means that the \(p\)-value for the observed \(H=15.25\) is:
## [1] 0.001615003
which would be significant if we chose \(\alpha=0.05\) or \(\alpha=0.01\).
Of course there is, as usual, a better way afforded us through the miracle of modern computing:
## Arrange the data into long format and add Condition labels
data<-c(Condition1, Condition2, Condition3, Condition4)
Condition<-c(rep("Condition1", 5), rep("Condition2", 5), rep("Condition3", 5), rep("Condition4", 5))
one.way.between.example.long<-data.frame(data, Condition)
kruskal.test(data~Condition, data=one.way.between.example.long)
##
## Kruskal-Wallis rank sum test
##
## data: data by Condition
## Kruskal-Wallis chi-squared = 15.251, df = 3, p-value = 0.001614
12.2.8 Bayesian One-Way Between-Groups ANOVA
Bayesian ANOVA works via the same principles as the Bayesian \(t\)-tests. There is a null model that posits that there is no variance between groups: in that model, the data are normally distributed, all sitting atop each other. In the alternative model, there is some non-zero variance produced by the factor(s) of interest, and the prior probabilities of those variance estimates are distributed by a multivariate Cauchy distribution (a generalization of the univariate Cauchy distribution that served as the prior for the Bayesian \(t\)-tests).
As with the Bayesian \(t\)-tests, the BayesFactor()
package provides a command structure that is maximally similar to the commands used for the classical ANOVA. For the sample data, we can calculate an ANOVA and use the posterior()
wrapper to estimate posterior condition means, standard deviations, and quantile estimates:
# The Bayes Factor software is very sensitive about variable types, so we have to explicitly make "Factor A" into a "Factor" variable with the following command:
one.way.between.groups.example.df$Factor.A<-as.factor(one.way.between.groups.example.df$Factor.A)
# Now we can run the Bayesian ANOVA:
anovaBF(DV~Factor.A, data=one.way.between.groups.example.df)
## Bayes factor analysis
## --------------
## [1] Factor.A : 3425.795 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS
posterior.data<-posterior(anovaBF(DV~Factor.A, data=one.way.between.groups.example.df), iterations=5000)
summary(posterior.data)
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 4.5728 0.4667 0.006601 0.006601
## Factor.A-Condition 1 -5.1945 0.8922 0.012618 0.017107
## Factor.A-Condition 2 0.9902 0.7992 0.011302 0.011550
## Factor.A-Condition 3 0.1972 0.7821 0.011060 0.011060
## Factor.A-Condition 4 4.0070 0.8775 0.012410 0.015078
## sig2 4.5869 2.2554 0.031896 0.056470
## g_Factor.A 5.8560 9.4623 0.133818 0.169369
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 3.6493 4.2683 4.5774 4.8825 5.491
## Factor.A-Condition 1 -6.8101 -5.7774 -5.2525 -4.6613 -3.274
## Factor.A-Condition 2 -0.6330 0.4862 0.9908 1.4971 2.564
## Factor.A-Condition 3 -1.3709 -0.3073 0.1860 0.7112 1.751
## Factor.A-Condition 4 2.1818 3.4816 4.0457 4.5894 5.612
## sig2 2.1495 3.1978 4.1270 5.3515 9.974
## g_Factor.A 0.5376 1.8800 3.3909 6.3900 24.816
A useful feature offered by BayesFactor
is the option to change the priors for an analysis by including a one-word option. The default prior width is “medium;” for situations where we may expect a larger difference between groups, choosing “wide” or “ultrawide” will give greater support to alternative models – but likely won’t give wildly different Bayes factor results – when the actual effect size is large. The options make no difference for our sample data, but for purposes of reviewing the syntax of the prior options:
## Bayes factor analysis
## --------------
## [1] Factor.A : 3425.795 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS
## Bayes factor analysis
## --------------
## [1] Factor.A : 3425.795 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS
12.3 One-Way Repeated-Measures ANOVA
We know that, in a psychological experiment (and, in life, too) different people respond to the same stimuli in (at least somewhat) different ways; otherwise, every numeric observation within every experimental condition would be identical. Of course, some of the variance in values of dependent variables are simple randomness, and when we use independent-groups designs, we assume that they are random; more precisely, we assume those different numbers are random draws from a normal distribution: we give the variance of that distribution a special name – within-groups variance – and we measure it with the within-groups mean squares (\(MS_\epsilon\)). It’s possible that all the differences in values of a DV are totally stochastic, in the sense that the same person could produce different DV values just due to simple variance (as an exercise: try starting a stopwatch and stopping it within 1 second a few times – you probably will get several similar-but-different results). However, some differences in the DV are likely not totally random; they may be different because the people generating those values are different. That’s what we mean – in statistics, anyway – when we refer to individual differences.
Individuals differ from each other in lots of ways, as I’m sure you have noticed. Some psychological studies examine them directly, but many more acknowledge individual differences and account for them in statistical analyses. That kind of approach is the topic for this section of the book.
Independent-groups ANOVA analyses do not have any mechanisms for accounting for individual differences because they have no mechanisms for measuring variance due to individual differences and not due to random, within-groups variance. That is because in an independent-groups design, each individual is only measured once and in one level of each factor; there is no basis for knowing how an individual would react in different levels of the factor.
Repeated-measures ANOVA (as with its \(t\)-test counterpart, there are variations on the name: repeated-measures ANOVA is also known as within-groups ANOVA, within-participants ANOVA, or randomized block ANOVA) designs are those in which there is an observation of every participant in every level of a factor. Since every individual participates in every condition, repeated-measures designs do allow for measurement of the variance in the DV that is due to individual differences. In the section on repeated-measures \(t\)-tests, I noted some of the advantages of repeated-measures designs and some of the cases where repeated-measures designs are impossible and/or unadvisable: all of those absolutely apply here to repeated-measures ANOVA.
12.3.1 Assumptions of Repeated-measures ANOVA
A brief note on the organization of this section: in the independent-groups section of this chapter, we discussed the ANOVA model before the assumptions. In the case of repeated-measures, the model depends on part on one of the repeated-measures-specific assumptions, so, we begin with assumptions.
Repeated-measures analyses, like independent-groups analyses, assume normality of the residuals. To assess normality for independent groups, we calculate residuals as the difference between each observed score and the mean for the group from which that score comes: \(\epsilon = y - y_{\bullet j}\). For repeated-measures designs, the residuals also take into account per-individual means: \(\epsilon = y - y_{\bullet j} - \pi_i\), where \(\pi_i\) is the mean of all observations across conditions for participant \(i\).
Independent-groups designs also assume homoscedasticity; repeated-measures designs carry an analogous assumption known as sphericity. The sphericity assumption is that the variances of the differences for particpant observations between each pair of conditions are equal. Honestly, that is a mouthful! But, it’s really a generalization of the variance in difference scores from repeated-measures \(t\)-tests. To calculate \(t\)-statistics for repeated measures, we take the difference between paired observations; we can then calculate the standard deviation \(s_d\) (from which we can go on to divide by \(\sqrt{n}\) to get \(se_d\), or use \(s_d\) in the denominator of Cohen’d \(d\)). Sphericity is the same idea, it’s just that there are more possible condition differences in ANOVA designs than the one difference in a \(t\)-test design. Please note: unlike most other assumptions of parametric tests, violations of sphericity and/or compound symmetry tend to increase type-I errors rather than type-II errors. Essentially, we expect individual differences to be consistent across different levels of the IV; if that’s not the case then the impact of the IV relative to the error is slightly overstated and the analysis is prone to slightly more frequent type-I errors as a result. When assumptional violations err on the more-misses side, we can usually feel pretty good about our results because we didn’t miss; when they err on the more-false-alarm side, that undermines our faith in statistically significant findings. So, this one is pretty important!
Compound Symmetry
Sphericity refers only to the condition that the variance of the differences between all pairs of factor-level variances are equal, and the variance-covariance matrix is used to assess it. Compound symmetry is a stronger assumption associated with repeated-measures ANOVA designs
The assumption of Compound symmetry is that all of the population-level variances are equal to each other (i.e., \(\sigma^2_1=\sigma^2_2 = \cdots = \sigma^2_j\) and all the population-level covariances are equal to each other (i.e., \(\sigma_{1,1}=\sigma_{1,2}=\cdots = \sigma_{j-1,j})\).
You might encounter the term compound symmetry, especially in, let’s say, 20th century sources. But, you don’t really have to worry about it. Compound symmetry is a stronger assumption than sphericity – if compound symmetry is met then sphericity is also met – but it has long been established that it’s not necessary for \(F\)-values to be reliable like sphericity is (Huynh & Feldt, 1970; Rouanet & Lépine, 1970). This text box has been strictly an FYI situation.
12.3.1.1 Testing the Sphericity Assumption: Something we really don’t have to do.
Much like the normality and homoscedasticity assumptions, sphericity is an assumption that can be mathematically tested. Mauchly’s test (note: Mauchly also invented other things) is the most common sphericity test; so common that it’s included by default in multiple software platforms and packages for analyzing repeated-measures ANOVA (including R
). It uses an algorithm so complex that it’s pretty much always left to software to run, although I want to note that the basis of the test is the variance-covariance matrix, which we will come back to later in this chapter. One thing to note about Mauchly’s test is that despite its ubiquity, it is considered to be unsatisfactorily powerful218; a more important thing to note about Mauchly’s test is that we don’t really need to do it. That’s because violations of sphericity don’t stop the null-hypothesis testing process, they just alter it; they are like a minor roadblock that we can go around, and the nice thing is that it’s not a problem to just take the alternate route in the first place.
There is an actual number that describes the degree of departure from sphericity; it is symbolized by \(\epsilon\), an unfortunate designation since we also use \(\epsilon\) to symbolize error in ANOVA219. \(\epsilon\), like tests of sphericity, is calculated based on the variance-covariance matrix. We use an estimate of \(\epsilon\) to calculate an adjusted \(F\)-statistic, which, as the name implies, is an adjustment to the regular \(F\) statistic used in ANOVA.220 The remedy for violations of the sphericity assumption is to use the adjusted \(F\). The adjusted \(F\) is adjusted to the extent of the violation of the sphericity assumption: if the data indicate a big violation then \(\epsilon\) is small and the adjusted \(F\) is bigger than the regular \(F\), but, if the sphericity assumption is not violated, then the adjusted \(F\) is the same as the regular \(F\) (and if the violation is minor, then the adjusted \(F\) is pretty close to boring basic-ass \(F\)). All that means is that we can just use the adjusted \(F\) in the first place, and pretty much can ignore tests of sphericity.
Additivity is the last of the repeated-measures-specific assumptions we will discuss here. Additivity is the assumption that there is no subject-by-treatment interaction. In other, hopefully clearer words: to assume additivity is to assume that each level of a factor will affect each participant in roughly the same way on average. When laboratory researchers design experiments, they allow that different levels of a treatment will affect participants – more specifically, the values of the dependent variable that participants produce – in different ways; they hope that they have designed the experiment in such a way that varying the levels of a factor leads to variance in the DV. What they don’t hope for when they assume additivity is that changing the levels will impact different participants in different ways. Physical allergies provide a pretty good example of something that may lead to violations of assumptions because they vary widely from person to person221. We may imagine a drug study in which some of the participants are allergic to chemicals in the capsule that holds the drug but not to the drug itself: those participants with that specific allergy may behave in ways that are in no way related to the levels of the IV and impact the results far more than mere (and expected) level-to-level variance.
The most common course of action is to assume additivity and therefore to use the additive repeated-measures ANOVA model. Like normality and sphericity, additivity can be tested for statistically; if violated, we can use the nonadditive model. However, we need to keep in mind that, relative to the nonadditive model, the additive model is easier to interpret, it makes calculating effect size easier, and tends to be more powerful.
12.3.2 Repeated-measures ANOVA Model
The additive repeated-measures ANOVA Model222 is symbolized by the following statement:
\[y_{ij}=\mu+\alpha_j+\pi_i+\epsilon_{ij}\]
where \(y_{ij}\) is the observed value of the DV, \(\mu\) is the population mean, \(\alpha_j\) is effect of the treatment \(\alpha\) at each level \(j\), \(\pi_i\) is the effect of individual (or block) differences, and \(\epsilon_{ij}\) is the error (or, residuals).
That model statement says that each observed value of the DV is influenced by four different elements: the population mean, which we can think of as providing a starting point, as in: if there were no other influences, all the DV values would be the value of the overall mean. The independent variable – which we will call factor A because it’s the first factor (for now it’s also the only factor. Also, A is symbolized by \(\alpha\) because that’s the Greek A) – exerts some influence on the numbers. Individual differences influence the DV numbers. And, finally, simple, otherwise-unaccounted-for within-group variance – which we also call error – moves the numbers around, too.
12.3.2.1 RM model vs. IG model
As a reminder, the model statement for the one-way independent groups ANOVA is:
\[y_{ij}=\mu+\alpha_j+\epsilon_{ij}\] In model-statement form, it looks like \(\pi_i\) is added in the repeated-measures model, but that’s a little misleading. \(\pi_i\) isn’t so much added to the model so much as it emerges out of \(\epsilon_{ij}\), which itself becomes that much smaller. Recall, please, that a independent-groups cannot account for individual differences because there is only one measurement per person. That doesn’t mean that individual differences cease to exist in an independent-groups analysis, only that we don’t have the tools to measure it. Therefore, individual differences in the IG model are treated like any other error. Repeated-measures models can account for individual differences (it’s their whole purpose), so part of what has to be called error in an IG model is an identifiable source of variance in the RM model!
The basic logic of ANOVA applies to all ANOVA designs, including, of course, the repeated-measures design: we reject \(H_0\) when the observed variance associated with the independent variable is large relative to the variance associated with error. One of the major advantages to using a repeated-measures design (relative to an independent-groups design and, of course, when possible) is that accounting for individual differences (as long as they are meaningful, and they frequently are) reduces the amount of variance that is associated with error, leading to relatively larger \(F\)-statistics.
12.3.3 Repeated-measures ANOVA: Hypotheses
The null and alternative hypotheses for repeated-measures ANOVA are the same as for the independent-groups ANOVA:
\[H_0: \sigma^2_\alpha = 0\]
\[H_1: \sigma^2_\alpha>0\]
That is, the null is that there is no variance associated with the treatment \((\alpha)\) and the alternative is that there is variance associated with the treatment \((\alpha)\) on the population level.
Repeated-measures designs, as noted above, have an additional identifiable source of variance: variance due to participants \((\sigma^2_\pi)\). However, measuring participant variance – that is, individual differences – is typically not the point, we usually care little about the presence of individual differences simply because it is so commonplace a phenomenon; our concern is really the independent variable. The aim of these designs is to reduce unaccounted-for variance \((\sigma^2_\epsilon)\) in order to increase statistical power for the hypothesis about \(\sigma^2_\alpha\). To use the signal-detection metaphor as applied to NHST, accounting for \(\sigma^2_\pi\) is a method of reducing the noise to get a better look at the signal.
12.3.4 Calculation Example
Almost all of the calculations for the Sums of Squares terms are the same for the repeated-measures ANOVA as they are for the independent-groups ANOVA, with one additional step. We are additionally going to calculate the Sums of Squares associated with individual differences: \(SS_{participants}\) We will then take that value and subtract it from \(SS_{within}\) to get \(SS_{e}\).
The entirely fictional data for this example are shown in Table 12.1, there are four participants \((n=4)\)223, three levels of the repeated-measures dependent variable Factor A, and a false-alarm rate of \(\alpha = 0.05\).
Participant | A1 | A2 | A3 |
---|---|---|---|
P1 | 5 | 6 | 7 |
P2 | 7 | 13 | 13 |
P3 | 2 | 4 | 6 |
P4 | 6 | 9 | 12 |
12.3.4.1 Calculating \(df\)
For this example, there are \(j=4\) levels of factor \(\alpha\) so \(df_A=j-1=2\). There are \(n=4\) participants in the experiment, so \(df_p=4-3=3\). The \(df_\epsilon\) for this model is given by \((j-1)(n-1)\); in this example, that comes to \((3-1)(4-1)=6\). As with the independent-groups model, \(df_{total}=jn-1\); in this case that would mean \(df_{total}=(3)(4)-1=11\). As always, \(df_{total}\) is the sum of all the other \(df\) terms; we can check our math for this example and confirm that \(df_A+df_p+df_\epsilon = 2+3+6=11=df_{total}\).
12.3.4.2 Calculating Sums of Squares
This first step is the same as it is for the one-way independent-groups ANOVA. We take the difference between each observed value \(y_{ij}\) and the grand mean \(y_{\bullet \bullet}\) and square that difference. The squared differences are noted next to each observed value in the shaded columns of Table 12.2.
Participant | Condition 1 | \((y_{i1}-y_{\bullet \bullet})^2\) | Condition 2 | \((y_{i2}-y_{\bullet \bullet})^2\) | Condition 3 | \((y_{i3}-y_{\bullet \bullet})^2\) |
---|---|---|---|---|---|---|
P1 | 5 | 6.25 | 6 | 2.25 | 7 | 0.25 |
P2 | 7 | 0.25 | 13 | 30.25 | 13 | 30.25 |
P3 | 2 | 30.25 | 4 | 12.25 | 6 | 2.25 |
P4 | 6 | 2.25 | 9 | 2.25 | 12 | 20.25 |
The sum of those values – \(\sum \left( y_{ij}-y_{\bullet \bullet}\right)^2\) – is \(SS_{total}\):
\[SS_{total}= \sum \left( y_{ij}-y_{\bullet \bullet}\right)^2=139\]
12.3.4.3 RM One-way ANOVA Example
Next, we will calculate the Sums of Squares associated with the repeated-measures independent variable \(SS_A\) just as we did for the independent-groups independent variable. \(SS_A\) is the sum of the squared differences between each group mean and the grand mean – \(y_{\bullet} j\), multiplied by the number of observations per condition \((n)\). As in the table accompanying the calculation of \(SS_A\) for the independent-groups model, Table 12.3 represents the concept of multiplied by the number of observations by repeatedly entering the difference between each group mean and the grand mean next to each observed datapoint.
Participant | Condition 1 | \((y_{\bullet 1}-y_{\bullet \bullet})^2\) | Condition 2 | \((y_{\bullet 2}-y_{\bullet \bullet})^2\) | Condition 3 | \((y_{\bullet 3}-y_{\bullet \bullet})^2\) |
---|---|---|---|---|---|---|
P1 | 5 | 6.25 | 6 | 0.25 | 7 | 4 |
P2 | 7 | 6.25 | 13 | 0.25 | 13 | 4 |
P3 | 2 | 6.25 | 4 | 0.25 | 6 | 4 |
P4 | 6 | 6.25 | 9 | 0.25 | 12 | 4 |
The sum of all the shaded values in Table 12.3 is \(SS_A\):
\[SS_{A}= n\sum \left( y_{\bullet j}-y_{\bullet \bullet}\right)^2=42\] Next, we’re going to calculate a value of \(SS\) that is not going to end up in the ANOVA table: it’s merely an intermediary step, and in the next step, we are going to break this value down into \(SS_{participants}\) and the \(SS_\epsilon\) term that is going to be the denominator for the \(F\) statistic.
As with \(SS_{total}\) and \(SS_A\), the calculation for \(SS_{within}\) for the repeated-measures design uses the exact same math as calculating \(SS_{within}\) for the independent-groups design. It’s the sum of the squared differences between each value of the dependent variable and its group mean \(y_{ij} -y_{\bullet j}\) of the level to which it belongs.
Participant | Condition 1 | \((y_{i1}-y_{\bullet 1})^2\) | Condition 2 | \((y_{i2}-y_{\bullet 2})^2\) | Condition 3 | \((y_{i3}-y_{\bullet 3})^2\) |
---|---|---|---|---|---|---|
P1 | 5 | 0 | 6 | 4 | 7 | 6.25 |
P2 | 7 | 4 | 13 | 25 | 13 | 12.25 |
P3 | 2 | 9 | 4 | 16 | 6 | 12.25 |
P4 | 6 | 1 | 9 | 1 | 12 | 6.25 |
\(SS_{within}\) is the sum of the numbers in the shaded columns in Table 12.4:
\[SS_{within}= \sum \left( y_{i j}-y_{\bullet j}\right)^2=97\]
First, we will calculate participant means \((y_{i\bullet})\) across the \(j\) conditions. Then, we will add the squared difference between the participant means \((y_{i\bullet})\) and the grand mean \((y_{\bullet \bullet})\) for each of the \(j\) conditions.
Participant | Condition 1 | Condition 2 | Condition 3 | \(y_{i \bullet}\) | \((y_{i \bullet}-y_{\bullet \bullet})^2\) |
---|---|---|---|---|---|
P1 | 5 | 6 | 7 | 6 | 2.25 |
P2 | 7 | 13 | 13 | 11 | 12.25 |
P3 | 2 | 4 | 6 | 4 | 12.25 |
P4 | 6 | 9 | 12 | 9 | 2.25 |
\(SS_{participants}\) is the sum of the numbers in the shaded columns of Table 12.5:
\[SS_{participants}= \sum j\left( y_{i \bullet}-y_{\bullet \bullet}\right)^2=87\]
The sums of squares associated with participants are literally subtracted from what was the error for the IG ANOVA:
\[SS_{residual}=SS_{within}-SS_{participants}=97-87=10\]
This step is the statistical expression of the benefit of repeated-measures ANOVA analysis! In addition to the practical benefits associated with requiring fewer individual participants, repeatedly-measuring the same participants allows us to literally subtract error away from what it would be in the corresponding independent-groups analysis.
12.3.4.4 RM One-way ANOVA Table
We summarize all of the above math in an ANOVA table. This ANOVA table differs only from the one-way independent-groups ANOVA table in that there is an additional identified source of variance – *participants** – and therefore another row in the table. The generic template for a one-way repeated-measures ANOVA table is shown in Table 12.6.
\(SV\) | \(df\) | \(SS\) | \(MS\) | \(F\) | \(EMS\) |
---|---|---|---|---|---|
Factor A | \(j-1\) | \(SS_A\) | \(MS_A\) | \(MS_A/MS_e\) | \(n\sigma^2_\alpha + \sigma^2_\epsilon\) |
Participants | \(n-1\) | \(SS_p\) | \(MS_p\) | Please ignore me | \(j\sigma^2_\pi + \sigma^2_\epsilon\) |
Error (Residuals) | \((j-1)(n-1)\) | \(SS_{resid}\) | \(MS_{resid}\) | \(\sigma^2_{\epsilon}\) | |
Total | \(jn-1\) | \(SS_{total}\) |
You see that little spot that is asking to be ignored? That’s where the \(F\)-ratio associated with participants would do, i.e., the test of the null hypothesis that individual differences on the population-level cause zero variance in the dependent variable. We could calculate an \(F\)-ratio for participants: it would be \(MS_{participants}/MS_e\), and software packages (including R
) will calculate it for you. However, we don’t care about it because we already know that individual differences exist. So, yes, it’s there, but we can safely leave it be.
The ANOVA table for the example is shown in Table 12.7. We enter the values for degrees of freedom and for sums of squares as previously calculated; as with the independent-groups analysis (and all ANOVA analyses!), the mean squares are the sums of squares for each row divided by the associated degrees of freedom. The \(F\)-ratio is the ratio of the mean squares associated with the repeated-measures dependent variable \((MS_A)\) and the mean squares for error \((MS_e)\). And, at this point we will again note that the \(MS_e\) term is smaller than it could have been because we have accounted for the variance associated with participants, which makes the \(F\) ratio relatively larger.
\(SV\) | \(df\) | \(SS\) | \(MS\) | \(F\) | \(EMS\) |
---|---|---|---|---|---|
Factor A | \(2\) | \(42\) | \(42/2=24\) | \(MS_A/MS_e=12.6\) | \(n\sigma^2_\alpha + \sigma^2_\epsilon\) |
Participants | \(3\) | \(87\) | \(87/3=29\) | \(MS_p/MS_e=17.4\) but whatever | \(j\sigma^2_\pi + \sigma^2_\epsilon\) |
Error (Residuals) | \(6\) | \(10\) | \(10/6=1.7\) | \(\sigma^2_{\epsilon}\) | |
Total | \(11\) | \(139\) |
The \(F\)-ratio for the example data is \(12.6\). Given that there \(2\) degrees of freedom for the variance in the numerator of the \(F\)-ratio and \(6\) degrees of freedom for the variance in the denominator, we get the \(p\)-value for this result based on the \(F\)-distribution with \(df_{numerator}=2\) and \(df_{denominator} = 6\):
## [1] 0.007111971
The \(p\)-value for \(F(2, 6)=12\), therefore, is \(0.00712\). If we assume a false-alarm rate of \(\alpha=0.05\), we would reject the null hypothesis that there is no variance associated with factor A.
12.3.4.5 RM One-way ANOVA in R
Here’s the simplest way to do it in R
; the code IV+Participant
indicates that the independent variable and participants are two factors that predict variance and that don’t interact with each other.
## Df Sum Sq Mean Sq F value Pr(>F)
## IV 2 42 21.000 12.6 0.00711 **
## Participant 3 87 29.000 17.4 0.00230 **
## Residuals 6 10 1.667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
12.3.5 The Adjusted \(F\) Test: Accounting for Violations of Sphericity
Unlike the other assumptions of parametric tests covered thus far (including additivity) which increase the likelihood of type-II errors, violations of sphericity increase the likelihood of type-I errors224. When data in a repeated-measures design violate sphericity, the regular \(F\)-distribution is no longer the correct \(F\)-distribution to compare the observed \(F\) value to: the regular \(F\) distribution will give a \(p\)-value that is too small.
The proper \(F\)-distribution for testing \(H_0\) – the one that preserves the desired \(\alpha\) rate (e.g., \(\alpha = 0.05\)) – has reduced degrees of freedom in the numerator and denominator; it’s known as the adjusted \(F\). The extent to which the degrees of freedom change depends on the severity of the violation.
As noted above, degrees of departure from sphericity are quantified by values of \(\epsilon\). \(\epsilon\), as indicated its Greekness, is a population-level construct. There are two notable ways to derive estimates based on the observed sample data. The original recipe for the estimate \(\hat{\epsilon}\) is used in the Geisser-Greenhouse Adjusted \(F\) procedure, the second estimate \(\tilde{\epsilon}\) is used in the Huynh-Feldt Adjusted \(F\) procedure. Both use the variance-covariance matrix of the observed data.
12.3.5.0.1 The Variance-Covariance Matrix
A variance-covariance matrix is simply a way of arranging the variances of each level of the repeated-measures factor and the covariances of the numbers in each level with each other. The row headers and the column headers of the variance-covariance matrix are the levels of the independent variable:
Column Headers are Factor Levels | |||||
---|---|---|---|---|---|
1 | 2 | ⋯ | j | ||
Row Headers | 1 | Variance-covariance matrix goes here | |||
are | 2 | ||||
Factor | ⋮ | ||||
Levels | j |
Each cell represents the covariance of the factor level indicated by the row header and the factor level indicated by the column header; the covariance of a factor level with itself is the variance of that factor level (which is true of the covariance of any group of numbers and itself), so the diagonal of the matrix contains variances. The off-diagonal cells of the matrix are covariances.
Note: technically, since the variances are just self-covariances, every number in a variance-covariance matrix is a covariance, so sometimes they are simply called covariance matrices.
\[\begin{equation} \begin{pmatrix} s^2_1 & \text{cov}_{1, 2} & \cdots & \text{cov}_{1, j} \\ \text{cov}_{2, 1} & s^2_2 & \cdots & \text{cov}_{2, j} \\ \vdots & \vdots & \ddots & \vdots \\ \text{cov}_{j, 1} & \text{cov}_{j, 2} & \cdots & s^2_j \\ \end{pmatrix} \end{equation}\]
The formula225 for the Geisser-Greenhouse \(\hat{epsilon}\) is rather beefy and is presented here mostly to enhance understanding how the variance-covariance matrix is related to the adjusted \(F\) (if you look at this equation and decide you would prefer not to; there is plenty of software to take care of all of this for us):
\[\hat{\epsilon}=\frac{j^2\left(\bar{E}_D-\bar{E}_{\bullet \bullet}\right)^2}{(p-1)\left(\sum\sum E^2_{jj'}-2p\sum^p_{j=1} \bar{E}^2_{j \bullet}+p^2 \bar{E}^2_{\bullet \bullet}\right)}\] where:
\(j\) is the number of factor levels |
\(\bar{E}_D\) is the mean of the Elements on the main Diagonal (i.e., the variances) |
\(E_{jj'}\) is the value of the Element (variance or covariance) in row \(j\) and column \(j'\) (since the rows and columns of the variance-covariance matrix are indexed by \(j\), we use \(j\) to indicate the row number and \(j'\) to indicate the column number) |
\(\bar{E}_{\bullet \bullet}\) is the mean of all the Elements |
\(\sum \sum E^2_{jj'}\) is the sum of each element squared |
\(\bar{E}_{j \bullet}\) is the mean of the Elements in row \(j\) |
The Geisser-Greenhouse procedure tends to be a little too conservative for minor violations of sphericity, which led to the development of the Huynh-Feldt Adjusted \(F\) procedure, uses a corrected \(\epsilon\) estimate called \(\tilde{\epsilon}\). The Huynh-Feldt \(\tilde{\epsilon}\) adjusts \(\hat{\epsilon}\) with the following equation:
\[\tilde{\epsilon}=\frac{n(j-1)\hat{\epsilon}-2}{(j-1)[n-1-(j-1)\hat{\epsilon}]}\]
The key features of both estimates of \(\epsilon\) (\(\hat{\epsilon}\) and \(\tilde{\epsilon}\)) – are:
It is based on the number of levels of the IV, the variances of the DV within each level of the IV, and the covariances of the DV between each of the IV levels.
Its maximum value happens when there is no violation of sphericity, and it is \(\epsilon = 1\).
Its minimum value is happens when there is maximal violation of sphericity, and it is \(\epsilon = \frac{1}{j}\) (reminder that \(j\) is the number of levels of the dependent variable).
The adjusted \(F\) statistic has \(df_{numerator}= \epsilon(j-1)\) and \(df_{denominator}=\epsilon(j-1)(n-1)\), that is, we multiply the \(df\) for unadjusted \(F\) by \(\epsilon\).
As noted above when sphericity is not violated, \(\epsilon = 1\) and the appropriate \(F\)-distribution is just the regular \(F\)-distribution with \(df_{numerator}\) and \(df_{denominator}\) as indicated in Table 12.6 :
\[\text{if}~\epsilon = 1; df_{numerator}=\epsilon(j-1) = j-1\] \[\text{if}~\epsilon = 1; df_{denominator}=\epsilon(j-1)(n-1) = (j-1)(n-1)\] If there is a violation of sphericity and \(\epsilon\) is at its minimum possible value of \(1/(j-1)\), then:
\[\text{if}~\epsilon = 1; df_{numerator}=\epsilon(j-1) = \left(\frac{1}{j-1}\right)j-1=1\]
\[\text{if}~\epsilon = 1; df_{denominator}=\epsilon(j-1)(n-1) = \left(\frac{1}{j-1}\right)(j-1)(n-1)=n-1\] Those \(df\) – \(df = (1, n-1)\) – represent the smallest possible \(df\) pair for the adjusted \(F\). Therefore, the \(df\) for the adjusted \(F\) are somewhere between \((1, n-1)\) (the worst possible sphericity violation) and \((j-1, (j-1)(n-1))\). Please note that the difference between the most and the fewest \(df\) is a factor of \(j-1\): the \(df\) associated with the DV. Thus, we either use all the \(df\) associated with the DV if there is no sphericity violation or completely factor out the \(df\) associated with the DV (leaving \(1\)) if there’s the worst possible violation.
Using the smallest possible \(df\) pair to test hypotheses is known as the conservative \(F\) test:
\[\text{Conservative}~F = F(1, n-1)\]
The conservative \(F\) can be used at any time without ever having to worry about type-I errors that arise from violations of sphericity. The tradeoff is that the conservatism entails less power, and therefore potentially more type-II errors.
Statistical software that runs adjusted \(F\) tests – including the R
functions in the example below will often provide results both the Geisser-Greenhouse and the Huynh-Feldt procedures.
12.3.5.0.2 Using the Adjusted \(F\)-test in R
example
As of this writing226, there are several R
packages with functions to run repeated-measures ANOVAs with adjusted-\(F\) corrections, including car
, rstatix
, and ez
. We’ll use the ezANOVA
function from the ez
package because it seems – just based on the names of the function and the package – let’s say, not that difficult to use.
The code below runs a repeated-measures ANOVA for the example data. The arguments in the ezANOVA
function indicate the data to be used with data
, the dependent variable with DV
, the identifier of the block (in this case, the participants) with wid
, and the within-participants factor with within
. That’s pretty ez
. By default, ezANOVA()
returns a list of objects: a regular (unadjusted) ANOVA table, the results of Mauchly’s test (which, whatever), and a table of results for both the Geisser-Greenhouse adjusted \(F\)-test and the Hunyh-Feldt adjusted-\(F\) procedure. For this example, we are going to save that list as an object called ezANOVAexample
so that we can call out each object individually using the $
operator.
First, let’s look at the unadjusted ANOVA results:
ezANOVAexample$ANOVA %>%
remove_rownames() %>% # the rownames are useless - let's get rid of them
kable() # and let's make the table look nice in the document.
Effect | DFn | DFd | F | p | p<.05 | ges |
---|---|---|---|---|---|---|
IV | 2 | 6 | 12.6 | 0.007112 |
|
0.3021583 |
The observed \(F\) does not change: the value listed here is exactly the same one that we calculated using sums of squares and observed \(df\) above, and the \(p\)-value is the one from the upper tail of the \(F\)-distribution with \(df_{numerator}=j-1=2\) and \(df_{denominator} = (j-1)(n-1)=6\) (note: ges stands for Generalized Eta Squared).
Next, we’ll take a quick glance at the results of Mauchly’s test:
Effect | W | p | p<.05 |
---|---|---|---|
IV | 0.72 | 0.72 |
According to Mauchly’s test – which, lest we forget, is considered insufficiently powerful, and that is especially true for cases (like our example) with small sample size – the example data do not represent a violation of the sphericity assumption.
Finally, and most importantly, let’s look at the last of the three objects produced by the ezANOVA()
function: the results of adjusted-\(F\) analyses. The Geisser-Greenhouse results are indicated by GG and the Geisser-Greenhouse \(\hat{\epsilon}\) is indicated by GGe; the Huynh-Feldt results are indicated by HF and the Huynh-Feldt \(\tilde{\epsilon}\) is indicated by HFe. Please note that the actual values of the adjusted \(F\)-distributions are not explicitly printed in the output, but it will be pretty easy to determine those ourselves based on the \(\epsilon\) values:
Effect | GGe | p[GG] | p[GG]<.05 | HFe | p[HF] | p[HF]<.05 |
---|---|---|---|---|---|---|
IV | 0.78125 | 0.0146429 |
|
1.478261 | 0.007112 |
|
## [1] 0.0146429
The Geisser-Greenhouse test, therefore, indicates that the result is statistically significant and robust to violations of sphericity. It’s like: yeah, there’s a little non-sphericity but we’re good, and we don’t have to worry about extra type-I errors.
Now, let’s look at the Hunyh-Feldt results. The Huynh-Feldt \(\tilde{\epsilon}\) is 1.48, which, as we all know, is greater than 1. Estimates of \(\epsilon\) aren’t supposed to be greater than 1.
And, it’s not a typo or a bug in the software, either. In this example, \(n=4\), \(j = 3\), and \(\hat{\epsilon}=0.78\) (the Geisser Greenhouse \(\hat{\epsilon}\) from the GG part of the Sphericity Corrections
part of the ezANOVA()
output):
\[ \begin{aligned} \tilde{\epsilon}&=\frac{n(j-1)\hat{\epsilon}-2}{(j-1)[n-1-(j-1)\hat{\epsilon}]}\\&=\frac{4(3-1)(0.7 8)-2}{(3-1)[4-1-(3-1)(0.78)]}\\&=1.48 \end{aligned} \]
This is just one of those things that can happen to statistical estimates. Sometimes, if the data are weird – and a repeated-measures design with \(n=4\) definitely qualifies as weird – you can get absurd values, like confidence intervals on proportions extending to less than zero or greater than one and negative estimates. A common remedy for that is to simply set such an estimate to the minimum or maximum (as appropriate) non-absurd value for that estimate, and we can see that the ezANOVA()
function has done just that. The \(p\)-value for \(F_{obs}=12.6\) for the Huynh-Feldt procedure is given as \(p = 0.007112\): that’s the \(p\)-value we get when comparisng same \(p\)-value as we got when we ran a conventional \(F\)-test in R
; and the upper-tail probability for \(F=12.6\) for an \(F\)-distribution with \(df_{numerator}=j-1=2\) and \(df_{denominator}= (j-1)(n-1)=(2)(3)=6\):
## [1] 0.007111971
In other words, the software treats estimates of \(\epsilon\) that are greater than \(1\) as if those estimates of \(\epsilon\) are exactly 1. For our example data, \(\tilde{\epsilon}=1\) is an indication that there is no sphericity violation. It makes sense that they would disagree given that the Huynh-Felt procedure is more liberal about sphericity than the more conservative Geisser-Greenhouse test.
To sum up the results for the example data with regard to sphericity corrections, both methods – the Geisser-Greenhouse adjusted-\(F\) test and the Huynh-Feldt adjusted-\(F\) procedure – indicate a significant effect of the repeated-measures factor when taking into account the increased risk of type-I error associated with possible violations of the sphericity assumption. That’s great!
Now, you might be asking: what if the Geisser-Greenhouse and the Huynh-Feldt results don’t agree? There’s no strict rule, but there is a general recommendation based on the fact that the Geisser-Greenhouse tends to be too conservative when \(\epsilon\) is large and the Huynh-Feldt correction tends to be too liberal when \(\epsilon\) is small: use the Geisser-Greenhouse results when the Geisser-Greenhouse \(\hat{\epsilon}\le0.75\) and the Huynh-Feldt correction otherwise.227
12.3.6 Effect Size and Population Variance Components
Effect size calculations for the repeated-measures design are similar to the corresponding calculations for the independent-groups design. The formula to calculate \(\eta^2\) is exactly the same:
\[\eta^2=\frac{SS_{A}}{SS_{total}}\] For the example data, \(\eta^2=42/139=0.31\), indicating a large effect.
In order to calculate \(\omega^2\), we first need to calculate the population variance components. As indicated by the model statement, there are three variance components accounted for by the repeated-measures design:
component | source |
---|---|
\(\sigma^2_\alpha\) | Factor A |
\(\sigma^2_\pi\) | Participants |
\(\sigma^2_\epsilon\) | Error |
As is the case for the one-way independent-groups design, the population variance components for repeated-measures designs are estimated using the observed mean squares values from the ANOVA table, and the formulas for those estimates are derived from the expected mean squares formulas. The component estimates \(\hat{\sigma}^2_\epsilon\) and \(\hat{\sigma}^2_\alpha\) are calculated in exactly the same way as they were for the independent-groups ANOVA:
\[EMS_e=\sigma^2_\epsilon\approx MSe, \\ \hat{\sigma}^2_\epsilon=MSe\] Measurement error is always considered a random factor, so there is never a need to correct the estimate of \(\hat{\sigma}^2_\epsilon\).
\[EMS_A=\sigma^2_\epsilon+n\sigma^2_\alpha\approx MS_A,\\ \hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}\] If \(A\) is a random factor, there is no need to correct the estimate; if \(A\) is a fixed factor, then we adjust by multiplying the estimate by \(df_A/df_A+1\):
\[\hat{\sigma}^2_\alpha=\frac{MS_A-MS_e}{n}\left(\frac{df_A}{df_A+1}\right) \text{ if A is fixed}\] The new kid on the population variance component block for repeated-measures designs is \(\sigma^2_\pi\), the variance contributed by participant differences:
\[EMS_p=\sigma^2_\epsilon+j\sigma^2_\pi\approx MS_p,\\ \hat{\sigma}^2_\pi=\frac{MS_p-MS_e}{j}\] Like \(\hat{\sigma}^2_\epsilon\), there is never a need to correct the estimate because a participants factor is always considered random.228
For the example data, \(j=3\), \(n = 4\), \(MS_A=21\), \(MS_p=87\), and \(MS_e=1.67\). If we assume that \(A\) is a random factor, then the population variance component estimates are:
\[\hat{\sigma}^2_\epsilon=MS_e=1.67\] $$ \[\begin{aligned} \hat{\sigma}^2_\pi&=\frac{MS_p-MS_e}{j} \\ &=\frac{29-1.67}{3}=9.11 \end{aligned}\]$$
$$ \[\begin{aligned} \hat{\sigma}^2_\alpha&=\frac{MS_A-MS_e}{n} \\ &=\frac{21-1.67}{4}=4.83 \end{aligned}\] \[ \] \[\begin{aligned} \omega^2 &=\frac{\hat{\sigma}^2_\alpha}{\hat{\sigma}^2_\alpha+\hat{\sigma}^2_\pi+\hat{\sigma}^2_\epsilon} &= \frac{4.83}{4.83+9.11+1.67}=0.31 \end{aligned}\]$$
(note the order in which the population variance components formulas were presented: there’s no rule involved here, but a little tip is that it’s easiest to calculate the components based on their position in the ANOVA table, from the bottom row of the table up)
If factor \(A\) is fixed, then the only component estimate that changes is \(\hat{\sigma}^2_\alpha\), and we recalculate \(\omega^2\) based on the updated \(\hat{\sigma}^2_\alpha\):
$$ \[\begin{aligned} \hat{\sigma}^2_\alpha&=\frac{MS_A-MS_e}{n}\left(\frac{df_A}{df_A+1}\right) \\ &=\frac{21-1.67}{4}\left(\frac{2}{3}\right)=3.22 \end{aligned}\] \[ \] \[\begin{aligned} \omega^2 &=\frac{\hat{\sigma}^2_\alpha}{\hat{\sigma}^2_\alpha+\hat{\sigma}^2_\pi+\hat{\sigma}^2_\epsilon} &= \frac{3.22}{3.22+9.11+1.67}=0.23 \end{aligned}\]$$
Either way, the effect of factor \(A\) would be considered large by the Cohen (1988) guidelines.
12.3.7 Power Analysis for the One-Way Repeated-Measures ANOVA
Power analysis for repeated-measures ANOVA is almost the same as it is for independent-groups ANOVA, except for one thing: accounting for sphericity violations.
Well, and one other, super-minor thing that only matters if we are using R
: the built-in stats
package doesn’t have a function for repeated-measures analyses. So, we’re going to have to use a different package – WebPower
– that has a wp.rmanova
(the wp
stands for Web Power) function with slightly different arguments than the power.anova.test
that we used for independent-groups analysis.
Let’s imagine that we want to do a proper experiment based on our extremely-small-sample-size example, and that we want to find out how many participants we would need for such an experiment in order to get a power estimate of 0.8. In the following code, we are going to set the power, the number of groups, the effect size, and the \(\alpha\)-rate and solve for \(n\). Since n
is the value we want returned, we set the argument n
as NULL. The number of groups ns
is simply \(1\) for any purely repeated-measures ANOVA (in mixed ANOVA, which uses both independent-groups and between-groups factors, there are more sets of participants than one, based on the number(s) of level(s) of the independent-groups factor(s)). The number of measurements nm
is the number of levels of the repeated-measures factors. The effect-size measure used in wp.rmanova
is one we haven’t seen before called Cohen’s \(f\) (and yes, it’s that Cohen. He’s got a \(d\) and an \(f\)), and it’s directly related to \(\eta^2\):
\[f=\sqrt{\frac{\eta^2}{1-\eta^2}}\].
Effect size guidelines for \(f\) are \(0.1\) = small; \(0.25\) = medium, and \(0.4\) = large (note: like \(d\) and unlike \(\eta^2\) and \(\omega^2\), \(f\) has no upper bound). nscor
is an estimate of \(\hat{\epsilon}\): if in your power analysis you want to account for the most sphericity, you might use least-spherical value of [\(1/j-1\)], if you are feeling bold about expecting the data not to violate sphericity, the most-spherical value for nscor
is 1. The argument alpha
is the \(\alpha\)-rate you will use. The type
argument has three possible values, only one of which we would use for a purely repeated-measures analysis. type = 0
indicates that we want to test an independent-groups variable, which we would have in a mixed design but do not here. type = 1
indicates that we want to test a repeated-measures variable, which is what we want. And, type = 3
is for testing an interaction: also not what we want. And, finally, power
is the desired power rate \(1-\beta\).
So, in this example, we will fill in the arguments based on everything we have calculated for this example so far.
library(WebPower)
wp.rmanova(n = NULL, # number of participants
ng = 1, # number of groups of participants
nm = 3, # number of measurements (factor levels)
f = 0.67, # Cohen's f (plugging in the observed eta-squares)
nscor = .78, # nonspericity correction coefficient (the observed G-G epsilon)
alpha = 0.05, # alpha rate
type = 1, # type of analysis (1 = within-participants effect)
power=0.8) # desired power
## Repeated-measures ANOVA analysis
##
## n f ng nm nscor alpha power
## 27.22575 0.67 1 3 0.78 0.05 0.8
##
## NOTE: Power analysis for within-effect test
## URL: http://psychstat.org/rmanova
Turns out we would need about 28 participants229 to correctly reject the null hypothesis 80% of the time in future experiments. Did we just get extraordinarily lucky with our example data? Sure, you could say that, but in another, more correct way, you could say that I carefully selected the numbers in the example to get a significant result with a set of numbers that fits neatly on the page.
12.3.8 Friedman Test: Nonparametric RM ANOVA
The nonparametric repeated-measures one-way ANOVA is known as the Friedman Test230. Like the Wilcoxon signed-rank test and the nonparametric \(\rho\) correlation coefficient, the Friedman test is based on the ranks of the observed data relative to each other. Specifically, each observed value of the dependent variable is ranked within each participant, so that the smallest value for a participant across levels of the repeated-measures factor gets the smallest rank for that participant and the largest value for a participant gets the largest rank.
Here again are the example repeated-measures data used in this chapter:
Participant | A1 | A2 | A3 |
---|---|---|---|
P1 | 5 | 6 | 7 |
P2 | 7 | 13 | 13 |
P3 | 2 | 4 | 6 |
P4 | 6 | 9 | 12 |
For each participant, the DV values are ranked from smallest (1) to largest. When there are ties – in this example, the \(A2\) and \(A3\) values for participant \(2\) are the same – the ranks are also tied at the average of the rank values those values otherwise would have; in this case: spots \(2\) and \(3\) are occupied and thus averaged, so the ranks would be \(\{1, 2.5, 2.5\}\). Then, we calculate the sums of the ranks for each level of the repeated-measures factor (see the last row of Table @ref(tab: wideonewayexranks)).
Participant | A1 | A2 | A3 | \(\bar{r}_{\bullet j}\) |
---|---|---|---|---|
P1 | 1.0 | 2.0 | 3.0 | 2.0 |
P2 | 1.0 | 2.5 | 2.5 | 2.0 |
P3 | 1.0 | 2.0 | 3.0 | 2.0 |
P4 | 1.0 | 2.0 | 3.0 | 2.0 |
Sum \((R_j)\) | 4.0 | 8.5 | 11.5 |
The formula for the Friedman test statistic is231:
\[\chi^2_{ranks}=\frac{12\sum_1^j \left(R_j - \frac{n(j+1)}{2}\right)^2}{nj(j+1)-\frac{\sum_1^n \left(t^3-t\right)}{j-1}}\] where \(n\) is the number of participants (as usual), \(j\) is the number of levels of the repeated-measures factor (as usual), and \(R_j\) is the sum of the ranks for factor level \(j\), and \(t\) is the number of tied observations per participant. To get the term \(\sum_1^n \left(t^3-t\right)\) (located in the denominator), we count up the \(t\) tied DV values for each participant and find \(t^3-t\), then add all of the \(t^3-t\) values for all the participants. If there are no tied values, then the whole \(\frac{\sum_1^n \left(t^3-t\right)}{j-1}=0\), and the denominator is simply \(nj(j+1)\).
We can plug the data from 12.10 into the Friedman test statistic formula to get an observed Friedman \(\chi^2\) value:
\[ \begin{aligned} \chi^2_{ranks}&=\frac{12\sum_1^j \left(R_j - \frac{n(j+1)}{2}\right)^2}{nj(j+1)-\frac{\sum_1^n \left(t^3-t\right)}{j-1}} \\ &=\frac{12\left[ \left(4 - \frac{4(3+1)}{2}\right)^2+\left(8.5^2 - \frac{4(3+1)}{2}\right)^2+\left(11.5^2 - \frac{4(3+1)}{2}\right)^2\right]}{(4)(3)(3+1)-\frac{2^3-2}{3-1}} \\ &=7.6 \end{aligned} \]
The observed value can be evaluated using a critical value table; if the number of groups and/or the number of participants exceed the column and/or the row lengths, respectively, of the table, then we compare the Friedman \(\chi^2_{ranks}\) to a \(\chi^2\) distribution with \(df=j-1\)
Or, we can do this:
friedman.test(DV ~ IV | Participant, # the DV is a function of the IV by participant
data = RM_1way_ex_df)
##
## Friedman rank sum test
##
## data: DV and IV and Participant
## Friedman chi-squared = 7.6, df = 2, p-value = 0.02237
12.3.9 Bayesian RM ANOVA
The Bayesian ANOVA for repeated-measures designs follows the same logic as its independent-groups counterpart. To analyze repeated-measures data with three or more factors with the BayesFactor
package, we simply need to A. add the variable that indicates participants to the model statement and to indicate that the participants variable is an identifier for repeated measurements.
anovaBF(DV~IV + Participant,
data = RM_1way_ex_df,
whichRandom = "Participant") #indicates which variable is the identifier
## Bayes factor analysis
## --------------
## [1] IV + Participant : 6.080263 ±0.82%
##
## Against denominator:
## DV ~ Participant
## ---
## Bayes factor type: BFlinearModel, JZS
For the example data, the Bayes Factor is about \(6\)232; that indicates that a model that includes the independent variable and participants is about \(6\) times more probable than a model that only includes participants. The strength of evidence for \(BF \approx 6\) is considered substantial by both the Jeffreys (1961) and the Kass & Raftery (1995) guidelines.
12.3.10 Nonadditivity
The additivity assumption implies that each participant responds to each level of the IV factor(s) in an approximately equal way. If we have reason to believe that there is a participant-by-treatment interaction - and are willing to deal with the consequences of that - there is a nonadditive model for repeated-measures designs.
The one-way nonadditive model is:
\[y_{ij}=\mu+\alpha_j+\pi_i+\epsilon_{ij}+\alpha\pi_{ij}\]
where \(\alpha\pi_{ij}\) represents the participant-by-treatment (or, blocks-by-treatment) interaction.
The additivity assumption can be statistically assessed with Tukey’s test of additivity, available in the asbio
package:
##
## Tukey's one df test for additivity
## F = 3.2086535 Denom df = 5 p-value = 0.1332467
\(SV\) | \(df\) | \(SS\) | \(MS\) | \(F\) | \(EMS\) |
---|---|---|---|---|---|
Factor A | \(j-1\) | \(SS_A\) | \(MS_A\) | \(MS_A/MS_{e}\) | \(n\sigma^2_\alpha + \sigma^2_{\alpha\pi}+\sigma^2_\epsilon\) |
Participants | \(n-1\) | \(SS_p\) | \(MS_p\) | \(j\sigma^2_\pi + (1-j/J)\sigma^2_{\alpha\pi}+\sigma^2_\epsilon\) | |
A*Participants | \((j-1)(n-1)\) | \(SS_{e}\) | \(MS_{e}\) | \(\sigma^2_{\epsilon}\) | |
Total | \(jn-1\) | \(SS_{total}\) |
These are the exact same results we got when assuming additivity. There’s not an appreciable difference for the one-way design between the additive and nonadditive analysis except for fact that we call the error term participants by factor A instead of simply residuals. That’s mere semantics for the one-way design, the difference is much more meaningful in factorial repeated-measures designs.
RM_1way_nonadd_ex_df <-
RM_1way_ex_df %>%
bind_rows(data.frame(Participant = paste0("P",
rep(5:8, each = 3)),
IV = paste0("A",
rep(1:3, 4)),
DV = c(5, 6, 7,
6, 7, 8,
7, 8, 9,
10, 11, 12)))
summary(aov(DV~IV+Participant + Error(IV*Participant),
data = RM_1way_nonadd_ex_df))
##
## Error: IV
## Df Sum Sq Mean Sq
## IV 2 43 21.5
##
## Error: Participant
## Df Sum Sq Mean Sq
## Participant 7 130.5 18.64
##
## Error: IV:Participant
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 14 17 1.214
## Df Sum Sq Mean Sq
## IV 2 43.0 21.500
## Participant 7 130.5 18.643
## IV:Participant 14 17.0 1.214
12.3.10.1 Issues with Nonadditive Analyses
As noted, in the one-way case, nonadditive models are for all intents and purposes the same as additive models. However, as repeated-measures models include more factors, nonadditive models become more complex and more difficult to use and interpret. The central idea of repeated-measures models is that there are measurable differences between individuals with regard to the dependent variable; it is harder to understand and explain why those differences would differ systematically within levels of different factors. Because of that, it often makes more sense to researchers to pool the individual differences across blocks – which is what additivity does – than to use separate participant-by-treatment terms – which is what nonadditivity does. Nonadditive analyses also have generally reduced power relative to additive analyses, and make it relatively difficult to ascertain effect sizes for the repeated-measures factors233. For all these reasons, the general recommendation is to avoid nonadditive analyses if possible.
12.3.10.2 Appendix: How \(t^2 = F\)
The \(F\)-statistic is the ratio of the between-groups mean squares and the within-groups mean squares:
\[F=\frac{MS_{between}}{MS_{within}}\] To calculate \(MS_{between}\), we take for each group the squared deviation between each group mean (\(M\)) and the grand mean \(GM\) and multiply by the number of observations in that group \(n\). Then, we sum the results across the groups.
\[SS_{between}=\sum n(M-GM)^2\] Because we are going to compare formulas between ANOVA and a \(t\)-test, let’s call each group mean \(\bar{x}\) instead of \(M\) to match our \(t\)-test notation:
\[SS_{between}=\sum n(\bar{x}-GM)^2\] In the case that we have two groups and equal \(n\), the grand mean of the data is just the average of \(\bar{x}_1\) and \(\bar{x}_2\). In addition, we can leverage the fact that the average of two numbers is exactly halfway between those two numbers.
So, for any \(\bar{x}_1\) and \(\bar{x}_2\), the difference between \(\bar{x}_1\) and the grand mean will be a mirror reflection of the difference between \(\bar{x}_2\) and the grand mean:
\[\bar{x}_1-GM=-(\bar{x}_2-GM)\] The distance between each \(\bar{x}\) and the grand mean is half of the difference between the two means. And, since we are going to square both \((\bar{x}_1-GM)\) and \((\bar{x}_2-GM)\) in our calculations, the sign of the difference is irrelevant, thus:
\[(\bar{x}_1-GM)^2=(\bar{x}_2-GM)^2=\left(\frac{\bar{x}_1-\bar{x}_2}{2}\right)^2\] All that means that – in this case, with two groups and equal \(n\) in each group – our \(SS_{between}\) formula simplifies to:
\[SS_{between}=n\left(\frac{\bar{x}_1-\bar{x}_2}{2}\right)^2+n\left(\frac{\bar{x}_1-\bar{x}_2}{2}\right)^2 =2n\left(\frac{\bar{x}_1-\bar{x}_2}{2}\right)^2\] which further simplifies to:
\[SS_{between}=\frac{n(\bar{x}_1-\bar{x}_2)^2}{2}\] \(MS_{between}\) is equal to \(SS_{between}\) divided by \(df_{between}\). Since, in this case (where \(k=2\)), \(df_{between}=k-1=1\). Thus:
\[MS_{between}=\frac{SS_{between}}{1}=SS_{between}=\frac{n(\bar{x}_1-\bar{x}_2)^2}{2}\]
\(SS_{within}\) is the sum of the squared deviations of each observation from the group mean:
\[SS_{within}=\sum(x-\bar{x})^2\] For the case of two groups, we can write the \(SS_{within}\) formula as:
\[SS_{within}=\sum(x-\bar{x}_1)^2+ \sum(x-\bar{x}_2)^2\] As we learned at the beginning of the semester, the sample variance of a set of observations is \(s^2=\frac{\sum(x-\bar{x})^2}{n-1}\), which means that each term of the \(SS_{between}\) formula is \(s^2\) times \(n-1\) (which removes the denominator). Therefore:
\[\sum(x-\bar{x}_1)^2+ \sum(x-\bar{x}_2)^2=(n_1-1)s_1^2+(n_2-1)s_2^2\] Because \(n\) is the same between groups, we can simplify that formula to:
\[SS_{within}=(n-1)(s_1^2+s_2^2)\] \(MS_{within}\) is the ratio of \(SS_{within}\) and \(df_{within}\):
\[MS_{within}=\frac{SS_{within}}{k(n-1)}\] In this case, \(k=2\), and \(SS_{between}=(n-1)(s_1^2+s_2^2)\), so we can rewrite \(MS_between\) as:
\[MS_{within}=\frac{(n-1)(s_1^2+s_2^2)}{2(n-1)}=\frac{s_1^2+s_2^2}{2}\]
Please note: for the case of equal \(n\) (which we have), the above is equal to the pooled variance \(s_p\).
The formula for the \(F\)-statistic is:
\[F=\frac{MS_{between}}{MS_{within}}\] Pulling together our work from above, we can rewrite the \(F\) formula as:
\[F=\frac{\frac{n(\bar{x}_1-\bar{x}_2)^2}{2}}{\frac{s_1^2+s_2^2}{2}}=\frac{n(\bar{x}_1-\bar{x}_2)^2}{s_1^2+s_2^2}\]
For the same data, the \(t\) equation is:
\[t=\frac{\bar{x}_1-\bar{x}_2}{se_{pooled}}\]
where
\[se_{pooled}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]
For the case of equal \(n\), the pooled standard error is:
\[se_{pooled}=\sqrt{\frac{s_1^2+s_2^2}{n}}=\frac{\sqrt{s_1^2+s_2^2}}{\sqrt{n}}\] Putting our \(t\) math together:
\[t=\frac{\bar{x}_1-\bar{x}_2}{\frac{\sqrt{s_1^2+s_2^2}}{\sqrt{n}}}\] That’s kind of a mess! Instead of writing the \(t\) formula as the ratio of the difference \(\bar{x}_1-\bar{x}_2\) and the pooled standard error, let’s write it as the product of the difference of the means and the reciprocal of the standard error:
\[t=(\bar{x}_1-\bar{x}_2)\left(\frac{\sqrt{n}}{\sqrt{s_1^2+s_2^2}} \right)\]
Now it’s easier to see that:
\[t=\frac{\sqrt{n}(\bar{x}_1-\bar{x}_2)}{\sqrt{s_1^2+s_2^2}}\]
And, finally, if we square \(t\):
\[t^2=\left(\frac{\sqrt{n}(\bar{x}_1-\bar{x}_2)}{\sqrt{s_1^2+s_2^2}}\right)^2=\frac{n(\bar{x}_1-\bar{x}_2)^2}{s_1^2+s^2_2}=F\]
we get \(F\).
There are ways to conduct multiple pairwise tests that manage the overall \(\alpha\)-rate when more than two groups are being studied at a time. One way is to reduce the \(\alpha\)-rate for each test in order to keep the overall \(\alpha\)-rate where one wants it (e.g., \(\alpha=0.05\)): this approach is generally known as applying the Bonferroni correction but should be known as applying the Dunn-Bonferroni correction to honor Olive Jean Dunn as well as Carlo Bonferroni. The other way is through the use of post-hoc pairwise tests like Tukey’s Honestly Signficant Difference (HSD) test and the Hayter-Fisher test, which are discussed below.↩︎
This pair of null and alternative hypotheses may appear odd, because usually when the null hypothesis includes the \(=\) sign, the alternative hypothesis features the \(\ne\) sign, and when the alternative has the \(>\) sign, the null has the \(\le\) sign. Technically, either pair of signs – \(=/\ne\) or \(\le/>\) – would be correct to use, it’s just that because variances can’t be negative, \(\sigma^2\le0\) is somewhat less meaningful than \(\sigma^2=0\), and \(\sigma^2\ne 0\) is somewhat less meaningful than \(\sigma^2>0\).↩︎
The \(F\) stands for Fisher but it was named for him, not by him, for what it’s worth.↩︎
The choice of the letter \(\alpha\) is to indicate that it is the first (\(\alpha\) being the Greek counterpart to \(A\)) of possibly multiple factors: when we encounter designs with more than one factor, the next factors will be called \(\beta\), \(\gamma\), etc.↩︎
The SAS software platform doesn’t even have a \(t\)-test command: to perform a \(t\)-test in SAS, one uses the ANOVA procedure.↩︎
To my knowledge, there are only three such effect-size statistics. The third, \(\epsilon^2\), is super-obscure: possibly so obscure as to be for ANOVA-hipsters only. The good news is that estimates of \(\epsilon^2\) tend to fall between estimates of \(\eta^2\) and estimates of \(\omega^2\) for the same data, so it can safely be ignored.↩︎
Population-level error may seem an odd term because we so frequently talk of error as coming from sampling error – that is the whole basis for the standard error statistic. However, even on the population level, there is variance that is not explained by the factor(s) that serve(s) as the independent variable(s): the naturally-occurring width of the population-level distribution of a variable.↩︎
Honestly, the effect-size statistics for the example data are absurdly large. There should be another category above large called stats-class-example large. Or something like that.↩︎
Base R is not sufficiently respectful of the contributions of Olive Jean Dunn↩︎
There are more post hoc tests, but the ones listed here (a) all control the \(\alpha\)-rate at the overall specified level and (b) are plenty.↩︎
The \(12\) is always just \(12\) and the \(3\) is always just \(3\), in case you were wondering.↩︎
Cornell, J. E., Young, D. M., Seaman, S. L., & Kirk, R. E. (1992). Power Comparisons of Eight Tests for Sphericity in Repeated Measures Designs. Journal of Educational Statistics, 17(3), 233–249. https://doi.org/10.2307/1165148↩︎
the lesson, as always: statisticians suck at naming things↩︎
a rare W for the statistic-namers↩︎
I, for one, have never tasted lobster due to allergies and likely never will, but I am well aware that other people not only tolerate lobster but seem to very much enjoy it.↩︎
it’s safe to consider the additive model the default.↩︎
this is not in any way a realistic example. Please avoid conducting experiments with only \(4\) people.↩︎
The classic paper on this topic is Box, 1954. Lane (2016) provides a slightly more readable explanation.↩︎
Kirk, R. E. (1995). Experimental design: Procedures for the behavioral sciences (3rd).↩︎
2024-05-03↩︎
Verma, J. P. (2015). Repeated measures design for empirical researchers. John Wiley & Sons.; Kirk, R. E. (1995). Experimental design: Procedures for the behavioral sciences (3rd).↩︎
for participants to be a fixed factor, one would need a number of participants that approaches the human population, but if you ever are running an experiment where you actually have a number of participants that approaches the human population, then, please, immediately stop doing inferential statistics. There is no need to do stats if you can observe all of the things.↩︎
ALWAYS round up the n in a power calculation!↩︎
like Mauchly’s test, Friedman’s test is named after someone who was more famous for other things, but Friedman’s contributions to society were decidedly less cool than Mauchly’s↩︎
Note: there are several versions of the Friedman test equation from different sources. This formula A. can correct for ties and B. is the one
R
uses.↩︎I say about because the MCMC algorithm by its nature includes randomness, so running
anovaBF
multiple times on the data will yield slightly different results each time↩︎Kirk, R. E. (1995). Experimental design: Procedures for the behavioral sciences (3rd).↩︎