3.3 One-way ANOVA

When your (categorical) independent variable has only two groups, you can test whether the means of the (continuous) dependent variable are significantly different or not with a t-test. When your independent variable has more than two groups, you can test whether the means are different with an ANOVA.

For example, say we want to test whether there is a significant difference between the average prices of entire homes and apartments, private rooms, and shared rooms. Let’s have a look at the means per room type:

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
##   room_type       count mean_price sd_price
##   <chr>           <int>      <dbl>    <dbl>
## 1 Entire home/apt 11082      113.     118. 
## 2 Private room     6416       64.3     46.5
## 3 Shared room       153       49.6     33.9

We can also plot these means in a bar plot:

Unsurprisingly, entire homes or apartments are priced higher than private rooms, which in turn are priced higher than shared rooms. We also see that there are almost twice as many entire homes and apartments than private rooms available and there are hardly any shared rooms available. Also, the standard deviation is much higher in the entire homes or apartments category than in the private or shared room categories.

An ANOVA can test whether there are any significant differences in the mean prices per room type. But before we carry out an ANOVA, we need to check whether the assumptions for ANOVA are met.

3.3.1 Assumption: normality of residuals

The first assumption is that the dependent variable (price) is normally distributed at each level of the independent variable (room_type). First, let’s visually inspect whether this assumption will hold:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that there is right-skew for each room type. We can also formally test, within each room type, whether the distributions are normal with the Shapiro-Wilk test. For example, for the shared rooms:

## 
##  Shapiro-Wilk normality test
## 
## data:  airbnb.shared$price
## W = 0.83948, p-value = 1.181e-11

The p-value of this test is extremely small, so the null hypothesis that the sample comes from a normal distribution should be rejected. If we try the Shapiro-Wilk test for the private rooms:

## Error in shapiro.test(airbnb.private$price): sample size must be between 3 and 5000

We get an error saying that our sample size is too large. To circumvent this problem, we can try the Anderson-Darling test from the nortest package:

## 
##  Anderson-Darling normality test
## 
## data:  airbnb.private$price
## A = 372.05, p-value < 2.2e-16

Again, we reject the null hypothesis of normality. I leave it as an exercise to test the normality of the prices of the entire homes and apartments.

Now that we know the normality assumption is violated, what can we do? We can consider transforming our dependent variable with a log-transformation:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, a log-transformation normalizes a right-skewed distribution. We could then perform the ANOVA on the log-transformed dependent variable. However, in reality it is often safe to ignore violations of the normality assumption (unless you’re dealing with small samples, which we are not). We’re going to simply carry on with the untransformed price as dependent variable.

3.3.2 Assumption: homogeneity of variances

A second assumption we need to check is whether the variances of our dependent variable price are equal across the categories of our independent variable room_type. Normally, a box plot is informative:

But in this case, the interquartile ranges (the heights of the boxes), which would normally give us an idea of the variance within each room type, are very narrow. This is because the range of Y-values to be plotted is very wide due to some extreme outliers. If we look at the standard deviations, however, we see that these are much higher for the entire rooms and apartments than for the private and shared rooms:

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
##   room_type       count mean_price sd_price
##   <chr>           <int>      <dbl>    <dbl>
## 1 Entire home/apt 11082      113.     118. 
## 2 Private room     6416       64.3     46.5
## 3 Shared room       153       49.6     33.9

We can also carry out a formal test of homogeneity of variances. For this we need the function leveneTest from the package car:

## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     2  140.07 < 2.2e-16 ***
##       17648                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is extremely small, so we reject the null hypothesis of equal variances. As with the normality assumption, violations of the assumption of equal variances can, however, often be ignored and we will do so in this case.

3.3.3 The actual ANOVA

To carry out an ANOVA, we need to install some packages:

We can now proceed with the actual ANOVA:

## # A tibble: 3 x 6
##   term                ss   df1   df2     f pvalue
##   <chr>            <dbl> <dbl> <int> <dbl>  <dbl>
## 1 (Intercept)   7618725.     1 17648  803.      0
## 2 room_type    10120155.     2 17648  534.      0
## 3 Residuals   167364763. 17648 17648   NA      NA

In this case the p-value associated with the effect of room_type is practically 0, which means that we reject the null hypothesis that the average price is equal for every room_type. You could report this as follows: “There were significant differences between the average prices of different room types (F(2, 17648) = 533.57, p < .001).”

3.3.4 Tukey’s honest significant difference test

Note that ANOVA tests the null hypothesis that the means in all our groups are equal. A rejection of this null hypothesis means that there is a significant difference in at least one of the possible pairs of means (i.e., in entire home/apartment vs. private and/or in entire home/apartment vs. shared and/or in private vs. shared). To get an idea of which pair of means contains a significant difference, we can follow up with Tukey’s test that will give us all pairwise comparisons. Tukey’s test corrects the p-values upwards — so it is more conservative in deciding something is significant — because the comparisons are post-hoc or exploratory:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ room_type, data = airbnb)
## 
## $room_type
##                                   diff       lwr        upr    p adj
## Private room-Entire home/apt -49.11516 -52.69593 -45.534395 0.000000
## Shared room-Entire home/apt  -63.79178 -82.37217 -45.211381 0.000000
## Shared room-Private room     -14.67661 -33.34879   3.995562 0.155939

This shows us that there’s no significant difference in the average price of shared and private rooms, but that shared rooms and private rooms both differ significantly from entire homes and apartments.