3.2 Independent samples t-test
Say we want to test whether prices differ between large and small cities. To do this, we need a variable that denotes whether an Airbnb is in a large or in a small city. In Belgium, we consider cities with a population of at least one hundred thousand as large:
<- airbnb %>%
airbnb mutate(large = population > 100000,
size = factor(large, labels = c("small","large")))
# We could have also written: mutate(size = factor(population > 100000, labels = c("small","large)))
# have a look at the population variable
head(airbnb$population)
## [1] 231493 1019022 1019022 69011 1019022 1019022
# have a look at the large variable
head(airbnb$large)
## [1] TRUE TRUE TRUE FALSE TRUE TRUE
# and at the size variable
head(airbnb$size)
## [1] large large large small large large
## Levels: small large
In the above, we first create a logical variable (this is another variable type; we’ve discussed some others here). We call this variable large
and it is TRUE
when population
is larger than 100000 and FALSE
if not. Afterwards we create a new variable size
that is the factorization of large
. Note that we add another argument to the factor
function, namely labels
, to give the values of large
more intuitive names. FALSE
comes first in the alphabet and gets the first label small
, TRUE
comes second in the alphabet and gets the second label large
.
To know which cities are large and which are small, we can ask for frequencies of size (large vs. small) and city (the actual city itself) combinations. We’ve learned how to do this in the introductory chapter (see frequency tables and descriptive statistics):
%>%
airbnb group_by(size, city) %>%
summarize(count = n(), population = mean(population)) %>% # Cities form the groups. So the average population of a group = the average of observations with the same population because they come from the same city = the population of the city
arrange(desc(size), desc(population)) %>% # largest city on top
print(n = Inf) # show the full frequency distribution
## `summarise()` has grouped output by 'size'. You can override using the `.groups` argument.
## # A tibble: 43 x 4
## # Groups: size [3]
## size city count population
## <fct> <chr> <int> <dbl>
## 1 large Brussel 6715 1019022
## 2 large Antwerpen 1610 459805
## 3 large Gent 1206 231493
## 4 large Charleroi 118 200132
## 5 large Brugge 1094 116709
## 6 large Namur 286 106284
## 7 small Leuven 434 92892
## 8 small Mons 129 91277
## 9 small Aalst 74 77534
## 10 small Mechelen 190 77530
## 11 small Kortrijk 107 73879
## 12 small Hasselt 151 69222
## 13 small Oostende 527 69011
## 14 small Sint-Niklaas 52 69010
## 15 small Tournai 97 67721
## 16 small Roeselare 41 56016
## 17 small Verviers 631 52824
## 18 small Moeskroen 28 52069
## 19 small Dendermonde 45 43055
## 20 small Turnhout 130 39654
## 21 small Ieper 143 35089
## 22 small Tongeren 173 29816
## 23 small Oudenaarde 110 27935
## 24 small Ath 47 26681
## 25 small Arlon 46 26179
## 26 small Soignies 58 24869
## 27 small Nivelles 505 24149
## 28 small Maaseik 93 23684
## 29 small Huy 99 19973
## 30 small Tielt 24 19299
## 31 small Eeklo 43 19116
## 32 small Marche-en-Famenne 266 16856
## 33 small Diksmuide 27 15515
## 34 <NA> Bastogne 145 NA
## 35 <NA> Dinant 286 NA
## 36 <NA> Halle-Vilvoorde 471 NA
## 37 <NA> Liege 667 NA
## 38 <NA> Neufchâteau 160 NA
## 39 <NA> Philippeville 85 NA
## 40 <NA> Thuin 81 NA
## 41 <NA> Veurne 350 NA
## 42 <NA> Virton 56 NA
## 43 <NA> Waremme 51 NA
We see that some cities have an NA
value for size. This is because we do not have the population for these cities (and therefore also do not know whether it’s a large or small city). Let’s filter these observations out and then check the means and the standard deviations of price depending on city size:
<- airbnb %>%
airbnb.cities filter(!is.na(population))
# Filter out observations for which we do not have the population.
# The exclamation mark should be read as NOT. So we want to keep the observations for which population is NOT NA.
# Check out https://r4ds.had.co.nz/transform.html#filter-rows-with-filter for more logical operators (scroll down to section 5.2.2).
%>%
airbnb.cities group_by(size) %>%
summarize(mean_price = mean(price),
sd_price = sd(price),
count = n())
## # A tibble: 2 x 4
## size mean_price sd_price count
## <fct> <dbl> <dbl> <int>
## 1 small 110. 122. 4270
## 2 large 85.8 82.9 11029
We see that prices are higher in small than in large cities, but we want to know whether this difference is significant. An independent samples t-test can provide the answer (the listings in the large cities and the listings in the small cities are the independent samples), but we need to check an assumption first: are the variances of the two independent samples equal?
install.packages("car") # For the test of equal variances, we need a package called car.
library(car)
# Levene's test of equal variances.
# Low p-value means the variances are not equal.
# First argument = continuous dependent variable, second argument = categorical independent variable.
leveneTest(airbnb.cities$price, airbnb.cities$size)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 134.45 < 2.2e-16 ***
## 15297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis of equal variances is rejected (p < .001), so we should continue with a t-test that assumes unequal variances:
# Test whether the average prices of large and small cities differ.
# Indicate whether the test should assume equal variances or not (set var.equal = TRUE for a test that does assume equal variances).
t.test(airbnb.cities$price ~ airbnb.cities$size, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: airbnb.cities$price by airbnb.cities$size
## t = 12.125, df = 5868.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group small and group large is not equal to 0
## 95 percent confidence interval:
## 20.55103 28.47774
## sample estimates:
## mean in group small mean in group large
## 110.31265 85.79826
You could report this as follows: “Large cities (M = 85.8, SD = 82.88) had a lower price (t(5868.25) = 12.125, p < .001, unequal variances assumed) than small cities (M = 110.31, SD = 121.63).”