3.2 Independent samples t-test

Say we want to test whether prices differ between large and small cities. To do this, we need a variable that denotes whether an Airbnb is in a large or in a small city. In Belgium, we consider cities with a population of at least one hundred thousand as large:

airbnb <- airbnb %>% 
  mutate(large = population > 100000,
         size = factor(large, labels = c("small","large")))

# We could have also written: mutate(size = factor(population > 100000, labels = c("small","large)))

# have a look at the population variable
head(airbnb$population)
## [1]  231493 1019022 1019022   69011 1019022 1019022
# have a look at the large variable
head(airbnb$large)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
# and at the size variable
head(airbnb$size)
## [1] large large large small large large
## Levels: small large

In the above, we first create a logical variable (this is another variable type; we’ve discussed some others here). We call this variable large and it is TRUE when population is larger than 100000 and FALSE if not. Afterwards we create a new variable size that is the factorization of large. Note that we add another argument to the factor function, namely labels, to give the values of large more intuitive names. FALSE comes first in the alphabet and gets the first label small, TRUE comes second in the alphabet and gets the second label large.

To know which cities are large and which are small, we can ask for frequencies of size (large vs. small) and city (the actual city itself) combinations. We’ve learned how to do this in the introductory chapter (see frequency tables and descriptive statistics):

airbnb %>% 
  group_by(size, city) %>% 
  summarize(count = n(), population = mean(population)) %>% # Cities form the groups. So the average population of a group = the average of observations with the same population because they come from the same city = the population of the city
  arrange(desc(size), desc(population)) %>% # largest city on top
  print(n = Inf) # show the full frequency distribution
## `summarise()` has grouped output by 'size'. You can override using the `.groups` argument.
## # A tibble: 43 x 4
## # Groups:   size [3]
##    size  city              count population
##    <fct> <chr>             <int>      <dbl>
##  1 large Brussel            6715    1019022
##  2 large Antwerpen          1610     459805
##  3 large Gent               1206     231493
##  4 large Charleroi           118     200132
##  5 large Brugge             1094     116709
##  6 large Namur               286     106284
##  7 small Leuven              434      92892
##  8 small Mons                129      91277
##  9 small Aalst                74      77534
## 10 small Mechelen            190      77530
## 11 small Kortrijk            107      73879
## 12 small Hasselt             151      69222
## 13 small Oostende            527      69011
## 14 small Sint-Niklaas         52      69010
## 15 small Tournai              97      67721
## 16 small Roeselare            41      56016
## 17 small Verviers            631      52824
## 18 small Moeskroen            28      52069
## 19 small Dendermonde          45      43055
## 20 small Turnhout            130      39654
## 21 small Ieper               143      35089
## 22 small Tongeren            173      29816
## 23 small Oudenaarde          110      27935
## 24 small Ath                  47      26681
## 25 small Arlon                46      26179
## 26 small Soignies             58      24869
## 27 small Nivelles            505      24149
## 28 small Maaseik              93      23684
## 29 small Huy                  99      19973
## 30 small Tielt                24      19299
## 31 small Eeklo                43      19116
## 32 small Marche-en-Famenne   266      16856
## 33 small Diksmuide            27      15515
## 34 <NA>  Bastogne            145         NA
## 35 <NA>  Dinant              286         NA
## 36 <NA>  Halle-Vilvoorde     471         NA
## 37 <NA>  Liege               667         NA
## 38 <NA>  Neufchâteau         160         NA
## 39 <NA>  Philippeville        85         NA
## 40 <NA>  Thuin                81         NA
## 41 <NA>  Veurne              350         NA
## 42 <NA>  Virton               56         NA
## 43 <NA>  Waremme              51         NA

We see that some cities have an NA value for size. This is because we do not have the population for these cities (and therefore also do not know whether it’s a large or small city). Let’s filter these observations out and then check the means and the standard deviations of price depending on city size:

airbnb.cities <- airbnb %>% 
  filter(!is.na(population)) 
# Filter out observations for which we do not have the population. 
# The exclamation mark should be read as NOT. So we want to keep the observations for which population is NOT NA.
# Check out https://r4ds.had.co.nz/transform.html#filter-rows-with-filter for more logical operators (scroll down to section 5.2.2).

airbnb.cities %>% 
  group_by(size) %>% 
  summarize(mean_price = mean(price),
            sd_price = sd(price),
            count = n())
## # A tibble: 2 x 4
##   size  mean_price sd_price count
##   <fct>      <dbl>    <dbl> <int>
## 1 small      110.     122.   4270
## 2 large       85.8     82.9 11029

We see that prices are higher in small than in large cities, but we want to know whether this difference is significant. An independent samples t-test can provide the answer (the listings in the large cities and the listings in the small cities are the independent samples), but we need to check an assumption first: are the variances of the two independent samples equal?

install.packages("car") # For the test of equal variances, we need a package called car.
library(car)
# Levene's test of equal variances. 
# Low p-value means the variances are not equal. 
# First argument = continuous dependent variable, second argument = categorical independent variable.
leveneTest(airbnb.cities$price, airbnb.cities$size) 
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  134.45 < 2.2e-16 ***
##       15297                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of equal variances is rejected (p < .001), so we should continue with a t-test that assumes unequal variances:

# Test whether the average prices of large and small cities differ. 
# Indicate whether the test should assume equal variances or not (set var.equal = TRUE for a test that does assume equal variances).
t.test(airbnb.cities$price ~ airbnb.cities$size, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  airbnb.cities$price by airbnb.cities$size
## t = 12.125, df = 5868.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group small and group large is not equal to 0
## 95 percent confidence interval:
##  20.55103 28.47774
## sample estimates:
## mean in group small mean in group large 
##           110.31265            85.79826

You could report this as follows: “Large cities (M = 85.8, SD = 82.88) had a lower price (t(5868.25) = 12.125, p < .001, unequal variances assumed) than small cities (M = 110.31, SD = 121.63).”