3.2 Independent samples t-test

Say we want to test whether prices differ between large and small cities. To do this, we need a variable that denotes whether an Airbnb is in a large or in a small city. In Belgium, we consider cities with a population of at least one hundred thousand as large:

## [1]  231493 1019022 1019022   69011 1019022 1019022
## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [1] large large large small large large
## Levels: small large

In the above, we first create a logical variable (this is another variable type; we’ve discussed some others here). We call this variable large and it is TRUE when population is larger than 100000 and FALSE if not. Afterwards we create a new variable size that is the factorization of large. Note that we add another argument to the factor function, namely labels, to give the values of large more intuitive names. FALSE comes first in the alphabet and gets the first label small, TRUE comes second in the alphabet and gets the second label large.

To know which cities are large and which are small, we can ask for frequencies of size (large vs. small) and city (the actual city itself) combinations. We’ve learned how to do this in the introductory chapter (see frequency tables and descriptive statistics):

## `summarise()` regrouping output by 'size' (override with `.groups` argument)
## # A tibble: 43 x 4
## # Groups:   size [3]
##    size  city              count population
##    <fct> <chr>             <int>      <dbl>
##  1 large Brussel            6715    1019022
##  2 large Antwerpen          1610     459805
##  3 large Gent               1206     231493
##  4 large Charleroi           118     200132
##  5 large Brugge             1094     116709
##  6 large Namur               286     106284
##  7 small Leuven              434      92892
##  8 small Mons                129      91277
##  9 small Aalst                74      77534
## 10 small Mechelen            190      77530
## 11 small Kortrijk            107      73879
## 12 small Hasselt             151      69222
## 13 small Oostende            527      69011
## 14 small Sint-Niklaas         52      69010
## 15 small Tournai              97      67721
## 16 small Roeselare            41      56016
## 17 small Verviers            631      52824
## 18 small Moeskroen            28      52069
## 19 small Dendermonde          45      43055
## 20 small Turnhout            130      39654
## 21 small Ieper               143      35089
## 22 small Tongeren            173      29816
## 23 small Oudenaarde          110      27935
## 24 small Ath                  47      26681
## 25 small Arlon                46      26179
## 26 small Soignies             58      24869
## 27 small Nivelles            505      24149
## 28 small Maaseik              93      23684
## 29 small Huy                  99      19973
## 30 small Tielt                24      19299
## 31 small Eeklo                43      19116
## 32 small Marche-en-Famenne   266      16856
## 33 small Diksmuide            27      15515
## 34 <NA>  Bastogne            145         NA
## 35 <NA>  Dinant              286         NA
## 36 <NA>  Halle-Vilvoorde     471         NA
## 37 <NA>  Liege               667         NA
## 38 <NA>  Neufchâteau         160         NA
## 39 <NA>  Philippeville        85         NA
## 40 <NA>  Thuin                81         NA
## 41 <NA>  Veurne              350         NA
## 42 <NA>  Virton               56         NA
## 43 <NA>  Waremme              51         NA

We see that some cities have an NA value for size. This is because we do not have the population for these cities (and therefore also do not know whether it’s a large or small city). Let’s filter these observations out and then check the means and the standard deviations of price depending on city size:

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   size  mean_price sd_price count
##   <fct>      <dbl>    <dbl> <int>
## 1 small      110.     122.   4270
## 2 large       85.8     82.9 11029

We see that prices are higher in small than in large cities, but we want to know whether this difference is significant. An independent samples t-test can provide the answer (the listings in the large cities and the listings in the small cities are the independent samples), but we need to check an assumption first: are the variances of the two independent samples equal?

## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  134.45 < 2.2e-16 ***
##       15297                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of equal variances is rejected (p < .001), so we should continue with a t-test that assumes unequal variances:

## 
##  Welch Two Sample t-test
## 
## data:  airbnb.cities$price by airbnb.cities$size
## t = 12.125, df = 5868.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  20.55103 28.47774
## sample estimates:
## mean in group small mean in group large 
##           110.31265            85.79826

You could report this as follows: “Large cities (M = 85.8, SD = 82.88) had a lower price (t(5868.25) = 12.125, p < .001, unequal variances assumed) than small cities (M = 110.31, SD = 121.63).”