3.5 Chi-squared test
Suppose we’re interested in finding a true gem of a listing. For example, we’re interested in listings with a 5 out of 5 rating and at least 30 reviews:
airbnb <- airbnb %>%
mutate(gem = (overall_satisfaction == 5 & reviews>=30), # two conditions should be met before saying a listing is a gem
gem = factor(gem, labels = c("no gem","gem"))) # give the logical variable more intuitive labels
Now, say we’re interested in knowing whether we’re more likely to find gems in small or in large cities (we’ve created the size
variable here). The chi-squared test can provide an answer to this question by testing the null hypothesis of no relationship between two categorical variables (city size: large vs. small & gem: yes vs. no). It compares the observed frequency table with the frequency table that you would expect when there is no relation between the two variables. The more the observed and the expected frequency tables diverge, the larger the chi-squared statistic, the lower the p-value, and the less likely it is that the two variables are unrelated.
Before we carry out a chi-squared test, remember that some cities have a missing value for size
because they have a missing value for population
. Let’s filter these out first:
airbnb.cities <- airbnb %>%
filter(!is.na(size))
# we only want those observations where size is not NA. ! stands for 'not'
# check out https://r4ds.had.co.nz/transform.html#filter-rows-with-filter for more logical operators (scroll down to section 5.2.2)
Now, print the frequencies of the city size and gem combinations:
## `summarise()` regrouping output by 'size' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: size [2]
## size gem count
## <fct> <fct> <int>
## 1 small no gem 4095
## 2 small gem 175
## 3 large no gem 10117
## 4 large gem 912
This information is correct but the format in which the table is presented is a bit unusual. We would like to have one variable as rows and the other as columns:
##
## no gem gem
## small 4095 175
## large 10117 912
This is a bit easier to interpret. A table like this is often called a cross table. It’s quite to easy to ask for percentages instead of counts:
crosstable <- table(airbnb.cities$size, airbnb.cities$gem) # We need to save the cross table first.
prop.table(crosstable) # Use the prop.table() function to ask for percentages.
##
## no gem gem
## small 0.26766455 0.01143866
## large 0.66128505 0.05961174
##
## no gem gem
## small 0.95901639 0.04098361
## large 0.91730891 0.08269109
##
## no gem gem
## small 0.2881368 0.1609936
## large 0.7118632 0.8390064
Based on these frequencies or percentages, we should not expect a strong relation between size
and gem
. Let’s carry out the chi-squared test to test our intuition:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: crosstable
## X-squared = 80.497, df = 1, p-value < 2.2e-16
The value of the chi statistic is 80.5 and the p-value is practically 0, so we reject the null hypothesis of no relationship. This is not what we expected, but the p-value is this low because our sample is quite large (15299 observations). You could report this as follows: “There was a significant relationship between city size and whether or not a listing was a gem (\(\chi^2\)(1, N = 15299) = 80.5, p < .001), such that large cities (8.27%) had a higher percentage of gems than small cities (4.1%).”