3.5 Chi-squared test

Suppose we’re interested in finding a true gem of a listing. For example, we’re interested in listings with a 5 out of 5 rating and at least 30 reviews:

airbnb <- airbnb %>% 
  mutate(gem = (overall_satisfaction == 5 & reviews>=30), # two conditions should be met before saying a listing is a gem
         gem = factor(gem, labels = c("no gem","gem")))  # give the logical variable more intuitive labels

Now, say we’re interested in knowing whether we’re more likely to find gems in small or in large cities (we’ve created the size variable here). The chi-squared test can provide an answer to this question by testing the null hypothesis of no relationship between two categorical variables (city size: large vs. small & gem: yes vs. no). It compares the observed frequency table with the frequency table that you would expect when there is no relation between the two variables. The more the observed and the expected frequency tables diverge, the larger the chi-squared statistic, the lower the p-value, and the less likely it is that the two variables are unrelated.

Before we carry out a chi-squared test, remember that some cities have a missing value for size because they have a missing value for population. Let’s filter these out first:

airbnb.cities <- airbnb %>% 
  filter(!is.na(size)) 

# we only want those observations where size is not NA. ! stands for 'not'
# check out https://r4ds.had.co.nz/transform.html#filter-rows-with-filter for more logical operators (scroll down to section 5.2.2)

Now, print the frequencies of the city size and gem combinations:

airbnb.cities %>% 
  group_by(size, gem) %>% 
  summarize(count = n())
## `summarise()` has grouped output by 'size'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   size [2]
##   size  gem    count
##   <fct> <fct>  <int>
## 1 small no gem  4095
## 2 small gem      175
## 3 large no gem 10117
## 4 large gem      912

This information is correct but the format in which the table is presented is a bit unusual. We would like to have one variable as rows and the other as columns:

table(airbnb.cities$size, airbnb.cities$gem)
##        
##         no gem   gem
##   small   4095   175
##   large  10117   912

This is a bit easier to interpret. A table like this is often called a cross table. It’s quite to easy to ask for percentages instead of counts:

crosstable <- table(airbnb.cities$size, airbnb.cities$gem) # We need to save the cross table first.
prop.table(crosstable) # Use the prop.table() function to ask for percentages.
##        
##             no gem        gem
##   small 0.26766455 0.01143866
##   large 0.66128505 0.05961174
prop.table(crosstable,1) # This gives percentages conditional on rows, i.e., the percentages in the rows sum to 1.
##        
##             no gem        gem
##   small 0.95901639 0.04098361
##   large 0.91730891 0.08269109
prop.table(crosstable,2) # This gives percentages conditional on columns, i.e., the percentages in the columns sum to 1.
##        
##            no gem       gem
##   small 0.2881368 0.1609936
##   large 0.7118632 0.8390064

Based on these frequencies or percentages, we should not expect a strong relation between size and gem. Let’s carry out the chi-squared test to test our intuition:

chisq.test(crosstable)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  crosstable
## X-squared = 80.497, df = 1, p-value < 2.2e-16

The value of the chi statistic is 80.5 and the p-value is practically 0, so we reject the null hypothesis of no relationship. This is not what we expected, but the p-value is this low because our sample is quite large (15299 observations). You could report this as follows: “There was a significant relationship between city size and whether or not a listing was a gem (\(\chi^2\)(1, N = 15299) = 80.5, p < .001), such that large cities (8.27%) had a higher percentage of gems than small cities (4.1%).”