8 Chi-squared Test of Independence

Many statistical quantities derived from data samples are found to follow the Chi-squared distribution. Hence we can use it to test whether a population fits a particular theoretical probability distribution.

8.1 Testing Category Probabilities: One-Way Table

In this section, we consider a multinomial experiment with k outcomes that correspond to categories of a single qualitative variable. The results of such an experiment are summarized in a one-way table. The term one-way is used because only one variable is classified. Typically, we want to make inferences about the true proportions that occur in the \(k\) categories based on the sample information in the one-way table.

A population is called multinomial if its data is categorical and belongs to a collection of discrete non-overlapping classes. Qualitative data that fall in more than two categories often result from a multinomial experiment. The characteristics for a multinomial experiment with \(k\) outcomes are described in the box.

Properties of the Multinomial Experiment

1. The experiment consists of n identical trials.
2. There are k possible outcomes to each trial. These outcomes are called classes, categories, or cells.
3. The probabilities of the k outcomes, denoted by p1, p2, c, pk, remain the same from trial to trial, where p1 + p2 + ... + pk = 1.
4. The trials are independent.
5. The random variables of interest are the cell counts, n1, n2, ..., nk, of the number of observations that fall in each of the k classes.

The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.

The null hypothesis for goodness of fit test for multinomial distribution is that the observed frequency fi is equal to an expected count \(e_i\) in each category. It is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level \(\alpha\).

Formula

\[ \chi ^2=\sum \frac{(f_i-e_i)^2}{e_i} \]

Example

To illustrate, suppose a large supermarket chain conducts a consumer-preference survey by recording the brand of bread purchased by customers in its stores. Assume the chain carries three brands of bread: two major brands (A and B) and its own store brand. The brand preferences of a random sample of 150 consumers are observed, and the number preferring each brand is tabulated

Brand	n
A	61
B	53
Store brand	36

ncount <- c(61, 53, 36)
sum(ncount)

## [1] 150

Note that our consumer-preference survey satisfies the properties of a multino- mial experiment for the qualitative variable brand of bread.

The experiment consists of randomly sampling \(n = 150\) buyers from a large population of consumers containing an unknown proportion \(p1\) who prefer brand A, a proportion \(p2\) who prefer brand B, and a proportion \(p3\) who prefer the store brand.

H0: the brands of bread are equally preferred
H1: At least one brand is preferred

res <- chisq.test(ncount)
res

## 
##  Chi-squared test for given probabilities
## 
## data:  ncount
## X-squared = 6.52, df = 2, p-value = 0.03839

Since the computed \(\chi^2 = 6.52\) exceeds the critical value of 5.99147, we conclude at the \(\alpha = .05\) level of significance that a consumer preference exists for one or more of the brands of bread.

Example

For example, we collected wild tulips and found that 81 were red, 50 were yellow and 27 were white.

Are these colors equally common?

If these colors were equally distributed, the expected proportion would be 1/3 for each of the color.

Suppose that, in the region where you collected the data, the ratio of red, yellow and white tulip is 3:2:1 (3+2+1 = 6). This means that the expected proportion is:

3/6 (= 1/2) for red
2/6 ( = 1/3) for yellow
1/6 for white

We want to know, if there is any significant difference between the observed proportions and the expected proportions.

Statistical hypothesis

Null hypothesis (H0): There is no significant difference between the observed and the expected value.
Alternative hypothesis (H1): There is a significant difference between the observed and the expected value.

Answer

tulip <- c(81, 50, 27)
sum(tulip)

## [1] 158

res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res

## 
##  Chi-squared test for given probabilities
## 
## data:  tulip
## X-squared = 27.886, df = 2, p-value = 8.803e-07

The p-value of the test is 8.80310^{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310^{-7}.

# Access to the expected values
res$expected

## [1] 52.66667 52.66667 52.66667

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res

## 
##  Chi-squared test for given probabilities
## 
## data:  tulip
## X-squared = 0.20253, df = 2, p-value = 0.9037

The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.

8.2 Testing Category Probabilities: Two-Way (Contingency) Table

In previous section, we introduced the multinomial probability distribution and considered data classified according to a single qualitative criterion. We now consider multinomial experiments in which the data are classified according to two criteria. It means, classification with respect to two qualitative factors.

For example, consider a study published in the Journal of Marketing on the impact of using celebrities in television advertisements. The researchers investigated the rela- tionship between gender of a viewer and the viewer’s brand awareness. Three hundred TV viewers were asked to identify products advertised by male celebrity spokespersons.

Awareness	Male	Female	total
identidy	95	41	136
no_identify	50	114	164
total	145	155	300

df <- tribble(
  ~awareness, ~gender,~count, 
  "identidy", "M", 95,
  "no_identify", "M",50,
  "identidy","F",41,
  "no_identify","F",114
)
df <- df %>% spread(gender, count)
df

## # A tibble: 2 x 3
##   awareness       F     M
## * <chr>       <dbl> <dbl>
## 1 identidy     41.0  95.0
## 2 no_identify 114    50.0

Suppose we want to know whether the two classifications, gender and brand awareness, are dependent. If we know the gender of the TV viewer, does that information give us a clue about the viewer’s brand awareness?

chisq <- chisq.test(as.matrix(df[,2:3]))
chisq

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  as.matrix(df[, 2:3])
## X-squared = 44.572, df = 1, p-value = 2.452e-11

Large values of \(\chi\) imply that the observed counts do not closely agree, and hence, the hypothesis of independence is false.

Example 2

A large brokerage firm wants to determine whether the service it provides to affluent clients differs from the service it provides to lower-income clients. A sample of 500 clients is selected, and each client is asked to rate his or her broker.

Broker rating	<30.000	30.000-60.000	>60.000	total
Outstanding	48	64	41	153
Average	98	120	50	268
Poor	30	33	16	79
total	176	217	107	500

df <- tribble(
  ~Broker, ~income,~count, 
  "Oustanding", "<30",48,
  "Avegare", "<30",98,
  "Poor","<30",30,
  "Oustanding", "30-60",64,
  "Avegare", "30-60",120,
  "Poor","30-60",33,
  "Oustanding", ">60",41,
  "Avegare", ">60",50,
  "Poor",">60",16
)

df <- df %>% spread(income, count)
df

## # A tibble: 3 x 4
##   Broker     `30-60` `<30` `>60`
## * <chr>        <dbl> <dbl> <dbl>
## 1 Avegare      120    98.0  50.0
## 2 Oustanding    64.0  48.0  41.0
## 3 Poor          33.0  30.0  16.0

chisq <- chisq.test(as.matrix(df[,2:4]))
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  as.matrix(df[, 2:4])
## X-squared = 4.2777, df = 4, p-value = 0.3697

Determine whether there is evidence that broker rating and customer income are dependent.

The null and alternative hypotheses we want to test are

H0: The rating a client gives his or her broker is independent of client,s income.
H1: Broker rating and client income are dependent.

This survey does not support the firm’s alternative hypothesis that affluent clients receive different broker service than lower-income clients.

8.3 Chi-squared Test of Independence

The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.

file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
head(housetasks)

##            Wife Alternating Husband Jointly
## Laundry     156          14       2       4
## Main_meal   124          20       5       4
## Dinner       77          11       7      13
## Breakfeast   82          36      15       7
## Tidying      53          11       1      57
## Dishes       32          24       4      53

library("gplots")

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent

chisq <- chisq.test(housetasks)
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  housetasks
## X-squared = 1944.5, df = 36, p-value < 2.2e-16

In our example, the row and the column variables are statistically significantly associated (p-value = 0).

8.3.1 Nature of the dependence between the row and the column variables

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell: Pearson residuals (\(r\)) for each cell

library(corrplot)

## corrplot 0.84 loaded

# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)

##             Wife Alternating Husband Jointly
## Laundry    7.738       0.272   1.777   2.246
## Main_meal  4.976       0.012   1.243   1.903
## Dinner     2.197       0.073   0.600   0.560
## Breakfeast 1.222       0.615   0.408   1.443
## Tidying    0.149       0.133   1.270   0.661
## Dishes     0.063       0.178   0.891   0.625
## Shopping   0.085       0.090   0.581   0.586
## Official   0.688       3.771   0.010   0.311
## Driving    1.538       2.403   3.374   1.789
## Finances   0.886       0.037   0.028   1.700
## Insurance  1.705       0.941   0.868   1.683
## Repairs    2.919       0.947  21.921   2.275
## Holidays   2.831       1.098   1.233  12.445

corrplot(contrib, is.cor = FALSE)

In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal.
There is a strong positive association between the column Husband and the row Repair