Chapter 27 Chi-Square Test

27.1 Goodness of Fit

In the late 1990s, Mars began publishing M&M color breakdown on its website. Somehow, in 2008, Mars stopped publishing the color breakdown. This is the color breakdown of M&Ms in 2008.

24% blue
20% orange
16% green
14% yellow
13% red
13% brown

A computer scientist and statistician named Rick Wicklin worked in a company whose boss loved M&Ms. Their breakroom had tons of the candies. in 2016, an idea came to him to check whether Mars’ published M&M color breakdown was accurate. Over several weeks in 2016 - 2017, Wicklin collected samples of M&M. He eventually collected 712 M&Ms and these were the color breakdown he found.

18.7% blue
18.7% orange
19.5% green
14.5% yellow
15.1% red
13.5% brown

Wicklin thought that what he found was quite different from what was the published. We will do a chi-square goodness of fit test to see if Wicklin had reason to question the published color breakdown.

The chi-square test for goodness of fit function is as follows:
chisq.test(observed_vector_count, p = expected_probability_vector)

For our example, we will call the observed vector count, observed, and the expected probability vector, expected. Note that our observed data are in percentages. Each percentage has to be multiplied by the sample size of 712. Each entry for the probability vector has to be less than 1 and the sum of the entries should be 1.

observed <- c(0.187*712, 0.187*712, 0.195*712, 0.145*712, 0.151*712, 0.135*712)
expected <- c(0.24, 0.20, 0.16, 0.14, 0.13, 0.13)
chisq.test(observed, p = expected)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 17.066, df = 5, p-value = 0.004377

The chi-square test shows a P-value of 0.004377 which is quite low. Rick Wicklin had reason to doubt the accuracy of the 2008 published M&M color breakdown data from Mars company. You can read more about what happened next in Quartz.

Alternatively, if you do not want to assign names to vectors, you can enter the sequence of values into the chi-square test function.

chisq.test(c(0.187*712, 0.187*712, 0.195*712, 0.145*712, 0.151*712, 0.135*712),
            p = c(0.24, 0.20, 0.16, 0.14, 0.13, 0.13))

## 
##  Chi-squared test for given probabilities
## 
## data:  c(0.187 * 712, 0.187 * 712, 0.195 * 712, 0.145 * 712, 0.151 *     712, 0.135 * 712)
## X-squared = 17.066, df = 5, p-value = 0.004377

The results are exactly the same.

27.2 Independence

The chi-square test of independence function is: chisq.test(matrix).

For the examples below, we will be using the datasets found in the package, Mass.

# Load package MASS
library(MASS)

Let us take a look at the dataset called caith.

caith

##        fair red medium dark black
## blue    326  38    241  110     3
## light   688 116    584  188     4
## medium  343  84    909  412    26
## dark     98  48    403  681    85

Caith consists of 4 rows of eye colors: blue, light, medium and dark and 5 rows of hair colors: fair, red, medium, dark and black.

Our null hypothesis is that the eye color is independent of the hair color. Our alternative hypothesis is that the eye color is dependent on the hair color. To do a chi-square test of independence, our object has to be a matrix. Caith is already a matrix so we can proceed with the test.

chisq.test(caith)

## 
##  Pearson's Chi-squared test
## 
## data:  caith
## X-squared = 1240, df = 12, p-value < 2.2e-16

The P-value is extremely small. We can conclude eye color is independent of hair color.

If you want to calculate the expected value, append $expected to the chisq.test( ) function.

chisq.test(caith)$expected

##            fair      red   medium     dark    black
## blue   193.9280 38.11918 284.8275 185.3978 15.72749
## light  426.7496 83.88342 626.7793 407.9785 34.60924
## medium 479.1479 94.18303 703.7383 458.0720 38.85873
## dark   355.1745 69.81437 521.6549 339.5517 28.80453

Let us take a look at another example where we have to form a matrix. We will use the dataset survey and work only with the variables, Exer for exercise frequency and smokes for smoking frequency.

To form a matrix, we use the function table( ) and assign it to an object. We will call this object, smokex.

smokex <- table(survey$Smoke, survey$Exer)
smokex

##        
##         Freq None Some
##   Heavy    7    1    3
##   Never   87   18   84
##   Occas   12    3    4
##   Regul    9    1    7

The top row heading is the exercise frequency: Frequent, None, Some. The first column is the smoking frequency: Heavy, Never, Occasionally, Regularly.

Our null hypothesis is the frequency of smoking is independent of the frequency of exercise. Our alternative hypothesis is the the frequency of smoking is dependent on the frequency of exercise. Let us perform a chi-square test.

chisq.test(smokex)

## Warning in chisq.test(smokex): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  smokex
## X-squared = 5.4885, df = 6, p-value = 0.4828

There is a message that the chi-square approximation may be incorrect. The issue is that the chi-square approximation to the distribution of the test statistic relies on the counts being roughly normally distributed. If many of the expected counts are very small, the approximation may be poor. Let us take a look at the expected counts.

chisq.test(smokex)$expected

## Warning in chisq.test(smokex): Chi-squared approximation may be incorrect

##        
##              Freq      None      Some
##   Heavy  5.360169  1.072034  4.567797
##   Never 92.097458 18.419492 78.483051
##   Occas  9.258475  1.851695  7.889831
##   Regul  8.283898  1.656780  7.059322

We see that quite a few expected counts are small. To remove the message, include the argument simulate.p.value = TRUE.

chisq.test(smokex, simulate.p.value = TRUE)$expected

##        
##              Freq      None      Some
##   Heavy  5.360169  1.072034  4.567797
##   Never 92.097458 18.419492 78.483051
##   Occas  9.258475  1.851695  7.889831
##   Regul  8.283898  1.656780  7.059322

chisq.test(smokex, simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  smokex
## X-squared = 5.4885, df = NA, p-value = 0.5057

The expected counts and the P-value do not differ much from the previous test but the error message is not showing any more. Since the P-value is quite high, we cannot reject the null hypothesis that the frequency of smoking is independent of the frequency of exercise.