Chapter 4 Practical Sheets

4.1 Practical 1 - Contingency Tables

In this practical, we consider exploring practical application of some of the techniques learnt in the lectures in R. This practical may seem long, however, some of it should be a nice refresher of R techniques already learnt, and some other parts are hopefully gentle and straightforward to read about and follow (although the odd more challenging exercise is thrown in!). You are encouraged to work through to the end in your own time. I would also like to note that Data Science and Statistical Computing II (DSSC II) was not a prerequisite for this course, and hence is not necessary. However, if you did take that course, you are welcome to play around with using some of the exciting skills picked up in that course in conjunction with what you learn in this course (particularly surrounding the visualisation side of things).

Finally, whilst solutions to the practical sheets will be provided, it is by attempting to answer the exercises yourself first that your practical coding skills will be developed!

4.1.1 Construction of Contingency Tables

4.1.1.1 Construction from Matrices

We here consider entering \(2 \times 2\) contingency tables manually into R. We do this as follows for the data in Table 2.3.

DR_data <- matrix( c(41, 9, 
                     37, 13 ), byrow = TRUE, ncol = 2 )
dimnames( DR_data ) <- list( Dose = c("High", "Low"),
                             Result = c("Success", "Failure") )

DR_data
##       Result
## Dose   Success Failure
##   High      41       9
##   Low       37      13

We can add row and column sum margins as follows

DR_contingency_table <- addmargins( DR_data )
DR_contingency_table
##       Result
## Dose   Success Failure Sum
##   High      41       9  50
##   Low       37      13  50
##   Sum       78      22 100

4.1.1.2 Contingency Tables of Proportions

Proportions can be obtained using prop.table.

DR_prop <- prop.table( DR_data )
DR_prop_table <- addmargins( DR_prop )
DR_prop_table
##       Result
## Dose   Success Failure Sum
##   High    0.41    0.09 0.5
##   Low     0.37    0.13 0.5
##   Sum     0.78    0.22 1.0

The row conditional proportions are derived by prop.table(DR_data, 1). Analogously, column conditional proportions can be obtained using prop.table(DR_data, 2).
The addmargins function also provides versatility for specifying margins for particular dimensions. We here specify that margins only be added for each row (trivially yielding 1 in each case as expected).

DR_prop_1 <- prop.table( DR_data, 1 )
DR_prop_1_table <- addmargins( DR_prop_1, margin = 2 )
DR_prop_1_table
##       Result
## Dose   Success Failure Sum
##   High    0.82    0.18   1
##   Low     0.74    0.26   1

We can amend the function argument FUN so that an alternative operation is performed. Since both rows contain the same total number of subjects, we can obtain the overall proportion in each column by amending the function argument FUN as follows45:

DR_prop_2_table <- addmargins( DR_prop_1_table, margin = 1, FUN = mean )
DR_prop_2_table
##       Result
## Dose   Success Failure Sum
##   High    0.82    0.18   1
##   Low     0.74    0.26   1
##   mean    0.78    0.22   1

Notice that

DR_prop_3_table <- addmargins( DR_prop_1, 
                               margin = c(1,2), 
                               FUN = list(mean, sum) ) 
## Margins computed over dimensions
## in the following order:
## 1: Dose
## 2: Result
DR_prop_3_table
##       Result
## Dose   Success Failure sum
##   High    0.82    0.18   1
##   Low     0.74    0.26   1
##   mean    0.78    0.22   1

yields the same result. Argument margin dictates the order of dimensions over which operations are applied over, and the function list FUN dictates which function should be applied in each case. So here, R first performs mean over dimension 1 (the rows, in this case Dose), and then sum over dimension 2 (the columns, in this case Result), as is confirmed by the additional information R fed back (setting argument quiet=TRUE would remove this).

4.1.1.3 Construction from Dataframes

Here we suppose we have our data collated in a dataframe that we wish to cross-classify into a contingency table. We demonstrate doing this on the penguins dataset in library palmerpenguins (remember to install this library using install.packages("palmerpenguins") if you have not already done so).

library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

You will notice that the penguins data is in the dataframe format known in R lingo as a tibble. For those of you that did not take DSSC II, a tibble is a user-friendly type of dataframe that works with the user-friendly R set of libraries tidyverse. For our purposes, a tibble can just be viewed as any other dataframe.

Firstly, find out information about the dataset using ?penguins.

We are interested in whether different types of penguin typically reside on different islands, hence we wish to tabulate the dataframe as follows.

penguins_data <- table( Species = penguins$species, Island = penguins$island )
penguins_data
##            Island
## Species     Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

In this case we may be fairly certain that there is a connection between these variables…but we can test out some techniques using this contingency table nonetheless.

4.1.1.4 Exercises

These questions involve using the contingency table from the penguin data introduced in Section 4.1.1.3.

  1. Use addmargins to add row and column sum totals to the contingency table of penguin data.

  2. Use prop.table to obtain a contingency table of proportions.

  3. Display the column-conditional probabilities, and use addmargins to add the column sums as an extra row at the bottom of the matrix (note: this should be a row of \(1\)’s).

  4. Suppose I want the overall proportions of penguin specie to appear in a final column on the right of the table. How would I achieve this?

4.1.2 Chi-Square Test of Independence

Run the command chisq.test on DR_data with argument correct set to FALSE.

chisq.test( DR_data, correct = FALSE )
## 
##  Pearson's Chi-squared test
## 
## data:  DR_data
## X-squared = 0.9324, df = 1, p-value = 0.3342

chisq.test runs a \(\chi^2\) test of independence, and setting the argument correct to FALSE tells R not to use continuity correction. Look at the help file for chisq.test, and you will see that the default is for R to use Yates’ continuity correction (see Section 2.4.3.5).

chisq.test( DR_data )
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  DR_data
## X-squared = 0.52448, df = 1, p-value = 0.4689

Notice that the \(p\)-value with continuity correction is larger, as expected. The ML estimates of the expected cell frequencies can be obtained by running

chisq.test( DR_data )$expected
##       Result
## Dose   Success Failure
##   High      39      11
##   Low       39      11

4.1.2.1 Exercise

For the penguin data, apply the \(\chi^2\) test of independence between penguin specie and island of residence, and interpret the results.

4.1.3 Data Visualisation

Here we introduce several types of data visualisation methods for categorical data presented in the form of contingency tables, and apply them to the Dose-Result contingency table.

4.1.3.1 Barplots

The exercises in this section should introduce you to (or refresh your memory of, if you have seen it before) the barplot function, as well as refresh your memory of generic R plotting function arguments.

  1. Run barplot( DR_prop ). What do the plots show?

  2. Investigate the density argument of the function barplot by both running the commands below, and also looking in the help file.

barplot( DR_prop, density = 70 )
barplot( DR_prop, density = 30 )
barplot( DR_prop, density = 0 )
  1. Add a title, and x- and y-axis labels, to the plot above.

  2. Use the help file for barplot to find out how to add a legend to the plot.

  3. How would we alter the call to barplot in order to view dose proportion levels conditional on result (instead of the overall proportions corresponding to each cell). You may wish to use some of the table manipulation commands from Section 4.1.1.

  4. Suppose instead that we wish to display each dose level in a bar, with the proportion of successes and failures illustrated by the shading in each bar. How would we do that?

4.1.3.2 Fourfold plots

Try running the following to obtain the plot shown in Figure 4.1.

fourfoldplot( DR_data )
Fourfoldplot of the Dose-Result contingency table data.

Figure 4.1: Fourfoldplot of the Dose-Result contingency table data.

A fourfold plot provides a graphical expression of the association in a \(2 \times 2\) contingency table, visualising the odds ratio. Each cell entry is represented as a quarter-circle (denoted by the middle of the three rings).

We see that the shaded diagonal areas are represented by quarter-circles with greater area than the off-diagonal areas, hence the association between the two binary classification variables in positive, that is, the odds ratio \(r_{12}\) is greater than 1. The strength of association can be visually strengthened by choice of colour (although it is subjective which colour scheme is best…). For example, to obtain a red/blue colour scheme, we can run the following to obtain the plot shown in Figure 4.2

fourfoldplot( DR_data, color = c("red", "blue") )
Fourfoldplot of the Dose-Result contingency table data.

Figure 4.2: Fourfoldplot of the Dose-Result contingency table data.

The inner and outer rings of the quarter-circles correspond to confidence rings. The observed frequencies support the null hypothesis of no association between the variables if the rings for adjacent quarters overlap (we will explore this hypothesis test in the lectures…).

4.1.3.3 Sieve Diagrams

We here investigate the sieve plotting function of library vcd. Remember to look at the help file to help you understand the various arguments for this function.

  1. Run
library(vcd)
sieve( DR_data )

What is shown?

  1. Now run
library(vcd)
sieve( DR_data, shade = T )

Does this make the data easier or harder to visualise?

  1. Finally, run
sieve( DR_data, sievetype = "expected", shade = T )

What is shown now?

4.1.3.4 Mosaic Plots

Run

mosaic( DR_data )

Mosaic plots for two-way tables display graphically the cells of a contingency table as rectangular areas of size proportional to the corresponding observed frequencies. Were the classification variables independent, then the areas would be perfectly aligned in rows and columns. The worse the alignment is, the stronger the lack of fit for independence. Furthermore, specific locations of the table that deviate from independence the most may be identified and thus the pattern of underlying association attempt to be explained.

4.1.4 Odds Ratios in R

This section seeks to test your understanding of odds ratios for \(2 \times 2\) contingency tables, as well as your ability to write simple functions in R.

  1. Write a function to compute the odds ratio of the success of event A with probability pA against the sucess of event B with probability pB.

  2. Write a function to compute the odds ratio for a \(2 \times 2\) contingency table. Test it on the Dose-response data above.

  3. Will there be an issue running your function from part (b) if exactly one of the cell counts of the supplied matrix is equal to zero?

  4. What about if both cells of a particular row or column of the supplied matrix are equal to zero?

  5. We consider two possible options for amending the function in this case.

    1. First option: ensure that your function terminates and returns a clear error message of what has gone wrong and why when a zero would be found to be in both the numerator and denominator of the odds ratio. Hint: The command stop can be used to halt execution of a function and display an error message.
    1. Second option: in the case that a row or column of zeroes is found, add 0.5 to each cell of the table before calculating the odds ratio in the usual way. Make sure that your function returns a clear warning (as opposed to error) message explaining that an alteration to the supplied table was made before calculating the odds ratio because there was a row or column of zeroes present. Hint: The command warning() can be used to display a warning message (but not halt execution of the function).

4.1.5 Further Exploration: Mushrooms

Consider the mushrooms data in Table 2.7 of Section 2.4.3.4 (note that this is the table after combining cells). Explore further the topics covered in this practical session. You will need to manually enter the data from the lecture notes as a matrix to get started.

4.2 Practical 2 - Contingency Tables

This practical gives you the opportunity to develop the techniques learnt in Practical 1 by utilising the functions presented there, as well as providing a base for exploration of new ones.

Each of the three sections below considers exploring one of three datasets. You are encouraged to explore each of these datasets using the array of techniques previously discussed, as well as utilising the presented questions and suggestions to help you learn new ones.

4.2.1 Mushrooms

We begin by returning to the mushrooms data (Table 2.7, introduced in Section 2.4.3.4). This dataset was the basis for the open-ended final exercise of Practical 1. You may have explored this dataset in some, or all, of the following ways, amongst others:

  • Manually input the data into R.

  • Investigated adding relevant margins to the tables and exploring corresponding contingency tables of proportions.

  • Performed a \(\chi^2\) test of independence.

  • Generated visual representations of the data in the form of barplots, sieve diagrams and mosaic plots.

You are encouraged to keep exploring this dataset. In particular, the following sections prompt analysis of residuals, a GLR test, and investigation into nominal odds ratios.

4.2.1.1 Residual Analysis

Pearson and Adjusted residuals can be obtained from the output of chisq.test(). Use the help file for this function to find out how.

4.2.1.2 GLR Test

The function below uses the appropriate parameters provided from the output of a call to chisq.test() to perform a GLR test.

  • Read through the code and try to understand what each line does.

  • What information/results are being returned at the end of the function?

  • Apply the function on the mushroom data and interpret the results.

G2 <- function( data ){
  # computes the G2 test of independence 
  # for a two-way contingency table of
  # data: IxJ matrix
  X2 <- chisq.test( data )
  Ehat <- X2$expected
  df <- X2$parameter
  
  term.G2 <- data * log( data / Ehat ) 
  term.G2[data==0] <- 0 # Because if data == 0, we get NaN
  
  Gij <- 2 * term.G2 # Individual cell contributions to G2 statistic.
  dev_res <- sign( data - Ehat ) * sqrt( abs( Gij ) )
  G2 <- sum( Gij ) # G2 statistic
  p <- 1 - pchisq( G2, df ) 
  return( list( G2 = G2, df = df, p.value = p, 
                Gij = Gij, dev_res = dev_res ) ) 
}

4.2.1.3 Nominal Odds Ratios

In Section 4.1.4, you were encouraged to write a function to calculate the odds ratio for a \(2 \times 2\) contingency table. Such a function (without concern for zeroes occurring) may be given by

OR <- function( M ){ ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] ) }

Based on this, what does the following function do? Test the function out on the mushrooms data and interpret the results.

nominal_OR <- function( M, ref_x = nrow( M ), ref_y =  ncol( M ) ){
  
  # I and J
  I <- nrow(M)
  J <- ncol(M)
  
  # Odds ratio matrix.
  OR_reference_IJ <- matrix( NA, nrow = I, ncol = J )
  for( i in 1:I ){
    for( j in 1:J ){
      OR_reference_IJ[i,j] <- OR( M = M[c(i,ref_x), c(j,ref_y)] )
    }
  }
  
  OR_reference <- OR_reference_IJ[-ref_x, -ref_y, drop = FALSE]
  
  return(OR_reference)
  
}

4.2.1.4 Visualisation and Residuals

We can add residual information to mosaic plots as follows:

mosaic( mushroom_data, 
        gp = shading_hcl, 
        residuals_type = "Pearson" )

Alternatively, residuals can be directly specified using the residuals argument (see the help file).

4.2.2 Dose-Result

Manually input into R the hypothetical data presented in Table 2.15. You may wish to attempt some or all of the following.

  • Utilise previously introduced skills on this dataset.

  • Amend the function presented in Section 4.2.1.3 to write two new functions

    • one to compute the set of \((I-1) \times (J-1)\) local odds ratios for \(i = 1,...,I-1\) and \(j = 1,...,J-1\); and
    • one to compute the set of \((I-1) \times (J-1)\) global odds ratios for \(i = 1,...,I-1\) and \(j = 1,...,J-1\).
  • Write a function that produces an \((I-1) \times (J-1)\) matrix of fourfold plots, each corresponding to the submatrices associated with each of the \((I-1) \times (J-1)\) local odds ratios for \(i = 1,...,I-1\) and \(j = 1,...,J-1\).

  • Perform a linear trend test on the data, either by writing your own code to calculate the relevant quantities, or by utilising the following function, courtesy of Kateri (2014).

linear.trend <- function( table, x, y ){
  # linear trend test for a 2-way table
  # PARAMETERS:
  # freq: vector of the frequencies, given by rows
  # NI: number of rows
  # NJ: number of columns
  # x: vector of row scores
  # y: vector of column scores
  # RETURNS:
  # r: Pearson’s sample correlation
  # M2: test statistic
  # p.value: two-sided p-value of the asymptotic M2-test
  NI <- nrow( table )
  NJ <- ncol( table )
  
  rowmarg <- addmargins( table )[,NJ+1][1:NI]
  colmarg <- addmargins( table )[NI+1,][1:NJ]
  n <- addmargins( table )[NI+1,NJ+1]
  
  xmean <- sum( rowmarg * x ) / n
  ymean <- sum( colmarg * y ) / n 
  xsq <- sqrt( sum( rowmarg * ( x - xmean )^2 ) )
  ysq <- sqrt( sum( colmarg * ( y - ymean )^2 ) ) 
  
  r <- sum( ( x - xmean ) %*% table %*% ( y - ymean ) ) / ( xsq * ysq )
  M2 = (n-1)*r^2
  p.value <- 1 - pchisq( M2, 1 ) 
  return( list( r = r, M2 = M2, p.value = p.value ) ) 
}

4.2.3 Titanic

Type ?Titanic into R to learn about the Titanic dataset. You may wish to do this in conjuntion with one or more of the following

Titanic
dim( Titanic )
dimnames( Titanic )

We will explore this dataset further in Practical 3. For now, you are encouraged to explore associations between the variables in the contingency table. Some ideas and questions to get you started are:

  • Generate some partial tables.

  • Generate some marginal tables. You can do this using the function margin.table (look at the help file).

  • Calculate partial and marginal odds ratios, and interpret the results.

  • Perform a \(\chi^2\)-test of independence between Class and Survival, marginalising over Sex and Age.

  • Can we perform a linear trend test between Class and Survival, having marginalised over Sex and Age? If you think we can, give it a go!

  • Produce a sieve or mosaic plot of the Titanic data and interpret.

References

Kateri. 2014. Contingency Table Analysis - Methods and Implementation Using r. New York: Birkhauser.

  1. This example serves to demonstrate how to change the function FUN - I would like to reiterate that the mean function works for our purposes here (to get overall proportions for each column) only because the row totals are both the same.↩︎