Chapter 5 Practical Sheet Solutions

5.1 Practical 1 - Contingency Tables

5.1.1 Contingency Table Construction

We here provide solutions to the practical exercises of Section 4.1.1.4.

These questions involve using the contingency table from the penguin data introduced in Section 4.1.1.3

  1. Use addmargins to add row and column sum totals to the contingency table of penguin data.
penguins_table <- addmargins( penguins_data )
penguins_table
##            Island
## Species     Biscoe Dream Torgersen Sum
##   Adelie        44    56        52 152
##   Chinstrap      0    68         0  68
##   Gentoo       124     0         0 124
##   Sum          168   124        52 344
  1. Use prop.table to obtain a contingency table of proportions.
penguins_prop <- prop.table( penguins_data )
penguins_prop_table <- addmargins( penguins_prop )
penguins_prop_table
##            Island
## Species        Biscoe     Dream Torgersen       Sum
##   Adelie    0.1279070 0.1627907 0.1511628 0.4418605
##   Chinstrap 0.0000000 0.1976744 0.0000000 0.1976744
##   Gentoo    0.3604651 0.0000000 0.0000000 0.3604651
##   Sum       0.4883721 0.3604651 0.1511628 1.0000000
  1. Display the column-conditional probabilities, and use addmargins to add the column sums as an extra row at the bottom of the matrix (note: this should be a row of \(1\)’s).
penguins_prop_1 <- prop.table( penguins_data, 2 )
penguins_prop_1
##            Island
## Species        Biscoe     Dream Torgersen
##   Adelie    0.2619048 0.4516129 1.0000000
##   Chinstrap 0.0000000 0.5483871 0.0000000
##   Gentoo    0.7380952 0.0000000 0.0000000
# Add margin sums only over the rows...
penguins_prop_1_table <- addmargins( penguins_prop_1, margin = 1 )
penguins_prop_1_table
##            Island
## Species        Biscoe     Dream Torgersen
##   Adelie    0.2619048 0.4516129 1.0000000
##   Chinstrap 0.0000000 0.5483871 0.0000000
##   Gentoo    0.7380952 0.0000000 0.0000000
##   Sum       1.0000000 1.0000000 1.0000000
# sum is 1 in each column as expected.
  1. Suppose I want the overall conditional proportions of penguin specie to appear in a final column on the right of the table. How would I achieve this?
# One way is as follows...

# Take the row sums (sum over the columns) on the initial 
# contingency table data.
penguins_prop_2_table <- addmargins( penguins_data, 2 ) 

# Calculate column-conditional proportions from this table.
penguins_prop_2_table <- prop.table( penguins_prop_2_table, 2 ) 

# Add column sums at the bottom of the table.
penguins_prop_2_table <- addmargins( penguins_prop_2_table, 1 ) 

penguins_prop_2_table
##            Island
## Species        Biscoe     Dream Torgersen       Sum
##   Adelie    0.2619048 0.4516129 1.0000000 0.4418605
##   Chinstrap 0.0000000 0.5483871 0.0000000 0.1976744
##   Gentoo    0.7380952 0.0000000 0.0000000 0.3604651
##   Sum       1.0000000 1.0000000 1.0000000 1.0000000

5.1.2 Chi-Square Test of Independence

We here provide solutions to the practical exercise of Section 4.1.2.1.

For the penguin data, apply the \(\chi^2\) test of independence between penguin specie and island of residence, and interpret the results.

# Apply the above tests on the Penguins data.
chisq.test( penguins_data )
## 
##  Pearson's Chi-squared test
## 
## data:  penguins_data
## X-squared = 299.55, df = 4, p-value < 2.2e-16
# Unsurprisingly the p-value is miniscule.

5.1.3 Barplots

We here provide solutions to the practical exercises of Section 4.1.3.1.

  1. Run barplot( DR_prop ). What does the plot show?
barplot( DR_prop )
Barplots of the Dose-Result contingency table data.

Figure 5.1: Barplots of the Dose-Result contingency table data.

We see the pair of barplots shown in Figure 5.1. In R, a matrix input to boxplot results in the column categories (in this case, Result) defining the bars while the row categories (Dose) form the stacked levels.

  1. Investigate the density argument of the function barplot by both running the commands below, and also looking in the help file.
par(mfrow=c(1,3))
barplot( DR_prop, density = 70 )
barplot( DR_prop, density = 30 )
barplot( DR_prop, density = 0 )
Barplots of the Dose-Result contingency table data, investigating the density parameter.

Figure 5.2: Barplots of the Dose-Result contingency table data, investigating the density parameter.

As illustrated in Figure 5.2, we can see that the density argument of boxplot changes the level of shading for the (two) categories defined by the rows (Dose) of the contingency table matrix.

  1. Add a title, and x- and y-axis labels, to the plot above.

We can achieve this with the code below, which yields the plot shown in Figure 5.3.

barplot( DR_prop, density = 30, main = "Comparison of Dose by Response",
         xlab = "treatment outcome", ylab = "proportions" )
Barplots of the Dose-Result contingency table data, now with title and axis labels.

Figure 5.3: Barplots of the Dose-Result contingency table data, now with title and axis labels.

  1. Use the help file for barplot to find out how to add a legend to the plot.

We set the argument legend.text to T. The code below thus yields the plot shown in Figure 5.4.

barplot( DR_prop, density = 30, legend.text = T, 
         main = "Comparison of Dose by Response", 
         xlab = "treatment outcome", ylab = "proportions" )
Barplots of the Dose-Result contingency table data, now with a legend.

Figure 5.4: Barplots of the Dose-Result contingency table data, now with a legend.

  1. How would we alter the call to barplot in order to view dose proportion levels conditional on result (instead of the overall proportions corresponding to each cell). You may wish to use some of the table manipulation commands from Section 4.1.1.

One possible solution is shown below, which yields the plot shown in Figure 5.5. Note that we can move the legend box using args.legend and setting the usual x and y location arguments for a legend within a list (that is, if we don’t want it covering up part of the plot itself). The exact necessary values of x and y depend on your plotting window.

barplot( prop.table( DR_data, 2 ), density = 30, legend.text = T, 
         main = "Comparison of Dose by Response", 
         xlab = "treatment outcome", ylab = "proportions", 
         args.legend = list( x = 2.5, y = 1.25 )  )
Barplots of dose level proportions corresponding to each treatment outcome.

Figure 5.5: Barplots of dose level proportions corresponding to each treatment outcome.

  1. Suppose instead that we wish to display each dose level in a bar, with the proportion of successes and failures illustrated by the shading in each bar. How would we do that?

We can alter the code as shown below, which runs prop.table() on the transposed version of the matrix DR_data, and yields the plot shown in Figure 5.6.

barplot( prop.table( t( DR_data ), 2 ), density = 30, legend.text = T,
         main = "Comparison of Response by Dose", 
         xlab = "dose level", ylab = "proportions", 
         args.legend = list( x = 2.6, y = 1.25 ) )
Barplots of treatment outcome proportions corresponding to each dose level.

Figure 5.6: Barplots of treatment outcome proportions corresponding to each dose level.

This is of course the more informative way to display the data, as we are likely more interested in the relative proportion of high dose successes to low dose successes, rather than, say, the proportion of successes which happened to be high dose, for example.

5.1.4 Sieve Diagrams

We here provide solutions to the practical exercises of Section 4.1.3.3.

  1. Run
library(vcd)
sieve( DR_data )
Sieve diagram comparing observed frequencies in relation to expected-under-independence frequencies for the Dose-Result data.

Figure 5.7: Sieve diagram comparing observed frequencies in relation to expected-under-independence frequencies for the Dose-Result data.

What is shown?

We see the plot given in Figure 5.7. A sieve (or parquet) diagram represents for an \(I \times J\) table the expected frequencies under independence as a collection of \(IJ\) rectangles, each containing a set of smaller squares/rectangles. Each of the larger rectangles has height and width proportional to the corresponding row and column marginal frequencies of the contingency table respectively. This way, the area of each rectangle is proportional to the expected-under-independence frequency for the corresponding cell. The number of smaller squares/rectangles in each larger rectangle corresponds to the observed frequency, hence a larger number of smaller squares indicate that the observed frequency was larger.

  1. Now run
sieve( DR_data, shade = T )
Sieve diagram comparing observed frequencies in relation to expected-under-independence frequencies for the Dose-Result data.

Figure 5.8: Sieve diagram comparing observed frequencies in relation to expected-under-independence frequencies for the Dose-Result data.

Does this make the data easier or harder to visualise?

We now see the plot given in Figure 5.8, which is the same plot as Figure 5.7, however now the observed frequency squares are coloured blue (undashed) if the observed frequency is larger than the expected frequency, and red (dashed) if the observed frequency is smaller than the expected frequency.

  1. Finally, run
sieve( DR_data, sievetype = "expected", shade = T )
Sieve diagram showing expected-under-independence frequencies for the Dose-Result data.

Figure 5.9: Sieve diagram showing expected-under-independence frequencies for the Dose-Result data.

What is shown now?

The plot shown in Figure 5.9, which just shows the expected frequencies under independence for each cell.

5.1.5 Odds Ratios in R

We here provide solutions to the practical exercises of Section 4.1.4.

This section seeks to test your understanding of odds ratios for \(2 \times 2\) contingency tables, as well as your ability to write simple functions in R.

  1. Write a function to compute the odds ratio of the success of event A with probability pA against the sucess of event B with probability pB.
Odds_Ratio <- function(pA, pB){(pA/(1-pA))/(pB/(1-pB))}
Odds_Ratio( 1/3, 1/2 ) # test the function.
## [1] 0.5
  1. Write a function to compute the odds ratio for a \(2 \times 2\) contingency table. Test it on the Dose-response data above.
OR <- function( M ){ ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] ) }
OR( DR_data )
## [1] 1.600601
# M is a 2 by 2 matrix representing contingency table cell counts.  
  1. Will there be an issue running your function from part (b) if exactly one of the cell counts of the supplied matrix is equal to zero?

Fine in this case as either 0 is returned if a cell count of 0 appears in the numerator of the odds ratio, or Inf is returned if a cell count of 0 is in the denominator of the odds ratio.

  1. What about if both cells of a particular row or column of the supplied matrix are equal to zero?

We have a problem as there will be a zero in both numerator and denominator of the odds ratio, which is undefined. For example, run…

AB <- matrix( c(3,4,0,0), ncol = 2 ) # Define such a matrix.
OR( AB ) # NaN returned.
## [1] NaN
  1. We consider two possible options for amending the function in this case.
    1. First option: ensure that your function terminates and returns a clear error message of what has gone wrong and why when a zero would be found to be in both the numerator and denominator of the odds ratio. Hint: The command stop can be used to halt execution of a function and display an error message.
OR_alt_1 <- function( M ){ 
  
  M_row_sum <- rowSums( M )
  M_col_sum <- colSums( M )
  
  if( prod( M_row_sum ) == 0 | prod( M_col_sum ) == 0 ){
    stop( "At least one row sum or column sum of M is equal 
          to zero, hence the odds ratio is undefined." )
  }
  else{ 
    ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] ) 
  }
}

OR_alt_1( DR_data )  # works fine with DR_data
## [1] 1.600601
OR_alt_1( AB ) # Execution is halted and our error returned.
## Error in OR_alt_1(AB): At least one row sum or column sum of M is equal 
##           to zero, hence the odds ratio is undefined.
    1. Second option: in the case that a row or column of zeroes is found, add 0.5 to each cell of the table before calculating the odds ratio in the usual way. Make sure that your function returns a clear warning (as opposed to error) message explaining that an alteration to the supplied table was made before calculating the odds ratio because there was a row or column of zeroes present. Hint: The command warning() can be used to display a warning message (but not halt execution of the function).
OR_alt_2 <- function( M ){ 
  
  M_row_sum <- rowSums( M )
  M_col_sum <- colSums( M )
  
  if( prod( M_row_sum ) == 0 | prod( M_col_sum ) == 0 ){
    M <- M + 0.5 * matrix( 1, nrow = 2, ncol = 2 )
    OR <- ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] )
    warning( "At least one row sum or column sum of the supplied 
             matrix M was equal to zero, hence an amendment of 0.5 was added 
             to the value of each sum prior to calculating the odds ratio." )
    return( OR )
  }
  else{ 
    ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] ) 
  }
}

OR_alt_2( DR_data ) # works fine with DR_data
## [1] 1.600601
OR_alt_2( AB ) # Alternative odds ratio calculated and warning returned.
## Warning in OR_alt_2(AB): At least one row sum or column sum of the supplied 
##              matrix M was equal to zero, hence an amendment of 0.5 was added 
##              to the value of each sum prior to calculating the odds ratio.
## [1] 0.7777778

5.1.6 Mushrooms

We here provide some possible analysis of the mushrooms data, with comments made in R, to get you started. The resulting barplots and sieve diagram are shown in Figures 5.10 and 5.11 respectively.

# We create a matrix with the data in.
mushroom_data <- matrix( c(101, 399, 57, 487, 
                           12, 389, 150, 428 ), byrow = TRUE, ncol = 4 )

# Add dimension names as follows.
dimnames( mushroom_data ) <- list( Edibility = c("Edible", "Poisonous"),
                                   Cap_Shape = c("bell", "flat", "knobbed",
                                                 "convex/conical") )

# Have a look.
mushroom_data
##            Cap_Shape
## Edibility   bell flat knobbed convex/conical
##   Edible     101  399      57            487
##   Poisonous   12  389     150            428
# Let's look at the proportion of each shape of mushroom that are 
# edible or poisonous.
mushroom_table <- addmargins( mushroom_data, 2, FUN = mean )
mushroom_table <- prop.table( mushroom_table, 2 )
mushroom_table <- addmargins(mushroom_table, 1 )
mushroom_table
##            Cap_Shape
## Edibility        bell      flat   knobbed convex/conical      mean
##   Edible    0.8938053 0.5063452 0.2753623      0.5322404 0.5160652
##   Poisonous 0.1061947 0.4936548 0.7246377      0.4677596 0.4839348
##   Sum       1.0000000 1.0000000 1.0000000      1.0000000 1.0000000
# Let's perform a chi-square test.
chisq.test( mushroom_data )
## 
##  Pearson's Chi-squared test
## 
## data:  mushroom_data
## X-squared = 113.84, df = 3, p-value < 2.2e-16
# This matches the result found manually in the lectures.
# Barplot
barplot( prop.table( mushroom_data, margin = 2 ), density = 50, 
         main = "Comparison of Edibility by Cap Shape", cex.main = 0.8,
         xlab = "treatment outcome", ylab = "proportions", 
         legend.text = T, args.legend = list( x = 5.1, y = 1.25 ) )
Barplots for the mushroom data of edibility proportions corresponding to each cap shape.

Figure 5.10: Barplots for the mushroom data of edibility proportions corresponding to each cap shape.

# Sieve diagram
sieve( mushroom_data  )
Sieve diagram showing expected-under-independence frequencies for the mushroom data.

Figure 5.11: Sieve diagram showing expected-under-independence frequencies for the mushroom data.

# Much can be drawn form this diagram.  For example,
# we can see that there are far more edible bell mushrooms in our 
# sample than would be expected under an assumption of independence.