Chapter 4 Practical Sheets
4.1 Practical 1 - Contingency Tables
In this practical, we consider exploring practical application of some of the techniques learnt in the lectures in R. This practical may seem long, however, some of it should be a nice refresher of R techniques already learnt, and some other parts are hopefully gentle and straightforward to read about and follow (although the odd more challenging exercise is thrown in!). You are encouraged to work through to the end in your own time. I would also like to note that Data Science and Statistical Computing II (DSSC II) was not a prerequisite for this course, and hence is not necessary. However, if you did take that course, you are welcome to play around with using some of the exciting skills picked up in that course in conjunction with what you learn in this course (particularly surrounding the visualisation side of things).
Finally, whilst solutions to the practical sheets will be provided, it is by attempting to answer the exercises yourself first that your practical coding skills will be developed!
4.1.1 Construction of Contingency Tables
4.1.1.1 Construction from Matrices
We here consider entering \(2 \times 2\) contingency tables manually into R. We do this as follows for the data in Table 2.3.
<- matrix( c(41, 9,
DR_data 37, 13 ), byrow = TRUE, ncol = 2 )
dimnames( DR_data ) <- list( Dose = c("High", "Low"),
Result = c("Success", "Failure") )
DR_data
## Result
## Dose Success Failure
## High 41 9
## Low 37 13
We can add row and column sum margins as follows
<- addmargins( DR_data )
DR_contingency_table DR_contingency_table
## Result
## Dose Success Failure Sum
## High 41 9 50
## Low 37 13 50
## Sum 78 22 100
4.1.1.2 Contingency Tables of Proportions
Proportions can be obtained using prop.table
.
<- prop.table( DR_data )
DR_prop <- addmargins( DR_prop )
DR_prop_table DR_prop_table
## Result
## Dose Success Failure Sum
## High 0.41 0.09 0.5
## Low 0.37 0.13 0.5
## Sum 0.78 0.22 1.0
The row conditional proportions are derived by prop.table(DR_data, 1)
. Analogously, column conditional proportions can be obtained using prop.table(DR_data, 2)
.
The addmargins
function also provides versatility for specifying margins for particular dimensions. We here specify that margins only be added for each row (trivially yielding 1 in each case as expected).
<- prop.table( DR_data, 1 )
DR_prop_1 <- addmargins( DR_prop_1, margin = 2 )
DR_prop_1_table DR_prop_1_table
## Result
## Dose Success Failure Sum
## High 0.82 0.18 1
## Low 0.74 0.26 1
We can amend the function argument FUN
so that an alternative operation is performed. Since both rows contain the same total number of subjects, we can obtain the overall proportion in each column by amending the function argument FUN
as follows45:
<- addmargins( DR_prop_1_table, margin = 1, FUN = mean )
DR_prop_2_table DR_prop_2_table
## Result
## Dose Success Failure Sum
## High 0.82 0.18 1
## Low 0.74 0.26 1
## mean 0.78 0.22 1
Notice that
<- addmargins( DR_prop_1,
DR_prop_3_table margin = c(1,2),
FUN = list(mean, sum) )
## Margins computed over dimensions
## in the following order:
## 1: Dose
## 2: Result
DR_prop_3_table
## Result
## Dose Success Failure sum
## High 0.82 0.18 1
## Low 0.74 0.26 1
## mean 0.78 0.22 1
yields the same result.
Argument margin
dictates the order of dimensions over which operations are applied over, and the function list FUN
dictates which function should be applied in each case. So here, R first performs mean
over dimension 1 (the rows, in this case Dose
), and then sum
over dimension 2 (the columns, in this case Result
), as is confirmed by the additional information R fed back (setting argument quiet=TRUE
would remove this).
4.1.1.3 Construction from Dataframes
Here we suppose we have our data collated in a dataframe that we wish to cross-classify into a contingency table. We demonstrate doing this on the penguins dataset in library palmerpenguins
(remember to install this library using install.packages("palmerpenguins")
if you have not already done so).
library(palmerpenguins)
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
You will notice that the penguins
data is in the dataframe format known in R lingo as a tibble
. For those of you that did not take DSSC II, a tibble
is a user-friendly type of dataframe
that works with the user-friendly R set of libraries tidyverse
. For our purposes, a tibble can just be viewed as any other dataframe.
Firstly, find out information about the dataset using ?penguins
.
We are interested in whether different types of penguin typically reside on different islands, hence we wish to tabulate the dataframe as follows.
<- table( Species = penguins$species, Island = penguins$island )
penguins_data penguins_data
## Island
## Species Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
In this case we may be fairly certain that there is a connection between these variables…but we can test out some techniques using this contingency table nonetheless.
4.1.1.4 Exercises
These questions involve using the contingency table from the penguin data introduced in Section 4.1.1.3.
Use
addmargins
to add row and column sum totals to the contingency table of penguin data.Use
prop.table
to obtain a contingency table of proportions.Display the column-conditional probabilities, and use
addmargins
to add the column sums as an extra row at the bottom of the matrix (note: this should be a row of \(1\)’s).Suppose I want the overall proportions of penguin specie to appear in a final column on the right of the table. How would I achieve this?
4.1.2 Chi-Square Test of Independence
Run the command chisq.test
on DR_data
with argument correct
set to FALSE
.
chisq.test( DR_data, correct = FALSE )
##
## Pearson's Chi-squared test
##
## data: DR_data
## X-squared = 0.9324, df = 1, p-value = 0.3342
chisq.test
runs a \(\chi^2\) test of independence, and setting the argument correct
to FALSE
tells R not to use continuity correction. Look at the help file for chisq.test
, and you will see that the default is for R to use Yates’ continuity correction (see Section 2.4.3.5).
chisq.test( DR_data )
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: DR_data
## X-squared = 0.52448, df = 1, p-value = 0.4689
Notice that the \(p\)-value with continuity correction is larger, as expected. The ML estimates of the expected cell frequencies can be obtained by running
chisq.test( DR_data )$expected
## Result
## Dose Success Failure
## High 39 11
## Low 39 11
4.1.3 Data Visualisation
Here we introduce several types of data visualisation methods for categorical data presented in the form of contingency tables, and apply them to the Dose-Result contingency table.
4.1.3.1 Barplots
The exercises in this section should introduce you to (or refresh your memory of, if you have seen it before) the barplot
function, as well as refresh your memory of generic R plotting function arguments.
Run
barplot( DR_prop )
. What do the plots show?Investigate the
density
argument of the functionbarplot
by both running the commands below, and also looking in the help file.
barplot( DR_prop, density = 70 )
barplot( DR_prop, density = 30 )
barplot( DR_prop, density = 0 )
Add a title, and x- and y-axis labels, to the plot above.
Use the help file for
barplot
to find out how to add a legend to the plot.How would we alter the call to
barplot
in order to view dose proportion levels conditional on result (instead of the overall proportions corresponding to each cell). You may wish to use some of the table manipulation commands from Section 4.1.1.Suppose instead that we wish to display each dose level in a bar, with the proportion of successes and failures illustrated by the shading in each bar. How would we do that?
4.1.3.2 Fourfold plots
Try running the following to obtain the plot shown in Figure 4.1.
fourfoldplot( DR_data )
A fourfold plot provides a graphical expression of the association in a \(2 \times 2\) contingency table, visualising the odds ratio. Each cell entry is represented as a quarter-circle (denoted by the middle of the three rings).
We see that the shaded diagonal areas are represented by quarter-circles with greater area than the off-diagonal areas, hence the association between the two binary classification variables in positive, that is, the odds ratio \(r_{12}\) is greater than 1. The strength of association can be visually strengthened by choice of colour (although it is subjective which colour scheme is best…). For example, to obtain a red/blue colour scheme, we can run the following to obtain the plot shown in Figure 4.2
fourfoldplot( DR_data, color = c("red", "blue") )
The inner and outer rings of the quarter-circles correspond to confidence rings. The observed frequencies support the null hypothesis of no association between the variables if the rings for adjacent quarters overlap (we will explore this hypothesis test in the lectures…).
4.1.3.3 Sieve Diagrams
We here investigate the sieve
plotting function of library vcd
. Remember to look at the help file to help you understand the various arguments for this function.
- Run
library(vcd)
sieve( DR_data )
What is shown?
- Now run
library(vcd)
sieve( DR_data, shade = T )
Does this make the data easier or harder to visualise?
- Finally, run
sieve( DR_data, sievetype = "expected", shade = T )
What is shown now?
4.1.3.4 Mosaic Plots
Run
mosaic( DR_data )
Mosaic plots for two-way tables display graphically the cells of a contingency table as rectangular areas of size proportional to the corresponding observed frequencies. Were the classification variables independent, then the areas would be perfectly aligned in rows and columns. The worse the alignment is, the stronger the lack of fit for independence. Furthermore, specific locations of the table that deviate from independence the most may be identified and thus the pattern of underlying association attempt to be explained.
4.1.4 Odds Ratios in R
This section seeks to test your understanding of odds ratios for \(2 \times 2\) contingency tables, as well as your ability to write simple functions in R.
Write a function to compute the odds ratio of the success of event A with probability
pA
against the sucess of event B with probabilitypB
.Write a function to compute the odds ratio for a \(2 \times 2\) contingency table. Test it on the Dose-response data above.
Will there be an issue running your function from part (b) if exactly one of the cell counts of the supplied matrix is equal to zero?
What about if both cells of a particular row or column of the supplied matrix are equal to zero?
We consider two possible options for amending the function in this case.
- First option: ensure that your function terminates and returns a clear error message of what has gone wrong and why when a zero would be found to be in both the numerator and denominator of the odds ratio. Hint: The command
stop
can be used to halt execution of a function and display an error message.
- First option: ensure that your function terminates and returns a clear error message of what has gone wrong and why when a zero would be found to be in both the numerator and denominator of the odds ratio. Hint: The command
- Second option: in the case that a row or column of zeroes is found, add 0.5 to each cell of the table before calculating the odds ratio in the usual way. Make sure that your function returns a clear warning (as opposed to error) message explaining that an alteration to the supplied table was made before calculating the odds ratio because there was a row or column of zeroes present. Hint: The command
warning()
can be used to display a warning message (but not halt execution of the function).
- Second option: in the case that a row or column of zeroes is found, add 0.5 to each cell of the table before calculating the odds ratio in the usual way. Make sure that your function returns a clear warning (as opposed to error) message explaining that an alteration to the supplied table was made before calculating the odds ratio because there was a row or column of zeroes present. Hint: The command
4.1.5 Further Exploration: Mushrooms
Consider the mushrooms data in Table 2.7 of Section 2.4.3.4 (note that this is the table after combining cells). Explore further the topics covered in this practical session. You will need to manually enter the data from the lecture notes as a matrix to get started.
4.2 Practical 2 - Contingency Tables
This practical gives you the opportunity to develop the techniques learnt in Practical 1 by utilising the functions presented there, as well as providing a base for exploration of new ones.
Each of the three sections below considers exploring one of three datasets. You are encouraged to explore each of these datasets using the array of techniques previously discussed, as well as utilising the presented questions and suggestions to help you learn new ones.
4.2.1 Mushrooms
We begin by returning to the mushrooms data (Table 2.7, introduced in Section 2.4.3.4). This dataset was the basis for the open-ended final exercise of Practical 1. You may have explored this dataset in some, or all, of the following ways, amongst others:
Manually input the data into R.
Investigated adding relevant margins to the tables and exploring corresponding contingency tables of proportions.
Performed a \(\chi^2\) test of independence.
Generated visual representations of the data in the form of barplots, sieve diagrams and mosaic plots.
You are encouraged to keep exploring this dataset. In particular, the following sections prompt analysis of residuals, a GLR test, and investigation into nominal odds ratios.
4.2.1.1 Residual Analysis
Pearson and Adjusted residuals can be obtained from the output of chisq.test()
. Use the help file for this function to find out how.
4.2.1.2 GLR Test
The function below uses the appropriate parameters provided from the output of a call to chisq.test()
to perform a GLR test.
Read through the code and try to understand what each line does.
What information/results are being returned at the end of the function?
Apply the function on the mushroom data and interpret the results.
<- function( data ){
G2 # computes the G2 test of independence
# for a two-way contingency table of
# data: IxJ matrix
<- chisq.test( data )
X2 <- X2$expected
Ehat <- X2$parameter
df
<- data * log( data / Ehat )
term.G2 ==0] <- 0 # Because if data == 0, we get NaN
term.G2[data
<- 2 * term.G2 # Individual cell contributions to G2 statistic.
Gij <- sign( data - Ehat ) * sqrt( abs( Gij ) )
dev_res <- sum( Gij ) # G2 statistic
G2 <- 1 - pchisq( G2, df )
p return( list( G2 = G2, df = df, p.value = p,
Gij = Gij, dev_res = dev_res ) )
}
4.2.1.3 Nominal Odds Ratios
In Section 4.1.4, you were encouraged to write a function to calculate the odds ratio for a \(2 \times 2\) contingency table. Such a function (without concern for zeroes occurring) may be given by
<- function( M ){ ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] ) } OR
Based on this, what does the following function do? Test the function out on the mushrooms data and interpret the results.
<- function( M, ref_x = nrow( M ), ref_y = ncol( M ) ){
nominal_OR
# I and J
<- nrow(M)
I <- ncol(M)
J
# Odds ratio matrix.
<- matrix( NA, nrow = I, ncol = J )
OR_reference_IJ for( i in 1:I ){
for( j in 1:J ){
<- OR( M = M[c(i,ref_x), c(j,ref_y)] )
OR_reference_IJ[i,j]
}
}
<- OR_reference_IJ[-ref_x, -ref_y, drop = FALSE]
OR_reference
return(OR_reference)
}
4.2.2 Dose-Result
Manually input into R the hypothetical data presented in Table 2.15. You may wish to attempt some or all of the following.
Utilise previously introduced skills on this dataset.
Amend the function presented in Section 4.2.1.3 to write two new functions
- one to compute the set of \((I-1) \times (J-1)\) local odds ratios for \(i = 1,...,I-1\) and \(j = 1,...,J-1\); and
- one to compute the set of \((I-1) \times (J-1)\) global odds ratios for \(i = 1,...,I-1\) and \(j = 1,...,J-1\).
Write a function that produces an \((I-1) \times (J-1)\) matrix of fourfold plots, each corresponding to the submatrices associated with each of the \((I-1) \times (J-1)\) local odds ratios for \(i = 1,...,I-1\) and \(j = 1,...,J-1\).
Perform a linear trend test on the data, either by writing your own code to calculate the relevant quantities, or by utilising the following function, courtesy of Kateri (2014).
<- function( table, x, y ){
linear.trend # linear trend test for a 2-way table
# PARAMETERS:
# freq: vector of the frequencies, given by rows
# NI: number of rows
# NJ: number of columns
# x: vector of row scores
# y: vector of column scores
# RETURNS:
# r: Pearson’s sample correlation
# M2: test statistic
# p.value: two-sided p-value of the asymptotic M2-test
<- nrow( table )
NI <- ncol( table )
NJ
<- addmargins( table )[,NJ+1][1:NI]
rowmarg <- addmargins( table )[NI+1,][1:NJ]
colmarg <- addmargins( table )[NI+1,NJ+1]
n
<- sum( rowmarg * x ) / n
xmean <- sum( colmarg * y ) / n
ymean <- sqrt( sum( rowmarg * ( x - xmean )^2 ) )
xsq <- sqrt( sum( colmarg * ( y - ymean )^2 ) )
ysq
<- sum( ( x - xmean ) %*% table %*% ( y - ymean ) ) / ( xsq * ysq )
r = (n-1)*r^2
M2 <- 1 - pchisq( M2, 1 )
p.value return( list( r = r, M2 = M2, p.value = p.value ) )
}
4.2.3 Titanic
Type ?Titanic
into R to learn about the Titanic
dataset. You may wish to do this in conjuntion with one or more of the following
Titanicdim( Titanic )
dimnames( Titanic )
We will explore this dataset further in Practical 3. For now, you are encouraged to explore associations between the variables in the contingency table. Some ideas and questions to get you started are:
Generate some partial tables.
Generate some marginal tables. You can do this using the function
margin.table
(look at the help file).Calculate partial and marginal odds ratios, and interpret the results.
Perform a \(\chi^2\)-test of independence between
Class
andSurvival
, marginalising overSex
andAge
.Can we perform a linear trend test between
Class
andSurvival
, having marginalised overSex
andAge
? If you think we can, give it a go!Produce a sieve or mosaic plot of the Titanic data and interpret.
References
This example serves to demonstrate how to change the function
FUN
- I would like to reiterate that themean
function works for our purposes here (to get overall proportions for each column) only because the row totals are both the same.↩︎