Chapter 6 Practical Sheet Solutions
6.1 Practical 1 - Contingency Tables
6.1.1 Contingency Table Construction
We here provide solutions to the practical exercises of Section 5.1.1.4.
These questions involve using the contingency table from the penguin data introduced in Section 5.1.1.3
- Use
addmargins
to add row and column sum totals to the contingency table of penguin data.
<- addmargins( penguins_data )
penguins_table penguins_table
## Island
## Species Biscoe Dream Torgersen Sum
## Adelie 44 56 52 152
## Chinstrap 0 68 0 68
## Gentoo 124 0 0 124
## Sum 168 124 52 344
- Use
prop.table
to obtain a contingency table of proportions.
<- prop.table( penguins_data )
penguins_prop <- addmargins( penguins_prop )
penguins_prop_table penguins_prop_table
## Island
## Species Biscoe Dream Torgersen Sum
## Adelie 0.1279070 0.1627907 0.1511628 0.4418605
## Chinstrap 0.0000000 0.1976744 0.0000000 0.1976744
## Gentoo 0.3604651 0.0000000 0.0000000 0.3604651
## Sum 0.4883721 0.3604651 0.1511628 1.0000000
- Display the column-conditional probabilities, and use
addmargins
to add the column sums as an extra row at the bottom of the matrix (note: this should be a row of \(1\)’s).
<- prop.table( penguins_data, 2 )
penguins_prop_1 penguins_prop_1
## Island
## Species Biscoe Dream Torgersen
## Adelie 0.2619048 0.4516129 1.0000000
## Chinstrap 0.0000000 0.5483871 0.0000000
## Gentoo 0.7380952 0.0000000 0.0000000
# Add margin sums only over the rows...
<- addmargins( penguins_prop_1, margin = 1 )
penguins_prop_1_table penguins_prop_1_table
## Island
## Species Biscoe Dream Torgersen
## Adelie 0.2619048 0.4516129 1.0000000
## Chinstrap 0.0000000 0.5483871 0.0000000
## Gentoo 0.7380952 0.0000000 0.0000000
## Sum 1.0000000 1.0000000 1.0000000
# sum is 1 in each column as expected.
- Suppose I want the overall conditional proportions of penguin specie to appear in a final column on the right of the table. How would I achieve this?
# One way is as follows...
# Take the row sums (sum over the columns) on the initial
# contingency table data.
<- addmargins( penguins_data, 2 )
penguins_prop_2_table
# Calculate column-conditional proportions from this table.
<- prop.table( penguins_prop_2_table, 2 )
penguins_prop_2_table
# Add column sums at the bottom of the table.
<- addmargins( penguins_prop_2_table, 1 )
penguins_prop_2_table
penguins_prop_2_table
## Island
## Species Biscoe Dream Torgersen Sum
## Adelie 0.2619048 0.4516129 1.0000000 0.4418605
## Chinstrap 0.0000000 0.5483871 0.0000000 0.1976744
## Gentoo 0.7380952 0.0000000 0.0000000 0.3604651
## Sum 1.0000000 1.0000000 1.0000000 1.0000000
6.1.2 Chi-Square Test of Independence
We here provide solutions to the practical exercise of Section 5.1.2.1.
For the penguin data, apply the \(\chi^2\) test of independence between penguin specie and island of residence, and interpret the results.
# Apply the above tests on the Penguins data.
chisq.test( penguins_data )
##
## Pearson's Chi-squared test
##
## data: penguins_data
## X-squared = 299.55, df = 4, p-value < 2.2e-16
# Unsurprisingly the p-value is miniscule.
6.1.3 Barplots
We here provide solutions to the practical exercises of Section 5.1.3.1.
- Run
barplot( DR_prop )
. What does the plot show?
barplot( DR_prop )
We see the pair of barplots shown in Figure 6.1. In R, a matrix input to boxplot results in the column categories (in this case, Result
) defining the bars while the row categories (Dose
) form the stacked levels.
- Investigate the
density
argument of the functionbarplot
by both running the commands below, and also looking in the help file.
par(mfrow=c(1,3))
barplot( DR_prop, density = 70 )
barplot( DR_prop, density = 30 )
barplot( DR_prop, density = 0 )
As illustrated in Figure 6.2, we can see that the density
argument of boxplot
changes the level of shading for the (two) categories defined by the rows (Dose
) of the contingency table matrix.
- Add a title, and x- and y-axis labels, to the plot above.
We can achieve this with the code below, which yields the plot shown in Figure 6.3.
barplot( DR_prop, density = 30, main = "Comparison of Dose by Response",
xlab = "treatment outcome", ylab = "proportions" )
- Use the help file for
barplot
to find out how to add a legend to the plot.
We set the argument legend.text
to T
. The code below thus yields the plot shown in Figure 6.4.
barplot( DR_prop, density = 30, legend.text = T,
main = "Comparison of Dose by Response",
xlab = "treatment outcome", ylab = "proportions" )
- How would we alter the call to
barplot
in order to view dose proportion levels conditional on result (instead of the overall proportions corresponding to each cell). You may wish to use some of the table manipulation commands from Section 5.1.1.
One possible solution is shown below, which yields the plot shown in Figure 6.5. Note that we can move the legend
box using args.legend
and setting the usual x
and y
location arguments for a legend within a list (that is, if we don’t want it covering up part of the plot itself). The exact necessary values of x
and y
depend on your plotting window.
barplot( prop.table( DR_data, 2 ), density = 30, legend.text = T,
main = "Comparison of Dose by Response",
xlab = "treatment outcome", ylab = "proportions",
args.legend = list( x = 2.5, y = 1.25 ) )
- Suppose instead that we wish to display each dose level in a bar, with the proportion of successes and failures illustrated by the shading in each bar. How would we do that?
We can alter the code as shown below, which runs prop.table()
on the transposed version of the matrix DR_data
, and yields the plot shown in Figure 6.6.
barplot( prop.table( t( DR_data ), 2 ), density = 30, legend.text = T,
main = "Comparison of Response by Dose",
xlab = "dose level", ylab = "proportions",
args.legend = list( x = 2.6, y = 1.25 ) )
This is of course the more informative way to display the data, as we are likely more interested in the relative proportion of high
dose successes to low
dose successes, rather than, say, the proportion of successes
which happened to be high
dose, for example.
6.1.4 Sieve Diagrams
We here provide solutions to the practical exercises of Section 5.1.3.3.
- Run
library(vcd)
sieve( DR_data )
What is shown?
We see the plot given in Figure 6.7. A sieve (or parquet) diagram represents for an \(I \times J\) table the expected frequencies under independence as a collection of \(IJ\) rectangles, each containing a set of smaller squares/rectangles. Each of the larger rectangles has height and width proportional to the corresponding row and column marginal frequencies of the contingency table respectively. This way, the area of each rectangle is proportional to the expected-under-independence frequency for the corresponding cell. The number of smaller squares/rectangles in each larger rectangle corresponds to the observed frequency, hence a larger number of smaller squares indicate that the observed frequency was larger.
- Now run
sieve( DR_data, shade = T )
Does this make the data easier or harder to visualise?
We now see the plot given in Figure 6.8, which is the same plot as Figure 6.7, however now the observed frequency squares are coloured blue (undashed) if the observed frequency is larger than the expected frequency, and red (dashed) if the observed frequency is smaller than the expected frequency.
- Finally, run
sieve( DR_data, sievetype = "expected", shade = T )
What is shown now?
The plot shown in Figure 6.9, which just shows the expected frequencies under independence for each cell.
6.1.5 Odds Ratios in R
We here provide solutions to the practical exercises of Section 5.1.4.
This section seeks to test your understanding of odds ratios for \(2 \times 2\) contingency tables, as well as your ability to write simple functions in R.
- Write a function to compute the odds ratio of the success of event A with probability
pA
against the sucess of event B with probabilitypB
.
<- function(pA, pB){(pA/(1-pA))/(pB/(1-pB))}
Odds_Ratio Odds_Ratio( 1/3, 1/2 ) # test the function.
## [1] 0.5
- Write a function to compute the odds ratio for a \(2 \times 2\) contingency table. Test it on the Dose-response data above.
<- function( M ){ ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] ) }
OR OR( DR_data )
## [1] 1.600601
# M is a 2 by 2 matrix representing contingency table cell counts.
- Will there be an issue running your function from part (b) if exactly one of the cell counts of the supplied matrix is equal to zero?
Fine in this case as either 0
is returned if a cell count of 0 appears in the numerator of the odds ratio, or Inf
is returned if a cell count of 0 is in the denominator of the odds ratio.
- What about if both cells of a particular row or column of the supplied matrix are equal to zero?
We have a problem as there will be a zero in both numerator and denominator of the odds ratio, which is undefined. For example, run…
<- matrix( c(3,4,0,0), ncol = 2 ) # Define such a matrix.
AB OR( AB ) # NaN returned.
## [1] NaN
- We consider two possible options for amending the function in this case.
- First option: ensure that your function terminates and returns a clear error message of what has gone wrong and why when a zero would be found to be in both the numerator and denominator of the odds ratio. Hint: The command
stop
can be used to halt execution of a function and display an error message.
- First option: ensure that your function terminates and returns a clear error message of what has gone wrong and why when a zero would be found to be in both the numerator and denominator of the odds ratio. Hint: The command
<- function( M ){
OR_alt_1
<- rowSums( M )
M_row_sum <- colSums( M )
M_col_sum
if( prod( M_row_sum ) == 0 | prod( M_col_sum ) == 0 ){
stop( "At least one row sum or column sum of M is equal
to zero, hence the odds ratio is undefined." )
}else{
1,1] * M[2,2] ) / ( M[1,2] * M[2,1] )
( M[
}
}
OR_alt_1( DR_data ) # works fine with DR_data
## [1] 1.600601
OR_alt_1( AB ) # Execution is halted and our error returned.
## Error in OR_alt_1(AB): At least one row sum or column sum of M is equal
## to zero, hence the odds ratio is undefined.
- Second option: in the case that a row or column of zeroes is found, add 0.5 to each cell of the table before calculating the odds ratio in the usual way. Make sure that your function returns a clear warning (as opposed to error) message explaining that an alteration to the supplied table was made before calculating the odds ratio because there was a row or column of zeroes present. Hint: The command
warning()
can be used to display a warning message (but not halt execution of the function).
- Second option: in the case that a row or column of zeroes is found, add 0.5 to each cell of the table before calculating the odds ratio in the usual way. Make sure that your function returns a clear warning (as opposed to error) message explaining that an alteration to the supplied table was made before calculating the odds ratio because there was a row or column of zeroes present. Hint: The command
<- function( M ){
OR_alt_2
<- rowSums( M )
M_row_sum <- colSums( M )
M_col_sum
if( prod( M_row_sum ) == 0 | prod( M_col_sum ) == 0 ){
<- M + 0.5 * matrix( 1, nrow = 2, ncol = 2 )
M <- ( M[1,1] * M[2,2] ) / ( M[1,2] * M[2,1] )
OR warning( "At least one row sum or column sum of the supplied
matrix M was equal to zero, hence an amendment of 0.5 was added
to the value of each sum prior to calculating the odds ratio." )
return( OR )
}else{
1,1] * M[2,2] ) / ( M[1,2] * M[2,1] )
( M[
}
}
OR_alt_2( DR_data ) # works fine with DR_data
## [1] 1.600601
OR_alt_2( AB ) # Alternative odds ratio calculated and warning returned.
## Warning in OR_alt_2(AB): At least one row sum or column sum of the supplied
## matrix M was equal to zero, hence an amendment of 0.5 was added
## to the value of each sum prior to calculating the odds ratio.
## [1] 0.7777778
6.1.6 Mushrooms
We here provide some possible analysis of the mushrooms data, with comments made in R, to get you started. The resulting barplots and sieve diagram are shown in Figures 6.10 and 6.11 respectively.
# We create a matrix with the data in.
<- matrix( c(101, 399, 57, 487,
mushroom_data 12, 389, 150, 428 ), byrow = TRUE, ncol = 4 )
# Add dimension names as follows.
dimnames( mushroom_data ) <- list( Edibility = c("Edible", "Poisonous"),
Cap_Shape = c("bell", "flat", "knobbed",
"convex/conical") )
# Have a look.
mushroom_data
## Cap_Shape
## Edibility bell flat knobbed convex/conical
## Edible 101 399 57 487
## Poisonous 12 389 150 428
# Let's look at the proportion of each shape of mushroom that are
# edible or poisonous.
<- addmargins( mushroom_data, 2, FUN = mean )
mushroom_table <- prop.table( mushroom_table, 2 )
mushroom_table <- addmargins(mushroom_table, 1 )
mushroom_table mushroom_table
## Cap_Shape
## Edibility bell flat knobbed convex/conical mean
## Edible 0.8938053 0.5063452 0.2753623 0.5322404 0.5160652
## Poisonous 0.1061947 0.4936548 0.7246377 0.4677596 0.4839348
## Sum 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
# Let's perform a chi-square test.
chisq.test( mushroom_data )
##
## Pearson's Chi-squared test
##
## data: mushroom_data
## X-squared = 113.84, df = 3, p-value < 2.2e-16
# This matches the result found manually in the lectures.
# Barplot
barplot( prop.table( mushroom_data, margin = 2 ), density = 50,
main = "Comparison of Edibility by Cap Shape", cex.main = 0.8,
xlab = "treatment outcome", ylab = "proportions",
legend.text = T, args.legend = list( x = 5.1, y = 1.25 ) )
# Sieve diagram
sieve( mushroom_data )
# Much can be drawn form this diagram. For example,
# we can see that there are far more edible bell mushrooms in our
# sample than would be expected under an assumption of independence.
6.2 Practical 2 - Contingency Tables
Practical 2 is designed to be exploratory, hence it is not possible to provide comprehensive solutions. We discuss some analysis of each dataset in accordance with the suggestions provided.
6.2.1 Mushrooms
Inputting the mushrooms data and some exploratory analysis was presented in Section 6.1.6.
We can assign the result of the \(\chi^2\) test to a named object as follows
<- chisq.test( mushroom_data )
chisq_Mush chisq_Mush
##
## Pearson's Chi-squared test
##
## data: mushroom_data
## X-squared = 113.84, df = 3, p-value < 2.2e-16
This matches the result found manually in the lectures. Very strong evidence to reject the null hypothesis of independence between edibility and cap shape.
6.2.1.1 Residual Analysis
We can use the \(\chi^2\) test result to obtain Pearson residuals…
$residuals chisq_Mush
## Cap_Shape
## Edibility bell flat knobbed convex/conical
## Edible 5.589590 -0.3798221 -4.820746 0.6810948
## Poisonous -5.772167 0.3922285 4.978209 -0.7033419
…and also Adjusted residuals.
$stdres chisq_Mush
## Cap_Shape
## Edibility bell flat knobbed convex/conical
## Edible 8.269282 -0.6987975 -7.314099 1.322946
## Poisonous -8.269282 0.6987975 7.314099 -1.322946
Both residuals suggest that bell-capped mushrooms and knobbed-capped mushrooms contribute most towards the \(X^2\) statistic. For example, there are far more edible bell-capped mushrooms than would be expected if the two variables were independent.
6.2.1.2 GLR Test
We can apply the \(G^2\) GLR test on the mushroom data as follows.
<- G2( mushroom_data )
G2mush $p.value G2mush
## [1] 0
$dev_res G2mush
## Cap_Shape
## Edibility bell flat knobbed convex/conical
## Edible 10.53326 -3.895337 -8.462185 5.482674
## Poisonous -6.03326 3.933403 11.005295 -5.394469
$Gij G2mush
## Cap_Shape
## Edibility bell flat knobbed convex/conical
## Edible 110.94946 -15.17365 -71.60858 30.05971
## Poisonous -36.40022 15.47166 121.11651 -29.10030
The \(p\)-value is close to 0, as expected, hence strong evidence to reject \(\mathcal{H}_0\). The deviance residuals are as calculated in lectures. Due to the relation between deviance residuals and the \(G^2\) statistic, classes of individual variables with large marginal sums of deviance residual values contribute most towards the \(G^2\) statistic. In this case, we again notice that the bell cap-shaped mushrooms have a larger contribution toward the \(G^2\) statistic, as evidenced by the large residual values, and also the large marginal sum in residual values over the two classes of edibility for bell-shaped caps.
6.2.2 Dose-Result Data
Input of data can be performed as follows:
<- matrix( c(47, 25, 12,
DoseResult 36, 22, 18,
41, 60, 55), byrow = TRUE, ncol = 3 )
dimnames( DoseResult ) <- list( Dose=c("High",
"Medium",
"Low"),
Result=c("Success",
"Partial",
"Failure") )
<- addmargins(DoseResult) DoseResultTable
Functions for local and global odds ratios respectively, then applied on the Dose-Result data to produce the results obtained in lectures, are:
<- function( M ){
local_OR
# I and J
<- nrow(M)
I <- ncol(M)
J
# Odds ratio matrix.
<- matrix( NA, nrow = I-1, ncol = J-1 )
OR_local for( i in 1:(I-1) ){
for( j in 1:(J-1) ){
<- OR( M = M[c(i,i+1), c(j,j+1)] )
OR_local[i,j]
}
}
return(OR_local)
}
<- function( M ){
global_OR
# I and J
<- nrow(M)
I <- ncol(M)
J
# Odds ratio matrix.
<- matrix( NA, nrow = I-1, ncol = J-1 )
OR_global for( i in 1:(I-1) ){
for( j in 1:(J-1) ){
<- OR( M = matrix( c( sum( M[1:i,1:j] ),
OR_global[i,j] sum( M[(i+1):I,1:j] ),
sum( M[1:i,(j+1):J] ),
sum( M[(i+1):I,(j+1):J] ) ),
nrow = 2 ) )
}
}
return(OR_global)
}
local_OR( DoseResult )
## [,1] [,2]
## [1,] 1.148889 1.704545
## [2,] 2.394678 1.120370
global_OR( DoseResult )
## [,1] [,2]
## [1,] 2.557038 2.754717
## [2,] 3.023440 2.359736
A plot of multiple fourfold plots can be obtained using the following function, then applied on the Dose-Result data to produce the plots shown in Figure 6.12.
<- function ( data ){
ffold_local
# I and J
<- nrow(data)
I <- ncol(data)
J
par( mfrow = c(I-1, J-1) )
for( i in 1:(I-1) ){
for( j in 1:(J-1) ){
<- data[c(i,i+1),c(j,j+1)]
sub_data fourfoldplot( sub_data )
}
}
}
# Apply fourfold local OR plot function.
ffold_local( DoseResult )
To compute the statistic for the linear trend test manually, we could…
<- 1:3
score_x <- 1:3
score_y
<- crossprod( t(score_x), t(score_y) )
score_xy <- sum(DoseResult)
n
<- sum( score_xy * DoseResult )
sum_xy <- sum( score_x * rowSums(DoseResult) )
sum_x <- sum( score_y * colSums(DoseResult) )
sum_y <- sum( score_x^2 * rowSums(DoseResult) )
sum_x2 <- sum( score_y^2 * colSums(DoseResult) )
sum_y2
<- ( n * sum_xy - sum_x * sum_y ) / ( sqrt( n * sum_x2 - sum_x^2 ) *
rxy sqrt( n * sum_y2 - sum_y^2 ) )
<- (n-1) * rxy^2
M2
<- 1-pchisq(M2,1) p.value
Or using the function provided we can simply do
linear.trend( table = DoseResult, x = score_x, y = score_y )
## $r
## [1] 0.2709127
##
## $M2
## [1] 23.11902
##
## $p.value
## [1] 1.522773e-06
yielding the same result. The test provides evidence to reject the null hypothesis of independence between the two variables.
6.2.3 Titanic
A partial table cross-classifying Class
and Sex
for Age = Child
and Survived = No
can be obtained in the usual way of selecting particular elements of an array.
="Child", Survived="No"] Titanic[,, Age
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
Marginal tables are produced using margin.table()
with argument margin
denoting the variables of interest (that is, those to still be included in the table). For example, a marginal table of Class
and Sex
summing over margins 3 and 4 (Age
and Survived
, thus ignoring them), and a marginal vector for Survived
, are obtained as follows
margin.table( Titanic, margin = c(1,2) )
## Sex
## Class Male Female
## 1st 180 145
## 2nd 179 106
## 3rd 510 196
## Crew 862 23
margin.table( Titanic, margin = 4 )
## Survived
## No Yes
## 1490 711
We can use previous odds ratio functions on relevant \(2\times2\) marginal/partial tables.
Note that whether we can treat Class
as ordinal is debatable, depending on how category Crew
might fit into that. If we feel that we can treat Class
as ordinal, then global Odds ratios might also be applicable. We calculate nominal, local and global marginal odds ratios for Class
and Sex
having marginalised over Age
and Survived
.
nominal_OR( margin.table( Titanic, margin = c(1,2) ) )
## [,1]
## [1,] 0.03312265
## [2,] 0.04505757
## [3,] 0.06942800
local_OR( margin.table( Titanic, margin = c(1,2) ) )
## [,1]
## [1,] 0.7351185
## [2,] 0.6489826
## [3,] 0.0694280
global_OR( margin.table( Titanic, margin = c(1,2) ) )
## [,1]
## [1,] 0.26012139
## [2,] 0.22830253
## [3,] 0.05187198
The nominal marginal odds ratios reflect the fact the proportion of men (or more, precisely, the odds that a randomly selected person of a particular Class
is a man) is much lower for all other levels of Class
relative to Crew
.
Both local and global marginal odds ratios provide evidence that there is an association between Class
and Sex
, particularly for Crew
in comparison with any of the other levels.
For calculating global odds ratios, we here treat Crew
as the fourth category of an ordinal variable Class
following on from 3rd
.
This position in the ordinal ranking (as opposed to being otherwise positioned in the Class
ordering) vastly influences the global odds ratios (but less so the local ones, which are applicable for nominal variables anyway).
A Chi square test for independence of Class
and Survival
, marginalising
over Sex
and Age
, is obtained by
chisq.test( margin.table( Titanic, margin = c(1,4) ) )
##
## Pearson's Chi-squared test
##
## data: margin.table(Titanic, margin = c(1, 4))
## X-squared = 190.4, df = 3, p-value < 2.2e-16
# An unsurprising overwhelming amount of evidence
# that these two are associated.
A linear trend test - applicability depends if we feel that we can treat Class
as an ordinal variable.
linear.trend( table = margin.table( Titanic, margin = c(1,4) ),
x = 1:4, y = 1:2 )
## $r
## [1] -0.2713954
##
## $M2
## [1] 162.042
##
## $p.value
## [1] 0
A sieve and mosaic plot could be obtained by running the following commands.
sieve( Titanic )
mosaic( Titanic )