14 Comparing qualitative data between individuals

So far, you have learnt to ask a RQ, design a study, collect the data, and describe the data. In this chapter, you will learn to:

  • compare qualitative data between individuals using the appropriate graphs.
  • compare qualitative data between individuals using odds ratios and summary tables.

14.1 Introduction

Relational RQs compare groups. This chapter considers how to compare qualitative variables in different groups. Tables and graphs are very useful this purpose.

14.2 Two-way tables

When more than one qualitative variable is recorded for each individual, the data can be collated into table. When two qualitative variables are cross-tabulated, the resulting table is called a two-way table. As always, the categories for each variable should be exhaustive (cover all values) and exclusive (observations belong to one and only one category).

Example 14.1 (Two-way tables) Charig et al. (1986) compared two treatments for kidney stones to determine which had a higher success rate. Data were collected from \(700\) UK patients, on two qualitative variables:

  • the treatment method ('A' or 'B'): the explanatory variable.
  • the result (procedure 'success' or 'failure'): the response variable.

Both variables are qualitative with two levels, and each treatment was used on \(350\) patients. Treatment A was used from 1972--1980, and Treatment B from 1980--1985; that is, treatments were not randomly allocated, and so confounding may be present. For this reason, the researchers also recorded the size of the kidney stone ('small' or 'large') as one possible confounding variable. Firstly, consider just the small stones (Julious and Mullee 1994), displayed in the two-way table in Table 14.1.

TABLE 14.1: Numbers for small kidney stones
Success Failure Total
Method A \(\phantom{0}81\) \(\phantom{0}6\) \(\phantom{0}87\)
Method B \(234\) \(36\) \(270\)
Total \(315\) \(42\) \(357\)

14.3 Summary tables by rows and columns

Each variable in a two-way table can be analysed separately, using percentages or proportions (Sect. 13.4) or odds (Sect. 13.5). For example, the two variables in Table 14.1 (Method; Result) can be analysed separately. For instance:

  • the percentage of procedures that were successful is \(315/357\times 100 = 88.2\)%.
  • the odds that a procedure was successful is \(315/42 = 7.5\); that is, there were \(7.5\) times as many successful procedures as unsuccessful procedures.

However, to compare Methods A and B, these odds and percentages can be computed for each row (or column) separately.

Example 14.2 (Large kidney stones) The data in Table 14.1 can be summarised by computing proportions or percentages by row. The rows refer to the different Methods, so this will compare success percentages for the two methods.

For the small kidney stones (Table 14.1), the row percentages (Table 14.2 give the proportion of successes for each Method, since the rows represent the counts for Methods A and B. Row proportions allow the proportions within the rows (i.e., for each Method) to be compared:

  • Method A: \(81 \div 87 = 0.931\) (or \(93.1\)%) of operations in the sample were successful; and
  • Method B: \(234\div 270 = 0.867\) (or \(86.7\)%) of operations in the sample were successful.

For small kidney stones, Method A is slightly more successful (\(93.1\)%) than Method B (\(86.7\)%) in the sample. These percentages are collated in Table 14.2.

Odds can also be computed:

  • Method A: The odds of success is \(81/6 = 13.5\). This means there are \(13.5\) more successful procedures than failures for Method A.
  • Method B: The odds of success is \(234/36 = 6.5\). This means there are \(6.5\) more successful procedures than failures for Method B.

The odds of a success is far greater for Method A than Method B in the sample.

TABLE 14.2: Row percentages for small kidney stones (from Table 14.1)
Success Failure Total
Method A \(93.1\) \(6.9\) \(100\)
Method B \(86.7\) \(13.3\) \(100\)
TABLE 14.3: Column percentages for small kidney stones (from Table 14.1)
Success Failure
Method A \(25.7\) \(14.3\)
Method B \(74.3\) \(85.7\)
Total \(100.0\) \(100.0\)

Rather than comparing methods (in the rows), the procedure results can be compared (i.e., the columns).

Example 14.3 (Comparing by column) For the small kidney stones (Table 14.1), the column proportions (Table 14.3 give the proportion of successes within each column (i.e., for successes and for failures), since the columns contain the procedure results. Column proportions allow the proportions (or percentages) within columns to be compared:

  • Successful procedures: \(81 \div 315 = 0.257\) (or \(25.7\)%) in the sample were with Method A; and
  • Unsuccessful procedures: \(234\div 315 = 0.143\) (or \(14.3\)%) in the sample were with Method A.

Row percentages seems more intuitive than column percentages here: they compare the success percentage for each method.

Odds can also be computed:

  • Successes: the odds of a success coming from Method A is \(81/234 = 0.346\): there are \(0.346\) more Method A procedures than Method B procedures among the successes.
  • Failures: the odds of failure coming from Method A is \(6/36 = 0.167\): there are \(0.167\) more Method A procedures than Method B procedures among the failures.

The odds of a success being a Method A procedure is quite different than the odds of a success being a Method B procedure.

14.4 Graphs

When a qualitative variable is compared across different groups (i.e., comparing between individuals), options for plotting include:

  • Stacked bar charts (Sect. 14.4.1);
  • Side-by-side bar charts (Sect. 14.4.2); or
  • Dot charts (Sect. 14.4.3).

14.4.1 Stacked bar charts

The data can be graphed by using a bar for each level of one variable, and stacking the bars for the levels of the second variable.

Example 14.4 (Stacked bar charts) For the kidney-stone data in Example 14.1, a stacked bar chart can be created by producing a bar for each method, and stacking the successes and failures for each method (Fig. 14.1, top left panel).

Rather than using numbers, the percentages separately within each group can be used too (Fig. 14.1, bottom left panel).

Six plots for the small kidney-stone data

FIGURE 14.1: Six plots for the small kidney-stone data

14.4.2 Side-by-side bar charts

Instead of stacking the success and failures bars on top of each other, these bars can be placed side-by-side for each method.

Example 14.5 (Side-by-side bar charts) For the kidney-stone data in Example 14.1, a side-by-side bar chart can be created by producing two bars for each method (one for failures; one for successes), and placing these side-by-side (Fig. 14.1, centre panels). Again, numbers or percentages within each method can be graphed.

14.4.3 Dot charts

Instead of bars, dots (or other symbols) can be used in place of the bars in a side-by-side bar chart.

Example 14.6 (Side-by-side bar charts) For the data in Example 14.1, a dot chart can be created by placing plotting symbols for each result (one for failures; one for successes) side-by-side for each method (Fig. 14.1, right panels). Again, numbers or percentages can be used.

14.4.4 Other variations

Many variations of these charts are possible, by making certain choices:

  • use a stacked bar chart, side-by-side bar chart, or dot chart.
  • use percentages or counts with one of the variables.
  • use the counts (or percentage) on either the horizontal or vertical axis.
  • decide which variable can be used as the first division of the data.

The guiding principle remains: the purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.

Using a computer to create graphs is recommended, and using a computer makes it easy to try different variations to find the graph that best displays the message in the data.

14.5 Summarising the comparison: odds ratios

The small kidney stone data (Table 14.1) can be summarised using odds:

  • Method A: the odds of success are \(13.5\) (\(13.5\) times as many successes as failures).
  • Method B: the odds of success are \(6.5\) (\(6.5\) times as many successes as failures).

The odds of success for Method A and Method B are very different. In the sample, the odds of success for Method A is many times greater than for Method B. In fact, in the sample, the odds of success for Method A is \(13.5\div 6.5 = 2.08\) times the odds of a success for Method B. This value is the odds ratio (OR). The sample odds ratio is a statistic, and the (unknown) population odds ratio is a parameter.

Definition 14.1 (Odds Ratio (OR)) The odds ratio is the ratio of the odds of an event in one group, compared to the odds of the same event in a different group: \[ \text{Odds ratio} = \frac{\text{Odds of an event in Group A}} {\text{Odds of the same event in Group B}}. \]

Example 14.7 (Odds ratios) For the small kidney stone data, the odds of a success for Method A is \(81\div6 = 13.5\). The odds of a success for Method B is \(234\div36 = 6.5\). The odds ratio is then computed as \(13.5\div 6.5 = 2.08\). The odds have been computed with the rows.

This means that the odds of a success for Method A is about \(2.08\) times the odds of a success for Method B.

Most software computes the odds ratio from a two-way table by using the values in the first row and first column on the top of the fractions when computing the odds and the odds ratio. In Example 14.7, for instance, the odds for both methods were computed with the Column 1 values on the top of the fraction (\(81\) and \(234\)), and the odds ratio was computed with the Row 1 odds (\(13.5\)) on top of the fraction.

However, the odds ratio could also be computed using the odds within the columns, rather than within the rows (as in Example 14.7).

Example 14.8 (Odds ratios) For the small kidney stone data, the odds of a success coming from Method A (i.e., Column 1) is \(81/234 = 0.3462\). Likewise, the odds of a failure (i.e., Column 2) coming from Method A is \(6\div36 = 0.1667\). The odds ratio is \(0.3462\div 0.1667 = 2.08\), as in Example 14.7. This means that the odds of Method A producing a success is about \(2.08\) times the odds of Method A producing a failure.

The two odds ratio calculations produce the same value. The odds ratio can be interpreted in either way: as in this example or as in Example 14.7. Both interpretations are correct.

The odds ratio can be interpreted in either of these ways (i.e., both are correct):

  • The odds compare Row 1 counts to Row 2 counts, for both columns. The odds ratio then compares the Column 1 odds to the Column 2 odds.
  • The odds compare Column 1 counts to Column 2 counts. The odds ratio then compares the Row 1 odds to the Row 2 odds.

Odds and odds ratios are computed with the first row and first column values on the top of the fraction.

The OR compares the odds of the same event (e.g., success) in two different groups (e.g., Method A and Method B). This means that a \(2\times 2\) table can be summarised using one number: the odds ratio (OR).

When interpreting odds ratios (or ORs):

  • odds ratios greater than \(1\) mean the odds of the event is larger for the group on top of the division compared to the group in the bottom.
  • odds ratios equal to \(1\) mean the odds of the event is the same for both groups (on the top and the bottom of the division).
  • odds ratios is less than \(1\) mean the odds of the event is smaller for the group on the top of the division compared to the group in the bottom.

The following short video may help explain some of these concepts:

The numerical summary information for comparing qualitative variables can be collated in a table. The data should be summarised by one of the qualitative variables, producing percentages and odds for the other.

Example 14.9 (Numerical summary table) For the small kidney-stone data, the summary of the data can be tabulated as in Table 14.4.

TABLE 14.4: Numerical summary of the small kidney-stone data: Odds and percentage of a successful procedure
Percentage Odds Sample size
Method A \(93.1\) \(13.500\) \(\phantom{0}87\)
Method B \(86.7\) \(\phantom{0}6.500\) \(270\)
Odds ratio \(\phantom{0}2.077\)

14.6 Example: large kidney stones

The data in Table 14.1 are for small kidney stones. Data were also recorded for the large kidney stones (Table 14.5). As for small kidney stones, the success percentages can be computed for both methods:

  • Method A: Success proportion for large kidney stones: \(192/263 = 0.730\), or \(73.0\)%; and
  • Method B: Success proportion for large kidney stones: \(55/80 = 0688\), or \(68.8\)%.

For large kidney stones, then, Method A has a higher success proportion than Method B, just as with the small kidney stones.

TABLE 14.5: numbers for large kidney stones
Success Failure Total
Method A \(192\) \(71\) \(263\)
Method B \(\phantom{0}55\) \(25\) \(\phantom{0}80\)

So... could the data for small (Table 14.1) and large kidney stones (Table 14.5) be combined, to produce a single two-way table of just Method and Result (Table 14.6), ignoring size?

TABLE 14.6: Numbers for all kidney stones combined, ignoring the size of the kidney stone
Success Failure Total
Method A \(273\) \(77\) \(350\)
Method B \(289\) \(61\) \(350\)

Compute the success proportions for Method A and Method B when small and large stones are combined (Table 14.6):

  • For all stones, what is the success proportion for Method A?
  • For all stones, what is the success proportion for Method B?

Which method has the higher success proportion for all stones?

Method A has a higher success proportion (\(273/350 = 0.780\)) than Method B (\(289/350 = 0.826\)), for all kidney stones combined.

To summarise:

  • Method A is more successful for small stones (\(93.1\)% vs \(86.7\)%);
  • Method A is more successful for large stones (\(73.0\)% vs \(68.8\)%); but
  • Method B is more successful for all stones combined (\(78.0\)% vs \(82.6\)%).

That seems strange: Method A performs better for small and large kidney stones, but Method B performs better when ignoring size.

The size of the stone is a confounding variable (Fig. 14.2). Size is associated with the method (small stones are treated more often with Method B) and with the result (small stones have a higher success proportion for both methods).

This confounding could have been avoided by randomly allocating a treatment method to patients. However, random allocation was not possible in this study, so the researchers used a different method to manage confounding: recording the size of the kidney stones (see Sect.  8.2).

In this example, incorporating information about a potential confounder (the size of the kidney stone) is important, otherwise the wrong (opposite) conclusion is reached: Method B would be considered better if the size of the stones was ignored, when the better method really is Method A.

This is called Simpson's paradox. If the size of the kidney stone had not been recorded, size would be a lurking variable, and the incorrect conclusion would have been reached.

The size of the stones is related to both the success percentage and the method

FIGURE 14.2: The size of the stones is related to both the success percentage and the method

14.7 Example: water access

López-Serrano et al. (2022) recorded data about access to water for three rural communities in Cameroon (see Sects. 12.9 and 13.7). One purposes of the study was to determine contributors to the incidence of diarrhoea in young children (\(85\) households had children under \(5\)). A cross-tabulation (Table 14.7) shows the relationship with keeping livestock; the numerical summary table (Table 14.8) may suggest a difference due to keeping livestock. The comparison in Fig. 14.3 includes some categories with small sample sizes, so the percentages shown may not be precise estimates of the population values.

As usual, the data come from one of countless possible samples, but the RQ is about the population, so making a definitive decision is difficult.

TABLE 14.7: Cross-tabulation of having livestock in the household, and children under \(5\) having diarrhoea in the household in the last two week
No diarrhoea Diarrhea
Does not have livestock \(17\) \(\phantom{0}3\)
Has livestock \(42\) \(23\)
TABLE 14.8: Numerical summary of the water-access data: odds and percentage of children with diarrhoea in the last two weeks
Percentage Odds Sample size
Household does not have livestock \(15.0\) \(0.176\) \(20\)
Household has livestock \(35.4\) \(0.548\) \(65\)
Odds ratio \(0.322\)
Percentage of children with and without diarrhoea in the last two weeks, by water source (left) and how often the water vessel was cleaned (right).

FIGURE 14.3: Percentage of children with and without diarrhoea in the last two weeks, by water source (left) and how often the water vessel was cleaned (right).

14.8 Chapter summary

Qualitative data can be compared between different groups (between individuals comparisons) using a dot chart, bar chart or pie chart. The data can be displayed in a two-way table, then summarised numerically by comparing proportions, percentages and odds. The odds ratio (OR) can be used to compare odds in two different groups.

14.9 Quick revision questions

A study (Alley et al. 2017) examined social media use, using a representative sample of Queenslanders at least \(18\) years of age (from the \(2013\) Queensland Social Survey; Table 14.9).

  1. Compute the sample proportion of urban residents who use social media.
  2. Compute the sample proportion of rural residents who use social media.
  3. Compute the sample odds of urban residents who use social media.
  4. Compute the sample odds of rural residents who use social media.
  5. Compute the sample odds ratio of using social media, comparing urban to rural residents.
TABLE 14.9: The number of Queenslanders using and not using social media (SM) in rural and urban locations in 2013 in a sample
Doesn't use SM Uses SM Total
Rural residents \(\phantom{0}78\) \(\phantom{0}89\) \(167\)
Urban residents \(416\) \(568\) \(984\)

14.10 Exercises

Answers to odd-numbered exercises are available in App.  E.

Exercise 14.1 Köchling et al. (2019) studied hangovers and recorded, among other information, when people vomited after consuming alcohol. Table 14.10 shows how many people vomited after consuming beer followed by wine, and how many people vomited after consuming only wine.

  1. Compute the row proportions. What do these mean?
  2. Compute the column percentages. What do these mean?
  3. Compute the overall percentage of drinkers who vomited.
  4. Compute the odds that a wine-only drinker vomited.
  5. Compute the odds that a beer-then-wine drinker vomited.
  6. Compute the odds ratio, comparing the odds of vomiting for wine-only drinkers to beer-then-wine drinkers.
  7. Compute the odds ratio, comparing the odds of vomiting for beer-then-wine drinkers to wine-only drinkers.
TABLE 14.10: How many people vomited and did not vomit, by type of alcohol consumed
Beer then wine Wine only
Vomited \(\phantom{0}6\) \(\phantom{0}6\)
Didn't vomit \(62\) \(22\)

Exercise 14.2 Stirrat (2008) recorded the sex of adult and young wallabies at the East Point Reserve, Darwin. In December 1993, \(91\) males and \(188\) female adult wallabies were recorded, and \(13\) male and \(22\) female young wallabies were recorded.

  1. Create the two-way table of counts.
  2. For adult wallabies, what proportion of adult wallabies were males?
  3. For adult wallabies, what are the odds that a female was observed?
  4. For young wallabies, what percentage of wallabies were males?
  5. For young wallabies, what are the odds that a female was observed?
  6. What is the odds ratio of observing an adult wallaby, comparing females to males?
  7. Create a summary table.
  8. Sketch a graph to display the data.

Exercise 14.3 [Dataset: EmeraldAug] The Southern Oscillation Index (SOI) is a standardised measure of the air pressure difference between Tahiti and Darwin, shown to be related to rainfall in some parts of the world (Stone, Hammer, and Marcussen 1996), and especially Queensland, Australia (Stone and Auliciems 1992; P. K. Dunn 2001).

The rainfall at Emerald (Queensland) was recorded for Augusts between 1889 to 2002 inclusive (P. K. Dunn and Smyth 2018), for months when the monthly average SOI was positive and non-positive (zero or negative); see Table 14.11.

  1. Compute the percentage of Augusts with no rainfall.
  2. Compute the percentage of Augusts with no rainfall, in Augusts with a non-positive SOI.
  3. Compute the percentage of Augusts with no rainfall, in Augusts with a positive SOI.
  4. Compute the odds of no August rainfall.
  5. Compute the odds of no August rainfall, in Augusts with a non-positive SOI.
  6. Compute the odds of no August rainfall, in Augusts with a positive SOI.
  7. Compute the odds ratio of no August rainfall, comparing Augusts with non-positive SOI to Augusts with a positive SOI.
  8. Interpret this OR.
  9. Create a summary table.
  10. Sketch a graph to display the data.
TABLE 14.11: The SOI, and whether rainfall was recorded in Augusts between 1889 and 2002 inclusive
Non-positive SOI Positive SOI
No rainfall recorded \(14\) \(\phantom{0}7\)
Rainfall recorded \(40\) \(53\)

Exercise 14.4 Haselgrove et al. (2008) asked boys and girls in Western Australia about back pain from carrying school bags (Table 14.12).

  1. Compute the percentage of boys reporting back pain from carrying school bags.
  2. Compute the percentage of girls reporting back pain from carrying school bags.
  3. Compute the odds of boys reporting back pain from carrying school bags.
  4. Compute the odds of girls reporting back pain from carrying school bags.
  5. Compute the odds of a child reporting back pain.
  6. Compute the odds ratio of reporting back pain, comparing boys to girls.
  7. Interpret this OR.
  8. Create a summary table.
  9. Sketch a graph to display the data.
TABLE 14.12: The number of boys and girls reporting back pain from carrying school bags
Males Females
No back pain \(330\) \(226\)
Back pain \(280\) \(359\)

Exercise 14.5 Using the information in Table 13.2, create a stacked bar chart to compare the responses to the three questions.

::: {.exercise {#Roadkill} T. C. Russell, Herbert, and Kohen (2009) studied road-kill possums (Table 14.13).

  1. Identify the two variables, and classify them as nominal or ordinal.
  2. Sketch some graphs to display the data.
  3. What is the main message in the data? What graph shows this best? :::
TABLE 14.13: The number of possums found as road kill, by sex and season
Unknown sex Male Female
Autumn \(75\) \(25\) \(21\)
Winter \(74\) \(27\) \(22\)
Spring \(71\) \(10\) \(18\)
Summer \(58\) \(10\) \(12\)

Exercise 14.6 The data in Table 14.14 come from a study of Iranian children aged \(6\)--\(18\) years old (Kelishadi et al. 2017).

  1. Compute the proportion of females who skipped breakfast.
  2. Compute the proportion of males who skipped breakfast.
  3. Compute the odds of a female skipping breakfast.
  4. Compute the odds of a male skipping breakfast.
  5. Compute the odds ratio comparing the odds of skipping breakfast, comparing females to males.
  6. Interpret this OR.
  7. Construct a summary table.
TABLE 14.14: The number of Iranian children aged 6 to 18 who skip and do not skip breakfast
Skips breakfast Doesn't skip breakfast Total
Females \(2383\) \(4257\) \(6640\)
Males \(1944\) \(4902\) \(6846\)

Exercise 14.7 Yonekura et al. (2020) studied Japanese women and their coffee drinking habits (Table 14.15).

  1. Compute the proportion of coffee drinkers who are also smokers.
  2. Compute the proportion of non-coffee drinkers who are also smokers.
  3. Compute the odds of a coffee drinker being a smoker.
  4. Compute the odds of a non-coffee drinker being a smoker.
  5. Compute the odds ratio comparing the odds of being a smoker, comparing coffee drinkers to non-coffee drinkers.
  6. Interpret this OR.
  7. Construct a summary table.
TABLE 14.15: The number of Japanese women who smoked, and drank at least one cup of coffee per day
Smokers Non-smokers
Coffee drinkers \(10\) \(66\)
Non-coffee drinkers \(\phantom{0}2\) \(84\)