# 14 Numerical summaries: qualitative data

So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, and graphically summarise data.

In this chapter, you will learn to numerically describe qualitative data. Both quantitative and qualitative data are described numerically in quantitative research. You will learn to:

• present and numerically summarise qualitative data.
• compute and understand row and column proportions (and percentages).
• compute and understand odds and odds ratios.
• describe relationships between qualitative variables.

## 14.1 Proportions and percentages

### 14.1.1 Introduction

In a study by the aim was to:

...compare (two) different methods of treating renal calculi... to establish which was the most [...] successful.

--- p. 879

(Renal calculi are better known as kidney stones.) Data were collected from 700 UK patients, on two qualitative variables:

• The treatment method used ('A' or 'B'): The explanatory variable. Each treatment was used on 350 patients.
• The result ('success' or 'failure' of the procedure): The response variable.

Both variables are qualitative with two levels. Treatment A was used from 1972--1980, and Treatment B from 1980--1985; that is, the treatments were not randomly allocated, and so confounding may be an issue. For this reason, the researchers also recorded the size of the kidney stone (also a qualitative variable) as a possible confounding variable, as 'small' or 'large'.

Firstly, consider just the small stones.340 The data can be compiled using a two-way table (Table 14.1), and graphed using a side-by-side or stacked bar chart, for example.

TABLE 14.1: Numbers for small kidney stones
Success Failure Total
Method A 81 6 87
Method B 234 36 270

Qualitative data can be numerically summarised by computing proportions or percentages. These can be computed:

These are demonstrated within each section, and in a separate Example.

Definition 14.1 (Proportion) A proportion is a fraction out of a total. Proportions are numbers between 0 and 1.

Definition 14.2 (Percentage) A percentage is a proportion, multiplied by 100. Percentages are numbers between 0% and 100%.

### 14.1.2 Overall proportions and percentages

From Table 14.1, the overall sample proportion of successes (denoted $$\hat{p}$$) is:

\begin{align*} \hat{p} &= \frac{\text{Number of successes}}{\text{Number of procedures}}\\ &= \frac{81 + 234}{6 + 81 + 36 + 234} = 0.882. \end{align*} The sample proportion of successful procedures for small kidney stones is 0.882. Sample proportions are denoted using $$\hat{p}$$. The sample proportion (a statistic) is an estimate of the unknown population proportion (a parameter), which is denoted $$p$$.

The symbol $$\hat{p}$$ is pronounced 'pee-hat', and refers to the sample proportion.

The proportion could also be expressed as a percentage, by multiplying by 100:

$0.882 \times 100 = 88.2\%.$ The sample percentage of successful procedures for small kidney stones is 88.2%. The sample proportion and sample percentage are both statistics

Notice that, when computing percentages and proportions, we divide the relevant count by the total number relevant to the context.

### 14.1.3 Row proportions and percentages

For the small kidney stones (Table 14.1), row proportions (or percentages), and column proportions (or percentages), can be computed

The row proportions (Table 14.2) give the proportion of successes for each Method, since the rows contain the counts for Method A and Method B. Row proportions allow the proportions within the rows to be compared:

• $$81 \div 87 = 0.931$$ (or 93.1%) of operations in the sample were successful for Method A; and
• $$0.867$$ (or 86.7%) of operations were successful in the sample for Method B.

This suggests that, for small kidney stones, Method A is more successful (93.1%) than Method B (86.7%) in the sample.

TABLE 14.2: Row percentages for small kidney stones (from Table 14.1)
Success Failure Total
Method A 93.1 6.9 100
Method B 86.7 13.3 100

### 14.1.4 Column proportions and percentages

For the small kidney stones (Table 14.1), column proportions can also be computed (Table 14.3). The column proportions give the proportion of successes within each method (since the columns contain the procedure results). Column proportions allow the proportions (or percentages) within columns to be compared:

• $$81 \div (81 + 234) = 0.257$$ (or 25.7%) of all successful operations came from using Method A; and
• $$0.143$$ (or 14.3% ) failures came from using Method A.
TABLE 14.3: Column percentages for small kidney stones (from Table 14.1)
Success Failure
Method A 25.7 14.3
Method B 74.3 85.7
Total 100.0 100.0

While both row and column proportions (or percentages) can be computed, row percentages seems more intuitive here: they compare the success percentage for each treatment method.

### 14.1.5 Example: Large kidney stones

The data in Table 14.1 are for small kidney stones. Data were also recorded for the large kidney stones (Table 14.4).

For both small and large stones, the success proportions can be computed for Methods A and B (i.e., row percentages), and hence the better method (in the sample) can be identified.

TABLE 14.4: Numbers for large kidney stones
Success Failure Total
Method A 192 71 263
Method B 55 25 80

The success proportion for Method A is greater than the success proportion for Method B for small stones (Table 14.1). Now, compute the success proportions for the large stones too (Table 14.4):

• For large stones, the success proportion with Method A is:
• For large stones, the success proportion with Method B is:

Which method has the higher success proportion for large stones?

Method A has a higher success proportion in the sample for both small (0.931 vs 0.867) and large kidney stones (0.730 vs 0.688). Perhaps the data for small (Table 14.1) and large kidney stones (Table 14.4) can therefore be combined, to produce a single two-way table of just Method and Result (Table 14.5), ignoring size.

TABLE 14.5: Numbers for all kidney stones combined, ignoring the size of the kidney stone
Success Failure Total
Method A 273 77 350
Method B 289 61 350

In summary, the sample shows that:

• For small stones (Table 14.1), Method A has a higher success proportion: Method A: 0.93; Method B: 0.87
• For large stones (Table 14.4), Method A has a higher success proportion: Method A: 0.73; Method B: 0.69
• Combining all stones together (Table 14.5), Method B has a higher success proportion:
Method A: 0.78; Method B: 0.83

That seems strange... Method A performs better for small and for large kidney stones, but Method B performs better when combined (and size is ignored).

How can Method A be better when small and large stones are considered separately, but Method B be better when they are combined? Can you see why?

The size of the stone is a confounding variable (Fig. 14.1): The size of the stone is related to success proportion (small stones have a greater success proportion) and the size of the stone is related to the method used (small stones are treated more often with Method B).

This confounding could have been avoided by randomly allocating a treatment methods to patients. However, random allocation was not possible in this study, so the researchers used a different method to manage confounding: recording the size of the kidney stones (and other variables also: the age and sex of the patient); see Sect. 8.2.3.

In this example, acknowledging the size of the kidney stone is important, otherwise the wrong (opposite) conclusion is reached: one would think that Method B is better if the size of the stones was ignored, when the best method really is Method A.

This is called Simpson's paradox. If the size of the kidney stone had not been recorded, size would have been a lurking variable, and the incorrect conclusion would have been reached.

## 14.2 Odds

Consider again the small kidney stone data (Table 14.1).

For Method A, the sample contains 81 successes and 6 failures. Apart from proportions and percentages, another way to numerically summarise this information is to see that there are $$81\div 6 = 13.5$$ times as many successes than failures in the sample.

In other words, for small kidney stones, the odds of success for Method A is 13.5 (in the sample). The sample odds is a statistic, and the population odds is a parameter.

Definition 14.3 (Odds) The odds are the proportion (or percentage, or number) of times that an event happens, divided by the proportion (or percentage, or number) of times that the event does not happen:

$\text{Odds} = \frac{\text{Proportion of times that something happens}} {\text{Proportion of times that something doesn't happen}},$ or (equivalently)

$\text{Odds} = \frac{\text{Number of times that something happens}}{\text{Number of times that something doesn't happen}}.$ The odds show how many times an event happens compared to not happening. Alternatively, it is how many times the event happens for every 100 times that it does not happen.

Notice that, when computing odds, we divide the relevant number by the remaining number, which is different than how percentages are computed.

percentages and proportions, we divide the relevant number by the total number relevant to the context.

Software usually works with odds rather than percentages (for good reasons that we will not delve into). However, understanding how software computes the odds is important.

Software usually computes odds as comparing either

• Row 1 to Row 2; or
• Column 1 to Column 2.

Here then, based on Table 14.1, the odds for comparing the Methods would be computed as Method A compared to Method B (rather than Method B to Method A).

Example 14.1 (Interpreting odds) For the small kidney stone data, the odds of a success for Method A is $$81\div6 = 13.5$$ (in the sample). This can be interpreted as:

• There are $$13.5$$ times as many successes as failures (in the sample);
• There are $$13.5\times 100 = 1350$$ successes for every 100 failures (in the sample).

Either way, successes are far more common than failures, for small kidney stones using Method A.

What are the odds of finding a failure for Method A? How is this value interpreted?

Example 14.2 (Odds) Suppose that about 67% of students at a particular university were female. The population odds of finding a female is about $$67 / (100 - 67) = 2.03$$: about twice as many females are students as non-females.

Suppose one tutorials had 18 females and 5 non-females. The sample odds of finding a female in this class is $$18/5 = 3.60$$. Another classes had 16 females and 9 non-females. The sample odds of finding a female in this class is $$16/9 = 1.79$$.

Example 14.3 (Computing odds) Consider again the small kidney stone data (Table 14.1). The odds of a success using Method B can also be found (Table 14.1):

\begin{align*} &\text{Odds}(\text{Success with Method B})\\ = &\frac{\text{Number of successes for Method B}}{\text{Number of failures for Method B}} =\frac{234}{36} = 6.52. \end{align*} Working with the proportions (or percentages) (Table 14.2) rather than the numbers, the same value results:

\begin{align*} & \text{Odds}(\text{Success with Method B})\\ = &\frac{\text{Percentage of successes for Method B}}{\text{Percentage of failures for Method B}} =\frac{86.7}{13.3} = 6.52. \end{align*}

When interpreting odds:

• When the odds are greater than one: the event is more likely to happen than to not happen.
• When the odds are equal to one: the event is just as likely to happen as it is to not happen.
• When the odds are less than one: the event is less likely to happen than to not happen.

## 14.3 Comparing odds: Odds ratios

To summarise the small kidney stone data:

• For Method A, the odds of success are 13.5; there are 13.5 times as many successes as failures. (Alternatively, there are 1350 successes for every 100 failures.)
• For Method B, the odds of success are 6.5; there are 6.5 times as many successes as failures. (Alternatively, there are 650 successes for every 100 failures.)

The odds of success for Method A and Method B are very different: in the sample, the odds of success for Method A is many times greater than for Method B. In fact, in the sample, the odds of success for Method A is

$\frac{13.5}{6.5} = 2.08$

times the odds of a success for Method B. This value is called the odds ratio (OR); see Fig. 14.2. The sample odds ratio is a statistic, and the population odds ratio is a parameter.

Definition 14.4 (Odds Ratio (OR)) The odds ratio is how many times greater the odds of an event are in one group, compared to the odds of the same event in another group.

Understanding how software computes the odds ratio is important for understanding the output. jamovi and SPSS compute the odds ratio as either:

• The odds compare Row 1 to Row 2, then the odds ratio compares the Row 1 odds to the Row 2 odds.

• The odds compare Column 1 to Column 2, then the odds ratio compares the Column 1 odds to the Column 2 odds.

In other words, the odds and odds ratios are relative to the first row or first column.

The OR compares the odds of an event in two groups. This means that a $$2\times 2$$ table can be summarised using one number: the odds ratio (OR).

When using odds ratios (or ORs):

• When the odds ratio is greater than one: the odds of the event for the group in the top of the division is greater than the odds of the event for the group in the bottom of the division.
• When the odds ratio is equal to one: the odds of the event for the group in the top of the division is equal to the odds of the event for the group in the bottom of the division.
• When the odds ratio is less than one: the odds of the event for the group in the top of the division is less than the odds of the event for the group in the bottom of the division.

The following short video may help explain some of these concepts:

## 14.4 Observing relationships

For the small kidney stone data, the odds of a success for Method A is different than the odds of a successes for Method B, in the sample. Broadly, two possible reasons exist to explain the differences in the sample:

• The odds in the population are the same for Method A and Method B, but a difference is observed in the sample odds simply because of who ended up in the sample. Every sample is likely to be different, and the sample we ended up with happened to show a difference. Sampling variation explains the difference in the sample odds.

• The odds in the population are different for Method A and Method B, and the difference in the sample odds simply reflects this difference between the population odds.

Similarly, the proportion (or percentage) of successes for Method A and B are quite different in the sample, and two possible reasons exist to explain the differences in the sample:

• No difference exists between the proportion (or percentage) in the population, but a difference is observed in the sample simply because of who ended up in the sample. Sampling variation explains the difference in the sample proportion (or percentage).

• A difference does exist between the proportion (or percentage) in the population, and this difference in the sample simply reflects this difference between the population proportion (or percentage).

The difficulty, of course, is knowing which of these two reasons ('hypotheses') is the most likely reason for the difference between the sample odds. This question is of prime importance (after all, it answers the RQ), and is addressed at length later in this book.

## 14.5 Example: Skipping breakfast

The data in Table 14.6 come from a study of Iranian children aged 6--18 years old.344 From this table:

• The proportion of females who skipped breakfast is $$\hat{p}_F = 2\,383/6\,640 = 0.359$$;
• The proportion of males who skipped breakfast is $$\hat{p}_M = 1\,944/6\,846 = 0.284$$.

Also,

• $$\text{Odds}(\text{Skips breakfast, among F}) = 2\,383/4\,257 = 0.5598$$;
• $$\text{Odds}(\text{Skips breakfast, among M}) = 1\,944/4\,902 = 0.3966$$.

For example, about 55.98 females skip breakfast for every 100 females who eat breakfast. The odds ratio (OR) comparing the odds of skipping breakfast, comparing females to males, is

\begin{align*} \text{OR} &= \frac{\text{Odds}(\text{Skipping breakfast, for females})}{\text{Odds}(\text{Skipping breakfast, for males})}\\ &= \frac{0.5598}{0.3966} = 1.41; \end{align*} the odds of females skipping breakfast are $$1.41$$ times the odds of males skipping breakfast. The data can then be summarised numerically (Table 14.7).

TABLE 14.6: The number of Iranian children aged 6 to 18 who skip and do not skip breakfast
Skips breakfast Doesn't skip breakfast Total
Females 2383 4257 6640
Males 1944 4902 6846
TABLE 14.7: Numerical summary of the Iranian-breakfast data: Odds and percentage of those who skip breakfast
Percentage Odds Sample size
Females 35.9 0.560 6640
Males 28.4 0.397 6846
Odds ratio 1.412

## 14.6 Case Study: The NHANES data

In Sects. 12.9 and 13.7, the NHANES data were introduced,345 and graphs and numerical summaries used to understand the data relevant to answering this RQ:

Among Americans, is the mean direct HDL cholesterol different for those who smoke now, and those who do not smoke now?

The data can be summarised numerically: the response variable (HDL cholesterol), the explanatory variable (current smoking status), and potential extraneous and confounding variables. Different summaries are needed for quantitative (means and standard deviations; medians and IQR) and qualitative (percentages; odds) variables (Table 14.8).

TABLE 14.8: A summary of some variables in the NHANES data set, according to current smoking status (current smoking status was not reported for 6789 respondents, and some other variables were not reported for all respondents). Quantitative variables are summarised using either the mean and standard deviation, or median and IQR; qualitative variables using percentages. There are many missing values.
Quantity Statistic Overall Non-smokers Smokers
Sample size 10000 1745 1466
Direct HDL (mmol/L) n 8474 1668 1388
Mean 1.36 1.39 1.31
Std. dev. 0.4 0.43 0.42
Gender n 10000 1745 1466
% Female 50.2 43.8 43.5
Age (years) n 10000 1745 1466
Mean 36.74 54.28 42.68
Std. dev. 22.4 16.64 14.79
Height (cm) n 9647 1726 1459
Mean 161.88 170.06 170.43
Std. dev. 20.19 9.75 9.27
Weight (kg) n 9922 1727 1458
Mean 70.98 84.5 80.54
Std. dev. 29.13 20.73 19.72
BMI (kg/m-sq) n 9634 1726 1458
Mean 26.66 29.09 27.7
Std. dev. 7.38 6.19 6.42
Diabetes n 9858 1743 1466
% Yes 7.7 15.3 7.2
Urine volume (mL) n 9013 1723 1447
Median 94 97 102
IQR 114 118.5 104

A number of interesting questions emerge from Table 14.8:

• How can the mean age of all respondents be 36.7 years, but the mean age for non-smokers and smokers both be much larger than this (54.3 and 42.7 years respectively)?
• Similarly, the percentage of females in the whole sample is 50.2%, but the percentage of females is less than this for both non-smokers and smokers (43.8% and 43.5% respectively)?

Table 14.9 summarises the relationship between current smoking status and having a diabetes diagnosis. Again, questions emerge:

• For current non-smokers, the percentage of diabetics is $$15.32$$%.
• For current smokers, the percentage of diabetics is $$7.23$$%.

The percentage of diabetics in the sample is different for non-smokers and smokers. Why? Similarly,

• For current non-smokers, the odds of finding a diabetic is $$0.181$$.
• For current smokers, the odds of finding a diabetic is $$0.078$$.

The odds of finding a diabetic in the sample is different for non-smokers and smokers. Why?

As noted before (Sect. 14.4), two possible reasons could explain this difference in percentages and odds in the sample:

• Sampling variation: The percentages (and odds) are the same in the population, but difference in the sample occur because of the people that happened to end up with the sample. Sampling variation explains the difference in the sample percentages (and odds).

• The percentages (and odds) are different in population: for non-smokers and smokers, and the difference in the sample percentages (and odds) simply reflects a difference* between non-smokers and smokers in the population.

In the next chapters, tools for deciding which of these explanations is the most likely are discussed.

TABLE 14.9: The two-way table of diabetes diagnosis against current smoking status
Doesn't smoke now Smokes now
Not diabetic 1476 1360
Diabetic 267 106

## 14.7 Summary

One qualitative variable can be numerically summarized using percentages or odds. With two qualitative variables, data can be compiled into a two-way table of counts, and the data can be numerically summarised using row percentages, column percentages, odds, or odds ratios.

The following short video may be useful to watch.

## 14.8 Quick revision questions

A study346 examined social media (SM) use, using a

... sample of Australian adults [...] randomly selected from a database with Queensland landline telephone numbers. To be eligible, participants must be aged 18 or more and reside in Queensland.

p. 92

Part of the data are summarised in Table 14.10.

1. Compute the sample proportion of urban residents who use social media, $$\hat{p}_U$$.
2. Compute the sample proportion of rural residents who use social media, $$\hat{p}_R$$.
3. Compute the sample odds of urban residents who use social media.
4. Compute the sample odds of rural residents who use social media.
5. Compute the sample odds ratio of using social media, comparing urban to rural residents.

Progress:

TABLE 14.10: The number of Queenslanders using and not using social media (SM) in rural and urban locations in a sample
Doesn't use SM Uses SM Total
Rural 78 89 167
Urban 416 568 984

## 14.9 Exercises

Selected answers are available in Sect. D.14.

Exercise 14.1 A study of hangovers348 recorded, among other information, when people vomited after consuming alcohol.

Table 14.11 shows how many people vomited after consuming beer followed by wine, and how many people vomited after consuming just wine.

1. Compute the row proportions. What do these mean?
2. Compute the column percentages. What do these mean?
3. Compute the overall percentage of drinkers who vomited.
4. Compute the odds a wine-only drinker vomited.
5. Compute the odds that a beer-then-wine drinker vomited.
6. Compute the odds ratio, comparing the odds of vomiting for wine-only drinkers to beer-then-wine drinkers.
7. Compute the odds ratio, comparing the odds of vomiting for beer-then-wine drinkers to wine-only drinkers.
TABLE 14.11: How many people vomited and did not vomit, by type of alcohol consumed
Beer then wine Wine only
Vomited 6 6
Didn't vomit 62 22

Exercise 14.2 In a study of wallabies at the East Point Reserve (Darwin),349 the sex of adult and young wallabies was recorded. In December 1993, 91 males and 188 female adult wallabies were recorded. At the same time, 13 male and 22 female young wallabies were recorded.

2. For adult wallabies, what are the odds that a female was observed?
3. For young wallabies, what are the odds that a female was observed?
4. For young wallabies, what percentage of wallabies were males?
5. What is the odds ratio of observing an adult wallaby to a young wallaby, for just the female wallabies?

Exercise 14.3 The Southern Oscillation Index (SOI) is a standardised measure of the pressure difference between Tahiti and Darwin, and has been shown to be related to rainfall in some parts of the world,350 and especially Queensland.351

As an example,352 the rainfall at Emerald (Queensland) was recorded for Augusts between 1889 to 2002 inclusive, for months when the monthly average SOI was positive, and for months when the SOI was non-positive (that is, zero or negative), as shown in Table 25.9.

1. Compute the percentage of Augusts with no rainfall.
2. Compute the percentage of Augusts with no rainfall, in Augusts with a non-positive SOI.
3. Compute the percentage of Augusts with no rainfall, in Augusts with a positive SOI.
4. Compute the odds of no August rainfall.
5. Compute the odds of no August rainfall, in Augusts with a non-positive SOI.
6. Compute the odds of no August rainfall, in Augusts with a positive SOI.
7. Compute the odds ratio of no August rainfall, comparing Augusts with non-positive SOI to Augusts with a positive SOI.
8. Interpret this OR.
TABLE 14.12: The SOI, and whether rainfall was recorded in Augusts between 1889 and 2002 inclusive
Non-positive SOI Positive SOI
No rainfall recorded 14 7
Rainfall recorded 40 53

Exercise 14.4 A study353 asked boys and girls in Western Australia about back and pain from carrying school bags (Table 14.13).

1. Compute the percentage of boys reporting back pain from carrying school bags.
2. Compute the percentage of girls reporting back pain from carrying school bags.
3. Compute the odds of boys reporting back pain from carrying school bags.
4. Compute the odds of girls reporting back pain from carrying school bags.
5. Compute the odds of a child reporting back pain.
6. Compute the odds ratio of reporting back pain, comparing boys to girls.
7. Interpret this OR.
TABLE 14.13: The number of boys and girls reporting back pain from carrying school bags
Males Females
No 330 226
Yes 280 359