13 Summarising qualitative data

So far, you have learnt to ask a RQ, design a study, collect the data, and classify the data. In this chapter, you will learn to:

  • summarise qualitative data using the appropriate graphs.
  • summarise qualitative data using, for example, medians and odds to summarise qualitative data.

13.1 Introduction

Most quantitative research studies involve qualitative variables Except for very small amounts of data, understanding data is difficult without a summary. As with quantitative data, qualitative data can be summarised by displaying the distribution of the data: the values taken by the variable, and how often they appear (Def. 12.1).

The distribution can be displayed using a frequency table (Sect. 13.2) or a graph (Sect. 13.3). The distribution of qualitative data can be summarised numerically (Sect. 13.4) by computing proportions (or percentages) or odds, modes; ordinal qualitative data can be also summarised numerically using medians.

13.2 Frequency tables for qualitative data

Qualitative data are typically collated in a frequency table. The rows (or the columns) should list the levels of the variable (Sect. 11.4), and these should be exhaustive (cover all levels) and exclusive (observations belong to only one level).

For nominal data, the levels of the variables can be displayed alphabetically, by size, by personal preference, or other way: use the order most likely to be useful to readers. For ordinal data, the natural order of the levels should almost always be used.

Example 13.1 (Opinions of AV vehicles) Researchers (Pyrialakou et al. 2020) surveyed \(400\) residents of Phoenix (Arizona) about their opinions of autonomous vehicles (AVs). Demographic information was collected (Table 13.1), as well as respondents opinions of sharing roads with AVs (Table 13.2).

The gender of the respondent is nominal (two levels), while the age group is ordinal (six levels); the levels are in the rows. The three questions about safety (Table 13.2) all yield ordinal responses (five levels, in columns).

TABLE 13.1: Demographic information for the AV data for \(400\) respondents
Number Percentage
Gender
Female \(204\) \(51\)
Male \(196\) \(49\)
Age group
\(18\) to \(24\) \(\phantom{0}52\) \(13\)
\(25\) to \(34\) \(\phantom{0}76\) \(19\)
\(35\) to \(44\) \(\phantom{0}76\) \(19\)
\(45\) to \(54\) \(\phantom{0}72\) \(18\)
\(55\) to \(64\) \(\phantom{0}56\) \(14\)
\(65\) and over \(\phantom{0}68\) \(17\)
TABLE 13.2: Responses to three scenarios for the AV data for \(400\) respondent
Driving near an AV Cycling near an AV Walking near an AV
Unsafe \(n\) \(58\) \(\phantom{0}77\) \(\phantom{0}63\)
Percent \(14\) \(\phantom{0}19\) \(\phantom{0}16\)
Somewhat unsafe \(n\) \(79\) \(104\) \(\phantom{0}86\)
Percent \(20\) \(\phantom{0}26\) \(\phantom{0}22\)
Neutral \(n\) \(96\) \(\phantom{0}87\) \(103\)
Percent \(24\) \(\phantom{0}22\) \(\phantom{0}26\)
Somewhat safe \(n\) \(97\) \(\phantom{0}76\) \(\phantom{0}82\)
Percent \(24\) \(\phantom{0}19\) \(\phantom{0}20\)
Safe \(n\) \(70\) \(\phantom{0}56\) \(\phantom{0}66\)
Percent \(18\) \(\phantom{0}14\) \(\phantom{0}16\)

13.3 Graphs

Three options for graphing qualitative data are:

  • Dot chart (Sect. 13.3.1): Usually a good choice.
  • Bar chart (Sect, 13.3.2): Usually a good choice.
  • Pie chart (Sect. 13.3.3): Only useful in special circumstances, and can be harder to interpret.

For nominal data, the levels of the variables can be displayed alphabetically, by size, by personal preference, or other way: use the order most likely to be useful to readers. For ordinal data, the natural order of the levels should almost always be used.

Sometimes these graphs are also used for quantitative data with a small number of possible options. Sometimes, graphs used for quantitative data (Sect. 12.3) may be useful for discrete quantitative data if many values are possible.

The purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.

13.3.1 Dot charts (qualitative data)

Dot charts indicate the counts (or the corresponding percentages) in each level, using dots on a line starting at zero. The levels can be on the horizontal or vertical axis, and the counts or percentages in the other direction. Placing the level names on the vertical axis often makes for easier reading, and space for long labels.

Example 13.2 (Dot plots) For the AV study in Example 13.1, a dot chart of the age group of respondents is shown in Fig. 13.1 (top left panel).

The age group of respondents in the AV study. All the graphs present the same data.

FIGURE 13.1: The age group of respondents in the AV study. All the graphs present the same data.

For dot charts:

  • place the qualitative variable on the horizontal or vertical axis (and label with the levels of the variable).
  • use counts or percentages on the other axis.
  • for nominal data, think about the most helpful order for the levels.

13.3.2 Bar charts

Bar charts indicate the counts in each category using bars starting from zero. As with dot charts, the levels can be on the horizontal or vertical axis, but placing the level names on the vertical axis often makes for easier reading, and room for long labels.

Example 13.3 (Dot plots) For the AV study in Example 13.1, a bar chart of the age group of respondents is shown in Fig. 13.1 (top right panel).

For bar charts:

  • place the qualitative variable on the horizontal or vertical axis (and label with the levels of the variable).
  • use counts or percentages on the other axis.
  • for nominal data, levels can be ordered any way: Think about the most helpful order.
  • bars have gaps between bars, as the bars represent distinct categories.

Bar charts have gaps between all of the bars. In contrast, the bars in histograms are butted together (except when an interval has a count of zero), as the variable-axis represent a continuous numerical scale.

13.3.3 Pie charts

In pie charts, a circle is divided into segments proportional to the number in each level of the qualitative variable.

Example 13.4 (Dot plots) For the AV study in Example 13.1, a pie chart of the age group of respondents is shown in Fig. 13.1 (bottom left panel).

Pie charts may present challenges:

  • Pie charts only work when graphing parts of a whole.
  • Pie charts only work when all options are present ('exhaustive').
  • Pie charts are difficult to use with levels having zero counts, or small counts.
  • Pie charts are difficult to read with many categories present.
  • Pie charts are hard to read: Humans compare lengths (as in bar and dot charts) better than angles (as in pie charts) (Friel, Curcio, and Bright 2001).

Would a pie chart be appropriate for graphing the percentage of people who use Firefox, Chrome, and Safari as web browsers?

A pie chart is not suitable. The three browsers are not mutually exclusive (people can use more than one of these browsers) nor exhaustive (some people may use browsers not listed). For example, the percentages could be that \(65\)% use Firefox, \(84\)% use Chrome, and \(20\)% use Safari. These add to more than \(100\)%.

13.3.4 Comparing pie, bar and dot charts

Consider the pie chart in Fig. 13.1 (bottom left panel). Determining which age groups have the most respondents is hard. The equivalent bar chart (or dot chart) makes the comparison easy: clearly the youngest age group has the smallest representation, while the \(25\) to \(34\) and the \(35\) to \(44\) age groups have the most respondents. The tilted pie chart makes this comparison even harder (Fig. 13.1, bottom right panel).

Recall that the purpose of a graph is to display information in the clearest, simplest possible way, to help the reader understand the message(s) in the data. A pie chart often makes the message hard to see (Siegrist 1996).

13.4 Numerical summaries

13.4.1 Parameters and statistics

In quantitative research (Sect. 1.4), both qualitative and quantitative data are summarised and analysed numerically. In the following sections, methods for numerically summarising quantitative variables are described. As described in Sect. 12.4, numerical quantities are computed from one of countless possible samples, even though the whole population is of interest.

A statistic (a sample value) is a numerical value estimating the unknown parameter (population value). Since countless possible samples are possible (Sect. 5.1), countless possible values for the statistic---all of which are estimates of the value of the parameter---are possible. The value of the statistic that is observed depends on which one of the countless possible samples is (randomly) selected.

The RQ identifies the population, but in practice only one of the many possible samples is studied. Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample. We only observe one value of the statistic from our single sample.

13.4.2 Proportions and percentages

Qualitative data can be summarised using proportions or percentages. These can be given instead of, or with, the counts (as in Tables 13.1 and 13.2).

Definition 13.1 (Proportion) A proportion is a fraction out of a total, and is a number between \(0\) and \(1\).

Definition 13.2 (Percentages) A percentage is a proportion, multiplied by \(100\). In this context, percentages are numbers between \(0\)% and \(100\)%.

Population proportions are almost always unknown. Instead, the population proportion (the parameter), denoted \(p\), is estimated by a sample proportion (a statistic), denoted by \(\hat{p}\).

The symbol \(\hat{p}\) is pronounced 'pee-hat', and refers to a sample proportion.

As always, only one of the many possible samples is studied. Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample.

Example 13.5 (Proportions and percentages) Consider the AV data in Table 13.1, summarising results from a sample of \(n = 400\) respondents. The sample proportion of respondents aged \(25\) to \(34\) is \(76\div 400\), or \(0.19\). The sample percentage of respondents aged \(25\) to \(34\) is \(76\div 400 \times 100\), or \(19\)%, as given in the table.

13.4.3 Odds

For the AV data in Table 13.1, the number of females is slightly larger than the number of males. More specifically, the ratio of females to males is \(204\div 196 = 1.04\); that is, there are \(1.04\) times as many females as males. This value of \(1.04\) is the odds that a respondent is female. An alternative interpretation is that there \(1.04\times 100 = 104\) females for every \(100\) males.

While proportions and percentages are computed as the number of interest divided by the total number, the odds are computed as the number of interest divided by the remaining number.

Definition 13.3 (Odds) The odds are the number (or proportion, or percentage) of times that an event happens, divided by the number (or proportion, or percentage) of times that the event does not happen:
\[ \text{Odds} = \frac{\text{Number of times event happens}}{\text{Number of times event doesn't happen}} \] or (equivalently)
\[ \text{Odds} = \frac{\text{Proportion of times event happens}} {\text{Proportion of times event doesn't happen}}. \] The odds are how many times an event happens compared to the event not happening.

Example 13.6 (Odds males) The AV data in Table 13.1 includes \(204\) females and \(196\) males. The odds that a respondent is female is \(1.04\), as found above. The odds is greater than one, as the number of females is larger than the number of males.

The odds that a respondent is male is \(196/204 = 0.96\); that is, there are \(0.96\) times as many males as females. The odds is less than one, as the number of males is smaller than the number of females. Alternatively, there are \(96\) males for every \(100\) females.

Example 13.7 (Odds) Consider the AV data in Table 13.1, summarising results from a sample of \(n = 400\) respondents.

The percentage of respondents that are female is \(204\div400\times 100 = 51\)%. The odds that a respondent is female is \(204\div(400 - 204) = 1.04\).

Similarly, the percentage of respondents aged \(18\) to \(24\) is \(52/400\times 400 = 13\)%. The odds that a respondent is aged \(18\) to \(24\) is \(52/(400 - 52) = 0.15\); that is, the odds that a respondent is aged \(18\) to \(24\) is \(0.15\). This means that the number of respondents aged \(18\) to \(24\) is \(0.15\) times (i.e., less) than the number of respondents aged over \(24\).

The odds that a respondent is aged \(18\) to \(54\) is \((52 + 76 + 76 + 72)/(56 + 68) = 2.23\); that is, the odds that a respondent is aged \(18\) to \(54\) is \(2.23\). This means that the number of respondents aged \(18\) to \(54\) is \(2.23\) times (i.e., greater) than the number of respondents aged over \(54\)

Take care interpreting odds:

  • odds are greater than \(1\): the event is more likely to happen than to not happen.
  • odds are equal to \(1\): the event is just as likely to happen as it is to not happen.
  • odds are less than \(1\): the event is less likely to happen than to not happen.
What are the odds of rolling a on a die?

Population odds are almost always unknown. Instead, the population odds (the parameter) is estimated by a sample odds (a statistic).

13.4.4 Modes

Because qualitative data has levels (Def. 11.5), all qualitative data (nominal; ordinal) can be numerically summarised by counting the number of observations in each level (or computing the percentage of observations in each level). The mode is the level (or levels) with the most observations.

Definition 13.4 (Mode) A mode is the level (or levels) of a qualitative variable with the most observations.

Example 13.8 (Modes) Consider again the data in Tables 13.1 and 13.2. 'Gender' is nominal qualitative; age group is ordinal qualitative. The responses to the question are ordinal. The mode could be used to summarise each variable:

  • The mode for gender is 'Female' (with \(204\) respondents).
  • The mode age groups are \(25\) to \(34\) and \(35\) to \(44\) (both with \(76\) respondents).
  • The modal response to the question about driving near AVs is 'Somewhat safe' (\(97\) respondents).
  • The modal response to the question about cycling near AVs is 'Somewhat unsafe' (\(104\) respondents).
  • The modal response to the question about walking near AVs is 'Neutral' (\(103\) respondents).

Population modes are almost always unknown. Instead, the population mode (the parameter) is estimated by a sample mode (a statistic).

13.4.5 Medians for ordinal data

Ordinal data can be summarised in ways that nominal data cannot be, since ordinal data have levels with a natural order. Ordinal qualitative data, but not nominal data, can be summarised using medians, by finding the location of the response that is in the middle, when the levels from all individuals are placed in order.

Medians can be used to summarise qualitative data and ordinal data, but never nominal data.

Example 13.9 (Medians) Consider again the data in Tables 13.1 and 13.2. 'Gender' is nominal qualitative, so medians are not appropriate. However, the other variables are ordinal, so medians could be used to summarise each variable. Since \(n = 400\), the median response will be halfway between the location of the \(200\)th and \(201\)st response when ordered:

  • the median age group is \(35\) to \(44\) (observations \(200\) and \(201\) fall into this level).
  • the median response to the driving question is 'Neutral' (observations \(200\) and \(201\) fall in this level).
  • the median response to the cycling question is 'Neutral' (observations \(200\) and \(201\) fall in this level).
  • the median response to the walking question is 'Neutral' (observations \(200\) and \(201\) fall in this level).

Means are generally not suitable for numerically summarising qualitative data. However, ordinal data may be numerically summarised like quantitative data in rare and very special circumstances: only when

  • the levels are considered 'equally spaced'; and
  • assigning a number to each level is appropriate (perhaps using a mid-point for numerical groups).

13.5 Numerical summary tables

In studies with qualitative variables, these variables should be summarised in a table. The table should include, as a minimum, numbers and percentages. While useful in other contexts (see Chap. 14), odds are usually not given in such a summary. An example of a summary table is given in the next section.

13.6 Example: water access

A study of three rural communities in Cameroon (López-Serrano et al. 2022) recorded data about access to water (see Sect. 12.10). Numerous qualitative variables are recorded; some are displayed in Fig. 13.2, and a summary table in Table 13.3. Notice that the levels of the two ordinal variables are displayed in the natural order.

The distance to the nearest water source is usually less than \(1\) km, and the wait at the source often over \(1\) mins. The most common water source is a bore (\(68.6\)%).

TABLE 13.3: Summarising some qualitative data in the water-access study
Num. % Odds
Distance to water source
Under \(100\) m \(55\) \(45.5\) \(0.83\)
\(100\) m to \(1000\) m \(57\) \(47.1\) \(0.89\)
Over \(1000\) m \(\phantom{0}9\) \(\phantom{0}7.4\) \(0.08\)
Wait time at water source
Under \(5\) min \(50\) \(41.7\) \(0.71\)
\(5\) to \(15\) min \(28\) \(23.3\) \(0.30\)
Over \(15\) min \(42\) \(35.0\) \(0.54\)
Water source
Tap \(\phantom{0}7\) \(\phantom{0}5.8\) \(0.06\)
Bore \(83\) \(68.6\) \(2.18\)
Well \(16\) \(13.2\) \(0.15\)
River \(15\) \(12.4\) \(0.14\)
The age of the women, the number of people in the household, and the number of children in the household for the water-access study. (Some data are missing.)

FIGURE 13.2: The age of the women, the number of people in the household, and the number of children in the household for the water-access study. (Some data are missing.)

13.7 Chapter summary

A qualitative variable can compared numerically summarized using the mode, percentages or odds. Ordinal qualitative variables maybe numerically summarised using a median.

13.8 Quick review questions

Are the following statements true or false?

  1. Nominal data can be summarised using a median.
  2. Ordinal data can be summarised using a mode.
  3. Odds are the ratio of how often an event occurs, to how often it does not occur.
  4. Proportions and percentages are the same.

13.9 Exercises

Selected answers are available in App. E.

Exercise 13.1 A study of spider monkeys (C. A. Chapman 1990) examined the types of social groups present (Table 13.4). Construct a suitable plot, and explain what the data reveal about the social groups of spider monkeys.

TABLE 13.4: Social groups in spider monkeys
Social group Number
Solitary \(\phantom{0}8\)
All males \(\phantom{0}3\)
Female + no young \(\phantom{0}2\)
Mixed young \(15\)
Mixed + no young \(\phantom{0}1\)
One female + offspring \(23\)
Many females + offspring \(48\)
Graphs of deaths from lung cancer in Fredericia (Denmark) from 1968 and 1971

FIGURE 13.3: Graphs of deaths from lung cancer in Fredericia (Denmark) from 1968 and 1971

Exercise 13.2 A study (Henderson and Velleman 1981) recorded the number of cylinders in many models of cars: eleven cars had four cylinders, seven had six cylinders, and fourteen had eight cylinders. The number of cylinders is quantitative discrete, but with so few different values, this data could be plotted with some of the graphs used for qualitative data. For these data:

  1. Produce a dot chart.
  2. Produce a histogram.

 

  1. Produce a bar chart.
  2. Produce a pie chart.

What graph do you think is best? Why?

Exercise 13.3 A survey of voice assistants (such as Amazon Echo; Google Home; etc.) conducted by Nielsen asked respondents to indicate how they used their voice assistant. The options were:

  • Listening to music;
  • Listen to news;
  • Chat with voice assistant for fun;
  • Use alarms, timer.

 

  • Search for real-time information (e.g., traffic; weather);
  • Search for factual information (e.g., trivia; history);

What would be the best graph for displaying respondents answers? Would a pie chart be suitable? Explain your answer.

Exercise 13.4 In a study of the taste of bread with varying salt and fibre content (Gębski et al. 2019), researchers recorded information from the \(300\) subjects, including gender, place of residence, and the subjects' responses to the statement 'Rolls with lower salt content taste worse than regular ones', on a five-point ordinal scale from 'Strongly Agree' to 'Strongly Disagree'.

  1. Identify the variables, then classify them as nominal or ordinal.
  2. For which variables is a mode an appropriate summary (if any)?
  3. For which variables is a median an appropriate summary (if any)?
  4. For which variables is a mean an appropriate summary (if any)?
  5. Compute the above where appropriate.
TABLE 13.5: The bread-tasting data
Number Percentage
Gender
Female \(150\) \(50\)
Male \(150\) \(50\)
Place of residence
Rural \(49\) \(16\)
City up to \(20\, 000\) residents \(38\) \(13\)
City \(20\, 000\) to \(100\, 000\) residents \(83\) \(28\)
City more than \(100\, 000\) residents \(130\) \(43\)
Response to statement
Strongly agree \(30\) \(10\)
Agree \(84\) \(28\)
Neutral \(78\) \(26\)
Disagree \(66\) \(22\)
Strongly disagree \(42\) \(14\)

Exercise 13.5 A study asked \(231\) farmers what they considered to be the advantages and disadvantages of using reclaimed water (López-Serrano et al. 2022). The responses are shown in Table 13.6 (not all farmers responded).

  1. Produce two bar charts to display the data.
  2. Produce two dot charts to display the data.
  3. Produce two pie charts to display the data
  4. Determine the mode for the advantages and disadvantages.
  5. Compute the percentages for the advantages and disadvantages.
  6. Compute the odds of a farmer stating 'high price' as a disadvantage, among all farmers in the study.
  7. Compute the odds of a farmer stating 'high price' as a disadvantage, among all farmers who listed a disadvantage.
TABLE 13.6: The advantages and disadvantages of using reclaimed water, reported by \(231\) farmers.
Advantage No. farmers
Water reutilization \(15\)
Availability \(27\)
Sustainability \(16\)
Disadvantage No. farmers
High price \(40\)
Growing conductivity \(12\)
Lack of proper filtering \(21\)

Exercise 13.6 A study of \(284\) university students in Joinville, Brazil, tabulated how students get to campus (Table 13.7; Henning, Ferreira Schubert, and Ceccatto Maciel (2020)).

  1. What is the mode type of active and motorised transport?
  2. What is the mode type of transport overall?
  3. Is a median appropriate?
  4. Compute the odds that a randomly-chosen student uses motorised transport to get to campus. Explain what this means.
  5. Compute the odds that a student walks to campus. Explain what this means.
  6. Construct appropriate plots to display the data.
TABLE 13.7: Modes of transport for students getting to campus
Number Percentage
Active
Bicycle \(\phantom{0}29\) \(10.2\)
Walking \(\phantom{0}35\) \(12.3\)
Motorised
Car \(\phantom{0}70\) \(24.7\)
Bus \(117\) \(41.2\)
Other \(\phantom{0}33\) \(11.6\)