12 Graphical summaries

So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data and describe the data.

In this chapter, you will learn to graph the data, so we can understand the data used to answer the RQ. You will learn to:

  • select the appropriate graphic to graphically summarise data.
  • graphically summarise data using quality, appropriate graphs.
  • interpret graphs.
  • identify badly prepared graphics, giving reasons.

12.1 Introduction

To answer a RQ, a study is designed to collect data, because data are the means by which the research question is answered. But before analysis, understanding, describing and summarising the collected data is important. This chapter discusses the use of graphs to summarise data.

The purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.

Graphs are not produced just for the sake of it; what the graph tells us should always be explained, especially regarding what it reveals about answering the RQ.

12.2 One quantitative variable

For quantitative data, a graph shows the distribution of the data:

::: {.definition #Distribution name = "Distribution"} The distribution of a quantitative variable describes what values are present in the data, and how often those values appear. :::

The graphs discussed in this section usually work best for continuous quantitative data, but may also be useful for discrete quantitative data if many possible values are present. Sometimes, discrete quantitative data with very few values are best graphed using the graphs discussed in Sect. 12.3.

Three different types of graphs can be used to show how the values of one quantitative variable are distributed:

Whatever graph is used, what the graph shows should be described.

12.2.1 Stem-and-leaf plots

Stem-and-leaf plots (or stemplots) are best described and explained using an example, so consider the data in Fig. 12.1,

The data give the weights (in kg) of babies born in a Brisbane hospital on one day.269

The data set also includes the gender of each baby, and the number of minutes after midnight that each birth occurred.

The data are given in the order in which the births occurred.

FIGURE 12.1: The baby-births data

For these data, the weights (quantitative) are to one decimal place of a kilogram. In a stemplot, part of each number is placed to the left (the stem) of a vertical line, and the rest of each number to the right (the leaf).

Here, the whole number of kilograms is placed to the left (as a stem), and the first decimal place is placed on the right (as a leaf). The animation below shows how the stemplot is constructed. From this plot, most birthweights are seen to be 3-point-something kilograms.

For stem-and-leaf plots:

  • Place the larger unit (e.g., kilograms) on the left (stems).
  • Place the next smallest unit (e.g., first decimal place of a kilogram) on the right (leaves).
  • Some data do not work well with stem-and-leaf plots.
  • Sometimes, data might need to be suitably rounded before creating the stem-and-leaf plot.
  • The numbers in each row should be evenly spaced, so that the numbers in the columns are under each other. This allows patterns to be seen.
  • Within each row, the observations are ordered on each stem so patterns can be seen.
  • Add an explanation for reading the stem-and-leaf plot. For example, the caption for the stem-and-leaf plot for the baby-birth data in Sect. 12.2.1 says '2 | 6 means 2.6kg', which explains what the stem plot means. For instance, '2 | 6' could mean 26kg, or 0.26kg.

Example 12.1 (Stem-and-leaf plots) A study of krill270 produced 15 measurements of the number of eggs. The stemplot shows that the number of eggs is usually under 10, but occasionally a large number of eggs are seen.

The animation below shows how the stemplot is constructed.

The following short video may help explain some of these concepts:

12.2.2 Dot charts (quantitative data)

Dot charts show the original data on a single axis, with each observation represented by a dot.

Example 12.2 (Dot charts) A study examined the serving size of fries at McDonald's,271 and produced the dot chart in Fig 12.2 (based on Wetzel272, Fig. 2).

The mass of fries is almost always under the target, and often substantially so. An alternative way to look at these data is to measure the percentage that each serving is in relation to the target serving (Fig. 12.3), where 100% means the serving size was exactly the target weight.

Mass measurements for large orders of french fries

FIGURE 12.2: Mass measurements for large orders of french fries

Percentage variation from target mass, for large orders of french fries

FIGURE 12.3: Percentage variation from target mass, for large orders of french fries

Example 12.3 (Dot charts) Consider again the weights (in kg) of babies born in a Brisbane hospital in one day. Again, a dot chart (Fig. 12.4) shows that most babies are between 3 and 4kg.

A dot chart of the baby-weight data

FIGURE 12.4: A dot chart of the baby-weight data

12.2.3 Histograms

Histograms are a series of boxes, where the width of the box represents a range of values of the variable being graphed, and the height of the box273 represents the number (or percentage) of observations within that range of values.

Example 12.4 (Histograms) Consider again the weights (in kg) of babies born in a Brisbane hospital in one day.274

A histogram (below) can be constructed for these data. When an observation occurs on a boundary between the boxes, software usually (but not universally) places it in the higher box (so 2.5kg would be counted in the '2.5 to 3.0kg' box, not the '2.0 to 2.5kg' box). The histogram shows, for example, that 17 babies weighed 3.0kg or more, but under 3.5kg.

The animation below shows how the histogram is constructed.

Example 12.5 (Histograms) A study of 'headache attributed to ingestion or inhalation of a cold stimulus' (HICS), commonly known as a brain freeze from eating cold food (e.g., ice cream) or drinking a cold drink, measured the duration of the brain freeze.275

A histogram of the data (Fig 12.5, based on Mages et al.276, Figure 2b), shows that 11 people experience HICS symptoms less than 5 seconds in length.

In addition, 9 people experienced symptoms for at least 5 but less than 10 seconds, and 1 person experienced symptoms for at least 35 seconds but under 40 seconds.

Duration of HICS (brain freeze) after drinking ice water

FIGURE 12.5: Duration of HICS (brain freeze) after drinking ice water

12.2.4 Describing the distribution

Graphs are constructed to help us understand the data. After producing a graph for one quantitative variable, then, we need to summarise what we learn. For a single quantitative variable, describe:

  1. Average: What is an "average" or typical value?
  2. Variation: How much variation is present in the bulk of the data?
  3. Shape: How are the values distributed? That is, are most of the values smaller values, or larger values, or about even distributed between smaller and larger values?
  4. Outliers (observations unusually large or small) or unusual features: Are there any unusual observations, or anything else of interest?

Describing the shape can be tricky, but terminology may help:

  • Skewed right: the bulk of the data is smaller, but there are some larger values (to the right).
  • Skewed left: the bulk of the data is larger, but there are some smaller values (to the left).
  • Symmetric data (and perhaps bell-shaped): There are approximately equal numbers of values that are smaller and larger.
  • Bimodal data: There are two peaks in the distribution.

Typical shapes are shown in the carousel below (click the left and right arrows to move through the example plots). Sometimes, no suitable short descriptions is suitable.

FIGURE 12.6: Some common shapes of the distribution of qualitative data

Example 12.6 (Bimodal data) The Old Faithful geyser in Yellowstone National Park (USA) erupts regularly.277

The time between eruptions (Fig. 12.7) is clearly bimodal, with a peak near 55 minutes and another near 80 minutes.

Histogram of the times between eruptions for the Old Faithful geyser

FIGURE 12.7: Histogram of the times between eruptions for the Old Faithful geyser

The number of bins used in the histogram can change the impression of the distribution. Using software, it is usually possible to try different bin sizes to find a bin size that suitably displays the overall distribution.

Try changing the number of bins in the interaction below.

FIGURE 12.8: Changing the bin width can change the impression of the distribution

Example 12.7 (Describing quantitative data) For the baby-weight data displayed in, for example, Fig. 12.4:

  • The average weight is somewhere between 2.5 to 3 kilograms.
  • The variation in weights is between 1.5 and 4.5 kilograms approximately.
  • The shape is slightly skewed to the left. That is, occasional small birth weights appear (probably premature babies).
  • There doesn't appear to be any outliers or anything unusual.

Describe the histogram in Example 12.5, the brain freeze durations.

  • Average: Hard to be sure... maybe between 10 or 15. (More observations appear at the smaller values (as the bars are higher).)
  • Variation: From about 0 to about 40.
  • Shape: Slightly skewed right.
  • Outliers: No outliers or unusual observations. The observation between 35 and 40 may be an outlier. I suspect it is not an outlier, as a larger sample may very well have observations between 30 and 35. Of course, I could be wrong.

12.3 One qualitative variable

For qualitative data, graphs show how often each level of the variable occurs in the data. The three options for graphing qualitative data are:

  • Dot chart: Usually a good choice.
  • Bar chart: Usually a good choice.
  • Pie chart: Only useful in special circumstances, and can be harder to interpret.

Comparing these graphs is useful too; indeed, sometimes a graph may not even be needed.

For nominal data, the order in which the levels of the variables appear is unimportant, so categories could be ordered alphabetically, by size, by personal preference, or any other way. Since you have a choice, think about the order that is most useful to readers. For ordinal data, the natural order of the levels should almost always be used.

Sometimes these graphs are also used for discrete quantitative data with a small number of possible options.

12.3.1 Dot charts (qualitative data)

Dot charts indicate the counts (or the corresponding percentages) in each level, using dots on a line starting at zero. The levels can be on the horizontal or vertical axis; placing the level names on the vertical axis often makes for easier reading, and room for long labels.

Example 12.8 (Dot plots) A study of spider monkeys278 examined the social groups present in a sample. A dot chart (Fig. 12.9) show the most common social group has many females plus offspring (with almost 50 social groups).

Dot chart of spider monkey family groups

FIGURE 12.9: Dot chart of spider monkey family groups

12.3.2 Bar charts

Bar charts indicate the counts in each category using a bar starting from zero. As with dot charts, the levels can be on the horizontal or vertical axis, but placing the level names on the vertical axis often makes for easier reading, and room for long labels.

Example 12.9 (Bar charts) In a study of functional independence,279 the type of diagnoses were graphed using a bar chart (Fig. 12.10). For example, two people in the sample have cerebral palsy.

The reason for the different coloured bars becomes apparent in Sect. 12.3.3.

Diagnoses of participants

FIGURE 12.10: Diagnoses of participants

For bar charts and dot charts:

  • Place the qualitative variable on the horizontal or vertical axis (and label with the levels of the variable).
  • Use counts or percentages on the other axis.
  • For nominal data, dots and bars can be ordered any way: Think about the most helpful order.
  • Bars have gaps between bars, as the bars represent distinct categories. In contrast, the bars in histograms are butted together (except when an interval has a count of zero), as the bars represent a numerical scale.

12.3.3 Pie charts

In pie charts, a circle is divided into segments proportional to the number in each level of the qualitative variable.

Example 12.10 (Pie charts) In a study of functional independence,280 the severity of the diagnoses were graphed using a pie chart (Fig. 12.11). This picture actually conveys one thing only ("69% of patients had a less severe injury"), so a graph of any kind is probably unnecessary.

The pie chart colours explain the colours used in the bar chart in Example 12.9. This is called encoding extra information into the bar chart.

Severity of diagnoses of participants

FIGURE 12.11: Severity of diagnoses of participants

Pie charts presents challenges:

  • Pie charts only work when graphing parts of a whole.
  • Pie charts only work when all options are present ('exhaustive').
  • Pie charts only work when each unit can appear in just one group ('mutually exclusive').
  • Pie charts are difficult to use when levels with zero counts, or small counts, are present.
  • Pie charts are difficult to read when many categories are present.
  • Pie charts are hard to read: In general, human brains are better at comparing lengths (as used in bar and dot charts) than comparing angles (as used in pie charts).281

In which of these situations is a pie chart appropriate?

  1. The percentage of people who use these web browsers: Firefox, Chrome, and Safari.
  2. For each state of Australia, the percentage of people who own an iPhone.
  3. The percentage of students awarded different grades in this course last semester.
  1. A pie chart is not suitable. Each individual (person) has information recorded on two qualitative variables: (a) which browser is being asked about (three levels); and (b) whether or not they use that browser ('yes' or 'no'). Also, the three browsers are not mutually exclusive (people can use more than one of these browsers) nor exhaustive (some people may use browsers not listed, such as Edge, Brave, Vivaldi, etc.). For example, the percentages could be that 65% use Firefox, 84% use Chrome, and 20% use Safari. These add to more than 100%.
  2. A pie chart is not suitable, as the percentages are not parts of a whole. Again, each individual (person) has information recorded on two qualitative variables: (a) which state the person lives in (many levels); and (b) whether or not they own an iPhone ('yes' or 'no').
    For example, the percentages could be 53% in Queensland, 61% in NSW, 41% in Victoria, and so on. They could possibly add to more than 100%.
  3. A pie chart is suitable. Only one qualitative variable is recorded for each individual (person): their grade.

12.3.4 Comparing pie charts and bar charts

Consider the pie chart (using data in E. B. Andersen282) in the top panel of Fig. 12.12.

The pie chart displays the number of lung cancer deaths in Fredericia between 1968 and 1971 inclusive, for various age groups (qualitative).

A pie chart is appropriate: only one variable is recorded on each individual (the age of each individual person), and the counts are parts of a whole. However, notice that determining which age groups have the most lung cancer deaths is hard.

The equivalent bar chart (lower panel) makes the comparison easy: clearly the age groups '65 to 69' and 'Over 74' have slightly fewer deaths than the other age groups.

Graphs from a study of hospital admission of children with asthma

FIGURE 12.12: Graphs from a study of hospital admission of children with asthma

Recall that the purpose of a graph is to is to display information in the clearest, simplest possible way, to help the reaader understand the message(s) in the data. Adding an artificial third dimension usually makes the message hard to see;283 see Example 12.11.

Example 12.11 (Comparing graphs) In the NHANES study,284 the age of each participant was recorded.

Rank the age groups from largest group to smallest group using each graph in Fig. 12.13, all constructed from the same data.

Which graph makes it easiest to compare the sizes of the categories?

Four different graphs for the same data

FIGURE 12.13: Four different graphs for the same data

12.3.5 Is a graph needed? Tabular summaries

Graphs are generally excellent for summarising data.

However, small amounts of qualitative data may not need a graphical summary; sometimes, the data can be collated in a table (Fig. 12.11), or a tabular summary. With small amounts of information, sometimes just writing the information is better ('69% of diagnoses were less severe').

Example 12.12 (Tables for data) The NHANES age data in Example 12.11 may not need a graphical summary. Fig. 12.13 is a graphical summary of the tabular summary in Table 12.1.

TABLE 12.1: The NHANES age distribution, displayed in a tabular summary
Age group Number Percentage
0-9 1391 14.4
10-19 1374 14.2
20-29 1356 14.0
30-39 1338 13.8
40-49 1398 14.5
50-59 1304 13.5
60+ 1506 15.6

12.4 One qualitative variable and one quantitative variable

Relationships between one qualitative variable and one quantitative variable can be displayed using:

12.4.1 Back-to-back stem-and-leaf

Back-to-back stem-and-leaf plots are essentially two stem-and-leaf plots (Sect. 12.2.1) sharing the same stems; one group has the leaves going left-to-right from the stem, and the second group has the leaves going right-to-left from the stem. Back-to-back stem-and-leaf plots can only be used when two groups are being compared.

Example 12.13 (Back-to-back stem-and-leaf plots) A study of krill285 produced the observations shown in Table 12.2. A back-to-back stem-and-leaf plot of these data makes it easy to compare the two groups visually (Fig. 12.14).

The plot for the Treatment data goes from right-to-left, and the data for the Control group goes from left-to-right, sharing the same stems. The control group tends to produce more eggs, in general.

TABLE 12.2: The number of eggs laid by krill, for those in a treatment group and for those in a control group
Treatment group
Control group
0 18 0 2
0 21 0 3
1 26 0 8
1 30 0 16
3 35 1 20
8 48 1 26
8 50 1 31
12 2
The number of eggs from krill, for control and treatment groups

FIGURE 12.14: The number of eggs from krill, for control and treatment groups

12.4.2 2-D dot charts

A 2-D dot chart places a dot for each observation, but separated for each level of the qualitative variable (also see Sect. 12.3.1). For the same krill data used in Example 12.13, a dot chart is shown in Fig. 12.15.

Many observations are the same, so some points would be overplotted if points were not stacked (top panel). Another way to avoid overplotting is to add a bit of randomness (called a 'jitter') in the vertical direction to the points before plotting (bottom panel).

Two variations of a 2-D dot chart for the krill-egg data: stacking and jittering

FIGURE 12.15: Two variations of a 2-D dot chart for the krill-egg data: stacking and jittering

12.4.3 Boxplots

Understanding boxplots takes some explanation, and so boxplots will be discussed again later (Sect. 13.3.3). For the same krill data used in Example 12.13, a boxplot is shown in Fig. 12.16.

A boxplot  for the krill-egg data

FIGURE 12.16: A boxplot for the krill-egg data

To explain boxplots, first focus on just one boxplot from Fig. 12.16: the boxplot for the Treatment group. Boxplots have five horizontal lines; from the top to the bottom of the plot (Fig. 12.17):

  • Top line: The largest number of eggs is 50: This is the line at the top of the boxplot.
  • Second line from the top: About 75% of the observations are smaller than about 28, and this is represented by the line at the top of the central box. This is called the third quartile, or \(Q_3\).
  • Middle line: About 50% of the observations are smaller than about 12, and this is represented by the line in the centre of the central box. This is an 'average' value for the data, or the second quartile, or \(Q_2\).
  • Second line from the bottom: About 25% of the observations are smaller than about 2, and this is represented by the line at the bottom of the central box. This is called the first quartile, or \(Q_1\).
  • Bottom line: The smallest number of eggs is 0. This is the line at the bottom of the boxplot.
A boxplot for the krill-egg data; the boxplot and dotplot just for the treatment group

FIGURE 12.17: A boxplot for the krill-egg data; the boxplot and dotplot just for the treatment group

However, the box for the krill in the Control group is slightly different (Fig. 12.16): One observation is identified with a point, above the top line. Computer software has identified this observation as potentially unusual (in this case, unusually large), and so has plotted this point separately. (Unusually large or small observations are called outliers.)

The values of the quantiles (\(Q_1\), \(Q_2\) and \(Q_3\)) are computed as usual.

So, for the Control data, the largest observation (31 eggs) is deemed unusually large (using arbitrary rules explained in Sect. 13.5.3). Then the boxplot is constructed like this:

  • The largest number of eggs (excluding the outlier of 31 eggs) is about 26: This is the line at the top of the boxplot.

  • 75% of the observations (including the 31 eggs) are smaller than about 12, and this is represented by the line at the top of the central box. This is called the third quartile, or \(Q_3\).

  • 50% of the observations (including the 31 eggs) are smaller than about 2, and this is represented by the line in the centre of the central box. This is an 'average' value for the data, the second quartile, or \(Q_2\).

  • 25% of the observations (including the 31 eggs) are smaller than about 0.5, and this is represented by the line at the bottom of the central box. This is called the first quartile, or \(Q_1\).

    Clearly we cannot have 0.5 eggs, but with 15 observations it is not possible to exactly determine the value for which 25% of observations are smaller. Software uses approximations to compute these values. (Different software may use different rules.)

  • The smallest number of eggs is 0. This is the line at the bottom of the boxplot.

A boxplot  for the krill-egg data; the boxplot just for the control group

FIGURE 12.18: A boxplot for the krill-egg data; the boxplot just for the control group

Example 12.14 (Boxplots explained) The NHANES study collects large amounts of information from about 10,000 Americans each year (Sect. 12.9). Consider the boxplot of the age of these Americans.

The animation below shows how the boxplot of the age of the Americans in the sample is constructed. The "average" age of the subjects is about 38 years, and the ages range from almost zero to about 80 years of age.

Example 12.15 (Boxplots) Boxplots can be plotted horizontally too, which leaves room for long labels. In Fig. 12.19 (based on Emmanuel João Nogueira Leal Silva et al.286), the three cements are quite different regarding their push-out forces.

Comparing three push-out values for three cements

FIGURE 12.19: Comparing three push-out values for three cements

Example 12.16 (Boxplots) A study of different engineering project delivery methods287 produced the boxplot in Fig. 12.20: the increase in the costs of projects seem to differ between the two methods.

The DB (Design/Build) method produces a smaller project cost growth on average (the centre line of the boxplot), but the DBB (Design/Bid/Build) method produces more variation in project cost growth. Notice the presence of outliers for both methods, as indicated by the dots.

Comparing two engineering project delivery methods

FIGURE 12.20: Comparing two engineering project delivery methods

12.5 Two quantitative variables

Scatterplots display the relationship between two quantitative variables. Conventionally, the "response" variable is on the vertical axis, and the the "explanatory" variable is on the horizontal axis.

As with any graph, explaining what the graph tells us is important because the purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data. Scatterplot can be described by briefly explaining how the variables are related to each other. Scatterplots are studied again later (Sect. 12.5).

Example 12.17 (Scatterplots) A study of mdx mice (mice with muscular dystrophy)288 recorded the lung weight of mice at various ages. The scatterplot in Fig 12.21 shows that the average lung weight increases, then declines after about 50 weeks of age; a lot of variation exists within mice of similar ages.

Scatterplot of lung weight vs age for mice with muscular dystrophy

FIGURE 12.21: Scatterplot of lung weight vs age for mice with muscular dystrophy

Example 12.18 (Scatterplots) A study of athletes at the Australian Institute of Sport (AIS) measured numerous physical and blood measurements from high performance athletes.289

Many relationships were of interest. Fig. 12.22 shows the relationship between the sum of skin folds (SSF) and percentage body fat.

Each point represents the percentage body fat and the SSF for one athlete. Clearly, the greater the percentage body fat, the greater the sum of skin folds, in general.

Scatterplot of SSF against percentage body fat for male AIS athletes

FIGURE 12.22: Scatterplot of SSF against percentage body fat for male AIS athletes

12.6 Two qualitative variables

The relationship between two qualitative variables can be explored using:

Many variations of these graphs are possible.

As an example, a study of road kill290 produced the data in Table 12.3. There are two qualitative variables: the season (ordinal, with four levels) and the sex (nominal, with three levels including 'Unknown').

TABLE 12.3: The number of possums found as road kill, by sex and season
Unknown M F
Autumn 75 25 21
Winter 74 27 22
Spring 71 10 18
Summer 58 10 12

12.6.1 Stacked bar charts

The data can be graphed by using a bar for each season, stacking the bars by sex on top of each other, within each season (Fig. 12.23).

The number of possums found as road kill, by sex and season

FIGURE 12.23: The number of possums found as road kill, by sex and season

12.6.2 Side-by-side bar charts

Instead of stacking the bars within each season on top of each other, the bars can be placed side-by-side within each season (Fig. 12.24).

The number of possums found as road kill, by sex and season

FIGURE 12.24: The number of possums found as road kill, by sex and season

12.6.3 Dot charts

Instead of bars, dots (or other symbols) can be used in place of a side-by-side bar chart (Fig. 12.25).

A dot chart of the number of possums found as road kill, by sex and season

FIGURE 12.25: A dot chart of the number of possums found as road kill, by sex and season

12.6.4 Other variations

Many variations of these bar charts are possible. We can choose:

  • horizontal or vertical bars;
  • percentages or counts;
  • stacked bar charts, side-by-side bar charts, or dot charts;
  • either the sex of the possum or the season as the first division of the data.

Many variations exist; some are shown in Fig. 12.26. Another display is to construct a two-way table, of either counts (Table 12.3) or percentages (Table 12.4).

Three graphs: The number or percentage of possums found as road kill, by sex and season

FIGURE 12.26: Three graphs: The number or percentage of possums found as road kill, by sex and season

TABLE 12.4: The percentages of possums found as road kill by sex, within each season (rows sum to 100%)
Unknown M F
Aut. 62.0 20.7 17.4
Wint. 60.2 22.0 17.9
Spr. 71.7 10.1 18.2
Sum. 72.5 12.5 15.0

Of all these displays, which one do you think best communicates the message in the data? (Indeed, what is the main message that you would like to get across?)

12.7 Other types of graphs

Many types of graphs have been studied, for specific types of data. But sometimes, other plots are useful. Usually the best plot is one of those just described, but sometimes the best plot is something different, perhaps even unique. Always remember the driving principle:

The purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.

Importantly, a graphs needed that helps answer the research question. In this section, graphs for some other situations are discussed:

  • Geographic plots: Useful when the RQ is about comparing geographical regions.
  • Case-profile plots: Useful when the same units are measured over a small number of time points, or are otherwise connected.
  • Histogram of differences: Useful when the same units are measured over two time points, or are otherwise connected.
  • Time plots: Useful when the units are measured over a large number of time points.

12.7.1 Geographic plots

When data are summarised over a geographic area, plotting accordingly can be useful.

Example 12.19 (Geographics plots) A study examining lower-limb amputation incidence in Australia (based on Michael P. Dillon et al.291) produced the graphs in Figs. 12.27 and 12.28.

Clearly, the incidence of amputation is higher in the Northern Territory than other parts of Australia for both females and males; furthermore, males have higher incidence of amputation than females in every state/territory.

Age-adjusted incidence of lower limb amputations in Australia, from August 2007 to December 2011: females. Numbers are incidents per 100 000.

FIGURE 12.27: Age-adjusted incidence of lower limb amputations in Australia, from August 2007 to December 2011: females. Numbers are incidents per 100 000.

Age-adjusted incidence of lower limb amputations in Australia, from August 2007 to December 2011: males. Numbers are incidents per 100 000.

FIGURE 12.28: Age-adjusted incidence of lower limb amputations in Australia, from August 2007 to December 2011: males. Numbers are incidents per 100 000.

12.7.2 Case-profile plots

Sometimes the same variable is measured on each unit of analysis more than once (i.e. many observations per unit of analysis) but only a small number of times. Then a case-profile plots can be used: plots that show how the response variable changes for each unit of analysis. Examples of this type of data include:

  • Measurements of household water consumption before and after installing water-saving devices, for many households.
  • Blood pressure measurements taken from left arms and right arms, for many people.

In both cases, the data are paired (two observations per unit of analysis) as each unit of analysis gets a pair of similar measurements.

Example 12.20 (Case-profile plots) A study of children with atopic asthma292 included the graph in Fig. 12.29. Each line on the graph represents one person.

A case-profile plot. Each line represents one subject, joining that person's pre-intervention score to their post-intervention score

FIGURE 12.29: A case-profile plot. Each line represents one subject, joining that person's pre-intervention score to their post-intervention score

12.7.3 Histogram of differences

An alternative way to present paired data (two observations made for each unit of analysis) is to produce a histogram of the changes for each individual. This may be easier to produce in software than a case-profile plot.

Consider the person in the case-profile plot whose line is at the top of the plot in Fig. 12.29. Their 'pre' IgE level is about 5500 micrograms/L, and their 'post' IgE level is about 4500 micrograms/L, which is a change (or more descriptively, a reduction) of about 1000 micrograms/L. These reductions could be computed for each person (Table 12.5).

TABLE 12.5: The IgE before and after an intervention, and the change in IgE (in micrograms/L)
IgE (before) IgE (after) IgE reduction
83 83 0
292 292 0
293 292 1
623 542 81
792 709 83
1543 1000 543
1668 1000 668
1960 1626 334
2877 2502 375
2961 2711 250
5504 4504 1000

Then a histogram can be constructed based on these reductions in IgE, with one reduction for each person (Fig. 12.30).

An alternative to a case-profile plot: A histogram of the differences

FIGURE 12.30: An alternative to a case-profile plot: A histogram of the differences

Example 12.21 (Graphing paired data) The Electricity Council in Bristol wanted to determine if a certain type of wall-cavity insulation was effective in reducing energy consumption in winter.293 Their RQ was:

Is there an average reduction in energy consumption due to adding insulation?

The data collected, shown below, can be graphed using a case-profile plot (Fig. 12.31, top panel): A dashed line has been used to show an increase in energy usage, and a solid line for houses where energy was saved after installing insulation. (Again, this is encoding extra information.)

For these data, finding the difference in energy consumption for each house seems sensible. The same unit of analysis is measured twice on the same variable: energy consumption before adding insulation and then after adding insulation. The difference in energy consumption (or the energy saving more specifically) for each house can be computed, then graphed using a histogram, bar chart, stemplot, or dot chart (Fig. 12.31, bottom panel).

Two plots of the energy consumption data. The dotted line in the top panel shows the one home where energy consumption increased.

FIGURE 12.31: Two plots of the energy consumption data. The dotted line in the top panel shows the one home where energy consumption increased.

Example 12.22 (Graphing paired data) A study294 examined the average number of bacteria on birthday cakes before and after blowing out the candles.

This question could be studied by taking two measurements from each cake: before and after blowing out candles. The change in the number of bacteria could be computed for each cake, and a histogram of the differences plotted.

12.7.4 Time plots

Sometimes, a variable is measured over many time points. A time plot can be used, which show how the variable changes over time.

Example 12.23 (Time plots) The baby-birth data (in Sect. 12.2.1) recorded the time of each birth. A time plot shows how the weights varied over time (Fig. 12.32).

The weight of babies born on 21 December 1997 at a Brisbane hospital, plotted over time

FIGURE 12.32: The weight of babies born on 21 December 1997 at a Brisbane hospital, plotted over time

Example 12.24 (Time plots) A study of the number of lynx trapped in the Mackenzie River area of Canada295 each year from 1821 to 1934 produced the data shown in Fig. 12.33. A regular cycle is apparent, where the number trapped is very large.

The number of lynx trapped in the Mackenzie River district of north-west Canada, from 1821 to 1934

FIGURE 12.33: The number of lynx trapped in the Mackenzie River district of north-west Canada, from 1821 to 1934

12.8 Notes on constructing graphs

12.8.1 Comparing 2-D and 3-D graphs

Always remember the purpose of a graph: to display the information in the simplest and clearest possible way, to help the reader understand the message(s) in the data.

Instead of aiming to communicate information, sometimes graphs are prepared to look fancy or clever, or to show off the researchers computing skills.

One way that people try to be fancy is to use a third dimension when it is not needed. This is a bad idea: the resulting graphs can be misleading and hard to read.296

In the NHANES study,297 the age and sex of each participant were recorded.

Using Fig. 12.34, can you easily determine if more females or more males are in each age group?

A three-dimensional bar chart of the NHANES data

FIGURE 12.34: A three-dimensional bar chart of the NHANES data

The artificial third dimension makes quickly determining the heights of the bars hard. In contrast, a 2-D graph (a side-by-side bar graph; Fig. 12.35) makes it clear whether each age group has more females or more males.

A side-by-side bar chart of treatment data

FIGURE 12.35: A side-by-side bar chart of treatment data

12.8.2 Further comments

Always remember:

The purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.

Helping readers to understand the data is the essense of producing a good graph:

Data graphics should draw the viewer's attention to the sense and substance of the data, not to something else [...] essentially statistical graphics are instruments to help people reason about quantitative information.

--- Edward R Tufte et al.298, p. 91

You should be able to construct the graphics by hand, but we will generally use software (e.g., jamovi or SPSS). Importantly, given the purpose of a graph, what the graph communicates should be explained: Graphs need to be clear and well-labelled with captions.

Ensure that you:

  • Do not add artificial third dimensions, or other 'chart junk'.299
  • Do not add optical illusions.
  • Do not make any errors.

Ensure that you:

  • Do add units of measurement or value labels where appropriate.
  • Do add titles and axis labels.
  • Do ensure the axis scales are appropriate.
  • Do add any necessary explanations.
  • Do make it easy for the reader to be able to consider the RQ (for example, to easily make the comparison of greatest interest).

Example 12.25 (Truncating bar charts) One optical illusion often appearing in graphs is when the frequency (or percentage) axis on a bar chart is truncated so that it doesn't start at zero.

For example, consider data recording the number of lung cancer cases in various Danish cities.300

The animation below shows a bar chart with the count (vertical axis) starting at zero (when the counts in each age group look similar), and then gradually changing where the vertical axis starts... so that the final bar chart make the number of cases in each age groups look very different. The graph is misleading when the graph does not start at a count of zero.

12.9 Case Study: The NHANES data

To demonstrate how graphs can help answer RQs, consider the following RQ:

Among Americans, is the average direct HDL cholesterol different for those who current smokers and non-smokers?

From the RQ, the Population is 'Americans', the Outcome is the 'average direct HDL cholesterol levels', and the Comparison is 'Between those who currently do and do not smoke'.

There is no intervention; this is a relational RQ, that can be answered using an observational study.

As with any study, managing confounding should be considered, by thinking about possible extraneous variables that could be measured or observed.

Can you think of any possible extraneous variables that have the potential to be confounding variables?

Maybe weight... Weight may be related to both the HDL cholesterol concentration and the smoking status.

In this study, clearly we cannot collect primary data. However, data to answer this (and many other) RQs are obtained from the American National Health and Nutrition Examination Survey (NHANES).301 From the NHANES webpage, this survey:

... examines a nationally representative sample of about 5,000 persons each year... located in counties across the country, 15 of which are visited each year.

--- NHANES webpage (emphasis added)

The data available are equivalent to a "a simple random sample from the American population".302 In total, 10,000 observations on scores of variables are available (from the 2009/2010 and the 2011/2012 surveys). Fig. 12.36 shows the first 5000 observations on the first 40 variables only.