12 Graphical summaries
So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data and describe the data.
In this chapter, you will learn to graph the data, so we can understand the data used to answer the RQ. You will learn to:
 select the appropriate graphic to graphically summarise data.
 graphically summarise data using quality, appropriate graphs.
 interpret graphs.
 identify badly prepared graphics, giving reasons.
12.1 Introduction
To answer a RQ, a study is designed to collect data, because data are the means by which the research question is answered. But before analysis, understanding, describing and summarising the collected data is important. This chapter discusses the use of graphs to summarise data.
The purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.
Graphs are not produced just for the sake of it; what the graph tells us should always be explained, especially regarding what it reveals about answering the RQ.
12.2 One quantitative variable
For quantitative data, a graph shows the distribution of the data:
::: {.definition #Distribution name = "Distribution"} The distribution of a quantitative variable describes what values are present in the data, and how often those values appear. :::
The graphs discussed in this section usually work best for continuous quantitative data, but may also be useful for discrete quantitative data if many possible values are present. Sometimes, discrete quantitative data with very few values are best graphed using the graphs discussed in Sect. 12.3.
Three different types of graphs can be used to show how the values of one quantitative variable are distributed:
 Stemplot or stemandleaf plot: Best for small amounts of data; useful only in some cases.
 Dot chart: Best for small amounts of data; good for moderate amounts of data.
 Histogram: Best for moderate to large amounts of data.
Whatever graph is used, what the graph shows should be described.
12.2.1 Stemandleaf plots
Stemandleaf plots (or stemplots) are best described and explained using an example, so consider the data in Fig. 12.1,
The data give the weights (in kg) of babies born in a Brisbane hospital on one day.^{269}
The data set also includes the gender of each baby, and the number of minutes after midnight that each birth occurred.
The data are given in the order in which the births occurred.
For these data, the weights (quantitative) are to one decimal place of a kilogram. In a stemplot, part of each number is placed to the left (the stem) of a vertical line, and the rest of each number to the right (the leaf).
Here, the whole number of kilograms is placed to the left (as a stem), and the first decimal place is placed on the right (as a leaf). The animation below shows how the stemplot is constructed. From this plot, most birthweights are seen to be 3pointsomething kilograms.
For stemandleaf plots:
 Place the larger unit (e.g., kilograms) on the left (stems).
 Place the next smallest unit (e.g., first decimal place of a kilogram) on the right (leaves).
 Some data do not work well with stemandleaf plots.
 Sometimes, data might need to be suitably rounded before creating the stemandleaf plot.
 The numbers in each row should be evenly spaced, so that the numbers in the columns are under each other. This allows patterns to be seen.
 Within each row, the observations are ordered on each stem so patterns can be seen.
 Add an explanation for reading the stemandleaf plot. For example, the caption for the stemandleaf plot for the babybirth data in Sect. 12.2.1 says '2  6 means 2.6kg', which explains what the stem plot means. For instance, '2  6' could mean 26kg, or 0.26kg.
Example 12.1 (Stemandleaf plots) A study of krill^{270} produced 15 measurements of the number of eggs. The stemplot shows that the number of eggs is usually under 10, but occasionally a large number of eggs are seen.
The animation below shows how the stemplot is constructed.
The following short video may help explain some of these concepts:
12.2.2 Dot charts (quantitative data)
Dot charts show the original data on a single axis, with each observation represented by a dot.
Example 12.2 (Dot charts) A study examined the serving size of fries at McDonald's,^{271} and produced the dot chart in Fig 12.2 (based on Wetzel^{272}, Fig. 2).
The mass of fries is almost always under the target, and often substantially so. An alternative way to look at these data is to measure the percentage that each serving is in relation to the target serving (Fig. 12.3), where 100% means the serving size was exactly the target weight.
Example 12.3 (Dot charts) Consider again the weights (in kg) of babies born in a Brisbane hospital in one day. Again, a dot chart (Fig. 12.4) shows that most babies are between 3 and 4kg.
12.2.3 Histograms
Histograms are a series of boxes, where the width of the box represents a range of values of the variable being graphed, and the height of the box^{273} represents the number (or percentage) of observations within that range of values.
Example 12.4 (Histograms) Consider again the weights (in kg) of babies born in a Brisbane hospital in one day.^{274}
A histogram (below) can be constructed for these data. When an observation occurs on a boundary between the boxes, software usually (but not universally) places it in the higher box (so 2.5kg would be counted in the '2.5 to 3.0kg' box, not the '2.0 to 2.5kg' box). The histogram shows, for example, that 17 babies weighed 3.0kg or more, but under 3.5kg.
The animation below shows how the histogram is constructed.
Example 12.5 (Histograms) A study of 'headache attributed to ingestion or inhalation of a cold stimulus' (HICS), commonly known as a brain freeze from eating cold food (e.g., ice cream) or drinking a cold drink, measured the duration of the brain freeze.^{275}
A histogram of the data (Fig 12.5, based on Mages et al.^{276}, Figure 2b), shows that 11 people experience HICS symptoms less than 5 seconds in length.
In addition, 9 people experienced symptoms for at least 5 but less than 10 seconds, and 1 person experienced symptoms for at least 35 seconds but under 40 seconds.
12.2.4 Describing the distribution
Graphs are constructed to help us understand the data. After producing a graph for one quantitative variable, then, we need to summarise what we learn. For a single quantitative variable, describe:
 Average: What is an "average" or typical value?
 Variation: How much variation is present in the bulk of the data?
 Shape: How are the values distributed? That is, are most of the values smaller values, or larger values, or about even distributed between smaller and larger values?
 Outliers (observations unusually large or small) or unusual features: Are there any unusual observations, or anything else of interest?
Describing the shape can be tricky, but terminology may help:
 Skewed right: the bulk of the data is smaller, but there are some larger values (to the right).
 Skewed left: the bulk of the data is larger, but there are some smaller values (to the left).
 Symmetric data (and perhaps bellshaped): There are approximately equal numbers of values that are smaller and larger.
 Bimodal data: There are two peaks in the distribution.
Typical shapes are shown in the carousel below (click the left and right arrows to move through the example plots). Sometimes, no suitable short descriptions is suitable.
Example 12.6 (Bimodal data) The Old Faithful geyser in Yellowstone National Park (USA) erupts regularly.^{277}
The time between eruptions (Fig. 12.7) is clearly bimodal, with a peak near 55 minutes and another near 80 minutes.
The number of bins used in the histogram can change the impression of the distribution. Using software, it is usually possible to try different bin sizes to find a bin size that suitably displays the overall distribution.
Try changing the number of bins in the interaction below.
Example 12.7 (Describing quantitative data) For the babyweight data displayed in, for example, Fig. 12.4:
 The average weight is somewhere between 2.5 to 3 kilograms.
 The variation in weights is between 1.5 and 4.5 kilograms approximately.
 The shape is slightly skewed to the left. That is, occasional small birth weights appear (probably premature babies).
 There doesn't appear to be any outliers or anything unusual.
Describe the histogram in Example 12.5, the brain freeze durations.
 Average: Hard to be sure... maybe between 10 or 15. (More observations appear at the smaller values (as the bars are higher).)
 Variation: From about 0 to about 40.
 Shape: Slightly skewed right.
 Outliers: No outliers or unusual observations. The observation between 35 and 40 may be an outlier. I suspect it is not an outlier, as a larger sample may very well have observations between 30 and 35. Of course, I could be wrong.
12.3 One qualitative variable
For qualitative data, graphs show how often each level of the variable occurs in the data. The three options for graphing qualitative data are:
 Dot chart: Usually a good choice.
 Bar chart: Usually a good choice.
 Pie chart: Only useful in special circumstances, and can be harder to interpret.
Comparing these graphs is useful too; indeed, sometimes a graph may not even be needed.
For nominal data, the order in which the levels of the variables appear is unimportant, so categories could be ordered alphabetically, by size, by personal preference, or any other way. Since you have a choice, think about the order that is most useful to readers. For ordinal data, the natural order of the levels should almost always be used.
Sometimes these graphs are also used for discrete quantitative data with a small number of possible options.
12.3.1 Dot charts (qualitative data)
Dot charts indicate the counts (or the corresponding percentages) in each level, using dots on a line starting at zero. The levels can be on the horizontal or vertical axis; placing the level names on the vertical axis often makes for easier reading, and room for long labels.
Example 12.8 (Dot plots) A study of spider monkeys^{278} examined the social groups present in a sample. A dot chart (Fig. 12.9) show the most common social group has many females plus offspring (with almost 50 social groups).
12.3.2 Bar charts
Bar charts indicate the counts in each category using a bar starting from zero. As with dot charts, the levels can be on the horizontal or vertical axis, but placing the level names on the vertical axis often makes for easier reading, and room for long labels.
Example 12.9 (Bar charts) In a study of functional independence,^{279} the type of diagnoses were graphed using a bar chart (Fig. 12.10). For example, two people in the sample have cerebral palsy.
The reason for the different coloured bars becomes apparent in Sect. 12.3.3.
For bar charts and dot charts:
 Place the qualitative variable on the horizontal or vertical axis (and label with the levels of the variable).
 Use counts or percentages on the other axis.
 For nominal data, dots and bars can be ordered any way: Think about the most helpful order.
 Bars have gaps between bars, as the bars represent distinct categories. In contrast, the bars in histograms are butted together (except when an interval has a count of zero), as the bars represent a numerical scale.
12.3.3 Pie charts
In pie charts, a circle is divided into segments proportional to the number in each level of the qualitative variable.
Example 12.10 (Pie charts) In a study of functional independence,^{280} the severity of the diagnoses were graphed using a pie chart (Fig. 12.11). This picture actually conveys one thing only ("69% of patients had a less severe injury"), so a graph of any kind is probably unnecessary.
The pie chart colours explain the colours used in the bar chart in Example 12.9. This is called encoding extra information into the bar chart.
Pie charts presents challenges:
 Pie charts only work when graphing parts of a whole.
 Pie charts only work when all options are present ('exhaustive').
 Pie charts only work when each unit can appear in just one group ('mutually exclusive').
 Pie charts are difficult to use when levels with zero counts, or small counts, are present.
 Pie charts are difficult to read when many categories are present.
 Pie charts are hard to read: In general, human brains are better at comparing lengths (as used in bar and dot charts) than comparing angles (as used in pie charts).^{281}
In which of these situations is a pie chart appropriate?
 The percentage of people who use these web browsers: Firefox, Chrome, and Safari.
 For each state of Australia, the percentage of people who own an iPhone.
 The percentage of students awarded different grades in this course last semester.
 A pie chart is not suitable. Each individual (person) has information recorded on two qualitative variables: (a) which browser is being asked about (three levels); and (b) whether or not they use that browser ('yes' or 'no'). Also, the three browsers are not mutually exclusive (people can use more than one of these browsers) nor exhaustive (some people may use browsers not listed, such as Edge, Brave, Vivaldi, etc.). For example, the percentages could be that 65% use Firefox, 84% use Chrome, and 20% use Safari. These add to more than 100%.
 A pie chart is not suitable, as the percentages are not parts of a whole.
Again, each individual (person) has information recorded on two qualitative variables: (a) which state the person lives in (many levels); and (b) whether or not they own an iPhone ('yes' or 'no').
For example, the percentages could be 53% in Queensland, 61% in NSW, 41% in Victoria, and so on. They could possibly add to more than 100%.  A pie chart is suitable. Only one qualitative variable is recorded for each individual (person): their grade.
12.3.4 Comparing pie charts and bar charts
Consider the pie chart (using data in E. B. Andersen^{282}) in the top panel of Fig. 12.12.
The pie chart displays the number of lung cancer deaths in Fredericia between 1968 and 1971 inclusive, for various age groups (qualitative).
A pie chart is appropriate: only one variable is recorded on each individual (the age of each individual person), and the counts are parts of a whole. However, notice that determining which age groups have the most lung cancer deaths is hard.
The equivalent bar chart (lower panel) makes the comparison easy: clearly the age groups '65 to 69' and 'Over 74' have slightly fewer deaths than the other age groups.
Recall that the purpose of a graph is to is to display information in the clearest, simplest possible way, to help the reaader understand the message(s) in the data. Adding an artificial third dimension usually makes the message hard to see;^{283} see Example 12.11.
Example 12.11 (Comparing graphs) In the NHANES study,^{284} the age of each participant was recorded.
Rank the age groups from largest group to smallest group using each graph in Fig. 12.13, all constructed from the same data.
Which graph makes it easiest to compare the sizes of the categories?
12.3.5 Is a graph needed? Tabular summaries
Graphs are generally excellent for summarising data.
However, small amounts of qualitative data may not need a graphical summary; sometimes, the data can be collated in a table (Fig. 12.11), or a tabular summary. With small amounts of information, sometimes just writing the information is better ('69% of diagnoses were less severe').
Example 12.12 (Tables for data) The NHANES age data in Example 12.11 may not need a graphical summary. Fig. 12.13 is a graphical summary of the tabular summary in Table 12.1.
Age group  Number  Percentage 

09  1391  14.4 
1019  1374  14.2 
2029  1356  14.0 
3039  1338  13.8 
4049  1398  14.5 
5059  1304  13.5 
60+  1506  15.6 
12.4 One qualitative variable and one quantitative variable
Relationships between one qualitative variable and one quantitative variable can be displayed using:
 Backtoback stemandleaf plot: Best for small amounts of data when the qualitative variable only has two levels;
 2D dot chart: Best choice for small to moderate amounts of data;
 Boxplot: Best choice, except for small amounts of data.
12.4.1 Backtoback stemandleaf
Backtoback stemandleaf plots are essentially two stemandleaf plots (Sect. 12.2.1) sharing the same stems; one group has the leaves going lefttoright from the stem, and the second group has the leaves going righttoleft from the stem. Backtoback stemandleaf plots can only be used when two groups are being compared.
Example 12.13 (Backtoback stemandleaf plots) A study of krill^{285} produced the observations shown in Table 12.2. A backtoback stemandleaf plot of these data makes it easy to compare the two groups visually (Fig. 12.14).
The plot for the Treatment data goes from righttoleft, and the data for the Control group goes from lefttoright, sharing the same stems. The control group tends to produce more eggs, in general.
0  18  0  2 
0  21  0  3 
1  26  0  8 
1  30  0  16 
3  35  1  20 
8  48  1  26 
8  50  1  31 
12  2 
12.4.2 2D dot charts
A 2D dot chart places a dot for each observation, but separated for each level of the qualitative variable (also see Sect. 12.3.1). For the same krill data used in Example 12.13, a dot chart is shown in Fig. 12.15.
Many observations are the same, so some points would be overplotted if points were not stacked (top panel). Another way to avoid overplotting is to add a bit of randomness (called a 'jitter') in the vertical direction to the points before plotting (bottom panel).
12.4.3 Boxplots
Understanding boxplots takes some explanation, and so boxplots will be discussed again later (Sect. 13.3.3). For the same krill data used in Example 12.13, a boxplot is shown in Fig. 12.16.
To explain boxplots, first focus on just one boxplot from Fig. 12.16: the boxplot for the Treatment group. Boxplots have five horizontal lines; from the top to the bottom of the plot (Fig. 12.17):
 Top line: The largest number of eggs is 50: This is the line at the top of the boxplot.
 Second line from the top: About 75% of the observations are smaller than about 28, and this is represented by the line at the top of the central box. This is called the third quartile, or \(Q_3\).
 Middle line: About 50% of the observations are smaller than about 12, and this is represented by the line in the centre of the central box. This is an 'average' value for the data, or the second quartile, or \(Q_2\).
 Second line from the bottom: About 25% of the observations are smaller than about 2, and this is represented by the line at the bottom of the central box. This is called the first quartile, or \(Q_1\).
 Bottom line: The smallest number of eggs is 0. This is the line at the bottom of the boxplot.
However, the box for the krill in the Control group is slightly different (Fig. 12.16): One observation is identified with a point, above the top line. Computer software has identified this observation as potentially unusual (in this case, unusually large), and so has plotted this point separately. (Unusually large or small observations are called outliers.)
The values of the quantiles (\(Q_1\), \(Q_2\) and \(Q_3\)) are computed as usual.
So, for the Control data, the largest observation (31 eggs) is deemed unusually large (using arbitrary rules explained in Sect. 13.5.3). Then the boxplot is constructed like this:
The largest number of eggs (excluding the outlier of 31 eggs) is about 26: This is the line at the top of the boxplot.
75% of the observations (including the 31 eggs) are smaller than about 12, and this is represented by the line at the top of the central box. This is called the third quartile, or \(Q_3\).
50% of the observations (including the 31 eggs) are smaller than about 2, and this is represented by the line in the centre of the central box. This is an 'average' value for the data, the second quartile, or \(Q_2\).

25% of the observations (including the 31 eggs) are smaller than about 0.5, and this is represented by the line at the bottom of the central box. This is called the first quartile, or \(Q_1\).
Clearly we cannot have 0.5 eggs, but with 15 observations it is not possible to exactly determine the value for which 25% of observations are smaller. Software uses approximations to compute these values. (Different software may use different rules.)
The smallest number of eggs is 0. This is the line at the bottom of the boxplot.
Example 12.14 (Boxplots explained) The NHANES study collects large amounts of information from about 10,000 Americans each year (Sect. 12.9). Consider the boxplot of the age of these Americans.
The animation below shows how the boxplot of the age of the Americans in the sample is constructed. The "average" age of the subjects is about 38 years, and the ages range from almost zero to about 80 years of age.
Example 12.15 (Boxplots) Boxplots can be plotted horizontally too, which leaves room for long labels. In Fig. 12.19 (based on Emmanuel João Nogueira Leal Silva et al.^{286}), the three cements are quite different regarding their pushout forces.
Example 12.16 (Boxplots) A study of different engineering project delivery methods^{287} produced the boxplot in Fig. 12.20: the increase in the costs of projects seem to differ between the two methods.
The DB (Design/Build) method produces a smaller project cost growth on average (the centre line of the boxplot), but the DBB (Design/Bid/Build) method produces more variation in project cost growth. Notice the presence of outliers for both methods, as indicated by the dots.
12.5 Two quantitative variables
Scatterplots display the relationship between two quantitative variables. Conventionally, the "response" variable is on the vertical axis, and the the "explanatory" variable is on the horizontal axis.
As with any graph, explaining what the graph tells us is important because the purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data. Scatterplot can be described by briefly explaining how the variables are related to each other. Scatterplots are studied again later (Sect. 12.5).
Example 12.17 (Scatterplots) A study of mdx mice (mice with muscular dystrophy)^{288} recorded the lung weight of mice at various ages. The scatterplot in Fig 12.21 shows that the average lung weight increases, then declines after about 50 weeks of age; a lot of variation exists within mice of similar ages.
Example 12.18 (Scatterplots) A study of athletes at the Australian Institute of Sport (AIS) measured numerous physical and blood measurements from high performance athletes.^{289}
Many relationships were of interest. Fig. 12.22 shows the relationship between the sum of skin folds (SSF) and percentage body fat.
Each point represents the percentage body fat and the SSF for one athlete. Clearly, the greater the percentage body fat, the greater the sum of skin folds, in general.
12.6 Two qualitative variables
The relationship between two qualitative variables can be explored using:
Many variations of these graphs are possible.
As an example, a study of road kill^{290} produced the data in Table 12.3. There are two qualitative variables: the season (ordinal, with four levels) and the sex (nominal, with three levels including 'Unknown').
Unknown  M  F  

Autumn  75  25  21 
Winter  74  27  22 
Spring  71  10  18 
Summer  58  10  12 
12.6.1 Stacked bar charts
The data can be graphed by using a bar for each season, stacking the bars by sex on top of each other, within each season (Fig. 12.23).
12.6.2 Sidebyside bar charts
Instead of stacking the bars within each season on top of each other, the bars can be placed sidebyside within each season (Fig. 12.24).
12.6.3 Dot charts
Instead of bars, dots (or other symbols) can be used in place of a sidebyside bar chart (Fig. 12.25).
12.6.4 Other variations
Many variations of these bar charts are possible. We can choose:
 horizontal or vertical bars;
 percentages or counts;
 stacked bar charts, sidebyside bar charts, or dot charts;
 either the sex of the possum or the season as the first division of the data.
Many variations exist; some are shown in Fig. 12.26. Another display is to construct a twoway table, of either counts (Table 12.3) or percentages (Table 12.4).
Unknown  M  F  

Aut.  62.0  20.7  17.4 
Wint.  60.2  22.0  17.9 
Spr.  71.7  10.1  18.2 
Sum.  72.5  12.5  15.0 
Of all these displays, which one do you think best communicates the message in the data? (Indeed, what is the main message that you would like to get across?)
12.7 Other types of graphs
Many types of graphs have been studied, for specific types of data. But sometimes, other plots are useful. Usually the best plot is one of those just described, but sometimes the best plot is something different, perhaps even unique. Always remember the driving principle:
The purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.
Importantly, a graphs needed that helps answer the research question. In this section, graphs for some other situations are discussed:
 Geographic plots: Useful when the RQ is about comparing geographical regions.
 Caseprofile plots: Useful when the same units are measured over a small number of time points, or are otherwise connected.
 Histogram of differences: Useful when the same units are measured over two time points, or are otherwise connected.
 Time plots: Useful when the units are measured over a large number of time points.
12.7.1 Geographic plots
When data are summarised over a geographic area, plotting accordingly can be useful.
Example 12.19 (Geographics plots) A study examining lowerlimb amputation incidence in Australia (based on Michael P. Dillon et al.^{291}) produced the graphs in Figs. 12.27 and 12.28.
Clearly, the incidence of amputation is higher in the Northern Territory than other parts of Australia for both females and males; furthermore, males have higher incidence of amputation than females in every state/territory.
12.7.2 Caseprofile plots
Sometimes the same variable is measured on each unit of analysis more than once (i.e. many observations per unit of analysis) but only a small number of times. Then a caseprofile plots can be used: plots that show how the response variable changes for each unit of analysis. Examples of this type of data include:
 Measurements of household water consumption before and after installing watersaving devices, for many households.
 Blood pressure measurements taken from left arms and right arms, for many people.
In both cases, the data are paired (two observations per unit of analysis) as each unit of analysis gets a pair of similar measurements.
Example 12.20 (Caseprofile plots) A study of children with atopic asthma^{292} included the graph in Fig. 12.29. Each line on the graph represents one person.
12.7.3 Histogram of differences
An alternative way to present paired data (two observations made for each unit of analysis) is to produce a histogram of the changes for each individual. This may be easier to produce in software than a caseprofile plot.
Consider the person in the caseprofile plot whose line is at the top of the plot in Fig. 12.29. Their 'pre' IgE level is about 5500 micrograms/L, and their 'post' IgE level is about 4500 micrograms/L, which is a change (or more descriptively, a reduction) of about 1000 micrograms/L. These reductions could be computed for each person (Table 12.5).
IgE (before)  IgE (after)  IgE reduction 

83  83  0 
292  292  0 
293  292  1 
623  542  81 
792  709  83 
1543  1000  543 
1668  1000  668 
1960  1626  334 
2877  2502  375 
2961  2711  250 
5504  4504  1000 
Then a histogram can be constructed based on these reductions in IgE, with one reduction for each person (Fig. 12.30).
Example 12.21 (Graphing paired data) The Electricity Council in Bristol wanted to determine if a certain type of wallcavity insulation was effective in reducing energy consumption in winter.^{293} Their RQ was:
Is there an average reduction in energy consumption due to adding insulation?
The data collected, shown below, can be graphed using a caseprofile plot (Fig. 12.31, top panel): A dashed line has been used to show an increase in energy usage, and a solid line for houses where energy was saved after installing insulation. (Again, this is encoding extra information.)
For these data, finding the difference in energy consumption for each house seems sensible. The same unit of analysis is measured twice on the same variable: energy consumption before adding insulation and then after adding insulation. The difference in energy consumption (or the energy saving more specifically) for each house can be computed, then graphed using a histogram, bar chart, stemplot, or dot chart (Fig. 12.31, bottom panel).
Example 12.22 (Graphing paired data) A study^{294} examined the average number of bacteria on birthday cakes before and after blowing out the candles.
This question could be studied by taking two measurements from each cake: before and after blowing out candles. The change in the number of bacteria could be computed for each cake, and a histogram of the differences plotted.
12.7.4 Time plots
Sometimes, a variable is measured over many time points. A time plot can be used, which show how the variable changes over time.
Example 12.23 (Time plots) The babybirth data (in Sect. 12.2.1) recorded the time of each birth. A time plot shows how the weights varied over time (Fig. 12.32).
Example 12.24 (Time plots) A study of the number of lynx trapped in the Mackenzie River area of Canada^{295} each year from 1821 to 1934 produced the data shown in Fig. 12.33. A regular cycle is apparent, where the number trapped is very large.
12.8 Notes on constructing graphs
12.8.1 Comparing 2D and 3D graphs
Always remember the purpose of a graph: to display the information in the simplest and clearest possible way, to help the reader understand the message(s) in the data.
Instead of aiming to communicate information, sometimes graphs are prepared to look fancy or clever, or to show off the researchers computing skills.
One way that people try to be fancy is to use a third dimension when it is not needed. This is a bad idea: the resulting graphs can be misleading and hard to read.^{296}
In the NHANES study,^{297} the age and sex of each participant were recorded.
Using Fig. 12.34, can you easily determine if more females or more males are in each age group?
The artificial third dimension makes quickly determining the heights of the bars hard. In contrast, a 2D graph (a sidebyside bar graph; Fig. 12.35) makes it clear whether each age group has more females or more males.
12.8.2 Further comments
Always remember:
The purpose of a graph is to display the information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.
Helping readers to understand the data is the essense of producing a good graph:
Data graphics should draw the viewer's attention to the sense and substance of the data, not to something else [...] essentially statistical graphics are instruments to help people reason about quantitative information.
 Edward R Tufte et al.^{298}, p. 91
You should be able to construct the graphics by hand, but we will generally use software (e.g., jamovi or SPSS). Importantly, given the purpose of a graph, what the graph communicates should be explained: Graphs need to be clear and welllabelled with captions.
Ensure that you:
 Do not add artificial third dimensions, or other 'chart junk'.^{299}
 Do not add optical illusions.
 Do not make any errors.
Ensure that you:
 Do add units of measurement or value labels where appropriate.
 Do add titles and axis labels.
 Do ensure the axis scales are appropriate.
 Do add any necessary explanations.
 Do make it easy for the reader to be able to consider the RQ (for example, to easily make the comparison of greatest interest).
Example 12.25 (Truncating bar charts) One optical illusion often appearing in graphs is when the frequency (or percentage) axis on a bar chart is truncated so that it doesn't start at zero.
For example, consider data recording the number of lung cancer cases in various Danish cities.^{300}
The animation below shows a bar chart with the count (vertical axis) starting at zero (when the counts in each age group look similar), and then gradually changing where the vertical axis starts... so that the final bar chart make the number of cases in each age groups look very different. The graph is misleading when the graph does not start at a count of zero.
12.9 Case Study: The NHANES data
To demonstrate how graphs can help answer RQs, consider the following RQ:
Among Americans, is the average direct HDL cholesterol different for those who current smokers and nonsmokers?
From the RQ, the Population is 'Americans', the Outcome is the 'average direct HDL cholesterol levels', and the Comparison is 'Between those who currently do and do not smoke'.
There is no intervention; this is a relational RQ, that can be answered using an observational study.
As with any study, managing confounding should be considered, by thinking about possible extraneous variables that could be measured or observed.
Can you think of any possible extraneous variables that have the potential to be confounding variables?
Maybe weight... Weight may be related to both the HDL cholesterol concentration and the smoking status.
In this study, clearly we cannot collect primary data. However, data to answer this (and many other) RQs are obtained from the American National Health and Nutrition Examination Survey (NHANES).^{301} From the NHANES webpage, this survey:
... examines a nationally representative sample of about 5,000 persons each year... located in counties across the country, 15 of which are visited each year.
 NHANES webpage (emphasis added)
The data available are equivalent to a "a simple random sample from the American population".^{302} In total, 10,000 observations on scores of variables are available (from the 2009/2010 and the 2011/2012 surveys). Fig. 12.36 shows the first 5000 observations on the first 40 variables only.