18 More about tables and graphs

So far, you have learnt to ask a RQ, design a study, collect the data, and describe the data. In this chapter, you will learn to:

  • construct clear and informative graphs.
  • construct clear and informative tables.

18.1 Introduction

A summary of the data is important for understanding the data, and for planning the direction of the analysis.

18.2 Preparing graphs

Always remember:

The purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.

Usually a computer is used for constructing graphs. Using a computer makes it easy to try different graphs, and to change features of graphs.

Helping readers to understand the data is the essence of producing a good graph. You should be able to sketch graphs by hand, but usually software is used to produce graphs (e.g., jamovi). When creating graphs, ensure that you:

  • do make graphs clear and well-labelled
  • do add titles and axis labels.
  • do add units of measurement where necessary.
  • do add informative captions below the figure.
  • do add units of measurement and value labels where appropriate.
  • do make sure text and details are easy to read.
  • do ensure the axis scales are appropriate.
  • do add any necessary explanations.
  • do make it easy for the reader to be see how the data answer the RQ (for example, to easily make the comparison of greatest interest).
  • do not add artificial third dimensions, or other 'chart junk' (Su 2008).
  • do not add optical illusions, such as an artificial third dimension.
  • do not use distracting colours and fonts; only use different colours and fonts if necessary, and explain that purpose if it is not clear.
  • do not make errors.

Two specific optical illusions to avoid are given in the two examples that follow.

Example 18.1 (Truncating bar charts) One common optical illusion occurs when the frequency (or percentage) axis on a bar chart (or dot chart) does not start at zero. For example, consider data recording the number of lung cancer cases in various Danish cities (Fig. 13.3).

The animation below shows a bar chart with the count (vertical axis) starting at zero (when the counts in each age group look similar), and then gradually changing where the vertical axis starts... so that the final bar chart make the number of cases in each age groups look very different. The graph is visually misleading when the graph does not start at a count of zero.

Graphs should focus on clear communication, not trying to appear fancy simply for the sake of appearing fancy. One way people try to be fancy is to use an unnecessary third dimension. This is poor: the graphs can be misleading and hard to read (Siegrist 1996).

Example 18.2 (Two- and three-dimensional plots) In the NHANES study (CDC 1994), the age and sex of each participant were recorded. Using Fig. 18.1 (left panel), can you easily determine if more females or more males are in each age group?

The artificial third dimension makes determining the heights of the bars hard. In contrast, a \(2\)-D graph (a side-by-side bar graph; Fig. 18.1, right panel) makes it clear whether each age group has more females or more males.

Two plots of the NHANES participants, divided by age group and sex. Left: A three-dimensional bar chart. Right: a side-by-side bar chart

FIGURE 18.1: Two plots of the NHANES participants, divided by age group and sex. Left: A three-dimensional bar chart. Right: a side-by-side bar chart

18.3 Preparing tables

A computer is helpful for constructing tables. Using a computer also makes it easy to try different orientations or layouts.

As with graphs, the purpose of tables is to help readers understand the data. When creating numerical summary tables, ensure that you:

  • do make table clear and well-labelled.
  • do use clear row and column labels (as necessary).
  • do add units of measurement where necessary.
  • do add informative captions above the table.
  • do add units of measurement and value labels where appropriate.
  • do make sure text and details are easy to read.
  • do round numbers appropriately (don't necessarily use all significant figures provided by software).
  • do align numbers in the table by decimal point if possible, for easier reading.
  • do construct the table to allow readers to easily make the important comparisons, as far as possible (space restriction may take precedence, for example).
  • do not use distracting colours and fonts; only use different colours and fonts if necessary, and explain that purpose if it is not clear.
  • do not use vertical lines (in general), and use very few horizontal lines. Horizontal lines can be used to group columns (for example, see Table 18.1).

18.4 Example: water access

A study of three rural communities in Cameroon (López-Serrano et al. 2022) recorded data about access to water (see Sects. 13.6 and 12.10). One of the main purposes of the study was to determine contributors to the incidence of diarrhea in young children (\(85\) households had children under \(5\)). Relationships between the incidence of diarrhea and some other variables appear in Figs. 14.3 and 16.6. A summary table of information can also be constructed (Table 18.1).

In this table, note that:

  • quantitative and qualitative variables are summarised differently but appropriately.
  • units of measurements are given where appropriate (i.e., only for age).
  • numbers in columns are aligned for easier reading.
TABLE 18.1: Numerical summary of the water-access data in \(85\) households with children, according of whether children under \(5\) had reported diarrhea in the last two weeks
All households
Reported diarrhea
No reported diarrhea
\(n\) Summary \(n\) Summary \(n\) Summary
Age\(^a\) \(85\) \(37.0\)  \(( 28.0 )\) \(59\) \(35.0\)  \(( 22.5 )\) \(26\) \(46.5\)  \(( 28.5 )\)
Household size\(^b\) \(85\) \(\phantom{0}7.0\)  \(( \phantom{0}6.0 )\) \(59\) \(\phantom{0}6.0\)  \(( \phantom{0}4.5 )\) \(26\) \(\phantom{0}8.5\)  \(( \phantom{0}7.8 )\)
Under 5s in household\(^c\) \(85\) \(\phantom{0}2.0\)  \(( \phantom{0}2.0 )\) \(59\) \(\phantom{0}2.0\)  \(( \phantom{0}1.0 )\) \(26\) \(\phantom{0}2.0\)  \(( \phantom{0}1.8 )\)
Region\(^b\)
Mbeng \(26\) \(30.6\)% \(14\) \(53.8\)% \(12\) \(46.2\)%
Mbih \(28\) \(32.9\)% \(19\) \(67.9\)% \(\phantom{0}9\) \(32.1\)%
Ntsingbeu \(31\) \(36.5\)% \(26\) \(83.9\)% \(\phantom{0}5\) \(16.1\)%
Water source\(^b\)
Tap \(\phantom{0}6\) \(\phantom{0}7.1\)% \(\phantom{0}5\) \(83.3\)% \(\phantom{0}1\) \(16.7\)%
Bore \(56\) \(65.9\)% \(40\) \(71.4\)% \(16\) \(28.6\)%
Well \(10\) \(11.8\)% \(\phantom{0}5\) \(50.0\)% \(\phantom{0}5\) \(50.0\)%
River \(13\) \(15.3\)% \(\phantom{0}9\) \(69.2\)% \(\phantom{0}4\) \(30.8\)%
Education\(^b\)
Primary or less \(38\) \(44.7\)% \(27\) \(71.1\)% \(11\) \(28.9\)%
Secondary or higher \(47\) \(55.3\)% \(32\) \(68.1\)% \(15\) \(31.9\)%
Has livestock\(^b\)
No \(20\) \(23.5\)% \(17\) \(85.0\)% \(\phantom{0}3\) \(15.0\)%
Yes \(65\) \(76.5\)% \(42\) \(64.6\)% \(23\) \(35.4\)%
a Quantitative variables are summarised using medians and IQR
b Qualitative variables are summarised using counts and percentages

The table summarises the sample, but RQs are about the population. For example, one RQ could be:

Is the percentage of households with children under \(5\) having diarrhea the same for house that do and do not keep livestock?

Since the observed sample is one of countless possible samples that may have been selected, answering RQs about the population is not straightforward. In the observed sample, \(85.0\)% of households that did not keep livestock reported diarrhea in children under \(5\), while \(64.6\)% of households that did keep livestock reported diarrhea in children under \(5\). That is, a difference is seen in the sample; but RQs are about the population.

Broadly, two possible reasons could explain why the sample percentages of households reporting diarrhea in children are different:

  • the population percentages are the same. The sample percentages are different simply because of the households selected in this particular sample. Another sample, with different households, might produce different sample percentages. Sampling variation explains the difference in the sample percentages.

  • the population percentages are different. The difference between the sample percentages reflects this difference between the population percentages.

The difficulty is knowing which of these reasons ('hypotheses') is the most likely explanation for the difference between the sample percentages. This question is of prime importance (it answers the RQ), and is addressed at length later in this book.

18.5 Quick revision questions

Are the following statements true or false?

  1. Graphs usually have their captions under the figure.
  2. Graphs should use as many colours as possible.
  3. Graphs should usually be carefully created using computer software.
  4. Tables should have plenty of horizontal and vertical lines.
  5. Tables usually have their captions under the table.

18.6 Exercises

Selected answers are available in App. E.

Exercise 18.1 What would be the best graph for displaying the data for these situations?

  1. To explore the relationship between the pH of water and the temperature of the water, in various creeks.
  2. Suppose, in a research study, the researchers measures the difference between a swimmer's fastest \(100\) m time and their fastest \(200\) m time. The researchers were interested in the average time difference.
  3. A research study examined the way in which students usually came to university (bus; private car; car pooling; etc.) and their postcode.

Exercise 18.2 A study of lime trees (Tilia cordata) recorded these variables for \(385\) lime trees in Russia (Schepaschenko et al. 2017):

  • the foliage biomass, in kg (the response variable);
  • the tree diameter, in cm (the explanatory variable);
  • the age of the tree, in years; and
  • the origin of the tree, one of Coppice, Natural, or Planted.

The purpose of the study is to estimate the foliage biomass from the tree diameter, and perhaps the other extraneous variables. What graphs would be useful?

Exercise 18.3 In a study of the influence of using ankle-foot orthoses in children with cerebral palsy (Swinnen et al. 2018), the data in Table 11.3 describe the \(15\) subjects. (GMFCS is an ordinal variable used to describe the impact of cerebral palsy on their motor function: the Gross Motor Function Classification System.) Sketch some graphs to explore the relationships between these variables.

Exercise 18.4 A study of fertilizer use (Lane 2002) recorded the soil nitrogen after applying different fertilizer doses. The researchers recorded:

  • the fertilizer dose, in kilograms of nitrogen per hectare;
  • the soil nitrogen, in kilograms of nitrogen per hectare; and
  • the fertilizer source; one of 'inorganic' or 'organic'.

What graphs would be useful for understanding the data?

Exercise 18.5 A study of noisy miners (a small Australian bird) counted the number of noisy miners and the number of eucalyptus trees in random quadrats (Maron 2007; P. K. Dunn and Smyth 2018). Critique the graph of the data (Fig. 18.2, left panel).

Left: The number of noisy miners and the number of eucalyptus trees. Right: A scatterplot of the colour of female horseshoe crabs and the condition of their spines.

FIGURE 18.2: Left: The number of noisy miners and the number of eucalyptus trees. Right: A scatterplot of the colour of female horseshoe crabs and the condition of their spines.

Exercise 18.6 A study of \(173\) female horseshoe crabs (Brockmann 1996; P. K. Dunn and Smyth 2018) recorded, among other things, the colour of the carapace ('Light medium', 'Medium', 'Dark medium' or 'Dark') and the condition of the carapace ('Both OK', 'One OK', 'None OK'). Critique the scatterplot ((Fig. 18.2, right panel).) used to explore the data.

Exercise 18.7 A study (Danielsson et al. 2014) examined the change in MADRS (a quantitative scale measuring level of depression) and treatment group (whether each person was treated using: exercise; body awareness; or advice).

  1. What is the response variable?
  2. What is the explanatory variable?
  3. What graphs would be useful for exploring the data and the relationships of interest?

Exercise 18.8 A study of high-performance athletes at the Australian Institute of Sport (AIS) (Telford and Cunningham 1991) recorded numerous variables about athletes. A plot for the sports played by the athletes is shown in Fig. 18.3.

How would you describe the data: Left skewed, right skewed, approximately symmetrical? Or something else?

Sports played by athletes in the AIS study

FIGURE 18.3: Sports played by athletes in the AIS study

Exercise 18.9 The Typing dataset contains four variables: typing speed (mTS), typing accuracy (mAcc), age (Age). and sex (Sex) for \(1301\) students. Produce graphs necessary for understanding the data, making sure to explain what they reveal.

Does the mean typing speed or mean accuracy appear to differ by the age or sex of the student? What other questions would be useful to ask about the data?