16 Relationships: quantitative data comparisons between individuals

So far, you have learnt to ask a RQ, design a study, collect the data, and describe the data. In this chapter, you will learn to compare quantitative data in different groups. You will learn to:

  • compare quantitative data between individuals using the appropriate graphs.
  • compare quantitative between within individuals in summary tables.

16.1 Introduction

Relational RQs compare groups. This chapter considers how to compare qualitative variables in different groups. Graphs are very useful this purpose, and a table of the numerical summaries usually are also produced.

16.2 Graphs

When a quantitative variable is measured or observed in different groups (i.e., between individuals), the variables can be graphed within each group. However, graphs for comparing the quantitative variable in the groups include:

  • Back-to-back stemplots: Best for small amounts of data in two groups;
  • 2-D dot charts: Best choice for small to moderate amounts of data;
  • Boxplots: Best choice, except for small amounts of data.

These situations have one quantitative variable being compared in different groups (defined by one qualitative variable).

16.2.1 Back-to-back stemplot

Back-to-back stemplots are two stemplots (Sect. 12.3.2) sharing the same stems; one group has the leaves emerging left-to-right from the stem, and the second group has the leaves emerging right-to-left from the stem. Back-to-back stemplots can only be used when two groups are being compared. Again, one advantage of using stemplots over other plots is that the original data are retained. One disadvantage is that only two groups can be compared.

Example 16.1 (Back-to-back stemplots) A study recorded the number chest-beats by gorillas (Wright et al. 2021), for gorillas under \(20\) years old ('younger') and \(20\) years and over ('older') (Table 16.1). A back-to-back stemplot allows the two groups to be compared visually (Fig. 16.1). The leaves for younger gorillas go from right-to-left, and the leaves for older gorillas go left-to-right, sharing the same stems. The younger group has a faster chest-beating rate in general. One older gorilla has a much faster rate that the other older gorillas (an outlier).

Younger
Older
TABLE 16.1: The chest-beating rate for gorillas, for younger and older gorillas
\(0.7\) \(1.7\) \(0.0\) \(0.9\)
\(0.9\) \(1.8\) \(0.1\) \(1.1\)
\(1.3\) \(2.6\) \(0.2\) \(1.6\)
\(1.5\) \(3.0\) \(0.3\) \(4.0\)
\(1.5\) \(4.1\) \(0.4\)
\(1.5\) \(4.4\) \(0.6\)
\(1.7\) \(4.4\) \(0.8\)
Stemplot for the chest-beating rate for gorillas

FIGURE 16.1: Stemplot for the chest-beating rate for gorillas

16.2.2 2-D dot charts

A 2-dimensional (2-D) dot chart places a dot for each observation, separated for each level of the qualitative variable (also see Sect. 13.3.1). Any number of groups can be compared.

Example 16.2 (Boxplots) For the chest-beating data seen in Example 16.1, a dot chart is shown in Fig. 16.2. Many observations are the same, so some points would be over-plotted if points were not stacked (left panel), or some randomness added in the vertical direction (a 'jitter'), before plotting (right panel).

Two variations of a 2-D dot chart for the chest-beating data to avoid overplotting: stacking (left) and jittering (right)

FIGURE 16.2: Two variations of a 2-D dot chart for the chest-beating data to avoid overplotting: stacking (left) and jittering (right)

16.2.3 Boxplots

A boxplot is a picture of the quantiles (Sect. 12.6.3). Any number of groups can be compared using a boxplot.

The distribution for each group is summarised by five numbers: the minimum value; \(Q_1\); the median (\(Q_2\)); \(Q_3\); and the maximum value. Outliers, identified using the IQR rule (Sect. 12.8.2), are also shown. Distributions can be compared by comparing the values of \(Q_1\), the medians, and the values of \(Q_3\). Since each distribution is summarised by just five numbers, a lot of detail of the distribution is lost. Different software may use different rules for computing quartiles, and hence may produce slightly different boxplots.

Boxplots summarise data with only five numbers, so detail of the distributions are lost. For this reason, boxplots are excellent for comparing distributions, but histograms are better for displaying the distribution of a single quantitative variable.

Example 16.3 (Boxplots) An example of a boxplot, for the chest-beating data in Example 16.1, is shown in Fig. 16.3. No outliers are identified for younger gorillas; one large outlier is identified for the older gorillas. The boxplot shows a distinct difference between the chest-beating rates of older and younger gorillas.

A boxplot for the chest-beating data

FIGURE 16.3: A boxplot for the chest-beating data

The boxplots are explained in Fig. 16.4. Boxplots have five horizontal lines; from the top to the bottom of the plot. First focus on just the boxplot for the younger gorillas (i.e., the left box):

  1. Top line: The fastest chest-beating rate is \(4.4\) per \(10\) h.
  2. Second line from top: \(75\)% of observations are smaller than about \(3\), represented by the line at the top of the central box. This is the third quartile (\(Q_3\)).
  3. Middle line: \(50\)% of observations are smaller than about \(1.7\), represented by the line inside the central box. This is an 'average' value, the second quartile (\(Q_2\)).
  4. Second line from bottom: \(25\)% of observations are smaller than about \(1.5\), represented by the line at the bottom of the central box. This is the first quartile (\(Q_1\)).
  5. Bottom line: The slowest chest-beating rate is \(0.7\) per \(10\) h.
Explaining the boxplots for the chest-beating data

FIGURE 16.4: Explaining the boxplots for the chest-beating data

The box for the older gorillas is slightly different (Fig. 16.3): one observation is identified with a point, above the top line. Computer software has identified this observation as a potentially outlier (in this case, unusually large), and has plotted this point separately, using the IQR rule (Sect. 12.8.2).

The values of \(Q_1\), the median and \(Q_3\) are all substantially larger for the younger gorillas, suggesting that younger gorillas have, in general, faster chest-beating rates.

Example 16.4 (Boxplots) Boxplots can be plotted horizontally too, which leaves space for long labels. In Fig. 16.5 (based on Silva et al. (2016)), the three dental cements are very different regarding their push-out forces.

Comparing three push-out values for three dental cements

FIGURE 16.5: Comparing three push-out values for three dental cements

16.3 Summary tables

The numerical summary information for comparing a quantitative between groups can be collated in a table. The data should be summarised the quantitative variable for each group, and (when appropriate) the differences should also be summarised.

Example 16.5 (Numerical summary table) For the gorilla data, the summary of the data can be tabulated as in Table 16.2. Notice that no standard deviation or sample size is provided for the difference; these make no sense.

Based on the table, do you think a difference exists between the mean chest-beating rate for younger and older gorillas? Remember, the data in Table 16.2 is for a sample, but the RQ is asking about the population.

TABLE 16.2: A numerical summary of the gorillas data
Mean
Standard deviation
Sample
(in beats per 10 h) (in beats per 10 h) size
Younger \(2.221\) \(1.2699\) \(14\)
Older \(0.909\) \(1.1309\) \(11\)
Difference \(1.312\)

16.4 Example: water access

A study of three rural communities in Cameroon (López-Serrano et al. 2022) recorded data about access to water (Sect. 12.10). One of the main purposes of the study was to determine contributors to the incidence of diarrhea in young children (\(85\) households had children under \(5\)).

The graphs (Fig. 16.6) and summary (Table 16.3) show that households in which diarrhea was found in the last two weeks in children had older coordinators, more people in the household, and more children under \(5\) in the household. The may be expected: older female coordinators probably have more children, have more children in the household under \(5\), and more children (and hence people) in the household in general.

Some plots for the water access data in 85 households ($59$ household reported no diarrhea in children under\ $5$; $26$ reported diarrhea in children under\ $5$).

FIGURE 16.6: Some plots for the water access data in 85 households (\(59\) household reported no diarrhea in children under \(5\); \(26\) reported diarrhea in children under \(5\)).

TABLE 16.3: A summary of the quantitative variables in the water-access study, according to whether diarrhea had been observed in the last two weeks in children under \(5\), for those household with children under \(5\)
\(n\) Mean Median Std. dev. IQR
Woman's age
All households with children \(85\) \(40.2\) \(37.0\) \(13.90\) \(28.00\)
Incidents of diarrhea \(26\) \(45.0\) \(46.5\) \(14.04\) \(28.50\)
No incidents of diarrhea \(59\) \(38.1\) \(35.0\) \(13.44\) \(22.50\)
Household size
Difference \(\phantom{0}6.8\)
All households with children \(85\) \(\phantom{0}8.4\) \(\phantom{0}7.0\) \(\phantom{0}4.93\) \(\phantom{0}6.00\)
Incidents of diarrhea \(26\) \(10.5\) \(\phantom{0}8.5\) \(\phantom{0}6.51\) \(\phantom{0}7.75\)
Children under 5 in household
No incidents of diarrhea \(59\) \(\phantom{0}7.5\) \(\phantom{0}6.0\) \(\phantom{0}3.78\) \(\phantom{0}4.50\)
Difference \(\phantom{0}2.9\)
All households with children \(85\) \(\phantom{0}2.2\) \(\phantom{0}2.0\) \(\phantom{0}1.56\) \(\phantom{0}2.00\)
Incidents of diarrhea \(26\) \(\phantom{0}2.8\) \(\phantom{0}2.0\) \(\phantom{0}2.01\) \(\phantom{0}1.75\)
No incidents of diarrhea \(59\) \(\phantom{0}1.9\) \(\phantom{0}2.0\) \(\phantom{0}1.26\) \(\phantom{0}1.00\)
Difference \(\phantom{0}0.8\)

16.5 Chapter summary

Quantitative data can be compared between different groups (between individuals comparisons) using a back-to-back stemplot, boxplot or \(2\)-D dot chart. A summary table should show the numerical summaries for the quantitative variable at each comparison and, if appropriate, the between-group differences.

16.6 Quick review questions

Are the following statements true or false?

  1. A boxplot is an appropriate graph for comparing a quantitative variable in two or more groups.
  2. A back-to-back stemplot is an appropriate graph for comparing a quantitative variable in two or more groups.
  3. A case-profile plot is an appropriate graph for comparing a quantitative variable in two or more groups.
  4. When comparing a quantitative variable in two or more groups, the sample size for the difference should be included

16.7 Exercises

Selected answers are available in App. E.

Exercise 16.1 A study of different engineering project delivery methods (Hale et al. 2009) produced the boxplot in Fig. 16.7 (left panel). The grey, horizontal line is where the projected costs are the same as the actual cost.

What does the plot reveal about the two methods?

Exercise 16.2 [Dataset: AISsub] A study of athletes at the Australian Institute of Sport (AIS) measured numerous physical and blood measurements from high performance athletes (Telford and Cunningham 1991). The graph in Fig. 16.7 (right panel) compares the heights of females in two similar sports: basketball and netball. (Netball was derived from basketball.) How would you interpret the graph, regarding the heights of the athletes in the two sports?

Left: Cost increases for two different building project delivery methods (the grey, horizontal line is where the projected costs are the same as the actual cost). Right: The heights of female basketball and netball players attending the AIS.

FIGURE 16.7: Left: Cost increases for two different building project delivery methods (the grey, horizontal line is where the projected costs are the same as the actual cost). Right: The heights of female basketball and netball players attending the AIS.

Exercise 16.3 Match the histograms with the corresponding boxplots in the activity below.

Exercise 16.4 A study of the productivity of construction workers (Gatti et al. 2013) recorded, among other things, the rate at which concrete panels could be installed by workers. Data for three different female workers in the study are shown in Table 16.4. Construct the boxplot comparing the three workers. What does it tell you?

TABLE 16.4: The productivity of three workers installing concrete panels (in panels per minute)
Worker 1 Worker 2 Worker 3
Mean \(1.24\) \(1.73\) \(1.36\)
Minimum \(0.59\) \(1.13\) \(0.86\)
1st quartile \(0.88\) \(1.51\) \(1.16\)
Median \(1.35\) \(1.70\) \(1.38\)
3rd quartile \(1.49\) \(1.91\) \(1.58\)
Maximum \(1.88\) \(3.00\) \(2.17\)
Range \(1.28\) \(1.87\) \(1.31\)

Exercise 16.5 In a study of the temperature in offices, Paul and Taylor (2008) compared the temperature in three offices (during working hours) at Charles Sturt University (Australia); the data are summarised in Table 16.5. Using this information, draw the boxplot comparing the three offices. What do we learn from this graph?

TABLE 16.5: A summary of the temperature (in degrees C) in three offices at CSU during working hours according to current smoking status
Office A Office B Office C
Mean \(24.1\) \(25.3\) \(25.7\)
Minimum \(16.4\) \(15.9\) \(20.1\)
\(Q_1\) \(22.8\) \(23.8\) \(24.6\)
Median \(24.4\) \(25.5\) \(26.1\)
\(Q_3\) \(25.5\) \(26.9\) \(27.2\)
Maximum \(27.4\) \(31.0\) \(30.3\)

Exercise 16.6 In a study of the influence of using ankle-foot orthoses in children with cerebral palsy (Swinnen et al. 2018), the data in Table 11.3 describe the \(15\) subjects. (GMFCS is an ordinal variable used to describe the impact of cerebral palsy on their motor function: the Gross Motor Function Classification System.)

Sketch a graphs to explore the relationships between the children's weight and GMGCS.

Exercise 16.7 [Dataset: NHANES] Consider this RQ:

Among Americans, is the mean direct HDL cholesterol different for current smokers and non-smokers?

Data to answer this RQ is available from the American National Health and Nutrition Examination Survey (NHANES) (Pruim 2015). Use the jamovi output (Fig. 16.8) to construct an appropriate table showing the numerical summary relevant to the RQ.

jamovi output for the NHANES data

FIGURE 16.8: jamovi output for the NHANES data

Exercise 16.8 [Dataset: ForwardFall] A study (Wojcik et al. 1999) compared the lean-forward angle in younger and older women. An elaborate set-up was constructed to measure this angle, using a harnesses. Consider the RQ:

Among healthy women, what is difference between the mean lean-forward angle for younger women compared to older women?

The data are shown in Table 16.6.

  1. What is an appropriate graph to display the data?
  2. Construct an appropriate numerical summary from the software output (Fig. 16.9).
Younger women
Older women
TABLE 16.6: Lean-forward angles for older and younger women
\(29\) \(34\) \(33\) \(27\) \(28\) \(18\) \(15\) \(23\) \(13\) \(12\)
\(32\) \(31\) \(34\) \(32\) \(27\)
jamovi output for the lean-forward angles data

FIGURE 16.9: jamovi output for the lean-forward angles data

Exercise 16.9 [Dataset: Speed] To reduce vehicle speeds on freeway exit ramps, a Chinese study studied adding additional signage (Ma et al. 2019). At one site (Ningxuan Freeway), speeds were recorded for 38 vehicles before the extra signage was added, and then for 41 different vehicles after the extra signage was added (data below).

The researchers are hoping that the addition of extra signage will reduce the mean speed of the vehicles. The RQ is:

At this freeway exit, how much is the mean vehicle speed reduced after extra signage is added?

  1. Using the jamovi output in Fig. 16.10, summarise the data numerically, and construct a summary table.
  2. Use a computer to produce a boxplot of the data.
jamovi output for the speed data

FIGURE 16.10: jamovi output for the speed data

Exercise 16.10 [Dataset: Deceleration] To reduce vehicle speeds on freeway exit ramps, a Chinese study studied using additional signage (Ma et al. 2019). At one site (Ningxuan Freeway), speeds were recorded at various points on the freeway exit for \(38\) vehicles before the extra signage was added, and then for \(41\) vehicles after the extra signage was added.

From this data, the deceleration of each vehicle was determined (data below) as the vehicle left the \(120\) km/h speed zone and approached the \(80\) km/hr speed zone. Use the data, and the summary in Table 16.7, to address this RQ:

At this freeway exit, what is the difference between the mean vehicle deceleration, comparing the times before the extra signage is added and after extra signage is added?

In this context, the researchers are hoping that the extra signage might cause cars to slow down faster (i.e., they will decelerate more, on average, after adding the extra signage).

  1. Construct a numerical summary table for the data.
  2. Use software to construct a boxplot of the data.
TABLE 16.7: A summary table for the deceleration data
Mean Std. dev. Sample size Std. error
After \(\phantom{0}0.0745\) \(0.0494\) \(38\) \(0.0080\)
Before \(\phantom{0}0.0765\) \(0.0521\) \(41\) \(0.0081\)
Difference \(\phantom{0}0.0020\) \(0.0114\)

Exercise 16.11 [Dataset: Typing] The Typing dataset contains information about the typing speed and accuracy for students, from an online typing test. The four variables include are: typing speed (mTS), typing accuracy (mAcc), age (Age), and sex (Sex) for \(1301\) students.

  1. Produce appropriate numerical summaries for the quantitative variables.
  2. Produce appropriate numerical summaries for comparing the quantitative variables for different values of the qualitative variable.
  3. What do you learn from these numerical summaries?