13 Numerical summaries: quantitative data

So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, and graphically summarise data.

In this chapter, you will learn to numerically describe quantitative data. Both quantitative and qualitative data are described numerically in quantitative research. You will learn to:

  • numerically summarise quantitative data using the appropriate statistics.
  • describe quantitative data by average, variation, shape and unusual features.

13.1 Introduction

In the last chapter (Sect. 12.9), this RQ was posed:

Among Americans, is the average direct HDL cholesterol different for current smokers and non-smokers?

Graphs were used to understand the data in Sect. 12.9, where information contained in the graphs was given. In some cases, the features of the data displayed in the graph can be described numerically. That is the purpose of this chapter: to learn how to summarise quantitative data numerically.

Example 13.1 (Describing quantitative data) For the RQ above, understanding the response variable (direct HDL cholesterol values) is important; a histogram is useful (Fig. 13.1).

What does the histogram tell us?

  • Average: The average value is about 1.5 mmol/L.
  • Variation: The values range from about 0.5 to 3 mmol/L, but with some larger values (that are hard to see on the histogram).
  • Shape: The distribution is slightly skewed right.
  • Outliers: Some large outliers are present (that are hard to see on the histogram).

Describing some of these features more precisely, with numbers, can be helpful.

The histogram of the direct HDL cholesterol from the NHANES study

FIGURE 13.1: The histogram of the direct HDL cholesterol from the NHANES study

A number that describes a feature of a population is called a parameter. The values of parameters are usually unknown.

In contrast, a number that describes a feature of a sample is called a statistic. That is:

  • Samples are numerically described by statistics;
  • Populations are numerically described by parameters.

Definition 13.1 (Parameter) A parameter is a number describing some feature of a population.

Definition 13.2 (Statistic) A statistic is a number describing some feature of a sample (to estimate a population parameter).

The RQ identifies the population, but in practice a sample is studied. Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample.

13.2 Computing the average value

The average (or location, or centre, or typical value) for quantitative sample data can be described in many ways; the two most common ways are:

In both cases, the population parameter is estimated by a sample statistic. Understanding whether to use the mean or median is important.

The word 'average' can refer to either mean or median (or other measures of centre too). Use the precise terms 'mean' or 'median', rather than 'average', when necessary!

Consider the daily river flow volume (called 'streamflow') at the Mary River from 01 October 1959 to 17 January 2019, summarised by month in Table 13.1 (from Queensland DNRM).

The 'average' daily streamflow in February could be quoted using either the mean or the median; but the two give very different values for the 'average':

  • the mean daily flow is 1123.2ML.
  • the median daily flow is 146.1ML.

These two common ways of measuring the same thing (the 'average' daily streamflow in February) give very different answers. Why? Which is the best 'average' to use?

To decide, both measures of average will need to be studied.

TABLE 13.1: The daily streamflow at Mary River (Bellbird Creek), in ML, from 01 October 1959 to 17 January 2019; average for each month
Month Mean Median
Jan 849.3 71.3
Feb 1123.2 146.1
Mar 793.9 194.9
Apr 622.5 141.7
May 348.4 118.4
Jun 378.7 83.6
Jul 259.3 68.8
Aug 108.6 55.5
Sep 100.9 48.0
Oct 151.2 37.6
Nov 186.6 45.3
Dec 330.8 64.1

13.2.1 Computing the average: The mean

The mean of the population is denoted by \(\mu\), and its value is almost always unknown.

Instead, the mean of the population is estimated by the mean of the sample, which is denoted by \(\bar{x}\) (an \(x\) with a line above it). In this context, the unknown parameter is \(\mu\), and the statistic is \(\bar{x}\). The sample mean is used to estimate the population mean.

The Greek letter \(\mu\) is pronounced 'myoo', as in music.
The symbol \(\bar{x}\) is pronounced 'ex-bar'.

Example 13.2 (A small data set to work with) To demonstrate ideas, consider a small data set for answering this descriptive RQ:

For mature Jersey cows, what is the average percentage butterfat in their milk?

The population is 'milk from Jersey cows', and an estimate of the population mean percentage butterfat is sought. The population mean is denoted by \(\mu\).

Clearly, milk from every Jersey cow cannot be studied; a sample is studied:315 The unknown population mean is estimated using the sample mean (\(\bar{x}\)). Measurements were taken from milk from 10 cows, in percentages (Table 13.2).

TABLE 13.2: The butterfat percentage from a sample of milk from 10 Jersey cows
Butterfat percentages
4.8 5.2 5.2 5.4 5.2
6.5 4.5 5.7 4.8 5.2

The sample mean is what people usually think of as the 'average'. The sample mean is actually the 'balance point' of the observations. The animation below shows how the mean acts as the balance point.

Alternatively, the mean is the value such that the positive and negative distances of the observations from the mean add to zero , as shown in the animation below. Both of these explanations seem reasonable for identifying the 'average' of the data.

Definition 13.3 (Mean) The mean is one way to measure the 'average' value of quantitative data. The arithmetic mean can be considered as the 'balance point' of the data, or the value such that the positive and negative distances from the mean add to zero.

To find the value of the sample mean:

  • Add (shown using the symbol \(\sum\)) all the observations (denoted by \(x\)); then
  • Divide by the number of observations (denoted by \(n\)).

In symbols: \[ \bar{x} = \frac{\sum x}{n}. \] This means to add up (indicated by \(\sum\)) the observations (denoted by \(x\)), then divide by the size of the sample (denoted by \(n\)).

Example 13.3 (Computing a sample mean) For data for the Jersey cow data (Example 13.2), an estimate of the population mean percentage butterfat is found using the sample information: sum all \(n=10\) observations and divide by \(n\): \[\begin{align*} \overline{x} &= \frac{\sum x}{n} = \frac{4.8 + 6.5 + \cdots + 5.2}{10}\\ &= \frac{52.5}{10} = 5.25. \end{align*}\] The sample mean, the best estimate of the population mean, is 5.25 percent.

Usually, software (such as jamovi or SPSS) or a calculator (in Statistics Mode) will be used to compute the sample mean. However, knowing how these quantities are computed is important.

For the butterfat data (Table 13.2), what is the value of \(\mu\), the population mean?

We do not know!

We know the value of the sample mean, but not the population mean. We only have an estimate of the value of the population mean by using the sample mean.

(If we already knew the value of the population mean, why would we estimate the value from an imperfect sample?)

A study of eyes316 aimed to estimate the average thickness of eyes affected by glaucoma. The collected data (in microns) are shown in Table 13.3.

Estimate the population mean corneal thickness.

The estimate of \(\mu\) is \(\bar{x} = 459\) microns.

TABLE 13.3: The thickness of the cornea (in microns) in eyes affected by glaucoma
Corneal thicknesses
484 492 436 464
478 444 398 476

Software and calculators often produce numerical answers to many decimal places, some of which may not be meaningful or useful. A useful rule-of-thumb is to round to one or two more significant figures than the original data.

For example, the butterfat data are given to one decimal place. The sample mean weight can be given to two decimal places: \(\bar{x} = 5.25\)%.

13.2.2 Computing the average: The median

The median is a value separating the larger half of the data from the smaller half of the data. In a data set with \(n\) values, the median is ordered observation number \(\displaystyle \frac{n + 1}{2}\).

The median is:

  • not equal to \(\displaystyle \frac{n+1}{2}\).
  • not halfway between the minimum and maximum values in the data.

Most calculators cannot find the median.

The median has no commonly-used symbol.

Definition 13.4 (Median) The median is one way to measure the 'average' value of some data. The median is a value such that half the values are larger than the median, and half the values are smaller than the median.

Example 13.4 (Find a sample median) To find the sample median for the Jersey cow data (Example 13.2), first arrange the data in numerical order (Table 13.4).

The median separates the larger 5 numbers from the smaller 5 numbers. With \(n=10\) observations, the median is the ordered observation located between the fifth and sixth observations (i.e., at position \((10 + 1)/2 = 5.5\); the median itself is not 5.5).

So the sample median is between \(5.2\) (ordered observation five) and \(5.2\) (ordered observation six): the sample median is \(5.20\) percent.

TABLE 13.4: The butterfat percentage from a sample of milk from 10 Jersey cows, in increasing order
Butterfat percentages
4.5 4.8 4.8 5.2 5.2
5.2 5.2 5.4 5.7 6.5

For the butterfat data (Table 13.2), what is the population median?

We do not know!

We know the value of the sample median, but not the population median. We only have an estimate of the value of the population median.

A study of eyes317 aimed to estimate the average thickness of eyes affected by glaucoma.

Using the collected data (Table 13.3), estimate the population median corneal thickness. What is the population median?

With \(n = 8\) observations, the median is ordered observation number \((8 + 1)/2 = 4.5\), halfway between ordered observation numbers 4 and 5.

After sorting into increasing order, the two middle numbers (the 4th and 5th) are 464 and 476. The median could be any number between 464 and 476, but the usual answer would be that the median is \((464 + 476)/2 = 470\).

The sample median is 470 microns; the value of the population median remains unknown.

To clarify:

  • If the sample size \(n\) is odd, the median is the middle number when the observations are ordered.
  • If the sample size \(n\) is even (such as the glaucoma example), the median is halfway between the two middle numbers, when the observations are ordered.

Some software uses different rules when \(n\) is even.

13.2.3 Which average to use?

Consider again estimating the average daily streamflow at the Mary River (Bellbird Creek) during February (Table 13.1): The mean daily streamflow is 1123.2ML, and the median daily streamflow is 146.1ML. Which is the 'best' average to use?

A dot chart of the daily stream flow (Fig. 13.2) shows that the data are very highly right-skewed, with many very large outliers: the maximum value is 156586.4ML, more than one hundred times larger than the mean of 1123.2ML).

In fact, about 86% of the observations are less than the mean. In contrast, about 50% the values are less than the median (by definition). For these data, the mean is hardly a central value...

A dot plot of the daily streamflow at Mary River from 1960 to 2017, for February. The vertical grey line is the mean value. Many large outliers exist, so the data near zero are all squashed together

FIGURE 13.2: A dot plot of the daily streamflow at Mary River from 1960 to 2017, for February. The vertical grey line is the mean value. Many large outliers exist, so the data near zero are all squashed together

The streamflow data are very highly skewed (to the right), which is important and relevant:

  • Means are best used for approximately symmetric data: the mean is influenced by outliers and skewness.
  • Medians are best used for data that are skewed or contain outliers: the median is not influenced by outliers and skewness.

Means tend to be too large if the data contains large outliers or severe right skewness, and too small if the data contains small outliers or severe left skewness.

For the Mary River data, the large outliers---and the fact that they are so extreme and abundant---result in the mean being substantially influenced by the outliers, which explains why the mean is much larger than the median. The median is the better measure of average for these data.

The mean is generally used if possible (for practical and mathematical reasons), and is the most commonly-used measure of location. However, the mean is influenced by outliers and skewness; the median is not influenced by outliers and skewness. The mean and median are similar in approximately symmetric distributions. Sometimes, quoting both the mean and the median may be appropriate.

An engineering study318 was studying a new building material to determine the average permeability time.

The time (in seconds) taken for water to permeate \(n = 81\) pieces of material. Using a histogram of the data (Fig. 13.3), estimate the value of the population mean and median.

Which would be best to use (for example, to quote an average permeability time on a specification sheet)?

The data are skewed, which suggests using the median.

In practice, we would probably need a larger sample anyway before giving a value to use on a specification sheet.

A histogram of the permeability of a type of building material

FIGURE 13.3: A histogram of the permeability of a type of building material

13.3 Computing the variation

For quantitative data, the amount of variation in the bulk of the data should be described. Many ways exist to measure the variation in a data set, including:

As always, a value computed from the sample (the statistic) estimates the unknown value in the population (the parameter). Knowing which measure of variation to use is important.

13.3.1 Computing the variation: Range

The range is the simplest measure of variation.

Definition 13.5 (Range) The range is the maximum value minus the minimum value.

The range is not often used, because only the two extreme observations are used, so it is highly influenced by outliers. Sometimes, the range may be given by stating both the maximum and the minimum value in the data instead of giving the difference between the maximum and the minimum values. The range is measured in the same measurement units as the data.

Example 13.5 (The range) For Jersey cow data (Example 13.2), the range is: \[ \text{Range} = \overbrace{6.5}^{\text{largest}} - \overbrace{4.5}^{\text{smallest}} = 2.0 \text{ percent}. \]
So the sample median percentage butterfat is 5.20 percent, with a range of 2.00 percent.

13.3.2 Computing the variation: Standard deviation

The population standard deviation is denoted by \(\sigma\) ('sigma', the parameter) and is estimated by the sample standard deviation \(s\) (the statistic).

The standard deviation is the most commonly-used measure of variation, but is complicated to compute manually (but you don't need to do it manually!). The standard deviation is (roughly) the mean distance that the observations are away from the mean. This seems like a reasonable way to measure the amount of variation in some data.

The Greek letter \(\sigma\) ('sigma') is pronounced as expected: 'sigma'.

The sample standard deviation \(s\) is mostly found using computer software (e.g., jamovi or SPSS) or a calculator (in Statistics Mode).

Definition 13.6 (Standard deviation) The standard deviation is, approximately, the average distance that observations are away from the mean.

You do not have to use the formula to calculate \(s\), but we will demonstrate for those who might find it useful to understand exactly what \(s\) calculates. The formula is:

\[ s = \sqrt{ \frac{\sum(x - \bar{x})^2}{n-1} }, \] where \(\bar{x}\) is the sample mean, \(x\) represents the data values, and \(n\) is the sample size. To use the formula, follow these steps:

  • Calculate the sample mean: \(\overline{x}\);
  • Calculate the deviations of each observation \(x\) from the mean: \(x-\bar{x}\);
  • Square these deviations (to make them all positive values): \((x-\bar{x})^2\);
  • Add these values: \(\sum(x-\bar{x})^2\);
  • Divide the answer by \(n-1\);
  • Take the (positive) square root of the answer.

You do not need to use the formula! You must know how to use software or a calculator to find the standard deviation.

Example 13.6 (Standard deviation) For the Jersey cow data (Example 13.2), the deviations of each observation from the mean of \(5.25\) can be found (Fig. 13.4). Then follow the steps outlined. You don't have to do this manually! From Fig. 13.4, the sum of the squared distances is 2.7650. Then, the sample standard deviation is:

\[ s = \sqrt{\frac{2.765}{10-1}} = \sqrt{ 0.3072222} = 0.5542763. \] The sample mean percentage butterfat is 5.25 percent, with a sample standard deviation of 0.554 percent.

The standard deviation is related to the sum of the squared-distances from the mean

FIGURE 13.4: The standard deviation is related to the sum of the squared-distances from the mean

The standard deviation for Dataset A in Fig. 13.5 is 2.00.

What do you estimate the standard deviation of Dataset B will be: smaller than 2.00 or greater than 2.00? Why?

The standard deviation is a bit like the average distance that observations are from the mean.

In Dataset B, there seems to be a lot more observations closer to the mean, so the average distance would be a smaller number.

This suggests that the standard deviation for Dataset B will be smaller than the standard deviation for Dataset A.

Dotplots of two sets of data

FIGURE 13.5: Dotplots of two sets of data

The sample standard deviation is:

  • Positive (unless all observations are the same, when it is zero: there is no variation);
  • Best used for (approximately) symmetric data;
  • Usually quoted with the mean;
  • The most commonly-used measure of variation;
  • Measured in the same units as the data;
  • Influenced by skewness and outliers, like the mean.

Consider again the Jersey cow data (Example 13.2).

Using your calculator's Statistics Mode, find the population standard deviation and the sample standard deviation.

The population standard deviation is unknown.

The best estimate is the sample standard deviation: \(s=0.554\)%.

If you do not get this value, you may be pressing the wrong button on your calculator: Ask for help!

13.3.3 Computing the variation: IQR

The standard deviation uses the value of \(\bar{x}\), so is affected by skewness like the sample mean. Another measure of variation that is not affected by skewness is the inter-quartile range, or IQR. To understand the IQR, understanding quartiles first is important.

Definition 13.7 (Quartiles) Quartiles to describe the variation and shape of data:

  • The first quartile \(Q_1\) is a value that separates the smallest 25% of observations from the largest 75%. The \(Q_1\) is like the median of the smaller half of the data, halfway between the minimum value and the median.
  • The second quartile \(Q_2\) is a value that separates the smallest 50% of observations from the largest 50%. (This is the median.)
  • The third quartile \(Q_3\) is a value that separates the smallest 75% of observations from the largest 25%. The \(Q_3\) is like the median of the larger half of the data, halfway between the median and the maximum value.

Quartiles divide the data into four parts of approximately equal numbers of observations, and a boxplot is a picture of the quartiles. The inter-quartile range, or the IQR is the difference between \(Q_3\) and \(Q_1\).

The IQR measures the range of the middle 50% of the data, and is a measure of variation not influenced by outliers. The IQR is measured in the same measurements units as the data.

Definition 13.8 (IQR) The IQR is the range in which the middle 50% of the data lie; the difference between the third and the first quartiles.

Quartiles were previously discussed in the context of boxplots (Sect. 12.4.3). For example, a boxplot of the egg-krill data319 was shown in Example 12.13; the data are repeated in Table 13.5, and the boxplot in Fig. 13.6.

TABLE 13.5: The number of eggs laid by krill, for those in a treatment group and for those in a control group
Treatment group
Control group
0 18 0 18
0 21 0 21
1 26 0 26
1 30 0 30
3 35 1 35
8 48 1 48
8 50 1 50
12 2
A boxplot for the krill-egg data; the boxplot just for the treatment group

FIGURE 13.6: A boxplot for the krill-egg data; the boxplot just for the treatment group

For the Treatment group:

  • 75% of the observations are smaller than about 28, and this is represented by the line at the top of the central box. This is \(Q_3\), or the third quartile.
  • 50% of the observations are smaller than about 12, and this is represented by the line in the centre of the central box. This is \(Q_2\), the second quartile or the median.
  • 25% of the observations are smaller than about 2, and this is represented by the line at the bottom of the central box. This is \(Q_1\), the first quartile.

The IQR is \(Q_3 - Q_1\) = \(28 - 2\), so that \(\text{IQR} = 26\). The animation below shows how the IQR is found.

Example 13.7 (Boxplots) Consider the NHANES data.320

The boxplot for the age of respondents in the NHANES data set is as shown below. For these data:

  • No outliers are identified.
  • The oldest person is 80.
  • About 75% of the subjects are aged less than about 54 (\(Q_3\)): the third quartile \(Q_3 = 54\), the median of the largest half of the data.
  • About 50% of the subjects are aged less than about 36 (\(Q_2\), the median): the second quartile \(Q_2 = 36\), the median of the data set.
  • About 25% of the subjects are aged less than about 17 (\(Q_1\)): the first quartile \(Q_1 = 17\), the median of the smallest half of the data.
  • The youngest subject is aged 0.

Then, \(Q_3 = 54\) and \(Q_1 = 17\), so the \(\text{IQR} = Q_3 - Q_1 = 54 - 17 = 37\) years. The middle 50% of the participants have an age range of 37 years.

13.3.4 Computing the variation: Percentiles

Percentiles can be computed, which are similar to quantiles; for example:

  • The 12th percentile is a value separating the smallest 12% of the data from the rest.
  • The 67th percentile is a value separating the smallest 67% of the data from the rest.
  • The 94th percentile is a value separating the smallest 94% of the data from the rest.

Percentiles are measured in the same measurements units as the data.

Definition 13.9 (Percentiles) The \(p\)th percentile of the data is a value separating the smallest \(p\)% of the data from the rest.

By this definition, the first quartile \(Q_1\) is also the 25th percentile, the second quartile \(Q_2\) is also the 50th percentile (and the median), and the third quartile \(Q_3\) is also the 75th percentile.

Percentiles are especially useful for very skewed data and in certain applications. For instance, scientists who monitor rainfall and stream heights, and engineers who use this information, are more interested in extreme weather events rather than the 'average' event. Engineers, for example, may design structures to withstand 1-in-100 year events (the 99th percentile) or similar, which are unusual events.

Example 13.8 (Percentiles) For the streamflow data at the Mary River (Table 13.1), the February data is highly right-skewed (Fig. 13.2):

  • The median (50th percentile) is 146.1 ML.
  • The 95th percentile is 3,480 ML.
  • The 99th percentile is 19,043 ML.

Constructing infrastructure to cope with the median streamflow is clearly silly.

13.3.5 Which measure of variation to use?

Which is the 'best' measure of variation for quantitative data? As with measures of location, it depends on the data.

Since the standard deviation calculation uses the mean, it is impacted in the same way as the mean by outliers and skewness, so the standard deviation is best used with approximately symmetric data. The IQR is best used when data are skewed or asymmetric. Sometimes, both the standard deviation and the IQR can be quoted.

13.4 Describing shape

Describing the skewness numerically is possible; however, in this book the shape will be described just using words (skewed, approximately symmetric, bimodal, etc.) as before (Sect. 12.2.4).

Example 13.9 (Skewness) The Australian Bureau of Statistics (ABS) records the age at death of Australians.

The histograms of the age of death for females and males (Fig. 13.7) show that both distributions are left skewed: Few Australians die at a very young age, and most die at an older age.

Histograms of age at death for Australians in 2012

FIGURE 13.7: Histograms of age at death for Australians in 2012

13.5 Identifying outliers

Outliers are 'unusual' observations: observation quite different (larger or smaller) than the bulk of the data. Deciding whether or not an observation is 'unusual' is arbitrary, so 'rules' for identifying outliers are somewhat arbitrary too.

Definition 13.10 (Outliers) An outlier is an observation that is 'unusual' compared to the bulk of the data (either larger or smaller). Rules for identifying outliers are arbitrary.

Two rules for identifying outliers are:

Understanding the first rule requires studying bell-shaped distributions first. Knowing which rule to use is important.

13.5.1 Bell-shaped (normal) distributions and the 68--95--99.7 rule

To begin, identifying outliers will be studied for data approximately symmetrically distributed. More specifically, symmetric distributions with a bell shape will be studied. For example, the heights of husbands in the UK321 have an approximate bell shape (Fig. 13.8, left panel). Most men are between 160 and 185cm; a few are shorter than 160cm and a few taller than 185cm. More formally, bell-shaped distributions are called normal distributions.

These data are from a sample. Of course, every sample is likely to contain different men, and every sample of men will produce a slightly different histogram.

For convenience then, histograms may be smoothed, so that the smoothing produces a shape that represents an 'average' of all these possible sample histograms (in other words, an estimate of how the heights may be distributed in the population). For example, see the animation below. The solid line represents the average of many sample histograms.

The smoothed histogram can be drawn can be considered as representing 100% of the observations; after all, every husband in the sample has a height, so is represented somewhere in the histogram. When we do this, the areas under the normal curve are theoretical percentages of the total number.

The heights of husbands have an approximate normal distribution

FIGURE 13.8: The heights of husbands have an approximate normal distribution

The smoothed histogram represents all of the husbands' heights (that is, 100%). Using this idea, areas of the histogram can be shaded (Fig. 13.9) to represent various percentages of the husbands' heights.

For example:

  • The middle 50% of husbands (Fig. 13.9, centre panel) are between about 168 and 178cm tall.
  • The tallest 20% of husbands (Fig. 13.9, right panel) are taller than about 179cm.
The heights of husbands, with certain percentages shaded

FIGURE 13.9: The heights of husbands, with certain percentages shaded

Importantly, for any normal distribution, whatever the mean or standard deviation, the areas under the smoothed curve approximately follow this important rule: The 68--95--99.7 rule.

Definition 13.11 (The 68--95--99.7 Rule (or the Empirical Rule)) For any bell-shaped distribution, approximately:

  • 68% of observations lie within one standard deviation of the mean;
  • 95% of observations lie within two standard deviations of the mean;
  • 99.7% of observations lie within three standard deviations of the mean.

The 68--95--99.7 rule, or the empirical rule, is one of the most important rules we will see.

The animation below shows how the 68--95--99.7 works.

The percentages given in the 68--95--99.7 rule are approximate; the exact percentages are 68.27%, 95.45% and 99.73% respectively.

The 68--95--99.7 rule can be used to understand variables that have an approximate normal distribution. For example, consider the heights of husbands again (Fig. 13.10); the sample mean height is \(\bar{x} = 173.2\)cm; the sample standard deviation is \(s = 6.88\)cm.

Using the 68--95--99.7 rule, approximately 68% of the husbands would have heights between

  • \(173.2 - 6.88 = 166.3\)cm and
  • \(173.2 + 6.88 = 180.1\)cm.

(In fact, 71% of husbands in the sample are between 166.3cm and 180.1cm tall, close to the expected 68%.) Similarly, approximately 95% of the husbands would have heights between

  • \(173.2 - (2\times 6.88) = 159.4\)cm and
  • \(173.2 + (2\times 6.88) = 187.0\)cm.

For the husbands' heights, the sample mean height is \(173.2\)cm; the sample standard deviation is \(6.88\)cm.

Using the 68--95--99.7 rule, about 99.7% of the husbands are between what heights?

Three standard deviations is \(3\times6.88 = 20.64\).

So about 99.7% of husbands are between \((173.2 - 20.64) = 152.6\)cm and \((173.2 + 20.64) = 193.8\)cm.

The empirical rule indicates that 99.7% of observations are within 3 standard deviations of the mean. That is, almost all observations are within three standard deviations of the mean.

This suggests a rule for identifying outliers in approximately bell-shaped distributions: any observation more than 3 standard deviations away from the mean is unusual, so may be considered an outlier. More generally, this rule is often applied to approximately symmetric distributions.

Bell-shaped (normal) distributions are studied further later (for example, Chap. 17).

The heights of husbands, showing the 68--95--99.7 rule in use

FIGURE 13.10: The heights of husbands, showing the 68--95--99.7 rule in use

13.5.2 The standard deviation rule for identifying outliers

One rule for identifying outliers is based on the 68--95--99.7 rule.

Definition 13.12 (Standard deviation rule for identifying outliers) For approximately symmetric distributions, an observation more than three standard deviations from the mean may be considered an outlier.

This rule uses the mean and the standard deviation, so this rule is suitable for approximately symmetric distributions (when means and standard deviations are sensible numerical summaries to use). Although this rule is based on normal distributions, it has proved useful for many approximately-symmetric distributions.

All rules for identifying outliers are arbitrary. For example, the standard deviation rule is sometimes given slightly differently; for example, outliers identified as observations more than 2.5 standard deviations away from the mean. Since all rules for identifying outliers are arbitrary, both rules are acceptable.

13.5.3 The IQR rule for identifying outliers

Since the standard deviation rule for identifying outliers relies on the mean and standard deviation, it is not appropriate for non-symmetric distributions. Another rule is needed for identifying outliers in these situations: the IQR rule.

Definition 13.13 (IQR rule for identifying outliers) The IQR rule identifies mild and extreme outliers as:

  • Extreme outliers: observations \(3\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\).

  • Mild outliers: observations \(1.5\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\) (that are not also extreme outliers).

This definition is much easier to understand using an example.

Example 13.10 (IQR rule for identifying outliers) An engineering project322 studied a new building material, to estimate the average permeability.

Measurements of permeability time (the time for water to permeate the sheets) were taken from 81 pieces of material (in seconds). For these data \(Q_1 = 24.7\) and \(Q_3 = 50.6\), so we find that \(\text{IQR} = {50.6 - 24.7 = 25.9}\). Then, extreme outliers observations are \(3\times 25.9 = 77.7\) more unusual than \(Q_1\) or \(Q_3\). That is, extreme outliers are observations:

  • more unusual than \(24.7 - 77.7 = -53.0\) (that is, less than \(-53\)); or
  • more unusual than \(50.6 + 77.7 = 128.3\) (that is, greater than \(128.3\)).

Mild outliers observations are \(1.5\times 25.9 = 38.9\) more unusual than \(Q_1\) or \(Q_3\) (that are not also extreme outliers). That is, mild outliers are

  • more unusual than \(24.7 - 38.9 = -14.2\) (that is, less than \(-14.2\)); or
  • more unusual than \(50.6 + 38.9 = 89.5\) (that is, greater than \(89.5\)).

The outliers are identified when constructing a boxplot: the 'whiskers' extended to the most extreme observation remaining after excluding mild and extreme observations; then, mild outliers are shown using a \(\circ\), and extreme outliers are shown using a \(\star\).

You don't need to do this (that's what software is for), but you do need to understand what the software is doing. Construction of the boxplot is shown in the animation below.

13.5.4 When to use which rule?

In summary, two common ways to identify outliers are:

  • For approximately symmetric distributions: use the standard deviation rule.
  • For any distribution, but primarily for those skewed or with outliers: use the IQR rule.

But remember: All rules for identifying outliers are arbitrary!

Example 13.11 (Boxplots and histograms) For the permeability data,323 compare the boxplot and histogram (Fig. 13.11).

Can you see how the boxplot identifies the observations in the histogram that seem to be outliers?

A boxplot and histogram for the permeability data

FIGURE 13.11: A boxplot and histogram for the permeability data

In an American study,324 the lung capacity (FEV) of youth aged 3 to 19 was measured.

The data are slightly skewed right, and the average FEV is about 2.6 litres. The FEV varies from about 0.8 to 5.8 litres, with no outliers.

Using this information, sketch the boxplot and the histogram for the data.

13.5.5 What to do with outliers?

What should you do if any observations in your data are identified as outliers?

In general, just deleting or removing these outliers just because they are outliers is a very bad idea. After all, this outlier was obtained from your study just like all other observations... it deserves to be in your data as much as any other observation.

In addition, the rules for identifying outliers are all arbitrary rules.

However, there are some exceptions (see Dunn and Smyth325, p. 138):

  • Mistakes: If the observation is clearly wrong (e.g., someone gives their age as 222), you can either fix it (the person may be 22 years old, not 222 years old) if possible (and it is not often possible), or you can delete it.

    Similarly, if the observations can be found as coming from some error or mistake in the data collection (e.g., you applied twice as much fertiliser as you should have), again it can be deleted.

  • A different population is represented: If the observation comes from a different population that the other observations, it may be removed from the analysis.

    For example, consider a study of how often students exercise. One observation is found to be an outlier; closer inspection find that this student is aged 65. The next oldest student in the data is someone aged 44. In this case, the outlier can be removed from the analysis as it belongs to a different population ("students aged over 45"), and all the other observations belong to a different population ("students aged under 45"). The remaining observations can be analysed, on the understanding that the results only apply to students aged under 45 (which should be clearly communicated).

  • Unknown reason: Sometimes no obvious reason can explain the outlier. In these situations, the solution is unclear.

    Discarding the outliers routinely is not recommended, as they are probably real observations that are just as valid as all the others. Perhaps a different analysis is necessary (for example, based on using medians rather than means).

In all cases, whenever observations are removed from a dataset, this should be clearly explained and documented.

Example 13.12 (Outliers) The Mary River dataset used in Sect. 13.2 has many extreme large outliers, but each of them is reasonable. They probably correspond to flooding events.

It would be silly to remove these from the analysis.

Example 13.13 (Outliers) The permeability data shown in Fig. 13.11 has some extreme large outliers. None of these seem unreasonable.

It would be silly to remove these from the analysis.

13.6 Compiling tables of numerical summary information

Here are some tips for compiling tables of numerical summary information:

  • Round numbers appropriately (don't necessarily use all decimals provided by software).
  • Place captions above tables.
  • In general, use no vertical lines and very few horizontal lines.
  • Align numbers in the table by decimal point when possible, for easier reading.
  • Ensure the table allows readers to easily make the important comparisons.

Example 13.14 (Tables for summarising data) Consider a study326 assessing the effects of probiotic and conventional yoghurt on blood glucose and antioxidant status in Type 2 diabetic patients. A randomised controlled trial (i.e., an experiment) collected data from 60 patients.

Compare the two numerical summary tables in Tables 13.6 and 13.7: Table 13.6 makes comparing the two groups easier, but Table 13.7 is the more conventional orientation (for practical purposes: fewer columns).

TABLE 13.6: Baseline characteristics of study participants. A superscript a indicates the data are summarised using means \(\pm\) standard deviation; a superscript b indicates the data are summarised using medians \(\pm\) IQR
Yoghurt Age\(a\) Weight (kg)\(a\) BMI\(a\) (kg/m\(2\)) Metformin/d\(b\) Glibenclamide/d\(b\)
Conventional (\(n=30\)) \(51.0 \pm 7.3\) \(75.42 \pm 11.28\) \(29.14 \pm 4.30\) \(2 \pm 1.25\) \(1 \pm 1\)
Probiotic (\(n=30\)) \(50.9 \pm 7.7\) \(76.18 \pm 10.94\) \(28.95 \pm 3.65\) \(2 \pm 1.25\) \(2 \pm 2\)
TABLE 13.7: Baseline characteristics of study participants. A superscript a indicates the data are summarised using means \(\pm\) standard deviation; a superscript b indicates the data are summarised using medians \(\pm\) IQR
Variable Conventional yoghurt (\(n=30\)) Probiotic yoghurt (\(n=30\))
Age\(a\) 51.00 \(\pm\) 7.32 50.87 \(\pm\) 7.68
Weight (kg)\(a\) 75.42 \(\pm\) 11.28 76.18 \(\pm\) 10.94
BMI (kg/m2)\(a\) 29.14 \(\pm\) 4.30 28.95 \(\pm\) 3.65
Metformin/d\(b\) 2 \(\pm\) 1.25 2 \(\pm\) 1.25
Glibenclamide/d\(b\) 1\(\pm\) 1 2 \(\pm\) 2

Do you think a difference exists between the mean BMI in the two groups in the population, based on Tables 13.6 and 13.7?

Explain.

Very importantly: It is hard to make decisions about a population based on just a sample.

We cannot be sure if there is a difference between the population means or not, just based on the sample means.

The difference is not large, so may be due to sampling variation: The two sample mean will vary from sample to sample, which may explain the small difference that we see in the sample.

13.7 Observing relationships: The NHANES study

In Sect. 12.9, the NHANES data were introduced [Center for Disease Control and Prevention (CDC)327, Center for Disease Control and Prevention328, Pruim329), and graphs were used to understand the data relevant to answering this RQ:

Among Americans, is the mean direct HDL cholesterol different for current smokers and non-smokers?

Using the software output (jamovi: Fig. 13.12; SPSS: Fig. 13.13), the direct HDL cholesterol can be summarised numerically:

  • Average value:
    • Sample mean: \(\bar{x} = 1.36\)mmol/L.
    • Sample median: \(1.29\)mmol/L.
  • Variation:
    • Sample standard deviation: \(s=0.399\)mmol/L.
    • Sample IQR: \(0.49\)mmol/L.
  • Shape: Slightly skewed right (from Fig. 13.1 or 12.39).
  • Outliers: SPSS identified some outliers (Fig. 12.39), mostly unusually large values.
jamovi output for direct HDL cholesterol

FIGURE 13.12: jamovi output for direct HDL cholesterol

SPSS output for direct HDL cholesterol

FIGURE 13.13: SPSS output for direct HDL cholesterol

The RQ is about comparing the mean direct HDL cholesterol in the two smoking groups, so compiling a table of summaries for each group is useful, using different output (jamovi: Fig. 13.14; SPSS: Fig. 13.15). Table 13.8 shows the numerical summaries of direct HDL cholesterol for each group.

jamovi output for direct HDL cholesterol, by current smoking status

FIGURE 13.14: jamovi output for direct HDL cholesterol, by current smoking status

SPSS output for direct HDL cholesterol, by current smoking status

FIGURE 13.15: SPSS output for direct HDL cholesterol, by current smoking status

TABLE 13.8: Summarising quantitative data
Group Sample size Mean Median Std. dev. IQR
All participants: 8474 1.36 1.29 0.399 0.49
Smokers: 1388 1.31 1.24 0.424 0.52
Non-smokers: 1668 1.39 1.32 0.428 0.54

Notice that information about current smoking status is unavailable for all people in the study. This could impact the results, especially if those who provide data and those who do not are different regarding direct HDL.

The RQ, as usual, asks about the population. The RQ cannot be answered with certainty, only using a sample, since every sample is likely to be different.

Clearly, the sample means are different, but the RQ asks if the population means are different. Broadly, two possible reasons could explain why the sample mean direct HDL cholesterol is different for current smokers and non-smokers:

  • The population means are the same, but the sample means are different simply because of the people who ended up in the sample. Another sample, with different people, might produce different sample means. Sampling variation explains the difference in the sample percentages.

  • The population means are different, and the difference between the sample means simply reflects this difference between the population means.

The difficulty, of course, is knowing which of these two reasons ('hypotheses') is the most likely reason for the difference between the sample means. This question is of prime importance (after all, it answers the RQ), and is addressed at length later in this book.

13.8 Summary

Quantitative data can be summarised numerically, and the most common techniques are indicated in Table 13.9.

The mean and standard deviation are usually used whenever possible, for practical and mathematical reasons. Sometimes quoting both the mean and median (and the standard deviation and IQR) may be appropriate.

The following short video may help explain some of these concepts:

TABLE 13.9: Summarising quantitative data
For distributions that are:
Feature: Approximately symmetric Not symmetric, or outliers
Average: Mean Median
Variation: Standard deviation IQR
Shape: Verbal description only Verbal description only
Outliers: Standard deviation rule IQR rule

13.9 Quick review questions

A study of fulmars (a type of seabird)330 explored the metabolic rate of the birds. The mass of the female birds were (in grams): 635; 635; 668; 640; 645; 635

  1. From your calculator, the sample mean is
  2. From your calculator, the sample standard deviation is
  3. The sample median is
  4. The population standard deviation is

Remember that population values are not known, and are estimated by the sample values!


  1. A study of fatalities on amusement rides in the US331 recorded these number of fatalities from 1994 to 2003:
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
2 4 3 4 7 6 1 3 2 5
  • What is the mean number of fatalities per year over this period?
  • What is the median number of fatalities per year over this period?
  • What is the standard deviation number of fatalities per year over this period?

  1. Which of the following statements are true?

    • The IQR measures the amount of variability in a set of data
    • The mean and the median can both be called the "average"
    • The mean and the median are not always the same value
    • The range is a measure of variability in a set of data (but it usually too simple to be useful)
    • The standard deviation measures the amount of variability in a set of data
    • Another name for the median is \(Q_2\)
    • \(Q_3\) is the median of the largest half of the data
    • The IQR is a useful measure of the amount of variation in data that are skewed
    • The IQR is the difference between the first and second quartiles

Progress:

13.10 Exercises

Selected answers are available in Sect. D.13.

Exercise 13.1 The histogram of the direct HDL cholesterol from the NHANES study is shown in Fig. 13.1. Should the mean or median be used to measure location?

Exercise 13.2 The average monthly SOI values in August from 1995 to 2000 are shown in Table 13.10. Use your calculator (where possible) to calculate the:

  1. sample mean
  2. sample median.
  3. range.
  4. sample standard deviation.
TABLE 13.10: The average monthly SOI values in August from 1995 to 2000
Year Monthly average SOI
1995 0.8
1996 4.6
1997 -19.8
1998 9.8
1999 2.1
2000 5.3

Exercise 13.3 The activity below contains histogram and boxplots.

  1. Match the histogram with the corresponding boxplot.

  2. For which data sets would the mean and standard deviation be the appropriate numerical summary?

    For which data sets would the median and IQR be the appropriate numerical summary?

Exercise 13.4 A study of the productivity of construction workers332 recorded, among other things, the rate at which concrete panels could be installed by workers.

Data for three different female workers in the study are shown in Table 13.11. Construct the boxplot comparing the three workers. What does it tell you?

TABLE 13.11: The productivity of three females workers installing concrete panels (in panels per minute)
Worker 1 Worker 2 Worker 3
Mean 1.24 1.73 1.36
Minimum 0.59 1.13 0.86
1st quartile 0.88 1.51 1.16
Median 1.35 1.70 1.38
3rd quartile 1.49 1.91 1.58
Maximum 1.88 3.00 2.17
Range 1.28 1.87 1.31

Exercise 13.5 An article examined patients who had been admitted for thoracic surgical procedures at Castle Hill Hospital for the presence of microplastics.333 The total number of microplastics found in the lungs of each patients are shown below:

Patient Number of microplastics Patient Number of microplastics
1 8 7 1
2 3 8 7
3 5 9 5
4 2 10 1
5 0 11 0
6 2

For these patients:

  1. What is the mean number of microplastics found?
  2. What is the median number of microplastics found?
  3. What is the standard deviation of the number of microplastics found?
  4. What is the IQR of the number of microplastics found?