13 Numerical summaries: quantitative data
So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, and graphically summarise data.
In this chapter, you will learn to numerically describe quantitative data. Both quantitative and qualitative data are described numerically in quantitative research. You will learn to:
 numerically summarise quantitative data using the appropriate statistics.
 describe quantitative data by average, variation, shape and unusual features.
13.1 Introduction
In the last chapter (Sect. 12.9), this RQ was posed:
Among Americans, is the average direct HDL cholesterol different for current smokers and nonsmokers?
Graphs were used to understand the data in Sect. 12.9, where information contained in the graphs was given. In some cases, the features of the data displayed in the graph can be described numerically. That is the purpose of this chapter: to learn how to summarise quantitative data numerically.
Example 13.1 (Describing quantitative data) For the RQ above, understanding the response variable (direct HDL cholesterol values) is important; a histogram is useful (Fig. 13.1).
What does the histogram tell us?
 Average: The average value is about 1.5 mmol/L.
 Variation: The values range from about 0.5 to 3 mmol/L, but with some larger values (that are hard to see on the histogram).
 Shape: The distribution is slightly skewed right.
 Outliers: Some large outliers are present (that are hard to see on the histogram).
Describing some of these features more precisely, with numbers, can be helpful.
A number that describes a feature of a population is called a parameter. The values of parameters are usually unknown.
In contrast, a number that describes a feature of a sample is called a statistic. That is:
 Samples are numerically described by statistics;
 Populations are numerically described by parameters.
Definition 13.1 (Parameter) A parameter is a number describing some feature of a population.
Definition 13.2 (Statistic) A statistic is a number describing some feature of a sample (to estimate a population parameter).
The RQ identifies the population, but in practice a sample is studied. Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample.
13.2 Computing the average value
The average (or location, or centre, or typical value) for quantitative sample data can be described in many ways; the two most common ways are:
 the sample mean (or sample arithmetic mean), which estimate the population mean; and
 the sample median, which estimates the population median.
In both cases, the population parameter is estimated by a sample statistic. Understanding whether to use the mean or median is important.
The word 'average' can refer to either mean or median (or other measures of centre too). Use the precise terms 'mean' or 'median', rather than 'average', when necessary!
Consider the daily river flow volume (called 'streamflow') at the Mary River from 01 October 1959 to 17 January 2019, summarised by month in Table 13.1 (from Queensland DNRM).
The 'average' daily streamflow in February could be quoted using either the mean or the median; but the two give very different values for the 'average':
 the mean daily flow is 1123.2ML.
 the median daily flow is 146.1ML.
These two common ways of measuring the same thing (the 'average' daily streamflow in February) give very different answers. Why? Which is the best 'average' to use?
To decide, both measures of average will need to be studied.
Month  Mean  Median 

Jan  849.3  71.3 
Feb  1123.2  146.1 
Mar  793.9  194.9 
Apr  622.5  141.7 
May  348.4  118.4 
Jun  378.7  83.6 
Jul  259.3  68.8 
Aug  108.6  55.5 
Sep  100.9  48.0 
Oct  151.2  37.6 
Nov  186.6  45.3 
Dec  330.8  64.1 
13.2.1 Computing the average: The mean
The mean of the population is denoted by \(\mu\), and its value is almost always unknown.
Instead, the mean of the population is estimated by the mean of the sample, which is denoted by \(\bar{x}\) (an \(x\) with a line above it). In this context, the unknown parameter is \(\mu\), and the statistic is \(\bar{x}\). The sample mean is used to estimate the population mean.
The Greek letter \(\mu\) is pronounced 'myoo', as in music.
The symbol \(\bar{x}\) is pronounced 'exbar'.
Example 13.2 (A small data set to work with) To demonstrate ideas, consider a small data set for answering this descriptive RQ:
For mature Jersey cows, what is the average percentage butterfat in their milk?
The population is 'milk from Jersey cows', and an estimate of the population mean percentage butterfat is sought. The population mean is denoted by \(\mu\).
Clearly, milk from every Jersey cow cannot be studied; a sample is studied:^{315} The unknown population mean is estimated using the sample mean (\(\bar{x}\)). Measurements were taken from milk from 10 cows, in percentages (Table 13.2).
4.8  5.2  5.2  5.4  5.2 
6.5  4.5  5.7  4.8  5.2 
The sample mean is what people usually think of as the 'average'. The sample mean is actually the 'balance point' of the observations. The animation below shows how the mean acts as the balance point.
Alternatively, the mean is the value such that the positive and negative distances of the observations from the mean add to zero , as shown in the animation below. Both of these explanations seem reasonable for identifying the 'average' of the data.
Definition 13.3 (Mean) The mean is one way to measure the 'average' value of quantitative data. The arithmetic mean can be considered as the 'balance point' of the data, or the value such that the positive and negative distances from the mean add to zero.
To find the value of the sample mean:
 Add (shown using the symbol \(\sum\)) all the observations (denoted by \(x\)); then
 Divide by the number of observations (denoted by \(n\)).
In symbols: \[ \bar{x} = \frac{\sum x}{n}. \] This means to add up (indicated by \(\sum\)) the observations (denoted by \(x\)), then divide by the size of the sample (denoted by \(n\)).
Example 13.3 (Computing a sample mean) For data for the Jersey cow data (Example 13.2), an estimate of the population mean percentage butterfat is found using the sample information: sum all \(n=10\) observations and divide by \(n\): \[\begin{align*} \overline{x} &= \frac{\sum x}{n} = \frac{4.8 + 6.5 + \cdots + 5.2}{10}\\ &= \frac{52.5}{10} = 5.25. \end{align*}\] The sample mean, the best estimate of the population mean, is 5.25 percent.
Usually, software (such as jamovi or SPSS) or a calculator (in Statistics Mode) will be used to compute the sample mean. However, knowing how these quantities are computed is important.
For the butterfat data (Table 13.2), what is the value of \(\mu\), the population mean?
We do not know!
We know the value of the sample mean, but not the population mean. We only have an estimate of the value of the population mean by using the sample mean.
(If we already knew the value of the population mean, why would we estimate the value from an imperfect sample?)
A study of eyes^{316} aimed to estimate the average thickness of eyes affected by glaucoma. The collected data (in microns) are shown in Table 13.3.
Estimate the population mean corneal thickness.
The estimate of \(\mu\) is \(\bar{x} = 459\) microns.
484  492  436  464 
478  444  398  476 
Software and calculators often produce numerical answers to many decimal places, some of which may not be meaningful or useful. A useful ruleofthumb is to round to one or two more significant figures than the original data.
For example, the butterfat data are given to one decimal place. The sample mean weight can be given to two decimal places: \(\bar{x} = 5.25\)%.
13.2.2 Computing the average: The median
The median is a value separating the larger half of the data from the smaller half of the data. In a data set with \(n\) values, the median is ordered observation number \(\displaystyle \frac{n + 1}{2}\).
The median is:
 not equal to \(\displaystyle \frac{n+1}{2}\).
 not halfway between the minimum and maximum values in the data.
Most calculators cannot find the median.
The median has no commonlyused symbol.
Definition 13.4 (Median) The median is one way to measure the 'average' value of some data. The median is a value such that half the values are larger than the median, and half the values are smaller than the median.
Example 13.4 (Find a sample median) To find the sample median for the Jersey cow data (Example 13.2), first arrange the data in numerical order (Table 13.4).
The median separates the larger 5 numbers from the smaller 5 numbers. With \(n=10\) observations, the median is the ordered observation located between the fifth and sixth observations (i.e., at position \((10 + 1)/2 = 5.5\); the median itself is not 5.5).
So the sample median is between \(5.2\) (ordered observation five) and \(5.2\) (ordered observation six): the sample median is \(5.20\) percent.
4.5  4.8  4.8  5.2  5.2 
5.2  5.2  5.4  5.7  6.5 
For the butterfat data (Table 13.2), what is the population median?
We do not know!
We know the value of the sample median, but not the population median. We only have an estimate of the value of the population median.
A study of eyes^{317} aimed to estimate the average thickness of eyes affected by glaucoma.
Using the collected data (Table 13.3), estimate the population median corneal thickness. What is the population median?
With \(n = 8\) observations, the median is ordered observation number \((8 + 1)/2 = 4.5\), halfway between ordered observation numbers 4 and 5.
After sorting into increasing order, the two middle numbers (the 4th and 5th) are 464 and 476. The median could be any number between 464 and 476, but the usual answer would be that the median is \((464 + 476)/2 = 470\).
The sample median is 470 microns; the value of the population median remains unknown.
To clarify:
 If the sample size \(n\) is odd, the median is the middle number when the observations are ordered.
 If the sample size \(n\) is even (such as the glaucoma example), the median is halfway between the two middle numbers, when the observations are ordered.
Some software uses different rules when \(n\) is even.
13.2.3 Which average to use?
Consider again estimating the average daily streamflow at the Mary River (Bellbird Creek) during February (Table 13.1): The mean daily streamflow is 1123.2ML, and the median daily streamflow is 146.1ML. Which is the 'best' average to use?
A dot chart of the daily stream flow (Fig. 13.2) shows that the data are very highly rightskewed, with many very large outliers: the maximum value is 156586.4ML, more than one hundred times larger than the mean of 1123.2ML).
In fact, about 86% of the observations are less than the mean. In contrast, about 50% the values are less than the median (by definition). For these data, the mean is hardly a central value...
The streamflow data are very highly skewed (to the right), which is important and relevant:
 Means are best used for approximately symmetric data: the mean is influenced by outliers and skewness.
 Medians are best used for data that are skewed or contain outliers: the median is not influenced by outliers and skewness.
Means tend to be too large if the data contains large outliers or severe right skewness, and too small if the data contains small outliers or severe left skewness.
For the Mary River data, the large outliersand the fact that they are so extreme and abundantresult in the mean being substantially influenced by the outliers, which explains why the mean is much larger than the median. The median is the better measure of average for these data.
The mean is generally used if possible (for practical and mathematical reasons), and is the most commonlyused measure of location. However, the mean is influenced by outliers and skewness; the median is not influenced by outliers and skewness. The mean and median are similar in approximately symmetric distributions. Sometimes, quoting both the mean and the median may be appropriate.
An engineering study^{318} was studying a new building material to determine the average permeability time.
The time (in seconds) taken for water to permeate \(n = 81\) pieces of material. Using a histogram of the data (Fig. 13.3), estimate the value of the population mean and median.
Which would be best to use (for example, to quote an average permeability time on a specification sheet)?
The data are skewed, which suggests using the median.
In practice, we would probably need a larger sample anyway before giving a value to use on a specification sheet.
13.3 Computing the variation
For quantitative data, the amount of variation in the bulk of the data should be described. Many ways exist to measure the variation in a data set, including:
 The range: very simple and simplistic, so not often used.
 The standard deviation: commonly used.
 The interquartile range (or IQR): commonly used.
 Percentiles: sometimes used.
As always, a value computed from the sample (the statistic) estimates the unknown value in the population (the parameter). Knowing which measure of variation to use is important.
13.3.1 Computing the variation: Range
The range is the simplest measure of variation.
Definition 13.5 (Range) The range is the maximum value minus the minimum value.
The range is not often used, because only the two extreme observations are used, so it is highly influenced by outliers. Sometimes, the range may be given by stating both the maximum and the minimum value in the data instead of giving the difference between the maximum and the minimum values. The range is measured in the same measurement units as the data.
Example 13.5 (The range) For Jersey cow data
(Example 13.2),
the range is:
\[
\text{Range} = \overbrace{6.5}^{\text{largest}}  \overbrace{4.5}^{\text{smallest}} = 2.0 \text{ percent}.
\]
So the sample median percentage butterfat is 5.20 percent, with a range of 2.00 percent.
13.3.2 Computing the variation: Standard deviation
The population standard deviation is denoted by \(\sigma\) ('sigma', the parameter) and is estimated by the sample standard deviation \(s\) (the statistic).
The standard deviation is the most commonlyused measure of variation, but is complicated to compute manually (but you don't need to do it manually!). The standard deviation is (roughly) the mean distance that the observations are away from the mean. This seems like a reasonable way to measure the amount of variation in some data.
The Greek letter \(\sigma\) ('sigma') is pronounced as expected: 'sigma'.
The sample standard deviation \(s\) is mostly found using computer software (e.g., jamovi or SPSS) or a calculator (in Statistics Mode).
Definition 13.6 (Standard deviation) The standard deviation is, approximately, the average distance that observations are away from the mean.
You do not have to use the formula to calculate \(s\), but we will demonstrate for those who might find it useful to understand exactly what \(s\) calculates. The formula is:
\[ s = \sqrt{ \frac{\sum(x  \bar{x})^2}{n1} }, \] where \(\bar{x}\) is the sample mean, \(x\) represents the data values, and \(n\) is the sample size. To use the formula, follow these steps:
 Calculate the sample mean: \(\overline{x}\);
 Calculate the deviations of each observation \(x\) from the mean: \(x\bar{x}\);
 Square these deviations (to make them all positive values): \((x\bar{x})^2\);
 Add these values: \(\sum(x\bar{x})^2\);
 Divide the answer by \(n1\);
 Take the (positive) square root of the answer.
You do not need to use the formula! You must know how to use software or a calculator to find the standard deviation.
Example 13.6 (Standard deviation) For the Jersey cow data (Example 13.2), the deviations of each observation from the mean of \(5.25\) can be found (Fig. 13.4). Then follow the steps outlined. You don't have to do this manually! From Fig. 13.4, the sum of the squared distances is 2.7650. Then, the sample standard deviation is:
\[ s = \sqrt{\frac{2.765}{101}} = \sqrt{ 0.3072222} = 0.5542763. \] The sample mean percentage butterfat is 5.25 percent, with a sample standard deviation of 0.554 percent.
The standard deviation for Dataset A in Fig. 13.5 is 2.00.
What do you estimate the standard deviation of Dataset B will be: smaller than 2.00 or greater than 2.00? Why?
The standard deviation is a bit like the average distance that observations are from the mean.
In Dataset B, there seems to be a lot more observations closer to the mean, so the average distance would be a smaller number.
This suggests that the standard deviation for Dataset B will be smaller than the standard deviation for Dataset A.
The sample standard deviation is:
 Positive (unless all observations are the same, when it is zero: there is no variation);
 Best used for (approximately) symmetric data;
 Usually quoted with the mean;
 The most commonlyused measure of variation;
 Measured in the same units as the data;
 Influenced by skewness and outliers, like the mean.
Consider again the Jersey cow data (Example 13.2).
Using your calculator's Statistics Mode, find the population standard deviation and the sample standard deviation.
The population standard deviation is unknown.
The best estimate is the sample standard deviation: \(s=0.554\)%.
If you do not get this value, you may be pressing the wrong button on your calculator: Ask for help!
13.3.3 Computing the variation: IQR
The standard deviation uses the value of \(\bar{x}\), so is affected by skewness like the sample mean. Another measure of variation that is not affected by skewness is the interquartile range, or IQR. To understand the IQR, understanding quartiles first is important.
Definition 13.7 (Quartiles) Quartiles to describe the variation and shape of data:
 The first quartile \(Q_1\) is a value that separates the smallest 25% of observations from the largest 75%. The \(Q_1\) is like the median of the smaller half of the data, halfway between the minimum value and the median.
 The second quartile \(Q_2\) is a value that separates the smallest 50% of observations from the largest 50%. (This is the median.)
 The third quartile \(Q_3\) is a value that separates the smallest 75% of observations from the largest 25%. The \(Q_3\) is like the median of the larger half of the data, halfway between the median and the maximum value.
Quartiles divide the data into four parts of approximately equal numbers of observations, and a boxplot is a picture of the quartiles. The interquartile range, or the IQR is the difference between \(Q_3\) and \(Q_1\).
The IQR measures the range of the middle 50% of the data, and is a measure of variation not influenced by outliers. The IQR is measured in the same measurements units as the data.
Definition 13.8 (IQR) The IQR is the range in which the middle 50% of the data lie; the difference between the third and the first quartiles.
Quartiles were previously discussed in the context of boxplots (Sect. 12.4.3). For example, a boxplot of the eggkrill data^{319} was shown in Example 12.13; the data are repeated in Table 13.5, and the boxplot in Fig. 13.6.
0  18  0  18 
0  21  0  21 
1  26  0  26 
1  30  0  30 
3  35  1  35 
8  48  1  48 
8  50  1  50 
12  2 
For the Treatment group:
 75% of the observations are smaller than about 28, and this is represented by the line at the top of the central box. This is \(Q_3\), or the third quartile.
 50% of the observations are smaller than about 12, and this is represented by the line in the centre of the central box. This is \(Q_2\), the second quartile or the median.
 25% of the observations are smaller than about 2, and this is represented by the line at the bottom of the central box. This is \(Q_1\), the first quartile.
The IQR is \(Q_3  Q_1\) = \(28  2\), so that \(\text{IQR} = 26\). The animation below shows how the IQR is found.
Example 13.7 (Boxplots) Consider the NHANES data.^{320}
The boxplot for the age of respondents in the NHANES data set is as shown below. For these data:
 No outliers are identified.
 The oldest person is 80.
 About 75% of the subjects are aged less than about 54 (\(Q_3\)): the third quartile \(Q_3 = 54\), the median of the largest half of the data.
 About 50% of the subjects are aged less than about 36 (\(Q_2\), the median): the second quartile \(Q_2 = 36\), the median of the data set.
 About 25% of the subjects are aged less than about 17 (\(Q_1\)): the first quartile \(Q_1 = 17\), the median of the smallest half of the data.
 The youngest subject is aged 0.
Then, \(Q_3 = 54\) and \(Q_1 = 17\), so the \(\text{IQR} = Q_3  Q_1 = 54  17 = 37\) years. The middle 50% of the participants have an age range of 37 years.
13.3.4 Computing the variation: Percentiles
Percentiles can be computed, which are similar to quantiles; for example:
 The 12th percentile is a value separating the smallest 12% of the data from the rest.
 The 67th percentile is a value separating the smallest 67% of the data from the rest.
 The 94th percentile is a value separating the smallest 94% of the data from the rest.
Percentiles are measured in the same measurements units as the data.
Definition 13.9 (Percentiles) The \(p\)th percentile of the data is a value separating the smallest \(p\)% of the data from the rest.
By this definition, the first quartile \(Q_1\) is also the 25th percentile, the second quartile \(Q_2\) is also the 50th percentile (and the median), and the third quartile \(Q_3\) is also the 75th percentile.
Percentiles are especially useful for very skewed data and in certain applications. For instance, scientists who monitor rainfall and stream heights, and engineers who use this information, are more interested in extreme weather events rather than the 'average' event. Engineers, for example, may design structures to withstand 1in100 year events (the 99th percentile) or similar, which are unusual events.
Example 13.8 (Percentiles) For the streamflow data at the Mary River (Table 13.1), the February data is highly rightskewed (Fig. 13.2):
 The median (50th percentile) is 146.1 ML.
 The 95th percentile is 3,480 ML.
 The 99th percentile is 19,043 ML.
Constructing infrastructure to cope with the median streamflow is clearly silly.
13.3.5 Which measure of variation to use?
Which is the 'best' measure of variation for quantitative data? As with measures of location, it depends on the data.
Since the standard deviation calculation uses the mean, it is impacted in the same way as the mean by outliers and skewness, so the standard deviation is best used with approximately symmetric data. The IQR is best used when data are skewed or asymmetric. Sometimes, both the standard deviation and the IQR can be quoted.
13.4 Describing shape
Describing the skewness numerically is possible; however, in this book the shape will be described just using words (skewed, approximately symmetric, bimodal, etc.) as before (Sect. 12.2.4).
Example 13.9 (Skewness) The Australian Bureau of Statistics (ABS) records the age at death of Australians.
The histograms of the age of death for females and males (Fig. 13.7) show that both distributions are left skewed: Few Australians die at a very young age, and most die at an older age.
13.5 Identifying outliers
Outliers are 'unusual' observations: observation quite different (larger or smaller) than the bulk of the data. Deciding whether or not an observation is 'unusual' is arbitrary, so 'rules' for identifying outliers are somewhat arbitrary too.
Definition 13.10 (Outliers) An outlier is an observation that is 'unusual' compared to the bulk of the data (either larger or smaller). Rules for identifying outliers are arbitrary.
Two rules for identifying outliers are:
 The standard deviation rule, useful when the data have an approximately symmetric distribution.
 The IQR rule, useful in other situations.
Understanding the first rule requires studying bellshaped distributions first. Knowing which rule to use is important.
13.5.1 Bellshaped (normal) distributions and the 689599.7 rule
To begin, identifying outliers will be studied for data approximately symmetrically distributed. More specifically, symmetric distributions with a bell shape will be studied. For example, the heights of husbands in the UK^{321} have an approximate bell shape (Fig. 13.8, left panel). Most men are between 160 and 185cm; a few are shorter than 160cm and a few taller than 185cm. More formally, bellshaped distributions are called normal distributions.
These data are from a sample. Of course, every sample is likely to contain different men, and every sample of men will produce a slightly different histogram.
For convenience then, histograms may be smoothed, so that the smoothing produces a shape that represents an 'average' of all these possible sample histograms (in other words, an estimate of how the heights may be distributed in the population). For example, see the animation below. The solid line represents the average of many sample histograms.
The smoothed histogram can be drawn can be considered as representing 100% of the observations; after all, every husband in the sample has a height, so is represented somewhere in the histogram. When we do this, the areas under the normal curve are theoretical percentages of the total number.
The smoothed histogram represents all of the husbands' heights (that is, 100%). Using this idea, areas of the histogram can be shaded (Fig. 13.9) to represent various percentages of the husbands' heights.
For example:
 The middle 50% of husbands (Fig. 13.9, centre panel) are between about 168 and 178cm tall.
 The tallest 20% of husbands (Fig. 13.9, right panel) are taller than about 179cm.
Importantly, for any normal distribution, whatever the mean or standard deviation, the areas under the smoothed curve approximately follow this important rule: The 689599.7 rule.
Definition 13.11 (The 689599.7 Rule (or the Empirical Rule)) For any bellshaped distribution, approximately:
 68% of observations lie within one standard deviation of the mean;
 95% of observations lie within two standard deviations of the mean;
 99.7% of observations lie within three standard deviations of the mean.
The 689599.7 rule, or the empirical rule, is one of the most important rules we will see.
The animation below shows how the 689599.7 works.
The percentages given in the 689599.7 rule are approximate; the exact percentages are 68.27%, 95.45% and 99.73% respectively.
The 689599.7 rule can be used to understand variables that have an approximate normal distribution. For example, consider the heights of husbands again (Fig. 13.10); the sample mean height is \(\bar{x} = 173.2\)cm; the sample standard deviation is \(s = 6.88\)cm.
Using the 689599.7 rule, approximately 68% of the husbands would have heights between
 \(173.2  6.88 = 166.3\)cm and
 \(173.2 + 6.88 = 180.1\)cm.
(In fact, 71% of husbands in the sample are between 166.3cm and 180.1cm tall, close to the expected 68%.) Similarly, approximately 95% of the husbands would have heights between
 \(173.2  (2\times 6.88) = 159.4\)cm and
 \(173.2 + (2\times 6.88) = 187.0\)cm.
For the husbands' heights, the sample mean height is \(173.2\)cm; the sample standard deviation is \(6.88\)cm.
Using the 689599.7 rule, about 99.7% of the husbands are between what heights?
Three standard deviations is \(3\times6.88 = 20.64\).
So about 99.7% of husbands are between \((173.2  20.64) = 152.6\)cm and \((173.2 + 20.64) = 193.8\)cm.
The empirical rule indicates that 99.7% of observations are within 3 standard deviations of the mean. That is, almost all observations are within three standard deviations of the mean.
This suggests a rule for identifying outliers in approximately bellshaped distributions: any observation more than 3 standard deviations away from the mean is unusual, so may be considered an outlier. More generally, this rule is often applied to approximately symmetric distributions.
Bellshaped (normal) distributions are studied further later (for example, Chap. 17).
13.5.2 The standard deviation rule for identifying outliers
One rule for identifying outliers is based on the 689599.7 rule.
Definition 13.12 (Standard deviation rule for identifying outliers) For approximately symmetric distributions, an observation more than three standard deviations from the mean may be considered an outlier.
This rule uses the mean and the standard deviation, so this rule is suitable for approximately symmetric distributions (when means and standard deviations are sensible numerical summaries to use). Although this rule is based on normal distributions, it has proved useful for many approximatelysymmetric distributions.
All rules for identifying outliers are arbitrary. For example, the standard deviation rule is sometimes given slightly differently; for example, outliers identified as observations more than 2.5 standard deviations away from the mean. Since all rules for identifying outliers are arbitrary, both rules are acceptable.
13.5.3 The IQR rule for identifying outliers
Since the standard deviation rule for identifying outliers relies on the mean and standard deviation, it is not appropriate for nonsymmetric distributions. Another rule is needed for identifying outliers in these situations: the IQR rule.
Definition 13.13 (IQR rule for identifying outliers) The IQR rule identifies mild and extreme outliers as:
Extreme outliers: observations \(3\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\).
Mild outliers: observations \(1.5\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\) (that are not also extreme outliers).
This definition is much easier to understand using an example.
Example 13.10 (IQR rule for identifying outliers) An engineering project^{322} studied a new building material, to estimate the average permeability.
Measurements of permeability time (the time for water to permeate the sheets) were taken from 81 pieces of material (in seconds). For these data \(Q_1 = 24.7\) and \(Q_3 = 50.6\), so we find that \(\text{IQR} = {50.6  24.7 = 25.9}\). Then, extreme outliers observations are \(3\times 25.9 = 77.7\) more unusual than \(Q_1\) or \(Q_3\). That is, extreme outliers are observations:
 more unusual than \(24.7  77.7 = 53.0\) (that is, less than \(53\)); or
 more unusual than \(50.6 + 77.7 = 128.3\) (that is, greater than \(128.3\)).
Mild outliers observations are \(1.5\times 25.9 = 38.9\) more unusual than \(Q_1\) or \(Q_3\) (that are not also extreme outliers). That is, mild outliers are
 more unusual than \(24.7  38.9 = 14.2\) (that is, less than \(14.2\)); or
 more unusual than \(50.6 + 38.9 = 89.5\) (that is, greater than \(89.5\)).
The outliers are identified when constructing a boxplot: the 'whiskers' extended to the most extreme observation remaining after excluding mild and extreme observations; then, mild outliers are shown using a \(\circ\), and extreme outliers are shown using a \(\star\).
You don't need to do this (that's what software is for), but you do need to understand what the software is doing. Construction of the boxplot is shown in the animation below.
13.5.4 When to use which rule?
In summary, two common ways to identify outliers are:
 For approximately symmetric distributions: use the standard deviation rule.
 For any distribution, but primarily for those skewed or with outliers: use the IQR rule.
But remember: All rules for identifying outliers are arbitrary!
Example 13.11 (Boxplots and histograms) For the permeability data,^{323} compare the boxplot and histogram (Fig. 13.11).
Can you see how the boxplot identifies the observations in the histogram that seem to be outliers?
In an American study,^{324} the lung capacity (FEV) of youth aged 3 to 19 was measured.
The data are slightly skewed right, and the average FEV is about 2.6 litres. The FEV varies from about 0.8 to 5.8 litres, with no outliers.
Using this information, sketch the boxplot and the histogram for the data.
13.5.5 What to do with outliers?
What should you do if any observations in your data are identified as outliers?
In general, just deleting or removing these outliers just because they are outliers is a very bad idea. After all, this outlier was obtained from your study just like all other observations... it deserves to be in your data as much as any other observation.
In addition, the rules for identifying outliers are all arbitrary rules.
However, there are some exceptions (see Dunn and Smyth^{325}, p. 138):

Mistakes: If the observation is clearly wrong (e.g., someone gives their age as 222), you can either fix it (the person may be 22 years old, not 222 years old) if possible (and it is not often possible), or you can delete it.
Similarly, if the observations can be found as coming from some error or mistake in the data collection (e.g., you applied twice as much fertiliser as you should have), again it can be deleted.

A different population is represented: If the observation comes from a different population that the other observations, it may be removed from the analysis.
For example, consider a study of how often students exercise. One observation is found to be an outlier; closer inspection find that this student is aged 65. The next oldest student in the data is someone aged 44. In this case, the outlier can be removed from the analysis as it belongs to a different population ("students aged over 45"), and all the other observations belong to a different population ("students aged under 45"). The remaining observations can be analysed, on the understanding that the results only apply to students aged under 45 (which should be clearly communicated).

Unknown reason: Sometimes no obvious reason can explain the outlier. In these situations, the solution is unclear.
Discarding the outliers routinely is not recommended, as they are probably real observations that are just as valid as all the others. Perhaps a different analysis is necessary (for example, based on using medians rather than means).
In all cases, whenever observations are removed from a dataset, this should be clearly explained and documented.
Example 13.12 (Outliers) The Mary River dataset used in Sect. 13.2 has many extreme large outliers, but each of them is reasonable. They probably correspond to flooding events.
It would be silly to remove these from the analysis.
Example 13.13 (Outliers) The permeability data shown in Fig. 13.11 has some extreme large outliers. None of these seem unreasonable.
It would be silly to remove these from the analysis.
13.6 Compiling tables of numerical summary information
Here are some tips for compiling tables of numerical summary information:
 Round numbers appropriately (don't necessarily use all decimals provided by software).
 Place captions above tables.
 In general, use no vertical lines and very few horizontal lines.
 Align numbers in the table by decimal point when possible, for easier reading.
 Ensure the table allows readers to easily make the important comparisons.
Example 13.14 (Tables for summarising data) Consider a study^{326} assessing the effects of probiotic and conventional yoghurt on blood glucose and antioxidant status in Type 2 diabetic patients. A randomised controlled trial (i.e., an experiment) collected data from 60 patients.
Compare the two numerical summary tables in Tables 13.6 and 13.7: Table 13.6 makes comparing the two groups easier, but Table 13.7 is the more conventional orientation (for practical purposes: fewer columns).
Yoghurt  Age^{\(a\)}  Weight (kg)^{\(a\)}  BMI^{\(a\)} (kg/m^{\(2\)})  Metformin/d^{\(b\)}  Glibenclamide/d^{\(b\)} 

Conventional (\(n=30\))  \(51.0 \pm 7.3\)  \(75.42 \pm 11.28\)  \(29.14 \pm 4.30\)  \(2 \pm 1.25\)  \(1 \pm 1\) 
Probiotic (\(n=30\))  \(50.9 \pm 7.7\)  \(76.18 \pm 10.94\)  \(28.95 \pm 3.65\)  \(2 \pm 1.25\)  \(2 \pm 2\) 
Variable  Conventional yoghurt (\(n=30\))  Probiotic yoghurt (\(n=30\)) 

Age^{\(a\)}  51.00 \(\pm\) 7.32  50.87 \(\pm\) 7.68 
Weight (kg)^{\(a\)}  75.42 \(\pm\) 11.28  76.18 \(\pm\) 10.94 
BMI (kg/m^{2})^{\(a\)}  29.14 \(\pm\) 4.30  28.95 \(\pm\) 3.65 
Metformin/d^{\(b\)}  2 \(\pm\) 1.25  2 \(\pm\) 1.25 
Glibenclamide/d^{\(b\)}  1\(\pm\) 1  2 \(\pm\) 2 
Do you think a difference exists between the mean BMI in the two groups in the population, based on Tables 13.6 and 13.7?
Explain.
Very importantly: It is hard to make decisions about a population based on just a sample.
We cannot be sure if there is a difference between the population means or not, just based on the sample means.
The difference is not large, so may be due to sampling variation: The two sample mean will vary from sample to sample, which may explain the small difference that we see in the sample.
13.7 Observing relationships: The NHANES study
In Sect. 12.9, the NHANES data were introduced [Center for Disease Control and Prevention (CDC)^{327}, Center for Disease Control and Prevention^{328}, Pruim^{329}), and graphs were used to understand the data relevant to answering this RQ:
Among Americans, is the mean direct HDL cholesterol different for current smokers and nonsmokers?
Using the software output (jamovi: Fig. 13.12; SPSS: Fig. 13.13), the direct HDL cholesterol can be summarised numerically:
 Average value:
 Sample mean: \(\bar{x} = 1.36\)mmol/L.
 Sample median: \(1.29\)mmol/L.
 Variation:
 Sample standard deviation: \(s=0.399\)mmol/L.
 Sample IQR: \(0.49\)mmol/L.
 Shape: Slightly skewed right (from Fig. 13.1 or 12.39).
 Outliers: SPSS identified some outliers (Fig. 12.39), mostly unusually large values.
The RQ is about comparing the mean direct HDL cholesterol in the two smoking groups, so compiling a table of summaries for each group is useful, using different output (jamovi: Fig. 13.14; SPSS: Fig. 13.15). Table 13.8 shows the numerical summaries of direct HDL cholesterol for each group.
Group  Sample size  Mean  Median  Std. dev.  IQR 

All participants:  8474  1.36  1.29  0.399  0.49 
Smokers:  1388  1.31  1.24  0.424  0.52 
Nonsmokers:  1668  1.39  1.32  0.428  0.54 
Notice that information about current smoking status is unavailable for all people in the study. This could impact the results, especially if those who provide data and those who do not are different regarding direct HDL.
The RQ, as usual, asks about the population. The RQ cannot be answered with certainty, only using a sample, since every sample is likely to be different.
Clearly, the sample means are different, but the RQ asks if the population means are different. Broadly, two possible reasons could explain why the sample mean direct HDL cholesterol is different for current smokers and nonsmokers:
The population means are the same, but the sample means are different simply because of the people who ended up in the sample. Another sample, with different people, might produce different sample means. Sampling variation explains the difference in the sample percentages.
The population means are different, and the difference between the sample means simply reflects this difference between the population means.
The difficulty, of course, is knowing which of these two reasons ('hypotheses') is the most likely reason for the difference between the sample means. This question is of prime importance (after all, it answers the RQ), and is addressed at length later in this book.
13.8 Summary
Quantitative data can be summarised numerically, and the most common techniques are indicated in Table 13.9.
The mean and standard deviation are usually used whenever possible, for practical and mathematical reasons. Sometimes quoting both the mean and median (and the standard deviation and IQR) may be appropriate.
The following short video may help explain some of these concepts:
Feature:  Approximately symmetric  Not symmetric, or outliers 

Average:  Mean  Median 
Variation:  Standard deviation  IQR 
Shape:  Verbal description only  Verbal description only 
Outliers:  Standard deviation rule  IQR rule 
13.9 Quick review questions
A study of fulmars (a type of seabird)^{330} explored the metabolic rate of the birds. The mass of the female birds were (in grams): 635; 635; 668; 640; 645; 635
 From your calculator, the sample mean is
 From your calculator, the sample standard deviation is
 The sample median is
 The population standard deviation is
Remember that population values are not known, and are estimated by the sample values!
 A study of fatalities on amusement rides in the US^{331} recorded these number of fatalities from 1994 to 2003:
1994  1995  1996  1997  1998  1999  2000  2001  2002  2003 

2  4  3  4  7  6  1  3  2  5 
 What is the mean number of fatalities per year over this period?
 What is the median number of fatalities per year over this period?
 What is the standard deviation number of fatalities per year over this period?

Which of the following statements are true?
 The IQR measures the amount of variability in a set of data
 The mean and the median can both be called the "average"
 The mean and the median are not always the same value
 The range is a measure of variability in a set of data (but it usually too simple to be useful)
 The standard deviation measures the amount of variability in a set of data
 Another name for the median is \(Q_2\)
 \(Q_3\) is the median of the largest half of the data
 The IQR is a useful measure of the amount of variation in data that are skewed
 The IQR is the difference between the first and second quartiles
Progress:
13.10 Exercises
Selected answers are available in Sect. D.13.
Exercise 13.1 The histogram of the direct HDL cholesterol from the NHANES study is shown in Fig. 13.1. Should the mean or median be used to measure location?
Exercise 13.2 The average monthly SOI values in August from 1995 to 2000 are shown in Table 13.10. Use your calculator (where possible) to calculate the:
 sample mean
 sample median.
 range.
 sample standard deviation.
Year  Monthly average SOI 

1995  0.8 
1996  4.6 
1997  19.8 
1998  9.8 
1999  2.1 
2000  5.3 
Exercise 13.3 The activity below contains histogram and boxplots.
Match the histogram with the corresponding boxplot.

For which data sets would the mean and standard deviation be the appropriate numerical summary?
For which data sets would the median and IQR be the appropriate numerical summary?
Exercise 13.4 A study of the productivity of construction workers^{332} recorded, among other things, the rate at which concrete panels could be installed by workers.
Data for three different female workers in the study are shown in Table 13.11. Construct the boxplot comparing the three workers. What does it tell you?
Worker 1  Worker 2  Worker 3  

Mean  1.24  1.73  1.36 
Minimum  0.59  1.13  0.86 
1st quartile  0.88  1.51  1.16 
Median  1.35  1.70  1.38 
3rd quartile  1.49  1.91  1.58 
Maximum  1.88  3.00  2.17 
Range  1.28  1.87  1.31 
Exercise 13.5 An article examined patients who had been admitted for thoracic surgical procedures at Castle Hill Hospital for the presence of microplastics.^{333} The total number of microplastics found in the lungs of each patients are shown below:
Patient  Number of microplastics  Patient  Number of microplastics  

1  8  7  1  
2  3  8  7  
3  5  9  5  
4  2  10  1  
5  0  11  0  
6  2 
For these patients:
 What is the mean number of microplastics found?
 What is the median number of microplastics found?
 What is the standard deviation of the number of microplastics found?
 What is the IQR of the number of microplastics found?