12 Summarising quantitative data
So far, you have learnt to ask a RQ, design a study, collect the data, and describe the data. In this chapter, you will learn to:
- summarise quantitative data using the appropriate graphs.
- summarise quantitative data using average, variation, shape and unusual features.
12.1 Introduction
Most quantitative research studies involve quantitative variables Except for very small amounts of data, understanding the data is difficult without a summary. A distribution is a way to summarise quantitative data.
Definition 12.1 (Distribution) The distribution of a variable describes what values are present in the data, and how often those values appear.
The distribution can be displayed using a frequency table (Sect. 12.2) or a graph (Sect. 12.3). The distribution of quantitative data can be summarised numerically by computing the average value (Sect. 12.5), computing the amount of variation (Sect. 12.6), describing the shape (Sect. 12.7), and identifying outliers (Sect. 12.8).
12.2 Frequency tables for quantitative data
Quantitative data can be collated in a frequency table by grouping the variables into appropriate intervals. The categories should be exhaustive (cover all values) and exclusive (observations belong to one and only one category). While not essential, usually the categories are of equal size.
Example 12.1 Consider the data in Fig. 12.1, The data are the weights of \(44\) babies born in a hospital on one day (P. K. Dunn 1999; S. Steele 1997), plus the gender of each baby, and the number of minutes after midnight of the birth. The data are given in the order in which the births occurred.
The weights can be grouped into weight categories (Table 12.1). The percentages are also added; for example, the percentage of babies over \(4.0\) kg is \(1/44 \times 100 = 2.27\)%, or about \(2\)%. Most babies in the sample are between \(3\) and \(4\) kg at birth.
Weight group | Number of babies | Percentage of babies |
---|---|---|
\(1.5\) kg to under \(2.0\) kg | \(\phantom{0}1\) | \(\phantom{0}2\) |
\(2.0\) kg to under \(2.5\) kg | \(\phantom{0}4\) | \(\phantom{0}9\) |
\(2.5\) kg to under \(3.0\) kg | \(\phantom{0}4\) | \(\phantom{0}9\) |
\(3.0\) kg to under \(3.5\) kg | \(17\) | \(39\) |
\(3.5\) kg to under \(4.0\) kg | \(17\) | \(39\) |
\(4.0\) kg to under \(4.5\) kg | \(\phantom{0}1\) | \(\phantom{0}2\) |
12.3 Graphs
The graphs discussed in this section are appropriate for continuous quantitative data, but may sometimes be useful for discrete quantitative data if many values are possible. Often, discrete quantitative data (or continuous quantitative data with very few recorded values) is better graphed using the graphs in Sect. 13.3.
The purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.
Graphs used to display the distribution of one quantitative variable include:
- Histogram (Sect. 12.3.1): Best for moderate to large amounts of data.
- Stemplots (Sect. 12.3.2): Best for small amounts of data; only sometimes useful.
- Dot chart (Sect. 12.3.3): Used for small to moderate amounts of data.
12.3.1 Histograms
Histograms are a series of boxes, where the width of the box represents a range of values of the variable being graphed, and the height of the box represents the number (or percentage) of observations within that range of values^{1}. Histograms are essentially a picture of a frequency table. The vertical axis can be counts (labelled as 'Counts', 'Frequency', or similar) or percentages.
When the quantitative variable is discrete and the boxes have a width of one, sometimes the labels are placed on the axis aligned with the centre of the bar (e.g., see Fig. 12.13, right panel)).
Example 12.2 (Histograms) Consider again the weights (in kg) of babies born in a Brisbane hospital in one day (Fig. 12.1). A histogram can be constructed for these data (below). When an observation occurs on a boundary between the boxes, software usually (but not universally) places it in the higher box (so \(2.5\) kg would be counted in the '\(2.5\) to \(3.0\) kg' box, not the '\(2.0\) to \(2.5\) kg' box). That is, the boxes include the lower limit, but exclude the upper limit. The histogram shows, for example, that \(17\) babies weighed \(3.0\) kg or more, but under \(3.5\) kg.
The animation below shows how the histogram is constructed.
Example 12.3 (Histograms) A study of brain freezes after consuming cold food or drink measured the duration of the brain-freeze symptoms (Mages et al. 2017).
A histogram of the data (Fig. 12.2, right panel), shows \(11\) people experience symptoms less than \(5\) s in length; nine people experienced symptoms for at least \(5\) but less than \(10\) s; and \(1\) person experienced symptoms for at least \(35\) s but under \(40\) s.
Software usually makes sensible choices for the number of bins, and the width of the bins. However, the choice of bin size can substantially change the appearance of the histogram. Software makes it easy to try different bin sizes to find one that suitably displays the overall distribution.
Example 12.4 (Bimodal data) The Old Faithful geyser in Yellowstone National Park (USA) erupts regularly (Härdle et al. 1991). A histogram for the time between eruptions is shown below. Try changing the number of bins in the interaction below to see the impact.
12.3.2 Stemplots
Stemplots (or stem-and-leaf plots) are best described and explained using an example. Consider the data in Fig. 12.1, the weights of babies born in a Brisbane hospital on one day (P. K. Dunn 1999; S. Steele 1997).
In a stemplot, part of each number is placed to the left of a vertical line (the stem), and the rest of each number to the right of the line (the leaf).
The weights
in Fig. 12.1
are given to one decimal place of a kilogram, so the whole number of kilograms is placed to the left of the line (as the stem), and the first decimal place is placed on the right of the line (as a leaf).
The animation below shows how the stemplot is constructed.
The first weight, of \(1.7\) kg, is entered with the \(1\) to the left of the line, and the \(7\) to the right: 1 | 7
.
Similarly, \(2.1\) kg is entered as 2 | 1
and \(2.2\) kg is entered as 2 | 2
, sharing the same stem as for \(2.1\) kg.⚄
The plots shows that most birthweights are \(3\)-point-something kilograms.
One advantage of using stemplots over other plots is that the original data are visible. For stemplots:
- place the larger unit (e.g., kilograms) on the left (stems).
- place the smaller unit (e.g., first decimal of a kilogram) on the right (leaves).
- some data do not work well with stemplots.
- sometimes, data may need suitable rounding before creating the stemplot (the baby weights were originally given to three decimal places).
- the numbers in each row should be evenly spaced, placing the numbers in the columns under each other, so that the length of each stem is proportional to the number of observations.
- within each stem, the observations are ordered so patterns can be seen.
- add an explanation for reading the stemplot. For example, the stemplot for the baby-birth data says '\(2\) | \(6\) means \(2.6\) kg' (rather than, say, \(0.26\) kg, or \(2\) lb \(6\) oz).
Example 12.5 (Stemplots) A study recorded the number chest-beats by gorillas (Wright et al. 2021). The stemplot in the animation below shows the stemplot being constructed, and for gorillas aged under 20 years shows a lot of variation in the beating rate, but most are under 20.
The following short video may help explain some of these concepts:
12.3.3 Dot charts (quantitative data)
Dot charts show the original data on a single (usually horizontal) axis, with each observation represented by a dot (or other symbol).
Example 12.6 (Dot charts) Consider again the weights (in kg) of babies born in a Brisbane hospital (Fig. 12.1). A dot chart (Fig. 12.5) shows that most babies were born between \(3\) and \(4\) kg. Observations have been jittered (i.e., placed with some added randomness in the vertical direction) to avoid overplotting.
Example 12.7 (Dot charts) The chest-beating rate of young gorillas, seen in Example 12.5), can be displayed using a dot chart (Fig. 12.4, top panel). Observations are stacked on top of each other when multiple observations are the same, or very nearly so; for example, two gorillas beat their chest at \(1.7\) beats per \(10\) h.
12.3.4 Describing the distribution
Graphs are constructed to help readers understand the data. After producing a graph for one quantitative variable, the distribution of the data should be described:
- The average: What is an average, central or typical value?
- The variation: How much variation is present in the bulk of the data?
- The shape: What is the shape of the distribution? That is, are most of the values smaller or larger, or about even distributed between smaller and larger values?
- Mention any outliers (observations unusually large or small) or unusual features.
These can be described in rough terms, but usually using numerical quantities (as in following sections).
Example 12.8 (Describing quantitative data) The weights of babies (displayed in Example 12.2) are typically between about \(2.5\) kg and \(3\) kg (the average), with most between \(1.5\) kg and \(4.5\) kg (variation). A few babies have very lower weights, probably premature births (shape). No unusual values are present.
Describe the histogram in Fig. 12.2 (right panel).
- Average: Hard to be sure... maybe between \(10\) or \(15\). (More observations appear at the smaller values (as the bars are higher).)
- Variation: From about \(0\) to about \(40\).
- Shape: Slightly skewed right.
- Outliers: No outliers or unusual observations. The observation between \(35\) and \(40\) may be an outlier. I suspect it is not an outlier, as a larger sample may very well have observations between \(30\) and \(35\). Of course, I could be wrong.
12.4 Parameters and statistics
In quantitative research (Sect. 1.4), both qualitative and quantitative data are summarised and analysed numerically. In the following sections, methods for numerically summarising quantitative variables are described. Importantly, these numerical quantities are computed from a sample, even though the whole population is of interest. As a result, distinguishing parameters and statistics is important (App. C).
Definition 12.2 (Parameter) A parameter is a number, usually unknown, describing some feature of a population.
Definition 12.3 (Statistic) A statistic is a number describing some feature of a sample (to estimate an unknown population parameter).
A statistic is a numerical value estimating an unknown population value. However, countless possible samples are possible (Sect. 5.1), and so countless possible values for the statistic---all of which are estimates of the value of the parameter---are possible. The value of the statistic that is observed depends on which one of the countless possible samples is (randomly) selected.
The RQ identifies the population, but in practice only one of the many possible samples is studied. Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample. We only observe one value of the statistic from our single observed sample.
12.5 Summaries: averages
The average (or location, or central value) for quantitative sample data can be described in many ways. The most common are:
- the sample mean (or sample arithmetic mean), which estimates the population mean (Sect. 12.5.1); and
- the sample median, which estimates the population median (Sect. 12.5.2).
In both cases, the population parameter is estimated by a sample statistic. Understanding whether to use the mean or median is important.
'Average' can refer to means, medians or other measures of centre. Use the precise term 'mean' or 'median', rather than 'average', when possible!
Example 12.9 (Averages) Consider the daily river flow volume ('streamflow') at the Mary River from 01 October 1959 to 17 January 2019. The 'average' daily streamflow in February could be described using either the mean or the median:
- the mean daily streamflow is \(1\ 123.2\) ML.
- the median daily streamflow is \(146.1\) ML.
These both summarise the same data, and both give an estimate of the 'average' daily streamflow in February, yet give very different answers. This implies they measure the 'average' in different ways, and have different meanings. Which is the best 'average' to use? To decide, both measures of average will need to be studied.
12.5.1 Average: the mean
The mean of the population is denoted by \(\mu\), and its value is almost always unknown. The mean of the population is estimated by the mean of the sample, denoted \(\bar{x}\). In this context, the value of the unknown parameter is \(\mu\), and the value of the statistic is \(\bar{x}\).
The sample mean estimates the population mean, and every one of the possible samples is likely to have a different sample mean.
The Greek letter \(\mu\) is pronounced 'mew' or 'myoo'. \(\bar{x}\) is pronounced 'ex-bar'.
Example 12.10 (A small dataset) Consider a small dataset for answering this descriptive RQ: 'For gorillas aged under \(20\), what is the average chest-beating rate?' The population mean rate (denoted \(\mu\)) is to be estimated.
Clearly, every gorilla cannot be studied; a sample is studied. The unknown population mean is estimated using the sample mean (\(\bar{x}\)), and every possible sample can give a different value for \(\bar{x}\). Measurements were taken from \(14\) young gorillas (Fig. 12.5).
The sample mean is the 'balance point' of the observations. The animation below shows how the mean acts as the balance point. Alternatively, the mean is the value such that the positive and negative distances of the observations from the mean add to zero, as in the animation below. Both of these explanations seem reasonable for identifying an 'average' for the data.
Definition 12.4 (Mean) The mean is one way to measure the 'average' value of quantitative data. The arithmetic mean is the 'balance point' of the data, and the value such that the positive and negative distances from the mean add to zero.
To find the value of the sample mean, add (denoted by \(\sum\)) all the observations (denoted by \(x\)); then divide by the number of observations (denoted by \(n\)). In symbols: \[ \bar{x} = \frac{\sum x}{n}. \]
Example 12.11 (Computing a sample mean) For the chest-beating data
(Fig. 12.5),
an estimate of the population mean (i.e., the sample mean) chest-beating rate is found by summing all \(n = 14\) observations and dividing by \(n\):
\[
\overline{x}
= \frac{\sum x}{n}
= \frac{0.7 + 0.9 + \cdots + 4.4}{14}
= \frac{31.1}{14}
= 2.221429.
\]
The sample mean, the best estimate of the population mean, is \(2.22\) beats per \(10\) h.
For the chest-beating data (Fig. 12.5), what is the value of \(\mu\)?
We do not know!
We know the value of the sample mean, but not the population mean. We have an estimate of the value of the population mean by using the sample mean.
(If we already knew the value of the population mean, why would we estimate the value from an imperfect sample?)
Software (such as jamovi) or a calculator (in Statistics Mode) is usually used to compute the sample mean. However, knowing how these quantities are computed is important.
Software and calculators often produce numerical answers to many decimal places, not all of which may be meaningful or useful. A simple but useful rule-of-thumb is to round to one or two more significant figures than the original data. Software usually does not add measurement units to the answer either.
For example, the chest-beating data are given to one decimal place. The sample mean rate can be given as \(\bar{x} = 2.22\) beats per \(10\) hrs.
A study of bats (Griffin, Webster, and Michael 1960) recorded the distance at which flies (Drosophila) were detected for \(n = 11\) detections (Table 12.2). Estimate the population mean distance (using the sample mean) at which bats detect the flies.
The estimate of \(\mu\) is \(\bar{x} = 532/11 = 48.4\) cm.
\(62\) | \(68\) | \(34\) | \(27\) | \(83\) | \(40\) |
\(52\) | \(23\) | \(45\) | \(42\) | \(56\) |
12.5.2 Average: the median
A median is a value separating the largest \(50\)% of the data from the smallest \(50\)% of the data. In a dataset with \(n\) values, the median is ordered observation number \((n + 1)\div 2\). (The median is not equal to \((n + 1)\div 2\), and is not halfway between the minimum and maximum values in the data.)
Many calculators cannot find the median. The median has no commonly-used symbol, though \(\tilde{\mu}\) and \(\tilde{x}\) are sometimes used for the population and sample means respectively.
Definition 12.5 (Median) The median is one way to measure the 'average' value of data. A median is a value such that half the values are larger than the median, and half the values are smaller than the median.
Example 12.12 (Find a sample median) To find a sample median for the chest-beating data (Fig. 12.5), first arrange the data in numerical order (Table 12.3). The median separates the larger \(7\) numbers from the smaller \(7\) numbers. With \(n = 14\) observations, the median is the ordered observation located between the seventh and eighth observations (i.e., at position \((14 + 1)/2 = 7.5\); the median itself is not \(7.5\)).
The sample median is between \(1.7\) (ordered observation \(7\)) and \(1.7\) (observation \(8\)). Since these values are the same, the sample median is \(1.7\) beats per \(10\) h.
0.7 | 0.9 | 1.3 | 1.5 | 1.5 | 1.5 | 1.7 | 1.7 | 1.8 | 2.6 | 3 | 4.1 | 4.4 | 4.4 |
For the chest-beating data (Table 12.3), what is the population median?
We do not know!
We know the value of the sample median, but not the population median. We only have an estimate of the value of the population median.
To clarify:
- if the sample size \(n\) is odd (such as the bats data; Table 12.2), the median is the middle number when the observations are ordered.
- if the sample size \(n\) is even (such as the chest-beating data), the median is halfway between the two middle numbers, when the observations are ordered.
Some software uses slightly different rules when \(n\) is even, producing slightly different values for the median.
The sample median estimates the population median, and every one of the possible samples is likely to have a different sample median.
For the bat data (Table 12.2), estimate the population median distance at which bats detect the flies.
With \(n = 11\), the median is the \((11 + 1)/2 = 6\)th ordered value, which is \(45\) cm.
12.5.3 Which average to use?
Consider the daily streamflow at the Mary River (Bellbird Creek) during February again (Example 12.9): the mean daily streamflow is \(1\ 123\) ML, and the median daily streamflow is \(146.1\) ML. Which is 'best' for measuring the average streamflow?
For these data, about \(86\)% of the observations are less than the mean, but \(50\)% the values are less than the median (by definition). The mean is hardly a central value...
A dot chart of the daily streamflow (Fig. 12.6, using jittering) shows that the data are very highly right-skewed, with many very large outliers (presumably during flood events); the extreme outliers are clear in the frequency table too (Table 12.4).
Daily streamflow (ML) | Number of days |
---|---|
\(0\) to under \(20,000\) | \(1636\) |
\(20,000\) to under \(40,000\) | \(\phantom{0}\phantom{0}\phantom{0}6\) |
\(40,000\) to under \(60,000\) | \(\phantom{0}\phantom{0}\phantom{0}3\) |
\(60,000\) to under \(80,000\) | \(\phantom{0}\phantom{0}\phantom{0}4\) |
\(80,000\) to under \(100,000\) | \(\phantom{0}\phantom{0}\phantom{0}0\) |
\(100,000\) to under \(120,000\) | \(\phantom{0}\phantom{0}\phantom{0}0\) |
\(120,000\) to under \(140,000\) | \(\phantom{0}\phantom{0}\phantom{0}0\) |
\(140,000\) to under \(160,000\) | \(\phantom{0}\phantom{0}\phantom{0}1\) |
The streamflow data are very highly right skewed, which is important to note:
- Means are best used for approximately symmetric data: the mean is influenced by outliers and skewness.
- Medians are best used for data that are highly skewed or contain outliers: the median is not influenced by outliers and skewness.
Means tend to be too large if the data contains large outliers or severe right skewness, and too small if the data contains small outliers or severe left skewness. For the Mary River data, the large outliers---and because they are so extreme and abundant---cause the mean to be so larger than the median. The median is the better measure of average for these data.
The mean is generally used if possible (for practical and mathematical reasons), and is the most commonly-used measure of location. However, the mean is not always appropriate; the median is not influenced by outliers and skewness. The mean and median are similar in approximately symmetric distributions. Sometimes, quoting both the mean and the median may be appropriate.
12.6 Summaries: variation
For quantitative data, the amount of variation in the bulk of the data should be described. Many ways exist to measure the variation in a dataset, including:
- the range: very simple and simplistic, so not often used (Sect. 12.6.1).
- the standard deviation: commonly used (Sect. 12.6.2).
- the interquartile range (or IQR): commonly used (Sect. 12.6.3).
- percentiles: useful in specific situations (Sect. 12.6.4).
As always, a value computed from the sample (the statistic) estimates the unknown value in the population (the parameter), and every sample can produce a different estimate.
12.6.1 Variation: the range
The range is the simplest measure of variation, but not often used.
Definition 12.6 (Range) The range is the maximum value minus the minimum value.
The range is not often used, as it only uses the two values: the extreme observations. This means the range is highly influenced by outliers. Sometimes, the range is given by stating both the maximum and the minimum value in the data instead of giving the difference between these values. The range is measured in the same measurement units as the data, and is usually quoted with the median.
Example 12.13 (The range) For the chest-beating data (Table 12.3), the largest value is \(4.4\), and the smallest value is \(0.7\); hence
\[
\text{Range} = 4.4 - 0.7 = 3.7.
\]
The sample median chest-beating rate is \(1.7\) beats per \(10\) h, with a range of \(3.7\) beats per \(10\) h.
12.6.2 Variation: the standard deviation
The population standard deviation is denoted by \(\sigma\) (the parameter) and is estimated by the sample standard deviation \(s\) (the statistic). The standard deviation is the most commonly-used measure of variation, but is tedious to compute manually. You will almost always find the sample standard deviation \(s\) using computer software (e.g., jamovi) or a calculator (in Statistics Mode)).
The standard deviation is (approximately) the mean distance that observations are from the mean. This seems like a reasonable way to measure the amount of variation in data.
The Greek letter \(\sigma\) is pronounced 'sigma'.
The sample standard deviation estimates the population standard deviation, and every one of the possible samples is likely to have a different sample standard deviation.
Definition 12.7 (Standard deviation) The standard deviation is, approximately, the average distance of the observations from the mean.
Even though you do not have to use the formula to calculate \(s\), we will demonstrate to show exactly what \(s\) calculates.
The formula is:
\[
s = \sqrt{ \frac{\sum(x - \bar{x})^2}{n - 1} },
\]
where \(\bar{x}\) is the sample mean, \(x\) represents the individual data values, \(n\) is the sample size, and the symbol '\(\sum\)' means to add (Sect. 12.5.1).
Using the formula requires these steps:
- calculate the sample mean: \(\overline{x}\);
- calculate the deviations of each observation \(x\) from the mean: \(x - \bar{x}\);
- square these deviations (to make them all positive values): \((x - \bar{x})^2\);
- add these squared deviations: \(\sum(x - \bar{x})^2\);
- divide the answer by \(n - 1\);
- take the (positive) square root of the answer.
You do not need to use the formula! You must know how to use software or a calculator to find the standard deviation, what the standard deviation measures, and how to use it.
Example 12.14 (Standard deviation) For the chest-beating data (Table 12.3), the squared deviations of each observation from the mean of \(2.221429\) (using all decimal places in calculations) are shown in Fig. 12.7.
The sum of the squared distances is \(20.96357\).
Then, the sample standard deviation is:
\[
s = \sqrt{\frac{20.96357}{14 - 1}}
= \sqrt{ 1.612582}
= 1.269875.
\]
The sample mean chest-beating rate is \(2.22\) per \(10\) h, with a sample standard deviation of \(1.27\) per \(10\) h.
The standard deviation for Dataset A in Fig. 12.8 is \(s = 2\). Will the standard deviation of Dataset B be: smaller than or greater than \(2\)? Why?
The standard deviation is a bit like the average distance that observations are from the mean.
In Dataset B, more observations are closer to the mean, so the average distance would be a smaller number. This suggests that the standard deviation for Dataset B will be smaller than the standard deviation for Dataset A.
The sample standard deviation \(s\) is:
- positive (unless all observations are the same, when it is zero: no variation);
- best used for (approximately) symmetric data;
- usually quoted with the mean;
- the most commonly-used measure of variation;
- measured in the same units as the data;
- influenced by skewness and outliers, like the mean.
Consider again the chest-beating data (Table 12.3). Using software or your calculator's Statistics Mode, find the population standard deviation and the sample standard deviation.
The population standard deviation is unknown. The best estimate is the sample standard deviation: \(s = 1.612582\)%. If you do not get this value, you may be pressing the wrong button on your calculator: seek help!
12.6.3 Variation: the inter-quartile range (IQR)
The standard deviation uses the value of \(\bar{x}\), so is affected by skewness like the sample mean. A measure of variation not affected by skewness is the inter-quartile range, or IQR. To understand the IQR, understanding quartiles is necessary.
Definition 12.8 (Quartiles) Quartiles describe the shape of the data:
- The first quartile \(Q_1\) is a value separating the smallest \(25\)% of observations from the largest \(75\)%. The \(Q_1\) is like the median of the smaller half of the data, halfway between the minimum value and the median.
- The second quartile \(Q_2\) is a value separating the smallest \(50\)% of observations from the largest \(50\)%. (This is the also the median.)
- The third quartile \(Q_3\) is a value separating the smallest \(75\)% of observations from the largest \(25\)%. The \(Q_3\) is like the median of the larger half of the data, halfway between the median and the maximum value.
Quartiles divide the data into four parts of approximately equal numbers of observations. The inter-quartile range (or IQR) is the difference between \(Q_3\) and \(Q_1\).
Definition 12.9 (IQR) The IQR is the range in which the middle \(50\)% of the data lie: the difference between the third and the first quartiles.
Since the IQR measures the range of the central \(50\)% of the data, the IQR is not influenced by outliers. The IQR is measured in the same measurements units as the data.
The sample IQR estimates the population IQR, and every one of the possible samples is likely to have a different sample IQR.
For the chest-beating data (Table 12.3), the median is \(1.7\) (Example 12.12). The data then can be split into the smaller and the larger halves, each with seven values:
- Smaller half: \(0.7\) \(0.9\) \(1.3\) \(1.5\) \(1.5\) \(1.5\) \(1.7\)
- Larger half: \(1.7\) \(1.8\) \(2.6\) \(3.0\) \(4.1\) \(4.4\) \(4.4\)
Since each half has seven observations, the median of each half is the \((7 + 1)/2 = 4\)th value. (When \(n\) is odd, the median may or may not be included in each of these halves; we decide not to include the median in each half.) Hence:
- \(Q_1\), the first quartile, is the median of the smaller half: \(Q_1 = 1.5\). About \(25\)% of observations are smaller than \(1.5\).
- \(Q_2\), the second quartile or median, is \(1.7\). About \(50\)% of observations are smaller than \(1.7\).
- \(Q_3\), the third quartile, is the median of the larger half: \(Q_3 = 3.0\). About \(75\)% of observations are smaller than \(3.0\).
We say 'about' these values, as exact values cannot be found here; each quartile is required to have \(14/4 = 3.5\) observations, which is not possible. (Software often uses different rules to compute quartiles in these situations.) Using these values, the IQR is \(Q_3 - Q_1\) = \(3.0 - 1.5 = 1.5\), as shown in Fig. 12.9.
12.6.4 Variation: percentiles
Percentiles are similar in principle to quantiles.
Definition 12.10 (Percentiles) The \(p\)th percentile of the data is a value separating the smallest \(p\)% of the data from the rest.
For example:
- the \(12\)th percentile separates the smallest \(12\)% of the data from the rest.
- the \(67\)th percentile separates the smallest \(67\)% of the data from the rest.
- the \(94\)th percentile separates the smallest \(94\)% of the data from the rest.
By this definition, the first quartile \(Q_1\) is the \(25\)th percentile, the second quartile \(Q_2\) is the \(50\)th percentile (and median), and the third quartile \(Q_3\) is the \(75\)th percentile.
Percentiles are especially useful for very skewed data in certain applications. For instance, scientists who monitor rainfall and stream heights, and engineers who use this information, are more interested in extreme weather events rather than the 'average' event. Structures are designed to withstand \(1\)-in-\(100\) year events (the \(99\)th percentile) or similar, rather than 'average' events. Percentiles are measured in the same measurements units as the data.
Example 12.15 (Percentiles) For the streamflow data at the Mary River (Example 12.9), the February data are highly right-skewed (Fig. 12.6). The median (50th percentile) is \(146.1\) ML. However, the 95th percentile is \(3\ 480\) ML and the 99th percentile is \(19\ 043\) ML.
Constructing infrastructure for the median streamflow would be inadequate.
12.6.5 Which measure of variation to use?
Which is the 'best' measure of variation for quantitative data? As with measures of location, it depends on the data.
Since the standard deviation formula uses the mean, it is impacted in the same way as the mean by outliers and skewness. Hence, the standard deviation is best used with approximately symmetric data. The IQR is best used when data are skewed or asymmetric. Sometimes, both the standard deviation and the IQR can be quoted.
12.7 Summaries: shape
Describing the shape can be difficult, but introducing terminology helps:
- Right (or positively) skewed: most data are smaller, with some larger values.
- Left (or negatively) skewed: most the data are larger, with some smaller values.
- Symmetric data:\index{Shape!symmetric} Approximately equal numbers of values are smaller and larger.
- Bimodal data: The distribution has two peaks.
The carousel below (click the left and right arrows to move through the example plots) shows typical shapes. Sometimes, no short descriptions are suitable. While symmetry and skewness can be described numerically, we will describe shape using words (skewed, approximately symmetric, bimodal, etc.).
Example 12.16 (Bimodal data) The Old Faithful geyser in Yellowstone National Park (USA) erupts regularly (Härdle et al. 1991). The original histogram for the time between eruptions (Fig. 12.3, left panel) is bimodal, with peaks near \(55\) min and \(80\) min.
12.8 Summaries: identifying outliers
Outliers are 'unusual' observations: those quite different (larger or smaller) than the bulk of the data. Outliers are 'unusual', but not necessarily 'wrong' or 'bad' observations. Rules for deciding if an observation is an outlier are always arbitrary.
Definition 12.11 (Outliers) An outlier is an observation that is 'unusual' (either larger or smaller) compared to the bulk of the data. Rules for identifying outliers are arbitrary.
Two rules for identifying outliers are:
- the standard deviation rule, useful when the data have an approximately symmetric distribution (Sect. 12.8.1).
- the IQR rule, useful in other situations (Sect. 12.8.2).
12.8.1 The standard deviation rule
This rule for identifying outliers applies for approximately symmetric distributions.
Definition 12.12 (Standard deviation rule for identifying outliers) For approximately symmetric distributions, an observation more than three standard deviations from the mean may be considered an outlier.
This rule uses the mean and the standard deviation, so is suitable for approximately symmetric distributions (when means and standard deviations are sensible numerical summaries). The rationale behind this rule is explained in Sect. 22.3.
All rules for identifying outliers are arbitrary, and sometimes the standard deviation rule is sometimes given slightly differently. For example, outliers may be identified as observations more than \(2.5\) standard deviations away from the mean. Both rules are acceptable, since the definition is arbitrary.
Example 12.17 (Standard deviation rule for identifying outliers) An engineering project (Hald 1952) studied a new building material, to estimate the average permeability. Permeability time (the time for water to permeate the sheets) was measured from \(81\) pieces of material (in seconds).
For these data, the mean is \(\bar{x} = 43.162\) and the standard deviation is \(s = 27.358\). Using the standard deviation rule, outliers are observations smaller than \(43.162 - (3\times 27.358)\) or larger than \(43.162 + (3\times 27.358)\); that is, smaller than \(-38.9\) (which is clearly not appropriate here), or larger than \(125.2\). This is shown in Fig. 12.11; two observations are identified as outliers using the standard deviation rule.
12.8.2 The IQR rule
Since the standard deviation rule for identifying outliers relies on the mean and standard deviation, it is not appropriate for non-symmetric distributions. Another rule is needed for identifying outliers in these situations: the IQR rule.
Definition 12.13 (IQR rule for identifying outliers) The IQR rule identifies mild and extreme outliers:
- Extreme outliers: observations \(3\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\).
- Mild outliers: observations \(1.5\times \text{IQR}\) more unusual than \(Q_1\) or \(Q_3\) (that are not also extreme outliers).
This definition is easier to understand using an example.
Example 12.18 (IQR rule for identifying outliers) Using the permeability data seen in Example 12.17, a computer shows that \(Q_1 = 24.7\) and \(Q_3 = 50.6\), so \(\text{IQR} = {50.6 - 24.7 = 25.9}\). Then, extreme outliers are observations \(3\times 25.9 = 77.7\) more unusual than \(Q_1\) or \(Q_3\). That is, extreme outliers are observations:
- more unusual than \(24.7 - 77.7 = -53.0\) (that is, less than \(-53.0\)); or
- more unusual than \(50.6 + 77.7 = 128.3\) (that is, greater than \(128.3\)).
Mild outliers are observations \(1.5\times 25.9 = 38.9\) more unusual than \(Q_1\) or \(Q_3\) (that are not also extreme outliers). That is, mild outliers are
- more unusual than \(24.7 - 38.9 = -14.2\) (that is, less than \(-14.2\)); or
- more unusual than \(50.6 + 38.9 = 89.5\) (that is, greater than \(89.5\)).
Three observations are identified as outliers using the IQR rule (Fig. 12.12.
12.8.3 Which outlier rule to use?
In summary, two common ways to identify outliers are the standard deviation rule (for approximately symmetric distributions), and the IQR rule (for any distribution, but primarily for those skewed or with outliers).
But remember: All rules for identifying outliers are arbitrary!
12.8.4 What to do with outliers?
What should be done if outliers are identified in data? Remember that outliers are unusual observations; they are not necessarily wrong. Deleting or removing outliers simply because they are identified as outliers is a very bad idea. After all, the outliers were obtained from your study like all other observations... they deserve to be in your data as much as any other observation. In addition, the rules for identifying outliers are arbitrary: some observations may be identified as outliers using one rule, but not by another.
The strategy for managing outliers depends on the reason for the outlier (P. K. Dunn and Smyth (2018), p. 138):
- The outlier is clearly a mistake (e.g., an age of \(222\)): If the mistake cannot be fixed (e.g., the completed survey form is lost), the observation can be deleted. Similarly, if the outliers come from an error or mistake in the data collection (e.g., too much fertiliser was accidentally applied), the observation can be deleted.
- The outlier represents a different population: Suppose an outlier is identified in a study of students, corresponding to a student aged 65. If the next oldest student in the data is aged \(39\), the outlier can be removed, since it belongs to a different population ('students aged over \(40\)') than the other observations ('students aged under \(40\)'). The remaining observations can be analysed, but the results only apply to students aged under 40 (which should be communicated).
- The reason for the outlier is unknown: Discarding the outliers routinely is not recommended, as the outliers are probably real observations that are just as valid as the others. Perhaps a different analysis is necessary (for example, using medians rather than means). Furthermore, very large datasets are expected to have observations incorrectly identified as outliers using the above rules.
In all cases, whenever observations are removed from a dataset, this should be clearly explained and documented.
Example 12.19 (Outliers) The Mary River dataset (Sect. 12.5) has many extremely large outliers identified by software, but each is reasonable. They probably correspond to flooding events. Removing these from the analysis would be inappropriate.
Example 12.20 (Outliers) The permeability data (Example 12.17) has large outliers, but all seem reasonable. Removing these from the analysis would be inappropriate.
12.9 Numerical summary tables
In studies with quantitative variables, these variables should be summarised in a table. The table should include, as a minimum, measures of average and variation. An example is given in the next section.
12.10 Example: water access
A study of three rural communities in Cameroon (López-Serrano et al. 2022) recorded data about access to water. Three quantitative variables are recorded. Part of the understanding the data requires summarising the quantitative variables; histograms are shown in Fig. 12.13, and a summary table in Table 12.5).
A large number of households are coordinated by women in their late 50s. The number of people and number of children under 5 are both right-skewed. One household has over 30 people, and has 10 children in that household. (These are identified as outliers, but are not mistakes.)
\(n\) | Mean | Median | Std. dev. | IQR | |
---|---|---|---|---|---|
Woman's age (years) | \(120\) | \(41.6\) | \(40.5\) | \(14.56\) | \(30.25\) |
Household size | \(121\) | \(\phantom{0}7.0\) | \(\phantom{0}6.0\) | \(\phantom{0}4.80\) | \(\phantom{0}4.00\) |
Children under \(5\) | \(120\) | \(\phantom{0}1.6\) | \(\phantom{0}1.0\) | \(\phantom{0}1.65\) | \(\phantom{0}2.00\) |
12.11 Chapter summary
Quantitative data can be summarised numerically; the most common techniques are indicated in Table 12.6. The mean and standard deviation are usually used whenever possible, for practical and mathematical reasons. Sometimes quoting both the mean and median (and the standard deviation and IQR) may be appropriate.
The following short video may help explain some of these concepts:
Feature | Approximately symmetric | Not symmetric, or outliers |
---|---|---|
Average | Mean | Median |
Variation | Standard deviation | IQR |
Shape | Verbal description only | Verbal description only |
Outliers | Standard deviation rule | IQR rule |
12.12 Quick review questions
Are the following statements true or false?
- The IQR measures the amount of variability in data.
- The mean and the median can both be called the "average".
- The mean and the median are not always the same value.
- The range is a simple measure of variability in a set of data.
- The standard deviation measures the amount of variability in data.
- Another name for the median is \(Q_2\).
- \(Q_3\) is the median of the largest half of the data.
- The IQR is a useful measure of the amount of variation in data that are skewed.
- The IQR is the difference between the first and second quartiles.
- Another name for the \(75\)th percentile is \(Q_3\).
- The units of the standard deviation, IQR are the same as for the original data.
12.13 Exercises
Selected answers are available in App. E.
Exercise 12.1 The Australian Bureau of Statistics (ABS) records the age at death of Australians. The histograms of the age of death for females in 2012 is shown in Fig. 12.14 (left panel). Describe the distribution.
Exercise 12.2 [Dataset: NHANES
]
The histogram of the direct HDL cholesterol concentration from the American National Health and Nutrition Examination Survey (NHANES) (Pruim 2015) from \(1999\)--\(2004\) is shown in Fig. 12.14 (right panel).
Should the mean or median be used to measure the 'average' HDL cholesterol concentration? Explain.
Exercise 12.3 A study of amusement rides in the US (Levenson 2005) recorded the number of fatalities from \(1994\) to \(2003\) (Table 12.7). Using software or a calculator, compute:
- the mean number of fatalities per year over this period.
- the median number of fatalities per year over this period.
- the standard deviation number of fatalities per year over this period.
1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | |
---|---|---|---|---|---|---|---|---|---|---|
Fatalities: | \(2\) | \(4\) | \(3\) | \(4\) | \(7\) | \(6\) | \(1\) | \(3\) | \(2\) | \(5\) |
Exercise 12.4 Furness and Bryant (1996) studied fulmars (a seabird) . The mass of the female birds were (in grams): \(635\);\(635\);\(668\);\(640\);\(645\);\(635\).
- Construct a stemplot (using the first two digits as the stems).
- Using your calculator, find the value of the sample mean.
- Using your calculator, the value of the sample standard deviation.
- Find the value of the sample median.
- Find the value of the population standard deviation.
Exercise 12.5 The average monthly SOI in August from \(1995\) to \(2000\) are shown in Table 12.8. Draw a stemplot of the data. Then, use your calculator (where possible) to calculate the:
- sample mean
- sample median.
- range.
- sample standard deviation.
1995 | 1996 | 1997 | 1998 | 1999 | 2000 | |
---|---|---|---|---|---|---|
Monthly average SOI: | \(0.8\) | \(4.6\) | \(-19.8\) | \(9.8\) | \(2.1\) | \(5.3\) |
Exercise 12.6 [Dataset: FriesWt
]
Order of French fries were weighted to determine how well they matched the target weight of \(171\) g (Wetzel 2005).
Using the data in Table 12.9:
- produce graphs to summarise the data.
- use software to produce numerical summary information.
Do you think the weights meet the target weight, on average?
\(117.0\) | \(132.0\) | \(134.0\) | \(139.0\) | \(141.0\) | \(143.0\) | \(146.0\) | \(152.0\) | \(154.0\) | \(155.0\) | \(157.0\) |
\(126.0\) | \(133.0\) | \(137.0\) | \(139.0\) | \(142.0\) | \(143.5\) | \(146.0\) | \(152.0\) | \(154.5\) | \(156.0\) | \(176.0\) |
\(128.0\) | \(133.0\) | \(138.0\) | \(140.0\) | \(142.5\) | \(145.0\) | \(151.0\) | \(154.0\) | \(154.5\) | \(156.5\) | \(117.0\) |
Exercise 12.7 [Dataset: Orthoses
]
In a study of the influence of using ankle-foot orthoses in children with cerebral palsy (Swinnen et al. 2018), the data in Table 11.3 describe the \(15\) subjects.
- Compute the mean, median, standard deviation and IQR for the children's heights.
- Produce a stemplot of the children's heights.
- Produce a dot chart of the children's heights.
- Produce a histogram of the children's heights.
- Describe the distribution of the children's heights.
Exercise 12.8 An article studied patients who had been admitted to Castle Hill Hospital (Jenner et al. 2022). The total number of microplastics found in the lungs of each patients are shown in Table 12.10. For these patients:
- Draw a stemplot, using the numbers as (say) \(8.0\), with the decimal place as the leaves.
- What is the mean number of microplastics found?
- What is the median number of microplastics found?
- What is the standard deviation of the number of microplastics found?
- What is the IQR of the number of microplastics found?
\(8\) | \(3\) | \(5\) | \(2\) | \(0\) | \(2\) | \(1\) | \(7\) | \(5\) | \(1\) | \(0\) |