Chapter 12 Measures of spread

It is often important to know how much a set of numbers is spread out. That is, do all of the data cluster close to the mean, or are most values distant from the mean? For example, all of the numbers below are quite close to the mean of 5.0 (three numbers are exactly 5.0).

4.9, 5.3, 5.0, 4.7, 5.1, 5.0, 5.0

In contrast, all of the numbers that follow are relatively distant from the same mean of 5.0.

3.0, 5.6, 7.8, 1.2, 4.3, 8.2, 4.9

This chapter focuses on summary statistics that describe the spread of data. The approach in this chapter is similar to Chapter 11, which provided verbal and mathematical explanations of measures of central tendency. We will start with the most intuitive measures of spread: the range and inter-quartile range. Then, we will move on to some more conceptually challenging measures of spread: the variance, standard deviation, coefficient of variation, and standard error. These more challenging measures can be a bit confusing at first, but they are absolutely critical for doing statistics. The best approach to learning them is to see them and practice using them in different contexts, which we will do throughout this book.

12.1 The range

The range of a set of numbers is probably the most intuitive measure of spread. It is simply the difference between the highest and the lowest value of a dataset (Sokal & Rohlf, 1995). To calculate it, we just need to take the highest value minus the lowest value. If we want to be technical, then we can write a general equation for the range of a random variable \(X\),

\[\mathrm{Range}(X) = \max(X) - \min(X).\]

But really, all we need to worry about is finding the highest and lowest values, then subtracting. Consider again the two sets of numbers introduced at the beginning of the chapter. In examples, it is often helpful to imagine numbers as representing something concrete that has been measured, so suppose that these numbers are the measured masses (in grams) of leaves from two different plants. Below are the masses of plant A, in which leaf masses are very similar and close to the mean of 5.

4.9, 5.3, 5.0, 4.7, 5.1, 5.0, 5.0

Plant B masses are below, which are more spread out around the same mean of 5.

3.0, 5.6, 7.8, 1.2, 4.3, 8.2, 4.9

To get the range of plant A, we just need to find the highest (5.3 g) and lowest (4.7 g) mass, then subtract,

\[\mathrm{Range}(plant\:A) = 5.3 - 4.7 = 0.6\]

Plant A therefore has a range of 0.6 g. We can do the same for plant B, which has a highest value of 8.2 g and lowest value of 1.2 g,

\[\mathrm{Range}(plant\:B) = 8.2 - 1.2 = 7.0\]

Plant B therefore has a much higher range than plant A.

It is important to mention that the range is highly sensitive to outliers (Navarro & Foxcroft, 2022). Just adding a single number to either plant A or plant B could dramatically change the range. For example, imagine if we measured a leaf in plant A to have a mass of 19.7 g (i.e., we found a huge leaf!). The range of plant A would then be \(19.7 - 4.7 = 15\) instead of 0.6. Just this one massive leaf would then make the range of plant A more than double the range of plant B. This lack of robustness can really limit how useful the range is as a statistical measure of spread.

12.2 The inter-quartile range

The inter-quartile range (usually abbreviated as ‘IQR’) is conceptually the same as the range. The only difference is that we are calculating the range between quartiles rather than the range between the highest and lowest numbers in the dataset. A general formula subtracting the first quartile (\(Q_{1}\)) from the third quartile (\(Q_{3}\)) is,

\[IQR = Q_{3} - Q_{1}.\]

Recall from Chapter 11 how to calculate first and third quartiles. As a reminder, we can sort the leaf masses for plant A below.

4.7, 4.9, 5.0, 5.0, 5.0, 5.1, 5.3

The first quartile will be 4.95. The third quartile will be 5.05. The IQR of plant A is therefore,

\[IQR_{\mathrm{plant\:A}} = 5.05 - 4.95 = 0.1.\]

We can calculate the IQR for plant B in the same way. Here are the masses of plant B leaves sorted.

1.2, 3.0, 4.3, 4.9, 5.6, 7.8, 8.2

The first quartile of plant B is 3.65, and the third quartile is 6.70. To get the IQR of plant B,

\[IQR_{\mathrm{plant\:B}} = 6.70 - 3.65 = 3.05.\]

An important point about the IQR is that it is more robust than the range (Dytham, 2011). Recall that if we found an outlier leaf of 19.7 g on plant A, it would change the range of plant leaf mass from 0.6 to 15 g. The IQR is not nearly so sensitive. If we include the outlier, the first quartile for plant A changes from \(Q_{1} = 4.95\) to \(Q_{1} = 4.975\). The third quartile changes from \(Q_{3} = 5.05\) to \(Q_{3} = 5.150\). The resulting IQR is therefore \(5.150 - 4.975 = 0.175\). Hence, the IQR only changes from 0.1 to 0.175, rather than from 0.6 to 15. The one outlier therefore has a huge effect on the range, but only a modest effect on the IQR.

12.3 The variance

The range and inter-quartile range were reasonably intuitive, in the sense that it is not too difficult to think about what a range of 10, e.g., actually means in terms of the data. We now move to measures of spread that are less intuitive. These measures of spread are the variance, standard deviation, coefficient of variation, and standard error. These can be confusing and unintuitive at first, but they are extremely useful. We will start with the variance; this section is long because I want to break the variance down carefully, step by step.

The sample variance of a dataset is a measure of the expected squared distance of data from the mean. To calculate the variance of a sample, we need to know the sample size (\(N\), i.e., how many measurements in total), and the mean of the sample (\(\bar{x}\)). We can calculate the variance of a sample (\(s^{2}\)) as follows,

\[s^{2} = \frac{1}{N - 1}\sum_{i = 1}^{N}\left(x_{i} - \bar{x} \right)^{2}.\]

This looks like a lot, but we can break down what the equation is doing verbally. First, we can look inside the summation (\(\sum\)). Here we are taking an individual measurement \(x_{i}\), subtracting the mean \(\bar{x}\), then squaring. We do this for each \(x_{i}\), summing up all of the values from \(i = 1\) to \(i = N\). This part of the equation is called the sum of squares (\(SS\)),

\[SS = \sum_{i = 1}^{N}\left(x_{i} - \bar{x} \right)^{2}.\]

That is, we need to subtract the mean from each value \(x_{i}\), square the result, and add everything up. Once we have this sum, \(SS\), then we just need to multiply by \(1 / (N - 1)\) to get the variance.

An example of how to do the actual calculation should help make it easier to understand what is going on. We can use the same values from plant A earlier.

4.9, 5.3, 5.0, 4.7, 5.1, 5.0, 5.0

To calculate the variance of plant A leaf masses, we start with the sum of squares. That is, take 4.9, subtract the sample mean of 5.0 (\(4.9 - 5.0 = -0.1\)), then square the result (\((-0.1)^{2} = 0.01\)). We do the same for 5.3, \((5.3 - 5.0)^{2} = 0.09\), and add it to the 0.01, then continue down the list of numbers finishing with 5.0. This is what the sum of squares calculation looks like written out,

\[SS = (4.9 - 5)^{2} + (5.3 - 5)^{2} + (5 - 5)^{2} + (4.7 - 5)^{2} + (5.1 - 5)^{2} + (5 - 5)^{2} + (5 - 5)^{2}.\]

Remember that the calculations in parentheses need to be done first, so the next step for calculating the sum of squares would be the following,

\[SS = (-0.1)^{2} + (0.3)^{2} + (0)^{2} + (-0.3)^{2} + (0.1)^{2} + (0)^{2} + (0)^{2}.\]

Next, we need to square all of the values,

\[SS = 0.01 + 0.09 + 0 + 0.09 + 0.01 + 0 + 0.\]

If we sum the above, we get \(SS = 0.2\). We now just need to multiply this by \(1 / (N - 1)\), where \(N = 7\) because this is the total number of measurements in the plant A dataset,

\[s^{2} = \frac{1}{7 - 1}\left(0.2\right).\]

From the above, we get a variance of approximately \(s^{2} = 0.0333\).

Fortunately, it will almost never be necessary to calculate a variance manually in this way. Jamovi will do all of these steps and calculate the variance for us (Chapter 14 explains how). The only reason that I present the step-by-step calculation here is to help explain the equation for \(s^{2}\). The details can be helpful for understanding how the variance works as a measure of spread. For example, note that what we are really doing here is getting the distance of each value from the mean, \(x_{i} - \bar{x}\). If these distances tend to be large, then it means that most data points (\(x_{i}\)) are far away from the mean (\(\bar{x}\)), and the variance (\(s^{2}\)) will therefore increase. The differences \(x_{i} - \bar{x}\) are squared because we need all of the values to be positive, so that variance increases regardless of whether a value \(x_{i}\) is higher or lower than the mean. It does not matter if \(x_{i}\) is 0.1 lower than \(\bar{x}\) (i.e., \(x_{i} - \bar{x} = -0.1\)), or 0.1 higher (i.e., \(x_{i} - \bar{x} = 0.1\)). In both cases, the deviation from the mean is the same. Moreover, if we did not square the values, then the sum of \(x_{i} - \bar{x}\) values would always be 0 (you can try this yourself)9. Lastly, it turns out that the variance is actually a special case of a more general concept called the covariance, which we will look at later in Chapter 30 and which helps the squaring of differences make a bit more sense.

We sum up all of the squared deviations to get the \(SS\), then divide by the sample size minus 1, to get the mean squared deviation from the mean. That is, the whole process gives us the average squared deviation from the mean. But wait, why is it the sample size minus 1, \(N - 1\)? Why would we subtract 1 here? The short answer is that in calculating a sample variance, \(s^{2}\), we are almost always trying to estimate the corresponding population variance (\(\sigma^{2}\)). And if we were to just use \(N\) instead of \(N - 1\), then our \(s^{2}\) would be a biased estimate of \(\sigma^{2}\) (see Chapter 4 for a reminder on the difference between samples and populations). By subtracting 1, we are correcting for this bias to get a more accurate estimate of the population variance10. It is not necessary to do this ourselves; jamovi will do it automatically (The jamovi project, 2024). The subtraction is required due to a reduction in the degrees of freedom that occurs when using the sample mean in the equation for variance11.

Degrees of freedom is a difficult concept to understand, but we can broadly define the degrees of freedom as the number of independent pieces of information that we have when calculating a statistic (Grafen & Hails, 2022; Upton & Cook, 2014). When we need to estimate a parameter from a dataset in the process of calculating a statistic, we lose an independent piece of information, and therefore a degree of freedom (Pandey & Bright, 2008). For example, if we collect \(N = 10\) samples, then there are 10 independent pieces of information that we can use to calculate the mean. To calculate the variance, we use all of these 10 samples and \(\bar{x}\) (note that \(\bar{x}\) appears in the equation for variance above). This suggests that we actually have 11 independent pieces of information when making our calculation (all of the samples and the sample mean). But not all of these values are actually free to vary. When we know what the 10 sample values are, then the sample mean is fixed. And if we know what 9 sample values are and what the mean is, then we can work out what the last sample value must be. We therefore lose a degree of freedom by including \(\bar{x}\) in the calculation of variance, so we need to divide by \(N - 1\) rather than \(N\) to avoid bias (Fowler et al., 1998; Wardlaw, 1985). Another way of thinking about degrees of freedom is as the number of differences that can arise between what we sample versus what we expect just based on the mathematics (Fryer, 1966). In other words, how much is our calculation free to vary just due to chance? In calculating the variance, the mathematics introduces a constraint by including \(\bar{x}\), thereby reducing degrees of freedom.

This was a lot of information. The variance is not an intuitive concept. In addition to being a challenge to calculate, the calculation of a variance leaves us with a value in units squared. That is, for the example of plant leaf mass in grams, the variance is measured in grams squared, \(\mathrm{g^{2}}\), which is not particularly easy to interpret. For more on this, Navarro & Foxcroft (2022) have a really good section on the variance. Despite its challenges as a descriptive statistic, the variance has some mathematical properties that are very useful (Navarro & Foxcroft, 2022), especially in the biological and environmental sciences.

For example, variances are additive, meaning that if we are measuring two separate characteristics of a sample, A and B, then the variance of A+B equals the variance of A plus the variance of B, i.e., \(\mathrm{Var}(A + B) = \mathrm{Var}(A) + \mathrm{Var}(B)\).12 This is relevant to genetics when measuring heritability. Here, the total variance in the phenotype of a population (e.g., body mass of animals) can be partitioned into variance attributable to genetics plus variance attributable to the environment,

\[\mathrm{Var}(Phenotype) = \mathrm{Var}(Genotype) + \mathrm{Var}(Environment).\]

This is also sometimes written as \(V_{P} = V_{G} + V_{E}\). Applying this equation to calculate heritability (\(H^{2} = V_{G} / V_{P}\)) can be used to predict how a population will respond to natural selection. This is just one place where variance reveals itself to be a highly useful statistic in practice. Nevertheless, as a descriptive statistic to communicate the spread of a variable, it usually makes more sense to calculate the standard deviation of the mean.

12.4 The standard deviation

The standard deviation of the mean (\(s\)) is just the square root of the variance,

\[s = \sqrt{\frac{1}{N - 1}\sum_{i = 1}^{N}\left(x_{i} - \bar{x} \right)^{2}}.\]

This is a simple step, mathematically, but it also is easier to understand conceptually as a measure of spread (Navarro & Foxcroft, 2022). By taking the square root of the variance, our units are no longer squared, so we can interpret the standard deviation in the same terms as our original data. For example, the leaf masses of plant A and plant B in the example above were measured in grams. While the variance of these masses was in grams squared, the standard deviation is in grams, just like the original measurements. For plant A, we calculated a leaf mass variance of \(s^{2} = 0.0333\:\mathrm{g^{2}}\), which means that the standard deviation of leaf masses is \(s = \sqrt{0.0333\:\mathrm{g^{2}}} = 0.1825\:\mathrm{g}\). Because we are reporting \(s\) in the original units, it is a very useful measure of spread to report, and it is an important one to be able to interpret. To help with the interpretation, an interactive tool can more effectively show how the heights of trees in a forest change across different standard deviation values13. Another interactive tool can help show how the shape of a histogram changes when the standard deviation of a distribution is changed14. Chapter 14 explains how to calculate the standard deviation in jamovi.

12.5 The coefficient of variation

The coefficient of variation (\(CV\)) is just the standard deviation divided by the mean,

\[CV = \frac{s}{\bar{x}}.\]

Dividing by the mean seems a bit arbitrary at first, but this can often be useful for comparing variables with different means or different units. The reason for this is that the units cancel out when dividing the standard deviation by the mean. For example, for the leaf masses of plant A, we calculated a standard deviation of 0.1825 \(\mathrm{g}\) and a mean of 5 \(\mathrm{g}\). We can see the units cancel below,

\[CV = \frac{0.1825\:\mathrm{g}}{5\:\mathrm{g}} = 0.0365.\]

The resulting \(CV\) of 0.0365 has no units; it is dimensionless (Lande, 1977). Because it has no units, it is often used to compare measurements with much different means or with different measurement units. For example, Sokal & Rohlf (1995) suggest that biologists might want to compare tail length variation between animals with much different body sizes, such as elephants and mice. The standard deviation of tail lengths between these two species will likely be much different just because of their difference in size, so by standardising by mean tail length, it can be easier to compare relative standard deviation. This is a common application of the \(CV\) in biology, but it needs to be interpreted carefully (Pélabon et al., 2020).

Often, we will want to express the coefficient of variation as a percentage of the mean. To do this, we just need to multiply the \(CV\) above by 100%. For example, to express the \(CV\) as a percentage, we would multiply the 0.0365 above by 100%, which would give us a final answer of \(CV\) \(= 3.65\)%.

12.6 The standard error

The standard error of the mean is the last measure of spread that I will introduce. It is slightly different than the previous measurements in that it is a measure of the variation in the mean of a sample rather than the sample itself. That is, the standard error tells us how far our sample mean \(\bar{x}\) is expected to deviate from the true mean \(\mu\). Technically, the standard error of the mean is the standard deviation of sample means rather than the standard deviation of samples. What does that even mean? It is easier to explain with a concrete example.

Imagine that we want to measure nitrogen levels in the water of Airthrey Loch (note that ‘loch’ is the Scottish word for ‘lake’). We collect 12 water samples and record the nitrate levels in milligrams per litre (mg/l). The measurements are reported below.

0.63, 0.60, 0.53, 0.72, 0.61, 0.48, 0.67, 0.59, 0.67, 0.54, 0.47, 0.87

We can calculate the mean of the above sample to be \(\bar{x} = 0.615\), and we can calculate the standard deviation of the sample to be \(s = 0.111\). We do not know what the true mean \(\mu\) is, but our best guess is the sample mean \(\bar{x}\). Suppose, however, that we then went back to the loch to collect another 12 measurements (assume that the nitrogen level of the lake has not changed in the meantime). We would expect to get values similar to our first 12 measurements, but certainly not the exact same measurements, right? The sample mean of these new measurements would also be a bit different. Maybe we actually go out and do this and get the following new sample.

0.47, 0.56, 0.72, 0.61, 0.54, 0.64, 0.68, 0.54, 0.48, 0.59, 0.62, 0.78

The mean of our new sample is 0.603, which is a bit different from our first. In other words, the sample means vary. We can therefore ask what is the variance and standard deviation of the sample means. In other words, suppose that we kept going back out to the loch, collecting 12 new samples, and recording the sample mean each time? The standard deviation of those sample means would be the standard error. The standard error is the standard deviation of \(\bar{x}\) values around the true mean \(\mu\). But we do not actually need to go through the repetitive resampling process to estimate the standard error. We can estimate it with just the standard deviation and the sample size. To do this, we just need to take the standard deviation of the sample (\(s\)) and divide by the square root of the sample size (\(\sqrt{N}\)),

\[SE = \frac{s}{\sqrt{N}}.\]

In the case of the first 12 samples from the loch in the example above,

\[SE = \frac{0.111}{\sqrt{12}} = 0.032.\]

The standard error is important because it can be used to evaluate the uncertainty of the sample mean in comparison with the true mean. We can use the standard error to place confidence intervals around our sample mean to express this uncertainty. We will calculate confidence intervals in Chapter 18, so it is important to understand what the standard error is measuring.

If the concept of standard error is still a bit unclear, we can work through one more hypothetical example. Suppose again that we want to measure the nitrogen concentration of a loch. This time, however, assume that we somehow know that the true mean nitrogen concentration is \(\mu = 0.7\), and that the true standard deviation of water sample nitrogen concentration is \(\sigma = 0.1\). Of course, we can never actually know the true parameter values, but we can use a computer to simulate sampling from a population in which the true parameter values are known. In Table 12.1, we simulate the process of going out and collecting 10 water samples from Airthrey Loch. This collecting of 10 water samples is repeated 20 different times. Each row is a different sampling effort, and columns report the 10 samples from each effort.

TABLE 12.1 Simulated samples (S1-S20) of nitrogen content from water samples of Airthrey Loch. Values are sampled from a normal distribution with a mean of 0.7 and a standard deviation of 0.1.
S1 0.72 0.82 0.62 0.75 0.62 0.68 0.61 0.59 0.65 0.80
S2 0.63 0.77 0.58 0.71 0.60 0.74 0.64 0.61 0.86 0.80
S3 0.62 0.70 0.68 0.50 0.89 0.72 0.83 0.64 0.79 0.69
S4 0.77 0.68 0.84 0.62 0.79 0.60 0.63 0.80 0.56 0.81
S5 0.72 0.68 0.67 0.68 0.94 0.67 0.58 0.71 0.58 0.69
S6 0.71 0.66 0.69 0.59 0.71 0.77 0.71 0.84 0.75 0.70
S7 0.83 0.54 0.75 0.58 0.61 0.68 0.61 0.65 0.69 0.79
S8 0.80 0.73 0.56 0.64 0.75 0.86 0.78 0.70 0.83 0.81
S9 0.64 0.72 1.07 0.58 0.79 0.64 0.66 0.64 0.56 0.65
S10 0.68 0.71 0.86 0.88 0.64 0.84 0.73 0.73 0.56 0.64
S11 0.77 0.62 0.82 0.82 0.74 0.78 0.90 0.62 0.68 0.76
S12 0.84 0.66 0.71 0.85 0.56 0.82 0.76 0.69 0.63 0.84
S13 0.70 0.54 0.77 0.77 0.58 0.72 0.52 0.59 0.65 0.74
S14 0.78 0.67 0.72 0.59 0.77 0.66 0.68 0.69 0.71 0.47
S15 0.71 0.71 0.71 0.73 0.80 0.62 0.63 0.86 0.55 0.64
S16 0.68 0.61 0.56 0.84 0.67 0.75 0.80 0.76 0.74 0.70
S17 0.69 0.81 0.66 0.59 0.90 0.82 0.79 0.65 0.83 0.76
S18 0.80 0.80 0.58 0.60 0.77 0.74 0.74 0.65 0.61 0.73
S19 0.58 0.69 0.63 0.69 0.75 0.82 0.67 0.55 0.62 0.74
S20 0.68 0.73 0.81 0.62 0.75 0.69 0.70 0.70 0.65 0.76

We can calculate the mean of each sample by calculating the mean of each row. These 20 means are reported below.

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
[1,] 0.686 0.694 0.706 0.710 0.692 0.713 0.673 0.746 0.695 0.727
[2,] 0.751 0.736 0.658 0.674 0.696 0.711 0.750 0.702 0.674 0.709

The standard deviation of the 20 sample means reported above is 0.0265613. Now suppose that we only had Sample 1 (i.e., top row of data). The standard deviation of Sample 1 is \(s =\) 0.0824891. We can calculate the standard error from these sample values below,

\[s = \frac{0.0824891}{\sqrt{10}} = 0.0260853.\]

The estimate of the standard error from calculating the standard deviation of the sample means is therefore 0.0265613, and the estimate from just using the standard error formula and data from only Sample 1 is 0.0260853. These are reasonably close, and they would be even closer if we had either a larger sample size in each sample (i.e., higher \(N\)) or a larger number of samples.

References

Dytham, C. (2011). Choosing and Using Statistics: A Biologist’s Guide (p. 298). John Wiley & Sons, West Sussex, UK.
Fowler, J., Cohen, L., & Jarvis, P. (1998). Practical Statistics for Field Biology (2nd ed., p. 259). John Wiley & Sons.
Fryer, H. C. (1966). Concepts and Methods of Experimental Statistics (p. 602). Allyn & Bacon, Boston, USA.
Grafen, A., & Hails, R. (2022). Modern Statistics for the Life Sciences (p. 351). Oxford University Press, Oxford, UK.
Lande, R. (1977). On comparing coefficients of variation. Systematic Zoology, 26(2), 214–217.
Navarro, D. J., & Foxcroft, D. R. (2022). Learning Statistics with Jamovi (pp. 1–583). (Version 0.75). https://doi.org/10.24384/hgc3-7p15
Pandey, S., & Bright, C. L. (2008). What are degrees of freedom? Social Work Research, 32(2), 119–128. https://doi.org/10.1080/00031305.1974.10479077
Pélabon, C., Hilde, C. H., Einum, S., & Gamelon, M. (2020). On the use of the coefficient of variation to quantify and compare trait variation. Evolution Letters, 4(3), 180–188. https://doi.org/10.1002/evl3.171
Sokal, R. R., & Rohlf, F. J. (1995). Biometry (3rd ed., p. 887). W. H. Freeman & Company, New York, USA.
The jamovi project. (2024). Jamovi (version 2.5). https://www.jamovi.org
Upton, G., & Cook, I. (2014). Dictionary of Statistics (3rd ed., p. 488). Oxford University Press, Oxford, UK.
Wardlaw, A. C. (1985). Practical Statistics for Experimental Biologists (p. 290). John Wiley & Sons, Chichester, UK.

  1. If you are wondering why we square the difference \(x_{i} - \bar{x}\) instead of just taking its absolute value, this is an excellent question! You have just invented something called the mean absolute deviation. There are some reasons why the mean absolute deviation is not as good of a measure of spread as the variance. Navarro & Foxcroft (2022) explain the mean absolute deviation, and how it relates to the variance, very well in section 4.2.3 of their textbook. We will not get into these points here, but it would be good to check out Navarro & Foxcroft (2022) for more explanation.↩︎

  2. To get the true population variance \(\sigma^{2}\), we would also need to know the true mean \(\mu\). But we can only estimate \(\mu\) from the sample, \(\bar{x}\). That is, what we would really want to calculate is \(x_{i} - \mu\), but the best we can do is \(x_{i} - \bar{x}\). The consequence of this is that there will be some error that underestimates the true distance of \(x_{i}\) values from the population mean, \(\mu\). Here is the really cool part: to determine the extent to which our estimate of the variance is biased by using \(\bar{x}\) instead of \(\mu\), we just need to know the expected squared difference between the two values, \((\bar{x} - \mu)^{2}\). It turns out that this difference (i.e., the bias of our estimate \(s^{2}\)) is just \(\sigma^{2} / N\), that is, the true variance of the population divided by the sample size. If we subtract this value from \(\sigma^{2}\), so \(\sigma^{2} - \sigma^{2}/N\), then we can get the expected difference between the true variance and the estimate from the sample size. We can rearrange \(\sigma^{2} - \sigma^{2}/N\) to get \(\sigma^{2} \times (N - 1)/N\), which means that we need to correct our sample variance by \(N / (N-1)\) to get an unbiased estimate of \(\sigma^{2}\). If all of this is confusing, that is okay! This is really only relevant for those interested in statistical theory, which is not the focus of this book.↩︎

  3. In the case of sample variance, note that we needed to use all the values \(x_{i}\) in the dataset and the sample mean \(\bar{x}\). But if we know what all of the \(x_{i}\) values are, then we also know \(\bar{x}\). And if we know all but one value of \(x_{i}\) and \(\bar{x}\), then we could figure out the last \(x_{i}\). Hence, while we are using \(N\) values in the calculation of \(s^{2}\), the use of \(\bar{x}\) reduces the degree to which these values are free to vary. We have lost 1 degree of freedom in the calculation of \(\bar{x}\), so we need to account for this in our calculation of \(s^{2}\) by dividing by \(N - 1\). This is another way to think about the \(N - 1\) correction factor (Sokal & Rohlf, 1995) explained in the previous footnote.↩︎

  4. This has one caveat, which is not important for now. Values of A and B must be uncorrelated. That is, A and B cannot covary. If A and B covary, i.e., \(\mathrm{Cov}(A, B) \neq 0\), then \(\mathrm{Var}(A+B) = \mathrm{Var}(A) + \mathrm{Var}(B) + \mathrm{Cov}(A, B)\). That is, we need to account for the covariance when calculating \(\mathrm{Var}(A+B)\).↩︎

  5. https://bradduthie.github.io/stats/app/forest/↩︎

  6. https://bradduthie.github.io/stats/app/normal_pos_neg/↩︎