Chapter 4 Measures of dispersion
When describing a distribution, we not only want to describe its location (using measures of central tendency), but also its dispersion (also called variability or spread).
Figure 4.1: An illustration of the centre and spread of a distribution.
4.1 Standard deviation
Standard deviation is perhaps the most popular measure of the dispersion of a feature’s distribution.
The standard deviation formula appears in two versions. In this script, we will denote them2 with the letters \(\widehat{\sigma}\) and \(s\), respectively.
\[\begin{equation} \widehat{\sigma}_X = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n}} \tag{4.1} \end{equation}\]
\[\begin{equation} s_X = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n-1}} \tag{4.2} \end{equation}\]
The \(X\) subscript is used in the above formulas to indicate that the standard deviation is calculated for the quantitative characteristic \(X\).
The measure \(s\) (formula (4.2)) is called the “sample” or “for-sample” standard deviation because it is usually preferred when the data being analyzed comes from a sample.
The measure \(\widehat{\sigma}\) (formula (4.1)) can also be used for a sample, but it is often called the “for-population” standard deviation. For populations, only this formula should be used. If we are dealing with a population, the “hat” over \(\sigma\) is omitted.
If you don’t know which formula to use (and there’s no one to tell you which formula they expect), use the formula (4.2).
In R, the sd() function calculates the standard deviation \(s\) using the formula (4.2). The formula for \(\widehat{\sigma}\) (4.1) is not included in standard R packages, and you must write your own function or install an additional package. The lack of a function to calculate \(\widehat{\sigma}\) in the standard package may provide additional clues as to which formula is typically preferred.
In spreadsheets, the standard deviation \(s\) (“for the sample”) is determined using the STDEV (STDEV) function – Google sheets, Excel or (equivalently) STDEV.S (STDEV.S) – Google sheets, Excel.
The standard deviation \(\widehat{\sigma}\) “for the population” can be calculated using the STDEVP function – Google Sheets, Excel or STDEV.P (STDEV.P) – Google Sheets, Excel.
4.1.1 Variance
The standard deviation is the square root of the variance. In other words, the variance is the square of the standard deviation.
There are, as you might guess, two formulas for variance:
\[\begin{equation} \widehat{\sigma}^2_X = \frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n} \tag{4.3} \end{equation}\]
\[\begin{equation} s^2_X = \frac{\sum_{i=1}^n \left(x_i-\overline{x}\right)^2}{n-1} \tag{4.4} \end{equation}\]
4.1.2 Coefficient of Variation
The coefficient of variation, which is the ratio of the standard deviation to the mean, can be a better measure of variability in some situations.
Its formula for a sample is:
\[\begin{equation} V_X = \frac{s_X}{\overline{x}} \tag{4.5} \end{equation}\]
The coefficient of variation can be used for quantitative variables on a ratio scale that take exclusively (or generally) positive values.
4.1.3 Using the standard deviation
The standard deviation is a key measure of dispersion in descriptive statistics. It indicates how much individual data points differ from the mean. This measure is commonly used for purposes such as:
Comparing variability across groups, either directly or using relative measures like the coefficient of variation.
Describing approximately normally distributed variables, where two parameters — the mean and the standard deviation — are sufficient to characterize the distribution.
Calculating effect sizes, such as Cohen’s d, which expresses the standardized difference between the means of two groups.
Computing standardized scores (z-scores) to assess how far a value lies from the mean.
Detecting extreme values or outliers within a dataset.
4.1.4 The standard deviation is not the mean deviation
One might propose to measure volatility using the mean deviation, or, more precisely, the mean absolute deviation:
\[\begin{equation} MAD_X = \frac{\sum_{i=1}^n |x_i-\overline{x}|}{n} \tag{4.6} \end{equation}\]
Although both the standard deviation and the mean absolute deviation (MAD) measure dispersion, they do so in different ways. The mean deviation represents the average of the absolute differences between each observation and the mean (or median). It gives a straightforward sense of how far, on average, data points are from the center. The standard deviation, in contrast, is based on squared differences from the mean. Because of the squaring, larger deviations have a greater influence on the result. As a consequence: the standard deviation is more sensitive to extreme values than the mean deviation and it is usually larger than the MAD.
4.2 Interquartile range
Another popular measure of dispersion based on positional measures is the interquartile range (IQR):
\[\begin{equation} IQR = Q_3 - Q_1 \tag{4.7} \end{equation}\]
where \(Q_1\) is the first quartile and \(Q_3\) is the third quartile.
4.2.1 Interquartile deviation and positional coefficient of variation
Interquartile deviation and positional coefficient of variation are sometimes introduced in the Polish literature. The interquartile deviation is half the IQR:
\[ Q = IQR/2 \]
The positional coefficient of variation is the ratio of the interquartile deviation to the median:
\[ V = Q/Me \]
4.3 Boxplot
The IQR and positional measure values allow you to create a boxplot (box-and-whisker plot). This plot most often displays the median, first quartile, third quartile, as well as the minimum and maximum.
Quite often, the minimum and maximum are determined without considering outliers, which are marked separately on the chart as dots. The typical definition of an outlier in this context assumes that outliers are either less than \(Q_1 - 1.5\cdot IQR\) or greater than \(Q_3 + 1.5\cdot IQR\), although other definitions of outliers are possible.
Figure 4.2: Understanding boxplots.
4.4 Links
Exploring and visualizing quantitative data - web application: https://istats.shinyapps.io/EDA_quantitative/
4.5 Questions
Question 4.1 (Freedman, Pisani, and Purves 2007)The two histogram sketches (density functions) are presented below. Which variable is more dispersed?

4.6 Exercises
Exercise 4.1 Select one of the following sets of numbers and compute the “population” version of the standard deviation (\(\widehat{\sigma}\)) by hand. Then, verify your result using your preferred software.
1, 7, 9, 10, 13
1, 1, 11, 11, 16
1, 1, 2, 4, 7, 7, 9, 13
1, 2, 5, 6, 7, 8, 9, 10
Exercise 4.2 Select one of the following sets of numbers and calculate the sample standard deviation (\(s\)) by hand. Then, verify your result using your preferred software.
1, 7, 13, 17
1, 8, 10, 12, 14
1, 9, 10, 12, 16, 18
1, 4, 5, 6, 6, 8, 15, 19
Exercise 4.3 According to Central Statistical Office data, there are 16 towns (cities and villages) in Poland named Dobra. Based on the table below, prepare a box plot summarizing the population (number of inhabitants) of these towns.
| town or village | population |
|---|---|
| the village of Dobra in the West Pomeranian Voivodeship in Police County | 4276 |
| the village of Dobra in the Lesser Poland Voivodeship in Limanowa County | 3217 |
| the town of Dobra in the West Pomeranian Voivodeship in Łobez County | 2103 |
| the town of Dobra in the Greater Poland Voivodeship in Turek County | 1358 |
| the village of Dobra in the Lower Silesian Voivodeship in Bolesławiec County | 1115 |
| the village of Dobra in the Opole Voivodeship in Krapkowice County | 797 |
| the village of Dobra in the Łódź Voivodeship in Zgierz County | 617 |
| the village of Dobra in the Subcarpathian Voivodeship in Przeworsk County | 468 |
| the village of Dobra in the Lower Silesian Voivodeship in Oleśnica County | 364 |
| the village of Dobra in the Subcarpathian Voivodeship in Sanok County | 286 |
| the village of Dobra in Silesian Voivodeship in Zawiercie County | 276 |
| village of Dobra in Świętokrzyskie Voivodeship in Staszów County | 261 |
| village of Dobra in Łódź Voivodeship in Łask County | 246 |
| village of Dobra in Pomeranian Voivodeship in Słupsk County | 102 |
| village of Dobra in Masovian Voivodeship in Płock County | 86 |
| village of Dobra in Greater Poland Voivodeship in Poznań County | 76 |
Exercise 4.4 Create a boxplot for the order amount based on order data from an online store orders.csv.
Exercise 4.5 Graphically compare the speeds of passenger cars and motorcycles using two side-by-side boxplots. Use SpeedRadarData.csv.
Literature
Note: These are not commonly used symbols—different texts use different symbols. The formula used should be recognized from the context (often the author provides it explicitly).↩︎