Chapter 5 Standard deviation and distribution shape

5.1 Variance and standard deviation

There are two versions of the variance formula. One of them is offen referred to as a “population” variance, another as a “sample” variance.

The population variance formula assumes the population has the size \(N\):

\[\begin{equation} \sigma^2_x = \frac{\sum_{i=1}^N \left(x_i-\bar{x}\right)^2}{N} \tag{5.1} \end{equation}\]

The sample formula is the following (assuming a sample of a size \(n\)):

\[\begin{equation} s^2_x = \frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)^2}{n-1} \tag{5.2} \end{equation}\]

We will use the sample variance usually (5.2).

In the worksheets we may use the function (VAR) – Google sheets, Excel, which is equivalent with VAR.S – Google sheets, Excel.

For the calculation of the population variance function VARP/VAR.P can be used – Google sheets, Excel.

The standard variation is the square root of the variance. We have the population and the sample version as well:

\[\begin{equation} \sigma_x = \sqrt{\sigma^2_x} = \sqrt{\frac{\sum_{i=1}^N \left(x_i-\bar{x}\right)^2}{N}} \tag{5.3} \end{equation}\]

\[\begin{equation} s_x = \sqrt{s^2_x} = \sqrt{\frac{\sum_{i=1}^n \left(x_i-\bar{x}\right)^2}{n-1}} \tag{5.4} \end{equation}\]

The function in spreadsheets is STDEV – Google sheets, Excel or (equivalently) STDEV.S – Google sheets, Excel.

The population standard deviation can be obtained with STDEVP function – Google sheets, Excel or STDEV.P– Google sheets, Excel.

The standard deviation tells us about the dispertion of the variable in question. It is hard to say it is an average deviation (mean absolute deviation is usually smaller). Instead, the standard deviation is often interpreted according to the “empirical rule” (https://en.wikipedia.org/wiki/68–95–99.7_rule), which claims that in many for many datasets the share of observations in the interval between mean - 1 standard deviation and mean + 1 standard deviation is roughly 68% of the total and the interval from mean - 2 sd to mean + 2 sd contains about 95% of the observations.

5.2 Skewness/asymmetry

The skewness coefficient is one of the ways to measure asymmetry of a variable.

\[\begin{equation} g_{1} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma_x}\right)^3 \tag{5.5} \end{equation}\]

Formula (5.5) is the formula analogical to the “population” version of the variance and the standard deviation.

The modified skewness can be defined as follows:

\[\begin{equation} G_{1} = \frac{\sqrt{n(n-1)}}{n-2}g_{1} \tag{5.6} \end{equation}\]

The asymmetry coefficient measures which tail of the distribution graph: the left – below the mean (negative coefficient values) or the right – above the mean (positive values) is more extended.

In spreadsheets, the modified asymmetry factor (\(G_1\)) can be calculated using the SKEW function - Google Sheets, Excel, and \(g_1\) factor is obtained by using the SKEW.P function - Google sheets, Excel.

Size of the skewness - let us agree on the rule of thumb (there can be many rules, and the interpretation depends on the domain):

  • If the skewness is between -0.5 and 0.5, the data are fairly symmetrical

  • If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed

  • If the skewness is less than -1 or greater than 1, the data are highly skewed

5.3 Kurtosis

Excess kurtosis has the following formula:

\[\begin{equation} g_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma_x}\right)^4-3 \tag{5.7} \end{equation}\]

The above formula can be thought of as a formula for the population. The formula for the sample is usually as follows:

\[\begin{equation} G_{2} = \frac{n-1}{(n-2)(n-3)}\left[(n+1)g_{2}+6\right] \tag{5.8} \end{equation}\]

In some statistical packages there is also the formula:

\[\begin{equation} b_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{s_x}\right)^4-3 \tag{5.9} \end{equation}\]

In spreadsheets, the KURT function calculates the \(G_2\) factor according to the formula (5.8) - Google Sheets, Excel. The \(g_2\) and \(b_2\) coefficients can be calculated using the function for mean and standard deviation and array formulas.

Kurtosis (excess kurtosis) measures the extremity of the tails in comparison with the normal distribution.

Let us agree on the rule-of-thumb scale:

  • Between -0.5 and 0.5 – the distribution is approximately mesokurtic

  • Between -1 and -0.5 – the distribution is moderately platykurtic

  • Below -1 – highly platikurtic

  • Between 0.5 and 1.0 – moderately leptokurtic

  • Between 1.0 and 5.0 – highly leptokurtic

  • Above 5.0 – extremely leptokurtic

5.4 Exercises

Exercise 5.1 For the numerical data (quantitative variables) from the green cards (green.csv):

  1. Calculate the mean, the median, Q1, Q3, standard deviation, skewness and kurtosis.

  2. Which variable has the highest kurtosis? How do you interpret it?

  3. Which variable has the positive/negative skewness? How do you interpret it?

  4. Check if the empirical rule (plus/minus 1 sd, plus/minus 2 sds) works for the numerical variables.

Exercise 5.2 For the football players data (football2021.csv) data calculate and interpret:

  1. The mean

  2. The standard deviation

  3. The median

  4. The lower quartile

  5. The upper quartile

  6. The skewness

  7. The kurtosis