Chapter 6 Distribution shape measures

6.1 Skewness

The skewness coefficient is one way to measure the asymmetry (skewness) of a variable’s distribution.

\[\begin{equation} g_{1} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma_x}\right)^3 \tag{6.1} \end{equation}\]

Formula (6.1) is analogous to the “population” version of variance and standard deviation.

The modified skewness coefficient can be defined as follows:

\[\begin{equation} G_{1} = \frac{\sqrt{n(n-1)}}{n-2}g_{1} \tag{6.2} \end{equation}\]

The skewness coefficient measures which end of the distribution plot (the “tail”) is more stretched out. If the left tail is stretched (values below the mean), the skewness coefficient is negative. If the right tail (values above the mean) is stretched, the coefficient is positive.

The interpretation of the skewness coefficient depends on the domain For example, the following rules can be adopted:

  • If skewness is between -0.5 and 0.5, the data are fairly symmetrical (weak asymmetry is present)

  • If skewness is between -1 and -0.5 or between 0.5 and 1, the data are moderately skewed.

  • If skewness is less than -1 or greater than 1, the data are highly skewed.

6.2 Kurtosis

Excess kurtosis has the following formula:

\[\begin{equation} g_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma_x}\right)^4-3 \tag{6.3} \end{equation}\]

The above formula can be treated as a “population” formula. The “sample” formula is typically as follows:

\[\begin{equation} G_{2} = \frac{n-1}{(n-2)(n-3)}\left[(n+1)g_{2}+6\right] \tag{6.4} \end{equation}\]

In some statistical packages, the following formula also appears:

\[\begin{equation} b_{2} = \frac{1}{n}\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{s_x}\right)^4-3 \tag{6.5} \end{equation}\]

Excess kurtosis measures the intensity of values in the distribution tails compared to the normal distribution.

The interpretation of kurtosis depends on the field. For example, the following scale can be proposed:

  • Between -0.5 and 0.5 – the distribution is approximately mesokurtic (extreme values occur with intensity similar to the intensity of extreme values in the normal distribution).

  • Between -1 and -0.5 – the distribution is moderately platykurtic (there are fewer extreme values, or they are smaller in magnitude than in the normal distribution).

  • Below -1 – the distribution is highly platykurtic.

  • From 0.5 to 1.0 – the distribution is moderately leptokurtic.

  • Between 1.0 and 5.0 – highly leptokurtic.

  • Above 5.0 – extremely leptokurtic.

6.3 Outliers

There is no single definition of an outlier. Most commonly, they are defined using position measures or using standardized “z” values.

6.3.1 Identifying Outliers Using Position Measures

This method is used in the case of box plots. Outliers are those values that are either less than Q1 - 1.5 IQR, or greater than Q3 + 1.5 IQR.

Of course, it is possible to adopt multipliers other than 1.5.

Some authors define outliers, and additionally distinguish “extreme values” that are either less than Q1 - 3 IQR, or greater than Q3 + 3 IQR. For others, outliers and extreme values are synonyms.

6.3.2 Identifying Outliers Using Z-Scores

Another frequently used method is the use of standardized values. In this case, outliers are, for example, those values that deviate from the mean by more than 3 standard deviations.