3.1 Numerical Measures

There are differences between a population and a sample

Measures of Category Population Sample
- What is it? Reality A small fraction of reality (inference)
- Characteristics described by Parameters Statistics
Central Tendency Mean \(\mu = E(Y)\) \(\hat{\mu} = \overline{y}\)
Central Tendency Median 50-th percentile \(y_{(\frac{n+1}{2})}\)
Dispersion Variance \[\begin{aligned} \sigma^2 &= var(Y) \\ &= E(Y- \mu^2) \end{aligned}\] \(s^2=\frac{1}{n-1} \sum_{i = 1}^{n} (y_i-\overline{y})^2\)
Dispersion Coefficient of Variation \(\frac{\sigma}{\mu}\) \(\frac{s}{\overline{y}}\)
Dispersion Interquartile Range difference between 25th and 75th percentiles. Robust to outliers
Shape Skewness Standardized 3rd central moment (unitless) \(g_1=\frac{\mu_3}{\mu_2^{3/2}}\) \(\hat{g_1}=\frac{m_3}{m_2sqrt(m_2)}\)
Shape Central moments \(\mu=E(Y)\) \(\mu_2 = \sigma^2=E(Y-\mu)^2\) \(\mu_3 = E(Y-\mu)^3\) \(\mu_4 = E(Y-\mu)^4\) |

\(m_2=\sum_{i=1}^{n}(y_1-\overline{y})^2/n\)

\(m_3=\sum_{i=1}^{n}(y_1-\overline{y})^3/n\)

Shape Kurtosis (peakedness and tail thickness) Standardized 4th central moment \(g_2^*=\frac{E(Y-\mu)^4}{\sigma^4}\) \(\hat{g_2}=\frac{m_4}{m_2^2}-3\)

Note:

  • Order Statistics: \(y_{(1)},y_{(2)},...,y_{(n)}\) where \(y_{(1)}<y_{(2)}<...<y_{(n)}\)

  • Coefficient of variation: standard deviation over mean. This metric is stable, dimensionless statistic for comparison.

  • Symmetric: mean = median, skewness = 0

  • Skewed right: mean > median, skewness > 0

  • Skewed left: mean < median, skewness < 0

  • Central moments: \(\mu=E(Y)\) , \(\mu_2 = \sigma^2=E(Y-\mu)^2\) , \(\mu_3 = E(Y-\mu)^3\), \(\mu_4 = E(Y-\mu)^4\)

  • For normal distributions, \(\mu_3=0\), so \(g_1=0\)

  • \(\hat{g_1}\) is distributed approximately as \(N(0,6/n)\) if sample is from a normal population. (valid when \(n > 150\))

    • For large samples, inference on skewness can be based on normal tables with 95% confidence interval for \(g_1\) as \(\hat{g_1}\pm1.96\sqrt{6/n}\)
    • For small samples, special tables from Snedecor and Cochran 1989, Table A 19(i) or Monte Carlo test
Kurtosis > 0 (leptokurtic) heavier tail compared to a normal distribution with the same \(\sigma\) (e.g., t-distribution)
Kurtosis < 0 (platykurtic) lighter tail compared to a normal distribution with the same \(\sigma\)
  • For a normal distribution, \(g_2^*=3\). Kurtosis is often redefined as: \(g_2=\frac{E(Y-\mu)^4}{\sigma^4}-3\) where the 4th central moment is estimated by \(m_4=\sum_{i=1}^{n}(y_i-\overline{y})^4/n\)

    • the asymptotic sampling distribution for \(\hat{g_2}\) is approximately \(N(0,24/n)\) (with \(n > 1000\))
    • large sample on kurtosis uses standard normal tables
    • small sample uses tables by Snedecor and Cochran, 1989, Table A 19(ii) or Geary 1936
data = rnorm(100)
library(e1071)
skewness(data)
#> [1] -0.2046225
kurtosis(data)
#> [1] -0.6313715