3.1 Numerical Measures
There are differences between a population and a sample:
Measures of | Category | Population | Sample |
---|---|---|---|
What is it? | Reality | A small fraction of reality (inference) | |
Characteristics described by | Parameters | Statistics | |
Central Tendency | Mean | \(\mu = E(Y)\) | \(\hat{\mu} = \overline{y}\) |
Central Tendency | Median | 50th percentile | \(y_{(\frac{n+1}{2})}\) |
Dispersion | Variance | \[\sigma^2 = var(Y) = E[(Y-\mu)^2]\] | \(s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \overline{y})^2\) |
Dispersion | Coefficient of Variation | \(\frac{\sigma}{\mu}\) | \(\frac{s}{\overline{y}}\) |
Dispersion | Interquartile Range | Difference between 25th and 75th percentiles; robust to outliers | |
Shape | Skewness Standardized 3rd central moment (unitless) |
\(g_1 = \frac{\mu_3}{\sigma^3}\) | \(\hat{g_1} = \frac{m_3}{m_2^{3/2}}\) |
Shape | Central moments | \(\mu=E(Y)\), \(\mu_2 = \sigma^2 = E[(Y-\mu)^2]\), \(\mu_3 = E[(Y-\mu)^3]\), \(\mu_4 = E[(Y-\mu)^4]\) | \(m_2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \overline{y})^2\) \(m_3 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \overline{y})^3\) |
Shape | Kurtosis (peakedness and tail thickness) Standardized 4th central moment |
\(g_2^* = \frac{E[(Y-\mu)^4]}{\sigma^4}\) | \(\hat{g_2} = \frac{m_4}{m_2^2} - 3\) |
Notes:
Order Statistics: \(y_{(1)}, y_{(2)}, \ldots, y_{(n)}\), where \(y_{(1)} < y_{(2)} < \ldots < y_{(n)}\).
Coefficient of Variation:
- Defined as the standard deviation divided by the mean.
- A stable, unitless statistic useful for comparison.
Symmetry:
- Symmetric distributions: Mean = Median; Skewness = 0.
- Skewed Right: Mean > Median; Skewness > 0.
- Skewed Left: Mean < Median; Skewness < 0.
Central Moments:
- \(\mu = E(Y)\)
- \(\mu_2 = \sigma^2 = E[(Y-\mu)^2]\)
- \(\mu_3 = E[(Y-\mu)^3]\)
- \(\mu_4 = E[(Y-\mu)^4]\)
Skewness (\(\hat{g_1}\))
- Sampling Distribution:
For samples drawn from a normal population:- \(\hat{g_1}\) is approximately distributed as \(N(0, \frac{6}{n})\) when \(n > 150\).
- Inference:
- Large Samples: Inference on skewness can be based on the standard normal distribution.
The 95% confidence interval for \(g_1\) is given by: \[ \hat{g_1} \pm 1.96 \sqrt{\frac{6}{n}} \] - Small Samples: For small samples, consult special tables such as:
- Snedecor and Cochran (1989), Table A 19(i)
- Monte Carlo test results
- Large Samples: Inference on skewness can be based on the standard normal distribution.
Kurtosis (\(\hat{g_2}\))
- Definitions and Relationships:
- A normal distribution has kurtosis \(g_2^* = 3\).
Kurtosis is often redefined as: \[ g_2 = \frac{E[(Y - \mu)^4]}{\sigma^4} - 3 \] where the 4th central moment is estimated by: \[ m_4 = \frac{\sum_{i=1}^n (y_i - \overline{y})^4}{n} \]
- A normal distribution has kurtosis \(g_2^* = 3\).
- Sampling Distribution:
For large samples (\(n > 1000\)):- \(\hat{g_2}\) is approximately distributed as \(N(0, \frac{24}{n})\).
- Inference:
Kurtosis Value | Tail Behavior | Comparison to Normal Distribution |
---|---|---|
\(g_2 > 0\) (Leptokurtic) | Heavier Tails | Examples: \(t\)-distributions |
\(g_2 < 0\) (Platykurtic) | Lighter Tails | Examples: Uniform or certain bounded distributions |
\(g_2 = 0\) (Mesokurtic) | Normal Tails | Exactly matches the normal distribution |
# Generate random data from a normal distribution
data <- rnorm(100)
# Load the e1071 package for skewness and kurtosis functions
library(e1071)
# Calculate skewness
skewness_value <- skewness(data)
cat("Skewness:", skewness_value, "\n")
#> Skewness: 0.362615
# Calculate kurtosis
kurtosis_value <- kurtosis(data)
cat("Kurtosis:", kurtosis_value, "\n")
#> Kurtosis: -0.3066409
References
Geary, R Cf. 1936. “Moments of the Ratio of the Mean Deviation to the Standard Deviation for Normal Samples.” Biometrika, 295–307.
Snedecor, George W., and William G. Cochran. 1989. “Statistical Methods.”