Chapter 7 Statistics

Having described our data and its DGP, we now move on to describing statistics calculated using our data.

Chapter goals

In this chapter we will:

  • Use the theory of probability and random variables to model the statistics we have calculated from a data set.
  • Calculate and interpret the mean and variance of a statistic from its sampling distribution.
  • Calculate and interpret bias and mean squared error.
  • Explain the law of large numbers, central limit theorem, and Slutsky’s theorem.

7.1 Statistics and their properties

Suppose we have some statistic \(s_n =s(D_n)\), i.e., a number that is calculated from the data.

  • Since the data is observed/known, the value of the statistic is observed/known.
  • Since the elements of \(D_n\) are random variables, \(s_n\) is also a random variable with a well-defined (but unknown) probability distribution that depends on the unknown DGP.

Roulette wins

In our roulette example, the total number of wins is: \[R = x_1 + x_2 + x_3\] Since this is a number calculated from our data, it is a statistic.

Since \(x_i \sim Bernoulli(p)\), we can show that \(R \sim Binomial(3,p)\).

7.1.1 Some important statistics

I will use \(s_n\) to represent an abstract statistic, but we will often use other notation to talk about specific statistics.

The most important statistic is the sample average which is defined as: \[\bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i\] We will also consider several other commonly-used univariate statistics:

  • The sample variance of \(x_i\) is defined as: \[s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\] A closely-related statistic is the sample standard deviation \(s_x = \sqrt{s_x^2}\) which is the square root of the sample variance.
  • The sample frequency or relative sample frequency of the event \(x_i \in A\) is defined as the proportion of cases in which the event occurs: \[\hat{f}_A = \frac{1}{n} \sum_{i=1}^n I(x_i \in A)\] A closely-related statistic is the absolute sample frequency \(n\hat{f}_A\) which is the number of cases in which the event occurs.
  • The sample median of \(x_i\) is defined as: \[\hat{m}_x = m: \begin{cases} \hat{f}_{x < m} \leq 0.5 \\ \hat{f}_{x > m} \leq 0.5 \\ \end{cases}\]

7.1.2 The sampling distribution

Since the data itself is a collection of random variables, any statistic calculated from that data is also a random variable, with a probability distribution that can be derived from the DGP.

The sampling distribution of the sample frequency

Calculating the exact probability distribution of most statistics is quite difficult, but it is easy to do for the sample frequency. Let \(p =\Pr(x_i \in A)\). Then: \[n\hat{f}_A \sim Binomial(n,p)\] In other words, we can calculate the exact probability distribution of the sample frequency using the formula for the binomial distribution.

Unfortunately, most statistics typically have sampling distributions that are quite difficult to calculate.

To see why the sampling distribution of a statistic is so difficult to calculate, suppose we have a discrete random variable \(x_i\) whose support \(S_x\) has five elements. Then we need to calculate the sampling distribution of our statstic by adding its probability up across the support of \(D_n\). The support has \(5^n\) elements, a number that can quickly get very large.

For example, a typical data set in microeconomics has at least a few hundred or a few thousand observations. With 100 observations, \(D_n\) can take on \(5^{100} \approx 7.9 \times 10^{69}\) (that’s 79 followed by 68 zeros!) distinct values. With 1,000 observations , \(D_n\) can take on \(5^{1000}\) distinct values, a number too big for Excel to even calculate.

7.1.3 The mean and variance

If our statistic has a probability distribution, it (usually) has a mean and variance as well. Under some circumstances, we can calculate them.

The mean of the sample average

Let \(\mu_x = E(x_i)\) be the mean of \(x_i\) Then the mean of \(\bar{x}\) is: \[E(\bar{x}_n) = E\left( \frac{1}{n} \sum_{i=1}^n x_i\right) = \frac{1}{n} \sum_{i=1}^n E\left( x_i\right) = \frac{1}{n} \sum_{i=1}^n \mu_x = \mu_x\]

This is an important and general result in statistics. The mean of the sample average in a random sample is identical to the mean of the random variable being averaged. \[E(\bar{x}_n) = E(x_i)\] We have shown this property specifically for a random sample, but it holds under many other sampling processes.

The variance of the sample average is not equal to the variance of the random variable being averaged, but they are closely related.

The variance of the sample average

To keep the math simple, suppose we only have \(n = 2\) observations. Then the sample average is: \[\bar{x} = \frac{1}{2}(x_1 + x_2)\] By our earlier formula for the variance: \[\begin{align} var(\bar{x}) &= var\left(\frac{1}{2}(x_1 + x_2)\right) \\ &= \left(\frac{1}{2}\right)^2 var(x_1 + x_2) \\ &= \frac{1}{4} \left( \underbrace{var(x_1)}_{\sigma_x^2} + 2 \underbrace{cov(x_1,x_2)}_{0 \textrm{(independence)}} + \underbrace{var(x_2)}_{\sigma_x^2} \right) \\ &= \frac{1}{4} \left( 2 \sigma_x^2 \right) \\ &= \frac{\sigma_x^2}{2} \\ \end{align}\]

More generally, the variance of the sample average in a random sample of size \(n\) is: \[var(\bar{x}_n) = \frac{\sigma_x^2}{n}\] where \(\sigma_x^2 = var(x_i)\).

Other commonly-used statistics also have a mean and variance.

The mean and variance of the sample frequency

Since the absolute sample frequency has the binomial distribution, we have already seen its mean and variance. Let \(p = \Pr(x_i \in A)\). Then \(n\hat{f}_A \sim Binomial(n,p)\) and: \[E(n\hat{f}_A) = np\] \[var(n\hat{f}_A) = np(1-p)\]

Applying the usual rules for expected values, the mean and variance of the relative sample frequency is: \[E(\hat{f}_A) = \frac{E(n\hat{f}_A)}{n} = \frac{np}{n} = p\] \[var(\hat{f}_A) = \frac{var(n\hat{f}_A)}{n^2} = \frac{np(1-p)}{n^2} = \frac{p(1-p)}{n} \]

7.2 Estimation

One of the most important uses of statistics is to estimate or guess at the value at, some unknown parameter of the DGP.

7.2.1 Parameters and estimators

A parameter is an unknown number characterizing a DGP.

Examples of parameters

Sometimes a single parameter completely describes the DGP:

  • In our roulette data set, the joint distribution of the data depends only on the single parameter \(p = \Pr(b \in Red)\).

Sometimes a group of parameters completely describe the DGP:

  • If \(x_i\) is a random sample from the \(U(L,H)\) distribution, then \(L\) and \(H\) are both parameters.

And sometimes a parameter only partially describes the DGP

  • If \(x_i\) is a random sample from some unknown distribution with unknown mean \(\mu_x = E(x_i)\), then \(\mu_x\) is a parameter.
  • If \(x_i\) is a random sample from some unknown distribution with unknown median \(m_x = q_{0.5}(x_i)\), then \(m_x\) is a parameter.

Typically there will be particular parameters whose value we wish to know. Such a parameter is called a parameter of interest. Our model may include other parameters, which are typically called auxiliary parameters or nuisance parameters.

An estimator is a statistic that is being used to estimate (guess at the value of) an unknown parameter of interest. The distinction between estimator and estimate is a subtle one: we use “estimate” when talking about our statistic as a number calculated for a specific data set and “estimator” when talking about it as a random variable.

Commonly used estimators

Our four example statistics are commonly used as estimators:

  1. The relative sample frequency \(\hat{f}_A\) is typically used as an estimator of the probability \(p_A = \Pr(x_i \in A)\)
  2. The sample average \(\bar{x}\) is typically used as an estimator of the population mean \(\mu_x = E(x_i)\).
  3. The sample variance \(s_x^2\) is typically used as an estimator of the population variance \(\sigma_x^2 = var(x_i)\).
  4. The sample median \(\hat{m}_x\) is typically used as an estimator of the population median \(m_x = q_{0.5}(x_i)\).

Let \(s_n\) be a statistic we are using as an estimator of some parameter of interest \(\theta\). We can define its error as: \[err(s_n) = s_n - \theta\] In principle, we want \(s_n\) to be a good estimator of \(\theta\), i.e., we want \(err(s_n)\) to be as close to zero as possible.

There are several major complications to keep in mind:

  1. Since \(s_n\) is a random variable with a probability distribution, \(err(s_n)\) is also a random variable with a proability distribution.
  2. Since the value of \(\theta\) is unknown, the value of \(err(s_n)\) is also unknown.

Always remember that \(err(s_n)\) is not an inherent property of the statistic - it depends on the relationship between the statistic and the parameter of interest. A given statistic may be a good estimator of one parameter, and a bad estimator of another parameter.

7.2.2 Bias and the MVUE criterion

In choosing an estimator, we can consider several criteria.

The first is the bias of the estimator, which is defined as: \[bias(s_n) = E(err(s_n)) = E(s_n - \theta) = E(s_n) - \theta\] The bias represents the expected error.

Ideally we would want \(bias(s_n)\) to be zero, in which case we would say that \(s_n\) is an unbiased estimator of \(\theta\). Note that bias is always defined relative to the parameter we wish to estimate, and is not an inherent property of the statistic.

Two unbiased estimators of the mean

Consider the sample average \(\bar{x}_n\) in a random sample as an estimator of the parameter \(\mu_x = E(x_i)\). The bias is: \[bias(\bar{x}_n) = E(\bar{x}_n) - \mu_x = \mu_x - \mu_x = 0\] That is, the sample average is an unbiased estimator of the population mean.

However, it is not the only unbiased estimator. For example, suppose we simply take the value of \(x_i\) in the first observation and throw away the rest of the data. This “first observation estimator” is easier to calculate than the sample average, and is also an unbiased estimator of \(\mu_x\): \[bias(x_1) = E(x_1) - \mu_x = \mu_x - \mu_x = 0\] This example illustrates a general principle: there is rarely exactly one unbiased estimator. There are either none, or many.

An unbiased estimator of the variance

The sample variance is an unbiased estimator of the population variance: \[E(s_x^2) = \sigma_x^2 = var(x_i)\] This is not hard to prove, but I will skip it for now.

If there are multiple unbiased estimators available for a given parameter, we need to apply a second criterion to choose one. A natural second criterion is the variance of the estimator: \[var(s_n) = E[(s_n-(E(s_n))^2]\] If \(s_n\) is unbiased, then a low variance means it is usually close to \(\theta\), while a high variance means that it is often either much larger or much smaller than \(\theta\). Clearly, low variance is better than high variance.

The minimum variance unbiased estimator (MVUE) of a parameter is the unbiased estimator with the lowest variance.

The variance of the sample average and and first observation estimators

In our previous example, we found two unbiased estimators for the mean, the sample average \(\bar{x}_n\) and the first observation \(x_1\).

The variance of the sample average is: \[var(\bar{x}_n) = \sigma^2/n\] and the variance of the first observation estimator is: \[var(x_1) = \sigma^2\] For any \(n > 1\), the sample average \(\bar{x}_n\) has lower variance than the first observation estimator \(x_1\). Since they are both unbiased, it is the preferred estimator of the two.

In fact, we can prove that \(\bar{x}_n\) is the minimum variance unbiased estimator of \(\mu_x\).

7.2.3 Mean squared error

Unfortunately, once we move beyond the simple case of estimating the population mean, we run into several complications:

The first complication is that an unbiased estimator may not exist for a particular parameter of interest. If there is no unbiased estimator, there is no minimum variance unbiased estimator. So we need some other way of choosing an estimator.

The sample median

There is no unbiased estimator of the median of a random variable with unknown distribution. To see why, consider the simplest possible data set: a random sample of size \(n=1\) on the random variable \(x_i \sim Bernoulli(p)\), where \(0 < p < 1\). The median of \(x_i\) in this case is: \[m_x = I(p > 0.5)\]

First we show that the sample median is a biased estimator of \(m_x\). The sample median is: \[\hat{m}_x = x_1\] and its expected value is: \[E(\hat{m}_x) = E(x_1) = p \neq I(p > 0.5)\] So the sample median \(\hat{m}_x\) is a biased estmator for the population median \(m_x\).

More generally, any statistic calculated from this data set must take the form \(s = a_0 + a_1x_1\), where \(s = a_0\) when \(x_1 = 0\) and \(s = a_0+a_1\) is its value when \(x_1 = 1\). This statistic has expected value \(E(a_0 + a_1x_1) = a_0 + a_1p\), so any unbiased estimator would need to solve the equation: \[a_0 + a_1 p = I(p > 0.5)\] and there is no such solution.

The second complication is that we often have access to an unbiased estimator and a biased estimator with lower variance.

The relationship between age and earnings

One common question in labour economics is how earnings vary by various characteristics such as age.

Suppose we have a random sample of 800 Canadians, and we want to estimate the earnings of the average 35-year old Canadian. Assuming for simplicity that ages are equally-spaced between 0 and 80, our data set will have only 10 Canadians at each age. So we have several options:

  • Average earnings of the 10 35-year-olds in our data.

This estimator will be unbiased, but 10 observations isn’t very much and so its variance will be high. We can reduce the variance by adding more observations from people who are almost 35 years old:

  • Average earnings of the 30 34-36 year olds in our data.
  • Average earnings of the 100 30-39 year olds in our data.
  • Average earnings of the 800 0-80 year olds in our data.

By including more data, these estimators will have lower variance but will introduce bias. My guess is introducing 34 and 36 year olds is a good idea since they probably have similar earnings to 35 year olds, but including children and the elderly is not such a good idea.

This suggests that we need a criterion that

  • Can be used to choose between biased estimators
  • Can choose slightly biased estimators with low variance over unbiased estimators with high variance.

The mean squared error of an estimator is defined as: \[MSE(s_n) = E[err(s_n)^2] = E[(s_n-\theta)^2]\] We can do a little math and show that: \[MSE(s_n) = var(s_n) + [bias(s_n)]^2\] The MSE criterion allows us to choose a biased estimator with low variance over an unbiased estimator with high variance, and also allows us to choose between biased estimators when no unbiased estimator exists.

The MSE of the sample mean and first observation estimators

The mean squared error of the sample average is: \[MSE(\bar{x}_n) = var(\bar{x}_n) + [bias(\bar{x}_n)]^2 = \frac{\sigma_x^2}{n} + 0^2 = \frac{\sigma_x^2}{n}\] and the mean squared error of the first observation estimator is: \[MSE(x_1) = \sigma_x^2\] The sample average is the preferred estimator by the MSE criterion, so in this case we get the same result as applying the MVUE criterion.

7.2.4 Standard errors

Parameter estimates are typically reported along with their standard errors. The standard error of a statistic is an estimate of its standard deviation.

The standard error of the average

We have shown that the sample average provides a good estimate of the population mean, and that its variance is: \[var(\bar{x}_n) = \frac{\sigma_x^2}{n} = \frac{var(x_i)}{n}\] Since \(s_x^2\) is an unbiased estimator of \(var(x_i)\) we can use it to construct an unbiased estimator of \(var(\bar{x})\): \[E\left(\frac{s_x^2}{n}\right) = \frac{E(s_x^2)}{n} = \frac{var(x_i)}{n} = var(\bar{x}_n)\]

We might also want to estimate the standard deviation of \(\bar{x}\). A natural approach would be to take the square root of the estimator above, yielding: \[se(\bar{x}_n) = \frac{s_x}{\sqrt{n}}\] This is the conventional formula for the standard error of the sample average, and is typically reported next to the sample average.

Standard errors are usually biased estimators of the statistic’s true standard deviation, but the bias is typically small.

7.3 Asymptotics

So far, we have described statistics and estimators in terms of their probability distribution and the mean, variance and mean squared error associated with that probability distribution.

We are able to do this fairly easy with both sample averages and sample frequencies (which are also sample averages) because they are sums. Unfortunately, this is not so easy with other statistics (e.g. standard errors, medians, percentiles, etc.) that are nonlinear functions of the data.

In order to deal with those statistics, we need to construct approximations based on the asymptotic properties of the statistics. Asymptotic properties are properties that hold approximately, with the approximation getting closer and closer to the truth as the sample size gets larger.

Properties that hold exactly for any sample size are sometimes called exact or finite sample properties. All of the results we have discussed so far are finite sample results.

We will state three main asymptotic results, the law of large numbers, the central limit theorem, and Slutsky’s theorem. All three rely on relatively sophisticated math, so I will not expect you to do much with them. Please focus on the intuition and interpretation and don’t worry too much about the math.

7.3.1 The law of large numbers

The law of large numbers (LLN) says that for a large enough random sample, the sample average is almost identical to the corresponding population mean.

In order to state the LLN, we need to introduce some concepts. Consider a data set \(D_n\) of size \(n\), and let \(s_n\) be some statistic calculated from \(D_n\). We say that \(s_n\) converges in probability to some constant \(c\) if: \[\lim_{n \rightarrow \infty} \Pr( |s_n - c| < \epsilon) = 1\] for any positive number \(\epsilon > 0\).

Intiutively, what this means is that for a sufficiently large \(n\) (the \(\lim_{n \rightarrow \infty}\) part), \(s_n\) is almost certainly (the \(\Pr(\cdot) = 1\) part) very close to \(c\) (the \(|s_n-c| < \epsilon\) part).

We have a compact way of writing convergence in probability: \[w_n \rightarrow^p c\] means that \(w_n\) converges in probability to \(c\).

Having defined our terms we can now state the law of large numbers.

LAW OF LARGE NUMBERS: Let \(\bar{x}_n\) be the sample average from a random sample of size \(n\) on the random variable \(x_i\) with mean \(E(x_i) = \mu_x\). Then \[\bar{x}_n \rightarrow^p \mu_x\]

The LLN in the economy

The law of large numbers is extremely powerful and important, as it is the basis for the gambling industry, the insurance industry, and much of the banking industry.

A casino works by taking in a large number of independent small bets. As we have seen for the case of roulette, these bets have a small house advantage, so their average benefit to the casino is positive. The casino can lose any bet, but the LLN virtually guarantees that the gains will outweigh the losses as long as the casino takes in a large enough number of independent bets.

An insurance company works almost the same as a casino. Each of us faces a small risk of a catastrophic cost: a house that burns down, a car accident leading to serious injury, etc. Insurance companies collect a little bit of money from each of us, and pay out a lot of money to the small number of people who have claims. Although the context is quite different, the underlying economics are identical to those of a casino: the insurance company prices its products so that its revenues exceed its expected payout, and takes on a large number of independent risks.

Sometimes insurance companies do lose money, and even go bankrupt. The usual cause of this is a big systemic event like a natural disaster, pandemic or financial crisis that affects everyone. Here the independence needed for the LLN does not apply.

7.3.2 The central limit theorem

The Central Limit Theorem (CLT) is an even more powerful result. It roughly says that for a sufficiently large sample size we can approximate the entire probability distribution of \(\bar{x}_n\) by a normal distribution.

As with the LLN, we need to invest in some terminology before we can state the CLT. Let \(s_n\) be a statistic calculated from \(D_n\) and let \(F_n(\cdot)\) be its CDF. We say that \(s_n\) converges in distribution to a random variable \(s\) with CDF \(F(\cdot)\), or \[s_n \rightarrow^D s\] if \[\lim_{n \rightarrow \infty} |F_n(a) - F(a)| = 0\] for every \(a \in \mathbb{R}\).

Convergence in distribution means we can approximate the actual CDF \(F_n(\cdot)\) of \(s_n\) with its limit \(F(\cdot)\). As with most approximations, this is useful whenever \(F_n(\cdot)\) is difficult to calculate and \(F(\cdot)\) is easy to calculate.

CENTRAL LIMIT THEOREM: Let \(\bar{x}_n\) be the sample average from a random sample of size \(n\) on the random variable \(x_i\) with mean \(E(x_i) = \mu_x\) and variance \(var(x_i) = \sigma_x^2\). Then \[z_n \rightarrow^D z \sim N(0,1)\]

What does the central limit theorem mean?

  • Fundamentally, it means that if \(n\) is big enough then the probability distribution of \(\bar{x}_n\) is approximately normal no matter what the original distribution of \(x_i\) looks like.
  • In order for the CLT to apply, we need to rescale \(\bar{x}_n\) so that it has zero mean (by subtracting \(E(\bar{x}_n) = \mu_x\)) and constant variance as \(n\) increases (by dividing by \(sd(\bar{x}_n) = \sigma_x/\sqrt{n}\))). That rescaled sample average is \(z_n\).
  • In practice, we don’t usually know \(\mu_x\) or \(\sigma_x\) so we can’t calculate \(z_n\) from data. Fortunately, there are some tricks for getting around this problem that we will talk about later.

7.3.3 Consistent estimation

The law of large numbers and central limit theorem apply to the sample mean, but we are interested in other estimators as well.

In general, we say that the statistic \(s_n\) is a consistent estimator of a parameter \(\theta\) if: \[s_n \rightarrow^P \theta\] It will turn out that most of the statistics we use are consistent estimators of the thing we typically use them to estimate.

The key to this property is a result called Slutsky’s theorem. Slutsky’s theorem roughly says that if the law of large numbers and central limit theorem applies to a statistic \(s_n\), it also applies to \(g(s_n)\) for any continuous function \(g(\cdot)\).

SLUTSKY THEOREM: Let \(g(\cdot)\) be a continuous function. Then: \[s_n \rightarrow^p c \implies g(s_n) \rightarrow^p g(c)\] and: \[s_n \rightarrow^D s \implies g(s_n) \rightarrow^D g(s)\]

What is the importance of Slutsky’s theorem? Most commonly used statistics can be written as continuous functions of a sample average (or several sample averages). Slutsky’s theorem

  • extends the LLN to these statistics
    • Most commonly used statistics are consistent estimators of the corresponding population parameter.
  • extends the CLT to these statistics
    • Most commonly used statistics are also asymptotically normal.

The math needed to make full use of Slutsky’s theorem is beyond the scope of this course, so all I am asking here is for you to know that it can be used for this purpose.