# Chapter 7 Statistics

Having described our data and its DGP, we now move on to describing statistics calculated using our data.

*Chapter goals*

In this chapter we will:

- Use the theory of probability and random variables to model the statistics we have calculated from a data set.
- Calculate and interpret the mean and variance of a statistic from its sampling distribution.
- Calculate and interpret bias and mean squared error.
- Explain the law of large numbers, central limit theorem, and Slutsky’s theorem.

## 7.1 Statistics and their properties

Suppose we have some ** statistic** \(s_n =s(D_n)\), i.e., a number
that is calculated from the data.

- Since the data is observed/known, the value of the statistic is observed/known.
- Since the elements of \(D_n\) are random variables, \(s_n\) is also a random variable with a well-defined (but unknown) probability distribution that depends on the unknown DGP.

**Roulette wins**

In our roulette example, the total number of wins is: \[R = x_1 + x_2 + x_3\] Since this is a number calculated from our data, it is a statistic.

Since \(x_i \sim Bernoulli(p)\), we can show that \(R \sim Binomial(3,p)\).

### 7.1.1 Some important statistics

I will use \(s_n\) to represent an abstract statistic, but we will often use other notation to talk about specific statistics.

The most important statistic is the ** sample average** which is
defined as:
\[\bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i\]
We will also consider several other commonly-used univariate
statistics:

- The
of \(x_i\) is defined as: \[s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\] A closely-related statistic is the*sample variance*\(s_x = \sqrt{s_x^2}\) which is the square root of the sample variance.*sample standard deviation* - The
or*sample frequency*of the event \(x_i \in A\) is defined as the proportion of cases in which the event occurs: \[\hat{f}_A = \frac{1}{n} \sum_{i=1}^n I(x_i \in A)\] A closely-related statistic is the*relative sample frequency*\(n\hat{f}_A\) which is the*absolute sample frequency**number*of cases in which the event occurs. - The
of \(x_i\) is defined as: \[\hat{m}_x = m: \begin{cases} \hat{f}_{x < m} \leq 0.5 \\ \hat{f}_{x > m} \leq 0.5 \\ \end{cases}\]*sample median*

### 7.1.2 The sampling distribution

Since the data itself is a collection of random variables, any statistic calculated from that data is also a random variable, with a probability distribution that can be derived from the DGP.

**The sampling distribution of the sample frequency**

Calculating the exact probability distribution of most statistics is quite difficult, but it is easy to do for the sample frequency. Let \(p =\Pr(x_i \in A)\). Then: \[n\hat{f}_A \sim Binomial(n,p)\] In other words, we can calculate the exact probability distribution of the sample frequency using the formula for the binomial distribution.

Unfortunately, most statistics typically have sampling distributions that are quite difficult to calculate.

To see why the sampling distribution of a statistic is so difficult to calculate, suppose we have a discrete random variable \(x_i\) whose support \(S_x\) has five elements. Then we need to calculate the sampling distribution of our statstic by adding its probability up across the support of \(D_n\). The support has \(5^n\) elements, a number that can quickly get very large.

For example, a typical data set in microeconomics has at least a few hundred or a few thousand observations. With 100 observations, \(D_n\) can take on \(5^{100} \approx 7.9 \times 10^{69}\) (that’s 79 followed by 68 zeros!) distinct values. With 1,000 observations , \(D_n\) can take on \(5^{1000}\) distinct values, a number too big for Excel to even calculate.

### 7.1.3 The mean and variance

If our statistic has a probability distribution, it (usually) has a mean and variance as well. Under some circumstances, we can calculate them.

**The mean of the sample average**

Let \(\mu_x = E(x_i)\) be the mean of \(x_i\) Then the mean of \(\bar{x}\) is: \[E(\bar{x}_n) = E\left( \frac{1}{n} \sum_{i=1}^n x_i\right) = \frac{1}{n} \sum_{i=1}^n E\left( x_i\right) = \frac{1}{n} \sum_{i=1}^n \mu_x = \mu_x\]

This is an important and general result in statistics. The mean of the sample average in a random sample is identical to the mean of the random variable being averaged. \[E(\bar{x}_n) = E(x_i)\] We have shown this property specifically for a random sample, but it holds under many other sampling processes.

The variance of the sample average is not equal to the variance of the random variable being averaged, but they are closely related.

**The variance of the sample average**

To keep the math simple, suppose we only have \(n = 2\) observations. Then the sample average is: \[\bar{x} = \frac{1}{2}(x_1 + x_2)\] By our earlier formula for the variance: \[\begin{align} var(\bar{x}) &= var\left(\frac{1}{2}(x_1 + x_2)\right) \\ &= \left(\frac{1}{2}\right)^2 var(x_1 + x_2) \\ &= \frac{1}{4} \left( \underbrace{var(x_1)}_{\sigma_x^2} + 2 \underbrace{cov(x_1,x_2)}_{0 \textrm{(independence)}} + \underbrace{var(x_2)}_{\sigma_x^2} \right) \\ &= \frac{1}{4} \left( 2 \sigma_x^2 \right) \\ &= \frac{\sigma_x^2}{2} \\ \end{align}\]

More generally, the variance of the sample average in a random sample of size \(n\) is: \[var(\bar{x}_n) = \frac{\sigma_x^2}{n}\] where \(\sigma_x^2 = var(x_i)\).

Other commonly-used statistics also have a mean and variance.

**The mean and variance of the sample frequency**

Since the absolute sample frequency has the binomial distribution, we have already seen its mean and variance. Let \(p = \Pr(x_i \in A)\). Then \(n\hat{f}_A \sim Binomial(n,p)\) and: \[E(n\hat{f}_A) = np\] \[var(n\hat{f}_A) = np(1-p)\]

Applying the usual rules for expected values, the mean and variance of the relative sample frequency is: \[E(\hat{f}_A) = \frac{E(n\hat{f}_A)}{n} = \frac{np}{n} = p\] \[var(\hat{f}_A) = \frac{var(n\hat{f}_A)}{n^2} = \frac{np(1-p)}{n^2} = \frac{p(1-p)}{n} \]

## 7.2 Estimation

One of the most important uses of statistics is to ** estimate** or
guess at the value at, some unknown parameter of the DGP.

### 7.2.1 Parameters and estimators

A ** parameter** is an unknown number characterizing a DGP.

**Examples of parameters**

Sometimes a single parameter completely describes the DGP:

- In our roulette data set, the joint distribution of the data depends only on the single parameter \(p = \Pr(b \in Red)\).

Sometimes a group of parameters completely describe the DGP:

- If \(x_i\) is a random sample from the \(U(L,H)\) distribution, then \(L\) and \(H\) are both parameters.

And sometimes a parameter only partially describes the DGP

- If \(x_i\) is a random sample from some unknown distribution with unknown mean \(\mu_x = E(x_i)\), then \(\mu_x\) is a parameter.
- If \(x_i\) is a random sample from some unknown distribution with unknown median \(m_x = q_{0.5}(x_i)\), then \(m_x\) is a parameter.

Typically there will be particular parameters whose value we wish
to know. Such a parameter is called a ** parameter of interest**.
Our model may include other parameters, which are typically called

*auxiliary parameters*or

*nuisance parameters*.

An ** estimator** is a statistic that is being used to

**(guess at the value of) an unknown parameter of interest. The distinction between estimator and estimate is a subtle one: we use “estimate” when talking about our statistic as a number calculated for a specific data set and “estimator” when talking about it as a random variable.**

*estimate***Commonly used estimators**

Our four example statistics are commonly used as estimators:

- The relative sample frequency \(\hat{f}_A\) is typically used as an estimator of the probability \(p_A = \Pr(x_i \in A)\)
- The sample average \(\bar{x}\) is typically used as an estimator of the population mean \(\mu_x = E(x_i)\).
- The sample variance \(s_x^2\) is typically used as an estimator of the population variance \(\sigma_x^2 = var(x_i)\).
- The sample median \(\hat{m}_x\) is typically used as an estimator of the population median \(m_x = q_{0.5}(x_i)\).

Let \(s_n\) be a statistic we are using as an estimator of some parameter
of interest \(\theta\). We can define its ** error** as:
\[err(s_n) = s_n - \theta\]
In principle, we want \(s_n\) to be a good estimator of \(\theta\), i.e.,
we want \(err(s_n)\) to be as close to zero as possible.

There are several major complications to keep in mind:

- Since \(s_n\) is a random variable with a probability distribution, \(err(s_n)\) is also a random variable with a proability distribution.
- Since the value of \(\theta\) is unknown, the value of \(err(s_n)\) is also unknown.

Always remember that \(err(s_n)\) is not an inherent property of the statistic - it depends on the relationship between the statistic and the parameter of interest. A given statistic may be a good estimator of one parameter, and a bad estimator of another parameter.

### 7.2.2 Bias and the MVUE criterion

In choosing an estimator, we can consider several criteria.

The first is the ** bias** of the estimator, which is defined as:
\[bias(s_n) = E(err(s_n)) = E(s_n - \theta) = E(s_n) - \theta\]
The bias represents the expected error.

Ideally we would want \(bias(s_n)\) to be zero, in which case
we would say that \(s_n\) is an ** unbiased** estimator of \(\theta\).
Note that bias is always defined relative to the parameter we wish
to estimate, and is not an inherent property of the statistic.

**Two unbiased estimators of the mean**

Consider the sample average \(\bar{x}_n\) in a random sample as an estimator of the parameter \(\mu_x = E(x_i)\). The bias is: \[bias(\bar{x}_n) = E(\bar{x}_n) - \mu_x = \mu_x - \mu_x = 0\] That is, the sample average is an unbiased estimator of the population mean.

However, it is not the only unbiased estimator. For example, suppose we simply take the value of \(x_i\) in the first observation and throw away the rest of the data. This “first observation estimator” is easier to calculate than the sample average, and is also an unbiased estimator of \(\mu_x\): \[bias(x_1) = E(x_1) - \mu_x = \mu_x - \mu_x = 0\] This example illustrates a general principle: there is rarely exactly one unbiased estimator. There are either none, or many.

**An unbiased estimator of the variance**

The sample variance is an unbiased estimator of the population variance: \[E(s_x^2) = \sigma_x^2 = var(x_i)\] This is not hard to prove, but I will skip it for now.

If there are multiple unbiased estimators available for a given
parameter, we need to apply a second criterion to choose
one. A natural second criterion is the ** variance** of
the estimator:
\[var(s_n) = E[(s_n-(E(s_n))^2]\]
If \(s_n\) is unbiased, then a low variance means it is usually
close to \(\theta\), while a high variance means that it is
often either much larger or much smaller than \(\theta\). Clearly,
low variance is better than high variance.

The ** minimum variance unbiased estimator** (MVUE) of a parameter is
the unbiased estimator with the lowest variance.

**The variance of the sample average and and first observation estimators**

In our previous example, we found two unbiased estimators for the mean, the sample average \(\bar{x}_n\) and the first observation \(x_1\).

The variance of the sample average is: \[var(\bar{x}_n) = \sigma^2/n\] and the variance of the first observation estimator is: \[var(x_1) = \sigma^2\] For any \(n > 1\), the sample average \(\bar{x}_n\) has lower variance than the first observation estimator \(x_1\). Since they are both unbiased, it is the preferred estimator of the two.

In fact, we can prove that \(\bar{x}_n\) is the minimum variance unbiased estimator of \(\mu_x\).

### 7.2.3 Mean squared error

Unfortunately, once we move beyond the simple case of estimating the population mean, we run into several complications:

The first complication is that an unbiased estimator may not exist for a particular parameter of interest. If there is no unbiased estimator, there is no minimum variance unbiased estimator. So we need some other way of choosing an estimator.

**The sample median**

There is no unbiased estimator of the median of a random variable with unknown distribution. To see why, consider the simplest possible data set: a random sample of size \(n=1\) on the random variable \(x_i \sim Bernoulli(p)\), where \(0 < p < 1\). The median of \(x_i\) in this case is: \[m_x = I(p > 0.5)\]

First we show that the sample median is a biased estimator of \(m_x\). The sample median is: \[\hat{m}_x = x_1\] and its expected value is: \[E(\hat{m}_x) = E(x_1) = p \neq I(p > 0.5)\] So the sample median \(\hat{m}_x\) is a biased estmator for the population median \(m_x\).

More generally, any statistic calculated from this data set must take the form \(s = a_0 + a_1x_1\), where \(s = a_0\) when \(x_1 = 0\) and \(s = a_0+a_1\) is its value when \(x_1 = 1\). This statistic has expected value \(E(a_0 + a_1x_1) = a_0 + a_1p\), so any unbiased estimator would need to solve the equation: \[a_0 + a_1 p = I(p > 0.5)\] and there is no such solution.

The second complication is that we often have access to an unbiased estimator and a biased estimator with lower variance.

**The relationship between age and earnings**

One common question in labour economics is how earnings vary by various characteristics such as age.

Suppose we have a random sample of 800 Canadians, and we want to estimate the earnings of the average 35-year old Canadian. Assuming for simplicity that ages are equally-spaced between 0 and 80, our data set will have only 10 Canadians at each age. So we have several options:

- Average earnings of the 10 35-year-olds in our data.

This estimator will be unbiased, but 10 observations isn’t
very much and so its variance will be high. We can reduce the
variance by adding more observations from people who are *almost*
35 years old:

- Average earnings of the 30 34-36 year olds in our data.
- Average earnings of the 100 30-39 year olds in our data.
- Average earnings of the 800 0-80 year olds in our data.

By including more data, these estimators will have lower variance but will introduce bias. My guess is introducing 34 and 36 year olds is a good idea since they probably have similar earnings to 35 year olds, but including children and the elderly is not such a good idea.

This suggests that we need a criterion that

- Can be used to choose between biased estimators
- Can choose slightly biased estimators with low variance over unbiased estimators with high variance.

The ** mean squared error** of an estimator is defined as:
\[MSE(s_n) = E[err(s_n)^2] = E[(s_n-\theta)^2]\]
We can do a little math and show that:
\[MSE(s_n) = var(s_n) + [bias(s_n)]^2\]
The MSE criterion allows us to choose a biased estimator with low variance
over an unbiased estimator with high variance, and also allows us to choose
between biased estimators when no unbiased estimator exists.

**The MSE of the sample mean and first observation estimators**

The mean squared error of the sample average is: \[MSE(\bar{x}_n) = var(\bar{x}_n) + [bias(\bar{x}_n)]^2 = \frac{\sigma_x^2}{n} + 0^2 = \frac{\sigma_x^2}{n}\] and the mean squared error of the first observation estimator is: \[MSE(x_1) = \sigma_x^2\] The sample average is the preferred estimator by the MSE criterion, so in this case we get the same result as applying the MVUE criterion.

### 7.2.4 Standard errors

Parameter estimates are typically reported along with their
** standard errors**. The standard error of a statistic is
an estimate of its standard deviation.

**The standard error of the average**

We have shown that the sample average provides a good estimate of the population mean, and that its variance is: \[var(\bar{x}_n) = \frac{\sigma_x^2}{n} = \frac{var(x_i)}{n}\] Since \(s_x^2\) is an unbiased estimator of \(var(x_i)\) we can use it to construct an unbiased estimator of \(var(\bar{x})\): \[E\left(\frac{s_x^2}{n}\right) = \frac{E(s_x^2)}{n} = \frac{var(x_i)}{n} = var(\bar{x}_n)\]

We might also want to estimate the standard deviation of \(\bar{x}\). A natural approach would be to take the square root of the estimator above, yielding: \[se(\bar{x}_n) = \frac{s_x}{\sqrt{n}}\] This is the conventional formula for the standard error of the sample average, and is typically reported next to the sample average.

Standard errors are usually biased estimators of the statistic’s true standard deviation, but the bias is typically small.

## 7.3 Asymptotics

So far, we have described statistics and estimators in terms of their probability distribution and the mean, variance and mean squared error associated with that probability distribution.

We are able to do this fairly easy with both sample averages and sample frequencies (which are also sample averages) because they are sums. Unfortunately, this is not so easy with other statistics (e.g. standard errors, medians, percentiles, etc.) that are nonlinear functions of the data.

In order to deal with those statistics, we need to construct
approximations based on the ** asymptotic** properties
of the statistics. Asymptotic properties are properties
that hold approximately, with the approximation getting
closer and closer to the truth as the sample size gets
larger.

Properties that hold exactly for any sample size
are sometimes called ** exact** or

**properties. All of the results we have discussed so far are finite sample results.**

*finite sample*We will state three main asymptotic results, the law of large numbers, the central limit theorem, and Slutsky’s theorem. All three rely on relatively sophisticated math, so I will not expect you to do much with them. Please focus on the intuition and interpretation and don’t worry too much about the math.

### 7.3.1 The law of large numbers

The ** law of large numbers** (LLN) says that for a large enough random
sample, the sample average is almost identical to the corresponding
population mean.

In order to state the LLN, we need to introduce some concepts. Consider
a data set \(D_n\) of size \(n\), and let \(s_n\) be some statistic
calculated from \(D_n\). We say that \(s_n\) ** converges in probability**
to some constant \(c\) if:
\[\lim_{n \rightarrow \infty} \Pr( |s_n - c| < \epsilon) = 1\]
for any positive number \(\epsilon > 0\).

Intiutively, what this means is that for a sufficiently large \(n\) (the \(\lim_{n \rightarrow \infty}\) part), \(s_n\) is almost certainly (the \(\Pr(\cdot) = 1\) part) very close to \(c\) (the \(|s_n-c| < \epsilon\) part).

We have a compact way of writing convergence in probability: \[w_n \rightarrow^p c\] means that \(w_n\) converges in probability to \(c\).

Having defined our terms we can now state the law of large numbers.

**LAW OF LARGE NUMBERS**: Let \(\bar{x}_n\) be the sample average
from a random sample of size \(n\) on the random variable \(x_i\) with
mean \(E(x_i) = \mu_x\). Then
\[\bar{x}_n \rightarrow^p \mu_x\]

**The LLN in the economy**

The law of large numbers is extremely powerful and important, as it is the basis for the gambling industry, the insurance industry, and much of the banking industry.

A casino works by taking in a *large* number of *independent* small
bets. As we have seen for the case of roulette, these bets
have a small house advantage, so their average benefit to
the casino is positive. The casino can lose any bet, but the
LLN virtually guarantees that the gains will outweigh the
losses as long as the casino takes in a large enough number
of independent bets.

An insurance company works almost the same as a casino. Each of us faces a small risk of a catastrophic cost: a house that burns down, a car accident leading to serious injury, etc. Insurance companies collect a little bit of money from each of us, and pay out a lot of money to the small number of people who have claims. Although the context is quite different, the underlying economics are identical to those of a casino: the insurance company prices its products so that its revenues exceed its expected payout, and takes on a large number of independent risks.

Sometimes insurance companies do lose money, and even go bankrupt. The usual cause of this is a big systemic event like a natural disaster, pandemic or financial crisis that affects everyone. Here the independence needed for the LLN does not apply.

### 7.3.2 The central limit theorem

The ** Central Limit Theorem (CLT)** is an even more powerful result.
It roughly says that for a sufficiently large sample size we can
approximate the entire probability distribution of
\(\bar{x}_n\) by a normal distribution.

As with the LLN, we need to invest in some terminology before we can
state the CLT. Let \(s_n\) be a statistic calculated from \(D_n\) and
let \(F_n(\cdot)\) be its CDF. We say that \(s_n\)
** converges in distribution** to a random variable \(s\) with CDF \(F(\cdot)\), or
\[s_n \rightarrow^D s\]
if
\[\lim_{n \rightarrow \infty} |F_n(a) - F(a)| = 0\]
for every \(a \in \mathbb{R}\).

Convergence in distribution means we can approximate the actual CDF \(F_n(\cdot)\) of \(s_n\) with its limit \(F(\cdot)\). As with most approximations, this is useful whenever \(F_n(\cdot)\) is difficult to calculate and \(F(\cdot)\) is easy to calculate.

**CENTRAL LIMIT THEOREM**: Let \(\bar{x}_n\) be the sample average
from a random sample of size \(n\) on the random variable \(x_i\) with
mean \(E(x_i) = \mu_x\) and variance \(var(x_i) = \sigma_x^2\). Then
\[z_n \rightarrow^D z \sim N(0,1)\]

What does the central limit theorem mean?

- Fundamentally, it means that if \(n\) is big enough then the probability
distribution of \(\bar{x}_n\) is approximately normal
*no matter what the original distribution of \(x_i\) looks like.* - In order for the CLT to apply, we need to rescale \(\bar{x}_n\)
so that it has zero mean (by subtracting \(E(\bar{x}_n) = \mu_x\))
and constant variance as \(n\) increases (by dividing by
\(sd(\bar{x}_n) = \sigma_x/\sqrt{n}\))). That rescaled sample
average is \(z_n\).

- In practice, we don’t usually know \(\mu_x\) or \(\sigma_x\) so we can’t calculate \(z_n\) from data. Fortunately, there are some tricks for getting around this problem that we will talk about later.

### 7.3.3 Consistent estimation

The law of large numbers and central limit theorem apply to the sample mean, but we are interested in other estimators as well.

In general, we say that the statistic \(s_n\) is a ** consistent**
estimator of a parameter \(\theta\) if:
\[s_n \rightarrow^P \theta\]
It will turn out that most of the statistics we use are consistent
estimators of the thing we typically use them to estimate.

The key to this property is a result called ** Slutsky’s theorem**.
Slutsky’s theorem roughly says that if the law of large numbers
and central limit theorem applies to a statistic \(s_n\), it also
applies to \(g(s_n)\) for any continuous function \(g(\cdot)\).

**SLUTSKY THEOREM**: Let \(g(\cdot)\) be a continuous function. Then:
\[s_n \rightarrow^p c \implies g(s_n) \rightarrow^p g(c)\]
and:
\[s_n \rightarrow^D s \implies g(s_n) \rightarrow^D g(s)\]

What is the importance of Slutsky’s theorem? Most commonly used statistics can be written as continuous functions of a sample average (or several sample averages). Slutsky’s theorem

- extends the LLN to these statistics
- Most commonly used statistics are consistent estimators of the corresponding population parameter.

- extends the CLT to these statistics
- Most commonly used statistics are also asymptotically normal.

The math needed to make full use of Slutsky’s theorem is beyond the scope of this course, so all I am asking here is for you to know that it can be used for this purpose.