Chapter 7 Statistics

In earlier chapters, we learned to use Excel to construct common univariate statistics and charts. We also learned the basics of probability theory, and working with simple or complex random variables. The next step is to bring these concepts together, and apply the theoretical tools of probability and random variables to statistics calculated from data.

This chapter will develop the theory of mathematical statistics, which treats our data set and each statistic calculated from the data as the outcome of a random data generating process. We will also explore one of the most important uses of statistics: to estimate or guess at the value at, some unknown feature of the data generating process.

Chapter goals

In this chapter, we will learn how to:

  1. Describe the joint probability distribution of a very simple data set.
  2. Identify the key features of a random sample.
  3. Classify data sets by sampling types.
  4. Find the sampling distribution of a very simple statistic.
  5. Find the mean and variance of a statistic from its sampling distribution.
  6. Find the mean and variance of a statistic that is linear in the data.
  7. Distinguish between parameters, statistics, and estimators.
  8. Calculate the sampling error of an estimator.
  9. Calculate bias and classify estimators as biased or unbiased.
  10. Calculate the mean squared error of an estimator.
  11. Apply MVUE and MSE criteria to select an estimator.
  12. Calculate the standard error for a sample average.
  13. Explain the law of large numbers and what it means for an estimator to be consistent.

To prepare for this chapter, please review both the introductory and advanced chapters on random variables, as well as the sections in the data analysis chapter on summary statistics and frequency tables.

7.1 Using statistics

Statistics are just numbers calculated from data. Modern computers make statistics easy to calculate, and they are easy to interpret as descriptions of the data.

But that is not the only possible interpretation of a statistic, and it is not even the most important one. Instead, we regularly use statistics calculated from data to infer or predict other quantities that are not in the data.

  1. Statistics Canada may conduct a survey of a few thousand Canadians, and use statistics based on that survey to infer how the other 38+ million Canadians would have responded to that survey.
    • This is the main application we will consider in this course.
  2. Wal-Mart may use historical sales data to predict how many chocolate bunnies it will sell this Easter. It will then use this prediction to determine how many chocolate bunnies to order.
    • We will talk a little about this kind of application.
  3. Economists and other researchers will often be interested in making causal or counterfactual inferences.
    • Counterfactual inferences are predictions about the data would have been different under other (counterfactual) circumstances.
    • Economic fundamentals like supply and demand curves are primarily counterfactual because they describe how much would have been bought or sold at each price (not just the equilibrium price).
    • Causal inferences are inferences about the underlying mechanism that produced the data.
    • For example, labour economists are often interested in whether and how much the typical individual’s earnings would increase if they spent one more year in school, or obtained a particular educational credential.
    • Counterfactual and causal inference are beyond the scope of this course, but are important in applied economics and may be covered extensively in later courses.

Anyone can make predictions, and almost anyone can calculate a few statistics in Excel. The hard part is making accurate predictions, and selecting or constructing statistics that will tend to produce accurate predictions.

In order to make consistently accurate predictions using data, we will need to construct a probabilistic model of the random process that generated the data.

Example 7.1 Using data to predict roulette outcomes

Our probability calculations for roulette have relied on two pieces of knowledge:

  • We know the game’s structure: there are 37 numbered slots, 18 numbers are red, 18 are black, and one is green.
  • We know the game is fair: the ball is equally likely to land in all 37 numbered slots.

In addition, the game is simple enough that we can do all of the calculations.

What if we do not know the structure of the game, are not sure the game is fair, or the game is too complicated to calculate the probabilities? If we have access to a data set of past results, we can use that data set to:

  1. estimate the win probability of various bets.
    • This application will be covered in the current chapter.
  2. test the claim that the game is fair.

This approach will be particularly useful for games like poker or blackjack that are more complex and/or involve human decision making. The win probability in blackjack depends on the skill of the player, so the house advantage can vary depending on who is playing, whether they are distracted or intoxicated, and various other human factors.

7.2 Data and the data generating process

Suppose for the rest of this chapter that we have a data set or sample called \(D_n\). In most applications, it will be a tidy rectangular table with \(n\) observations (rows) and \(K\) numeric variables (columns).

For this chapter, we will simplify by assuming that \(K = 1\), i.e., that \(D_n = (x_1,x_2,\ldots,x_n)\) contains \(n\) observations on a single numeric variable \(x_i\). This case will cover all of the univariate statistics and methods described in the chapter on basic data analysis.

Example 7.2 Data from two roulette games

Suppose we have a data set \(D_n\) providing the result of \(n\) independent games of roulette. Let \(b_i\) be the number the ball lands on in game \(i\), and let \(x_i\) be the result of a bet on red: \[\begin{align} x_i = \begin{cases} 1 & \textrm{if Red wins game } i \\ 0 & \textrm{if Red loses game } i \\ \end{cases} \end{align}\] Suppose our data set has observations from two games (\(n = 2\)). Then \(D_2 = (x_1,x_2)\) where \(x_1\) is the result from the first game and \(x_2\) is the result from the second game. For example, suppose red loses the first game and wins the second game. Then our data could be written in tabular form as:

Game # (\(i\)) Result of bet on red (\(x_i\))
1 0
2 1

or in a list as \(D_2 = (0,1)\).

Our data set \(D_n\) is a table or list of numbers, but we can also think of it as a set of random variables with an unknown joint PDF \(f_D\). This PDF is sometimes called the data generating process or DGP for the data set. The exact DGP is usually unknown, but we usually have some information about it.

Example 7.3 The DGP for the roulette data

The support of our roulette data set includes four possible values: \[\begin{align} S_D = \{(0,0), (0,1), (1,0), (1,1)\} \end{align}\] The DGP is just the joint PDF of \(D_n\), which we can calculate for each value in the support: \[\begin{align} f_D(0,0) &= \Pr(x_1 = 0 \cap x_2 = 0) \\ &= \Pr(x_1 = 0)\Pr(x_2 = 0) \\ &= (1-p)^2 \\ f_D(0,1) &= \Pr(x_1 = 0 \cap x_2 = 1) \\ &= \Pr(x_1 = 0)\Pr(x_2 = 1) \\ &= (1-p)p \\ f_D(1,0) &= \Pr(x_1 = 1 \cap x_2 = 0) \\ &= \Pr(x_1 = 1)\Pr(x_2 = 0) \\ &= p(1-p) \\ f_D(1,1) &= \Pr(x_1 = 1 \cap x_2 = 1) \\ &= \Pr(x_1 = 1)\Pr(x_2 = 1) \\ &= p^2 \\ f_D(a,b) &= 0 \qquad \textrm{otherwise} \end{align}\] where \(p = \Pr(x_i = 1)\) is the unknown probability that a bet on red wins.

While it is feasible to calculate the DGP for a very small data set, it quickly becomes impractical8 to do so as the number of observations increase. Fortunately, we don’t usually need to calculate the DGP. We just need to know that it could be calculated.

7.2.1 Simple random sampling

In most applications, we can simplify the DGP by assuming that \(D_n\) is independent and identically distributed (IID) or a simple random sample from a large population. A simple random sample has two features:

  1. The observations are independent: Each \(x_i\) is an independent random variable.
  2. The observations are identically distributed: Each \(x_i\) has the same (unknown) marginal distribution.

The reason we call this “independent and identically distributed” is hopefully obvious, but what does it mean to say we have a “random sample” from a “population”? Well, one simple way of generating an IID sample is to:

  1. Define the population of interest, for example all Canadian residents.
  2. Use some purely random mechanism9 to choose a small subset of cases from this population.
    • The subset is called our sample
    • “Purely random” here means some mechanism like a computer’s random number generator, which can then be used to dial random telephone numbers or select cases from a list.
  3. Collect data from every case in our sample.

It turns out that a moderately-sized random sample provides surprisingly accurate information on the underlying population.

Example 7.4 Our roulette data is a random sample

Each observation \(x_i\) in our roulette data is an independent random draw from the \(Bernouilli(p)\) distribution where \(p = \Pr(\textrm{Red wins})\).

Therefore, this data set satisfies the criteria for a simple random sample.

The joint PDF of a simple random sample can be written: \[\begin{align} \Pr(D_n = (a_1,a_2,\ldots,a_n)) = f_x(a_1)f_x(a_2)\ldots f_x(a_n) \end{align}\] where \(f_x(a) = \Pr(x_i = a)\) is just the marginal PDF of a single observation. Independence allows us to write the joint PDF as the product of the marginal PDFs for each observation, and identical distribution allows us to use the same marginal PDF for each observation.

7.2.2 Time series data

Time series data sets are constructed by repeatedly observing a variable (or several variables) at regularly-spaced points in time. Our historical employment data set is an example of time series data, as are most other macroeconomic variables such as GDP, population, inflation, interest rates, etc.

Time series have several features that are inconsistent with the random sampling assumption:

  • They usually have clear time trends.
    • For example, Canada’s real GDP has been steadily growing for as long as we have data.
    • This violates the assumption that observations are identically distributed since the expected value changes from one year to the next.
  • They usually have clear recurring cyclical patterns or seasonality.
    • For example, unemployment in Canada is usually lower from September through December.
    • This violates the assumption that observations are identically distributed since the expected value changes from one month to the next.
  • They usually exhibit what is called autocorrelation.
    • For example, shocks to the economy that affect GDP in one month or quarter (think of COVID or a financial crisis) are likely to have a similar (if smaller) effect on GDP in the next month or quarter.
    • This violates the assumption that observations are independent since values in nearby time periods are related.

We can calculate statistics for time series, and we already did in Chapter 5. However, using time series data for rigorous statistical analysis and prediction requires more advanced techniques than we will learn in this course.

ECON 433

Our fourth-year course ECON 433 covers time series econometrics, with applications in macroeconomics and finance.

7.2.3 Other sampling models

Not all useful data sets come from a simple random sample or a time series. For example:

  • A stratified sample is collected by dividing the population into strata (subgroups) based on some observable characteristics, and then randomly sampling a predetermined number of cases within each strata.
    • Most professional surveys are constructed from stratified samples rather than random samples.
    • Stratified sampling is often combined with oversampling of some smaller strata that are of particular interest.
      • The LFS oversamples residents of Prince Edward Island (PEI) because a national random sample would not catch enough PEI residents to accurately measure PEI’s unemployment rate.
      • Government surveys typically oversample disadvantaged groups..
    • Stratified samples can usually be handled as if they were from a random sample, with some adjustments.
  • A cluster sample is gathered by dividing the population into clusters, randomly selecting some of these clusters, and sampling cases within the cluster.
    • Educational data sets are often gathered this way: we pick a random sample of schools, and then collect data from each student within those schools.
    • Cluster samples can usually be handled as if they were from a random sample, with some adjustments.
  • A census gathers data on every case in the population.
    • For example, we might have data on all fifty US states, or all ten Canadian provinces, or all of the countries of the world.
    • Data from administrative sources such as tax records or school records often cover the entire population of interest as well.
    • Censuses are often treated as random samples from some hypothetical population of “possible” cases.
  • A convenience sample is gathered by whatever method is convenient.
    • For example, we might gather a survey from people who walk by, or we might recruit our friends to participate in the survey.
    • Convenience samples are the worst-case scenario; in many cases they simply aren’t usable for accurate statistical analysis.

Many data sets combine several of these elements. For example, Canada’s unemployment rate is calculated using data from the Labour Force Survey (LFS). The LFS is built from a stratified sample of the civilian non-institutionalized working-age population of Canada. There is also some clustering: the LFS will typically interview whole households, and will do some geographic clustering to save on travel costs. The LFS is gathered monthly, and the resulting unemployment rate is a time series.

7.2.4 Sample selection and representativeness

Random samples and their close relatives have the feature that they are representative of the population from which they are drawn. In a sense that will be made more clear over the next few chapters, any sufficiently large random sample tends to closely resemble the population.

Unfortunately, a simple random sample is not always possible. Even if we are able to randomly select cases, we often run into the following problems:

  • Nonresponse occurs when a sampled individual refuses or is unable to provide the information requested by the survey
    • Survey-level nonresponse occurs when the sampled individual does not answer any questions.
      • This can occur if the sampled individual cannot be found, refuses to answer, or cannot answer (for example, is incapacitated due to illness or disability).
      • Recent response rates to telephone surveys have been around 9%, implying over 90% of those contacted do not respond.
    • Item-level nonresponse occurs when the sampled individual does not answer a particular question.
      • This can occur if the respondent refuses to answer, or the question is not applicable or has no valid answer.
      • Item-level nonresponse is particularly common on sensitive questions including income.
  • Censoring occurs when a particular quantity of interest cannot be observed for a particular case. Censored outcomes are extremely common in economics, for example:
    • In labour market analysis, we cannot observe the market wage for individuals who are not currently employed.
    • In supply/demand analysis, we only observe quantity supplied and quantity demanded at the current market price.

When observations are subject to nonresponse or censoring, we must interpret the data carefully.

Example 7.5 Wald’s airplanes

Abraham Wald was a Hungarian/American statistician and econometrician who made important contributions to both the theory of statistical inference and the development of economic index numbers such as the Consumer Price Index.

Like many scientists of his time, he advised the US government during World War II. As part of his work, he was provided with data on combat damage received by airplanes, with the hopes that the data could be used to help make the planes more robust to damage. Planes can be reinforced with additional steel, but this is costly and makes them heavier and slower.

The data looked something like this (this isn’t the real data, just a visualization constructed for the example):

Survivorship-bias

You will notice that most of the damage seems to be in the wings and in the middle of the fuselage (body), while there is little damage to the nose, engines, and back of the fuselage. You might think that means that the wings and middle fuselage are the areas that should be reinforced.

Wald’s team realized that this was wrong: the data were taken from planes that returned, which is not a random sample of planes that went out. Planes were probably shot in the nose, the engines, and the back of the fuselage just as often as anywhere else, but they did not appear often in the data because they crashed. Wald’s insight led to a counter-intuitive policy recommendation: reinforce the parts of the plane that usually show the least damage.

There are two basic solutions to nonresponse and censoring:

  • Reinterpretation: we redefine the population so that our data can be correctly interpreted as a random sample from that population.
    • Abraham Wald’s team made a distinction between a random sample of planes that flew, and a random sample of planes that returned.
    • Canadian survey data can be interpreted as a a random sample of Canadians who answer surveys.
  • Imputation: we assume or impute values for all missing quantities.
    • Abraham Wald’s team assumed that the distribution of damage was roughly uniform across all planes that flew.
    • We might assume that Canadians who do not respond to surveys would give similar responses to those given by survey respondents.
    • We might assume that Canadians who do not respond to surveys would give similar responses to those given by survey respondents of the same age, gender, and education.

Like many real-world statistical issues, nonresponse and censoring do not have a purely technical solution. Careful thought is required. If we are imputing values, do we believe that our imputation method is reasonable? If we are redefining the population, is the redefined population one we are interested in? There are no clearly right answers to these questions, and sometimes our data are simply not good enough to answer our questions.

Nonresponse bias in US presidential elections

Going into both the 2016 and 2020 US presidential elections, polls indicated that the Democratic candidate had a substantial lead over the Republican candidate:

  • Hillary Clinton led Donald Trump by 4-6% nationally in 2016
  • Joe Biden led Trump by 8% nationally in 2020.

The actual vote was much closer:

  • Clinton won the popular vote (but lost the election) by 2%
  • Biden won the popular vote (and won the election) by about 4.5%.

The generally accepted explanation among pollsters for the clear disparity between polls and voting is systematic nonresponse: for some reason, Trump voters are less likely to respond to polls. Since most people do not respond to standard telephone polls any more (response rates are typically around 9%), it does not take much difference in response rates to produce a large difference in responses. For example, suppose that:

  • We call 1,000 voters
  • These voters are equally split, with 500 supporting Biden and 500 supporting Trump.
  • 10% of Biden voters respond (50 voters)
  • 8% of Trump voters respond (40 voters)

The overall response rate is 9% (similar to what we usually see in surveys), Biden has the support of \(50/90 = 56\%\) of the respondents while Trump has the support of \(40/90 = 44\%\). Actual support is even, but the polls show a 12 percentage point gap in support, entirely because of the small difference in response rates.

Polling organizations employ statisticians who are well aware of this problem, and they made various adjustments after the 2016 election to address it. For example, most now weight their analysis by education, since more educated people tend to have a higher response rate. Unfortunately, the 2020 results indicate that this adjustment was not enough. Some pollsters have argued that it makes more sense to just assume the nonresponse bias is 2-3% and adjust the numbers by that amount directly.

7.3 Statistics and their properties

A statistic is just a number \(s_n =s(D_n)\) that is calculated from the data. In general, each statistic is:

  • Observed since \(D_n\) is observed.
  • A random variable with a probability distribution that is well-defined but unknown. This is because \(D_n\) is a set of random variables with a well-defined but unknown joint probability distribution.

I will use \(s_n\) to represent a generic statistic, but we will often use other letters to talk about specific statistics.

Example 7.6 Roulette wins

In our two-observation roulette data, the total number of wins is: \[\begin{align} R = x_1 + x_2 \end{align}\] Since this is a number calculated from our data, it is a statistic.

Since the observations are independent draws from the \(Bernoulli(p)\) distribution, the total number of wins has the \(Binomial(2,p)\) distribution. This distribution is unknown because the true value of \(p\) is unknown.

7.3.1 Summary statistics

The univariate summary statistics we previously learned to calculate in Excel will serve as our main examples. Their definitions are reproduced below for convenience:

  • The count or sample size is just the number of valid observations in the data (\(n\)).
  • The sample average is: \[\begin{align} \bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i \end{align}\]
  • The sample variance is: \[\begin{align} sd_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \end{align}\]
  • The sample standard deviation is: \[\begin{align} sd_x = \sqrt{sd_x^2} \end{align}\] Note: The conventional notation for the sample standard deviation is \(s_x\); I am using \(sd_x\) because I am already using \(s\) to represent a general statistic.
  • The sample median is: \[\begin{align} \hat{m} &= \begin{cases} x_{[(n/2) + (1/2)]} & \textrm{if $n$ is odd} \\ \frac{x_{[n/2]} + x_{[(n/2) + 1]}}{2} & \textrm{if $n$ is even} \\ \end{cases} \end{align}\] where \(x_{[k]}\) is the value of \(x_i\) for the \(k\)th-lowest observation in the data set.

We also learned to construct both simple and binned frequency tables. Let \(B \subset \mathbb{R}\) be a bin of values. Each bin would contain a single value for a simple frequency table, or multiple values for a binned frequency table.

Given a particular bin, we can define:

  • The sample frequency or relative sample frequency of bin \(B\) is the proportion of cases in which \(x_i\) is in \(B\): \[\begin{align} \hat{f}_B = \frac{1}{n} \sum_{i=1}^n I(x_i \in B) \end{align}\]
  • The absolute sample frequency of bin \(B\) is the number of cases in which \(x_i\) is in \(B\): \[\begin{align} n \hat{f}_B = \sum_{i=1}^n I(x_i \in B) \end{align}\]

We can then construct each cell in a frequency table by choosing the appropriate bin.

7.3.2 The sampling distribution

Since the data itself is a collection of random variables, any statistic calculated from that data is also a random variable, with a probability distribution that can be derived from the DGP.

Example 7.7 The sampling distribution of the sample average in the roulette data

Returning to our two-observation roulette data, the sample average is: \[\begin{align} \bar{x} = \frac{1}{2} (x_1 + x_2) \end{align}\] Since there are four possible values of \((x_1,x_2)\) we can determine the sampling distribution of the sample average by enumeration.

Data (\(D_2\)) Probability (\(f_D\)) Sample Average (\(\bar{x}\))
\((0,0)\) \((1-p)^2\) \(0.0\)
\((0,1)\) \(p(1-p)\) \(0.5\)
\((1,0)\) \(p(1-p)\) \(0.5\)
\((1,1)\) \(p^2\) \(1.0\)

Therefore the sampling distribution of \(\bar{x}\) can be described by the PDF: \[\begin{align} f_{\bar{x}}(a) &= \begin{cases} (1-p)^2 & \textrm{if $a=0$} \\ 2p(1-p) & \textrm{if $a=0.5$} \\ p^2 & \textrm{if $a=1$} \\ 0 & \textrm{otherwise} \\ \end{cases} \end{align}\]

Unfortunately, most statistics typically have sampling distributions that are quite difficult to calculate for the same reason the DGP is quite difficult to calculate: there are just too many possible values. Again, the important part is to understand what a sampling distribution is, that every statistic has one, and that it depends on the (usually unknown) DGP.

7.3.3 The mean

If our statistic has a probability distribution, it (usually) has a mean and variance as well.

Example 7.8 The mean of the sample average in the roulette data

We can calculate the expected value of \(\bar{x}\) in the two-observation roulette data directly from the PDF derived in the previous example: \[\begin{align} E(\bar{x}) &= 0 \times f_{\bar{x}}(0) + 0.5 \times f_{\bar{x}}(0.5) + 1 \times f_{\bar{x}}(1) \\ &= 0 \times (1-p)^2 + 0.5 \times 2p(1-p) + 1.0 \times p^2 \\ &= p \end{align}\] Notice that \(E(x_i) = p\) as well. This is a special case of a more general result that will be developed below.

As mentioned earlier, it is often impractical or impossible to calculate the sampling distribution for a given statistic. Fortunately, the mean is usually much easier to calculate.

In particular, suppose we have a simple random sample of size \(n\) on the random variable \(x_i\) with unkown mean \(E(x_i) = \mu_x\). Then the mean of the sample average is: \[\begin{align} E(\bar{x}_n) &= E\left( \frac{1}{n} \sum_{i=1}^n x_i\right) \\ &= \frac{1}{n} \sum_{i=1}^n E\left( x_i\right) \\ &= \frac{1}{n} \sum_{i=1}^n \mu_x \\ &= \mu_x \end{align}\] Notice that we took advantage of the fact that the sample average is a linear function of the \(x_i\)’s, so we can take the expected value “inside” of that function. This is an important and fairly general result in statistics - although we have derived it here for the specific case of a simple random sample, it applies for many other common sampling schemes.

7.3.4 The variance

The variance of a statistic can also be calculated in many cases.

Example 7.9 The variance of the sample average in the roulette data

In the two-observation roulette data, the variance of the sample average is: \[\begin{align} var(\bar{x}) &= var\left(\frac{1}{2}(x_1 + x_2)\right) \\ &= \left(\frac{1}{2}\right)^2 var(x_1 + x_2) \\ &= \frac{1}{4} \left( var(x_1) + \underbrace{2 \, cov(x_1,x_2)}_{\textrm{$= 0$ (by independence)}} + var(x_2) \right) \\ &= \frac{1}{4} \left( 2 \, var(x_i) \right) \\ &= \frac{var(x_i)}{2} \end{align}\]

Notice that \(var(\bar{x}_n) < var(x_i)\). Averages are typically less variable than the thing they are averaging.

In general, we can prove that the sample average in a random sample of size \(n\) has variance: \[\begin{align} var(\bar{x}_n) = \frac{\sigma_x^2}{n} \end{align}\] where \(\sigma_x^2 = var(x_i)\). Similarly the standard deviation of the sample average in a random sample of size \(n\) is \(\sigma_x/\sqrt{n}\).

7.4 Estimation

One of the most important uses of statistics is to estimate, or guess the value of, some unknown feature of the population or DGP. Anyone can make a guess, but using data to make accurate guesses requires some attention to the details.

7.4.1 Parameters

A parameter is an unknown number \(\theta = \theta(f_D)\) whose value depends on the DGP. In general, each parameter’s value is:

  • Unobserved/unknown since its value depends on the unknown DGP \(f_D\).
  • Fixed (not random) since the DGP \(f_D\) is a fixed function and not a random variable.

I will use \(\theta\) to represent a generic parameter, but we will often use other letters to talk about specific parameters.

For example, we might have a random sample of size \(n\) on the random variable \(x_i\), and we can define the parameter \(\mu = E(x_i)\). Like all expected values, \(\mu\) is a number, and we could determine its value if we knew the true DGP. But we do not know the true DGP, and so we do not know the value of \(\mu\).

Example 7.10 Examples of parameters

Sometimes a single parameter completely describes the DGP:

  • In our roulette data set, the joint distribution of the data depends only on the (known) sample size \(n\) and the single (unknown) parameter \(p = \Pr(\textrm{Red wins})\).

Sometimes a group of parameters completely describe the DGP:

  • If \(x_i\) is a random sample from the \(U(L,H)\) distribution, then \(L\) and \(H\) are both parameters.

And sometimes a parameter only partially describes the DGP

  • If \(x_i\) is a random sample from some unknown distribution with unknown mean \(\mu_x = E(x_i)\), then \(\mu_x\) is a parameter.
  • If \(x_i\) is a random sample from some unknown distribution with unknown median \(m_x\), then \(m_x\) is a parameter.

Typically there will be particular parameters whose value we wish to know. Such a parameter is called a parameter of interest. Our model may include other parameters, which are typically called auxiliary parameters or nuisance parameters.

7.4.2 Estimators

An estimator is a statistic \(\hat{\theta}_n = \hat{\theta}(D_n)\) that is used to estimate (guess at the value of) an unknown parameter of interest \(\theta\). Since it is a statistic constructed from \(D_n\), an estimator is generally:

  • Observed since \(D_n\) is observed.
  • A random variable with a well-defined but unknown probability distribution.

I will use \(\hat{\theta}\) to represent a generic estimator, but we will often use other notation to talk about specific estimators. The circumflex or “hat” \(\hat{\,}\) notation is commonly used to identify an estimator.

Example 7.11 Two estimators for the mean of a random sample

Suppose we have a random sample of size \(n\) on the random variable \(x_i\) with unknown mean \(\mu_x = E(x_i)\) and unknown variance \(\sigma_x^2 = var(x_i)\). We will consider two estimators for the unknown parameter \(\mu_x\):

  1. The sample average \(\bar{x}_n\).
  2. The value of the first observation \(x_1\).

These are both statistics calculated from the data, so they are both valid estimators for \(\mu_x\).

In addition to the usual properties of a statistic, estimators have properties specific to their purpose of representing the unknown parameter of interest.

7.4.3 Sampling error

Let \(\hat{\theta}_n\) be a statistic we are using as an estimator of some parameter of interest \(\theta\). We can define its sampling error as: \[\begin{align} err(\hat{\theta}_n) = \hat{\theta}_n - \theta \end{align}\] In principle, we want \(\hat{\theta}_n\) to be a good estimator of \(\theta\), i.e., we want the sampling error to be as close to zero as possible.

There are several major complications to keep in mind:

  1. Since \(\hat{\theta}_n\) is a random variable with a probability distribution, \(err(\hat{\theta}_n)\) is also a random variable with a probability distribution.
  2. Since the value of \(\theta\) is unknown, the value of \(err(\hat{\theta}_n)\) is also unknown.

Always remember that \(err(\hat{\theta}_n)\) is not an inherent property of the statistic - it depends on the relationship between the statistic and the parameter of interest. A given statistic may be a good estimator of one parameter, and a bad estimator of another parameter.

7.4.4 Bias

Since any statistic can be used to estimate any parameter, we need to choose one. In choosing an estimator, we can consider several criteria.

The first is the bias of the estimator, which is defined as its expected sampling error: \[\begin{align} bias(\hat{\theta}_n) &= E(err(\hat{\theta}_n)) \\ &= E(\hat{\theta}_n - \theta) \\ &= E(\hat{\theta}_n) - \theta \end{align}\] Note that bias is a property of the sampling error, so it is always defined relative to the parameter we wish to estimate, and is not an inherent property of the statistic.

Ideally we would want \(bias(\hat{\theta}_n)\) to be zero, in which case we would say that \(\hat{\theta}_n\) is an unbiased estimator of \(\theta\).

Example 7.12 Two unbiased estimators of the mean

Consider a random sample of size \(n\) on the random variable \(x_i\) with unknown mean \(\mu_x = E(x_i)\).

The sample average is an unbiased estimator of \(\mu_x\) since: \[\begin{align} bias(\bar{x}_n) &= E(\bar{x}_n) - \mu_x \\ &= \mu_x - \mu_x \\ &= 0 \end{align}\] The first observation is also an unbiased estimator of \(\mu_x\) since: \[\begin{align} bias(x_1) &= E(x_1) - \mu_x \\ &= \mu_x - \mu_x \\ &= 0 \end{align}\] This example illustrates a general principle: there is rarely exactly one unbiased estimator. There are either none, or many.

If the bias is nonzero, we would say that \(\hat{\theta}_n\) is a biased estimator of \(\theta\).

Example 7.13 A biased estimator of the median

Consider our two-observation roulette data, and suppose we are interested in estimating the median of \(x_i\). Applying our definition of the median from an earlier chapter, the median of \(x_i\) would be: \[\begin{align} m &= \begin{cases} 0 & \textrm{if $p \leq 0.5$} \\ 1 & \textrm{if $p > 0.5$} \\ \end{cases} \end{align}\] A natural estimator of the median is the sample median. In our two-observation example, the sample median would be: \[\begin{align} \hat{m} &= \frac{1}{2} (x_1 + x_2) \end{align}\] and its expected value would be \(E(\hat{m}) = p\). \[\begin{align} E(\hat{m}) &= E\left(\frac{1}{2} (x_1 + x_2) \right) \\ &= \frac{1}{2} \left( E(x_1) + E(x_2) \right) \\ &= \frac{1}{2} \left( p + p \right) \\ &= p \end{align}\] So its bias as an estimator of \(m\) is: \[\begin{align} bias(\hat{m}) &= E(\hat{m}) - m \\ &= p - m \\ &= \begin{cases} p & \textrm{if $p \leq 0.5$} \\ p-1 & \textrm{if $p > 0.5$} \\ \end{cases} \end{align}\] This is nonzero (unless \(p\) is zero or one), so the sample median is a biased estimator of the median in this case.

This result applies more generally: the sample median is typically a biased estimator of the median. It is still the most commonly-used estimator for the median, for reasons we will discuss soon.

7.4.5 Variance and the MVUE

If there are multiple unbiased estimators available for a given parameter, we need to apply a second criterion to choose one. A natural second criterion is the variance of the estimator: \[\begin{align} var(\hat{\theta}_n) = E[(\hat{\theta}_n-(E(\hat{\theta}_n))^2] \end{align}\] Why do we care about the variance? A low variance means that \(\hat{\theta}_n\) is usually close to its own expected value. If \(\hat{\theta}_n\) is unbiased, then that means it is also close to \(\theta\). So low variance is preferable to high variance, at least when comparing unbiased estimators.

The minimum variance unbiased estimator (MVUE) of a parameter is the unbiased estimator with the lowest variance, and the MVUE criterion for choosing an estimator says to choose the MVUE.

Example 7.14 The variance of two estimators of the mean

Suppose we have a random sample of size \(n > 1\) on the random variable \(x_i\) with unknown mean \(\mu_x = E(x_i)\) and unknown variance \(\sigma_x^2 = var(x_i)\).

We earlier showed that the sample average \(\bar{x}_n\) and the first observation \(x_1\) are both unbiased estimators of \(\mu_x\). We can select the “better” estimator by applying the MVUE criterion.

The variance of the sample average is: \[\begin{align} var(\bar{x}_n) = \sigma_x^2/n \end{align}\] and the variance of the first observation estimator is: \[\begin{align} var(x_1) = \sigma_x^2 \end{align}\] Since \(\bar{x}_n\) has lower variance for all \(n > 1\), it is the preferred estimator of these two unbiased estimators under the MVUE criterion.

7.4.6 Mean squared error

Unfortunately, once we move beyond the simple case of the sample average, we run into several complications:

The first complication is that an unbiased estimator may not exist for a particular parameter of interest. If there is no unbiased estimator, there is no MVUE. So we need some other way of choosing an estimator.

Example 7.15 The sample median

There is no generally unbiased estimator of the median of a random variable with unknown distribution.

The second complication is that we often have access to an unbiased estimator and a biased estimator with lower variance. Depending on the details, the biased estimator may be clearly preferable to the unbiased estimator.

Example 7.16 The relationship between age and earnings

Labour economists are often interested in the relationship between age and earnings. Typically, workers earn more as they get older but earnings do not increase at a constant rate. Instead, earnings rise rapidly in a typical worker’s 20s and 30s, then gradually flatten out. This pattern affects many economically important decisions like education, savings, household formation, having children, etc.

Suppose we want to estimate the earnings of the average 35-year-old Canadian, and have access to a random sample of 800 Canadians with 10 observations for each age between 0 and 80.

The average earnings of 35-year-olds in our data would be an unbiased estimator of the average earnings of 35-year-olds in Canada. However, it would be based on only 10 observations, and its variance would be very high.

We could increase the sample size and reduce the variance by including observations from people who are almost 35 years old. We have many options, including:

  • Average earnings of the 10 35 year olds in our data.
  • Average earnings of the 30 34-36 year olds in our data.
  • Average earnings of the 100 30-39 year olds in our data.
  • Average earnings of the 800 0-80 year olds in our data.

By widening the age range, these averages will have lower variance but will introduce bias (since they have added people that are not exactly like 35-year-olds).

This set of issues implies that we need a criterion that:

  • Can be used to choose between biased estimators.
  • Can choose slightly biased estimators with low variance over unbiased estimators with high variance.

The mean squared error of an estimator is defined as the expected value of the squared sampling error: \[\begin{align} MSE(\hat{\theta}_n) &= E[err(\hat{\theta}_n)^2] \\ &= E[(\hat{\theta}_n-\theta)^2] \\ &= var(\hat{\theta}_n) + [bias(\hat{\theta}_n)]^2 \end{align}\] and the MSE criterion says to choose the (biased or unbiased) estimator with the lowest MSE.

Example 7.17 The MSE of the sample mean and first observation estimators

The mean squared error of the sample average is: \[\begin{align} MSE(\bar{x}_n) = var(\bar{x}_n) + [bias(\bar{x}_n)]^2 = \frac{\sigma_x^2}{n} + 0^2 = \frac{\sigma_x^2}{n} \end{align}\] and the mean squared error of the first observation estimator is: \[\begin{align} MSE(x_1) = \sigma_x^2 \end{align}\] The sample average is the preferred estimator by the MSE criterion, so in this case we get the same result as applying the MVUE criterion.

The MSE criterion allows us to choose a biased estimator with low variance over an unbiased estimator with high variance, and also allows us to choose between biased estimators when no unbiased estimator exists.

Estimating bias and MSE

In most cases, the bias and mean squared error of our estimators depend on the unknown DGP and cannot be calculated. So why do we bother talking about them?

  1. They provide a clear framework for thinking about trade-offs in data analysis, even when we are unable to precisely quantify those trade-offs.
  2. Some advanced statistical techniques such as cross-validation and the bootstrap make it possible to estimate the bias and/or MSE of an estimator. These methods work (roughly) by estimating the parameter on multiple random subsets of the data, treating the original data set as the “true” population.

Cross-validation is a critical element of most machine learning techniques, which typically work by estimating a very large number of complex prediction models and then selecting the model that performs best in cross-validation.

7.4.7 Standard errors

Parameter estimates are typically reported along with their standard errors. The standard error of a statistic is an estimate of its standard deviation, so it gives a rough idea of how accurate the estimate is likely to be.

The first step in constructing a standard error is to find the actual standard deviation of the statistic, in terms of (probably unknown) parameters of the DGP.

Example 7.18 The standard deviation of the average in a random sample

Consider a random sample of size \(n\) on the random variable \(x_i\) with unknown mean \(\mu_x = E(x_i)\) and unknown variance \(\sigma_x^2 = var(x_i)\).

We earlier showed that the variance of the sample average \(\bar{x}_n\) was: \[\begin{align} var(\bar{x}_n) &= \frac{var(x_i)}{n} \\ &= \frac{\sigma_x^2}{n} \\ \end{align}\] and so its standard deviation is: \[\begin{align} sd(\bar{x}_n) &= \frac{\sigma_x}{\sqrt{n}} \end{align}\] The sample size \(n\) is known, but the standard deviation \(\sigma_x\) is unknown. So we will need to estimate it in order to estimate \(sd(\bar{x}_n)\).

The next step is to find suitable estimators for any unknown parameter values.

Example 7.19 An unbiased estimator of the variance

Consider a random sample of size \(n\) on the random variable \(x_i\) with unknown mean \(\mu_x = E(x_i)\) and unknown variance \(\sigma_x^2 = var(x_i)\).

The sample variance is an unbiased estimator of the population variance: \[\begin{align} E(sd_x^2) = \sigma_x^2 = var(x_i) \end{align}\] This is not hard to prove, but I will skip it for now.

Unfortunately, the square root is not a linear function: \[\begin{align} E(sd_x) = E(\sqrt{sd_x^2}) \neq \sqrt{E(sd_x^2)} = \sigma_x \end{align}\] so the sample standard deviation is a biased estimator of the population standard deviation. Even so, the bias is typically small and so this is the estimator we typically use.

Finally, we substitute to get the formula for the standard error.

Example 7.20 The standard error of the average

The usual standard error for the sample average in a random sample of size \(n\) is: \[\begin{align} se(\bar{x}_n) = \frac{sd_x}{\sqrt{n}} \end{align}\] where \(sd_x\) is the sample standard deviation.

Standard errors are useful as a data-driven measure of how variable a particular statistic or estimator is likely to be. We will also use them more formally as part of formulas for hypothesis testing and other statistical inference procedures.

7.5 The law of large numbers

Our analysis of the sampling distribution for a statistic/estimator, and related properties such as mean, variance, bias, and mean squared error, has mostly looked at sample averages. The sample average is particularly easy to characterize because it is a sum, and so we can exploit the linearity of the expected value to derive simple characterizations of its distribution. We can do this for a few other statistics, for example the sample frequency, but most statistics - including medians, quantiles, and standard deviations - are nonlinear functions of the data.

In order to deal with those statistics, we need to construct approximations based on their asymptotic properties. The asymptotic properties of a statistic are properties that hold approximately, with the approximation getting closer and closer to the truth as the sample size gets larger.

Every property of a statistic we have discussed so far - sampling distribution, mean, variance, bias, mean squared error, etc. - holds exactly for any sample size. Such properties are sometimes called exact or finite sample properties, to distinguish them from asymptotic properties.

We will state two main asymptotic results in this chapter: the law of large numbers and Slutsky’s theorem. A third asymptotic result called the central limit theorem will be discussed later.

All three results rely on the concept of a limit, which you would have learned in your calculus course. If you need to review that concept, please see the section on limits in the math appendix. However, I will not expect you to do any significant amount of math with limits. Please focus on the intuition and interpretation and don’t worry too much about the math.

7.5.1 Defining the LLN

The law of large numbers (LLN) says that for a large enough random sample, the sample average is nearly identical to the corresponding population mean with a very high probability.

The law of large numbers

In order to state the LLN in a more mathematically rigorous way, we need to clarify what “large enough”, nearly identical” and “very high” mean. This will require the use of limits.

Consider a data set \(D_n\) of size \(n\), and think of it as part of an infinite sequence of data sets \((D_1, D_2, \ldots)\) where \(D_n\) is just the first \(n\) observations in an infinite sequence of observations \((x_1,x_2,\ldots)\). Let \(s_n\) be some statistic calculated from \(D_n\); we can also think of it as part of an infinite sequence \((s_1,s_2,\ldots)\).

We say that \(s_n\) converges in probability to some constant \(c\) if: \[\begin{align} \lim_{n \rightarrow \infty} \Pr( |s_n - c| < \epsilon) = 1 \end{align}\] for any positive number \(\epsilon > 0\).

Intuitively, what this means is that for a sufficiently large \(n\) (the \(\lim_{n \rightarrow \infty}\) part), \(s_n\) is almost certainly (the \(\Pr(\cdot) = 1\) part) very close to \(c\) (the \(|s_n-c| < \epsilon\) part).

We have a compact way of writing convergence in probability: \[\begin{align} w_n \rightarrow^p c \end{align}\] means that \(w_n\) converges in probability to \(c\).

Having defined our terms we can now state the law of large numbers.

LAW OF LARGE NUMBERS: Let \(\bar{x}_n\) be the sample average from a random sample of size \(n\) on the random variable \(x_i\) with mean \(E(x_i) = \mu_x\). Then: \[\begin{align} \bar{x}_n \rightarrow^p \mu_x \end{align}\]

One way of understanding the law of large numbers is to look at our earlier results on mean squared error. We earlier found that \(MSE(\bar{x}_n) = var(x_i)/n\) in a random sample. Note that the right side of this equation goes to zero as \(n\) goes to infinity, implying that the distribution of \(\bar{x}_n\) gradually settles down to a a very small range around \(\mu_x\).

The LLN in the economy

The law of large numbers is extremely powerful and important, as it is the basis for the gambling industry, the insurance industry, and much of the banking industry.

A casino makes money in games like roulette by taking in a large number of independent small bets. These bets have a small house advantage, so their expected benefit to the casino is positive. The casino makes money on some bets and loses money on others, but the LLN ensures that a casino is almost certain to have more profits than losses if it takes enough bets.

Gambling is often considered a glamorous but shady business and insurance is often considered a boring but respectable business, but in many ways they are the same business. An insurance company operates just like a casino. Each of us faces a small risk of a catastrophic cost: a house that burns down, a car accident leading to serious injury, etc. Insurance companies collect a little bit of money from each of us, and pay out a lot of money to the small number of people who have claims. Although the context is quite different, the underlying economics are identical to those of a casino: the insurance company prices its products so that its revenues exceed its expected payout, and takes on a large number of independent risks to ensure that its revenues exceed its actual payout.

Sometimes insurance companies do lose money, and even go bankrupt. The usual cause of this is a big systemic event like a natural disaster, pandemic or financial crisis that affects everyone. Here the independence needed for the LLN does not apply, and an insurance company can take losses that substantially exceed its gains. For example, several insurance companies in Florida went bankrupt in 2022 as a result of property damage claims from Hurricane Ian.

Casinos can also face non-independent risks, for example in sports betting where many players are betting on the same outcome. Casinos typically address this by using different betting rules: rather than promising a fixed payout to winners, they take a fixed percentage (sometimes called the “vigorish” or “vig”) of the total amount bet, and pay the rest out to the winners.

7.5.2 Consistent estimation

We say that the statistic \(\hat{\theta}_n\) is a consistent estimator of a parameter \(\theta\) if \(\hat{\theta}_n\) is nearly identical to \(\theta\) with very high probability for a large enough sample.

The law of large numbers implies that the sample average is a consistent estimator of the corresponding population mean, but we can go much further than that. Almost all commonly-used estimators are consistent in a random sample. For example:

  • The sample variance is a consistent estimator of the population variance.
  • The sample standard deviation is a consistent estimator of the population standard deviation.
  • The relative sample frequency is a consistent estimator of the population probability.
  • The sample median is a consistent estimator of the population median.
  • All other sample quantiles are consistent estimators of the corresponding population quantile.

Similar results can be found for most other commonly-used estimators and sampling schemes. The reason for this is an advanced result called Slutsky’s theorem.

Consistency and Slutsky’s theorem

More formally, we say that \(\hat{\theta}_n\) is a consistent estimator of \(\theta\) if: \[\begin{align} \hat{\theta}_n \rightarrow^P \theta \end{align}\] As said earlier, most commonly-used estimators are consistent.

The key to demonstrating this property is a result called Slutsky’s theorem. Slutsky’s theorem roughly says that if the law of large numbers applies to a statistic \(s_n\), it also applies to \(g(s_n)\) for any continuous function \(g(\cdot)\).

SLUTSKY THEOREM: Let \(g(\cdot)\) be a continuous function. Then: \[\begin{align} s_n \rightarrow^p c \implies g(s_n) \rightarrow^p g(c) \end{align}\]

Almost all commonly-used estimators can be written as a continuous function of one or more sample averages, and almost all parameters can be written as a continuous function of one or more population means. So we can prove consistency of an estimator in two steps:

  1. Use the LLN to prove that the sample averages converge to their expected values.
  2. Use Slutsky’s theorem to prove that the function of sample averages (the estimator) converges to the same function of the corresponding expected values (the parameter).

Actually doing that for a particular estimator is well beyond the scope of this course. The important thing to know is that it can be done, and to have a clear idea what consistency means.

Chapter review

Statistics is the core subject of this course, making this chapter the most important one in the entire book.

In this chapter we have learned to model a data generating process, describe the probability distribution of a statistic, and interpret a statistic as estimating some unknown parameter of the underlying data generating process.

An estimator is rarely identical to the parameter of interest, so any conclusions based on estimating a parameter of interest have a degree of uncertainty. To describe this uncertainty in a rigorous and quantitative manner, we will next learn some principles of statistical inference.

Practice problems

Answers can be found in the appendix.

GOAL #1: Describe the joint probability distribution of a very simple data set

  1. Suppose we have a data set \(D_n = (x_1,x_2)\) that is a random sample of size \(n = 2\) on the random variable \(x_i\) which has discrete PDF: \[\begin{align} f_x(a) &= \begin{cases} 0.4 & a = 1 \\ 0.6 & a = 2 \\ \end{cases} \end{align}\] Let \(f_{D_n}(a,b) = \Pr(x_1=a \cap x_2 = b)\) be the joint PDF of the data set
    1. Find the support \(S_{D_n}\).
    2. Find \(f_{D_n}(1,1)\).
    3. Find \(f_{D_n}(2,1)\).
    4. Find \(f_{D_n}(1,2)\).
    5. Find \(f_{D_n}(2,2)\).

GOAL #2: Identify the key features of a random sample

  1. Suppose we have a data set \(D_n = (x_1,x_2)\) of size \(n = 2\). For each of the following conditions, identify whether it implies that \(D_n\) is (i) definitely a random sample; (ii) definitely not a random sample; or (iii) possibly a random sample.
    1. The two observations are independent and have the same mean \(E(x_1) = E(x_2) = \mu_x\).
    2. The two observations are independent and have the same mean \(E(x_1) = E(x_2) = \mu_x\) and variance \(var(x_1)=var(x_2)=\sigma_x^2\).
    3. The two observations are independent and have different means \(E(x_1) \neq E(x_2)\).
    4. The two observations have the same PDFs, and are independent.
    5. The two observations have the same PDFs, and have \(corr(x_1,x_2) = 0\)
    6. The two observations have the same PDFs, and have \(cov(x_1,x_2) > 0\).

GOAL #3: Classify data sets by sampling types

  1. Identify the sampling type (random sample, time series, stratified sample, cluster sample, census, convenience sample) for each of the following data sets.
    1. A data set from a survey of 100 SFU students who I found waiting in line at Tim Horton’s.
    2. A data set from a survey of 1,000 randomly selected SFU students.
    3. A data set from a survey of 100 randomly selected SFU students from each faculty.
    4. A data set that reports total SFU enrollment for each year from 2005-2020.
    5. A data set from administrative sources that describes demographic information and postal code of residence for all SFU students in 2020.

GOAL #4: Find the sampling distribution of a very simple statistic

  1. Suppose we have the data set described in question 1 above. Find the support \(S\) and sampling distribution \(f(\cdot)\) for:
    1. The sample frequency \(\hat{f}_1 = \frac{I(x_1=1) + I(x_2=1)}{2}\).
    2. The sample average \(\bar{x} = (x_1 + x_2)/2\).
    3. The sample variance \(sd_x^2 = (x_1-\bar{x})^2 + (x_2-\bar{x})^2\).
    4. The sample standard deviation \(sd_x = \sqrt{sd_x^2}\).
    5. The sample minimum \(xmin = \min(x_1, x_2)\).
    6. The sample maximum \(xmax = \max(x_1, x_2)\).

GOAL #5: Find the mean and variance of a statistic from its sampling distribution

  1. Suppose we have the data set described in question 1 above. Find the expected value of:
    1. The sample frequency \(\hat{f}_1 = \frac{I(x_1=1) + I(x_2=1)}{2}\).
    2. The sample average \(\bar{x} = (x_1 + x_2)/2\).
    3. The sample variance \(sd_x^2 = (x_1-\bar{x})^2 + (x_2-\bar{x})^2\).
    4. The sample standard deviation \(sd_x = \sqrt{sd_x^2}\).
    5. The sample minimum \(xmin = \min(x_1, x_2)\).
    6. The sample maximum \(xmax = \max(x_1, x_2)\).
  2. Suppose we have the data set described in question 1 above. Find the variance of:
    1. The sample frequency \(\hat{f}_1 = \frac{I(x_1=1) + I(x_2=1)}{2}\).
    2. The sample average \(\bar{x} = (x_1 + x_2)/2\).
    3. The sample minimum \(xmin = \min(x_1, x_2)\).
    4. The sample maximum \(xmax = \max(x_1, x_2)\).

GOAL #6: Find the mean and variance of a statistic that is linear in the data

  1. Suppose we have a data set \(D_n = (x_1,x_2)\) that is a random sample of size \(n = 2\) on the random variable \(x_i\) which has mean \(E(x_i) = 1.6\) and variance \(var(x_i) = 0.24\). Find the mean and variance of:
    1. The first observation \(x_1\).
    2. The sample average \(\bar{x} = (x_1 + x_2)/2\).
    3. The weighted average \(w = 0.2*x_1 + 0.8*x_2\).

GOAL #7: Distinguish between parameters, statistics, and estimators

  1. Suppose \(D_n\) is a random sample of size \(n=100\) on a random variable \(x_i\) which has the \(N(\mu,\sigma^2)\) distribution. Which of the following are unknown parameters of the DGP? Which are statistics calculated from the data?
    1. \(D_n\)
    2. \(n\)
    3. \(x_i\)
    4. \(i\)
    5. \(N\)
    6. \(\mu\)
    7. \(\sigma^2\)
    8. \(E(x_i)\)
    9. \(E(x_i^3)\)
    10. \(var(x_i)\)
    11. \(sd(x_i)/\sqrt{n}\)
    12. \(\bar{x}\)
  2. Suppose we have the data set described in question 1 above. Find the true value of:
    1. The probability \(\Pr(x_i=1)\).
    2. The population mean \(E(x_i)\).
    3. The population variance \(var(x_i)\).
    4. The population standard deviation \(sd(x_i)\).
    5. The population minimum \(\min(S_x)\).
    6. The population maximum \(\max(S_x)\).

GOAL #8: Calculate the sampling error of an estimator

  1. Suppose we have the data set described in question 1 above. Suppose we use the sample maximum \(xmax = \max(x_1,x_2)\) to estimate the population maximum \(\max(S_x)\).
    1. Find the support \(S_{err}\) of the sampling error \(err = \max(x_1,x_2) - max(S_x)\).
    2. Find the PDF \(f_{err}(\cdot)\) for the sampling distribution of the sampling error \(err\).

GOAL #9: Calculate bias and classify estimators as biased or unbiased

  1. Suppose we have the data set described in question 1 above. Classify each of the following estimators as biased or unbiased, and calculate the bias:

    1. The sample frequency \(\hat{f}_1\) as an estimator of the probability \(\Pr(x_i=1)\).
    2. The sample average \(\bar{x}\) as an estimator of the population mean \(E(x_i)\)
    3. The sample variance \(sd_x^2\) as an estimator of the population variance \(var(x_i)\)
    4. The sample standard deviation \(sd_x\) as an estimator of the population standard deviation \(sd(x_i)\)
    5. The sample minimum \(xmin\) as an estimator of the population minimum \(\min(S_x)\).
    6. The sample maximum \(xmax\) as an estimator of the population maximum \(\max(S_x)\).
  2. Suppose we are interested in the following parameters:

    • The average earnings of Canadian men: \(\mu_M\).
    • The average earnings of Canadian women: \(\mu_W\).
    • The male-female earnings gap in Canada: \(\mu_M - \mu_W\).
    • The male-female earnings ratio in Canada: \(\mu_M/\mu_W\).

    and we have calculated the following statistics from a random sample of Canadians:

    • The average earnings of men in our sample \(\bar{y}_{M}\).
    • The average earnings of women in our sample \(\bar{y}_{W}\).
    • The male-female earnings gap in our sample \(\bar{y}_{M} - \bar{y}_{W}\).
    • The male-female earnings ratio in our sample \(\bar{y}_{M}/\bar{y}_{W}\).

    We already know that \(\bar{y}_{M}\) is an unbiased estimator of \(\mu_M\) and \(\bar{y}_{W}\) is an unbiased estimator of \(\mu_W\).

    1. Is the sample earnings gap \(\bar{y}_M - \bar{y}_W\) a biased or unbiased estimator of the population gap \(\mu_M - \mu_W\)? Explain.
    2. Is the sample earnings ratio \(\bar{y}_M/\bar{y}_W\) a biased or unbiased estimator of the population earnings ratio \(\mu_M/\mu_W\)? Explain.

GOAL #10: Calculate the mean squared error of an estimator

  1. Suppose we have the data set described in question 1 above. Calculate the mean squared error for:
    1. The sample frequency \(\hat{f}_1\) as an estimator of the probability \(\Pr(x_i=1)\).
    2. The sample average \(\bar{x}\) as an estimator of the population mean \(E(x_i)\).
    3. The sample minimum \(xmin\) as an estimator of the population minimum \(\min(S_x)\).
    4. The sample maximum \(xmax\) as an estimator of the population maximum \(\max(S_x)\).

GOAL #11: Apply MVUE and MSE criteria to select an estimator

  1. Suppose you have a random sample of size \(n=2\) on the random variable \(x\) with mean \(E\left(x\right)=\mu\) and variance \(var(x_i)=\sigma^2\). Two potential estimators of \(\mu\) are the sample average: \[\begin{align} \bar{x} = \frac{x_1 + x_2}{2} \end{align}\] and the last observation: \[\begin{align} x_2 \end{align}\]
    1. Are these estimators biased or unbiased?
    2. Find \(var(\bar{x})\).
    3. Find \(var(x_2)\).
    4. Find \(MSE(\bar{x})\).
    5. Find \(MSE(x_2)\).
    6. Which estimator is preferred under the MVUE criterion?
    7. Which estimator is preferred under the MSE criterion?

GOAL #12: Calculate the standard error for a sample average

  1. Suppose that we have a random sample \(D_n\) of size \(n=100\) on the random variable \(x_i\) with unknown mean \(\mu\) and unknown variance \(\sigma^2\). Suppose that the sample average is \(\bar{x} = 12\) and the sample variance is \(sd^2 = 4\). Find the standard error of \(\bar{x}\).

GOAL #13: Explain the law of large numbers and what it means for an estimator to be consistent

  1. Suppose we have a random sample of size \(n\) on the random variable \(x_i\) with mean \(E(x_i) = \mu\). Which of the following statistics are consistent estimators of \(\mu\)?
    1. The sample average \(\bar{x}\)
    2. The sample median.
    3. The first observation \(x_1\).
    4. The average of all even-numbered observations.
    5. The average of the first 100 observations.

  1. To get an idea what I mean by “impractical”, the support of \(D_n\) has \(|S_x|^n\) possible values, where \(|S_x|\) is the number of values in the support of \(x_i\) and \(n\) is the sample size. If our roulette data set had results from 100 games (this would still be a fairly small sample in economics), this would mean we need to calculate the joint PDF for \(2^{100} \approx 1.3 \times 10^30\) (that’s a one followed by 30 zeros) different combinations.↩︎

  2. As a technical matter, the assumption of independence requires that we sample with replacement. This means we allow for the possibility that we sample the same case more than once. In practice this doesn’t matter as long as the sample is small relative to the population.↩︎