Lecture 1 Sample-Based Estimation of Population Parameters

1.1 Why this matters

What do the following, seemingly unrelated, questions have in common?

What is the conversion rate of a trial promotion if extended to all qualifying customers?
What is the defect rate of a large batch of scratch-resistant glass supplied to a smartphone manufacturer?
What was the unemployment rate in the United States last month?
What fraction of the U.S. population was infected with COVID-19 during the first 3 months of the pandemic?
What is the average claim amount of a COVID-19 hospitalization in the U.S. based on the first 3 months of the pandemic?
What was the average change in Earth’s global surface temperature over the past 50 years?
What is the volatility of monthly returns of a particular stock?

What these questions have in common is that answering any of them requires sample-based estimation of a population parameter.

Consider, for example, the unemployment rate (the parameter) reported monthly by the U.S. Bureau for Labor Statistics. It is not a full account, or census, of the entire labor force (the population). Rather, it is based on two samples: a household survey of approximately 60,000 households and an establishment survey of approximately 130,000 businesses and government agencies.¹ A complete census is simply impractical.

We resort to sample-based estimation when it is impractical, or impossible, to obtain pertinent data on each and every person or item in a population of interest. In particular, population data would be impossible to obtain if the observations of interest lie in the future or if a measurement, say of scratch-resistant glass, would destroy the item tested.

Reliable estimates of population parameters have clear decision-relevance: informing whether or how to launch a promotion, accept a supply batch, modify an economic or public health policy, price a new product, address global warming, or invest in stocks. Fortunately, simple and exceedingly useful statistical theory exists that, when combined with good sampling practices, provide reliable answers to many sample-based estimation problems that arise in business contexts. Let’s see and understand how.

1.2 Sample statistics and population parameters

Each question listed in the preceding section asks for the value of a specific parameter summarizing a specific measurement for each person or item in a well-defined population.

Consider again the unemployment rate question, Question (iii). The population consists of all individuals in the U.S. civilian labor force (those aged 16+ who are employed or are otherwise able to and are actively looking for employment). The measurement is a simple binary classification: 1 if the individual is unemployed and 0 if the individual is employed. The parameter in question is the proportion of unemployed individuals.

In Question (v), the population consists of all COVID-19 hospitalizations in the U.S. during a given time period underwent by members of a given insurance plan. The measurement is the dollar amount of each claim. The parameter is the mean value of these claims.

Note that when the measurement is binary, as in Questions (i)-(iv), asking for a proportion of an outcome is the same as asking for the mean of the binary measurement where 1 denotes the outcome.

In Question (vii), the population can be viewed as all months, past and future, where a particular stock is traded. The measurement is the return of that stock during that month. The parameter can be the standard deviation of these returns (a common measure of volatility).

The essential first step in sample-based parameter estimation is to clarify the question being asked by carefully defining the target population, the precise measurement associated with each element of the population, and the summary parameter of interest. ‘What would we do with this estimate?’ is a good clarifying question.

A sample statistic is the value of the parameter of interest calculated on the sample data. It is the unemployment rate of the 60,000 households of the household survey or the 130,000 business and government agencies in the establishment survey. It is a proxy, or point estimate, of the population parameter.

1.3 The concept of an interval estimate

The simplistic approach to estimating a population parameter is to draw a sample from the population, calculate the sample statistic, and use that statistic as an estimate of the population parameter. But how reliable is such an estimate? Often, not very reliable. Assuming that a sample statistic equals a population parameter is betting on a lucky draw.

Consider a product promotion that costs $1 per customer and generates $40 in revenue if a customer converts. In a trial promition - a sample - the conversion rate was 3%. Would you proceed with the promotion based on the 3% sample statistic? What if you learn that the uncertainty associated with this sample statistic is such that you can only be confident that the population conversion rate is between 1.25% and 4.75%?

The 3% sample statistic is a take-it-or-leave-it point estimate. What we really need is an interval estimate that quantifies the degree of confidence and precision of the estimate.

An example of an interval estimate is: 3% ± 1.75% with 95% confidence. This is also referred to as a confidence interval. It says that we are 95% confident that the population parameter is somewhere between 1.25% and 4.75%. How we come up with such an interval - the recipe or formula - is covered in the next section. For now, you just need to appreciate that the width of this interval (the 3.5% range) reflects the precision of the sample-based estimate and the 95% reflects our level of confidence in that degree of precision.

But what exactly does this mean? It means that the recipe we used to construct that interval guarantees that if we were to repeat the sampling process many times on random samples using the same recipe to construct the interval associated with each sample then 95% of the intervals constructed would be ‘correct’ in the sense that they would contain the true unknown population parameter.² In the remaining 5% of the cases, the true population parameter would lie outside the interval. You don’t know if this particular 3% ± 1.75% interval is accurate or not. But it is very likely - 95% likely - to be correct. You will get a deeper understanding of this statement, and how we can possibly arrive at it, when we outline the underlying theory shortly.

At this point it is important to understand that any interval estimate must simultaneously specify both a degree of precision and a level of confidence. We cannot claim one without the other. We can be highly confident in a highly imprecise estimate or we can be very precise with zero confidence. To take a trivial example, we are 100% confident that the proportion is somewhere between 0% and 100%! That’s not a very useful interval estimate. We are also, strictly speaking, 0% confident that our point estimate of 3% is exactly correct. It is common practice to seek interval estimates with 95% confidence (although specific applications may justify other levels of confidence). We can then use statistical theory to calculate the corresponding precision given the information contained in the sample data.

In summary, interval estimates are much more useful than point estimates in managerial decision-making because they guard against unjustified confidence in imprecise estimates.

1.4 Random samples

In order to construct interval estimates with quantifiable precision and confidence guarantees, it is very important that the sample drawn is random. This means that the data collection process provides an equal chance of selecting any subset of the population data of the same size as the selected sample. If the data collection process exhibits any bias then such bias would generally be very difficult to quantify and hence correct.

Consider again the example of acceptance sampling of a supply batch of scratch-resistant glass. Drawing the sample from the “top” of the batch, for instance, is convenient but not random and might be biased by manufacturing sequence, transportation issues, or even deliberate placement of defective items at the bottom of the batch. Instead, a computer-generated random sample of the serial numbers of the batch items should determine which items are to be inspected.

Random sampling is a key assumption underlying all of statistical sampling theory. Despite its conceptual simplicity, that ideal of random sampling can present practical challenges in many business settings. We discuss in Section 1.8 common root factors for these practical challenges to random sampling. It suffices to mention here that, in practice, we seldom have full control over the data collection process.

Be aware that human judgment of what constitutes randomness may not be reliable. Truly random samples are perhaps more ‘lumpy’ and exhibit greater variability across samples than intuition suggests. The panel below illustrates different actual random samples of size 100 drawn from a population of 50% blue tiles and 50% gray tiles. Note how the sample statistic fluctuates around the 50% population parameter. Randomness provides a probabilistic anchor that enables the construction of confidence intervals, as we discuss in the next sections.

1.5 Constructing confidence intervals

We focus on confidence intervals when the population parameter is the mean (including the special case of proportion, which is the mean of binary measurements).

The formula for constructing such confidence intervals, centered around the sample mean, is as follows:

$\bar{x} \pm q_{cl} \, \frac{\sigma}{\sqrt{n}}$ where $\bar{x}$ is the sample mean, $\sigma$ is the population standard deviation, $n$ is the sample size, and $q_{cl}$ is a ‘confidence factor’ related to the desired confidence level.

The relationship between the confidence factor, $q_{cl}$ , and the confidence level, $cl$ , is captured in the following graph. The most commonly used factor is $1.96$ corresponding to a $95\%$ confidence level. Other common confidence levels are $90\%$ and $99\%$ .

The term $q_{cl} \frac{\sigma}{\sqrt{n}}$ is called the margin-of-error and quantifies the precision of the estimate. The margin-of-error, as expected and as confirmed by the formula, depends on the desired confidence level, the size of the sample, and the variability inherent in the population.

To apply the formula, we clearly need an approximation of the population standard deviation, $\sigma$ , that can be calculated directly from sample data. Replacing $\sigma$ by the sample standard deviation, $s$ , yields the following approximate confidence interval. The approximation is typically small when the sample size is sufficiently large (greater than 100):

$\bar{x} \pm q_{cl} \, \frac{s}{\sqrt{n}}$

In the case where measurements are binary and we seek a confidence interval around a population proportion, $p$ , the population standard deviation, $\sigma$ , is simply $\sqrt{p(1-p)}$ . An approximation in terms of the sample proportion, $\bar{p}$ , is $\sqrt{\bar{p}(1-\bar{p})}$ , yielding the following formula for the confidence interval of a population proportion:

$\bar{p} \pm q_{cl} \, \frac{\sqrt{\bar{p} (1 - \bar{p})}}{\sqrt{n}}$ A special feature of the above formula is that the term $\sqrt{\bar{p} (1 - \bar{p})}$ has a maximum value of $0.5$ , corresponding to $\bar{p} = 0.5$ . Thus a conservative calculation of the confidence interval for a population proportion is $\bar{p} \pm q_{cl} \, \frac{0.5}{\sqrt{n}}$ . This is the formula often used in election polls.³. For example, an election poll based on a sample size of $1,067$ randomly selected likely voters has a margin of error of $\pm 3\%$ at the $95\%$ confidence level.

If you haven’t noted this already, the above confidence intervals do not depend on the population size! The same sample size will yield the same margin-of-error at the same confidence level regardless of whether the poll is conducted in a large population, say in the United States, or a much smaller population, say in Iceland. This is true because we have implicitly assumed that the sample size is small relative to the population. Some intuition is provided in the next section but a deeper understanding emerges in Section 1.7.

1.6 Understanding confidence intervals

TO BE COMPLETED.

1.7 The theory underlying confidence intervals

TO BE COMPLETED.

1.8 Non-sampling sources of error

TO BE COMPLETED.

Customers are free whether or not to respond to a survey and those who choose to respond may not be a representative random sample. Similarly, past data may not be a random sample of past and future data.

1.9 Review

TO BE COMPLETED.

Preface