So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, summarise data graphically and numerically, and understand the tools of inference. The previous chapter discussed forming a confidence interval (CI) for one proportion. We will also study CIs in other contexts too.

The following applies to all CIs:

• CIs are formed for the unknown population parameter (such as the population proportion $$p$$), based on a sample statistic (such as the sample proportion $$\hat{p}$$).
• CIs give an interval in which the sample statistic is likely to lie, over repeated sampling.
• Loosely speaking, this is usually interpreted as the CIs giving an interval which is likely to straddle the value of the unknown population quantity. That is, the CI gives an interval of plausible values of the population parameter that may have produced the observed sample statistic.
• Most CIs have the form $\text{Statistic} \pm \overbrace{(\text{Multiplier} \times \text{standard error})}^{\text{Called the `margin of error'}}.$
• The multiplier is approximately 2 for a 95% CI (from the 68--95--99.7 rule).
• The margin of error is $$(\text{Multiplier} \times \text{standard error})$$.
• The statistical conditions should always be checked to see if the CI is (at least approximately) statistically valid.

## 21.2 Interpretation of a CI

Interpreting CIs correctly is tricky. The correct interpretation (Definition 20.3) of a 95% CI is the following:

If samples were repeatedly taken many times, and the 95% confidence interval computed for each sample, 95% of these confidence intervals formed would contain the population parameter.

This is the idea shown in the animation in Sect. 20.4. In practice, this definition is unsatisfying, since we almost always have only one sample. And since the value of the parameter is unknown (after all, we went to the bother of taking a sample so we could estimate the value of the parameter), we don't know if our CI includes the population parameter or not.

A reasonable alternative interpretation is:

The interval gives a range of values of the parameter that could plausibly (with 95% confidence) have given rise to our observed value of the statistic.

Or we might say that:

There is a 95% chance that our computed CI straddles the value of the population parameter.

These alternatives are not absolutely correct, but are reasonable interpretations.

Many people will write---and you will see it written in many places---that the CI means that there is a 95% chance that the CI contains the population parameter. This is not strictly correct, but is common (probably because it is easier to understand).

I use this analogy: Most people say the sun rises in the east. This is incorrect: the sun doesn't rise at all. It appears to rise in the east because the earth rotates on its axis. But almost everyone says that the 'sun rises in the east', and for most circumstances this is fine and serviceable, even though technically incorrect.

Similarly, most people use the final interpretation above for a CI in practice, even though it is technically incorrect. Example 21.1 (Energy drinks in Canadian youth) In Example 20.1, the approximate 95% CI was from 0.192 to 0.236 The correct interpretation is:

If we took many samples of 1516 Canadian youth, and computed the approximate 95% CI for each one, about 95% of those CIs would contain the population proportion.

We don't know if our CI includes the value of $$p$$, however. We might say:

This 95% CI is likely to straddle the actual value of $$p$$.

or

The range of values of $$p$$ that could plausibly (with 95% confidence) have produced $$\hat{p} = 0.241$$ is between 0.192 and 0.236.

In practice, the CI is usually interpreted as saying:

There is a 95% chance that the population proportion of Canadian youth who have experienced sleeping difficulties after consuming energy drinks is between 0.192 to 0.236.

This is not strictly correct, but is commonly used.

In Example 20.2 about koalas crossing roads, the approximate 95% CI was from 0.130 to 0.209.

What is the correct interpretation of this CI?

## 21.3 Validity and confidence intervals

When constructing confidence intervals, certain statistical validity conditions must be true; these ensure that the sampling distribution is sufficiently close to a normal distribution for the 68--95--99.7 rule rule to apply.

If these conditions are not met, the sampling distribution may not be normally distributed, so the 68--95--99.7 rule (on which the CI is based) maybe inappropriate, so the CI itself may also be inappropriate.

In addition to the statistical validity condition, the internal validity and external validity of the study should be discussed also (Fig. 21.1).

Regarding external validity, all the CI computations in this book assume a simple random sample. If the sample is from a random sampling method, but not from a simple random sample, then methods exist for producing CIs that are externally valid, but are more complicated than those described in this book.

If the sample is a non-random sample, then the CI may be reasonable for the quite specific population that is represented by the sample; however, the sample probably does not represent the more general population that is probably intended.

Externally validity requires that a study is also internally valid. Internal validity can only be discussed if details are known about the study design. FIGURE 21.1: Four types of validities for studies.

In addition, CIs also require that the sample size is less than 10% of the population size; however this is almost always the case.

## 21.4 Quick revision exercises

1. True or false: CIs always have 95% confidence.
2. True or false: The statistical validity conditions concern external validity.
3. True or false: CIs give intervals in which the value of a population parameter will fall.
4. True or false: All other things being equal, a 95% CI is wider than a 90% CI.
5. The 'multiplier times the standard error' is called the
6. What is the missing word: A CI gives an interval in which we are fairly sure that the value of the ????? is within.

Progress:

## 21.5 Exercises

Selected answers are available in Sect. D.20.

Exercise 21.1 A researcher was computing a 95% CI for a single proportion to estimate the proportion of trees with apple scabe,394 and found that $$\hat{p} = 0.314$$ and $$\text{s.e.}(\hat{p}) = 0.091$$.

What is wrong with the following conclusion that the researcher made?

The approximate 95% CI for the sample proportion is between 0.223 and 0.405.

Exercise 21.2 A researcher was computing a 95% CI for a single proportion to estimate the proportion of trees with apple scabe,395 and found that $$\hat{p} = 0.314$$ and $$\text{s.e.}(\hat{p}) = 0.091$$.

What is wrong with the following conclusion that the researcher made?

This CI means we are 95% confident that between 22.3 and 40.5 trees are infected with apple scab.