## 20.1 Sampling distribution: Known proportion

Suppose a fair, six-sided
die^{8}
is rolled 25 times.
What proportion of the rolls will produce an even number?
That is,
what will be the *sample proportion* of even numbers?

No-one knows exactly what will happen for *any* individual roll,
so no-one knows what proportion will be even for any set of 25 rolls.

In addition,
the proportion of the 25 rolls that will be even
will not be the same for every set of 25 rolls.
The sample proportion will *vary*:
there is *sampling variation*.

Describing *how* the sample proportion
varies from sample to sample is useful (Sect. 15.4.2).

To do this,
statistical theory could be used…
or thousands of repetitions of a set of 25 rolls could be performed…
or a computer could *simulate* many sets of 25 rolls.

Let’s simulate rolling a die 25 times, using just 10 sets of 25 rolls; see the animation below.

The proportion of rolls that is even varies from set to set. For these 10 sets of \(n=25\) rolls, the percentage of even rolls ranged from \(\hat{p}=0.32\) even rolls to \(\hat{p}=0.60\) even rolls.

The sample proportion of even rolls would
be expected to vary around \(p=0.5\),
since three of the six faces of the die are even numbers
(the *population proportion*),
using the classical approach to probability.

Of course, the sample proportion could be very small or very high by chance, but we wouldn’t expect to see that very often.

In this example,
the *population proportion* of even rolls
is known to be \(p=0.5\).
Each set of \(n=25\) rolls is a *sample* of all possible sets of \(n=25\) rolls,
and the *sample* proportion of even rolls is denoted by \(\hat{p}\).

For any set of 25 rolls,
the value of \(\hat{p}\) will be unknown until we roll the die.
The proportion of even rolls is likely to vary from sample to sample;
that is,
the sample proportions exhibit sampling variation,
and the *amount* of sampling variation is quantified
using a *standard error*.

*population*proportion, and \(\hat{p}\) refers to the

*sample*proportion.

Suppose a fair die was rolled 25 times,
and this was repeated *thousands* of times
(not just 10 sets times, as in
the animation
above),
and the proportion of even rolls was recorded
for every set of 25 rolls.

These thousands of sample proportions \(\hat{p}\), one from every set of 25 rolls, could be graphed using a histogram; see the animation below.

The shape of the histogram is roughly a normal distribution. This is no accident: statistical theory says this will happen (when certain conditions are met: see Sect. 20.6).

The possible values of the *sample* proportion \(\hat{p}\)
have a *sampling distribution*
which is roughly a normal distribution;
the mean and standard deviation of the normal distribution
the animation above
can even be determined.
The possible values of the *sample* proportion \(\hat{p}\)
will have a *sampling distribution*,
described by:

- an approximate normal distribution;
- centred around a mean of \(p=0.5\);
- with a standard deviation of 0.1 (where this number comes from will be revealed soon).

This distribution is called a *sampling distribution*,
as discussed in Sect. 18.1.
The standard deviation of the sampling distribution
is called a *standard error*,
since it measures how much a sample statistic
(in this case, a sample proportion \(\hat{p}\))
varies from sample to sample.

Since the variation in the sample proportions can be described,
a picture of this normal distribution can be drawn
(Fig. 20.1).
We still don’t know *exactly* what we’ll find next roll…
but we have some idea of *how* the sample proportion
is likely to vary in sets of 25 rolls.

*proportions*, but a

*mean*and

*standard deviation*are used to describe the

*sampling distribution*.

The value of \(p\)
(the *population* proportion:
the proportion of even numbers on the die)
remains the same,
but the value of \(\hat{p}\)
(the *sample* proportion:
the proportion of even numbers in the sample of 25 rolls)
is not the same in every set of 25 rolls.
That is,
\(\hat{p}\) varies,
and exhibits
*sampling variation*.
The variation in \(\hat{p}\) from sample to sample
is measured by the *standard error of the sample proportion*,
written as \(\text{s.e.}(\hat{p})\).

In general,
the **standard error for a sample proportion**
when \(p\) is known is
given by

\[\begin{equation} \text{s.e.}(\hat{p}) = \sqrt{\frac{p \times (1-p)}{n}}, \tag{20.1} \end{equation}\] where \(n\) is the number of rolls, and \(p\) is the population proportion. For this example, there are \(n = 25\) rolls of a die, and the population proportion of even rolls is \(p=0.5\). Then, the standard error of the sample proportion is

\[\begin{equation} \text{s.e.} (\hat{p}) = \sqrt{\frac{0.5 \times (1-0.5)}{25}} = 0.1. \tag{20.2} \end{equation}\]

This standard deviation is the standard deviation of the normal distribution in Fig. 20.1.

Recall that the
*the standard error is just a special standard deviation*,
that
measures how much a sample estimate is likely to vary
from sample to sample.
In that sense,
the standard error of the proportion measures how precisely \(\hat{p}\)
estimates the population proportion \(p\).

Almost always, the value of \(p\) is unknown, so when \(\text{s.e.}(\hat{p})\) is computed, the value of \(p\) can’t be used. Instead, the best available estimate of \(p\) is used, which is \(\hat{p}\). This situation is studied from Sect. 20.3 onwards.

**Definition 20.1 (Sampling distribution of a sample proportion when \(p\) is known) **When the value of \(p\) is

*known*, the

*sampling distribution of the sample proportion*is described by

- an approximate normal distribution,
- centred around a mean of \(p\),
- with a standard deviation (called the
*standard error*of \(\hat{p}\)) of

\[ \text{s.e.}(\hat{p}) = \sqrt{\frac{ p \times (1-p)}{n}}, \]

when certain conditions are met, where \(n\) is the size of the sample, and \(p\) is the population proportion.

In general, the approximation gets better as the sample size gets larger.**Think 20.1 (Values for \(\hat{p}\)) **From the die example,
the values of \(\hat{p}\) will vary

- with an approximate normal distribution;
- centred around \(p=0.5\); and
- with a standard error of approximately \(\text{s.e.}({\hat{p}}) = 0.1\).

Note the language: two

*dice*, but one*die*.*Dice*is plural.↩︎