17 Distributions and models
So far, you have learnt to ask a RQ, identify different ways of obtaining data, design the study, collect the data describe the data, summarise data graphically and numerically, and understand the decision-making process.
In this chapter, you will learn about distributions and models to describe the distribution of populations and samples. You will learn to:
- describe distributions.
- describe populations using normal distributions.
- use \(z\)-scores to compute probabilities related to normal distributions.
- use \(z\)-scores to 'work backwards' from probabilities for normal distributions.
17.1 Introduction
In the decision-making process used in statistics, an assumption is made about the population parameter, and then, based on this assumption, the values expected from the sample statistics can be described.
The expectations about the sample statistic are based around how the statistic (such as a sample mean, or a sample proportion, or a sample odds ratio) is distributed: what values it can take in various samples, and how often.
A model is used to describe this sampling distribution. For example, if I deal 15 cards, the statistic could be 'the proportion of red cards in a hand of 15'. The model would describe how often we would see 0 red cards in 15, 1 red card in 15, 2 red cards in 15, ... up to 15 red cards in 15 (Sect. 15.4).
Under certain circumstances, many different statistics have a similarly-shaped distribution: a bell-shaped (or normal) distribution. We now study this distribution, as it often is the basis for describing what values the statistic can be expected to take, based on the assumption about the population that we begin with.
17.2 Distributions: An example
To begin, consider the heights of all Australian adult males. Clearly, the height of all Australian adult males is unknown: no-one has ever, or could ever realistically, measure the height of all Australian adult males. The Australian Bureau of Statistics (ABS), however, takes samples of Australians to compute estimates of the heights and other measurements.
A model could be assumed for the heights of all Australian adult males. This is a theoretical idea that might be a useful description of the heights of Australian adult males in the population. Suppose a model for the heights of Australian adult males is adopted that has:
- a symmetric distribution,
- with a mean height of 175 cm, and
- a standard deviation of 7 cm.
Then, the distribution of the heights of Australian adult males may look like Fig. 17.1. That is, most Australian adult males are between about 168 and 182cm, and very few are taller than 196cm or shorter than 154cm.
This model represents an idealised, or assumed, picture of the histogram of the heights of all Australian adult males in the population. If this model is a accurate, the distribution of heights in any sample, may be shaped a bit like this, but sampling variation will exist.
Any one sample will look a bit different than this model, but this model captures the general feel of the histogram from many of these samples. For example, see the animation below, where many samples of \(n=100\) men are taken.
The model of heights has approximately a bell-shape: that is, most values are near the average height, but a small number of men are very tall or very short. A bell-shaped distribution is formally called a normal distribution or a normal model. A normal distribution is a way of modelling the population.
A model is a theoretical or ideal concept. In the same way that a model skeleton isn't 100% accurate (wire joins?) and certainly not exactly like your skeleton, it suitably approximates reality. None of us probably have a skeleton exactly like the model, but the model is still useful and helpful.
Likewise, no variable has exactly a normal distribution, but the model is still useful and helpful. The model is a theoretical way of describing the distribution in the population.
17.3 Normal distributions
A suitable model for the heights of all Australian adult males may be described (Fig. 17.1) as having:
- An approximately normal shape,
- With a mean height of \(\mu=175\) cm, and
- A standard deviation of \(\sigma=7\) cm.
This model for the heights of Australian adult males is a theoretical idea about the unknown population: it does not represent any particular sample of data. The model can be thought of as an 'average' of the histograms of the data from many samples.
Indeed, if this model turns out to be poor at describing what appears in these many samples, the parameters of the model (that is, the values of \(\mu\) and \(\sigma\)) can be adjusted so the model does describe the sample data well.
In fact, sample evidence suggests that the average height of Australians has been increasing^{364} and so the mean of the model may need to be changed at various times to remain a good model for heights of Australian adult males.
17.4 Standardising (\(z\)-scores)
Since many statistics have a normal distribution (under certain circumstances), the 68--95--99.7 rule can be used to understand the distribution of sample statistics.
Recall that the 68--95--99.7 rule states that, for any normal distribution (Fig. 13.10):
- 68% of values lie within 1 standard deviation of the mean;
- 95% of values lie within 2 standard deviations of the mean; and
- 99.7% of values lie within 3 standard deviations of the mean.
These percentages only depend on how many standard deviations (\(\sigma\)) a value (\(x\)) is from the mean (\(\mu\)). This information can be used to learn about how values are distributed.
Example 17.1 (The 68--95--99.7 rule) Suppose heights of Australian adult males have a mean of \(\mu=175\)cm, and a standard deviation of \(\sigma=7\)cm, and (approximately) follow a normal distribution. Using this model, what proportion of Australian adult men are taller than 182cm?
Drawing the situation is helpful (Fig. 17.2). Notice that \(175 + 7 = 182\)cm is one standard deviation above the mean. We know that 68% of values are within one standard deviation of the mean, so that 32% are outside that range (smaller or larger) (Fig. 17.2). Hence, 16% are taller than one standard deviation above the mean, so the answer is about 16%. (Another 16% are less than one standard deviation below the mean, or less than \(175 - 7 = 168\)cm in height.)
Again, the percentages only depend on how many standard deviations (\(\sigma\)) the value (\(x\)) is from the mean (\(\mu\)), and not the actual values of \(\mu\) and \(\sigma\).
Example 17.2 (The 68--95--99.7 rule) Suppose heights of Australian adult males have a mean of \(\mu=175\)cm, and a standard deviation of \(\sigma=7\)cm, and (approximately) follow a normal distribution. Using this model, what proportion are shorter than 161cm? Again, drawing the situation is helpful (Fig. 17.3).
Since \(175 - (2\times 7) = 161\), then 161cm is two standard deviation below the mean. Since 95% of values are within two standard deviation of the mean, 5% are outside that range (half smaller, half larger; see Fig. 17.3), so that 2.5% are shorter than 161cm. (Another 2.5% are taller than \(175 + 14 = 189\)cm.)
Again, the percentages only depend on how many standard deviations (\(\sigma\)) the value (\(x\)) is from the mean (\(\mu\)). The number of standard deviations that an observation is from the mean is called a \(z\)-score.
A \(z\)-score is computed using
\[ z = \frac{ x - \mu}{\sigma}. \] Converting values to \(z\)-scores is called standardising.
Definition 17.1 (z-score) A \(z\)-score measures how many standard deviations a value is from the mean. In symbols:
\[\begin{equation} z = \frac{x - \mu}{\sigma}, \tag{17.1} \end{equation}\] where \(x\) is the value, \(\mu\) is the mean of the distribution, and \(\sigma\) is the standard deviation of the distribution.
Example 17.3 (z-scores) In Example 17.1, the \(z\)-score for a height of 182cm is
\[ z = \frac{x-\mu}{\sigma} = \frac{182 - 175}{7} = 1, \] one standard deviation above the mean.
In Example 17.2, the \(z\)-score for a height of 161cm is
\[ z = \frac{x-\mu}{\sigma} = \frac{161 - 175}{7} = -2, \] two standard deviations below the mean (a negative \(z\)-score means the value is below the mean).
The \(z\)-score is the number of standard deviations the observation is away from the mean. The \(z\)-score is also called the standardised value or standard score, and is calculated using Equation (17.1). Note that:
- \(z\)-scores are negative for observations below the mean, and positive for observations above the mean.
- \(z\)-scores are numbers without units (that is, it is not in kg, or cm, etc.).
Example 17.4 (The 68--95--99.7 rule) Consider the model for the heights of Australian adult males: a normal distribution, mean \(\mu=175\), standard deviation \(\sigma=7\) (Fig. 17.1).
Using this model:
- The mean is zero standard deviations from the mean: \(z=0\).
- 168cm and 182cm are one standard deviation from the mean: \(z=-1\) and \(z=1\) respectively.
- 161cm and 189cm are two standard deviations from the mean: \(z=-2\) and \(z=2\) respectively.
- 154cm and 196cm are three standard deviations from mean: \(z=-3\) and \(z=3\) respectively.
17.5 Approximating areas using the 68--95--99.7 rule
Suppose again that heights of Australian adult males have a mean of \(\mu=175\)cm, and a standard deviation of \(\sigma=7\)cm, and (approximately) follow a normal distribution (Fig. 17.4).
Example 17.5 (Normal distribution areas) Using this model, what proportion of men are shorter than 160cm?
Again, drawing the situation is helpful (Fig. 17.5).
Proceeding as before, we need to ask 'How many standard deviation below the mean is 160cm?' Using Equation (17.1) to compute the \(z\)-score, \(160\)cm corresponds to a \(z\)-score of
\[ z = \frac{160 - 175}{7} = -2.14; \] that is, \(2.14\) standard deviations below the mean.
What percentage of observations are less than this \(z\)-score? This case is not covered by the 68--95--99.7 rule, though we can use the 68--95--99.7 rule to make some rough estimates.
About 2.5% of observations are less than 2 standard deviations below the mean (Example 17.1); that is, about 2.5% of men are shorter than 161cm.
So the percentages males even shorter than 161cm (that is, further into the tail of the distribution), will be less than 2.5%. While we don't know the probability exactly, it will be smaller than 2.5%.
Estimates in this way are crude, but often serviceable. However, better estimates of 'areas under the normal curve' are found using tables compiled for this very purpose.
These tables are in Appendix B.2. 'Percentages' under a normal curve are also called 'areas' under the normal curve. The total area under a normal curve is one (or 100%), since it represent all possible values that could be observed.
We now learn how to use these tables, then come back to Example 17.5.
17.6 Exact areas from normal distributions
Areas under normal distributions can be found using:
The online tables are easier to use.
17.6.1 Using the online tables
The online tables (which work differently to the hard-copy tables) can be found in Appendix B.2. Consider the same example again: the percentage of observations smaller than \(z = -2\).
The online tables (Appendix B.2) work with two decimal places, so consider the \(z\)-score as \(z = -2.00\).
In the tables,
enter the value -2.00
in the search region just under the column labelled z.score
(see the animation below).
After pressing Enter
,
the answer is shown in the column headed Area.to.left
:
the probability of finding a \(z\)-score less than \(-2\) is 0.0228, or about 2.28%.
Using either the hard-copy or online tables gives an answer of about 2.28%. Using the 68--95--99.7 rule, the answer we obtained was \(2.5\)%. Recall that the 68--95--99.7 rule is an approximation only.
17.6.2 Using the hard-copy tables
To demonstration the use of the normal distribution tables, consider the percentage of observations smaller than \(z = -2\) (that is, two standard deviations below the mean) in a normal distribution.
Like the online tables, the hard-copy tables work with \(z\)-scores to two decimal places, so consider the \(z\)-score as \(z=-2.00\).
On the tables, find \(-2.0\) in the left margin of the table, and find the second decimal place (in this case, 0) in the top margin of the table (Fig. 17.6): where these intersect is the area (or probability) less than the \(z\)-score. So the probability of finding a \(z\)-score less than \(z = -2\) is 0.0228, or about 2.28%. (The online tables work differently.)
The tables give the area to the left of the \(z\)-score that is looked up.
17.7 Comparing exact and approximate areas
Armed with knowledge of obtaining exact areas, let's return to Example 17.5:
Example 17.6 (Using normal distributions) Suppose heights of Australian adult males have a mean of \(\mu=175\)cm, and a standard deviation of \(\sigma=7\)cm, and (approximately) follow a normal distribution. Using this model, what proportion are shorter than 160cm?
The general approach to computing probabilities from normal distributions is:
- Draw a diagram: Mark on 160 cm (Fig. 17.5).
- Shade the required region of interest: 'less than 160 cm tall' (Fig. 17.5).
- Compute the \(z\)-score using Equation (17.1).
- Use the \(z\) tables in Appendix B.2.
- Compute the answer.
The number of standard deviations that 160cm is from the mean is found using Equation (17.1):
\[\begin{align*} z &= \frac{x-\mu}{\sigma} \\[3pt] &= \frac{160-175}{7} = \frac{-15}{7} = -2.14. \end{align*}\] That is, 160cm is 2.14 standard deviations below the mean, so use \(z=-2.14\) in the tables. The diagram at the top of the tables reminds us that this is the probability (area) that the value of \(z\) is less than \(z=-2.14\) (Fig. 17.5). The probability of finding an Australian man less than 160cm tall is about 1.6%.
More complicated questions can be asked too, as shown in the next section.
17.8 Examples using \(z\)-scores
Example 17.7 (Normal distributions) Dario M. Aedo-Ortiz, Eldon D. Olsen, and Loren D. Kellogg^{365} simulated mechanized forest harvesting systems.^{366}
As part of their study, they assumed that the specific trees in their study would vary in diameter, with
- a normal distribution; with
- a mean of \(\mu=8.8\) inches; and
- a standard deviation of \(\sigma=2.7\) inches.
Using this model, what is the probability that a tree has a diameter greater than than 6 inches?
Follow the steps identified earlier:
- Draw a normal curve, and mark on 6 inches (Fig. 17.7, top panel).
- Shade the region corresponding to 'greater than 6 inches' (Fig. 17.7, bottom panel).
- Compute the \(z\)-score using Eq. (17.1). Here, \(x=6\), \(\mu=8.8\), \(\sigma=2.7\), so \(\displaystyle z = (6 - 8.8)/2.7 = -2.8/2.7 = -1.04\) to two decimal places.
- Use tables: The probability of a tree diameter shorter than 6 inches is \(0.1492\). (The tables always give area less than the value of \(z\) that is looked up.)
- Compute the answer: Since the total area under the normal distribution is one, the probability of a tree diameter greater than 6 inches is \(1 - 0.1492 = 0.8508\), or about 85%.
The normal-distribution tables in the Appendix always provide area to the left of the \(z\)-scores that is looked up. Drawing a picture of the situation is important: it helps visualise how to get the answer from what the table give us.
Remember: The total area under the normal distribution is one.
Match the diagram in Fig. 17.8 with the meaning for the tree-diameter model (recall: \(\mu=8.8\) inches):
- Tree diameters greater than 11 inches.
- Tree diameters between 6 and 11 inches.
- Tree diameters less than 11 inches.
- Tree diameters between 3 and 6 inches.
(Answer is here^{367}.)
Example 17.8 (Normal distributions) Using the model for tree diameters in Example 17.7,^{368} what is the probability that a tree has a diameter between 6 and 11 inches?
First, draw the situation, and shade 'between 6 and 10 inches' (Fig. 17.9). Then, compute the \(z\)-scores for both tree diameters:
\[\begin{align*} \text{6 inches: } &z = \displaystyle \frac{6 - 8.8}{2.7} = -1.04;\\[6pt] \text{11 inches: } &z = \displaystyle \frac{11 - 8.8}{2.7} = 0.81. \end{align*}\] Table B can then be used to find the area to the left of \(z = -1.04\), and also the area to the left of \(z = 0.81\). However, neither of these provide the area between \(z = -1.04\) and \(z = 0.81\) (Fig. 17.10).
Looking carefully at the areas from the tables and the area sought, that area between the two \(z\)-scores is
\[ 0.7910 - 0.1492 = 0.6418; \] see the animation below. The probability that a tree has a diameter between 6 and 11 inches is about 0.6418, or about 64%.
Click on the hotspots in the following image, to see what the areas under the normal curve mean.
17.9 Unstandardising: Working backwards
Using the model for tree diameters in Example 17.7^{369} again, suppose now the diameters of the smallest 10% of trees needs to be identified. What are these diameters?
Example 17.9 (Normal distributions backwards) Consider again the trees study. The tree diameters can be modelled with
- a normal distribution; with
- a mean of \(\mu=8.8\) inches; and
- a standard deviation of \(\sigma=2.7\) inches.
Identify the diameters of the smallest 10% of trees.
This is a different problem than before; previously, the tree diameter was known, so a \(z\)-score could be computed, and hence a probability (Fig. 17.11, top panel).
This time, the probability is known, and a tree diameter is sought. That is, working 'backwards' is needed (Fig. 17.11, bottom panel), so the \(z\)-tables need to be used 'backwards' too.
17.9.1 Using the hard-copy tables
When the \(z\) scores (in the margins of the tables were known, the areas were found in the body of the table. If the area (or probability) is known (found in the body of the table), the corresponding \(z\)-score can be found (in the margins of the table), and hence the observation \(x\); see the animation below. The closest area to 10% in the tables is 0.1003, or 10.03%.
To identify the diameters of the smallest 10% of trees, the \(z\)-score that has an area to the left of 10% (or 0.10) needs to be found (at least, as close as possible to 0.10).
17.9.2 Using the online tables
When the area (or probability) is known,
special online tables can be
used (Appendix B.3). In
these tables,
enter the area to the left in search box under Area.to.left
,
and the corresponding \(z\)-scores appears under the z.score
column
(see the animation below).
Using either the hard-copy or online tables, the appropriate \(z\)-value is \(1.28\) standard deviations below the mean (Fig. 17.12). Then, the \(z\)-score can be converted to an observation value \(x\) using the unstandardising formula^{370}:
\[ x = \mu + z\sigma. \] Using this unstandardising formula:
\[\begin{align*} x &= \mu + (z\times\sigma) \\ &= 8.8 + (-1.28 \times 2.7) = 5.344; \end{align*}\] that is, about 10% of trees have diameters less than about 5.3 inches.
Definition 17.2 (Unstandardizing formula) When the \(z\)-score is known, the corresponding value of the observation \(x\) is
\[\begin{equation} x = \mu + z\sigma. \tag{17.2} \end{equation}\] This is called the unstandardising formula.
Ball bearings labelled as "50mm bearings" actually have diameters that follow a normal distribution with mean 50mm and standard deviation 0.1mm. The smallest 15% of bearings are too small for sale. What size bearings cannot be sold?
(Answer is here^{371}.)
Example 17.10 (Normal distributions backwards) Using the model for tree diameters in Example 17.7^{372} again, suppose now the diameters of the largest 25% of trees needs to be identified. What are these diameters?
The tree diameters can be modelled with
- a normal distribution; with
- a mean of \(\mu=8.8\) inches; and
- a standard deviation of \(\sigma=2.7\) inches.
Again, we need to work 'backwards' (Fig. 17.13, bottom panel), so the \(z\)-tables need to be used 'backwards' too. The largest 25% implies large trees, so we would expect a diameter larger than the mean.
Using a diagram is important (Fig. 17.13): the tables work with the area to the left of the value of interest, which is 75%.
Using either the hard-copy or online tables, the appropriate \(z\)-value is \(z = 0.674\). Then, the \(z\)-score can be converted to an observation value \(x\) using the unstandardising formula:
\[\begin{align*} x &= \mu + (z\times\sigma) \\ &= 8.8 + (0.674 \times 2.7) = 10.621; \end{align*}\] that is, about 25% of trees have diameters larger than about 10.6 inches.
17.10 Summary
A model is a way of theoretically describing the distribution of some quantitative variable in a population. One common model is a normal model or normal distribution, which is a bell-shaped distribution with a theoretical mean \(\mu\) and a theoretical standard deviation \(\sigma\). Probabilities can be computed from normal distributions using \(z\)-scores.
17.11 Quick revision questions
Consider again the model for tree diameters in Example 17.7:^{373} a normal distribution with \(\mu=8.8\) inches, and \(\sigma=2.7\) inches.
- A tree diameter of
7.9 inches
corresponds to a \(z\)-score (to two decimal places) of:
- The probability that a tree has a diameter less than
7.9 inches is (as a decimal value):
- The probability that a tree has a diameter greater than
7.9 inches is (as a decimal value):
- A tree diameter of
9 inches
corresponds to a \(z\)-score (to two decimal places) of (as a decimal value):
- The probability that a tree has a diameter less than
9 inches is (as a decimal value):
- The probability that a tree has a diameter greater than
9 inches is (as a decimal value):
Progress:
17.12 Exercises
Selected answers are available in Sect. D.17.
Exercise 17.1 Consider again the study by Aedo-Ortiz, Olsen, and Kellogg,^{374} who studied the diameter of trees in certain forests. The tree diameters can be modelled with
- a normal distribution; with
- a mean of \(\mu=8.8\) inches; and
- a standard deviation of \(\sigma=2.7\) inches.
For these trees:
- What is the probability that a tree will have a diameter less than 8 inches?
- What is the probability that a tree will have a diameter greater than 9 inches?
- What is the probability that a tree will have a diameter between 7 and 10 inches?
- The largest 15% of trees have what diameters?
- The smallest 25% of trees have what diameters?
Exercise 17.2 In a study^{375} to help understand factors influencing preterm births, the researchers modelled the gestation length of healthy babies as having a normal distribution with a mean of 40 weeks, and a standard deviation of 1.64 weeks. Using this model:
- What proportion of births are longer than 39 weeks (that is, nine months)?
- In Australia, a premature birth is defined as a birth occuring before 37 weeks. What proportion of births are expected to be premature?
- According to Health Direct, 'Babies born between 32 and 37 weeks may need care in a special care nursery'. What proportion of healthy births would be expected to be born between 32 and 37 weeks gestation?
- How long is the gestation length for the longest 5% of pregnancies?
- How long is the gestation length for the shortest 5% of pregnancies?
Exercise 17.3 IQ scores are designed to have a mean of 100 and a standard deviation of 15. Mensa is a society for people with a high IQ:
Membership of Mensa is open to persons who have attained a score within the upper two percent of the general population on an approved intelligence test that has been properly administered and supervised.
--- Mensa webpage
What IQ score is needed to join Mensa?
Exercise 17.4 IQ scores are designed to have a mean of 100 and a standard deviation of 15. Jay L. Zagorsky^{376} reports that
...Congress requires the Pentagon to reject all military recruits whose IQ is in the bottom 10% of the population...
What IQs scores lead to a rejection from the US military?
Exercise 17.5 IQ scores are designed to have a mean of 100 and a standard deviation of 15. Match the diagram in Fig. 17.14 with the meaning.
- IQs greater than 110.
- IQs between 90 and 115.
- IQs less than 110.
- IQs greater than 85.
Exercise 17.6 IQ scores are designed to have a mean of 100 and a standard deviation of 15. Match the diagram in Fig. 17.14 with the meaning.
- The largest 25% of IQ scores.
- The smallest 10% of IQ scores.
- The largest 70% of IQ scores.
- The smallest 60% of IQ scores.
Exercise 17.7 A study of the impact of charging electric vehicles (EVs) on electricity demands^{378} modelled the time at which people began charging their EVs at home. Based on a survey,^{379} they modelled the time at which EVs began charging as having a mean of 5:30pm, with a standard deviation of 2.28 hrs. For this model:
- What is the probability that an EVs will begin charging after 9pm?
- What is the probability that an EVs will begin charging before 5pm?
- What is the probability that an EVs will begin charging between 5pm and 6pm?
- 30% of the EVs begin charging after what time?
- The earliest 15% of charging begins when?
Hint: This question is much easier if you convert times into 'minutes after midnight'!