4.5 Determining whether data are sampled from a Normal distribution

Let us now return to the Height variable from the survey data set which contains the responses of Statistics students to a set of questions (Venables & Ripley, 1999). In general, we let \(x_1, x_2, \ldots, x_n\) denote a sample of data, where \(n\) is the sample size, or number of observations in the sample. In this instance, we have 209 observations so that \(n = 209\). The observations can be listed as \((173, 177.8, 160, \ldots)\) and as we have seen, we can display this data in a histogram.

Often in statistics, it is useful to ask the question, has this data been sampled from a Normal distribution? There are several ways one may go about attempting to answer this question. We will look at a few of them here.

Overlaying a Normal curve to a histogram

As shown in the below picture, it is possible to re-scale a histogram so that instead of having frequencies on the vertical axis, we show the density. This then makes it possible to overlay the Probability Density Function of the normal distribution and compare the shape of the histogram (sampled data) with the shape of the normal curve.

We can see that the curve fits the data reasonably well, suggesting that our data could have been sampled from a normal distribution, although it does not prove such. Even so, we may surmise that the data have been sampled from a normal distribution or something similar, or an approximately normal distribution.

In choosing the values for \(\mu\) and \(\sigma\) for the normal distribution here, we have used the sample mean and sample standard deviation as estimated from the data.

Comparing quantiles

We could also compare the quantiles seen in the sampled data with the quantiles from the normal distribution. Consider the table below. Displayed in the first row are the 0.1th to 0.9th quantiles as estimated from the data. The same quantiles are displayed in the second row but as calculated from the normal distribution (theoretical quantiles):

Quantile 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Sample 160.00 165.00 167.00 169.36 171.00 173.80 178.00 180.34 185.42
Theoretical 159.76 164.09 167.21 169.88 172.38 174.88 177.55 180.67 185.00

How well do you think the sample quantiles match the theoretical quantiles? Based on your answer, do you think the data have been sampled from a normal distribution?

Other methods

There are other methods commonly used to help determine whether or not data have been sampled from a normal distribution, including the inspection of a 'Normal Q-Q plot', and some statistical tests. We will consider these methods later on in the subject.