4.1 Checking whether the underlying distribution is normal

Whenever we wish to carry out a hypothesis test, it is always a good idea to start out by visualising the data at hand and producing some simple summary statistics. For a one-sample \(t-\)test, we can look at a histogram of the data, and calculate the sample mean and sample standard deviation. Carrying out these steps will not only give us an idea as to what we may expect to find after carrying out the hypothesis test, they also give us an idea of the underlying distribution of the population from which the data has been sampled. Let's take a look at the cholesterol data:

Sample mean = 5.13
Sample standard deviation = 0.5

Looking at the histogram above, we can observe a skew to the right, or positive skew, in the data. This is a sign that the data may not have been sampled from a population that is normally distributed. However, there is not enough detail in the histogram for us to draw a strong conclusion about how well the data fits the normal curve that has been overlaid. Histograms are not the only way we can check for normality. We will consider three ways:

Viewing the data in a histogram with a normal curve overlaid
Checking a Normal Q-Q plot
Carrying out a hypothesis test for normality.

Checking a Normal Q-Q plot A Normal Q-Q (Quantile-Quantile) plot is another graphical method we can use to check for normality. Without going into detail here, this plot compares the data to the normal distribution. The main thing to look for is how well the dots follow the diagonal straight line in the plot. For the data to be considered normally distributed, the dots should follow the line as closely as possible. Let's look at the Normal Q-Q plot for the cholesterol example:

For the most part, the dots follow the diagnoal line quite well, however in the right-hand upper corner we do observe a trend that the dots are moving away from the line. This is indicative of the skew we observed in the histogram.

Carrying out a hypothesis test for normality We can use the Shapiro-Wilk test to test for normality. Consider the following null and alternative hypotheses:

\(H_0:\text{The data are normally distributed }\)
\(H_1:\text{The data are not normally distributed.}\)

Since we start out by assuming the data are normally distributed, the test tells us to only reject this assumption if we get a small \(p\)-value. That is, a small \(p\)-value indicates the data are not normally distributed. To summarise:

Hypothesis test for normality:

If p < 0.05, normality cannot be assumed
If p > 0.05, normality can be assumed

Note the following:

If the sample size is very small, the test may fail to pick up non-normality
If the sample size is very large (e.g. 100 or more), the test may become too sensitive and indicate non-normality when the data are in fact normal

Let's carry out the Shapiro-Wilk test for normality for the cholesterol data:


    Shapiro-Wilk normality test

data:  heartattack$cholesterol
W = 0.97016, p-value = 0.08466

As we can see, we have \(p = 0.08466\). Since \(p > 0.05\), normality can be assumed.

There is sometimes a grey area when deciding whether or not data are normally distributed. The cholesterol data is an example of where this is the case because:

We observe positive skew in the histogram
The Normal Q-Q plot showed some evidence of positive skew
The Shapiro-Wilk test indicated normality can be assumed at \(\alpha = 0.05\) level of significance, although \(p = 0.084\) can still be considered 'close to significant'.

However on balance, and especially when considering how the Central Limit Theorem applies, in this example we could safely conclude that the normality assumption has not been violated. We will see why in the next section.