14 Testing for Normality and other distributions

We often have normality as a prerequisite for conducting a test. You can check normality visually and by testing. In practice, no one test is completely reliable. Because the assumption of normality is so critical in many applications, I recommend combining several methods to check. Don’t rely on just a single test, especially when the result is inconclusive or close.

14.1 Visual methods

To use a visual method, you need to have some idea what distribution you are comparing your data to.

14.1.1 Using a histogram of your date

Most of us find it easier to identify what family of distributions data comes from by looking at a histogram, so plotting the data is usually the first step.

The problem with histograms is that the data look very differently depending on the scale used. Therefor, other graphical methods are preferable. Below, all three histograms show the same data.

14.1.2 Plotting empirical cumulative distribution function against a known distribution

Plot your data’s empirical cumulative distribution function against a normal cdf and eyeball it. Does it look close? You can use information from the graph to adjust your guess. The wrong distribution will have a different shape than your data. A line that is shifted left or right has a different mean than your data (green line). The wrong standard deviation will change how steep the line looks (blue line). Remember that the normal distribution is symmetric.

x <- rnorm(100,1,5)
plot(ecdf(x), xlim=c(-10,20),main="A plot of ecdf of normal data vs ecdfs for different normal distributions")
legend('topleft',legend=c("data","matching normal","mean too large","standar deviation too large","standard deviation too small"), fill=c("black","red","green","orange","blue"))
curve(pnorm(x,1,5), col="red",add=TRUE)
curve(pnorm(x,8,5), col="green",add=TRUE)
curve(pnorm(x,1,8), col="orange",add=TRUE)
curve(pnorm(x,1,1), col="blue",add=TRUE)
x <- rnorm(100,1,5)
plot(ecdf(x), xlim=c(-10,20),main="A plot of ecdf of normal data vs ecdfs for different non-normal distributions")
legend('topleft',legend=c("data","matching normal","uniform","Chisquare"), fill=c("black","red","green","blue"))
curve(pnorm(x,1,5), col="red",add=TRUE)
curve(punif(x,1,5), col="green",add=TRUE)
curve(pchisq(x,5), col="blue",add=TRUE)

14.2 Q-Q plots

Q-Q plots or quantile-quantile plots can alo be used to determine whether or not a set of data potentially came from some distribution. Again, we need to have an idea which distribution to compare to. Q-Q plots are an easy way to visually check whether a data set follows an assumed distribution. They also give some information how this assumption is violated and which data points cause problems. We can then go back and check if those data points were correctly entered or if they are legitimate outliers.

Recall that quantiles are cut-offs in a dataset below which a certain portion of the data fall. For example, the 0.8 quantile represents the point below which 80% of the data fall below, and so on.

Q-Q plots plot the quantiles for your sample data against the quantiles of a the distribution you want to test for fit.If the plotted points fall along a straight diagonal line , then the data likely follows the assumed distribution. If you have the correct normal distribution, i.e. you also have the mean and standard deviation right, then the Q-Q plot will follow the \(x=y\) line (added below in red).

Have a look at the plots below. I am first comparing our normal data to various normal distributions. You should note how you always see a straight line except at the tails, but also how having the wrong mean or wrong standard deviation changes the orientation of the line. Finally, note how if you guessed the wrong type of distribution, you won’t get a straight line at all.

data <- rnorm(1000,1,5)
x <- seq(0.01,0.99, 0.01)
y1 <- qnorm(x,1,5)
y2 <- qnorm(x,5,5)
y3 <- qnorm(x,1,10)
y4 <- qnorm(x,1,0.5)
qqplot(data,y1,xlim=c(-15,15),ylim=c(-15,15),ylab="comparison distribution",main="Q-Q plot against original distribution")
abline(coef=c(0,1),col="red")
qqplot(data,y2,xlim=c(-15,15),ylim=c(-15,15),ylab="comparison distribution",main="Q-Q plot against normal distribution, mean too large")
abline(coef=c(0,1),col="red")
qqplot(data,y3,xlim=c(-15,15),ylim=c(-15,15),ylab="comparison distribution",main="Q-Q plot against normal distribution, sd too large")
abline(coef=c(0,1),col="red")
qqplot(data,y4,xlim=c(-15,15),ylim=c(-15,15),ylab="comparison distribution",main="Q-Q plot against normall distribution, sd too small")
abline(coef=c(0,1),col="red")

y5 <- qunif(x,0,10)
y6 <- qchisq(x,5)
qqplot(data,y5,xlim=c(-15,15),ylim=c(-15,15),ylab="comparison distribution",main="Q-Q plot against uniform distribution")
abline(coef=c(0,1),col="red")
qqplot(data,y6,xlim=c(-15,15),ylim=c(-15,15),ylab="comparison distribution",main="Q-Q plot against Chi square distribution")
abline(coef=c(0,1),col="red")

14.3 Hypothesis tests

The tests introduced here are testing the hypotheses:

\(H_o\): The sample’s distribution is equal to an assumed distribution
\(H_1\): The sample’s distribution is not equal to an assumed distribution
So in a way, we are not testing for a distribution, but against it, as a small p-value would lead us to reject \(H_o\).

Basically, we are assuming we have normality (or whatever distribution we are hypothesizing) unless we find evidence against it.

Normality tests are sensitive to sample size. Small samples often pass normality tests. You should combine visual inspection and significance test(s) in order to increase your chances of correctly identifying your distribution.

Note: While we have a look at the test statistics involved, their sampling distributions oare beyond the scope of this course.

14.3.1 Chi-square test

We covered this earlier in the chapter on contingency tables.

14.3.2 One sample Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test checks if a variable follows any given distribution. This “given distribution” is usually -not always- the normal distribution, so this is also called the “Kolmogorov-Smirnov normality test”.

The key assumptions of the one sample test are that the theoretical distribution is continuous and fully defined, i.e. all parameters such as mean and standard deviation are know. It is a misuse of the K-S test if you estimate the parameters from your data. And as always we have to assume our sample is random, representative, from a well-designed experiment etc.

Let’s assume we have \(n\) observations \(x_1,...,x_n\) and the associated the empirical cumulative probability distribution function \(F_n(x)\). Let \(F(x)\) be the cumulative probability distribution function of the distribution we want to test for. The Kolmogorov-Smirnov statistic for a given cumulative distribution function \(F(x)\) is

\[ D_n = sup_{x} |F_n(x)-F(x)|\]

where \(sup_{x}\) is the supremum of the set of distances between \(F_n(x)\) and \(F(x)\) . There is a theorem (Glivenko-Cantelli) that states that, if the sample really did come from the theorized distribution \(F(x)\), then \(D_n\) converges to 0 in probability when n goes to infinity. The graph below illustrates this concept for testing a sample of size n against a uniform distribution.

The K-S test has several important limitations:

  • It only applies to continuous distributions.
  • It tends to be more sensitive near the center of the distribution than at the tails.
  • The distribution you test for must be fully specified. That is, location, scale, and shape parameters can not be estimated from the data.

Example Let’s test if 10000 data points generated from a uniform distribution really follow that distribution.
\(H_o\): Data are uniformly distributed on [0,5]
\(H_a\): Data are not uniformly distributed on [0,5]

x <- runif(10000, 0,5)
ks.test(x,'punif',0,5)
#> 
#>  Asymptotic one-sample Kolmogorov-Smirnov test
#> 
#> data:  x
#> D = 0.0086662, p-value = 0.4404
#> alternative hypothesis: two-sided

Here we get a large p-value, so we fail to reject that the data came from the specified uniform distribution. The data may or may not be distributed uniformly on [0,5].

Example Let’s test a different uniform distribution.

\(H_o\): Data are uniformly distributed on [0,1]
\(H_a\): Data are not uniformly distributed on [0,1]

x <- runif(10000, 0,5)
ks.test(x,'punif',0,1)
#> 
#>  Asymptotic one-sample Kolmogorov-Smirnov test
#> 
#> data:  x
#> D = 0.8042, p-value < 2.2e-16
#> alternative hypothesis: two-sided

Here we get a small p-value, so we reject that the data came from the specified uniform distribution. The data is not distributed uniformly on [0,5].

Example Next, we test a small and a large sample.

\(H_o\): Data are from the standard normal distribution
\(H_a\): Data are not from the standard normal distribution

xs <- runif(10, -1,1)
ks.test(xs,'pnorm',0,1)
#> 
#>  Exact one-sample Kolmogorov-Smirnov test
#> 
#> data:  xs
#> D = 0.21378, p-value = 0.6761
#> alternative hypothesis: two-sided
xl <- runif(100, -1,1)
ks.test(xl,'pnorm',0,1)
#> 
#>  Asymptotic one-sample Kolmogorov-Smirnov test
#> 
#> data:  xl
#> D = 0.1626, p-value = 0.01011
#> alternative hypothesis: two-sided

You see that the test is inconclusive for the small sample, but we can confidently reject normality for the larger sample. In general, in comparison to other test such as the Anderson–Darling test (see below) the K-S test requires a relatively large number of data points to correctly reject the null hypothesis.

14.4 Two-sample Kolmogorov-Smirnov test

With this test, you are trying to find out if two variables are from the same (continuous) distribution. Similarly to the one-sample test, the test statistic is the supremum of all the distances between the two empirical distribution functions. Assume we have two samples of size \(n_1\) and \(n_2\) respectively, and with empirical distribution functions \(F_{n_1}(x)\) and \(F_{n_2}(x)\). The test statistics is \[ D_{n_1,n_2} = sup_{x} |F_{n_1}(x)-F_{n_2}(x)|\].

Example Compare two samples drawn from the same normal distribution.

x <- rnorm(100,1,2)
y <- rnorm(100,1,2)
ks.test(x,y)
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D = 0.14, p-value = 0.281
#> alternative hypothesis: two-sided

Example Compare two samples drawn from different normal distributions.

x <- rnorm(100,0,)
y <- rnorm(100,1,2)
ks.test(x,y)
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D = 0.46, p-value = 1.292e-09
#> alternative hypothesis: two-sided

Example Compare two samples drawn from two different types of distribution.

x <- rnorm(10,1,2)
y <- runif(10,0,4)
ks.test(x,y)
#> 
#>  Exact two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D = 0.4, p-value = 0.4175
#> alternative hypothesis: two-sided
x <- rnorm(100,1,2)
y <- runif(100,0,4)
ks.test(x,y)
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D = 0.3, p-value = 0.0002468
#> alternative hypothesis: two-sided

The last example illustrates again that you need a large number of data points to correctly reject \(H_o\).

There are other tests you can run to check distributions. You first need to install the package ‘nortest’.Remember to check it in the packages tab in window (4).

14.4.1 Anderson-Darling test

Like the K-S tests, the Anderson-Darling (A-D) test is based on the distance between the empirical distribution function and an assumed edf. However, in this case the difference between them is squared,weighted, and then integrated. Again, the details are beyond this course.

Some notes

  • The sample size of unique values needs to be 8 or more
  • The version of the A-D test we use here does not require you to know the parameters of the distribution you are testing for. It test for “generic” normality.
  • The data should not contain any ties. When a large number of ties exist, the Anderson-Darling will frequently mistakenly reject the data as non-normal. *You need to check or load the nortest package.
library(nortest)
x <- rnorm(1000,0,3)
y <- runif(1000,0,4)
z <- runif(100,0,4)
ad.test(x)
#> 
#>  Anderson-Darling normality test
#> 
#> data:  x
#> A = 0.17784, p-value = 0.9197
ad.test(y)
#> 
#>  Anderson-Darling normality test
#> 
#> data:  y
#> A = 10.867, p-value < 2.2e-16
ad.test(z)
#> 
#>  Anderson-Darling normality test
#> 
#> data:  z
#> A = 2.2808, p-value = 8.051e-06

Note that, unlike the K-S test in the example above, the Anderson-Darling test could detect that sample z is not normal.

14.4.2 Shapiro-Wilks test for normality

The Shapiro-Wilks S-W test is not as affected by ties as the A-D test, but you still shouldn’t use it with a large number of ties. Recall the Q-Q plots. The S-W test measures of how well the ordered and standardized quantiles computed from your data fit the standard normal quantiles. The statistic will take a value between 0 and 1 with 1 being a perfect match. You can think of Q-Q plots as the visual version of the S-W test.

  • The number of unique values in your sample must be between 3 and 5000
  • The S-W also does not require you to know the parameters of the normal distribution you are testing for.
library(nortest)
x <- rnorm(1000,0,3)
y <- runif(1000,0,4)
z <- runif(100,0,4)
shapiro.test(x)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  x
#> W = 0.99804, p-value = 0.2994
shapiro.test(y)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  y
#> W = 0.95771, p-value < 2.2e-16
shapiro.test(z)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  z
#> W = 0.95277, p-value = 0.001268

Recall that the power of a test is the probability of avoiding a type II error, i.e. the probability of correctly rejecting a false null hypothesis. https://www.nrc.gov/docs/ML1714/ML17143A100.pdf compared the power of S_W, K_S, and A_D tests and came to the conclusion that, for smallish samples (n<100), the S-W test is the most powerful, followed by A-D and the K-S. However, as the sample size increases (n>1000), all three tests are perform very similarly. Not surprisingly, the power depends very much on the underlying distribution of the sample (symmetric or not, close to normal or not, etc.)

14.5 Assignment

Test if the data in normal_test.csv is normally distributed or not.