Chapter 7 Testing




7.1 Motivation


We began to discuss inference back with the Central Limit Theorem, and saw a tool of inference with the Confidence Intervals. However, now we will actually be solving inference based problems using testing methods. This section provides really interesting applications for Statistics and a great intro for more learning on inference. }





7.2 Introduction to Testing


Recall that a concept central to Statistics is the sample-population relationship: that we can never find the true parameters of large populations and so we instead take samples to estimate these parameters. Testing allows us to judge claims about these parameters.


Sounds a little dicey, but the general idea is testing to see if someone’s ‘hunch’ about a population parameter is correct. More concretely, imagine if someone told you that they claimed that the average height for adults in America was 8 feet tall. Using the techniques we are about to learn, you could actually test this claim and give some probability that this claim is correct. In the case of this extreme height, you would take a sample, find an average much lower than 8 feet and eventually get a low probability that the hypothesized value of 8 feet tall on average is accurate.

Let’s start with the terminology. The most basic part of hypothesis testing are the null and alternative hypotheses. These are essentially the two guesses when someone makes a claim about a population. Let’s break them down further.


1. Null Hypothesis, notated \(H_o\)


Perhaps the best way to think about the null is as a status quo. This is what we assume until it is proved otherwise. Let’s think of a relevant example. At the time of writing, the Patriots had just demolished the Colts in the AFC championship 45 to 7. Anyways, there is a little controversy regarding the actual air pressure of the football: some officials are claiming that the Patriots deflated their footballs slightly so they would be easier to catch in the rainy conditions (since the home team has to provide game balls for the playoffs, it would be easy for the Pats to deflate the balls they were going to play with). Let’s say that the rule states that the ball must be inflated to 100 PSI (pounds per square inch) and the officials are claiming the Patriots balls were less than 100 PSI.


This is an excellent example of hypothesis testing. We have some population - all of the footballs that Patriots own - and we are testing a parameter of the population - the average pressure in the balls. You can’t measure ALL the balls (these teams could have thousands) but you can take a sample of a few balls and use that to estimate the true average pressure of all the balls.


So, what would the Null Hypothesis be in this situation? Well, it’s the status quo that we assume until proven otherwise. Here, we are assuming that balls are regulated and pumped to 100 PSI; this is what we are trying to prove incorrect. Therefore, the null hypothesis here is that the average pressure in the Patriots balls are 100 PSI. Succinctly, we could write:

\[H_o: \mu = 100\]

Where \(H_o\) just means ‘null hypothesis’ and \(\mu\) is the average pressure in the balls (again, the status quo is that they are 100 PSI.

This brings us to the…


2. Alternative Hypothesis, notated \(H_A\)


The alternative hypothesis is the claim made about the population parameter. We are testing to see if it’s true, and the claim about the parameter is correct. In this football example, what would be the alternative hypothesis?

Well, we know that the null, or status quo, is that the average pressure in the Patriot’s footballs is 100 PSI, but we also know that the officials are claiming that the Patriots deflated the footballs. Therefore, the alternative hypothesis is just that the average pressure is less than 100 PSI. Succinctly:

\[H_A: \mu < 100\]

Or “the alternative hypothesis is that the population average is less than 100 PSI.”



That’s it for our hypothesis: a null and an alternative. The null is the assumed status quo for a parameter of the population, and the alternative is a claim about this parameter.


What happens next? Well, in the football example, we take a sample of the Patriots footballs (maybe we look at 10 footballs) and check the PSI in them. We do some mathematical mechanics (which we will learn about later) and then we conclude something concerning out hypotheses. There are clearly two cases here: the sample gives enough evidence to support the claim (perhaps all of the footballs sampled were very deflated) or the sample does not give enough evidence to support the claim (all of the footballs were fine).

Let’s think about the first example, when there is enough evidence to support the claim (again, we will formalize just how we get to this point later). If this happens - say all the footballs are well below 100 PSI - we reject the null hypothesis (the status quo that all footballs are fine and regulated at 100 PSI) and accept the alternative hypothesis (that the footballs were indeed too deflated).

Be sure to become comfortable with this terminology! It will show up often, and will be needed every time you perform a hypothesis test. This one is pretty straightforward: if there is enough evidence for a claim, we can reject the status quo and accept the claim.


So, if we have enough evidence, we can reject the null and accept the alternative. The second case is when there is not enough evidence to support the claim/alternative hypothesis, like if all the footballs were perfect at 100 PSI. In this case (and this is a bit trickier) we fail to reject the null hypothesis.


Why this jargon? Well, again, the null hypothesis provides a specific statement about the population, since it actually gives a parameter value for what the mean or proportion of a population takes on. Since it talks about this true parameter and we already know we can never find the exact parameter (population is too big) we can never quite prove that the null hypothesis is true. Professor Parzen from Harvard has a great example of this. Say you have the null hypothesis that all cats have 4 legs. You could never exhaustively prove this: to do so you would have to round up every single cat and count their legs. However, you could very easily reject this claim if you saw just one cat with 3 legs (which you will if you go to class). So, you can only reject or fail to reject the null. If you do reject the null (hey, there’s a cat with 3 legs, so all cats don’t have 4 legs) you are implying that you accepted the alternative (here that not all cats have 4 legs).


Again, be sure to commit this reject/fail to reject/accept terminology to memory, since you will need it in all interpretations of Hypothesis testing!


The last qualitative bit of testing has to do with Testing Errors. These take two forms, called Type I and Type II.


1. Type I Error


A Type I Error occurs when you reject a true null hypothesis. That is, your sample seemed to have a lot of evidence that the null hypothesis was not true, but it turns out that the sample was just crazy random and the null is actually true. In the Patriots football example, this would happen if the officials concluded that the average PSI for the Patriots footballs was below 100, when in reality over the entire population the average is indeed 100 PSI.

Usually our work with Type I Errors is qualitative, but on the past two exams there has been a question on actually finding the probability of making these errors. To do this, you would have to write the overall probability in terms of a probability statement:

\[P(Type \; I \; Error) = P(Reject \; H_o \; | \; H_o \; is \; true)\]

This will make more sense once you actually learn the mechanics behind the testing (for practice, look on the Fall 2014 second midterm and final).


2. Type II Error


This is the exact opposite of a Type I Error (in fact, as we will see in a second, they are inversely related). It occurs when you fail to reject a false Null Hypothesis. In the football scenario, this would be the officials concluding that the Patriot’s footballs were fine and regulated at 100 PSI when really the Patriots were indeed cheating and have an overall average of lower than 100 PSI for their footballs (perhaps the sample included a lot of the ‘good’ balls). In terms of probability, we would write:

\[P(Type \; II \; Error) = P(Fail \; to \; Reject \; H_o \; | \; H_o \; is \; false)\]



Since these concepts are so important, we’ll go over one more classic example to hammer them down. Take the court system in the United States. A person who goes on trial for a crime can either be guilty or innocent. When someone goes to trial, essentially what are the hypotheses, and what would a Type I and Type II error consist of?


Well, since the classic mantra of the court system is that someone is innocent until proven guilty, innocence is the status quo that we are assuming, while guilt is the alternative claim being made. So, the null is that the person is innocent, and the alternative is that they are guilty. Remember, then, the court can’t prove them innocent, just try to present enough evidence to either prove them guilty or not be able to refute their innocence.

How about the errors? Well, we know a Type I error occurs when we reject a true null. We know that the null is a person being innocent, so a true null means that the person is actually innocent. Rejecting a true null, then, would be calling an innocent person guilty and sending them to jail.

A Type II error occurs when you fail to reject a false null. Again, since the null is innocence, a false null is a guilty person, and failing to reject a false null is failing to send a guilty person to jail.

You will often be asked which of these errors is worse. It usually depends, and you can often make a case for either. However, most people generally think that sending an innocent person to jail is worse.


Finally, we should remember that the two errors are inversely related. That is, the more you try to lower the probability of a Type I Error occurring, the higher the probability that a Type II Error occurs. This is important because it means we can never eliminate errors; however, since the Type I is usually more serious, tests are set to lower the probability of this error more (again, once we actually perform the tests, you will see these dynamics better).



7.3 Let’s get to Testing!


We’ve discussed the terminology for hypothesis testing but we haven’t even scraped how it’s actually done. We’ll discuss two methods: confidence intervals (which you have already learned the hard part for) and t-stat/p-value. First, however, we should review how there are three types of tests: two-tailed, left-tailed and right-tailed.

These are simple distinctions that basically describe the alternative hypothesis. Say you have a null hypothesis that the population mean is some value \(X\) and you are testing the alternative that the mean is not that value \(X\) (remember, we can test means or proportions, but we will just do means for now). This is a two sided test, since the mean can be larger or smaller than the value \(X\) (as long as it’s not equal, we can reject the null). It’s written as:

\[H_o: \mu = X, \; H_A: \mu \neq X\]

The other two one tailed tests (right or left) are testing the claim that the population parameter is greater or less than some parameter, but not both. For example, you would have a right tail test if your Null was that the mean was some value \(X\) and the null was that the mean was greater than that value \(X\). It would be written:

\[H_o: \mu = X, \; H_A: \mu > X\]

And finally, as you can probably guess, the left tailed test is looking for where the mean is less than the value \(X\) and is written:

\[H_o: \mu = X, \; H_A: \mu < X\]



These, then, are the three types of Hypothesis tests. They will be identical in all but one simple category, as you will see. Now, onto the first testing method, confidence intervals.


1. Confidence Intervals


If you need a quick refresher on confidence intervals, harken back to Unit VI. Otherwise, full steam ahead in it’s application to testing.

Nothing too crazy here. Basically, if you are testing that the population mean is 10, then you would build a confidence interval based on the sample and see if 10 is in the interval. Again, since we will be working at 95% confidence in this class, you want to construct a 95% confidence interval just to be consistent (since 95% confidence in the interval translates logically to 95% certainty in the test).

That’s it! You’ve already learned the hard part, constructing confidence intervals, and testing is just an easy application.


2. t-stat/p-value Method


This method is a little more involved and thus will be tested more often. Again, it relies heavily on the central limit theorem, since we use the fact that sample means and proportions are distributed normally.

Basically, the idea here is that we have a claim about the population (i.e., the mean is some value \(X\)) and by using that claim we can estimate the distribution of the sample mean. For example, let’s harken back the first example about someone claiming the average person is 8 feet tall. Here, if we assume that the true average height is indeed 8 feet, then we know a large enough sample creates a sample mean that is distributed normally and centered at 8 feet. Then, since we know the distribution of the sample mean (and we can work with the Normal distribution to find probabilities and stuff) we can actually measure how likely the sample result is given what the null hypothesis says the true population parameter is.


That explanation wasn’t great, so let’s think about a concrete example. Let’s go with the 8 foot part from earlier (someone says that the status quo is that the average adult height is 8 feet, we claim it is not). We know that if the average height really was 8 feet then the sample mean would be centered at 8 as well. So, the sample mean would be Normal with mean of 8 and a variance of 1 (we’ll just assume this variance for simplicity). Say you take a sample and get an average height of 5 feet. Qualitatively, we know that if we see a sample of 5 feet, the claim that the average height is 8 feet is probably not true. However, we can do better than ‘probably not true’ because we know the actual distribution. In this case, since the sample height is 3 standard deviations below the mean, we know it is in about the \(.15^{th}\) percentile (remember the 68-95-99.7 rule, AKA the Empirical Rule) and thus it is very unlikely that the prior distribution is true, AKA that the mean is 8 feet. In fact, what we are saying here is that if the mean were indeed true, there would be a .15% chance of getting a sample with a mean this small. Since the chance is so low, we can say with a large amount of confidence that the average height is not 8 feet.


Cool! So basically, we take the sample result and find the probability of it occuring using the null hypothesis to form the Normal distribution. How do we find probabilities when working with the Normal again? Well, we standardize, or subtract the mean and divide by the standard deviation to get to a \(N(0,1)\), which we can then plug into our calculator.

We already know that this is essentially just finding a z-score. Remember, though, we use the t-distribution in the case of inference since there is some unwanted variance that the fat tails of the t-distribution captures better. So, instead of calling what we calculate a z-score, we will call it a t-statistic (they are calculated the exact same way, though). For means, we have:

\[t_{stat} = \frac{\bar{X} - \mu_o}{\frac{s}{\sqrt{n}}}\]

Where \(\mu_0\) is the hypothesized mean.


Hey! We know this. Where does it come from? Well, remember, according to the central limit theorem, we get \(\bar{X} \sim N(\mu_o, \frac{\sigma^2}{n})\) (where \(\mu_o\) is just the mean hypothesized by the null hypothesis. Remember that we can estimate \(\sigma\) with \(s\) if the sample is large enough).


So this is the same exact idea as a z-score. For the more extreme a sample mean (farther from the hypothesized mean) we’ll get a larger \(t_{stat}\). We can also use the t-stat to easily find the cumulative probability of a certain sample mean (remember, 0 has a cumulative probability of .5). If the \(t_{stat}\) shows that the sample mean is extreme enough (say the \(t_{stat}\) is 10 or something) then we will likely say there is something wrong with the null hypothesis, because there’s such a small chance of seeing this extreme sample if it were indeed true.


This is all good and fine, but it begs the question how big is too big. That is, we know a larger \(t_{stat}\) means that the sample mean is more extreme and thus the null hypothesis is more unlikely, but how large does it have to be for us to abandon the null entirely? Well, since we work at the 95% level of confidence elsewhere, we will continue using it here. Therefore, if the \(t_{stat}\) puts us in the top 5% of extreme data - either bottom 2.5% or top 2.5% - we will be 95% sure that the null is fallible and will reject.


This is called the Decision Rule, since it basically tells us what the threshold is to reject a null hypothesis. Luckily, we’ve already talked about the 95% threshold, and we know that it is 1.96 for a two-sided test and 1.64 for a one-sided test (harken back to Unit VI if you need a refresher here). Here are the basic Decision Rules:


For a two-sided test, we reject if \(|t_{stat}| > 1.96\) (don’t forget absolute value!) and reject for a one-sided test if \(|t_{stat}| > 1.64\).


Again, this is because if you saw a \(t_{stat}\) greater than either of those values, it would mean that the sample mean is really extreme and rare if the null hypothesis is indeed true (5% or less likely) and thus something is very likely to be wrong with the null.


I should also mention how to find the \(t_{stat}\) for proportions, although it’s the same process: subtract the hypothesized population parameter and divide by the standard deviation.

\[t_{stat} = \frac{\hat{p} - p}{\sqrt{\frac{pq}{n}}}\]

And the decision rules are the same.



Then we have the p-value, which is really just short for probability value. Some people mark this as a different method, but really it’s the exact same thing as a \(t_{stat}\) but in different units. All you do is plug the \(t_{stat}\) that you get into your calculator or use the table to find the probability associated with it. This is the p-value, and it will be less than .05 if the \(t_{stat}\) is also large enough to reject the null.


We’ve already discussed the p-value implicitly, but again it just gives the probability that the sample mean took on the value it did if the hypothesized mean for the population was actually true. So, a low p-value means that the sample mean is very extreme and would be rare if the null hypothesis was low; this in turn means that it’s likely that something is wrong with the null hypothesis. Again, if we have less than a 5% chance of seeing a sample mean given some hypothesized value, then we conclude something is wrong with the null (reject it).



This may seem dicey, so let’s do an example.


People generally assume that the average Harvard student IQ is 130, but one scientist claims that the average IQ is greater than 130. He can’t find out the IQ of every Harvard student, so he just takes a sample of 100 students from his Statistics class to find an average IQ of 145 and a standard deviation of 20 IQ points. Test this claim using his sample and Hypothesis Testing methods.


The first thing to do is write out the appropriate hypotheses. We are given the status quo, which we know is the null hypothesis, we we can write \(H_o: \mu = 130\), which basically means that the status quo is that the average Harvard IQ is 130. We know that the claim is the alternative, so we write \(H_A: \mu > 130\).

On to the actual testing. We know right away that the sample IQ is higher than the status quo, which supports the scientist’s claim that Harvard students have a higher IQ than 130. However, is the evidence strong enough for 95% confidence? Did we take a large enough sample? Is the variance too large? We check by calculating the \(t_{stat}\):

\[t_{stat} = \frac{\bar{x} - \mu_o}{\frac{s}{\sqrt{n}}} = \frac{145 - 130}{\frac{20}{\sqrt{100}}} = 7.5\]

So we have a \(t_{stat}\) of 7.5. Since we are doing a one tailed test (we are interested in if the parameter is greater than the hypothesized value, not just if it is unequal to it) the threshold level is 1.64. Since the \(t_{stat}\) is well above it at 7.5, it looks like we have strong evidence to support a rejection.

Just to be sure, we’ll calculate the p-value. By plugging the \(t_{stat}\) into a calculator (I’m just using the normal CDF since when \(n>30\) the t distribution is practically Normal) we get a number with 14 zeroes after the decimal point. This is the p-value, and it essentially means that there is a very small chance of getting this sample where students on average had IQs of 145 if it were true that students actually had an average IQ of 130. Since it’s less than .05, we can reject.


Here’s an appropriate conclusion for this problem:


Since the t statistic is 7.5, which is larger than the threshold of 1.64, we have sufficient evidence to reject the null hypothesis. There is sufficient evidence at the 5% level of statistical significance that the true IQ of Harvard students is greater than 130.


Commit this form to memory, as it is a great shell for answering these types of questions. You of course have to change parts of it based on the results (maybe 1.64 goes to 1.96, or we get a smaller t statistic, or insufficient evidence, you get it) but the format is there. Always make sure to interpret the result in the context of the problem!


7.4 Two Sample Testing


Now that you understand the basics of Hypothesis Testing, we will provide the different equations for t-statistics and the accompanying decision rules for two sample testing. Everything else is the exact same: the formation of the hypothesis, the conclusions, etc. You will, of course, have to calculate the t statistics differently, so here are those equations.


If you are testing two means (i.e., they are not equal, one is larger, etc.) we get a t statistic using: \[t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n}}}\]

Where \(\bar{X}_n\) is the sample mean for the \(n^{th}\) sample.


As you can see, a little more involved. Same jazz, though. Use this formula, get a \(t_{stat}\), and if it’s greater than the threshold (again, 1.96 for two sided and 1.64 for one sided) reject whatever hypothesis you had established.


Likewise, for a two sample proportion, we get a t statistic using:

\[t = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}\] Where \[\hat{p} = \frac{n_1\hat{p}_1 + n_2\hat{p}_2}{n_1 + n_1}\] And, again, \(p_n\) gives the sample proportion for the \(n^{th}\) sample.


Like we said, not so fun, but not so hard if you understand Hypothesis Testing (again, 1.96 and 1.64 are the thresholds).




7.5 Chi-Squared Goodness of Fit

<br

The goodness of fit test works best for a multinomial random variable. This is kind of like a binomial random variable but with, you guessed it, multiple outcomes. Remember, the best example of a binomial is a coin flip, which has two outcomes with equal probabilities. A good example of a multinomial, then, would be rolling a die with \(k\) sides, where the \(k^{th}\) side has probability \(p_k\) of showing up.

The null hypothesis here is always \(H_o: p_1 = a_1, p_2 = a_2, ... p_k = a_k\) where we are testing each \(a\), and the alternative hypothesis is that at least one of those pairs is not equal. This looks confusing, but imagine if we are testing to see if a six sided die is fair, or every side has an equal probability of occurring. Here, the null would just be that each probability equals \(\frac{1}{6}\).

How do we test this, then? Similar idea to calculating a t statistic, but this time we calculate a Chi-squared value, or \(\chi^2\) value. That is given by the formula:

\[\chi^2 = \sum\limits_{all \; i} \frac{(o_i - ei)^2}{e_i}\]

Where \(o\) gives the observed values and \(e\) the expected ones.

So, you essentially add up the squared difference between observed and expected for all the values and divide each by the expected value to get chi squared. Intuitively, a large chi-squared value will result in a rejection of the null hypothesis, since that means that the observed value is very far from the expected value, which is exactly what the null hypothesis is hoping for.


Before we had the decision rule that if \(t_{stat}\) was greater than 1.96 or 1.64 (depending on one or two tailed tests), we would reject. Here, the technical decision rule is to reject when \(\chi^2 > \chi_{\alpha, \; k-1}^2\). What that means in english is to reject if the value you found is higher than the threshold value at the specified level of significance \(\alpha\) for a chi-squared distribution with \(k -1\) degrees of freedom. Remember, we always test at the 95% confidence level, so \(\alpha = .05\) here.



Seems pretty sketch, so I’ll do an example. Let’s say we are doing a Chi-squared goodness of fit test on a fair die. You roll the die 10 times and get 1 three times, 2 zero times, 3 once, 4 twice, 5 twice and 6 twice.


Again, since the null hypothesis is always that each proportion equals a certain value, and we are testing for a fair die, we expect a proportion of \(\frac{1}{6}\) for each value, or for it to show up 1.67 times. So, we find our Chi-squared value:

\[\chi ^2 = \frac{(3 - 1.67)^2}{1.67} + \frac{(0 - 1.67)^2}{1.67} + \frac{(1 - 1.67)^2}{1.67} + \frac{(2 - 1.67)^2}{1.67} + \frac{(2 - 1.67)^2}{1.67} +\frac{(2 - 1.67)^2}{1.67} = 6.94\]

So, we got a value of 6.94. If we check on my own handy dandy chi squared table, we know we testing at the .05 level of significance and that \(k - 1 = 5\), so it looks like the threshold is 11.07. Since the value that we got is 6.94, we fail to reject the null hypothesis. There is insufficient evidence at the 5% level of significance that this is not a fair die.