2.14 Hypothesis Testing: Tips and Troubles

2.14.1 Introduction

One of the nice things about this inference framework that we’ve been seeing is that it is flexible. There are a lot of different kinds of questions and data you can work with using this framework, and you can come out with similar kinds of results: hypothesis test decisions, p-values, and confidence intervals.

Since we run into these so often, it’s worth taking the time to go over some of the common questions and, indeed, pitfalls associated with these tools.

2.14.2 Choosing alpha

Let’s start with a common question: what is the appropriate \(\alpha\) value to use? Remember that \(\alpha\) affects both your hypothesis test and your confidence interval.

In a hypothesis test, alpha represents a level of risk you’re willing to take. It’s your chance of accidentally – falsely – rejecting the null hypothesis if that null hypothesis is actually true.

In a confidence interval, \(1-\alpha\) represents the level of, well, confidence you have in your interval – how confident you are that your interval actually does contain the truth. Mathematically speaking, it shows up in the critical value. For example, to get a confidence interval for a mean, you use: \[\overline{y} \pm t^*_{n-1,\alpha/2}*\frac{s}{\sqrt{n}}\]

Note also that \(t^*_{6,0.05/2} = 2.45\) is bigger than 1.96, the corresponding critical value for a Normal! The effect of that is to compensate for the fact that \(s\) tends to be too small as an estimate of \(\sigma\). Since your estimate of the standard deviation is too small, you have to go out more “standard deviations” in order to catch 95% of the values.

For any given \(n\), that critical value \(t^*\) gets bigger as alpha gets smaller. For example, \(t^*_{6,0.05/2} = 2.45\), while \(t^*_{6,0.01/2} = 3.71\). That makes the confidence interval wider – with a larger “radius” – when \(\alpha = 0.01\) than when \(\alpha = 0.05\). Think about this in context: if you want to be 99% confident that your interval contains the truth, you’re going to have to “cast a wider net” than if you only need to be 95% confident.

So what is the right \(\alpha\)? I dunno. It depends on how confident you want to be. We’ll think about this more in the concept of error types, but ultimately it all comes back to context. If you’re willing to take a higher risk of a “false positive” – rejecting \(H_0\) when you shouldn’t have – then you can raise \(\alpha\). Your confidence intervals will be narrower, and you’ll be more likely to reject the null. But, again, you’ll have a higher risk of being wrong. If you’re in a situation where there are consequences to being wrong, and you can’t assume that much risk, you need to use a lower \(\alpha\).

I will say one thing here: the \(\alpha\) value that people are really used to is 0.05. This comes from basically a historical accident – I’m happy to tell you this story sometime – and these days, people use it because…that’s what people use. That doesn’t mean it is actually a good idea! A 5% chance of being wrong is actually pretty high in a lot of contexts. If you did just one test with \(\alpha = 0.05\) every day, you’d be wrong about once every three weeks. That’s more frequent than me paying my utility bills. If there are any kind of serious stakes, that’s a lot! So please, do not feel some kind of loyalty to 0.05 just because you’re used to it. Consider what kind of risk it’s appropriate to take given your situation.

2.14.3 One-sided tests

While we’re talking about rejecting null hypotheses: there’s a little twist you may have encountered before called a one-sided test. “One-sided” here doesn’t mean “unfair”; it means “one-directional” – there’s only one direction that interests you.

For example, suppose you run an online store selling, I don’t know, yarn. Usually, visitors to your site end up spending an average of $50. (It’s really good yarn.) You do a study for a few weeks, where you pop up a free-shipping coupon when someone arrives at your site.

Think about what’s of interest here. If your average sale remains the same with the coupon – $50, just like before – well, that’s not interesting. You’re not going to keep giving out coupons. And if your sales go down, well, that’s sort of psychologically interesting, but from a store owner’s perspective, it’s not useful. You’re still not going to keep giving out coupons. The only interesting scenario is if the coupons help: if the average purchase goes up from what it was before.

So your null hypothesis here would be, boring, no change: \(H_0: \mu = 50\), where \(\mu\) is the true average purchase when the coupon pops up.

But your alternative hypothesis wouldn’t necessarily be \(H_A: \mu \ne 50\). Because you’re only excited if the coupons help, your alternative could be \(H_A: \mu > 50\) instead.

If you do a one-sided test, you can only possibly reject \(H_0\) if you observe a value in the “right direction” – if your sample mean of purchase prices is higher than $50. But, if your value is in the right direction, the associated p-value will be half as large, so it is in some sense “easier” to reject the null. (To see why this is true, draw a picture of where p-values come from – notice you now only care about one tail!)

This is an important cautionary point. If you want to do a one-sided test, you need to decide that’s what you’re doing in advance. You don’t get to observe the data and then decide to do a one-sided test so you can reject, that’s sketchy. One-sided tests should always be clearly motivated by the context of the problem.

2.14.4 Things not to say about confidence intervals

The important thing to remember when you’re talking about these confidence intervals, or indeed p-values and hypothesis tests, is what is uncertain. The tools we’re working with here come out of frequentist inference, and frequentist inference has very clear rules about this.

In particular, here is the thing to remember: the truth is not uncertain. Say what you will about the nature of reality, but in terms of frequentist inference, the truth is fixed. The population parameter – the true mean \(\mu\), or the true proportion, or the true regression coefficients, or whatever – it exists. It is whatever it is. It doesn’t move, it doesn’t change, it doesn’t exist in a superposition where it might be one thing but it might be another thing instead. It just is.

What’s uncertain is, well, us. We don’t know what the True parameter is. Instead, we have one random sample to use to come up with a guess. Maybe our sample is good, and will lead us to a good guess. Maybe we got unlucky and our sample is not so good. That’s where the uncertainty comes in.

When you give a confidence interval, you’re making a statement about the population parameter, and also expressing your level of confidence that you’re right.

There’s two important pieces here:

First, your statement is about the population parameter, not about individuals. If I get a 99% confidence interval of, say, (45, 60) for the average sale at your online yarn store, that does not tell me that 99% of your customers spend between $45 and $60. I’m not 99% confident that your next customer will spend between $45 and $60. It doesn’t tell me about future samples, either – I don’t know what I would see if I took another sample; that new sample’s \(\overline{y}\) might not even be in this (45, 60) interval. I am, however, 99% confident that the True average spending per customer is between $45 and $60.

Second, your confidence is in your sample and the interval that you derived from that sample. So is your uncertainty. I cannot say “there’s a 99% chance that the True Mean \(\mu\) is in this interval”; I have to say “there’s a 99% chance that this interval covers \(\mu\).” It’s a very subtle difference in wording, but the key is to emphasize that \(\mu\) is not a probabilistic quantity. \(\mu\) is whatever it is – it’s just a question of whether my interval found it or not.

2.14.5 Things not to say about p-values

I mean, maybe the most common? People make a lot of mistakes about p-values, and I don’t know of a study that has analyzed which ones are really the most common. Seems like it would be difficult to get a representative sample. Maybe if you restricted it to published papers in journals?

But I digress.

This brings us to the most common mistake that people make when interpreting p-values. They go and get themselves a p-value of 0.01 or whatever, and they say, “Ah! There’s a 1% chance that the null hypothesis is true.”

And across the country, statisticians wake up screaming without knowing why.

Think about the conditioning involved in creating a p-value. We start by assuming the null hypothesis is true – then we ask how unusual our test statistic would be. So we can say: if \(H_0\) is true, there’s a 1% chance we’d observe data as or more “extreme” as what we saw in our sample. But we can’t say: “given our sample data, here’s the chance \(H_0\) is true.” That’s just not the computation that we did. We actually don’t have any way to estimate the probability that \(H_0\) is true – not with these tools anyway.

The other one that you hear a lot is “There’s a 1% chance that this result is due to chance.” Which you can immediately recognize as nonsense. All results are due to chance, because they’re based on random samples!

There’s also an interesting version where people say that the null hypothesis is, like, more false if the p-value is really low. Like, if you got a p-value of 0.0001 for your test of whether your average yarn sale was $50, someone might think this means that your average sale is very different from $50. But this also doesn’t make sense. A hypothesis test is just about whether or not we think the null hypothesis is true. True or not true, those are your options. How far off it is from the truth is a different question.

There’s a great video from FiveThirtyEight in which they go ask a bunch of actual professional statisticians at a conference to explain p-values. It…does not go well.

So, again, what you can say about a p-value of, say, 0.01 is this: if the null hypothesis is true, then I would have a 1% chance of getting a test statistic at least as extreme as this one here. If this p-value definition seems sort of awkward and unsatisfying, well, yeah. It is. That’s probably why people get it wrong so often. But understanding what p-values actually do mean, and where they come from, can help you treat them with respect and recognize when they might not be telling you what you really want to know.

Response moment: Suppose you got a p-value from a test of 0.98. Your client is excited because this feels like an interesting result – it’s unusual to see a p-value so close to 1. What would you tell them?