Probability distributions and hypothesis testing

When we collect data, that data ends up having a certain “shape” or distribution. “Descriptive statistics” are used to describe characteristics of these data: its peak, dispersion, skew, etc. “Inferential statistics” are used when you compare the data you actually have against the data you hypothetically would have, given certain assumptions. This whole chaptere is going to slowly unpack that last sentence, but it’s worth reiterating: “inferential statistics” are used to compare real data against a hypothetical distribution of data.

I’ll give an example before going deeper into things. Let’s say you think someone has been swindling people with a loaded coin. In other words, you think the coin they are flipping does not have an even 50-50 change of landing on heads or tails. To test this hypothesis, you collect some data. You watch your friend flip their coin, many, many times and record the outcomes. Of course, even a fair coin isn’t going to land on heds (or tails) exactly 50% of the time. The number of heads (or tails) could be slightly higher (or lower) by random chance.

Let’s say the distribution of your actual data (the real number of heads you observed) is 16 heads out of 20 flips. Hypothetically, if it’s a fair coin, it the coin should’ve landed about 10 out of 20 flips, give or take. I’m not going to go over the full calculations, but the probability of flipping exactly 16 out of 20 heads with a fair coin is 0.46%. The probability of flipping 16 or more coins out of 20 is about 0.59%.

In conclusion, when you assume the coin is fair, the data you observed are very unlikely. However, if you assume the coin is unfair, and has a higher probability of landing heads versus tails, the data become more likely. Inferential statistics are all about comparing your actual data (e.g., coin flips) against hypothetical data (e.g., what the outcomes of a fair coin should look like.)

Probability distributions

A probability distribution is a mathematical function that gives the probability of any and every outcome that is possible. I’ll start with two very simple probability distributions. First, a fair coin: The probability of flipping a coin and getting heads is 50%. The probability of flipping tails is 50%. In this probability distribution, every possible outcome (all two of them: heads and tails) are given a probability. The probability of every outcome adds up to 100%.

Let’s take a slightly more complex example: Rolling a 6-sided die: Each outcome, 1 through 6, has a probability of 1/6 (or about 16.67%) of being on top. Again, every possible outcome has a probability assigned to it. Plus, if you add all the probabilities together for every possible outcome, they sum to 100%.

The probability distribution for different outcomes when rolling a 6-sided die

One of the most important aspects of inferential statistics is the idea of an “interval of outcomes.” An “interval of outcomes” is just a subset of possible outcomes. For instance, what’s the probability of rolling a 3 or lower with a regular 6-sided die? You could to the math:

\(\frac{1}{6}+\frac{1}{6}+\frac{1}{6}=\frac{3}{6}=\frac{1}{2}=50\%\)

But you probably knew that one off the top of your head. What about the probability of rolling a 5 or lower? A four or lower? These are all intervals of values.

Value Probability of rolling that specific value Probability of rolling that value or lower
1 \(\frac{1}{6}\) or 16.67% \(\frac{1}{6}\) or 16.67%
2 \(\frac{1}{6}\) or 16.67% \(\frac{2}{6}\) or 33.33%
3 \(\frac{1}{6}\) or 16.67% \(\frac{3}{6}\) or 50.00%
4 \(\frac{1}{6}\) or 16.67% \(\frac{4}{6}\) or 66.66%
5 \(\frac{1}{6}\) or 16.67% \(\frac{3}{6}\) or 83.33%
6 \(\frac{1}{6}\) or 16.67% \(\frac{3}{6}\) or 100.00%

Another name for “the probability of rolling a certain value or lower” is the “cumulative probability” of that value. Each time the end of the interval goes up, the total probability of being within that interval accumulates.

Randomly sampled data often “approach” or “appriximate” the shape of a probability distribution

Hopefully you’re starting to see why it’s useful to compare probability distributions and distributions of actual data.

In the GIF below, I have randomly generated coin flips. If we assume it’s a fair coin, then, over time, all the random flips should reach the 50-50 point, an even split between “heads” and “tails”.

Animation of random coin flips approaching the probability distribution of a 50-50 chance.

This animation shows an ongoing, updating distribution of data that is getting closer to a theoretical probability distribution represented by the horozontal green line. Both bars in that chart should get closer and closer to being even with that bar as the sample size (the number of flips) gets bigger and bigger.

Here’s a similar animation for 6-sided die rolls.

Animation of random die rolls approaching the probability distribution where each outcome has a 1/6 chance of occuring.

Since there are more possible outcomes (1 through 6), it takes longer for this data distribution to approach the theoretical probability distribution where every outcome has a 1 in 6 chance of occurring.

One of the most important concepts in statistics is the idea of comparing your data distribution to a theoretical probability distribution. Are your data more likely to have occurred if the coin was fair? Or, are your data more likely to occur if the coin is biased? In the coin flip GIF above, I told my program to generate random data where there really is a 50% chance of either landing heads or tails. What if I tell it to come up heads 75% of the time and to come up tails 25% of the time?

Animation of random coin flips for a biased coin

In the simulated data above, the green line represents the long-term expected outcomes from a fair coin. The red lines represent the long-term expected outcomes from a coin that is biased to come up heads 75% of the time and otherwise come up tails. Here, you can see that the simulated don’t converge exactly where the theoretical probability distributions said they would. This is becasue of sampling error. We only flipped the coin 30 times. The data are going to approach a theoretical probability distribution, but not get there right away or with a small sample.

Area \(\approx\) probability

“The probability of rolling a 3 or lower” is equivalent to saying, “The cumulative probability of rolling a 3.” It’s also like you cut the distribution in half.

The cumulative probability of rolling a 3.

And if you ask for the cumulative probability of rolling a 4, it’s as if you cut a rectangle into one piece that makes up \(\frac{4}{6}\) of that of the rectangle and another piece that makes up \(\frac{2}{6}\) of the rectangle.

The cumulative probability of rolling a 3.

Did you remember that the area of a 2 dimensional shape is its height times its width? In the picture above, the height is \(\frac{1}{6}\) and the width is 6. \(\frac{1}{6}*6=1\). That “1” represents 100% of the area of the rectangle. There’s a 100% chance of obsereving an outcome that falls somewhere within the area of that rectangle.

When you ask the probability of rolling a 3 or lower, it’s like asking the probability of observing an outcome on one half of the total rectangle but not on the other half. You can still think of that sub-area (1 through 3) as a probability. It has a height of \(\frac{1}{6}\) and a width of 3. \(\frac{1}{6}*3=\frac{3}{6}=\frac{1}{2}=50\%\). 1 through 3 makes up half the possible outcomes for rolling a 6-sided die. The interval “1 through 3” also takes up half (50%) of the area of the overall “rectangle”/probability distribution.

When you ask the probability of rolling a 4 or lower, you end up with a sub-section of the “rectangle”/probability distribution that takes up \(\frac{4}{6}\) (or about 66.66%) of the total area. You create a shape that is \(\frac{4}{6}\) tall and 4 wide. \(\frac{1}{6}*4=\frac{4}{6}=66.66\%\).

The area of an oddly shaped room

So, a distribution has an total area. Different subsections (or intervals) within the distribution have areas that make up a certain percent of the total area. This concept is so important, that I have more examples for you.

Imagine you and your roommate are in a 5 foot by 5 foot room.

A picture of a 5x5 room

The room is 25 square feet. In other words, the total area of the floor is 25 square feet.

Now suppose your roommate is furious at you. In typical sitcom fashion, they draw a line across the floor separating their part of the room from your part of the room.

A picture of a 5x5 room with a dashed line drawn on the floor

Your roommate says you’re not allowed anywhere above the dotted line. That’s your roommate’s side of the room. Let’s say the rectangle formed by their side of the room is 1 ft tall and 5 ft wide. That makes their part of the room 5 square feet. That’s 1/5 (or 20%) of the total area of the room.

A picture of a 5x5 room split into 20-80% areas

To understand the connection between “area” and “probability”, imagine this: Someone throws a red bouncy ball as hard as they can to ricochet around the room, bouncing erratically off the walls and floor. It could land anywhere in the room.

A picture of a 5x5 room split into 20-80% areas

What is the probability that the ball will land on your roommate’s side of the room? What’s the probability that the ball will land on your side of the room?

There’s a 20% chance it will land on your roommate’s side of the room and an 80% chance it will land on your side of the room.

The percentage of the total area sectioned off by the white dotted line corresponds to the probability of a ball landing on that side of the room at random.

Now let’s try to transfer this reasoning over to a different shape. The area of a square is easy to deal with mathematically, but the area of a curved object like the normal distribution isn’t.

A picture of a 5x5 room split into 20-80% areas

But the logic is the same. Pretend this is just a room with a weird shape.

A picture of a 5x5 room split into 20-80% areas

If you draw a white dotted line somewhere in this oddly shaped room, you are still dividing the total area of the room into subsections.

Let’s say you draw a white dotted line where z = 1.

A picture of a 5x5 room split into 20-80% areas

This would divide the room into two parts: Most of it is not on your roommate’s side of the room. 84.13% of the room is still yours. The part shaded in in gray is your roommate’s area. It takes up 15.87% of the total area of the room.

If you throw a ball in this oddly shaped room, there is an 84.13% chance it will land in your side of the room and a 15.87% chance it will land in your roommate’s part of the room.

Let’s say we divide the room where z = 2. Now you have 97.7% of the room to yourself and your poor roommate only gets 2.3% of the room. If you threw a ball in this room, there is still a chance the ball will land on your roommate’s section, but it is very unlikely.

A picture of a 5x5 room split into 20-80% areas

Much of this chapter will focus on hypothesis testing in statistics. When you assume a certain statistical hypothesis is true (e.g., “This coin is fair, 50% heads, 50% tails.”), then you will end up calculating the probability of observing the data that you observed. The probability of observing your data requires an statistical hypotheses. You have to make assumptions. Once you have an assumption in place, you will often end up with a probability that represents the “area under the curve” associated with observing the data that you did (when you assume a certain statistical hypothesis is true). We’ll return to this oddly-shaped room example. For now, though, we’re going to turn towards statistical hypotheses and explain what those are in more depth.

Parameters of probability distributions

Probability distributions often have “parameters”. Parameters are like the “settings” for the distribution. Just like how a TV can be set to different volumes and color contrasts, probability distributions can be set to have different peaks, widths, skews, etc.

One very neat probability distribution is the binomial distribution. I’m going to talk a bit about the binomial distribution here just to illustrate a point. I won’t be testing you on the math behind the binomial distribution. I used to just start talking about the binomail distribution and showed the equation that goes with it and it would terrify students:

\(p_{x}= \binom{n}{x}p^x q^{n-x}\)

The binomial distribution can give you useful information about the probabilities of events with only two possible outcomes: heads vs. tails, survived vs. didn’t survive, approved for loan vs. not approved for loan.

The binomial distribution has two parameters: The number of trials (n) and the probability of “success” on a given trial (p). Let’s say flipping a coin and landing “heads” is a “success”. The binomial distribution can tell you all sorts of things such as:

  • The probability of landing heads at least twice in 10 flips.
  • The probability of landing heads two or more times in 5 flips.

Going back to our coin example from earlier, let’s say you flipped a sketchy coin 20 times and it came up heads on 15 of those 20 flips. It’s still possible that the coin is actually fair. You just got “bad luck” and flipped a higher percentage of heads than you would’ve seen in the long run, if you’d kept flipping. With the binomail distribution, you can plut in “20” as the number of trials (n) and you can plug in 0.5 (p) as the probability of landing “heads”. According to the binomial distribution, there is a

Two binomial distributions

On the distribution at the top, the parameter p has been set to 0.5, a 50% chance of “success”. In this case, a “success” is defined as “coming up heads.” On the distribution at the bottom, p has been set to 0.75, as 75% chance of coming up heads. As you can see, these two probability distributions – with their two different p settings – have different implications for how likely it is you will flip 15 heads out of 20 overall flips. The one at the top says there’s about a 1.48% chance of flipping 15 out of 20 heads when you assume p = .50. The one at the bottom says there’s about a 20.23% chance of flipping 15 out of 20 heads when you assume p = .75. So, flipping 15 out of 20 is still possible under both assumptions, but is much more likely under one versus the other.

But why would we care so much about coin flips? The binomial distribution can work for any outcome that’s binary (“all or nothing”). Let’s say a new cancer drug is going on the market and you need to decide whether it’s safe enough for the public. You’ll have to run some clinical trials and compare how many people die when using the new drug versus the old drug. But some people are still going to die either way.

You’ll need to ask questions like

  • What is the probability of 5 people dying out of 100 cancer patients in a clinical trial when the old treatment is used?
  • Let’s say usually 5 out of 100 people die during a clinical trial using the old treatment. If we see only 2 out of 100 people die in the new treatment, how likely were we to observe that if the new treatment is equally effective as the old one?
  • How likely are we to see this if the new treatment has a better “cure” rate than the old treatment?

The binomial distribution can answer questions like these

What’s the probability of 0 people dying in a clinical trial of 100 cancer patients if the probability of dying is assumed to be 5%? Plug the right numbers into the binomial distribution and it’ll tell you there’s a 0.59% chance of that happening. Less than a 1% chance. What’s the probability of 1 person dying in that trial? 3.12%.

Number of deaths Probability of that many deaths when p is set to .05 in the binomial distribution
0 out of 100 0.59%
1 out of 100 3.12%
2 out of 100 8.12%
3 out of 100 13.96%
4 out of 100 17.81%
5 out of 100 18.00%
6 out of 100 15.00%
7 out of 100 10.60%
8 out of 100 6.49%
9 out of 100 3.49%
10 out of 100 1.67%
11 out of 100 0.72%%
12 out of 100 0.28%
13 out of 100 0.10%
14 out of 100 0.03%
15 out of 100 0.01%

You can see our friend “sampling error” at play here. If you assume the probability of dying during a clinical trial is 5%, that doesn’t mean exactly 5% of the people in the trial will die. We can see from the table above that the probability of exactly 5 out of 100 patients dying only has an 18% chance of occurring. Nearby outcomes (such as 4 people or 6 people dying) are almost as likely.

Two binomial distributions

Now let’s say we do a clinical trial and 10 (out of 100) die. The probability of 10 or more people dying, when you assume there’s really a 5% chance of dying is given by the binomial distribution: 2.82%.

That’s pretty unlikely.

Hopefully you are starting to see how the parameters of a probability distribution can be used to assess the probability of some real, actual data. Parameters of probability distributions can be used to say precisely what some data hypothetically should look like. With statistical hypothesis testing, we are really just asking whether different sets of parameters are plausible or which ones aren’t.

Statistical hypotheses

Probability distributions give you the power to say precisely how likely (or unlikely) your data are when you assume a certain set of parameters. This is what statistical hypotheses are all about: Assume the parameter(s) have a certain value, then calculate the likelihood of your data when that assumption is in place.

Let’s say you’re walking to the café between classes to grab some coffee and you happen to spot a man hanging out in a dark alley. He motions for you to come over and says, “Hey, you wanna shoot some dice? You bet $1. If I roll a 5 or higher, you get $12. If I roll anything lower than a 5, then I keep your dollar.” At first this doesn’t sound like the best deal because you’re more likely to lose money every time you play than win money. As you think about it more though, each time you play, you have a 2/6 chance of winning $12 and a 4/6 chance of losing $1. Actually, in the long run, you’d end up winning enough money to make up for all the losses.

So you decide to play several rounds of dice with this shady guy in the alley. You end up not making nearly as much money as you thought you would. Out of 20 games, he only rolled a 5 or above once. You start wondering if maybe his dice is loaded.

Time to formulate a statistical hypothesis! Suppose the probability of rolling a 5 or higher really is 2/6 (or 33.33% chance). How likely is it that, in 20 rounds of this dice game, that you would win 0 times, or just once, or only twice, and so on?

Two binomial distributions

Let’s use the binomial distribution and set the “success rate” at 0.33. Theoretically, if the die is fair, then that’s what the real success rate should be. The figure above shows the probabilities associated with winning 0 out of 20 times, 1 out of 20 times, and so on. If the die were fair, then we shoudl be winning about 6 (or maybe 7) times out of 20. But it is possible that you’ll win a bit less or a bit more than that. The probability of only winning once, when you assume the true “success rate” is 0.33, is 0.003. That’s 0.3%, a fraction of 1%.

The evidence at hand (only 1 win out of 20) therefore makes the hypothesis that the die is fair seem like it couldn’t be true. Afterall, if the probability of winning on any given roll really was 33.33%, then something with a less than 1% chance of happening just happened.

This is at the heart of statistical hypothesis testing: Calculating the probability of observing our data (or more extreme data) when you assume a certain statistical hypothesis. The clause “or more extreme data” is important here. There is a range, or “interval” of outcomes that we are interested in:

  • What is the probability that 70% or more heads would come up when you assume the coin is fair?
  • What is the probability that 5% or fewer would survive a clinical trial if the new drug is truly no more effective than the old drug?

Null hypothesis significance testing (NHST)

In this book/class, we will focus on a specific type of statistical inference called Null Hypothesis Significance Testing (NHST). This is currently (and historically) the most popular approach to statistical inference. It is also far easier than Bayesian statistics, which is the only major alternative to NHST.

The main idea behind NHST is something called the null hypothesis. The null hypothesis amounts to something like “Suppose the die is fair” or “Suppose I’m wrong and there is no correlation between these variables” or “Suppose I’m wrong, and there is not difference between these two groups.” You have to have some sort of statistical hypothesis in place before you can assess how likely your data are. The null hypothesis amounts to saying, “Well, suppose I’m wrong…”

What you’re usually hoping to demonstrate in NHST is that the null hypothesis is false, or at least really implausible. You want your argument to go something like this, “If there was really no difference (or no correlation), then the probability of observing my data would be very unlikely. Either my data are wrong or the null hypothesis is wrong. If the null hypothesis is wrong, then there really must be a difference (or a correlation) all along.”

This is a type of reductio ad absurdum argument. Here’s another example of a reductio ad absurdum argument: “You think the earth is flat? Okay, let’s assume you’re right and the earth is flat. Why aren’t people falling off the edge of the earth? How come, when something disappears into the horizon, the bottom part of it disappears first? Why does the earth cast a circular shadow onto the moon?”

In a reductio ad absurdum argument, you grant a certain premise as true, for the sake of argument. Then you demonstrate how, when that assumption is made, it leads to absurdities. You say, “If we assume X is true, then the data should look like this. The data don’t look like this. Therefore, X cannot be true.”

Ever hearrd of Alex Jones? (If you haven’t, I envy you.) Jones runs a radio/internet show that promotes conspiracy theories. He was sued by the victims of the Sandy Hook massacre because he claimed that the children and parents involved were actors hired by the government. The whole thing was a “false flag operation” to help the government (i.e., the “deep state”) to justify stricter gun laws. He says the “deep state” has murdered people who disagree with them or expose their conspiracies. he says the “deep state” spies on people, looks for any reason they can to murder or silence its critics.

That’s only the tip of the Alex Jones iceberg. He also thinks the government is “turning the frogs gay” as it tries to develop a chemical treatment to feminize men and make them easier to manipulate. They’ve started with frogs while they work the kings out of these chemicals.

The reason I brought up Alex Jones, though, is that his basic claims about the “deep state” hunting down and killing conspiracy theorists can be destroyed with a very simple reductio ad absurdum argument: If the government is actively tracking down and killing anyone who criticizes them, especially if they’re criticizing them on a large platform, then why haven’t they killed Alex Jones? The government must not be trying that hard to assassinate their critics if Alex Jones is able to keep his show running for decades.

NHST works much the same way. You assume that you are wrong. “There is no correlation between these two variables.” “There is no difference between these two groups.” Once you’ve made that assumption, you can potentially demonstrate that this assumption leads to an absurdity. “If there really is no correlation between these two variables, then there’s only a 1% chance that I would’ve observed a correlation as high (or higher) as the one I did.” “If there really is no difference between these groups, then there’s only a 4% chance that I would’ve observed a difference between the groups as large (or larger) as the one I did.”

Statistical significance

In statistics, we often talk about “effects”. A difference between two groups (whether it’s a big or small difference) is an effect. “What effect does being in the honors program have on feelings of belongingness on campus?” Correlations (or associations) between variables are talked about as effects too. “What is the effect of the hours of sleep one gets at night on their academic performance?”

You might observe a small effect or a large one, but your estimation about how large an effect is is based on your sample size (among other things). To address this concern, we assess whether an effect is “statistically significant.”

An effect is statistically significant if the probability of observing that effect (or a larger one) is less than 5% when you assume the null hypothesis is true.

A “p-value” is the probability of your data (or more extreme data) when you assume the null hypothesis is true.

That definition is so fundamental to understanding the material in this book that I want you to take extra special care to try and understand it as best you can. Re-read the material leading up to it. Think it over. Re-visit it. Look at it from different angles. Say it to yourself every morning before you get out of bed. Tattoo it to your arm and stare at it every chance you get.

Two binomial distributions

So, let’s say you observe a difference between sample means. But has lady luck played a trick on you and given you what looks like a real difference between means in your sample when really there is no difference in the population at large? You can calculate the probability of observing a difference between means as large (or larger), as long as you form a statistical hypothesis to base these calculations on. You can’t directly calculate the probability of your data without making assumptions. So, we calculate the probability of observing a difference between means as large (or larger) when you assume the null hypothesis is true. What you end up with is a p-value: The probability of observing an effect as large (or larger) when you assume there is no difference between the means (i.e., that the null hypothesis is true) If this p-value is below .05 (below 5% probability) then we conclude that the null hypothesis is probably not true. After all, when you assume the null hypothesis is true, then your data are very improbable.

We reject the null hypothesis when our p-value is less than .05. Another way of saying this is our results are “statistically significant”. It is very important to interpret p-values and statistical significance for what they are, no more and no less. We are not concluding that the alternative hypothesis is true. Besides, that would only amount to saying, “there is probably some non-zero difference between means.” Statistical significance also doesn’t entail that your result has any practical significance. You can make any trivial and unimportant effect, however small, show up as statistically significant if you collect a large enough sample. That doesn’t mean it would be worth anyone’s time and resources to try and translate that discovery into practice.

Let’s implement the concept of the p-value and statistical significance in our biased coin example. Remember, though, any serious real-world data will be tested using the same logic.

Let’s say you believe a coin is biased. You think it’s not a fair 50-50 coin. So, you form a null hypothesis: “Suppose I’m wrong, and coin really is fair.” Next, you calculate the probability of observing the data (or more extreme data) when you assume this null hypothesis is true. Let’s say the coin in question was flipped 20 times and came up heads 15 of those times.

Two binomial distributions

The figure above shows the probabilities of flipping a coin 20 times and observing a different numbers of “heads” out of the 20. From 15 and onward is highlighted in pink. Under these assumptions, the probability of flipping 15 heads out of 20 is 1.48%. The probability of flipping 16 out of 20 is 0.46%. 17 out of 20 is 0.09%. If you add all of these probabilities up, you get about 2.07%. So, if you assume the null hypothesis is true, there is only about a 2% chance that you would observe as many “heads” as you did (or more). You might call these results… statistically significant!

Directional hypotheses: 1-tailed vs. 2-tailed tests

So far, I have talking about statistical hypotheses that are technically called “directional” hypotheses (or “1-tailed” tests). That is, I’ve been talking about hypotheses where you believe the effect is either larger or smaller than zero. Directional hypotheses include, “I predict there will be a positive correlation,” “I predict there will be a negative correlation”, or “I predict that the Lord of the Rings fans will have a higher average IQ compared to the Star Wars fans.” In other words, there’s some predicted direction for the effect.

By contrast, there are non-directional hypotheses. Non-directional (or “2-tailed”) tests predict that there will be some effect, but the direction of this effect is not specified: “There will be a correlation between these variables” or “There will be a difference in IQ between Lord of the Rings fans ans Star Wars fans.” You are saying that there will be an effect, but not what direction that effect will go in.

Directional hypotheses are also called “1-tailed” tests whereas non-directional hypotheses are sometimes called “2-tailed” tests. This terminology comes from looking at the threshold of statistical significance on a null distribution. When you make a directional (one-tailed) hypothesis, you have to observe an effect so large (in absolute value) that it either exceeds the top 5% or bottom 5% of the area of the distribution. In other words, you have to observe an effect so large (in absolute value) that there’s less than a 5% chance of observing that effect when you assume the null hypothesis is true.

Two binomial distributions

On the other hand, when you formulate a non-directional (two-tailed) hypothesis, you need to observe an effect that cuts off either the top 2.5% or the bottom 2.5%. So, in total, there is still 5% of area under the curve you have to get into, but that area is split between the two extremes (or tails) of the distribution.

It is harder to observe a statistically significant result when you are using a two-tailed test. If you observe a positive correlation, it has to surpass the top 2.5% of the area under the curve rather than the top 5%. Same thing for a negative correlation. With a two-tailed test, a negative correlation has to be so low that it surpasses the bottom 2.5% of the distribution rather than the bottom 5%.

There was a time when researchers were allowed to “predict” that a certain effect will have a direction. They were allowed to predict that group A would be higher than group B (not lower). Or they were allowed to predict a negative correlation between two variables (rather than a positive one). This made it easier to get a statistically significant result. And publishing scientific papers is still heavily biased towards statistically significant results. Failing to reject the null hypothesis is still not considered “sexy” enough to get published in many places.

The era of 1-tailed tests has long passed. Nowadays, if you use directional hypotheses in your statistical analyses, people will think you’re trying to pull a fast one on them and demand to see what your p-values look like if you use a 2-tailed test rather than a 1-tailed test.

Type 1 and Type 2 errors

When you collect data and calculate means or correlations or whatever, you can never know whether the null hypothesis is actually true or not. You only have your data and your hypotheses. You calculate a p-value and you decide whether to accept or retain the null hypothesis based on your data, but you can never know with 100% certainty whether you made a mistake.

We have names for these mistakes. A type 1 error occurs when you reject the null hypothesis, but you shouldn’t have. The null hypothesis was true, but you rejected it anyway. There really is no effect, but lady luck has played a trick on you. You happened to observe what looks like an effect because of sampling error.

A type 2 error occurs when the null hypothesis is false, but you fail to reject it. There really is an effect, but you failed to detect that effect. You observed an effect so close to zero that it was statistically indistinguishable from the null hypothesis.

Two binomial distributions

Both errors are bad. With a type 1 error, you’re making it look as if there’s an effect when there really isn’t. People might waste resources doing follow-up studies or trying to put your research into practice. On the other hand, a type 2 error is bad because you missed something that might have been important. Maybe there’s some effect that could have been put into practice and could’ve helped people, but here you are going around telling people that this effect doesn’t exist.

Two binomial distributions

Null hypothesis significance testing (NHST) and its associated Type 1 and Type 2 errors can be kind of confusing because there are a lot of opposites and double-negatives in the logic. That’s why it’s worth going over these concepts, studying them, putting them in your own words, and testing yourself as much as possible. They’re tricky. Make yourself some flashcards. It’ll be tempting to try and “boil it down” to some simple rules. And that’s okay. But sometimes that can lead into getting confused later on when someone uses different phrasing that you’re not used to. I’ve been using NHST and teaching it for a long time and I still get tongue twisted and say stuff backwards at times.

Two binomial distributions

Statistical power

Statistical power is the ability of your study to detect an effect if that effect is real. Statistical power is your ability to avoid a Type 2 error. Traditionally, people want their statistical power to be at 0.8 (or 80% probability). This would be the probability of rejecting the null hypothesis if the null hypothesis is indeed false.

There are many factors that affect statistical power. The two main ones are sample size and effect size. In general, if your sample is small, you’re probably not going to be able to detect small (or small-ish) effects. And most effects in psychology are small. Small samples are also more erratic, giving you extremely large or small means (or correlations). In other words, sampling error is more of a problem with smaller samples. Large sample sizes are therefore good because they avoid extreme, erratic sampling error problems and they increase statistical power.

The other important factor is effect size. Let’s say I have a hypothesis that people who drink diet sodas tend to be shorter than people who don’t drink diet sodas. If that effect exists, it’s going to be very small. You would need an extremely large sample to be able to test this hypothesis.

Let’s say, on the other hand, I have a hypothesis that people bowl better when they haven’t been sprayed in the eyes with mace. I have a control group that bowls like normal with no mace in their eyes. I also have an experimental group of bowlers who I spray down with mace. You don’t need a very big sample size to see the difference in bowling scores. This is going to be a “night and day” difference. Studies with large effect sizes will have high statistical power, even if the sample size isn’t all that big.