Chapter 19 More on Hypothesis Testing

19.1 The Concept of Hypothesis Testing

The concept behind hypothesis testing is that I first will write a pair of hypotheses \(H_0\) and \(H_a\) that correspond to a research question. Then I collect data via random sampling, choose an appropriate mathematical procedure called a hypothesis test, calculate a test statistic, and decide to either reject the null hypothesis or fail to reject the null hypothesis.

One can draw an analogy between hypothesis testing and a jury trial. The null hypothesis is that the defendant is innocent and the alternative is that he/she is guilty. Evidence is presented during the trial, and the jury decides to either find the defendant guilty (i.e. they rejected the null) or not guilty (i.e. they failed to reject the null due to lack of evidence ‘beyond a reasonable doubt’).

19.2 Type I and Type II Error

When we conduct a hypothesis test, there are two possible outcomes (reject \(H_0\) or fail to reject \(H_0\). Naturally, we hope that the outcome is the proper one, but it is possible to make an error in a hypothesis test. (NOTE: By ‘error’, I do not mean making a computational mistake.)

Reality About The Null

Decision True False
Reject \(H_0\) Type I Error, \(\alpha\) Correct
Fail to Reject \(H_0\) Correct Type II Error, \(\beta\)
  • Type I error, or \(\alpha\), is defined as the probability of REJECTING the null hypothesis when it is TRUE. For example, suppose we concluded that girls from Kentucky were heavier than the national average if in reality they aren’t.

  • Type II error, or \(\beta\), is defined as the probability of FAILING TO REJECT the null hypothesis when it is FALSE. For example, suppose we concluded that girls from Kentucky were not heavier than the national average if in reality they are.

19.3 Power

  • Power is the complement of Type II error. \(\text{Power}=1-\beta\)

  • In hypothesis testing, we typically are hoping to reject the null hypothesis (unlike real life, where we usually hope to avoid rejection). We want both \(\alpha\) and \(\beta\) to be low (close to zero) and power to be high (close to one or 100%).

  • Since we choose to control Type I error by selecting the level of significance \(\alpha\), we sacrifice our ability to simultaneously set the Type II error rate \(\beta\) without either changing the sample size or making some other change to the study.

The Spam Example You may be familiar with the concept of email filtering (they often use Bayesian techniques). The idea behind it is that when an email message is sent to your email address, it is analyzed to see if it is a legitimate message or if it is spam (i.e. junk mail). If it judged to be real, it is sent to your Inbox; if it is judged to be spam, it is sent to the Junk folder.

Here, the filter is playing the role of the hypothesis test. The null hypothesis is that the incoming message is legitimate. The message is scanned to see if it has suspicious words or other features (i.e. it came from a weird email address, it was sent by a Nigerian prince asking you for your bank account number, it uses the word ‘VIAGRA’ frequently). If enough suspicious features are detected, we reject the null and off it goes to the Junk folder. Otherwise, we fail to reject and it ends up in your Inbox.

  1. In the spam example, what would be a Type I error? What would be a Type II error? Which would be the worst type to make in this situation?

  2. In the jury example, where the null hypothesis is innocence, what would be a Type I error? Type II error? Which would be the worst type to make in this situation?

  3. A new drug is being tested that will cure a fatal disease. The alternative hypothesis is that the drug is more effective than a placebo. Identify what Type I and II errors are here, and which would be the worst to make.

NOTE: The hypothesis testing framework has been set up by statisticians such that Type I error is assumed to be the worst kind of error to commit, and is therefore controlled by choosing \(\alpha\)

19.4 How do we increase power?

There are a variety of methods for increasing poker, which will thus decrease \(\beta\), the Type II error rate.

  1. We can lower \(\beta\) by increasing the Type I error rate \(\alpha\). Most statisticians do not approve of this approach.

  2. We can lower \(\beta\) if we can lower the variability \(\sigma^2\) of the response variable that we are measuring and are interested in testing or estimating. Usually this isn’t possible.

  3. We can sometimes lower \(\beta\) by choosing a different method of statistical test. This is analogous to finding a more powerful tool for a job (i.e. using a chainsaw rather than an ax to cut down a tree). We do not have time to pursue these other “tools” this semester, but options such as non-parametric tests and randomization tests can be useful.

  4. We can increase the sample size \(n\). This will lower \(\beta\) while keeping \(\alpha\) at the desired level of significance. This is an easy option for a statistician to give, but can be difficult or impossible in some real-life data collection settings.

  5. Many large scale studies, such as a clinical trial, will have a power study before the main study to try to plan a sufficient sample size. Factors such as the desired \(\alpha\), desired \(\beta\), desired effect size \(\Delta\) that is “practically significant”, and variability of the data are used in such sample size calculations.

19.5 Statistical & Practical Significance

Some YouTube videos!

  1. http://youtu.be/Oy6Co8-XkEc (Statistical vs. Practical Significance)

  2. http://youtu.be/PbODigCZqL8 (Biostatistics vs. Lab Research)

  3. http://youtu.be/kMYxd6QeAss (Power of the test, p-values, publication bias and statistical evidence)

  4. http://youtu.be/eyknGvncKLw (Understanding the p-value - Statistics Help)

If you didn’t watch the YouTube videos, particularly the first one , do so! I really feel that the last two videos did a nice job of explaining the difference between statistical and practical significance (the latter is sometimes called clinical significance).

Remember, the ‘job’ of a hypothesis test is to show whether or not we can reject the null hypothesis at level \(\alpha\) (i.e. do we have STATISTICAL significance). The mathematical formula knows nothing about the context of the problem, but you do!

Collecting data and analyzing it statistically does NOT mean we can turn off our brain and hunt the statistical output for ‘magic numbers’ that are less than 0.05.

Sometimes, people try to reduce statistical inference to the following:

  1. Load data into statistical software package

  2. Pick some statistical test

  3. If \(\text{p-value}<.05\), we have found a wonderful result that is significant! YAY! Write a paper!

  4. If \(\text{p-value}\geq .05\), we obviously chose the wrong test. Choose another test.

  5. Repeat until the ‘magic number’ is less than .05.

  6. If we cannot get the ‘`magic number’ to be less than .05, complain to the statistician. It’s obviously his/her fault!

Example of Statistical vs. Practical Significance

College students who are pre-med and are applying to medical school usually take a high-stakes standardized test called the MCAT. Scores in 2010 were scaled such that \(\mu=25\) with \(\sigma \approx 6\).

A student who wishes to increase their probability of being accepted to medical school might choose to pay for an expensive course that is designed to train them to score better on this exam.

Suppose a study was done comparing the MCAT scores who students who had taken the training course versus those who had not. Further, suppose a statistically significant difference (\(p\)-value \(<.05\)) was found.

Obviously, since the result was statistically significant, we should automatically conclude that all pre-med students should take this expensive training course before taking the MCAT. Right???

Not necessarily! It might have been the case that very large samples were used, and the actual difference between the two groups (the effect) was very small. Hypothetically, the effect might have been \(\Delta=0.5\), or the students in the training course group scored a half point better than those in the control group.

Even though that difference would be statistically significant if the sample was large enough, at some point the difference (or effect) will not have any practical significance. Increasing my MCAT score by less than one point is very unlikely to change one’s prospects of being accepted into medical school, whereas an effect of 5 points might be very important.

Moral: Statistical significance is important, but considering the effect size and the context is also important!

19.6 Are P-Values Broken??

Let’s watch this video:

https://www.youtube.com/watch?v=tLM7xS6t4FE

The use of null hypothesis based statistical testing and the \(p\)-values has been controversial for decades. The methodology, as it is being taught to year and how it has been taught for decades, is an amalgam of methods developed by competing statisticians that hated each other. Ronald Fisher is largely responsible for the concept of \(p\)-values, which was blended with the critical-value based method of Egon Pearson (this was Karl Pearson’s son) and Jerzy Neyman, which talked about Type I & Type II errors but not the \(p\)-value. Fisher and Neyman said pretty rude and disparaging things about the other.

Papers were published before I was born criticizing null hypothesis testing. When I was a graduate student in the late 1990s, a book called What If There No Significance Tests? was published. More recently, an important article in Nature by Regina Nuzzo in 2014 pointed out the unfortunate reality that statistical testing is not as foolproof as many researchers think.

https://www.nature.com/news/scientific-method-statistical-errors-1.14700

In 2015, the editors of the journal Basic and Applied Social Psychology (BASP) BANNED the use of \(p\)-values in manuscripts submitted for their journal. Trafimow and Marks (2015, p. 1) said, “…prior to publication, authors will have to remove all vestiges of NHSTP (p-values, t-values, F-values, statements about ‘significant’ differences or lack thereof, and so on)” from articles published in BASP.”

A 2019 paper looked at the articles published by BASP in 2016 and came to the conclusion that many of the papers overstated their results, possibly due to not computing \(p\)-values.

Many statisticians (myself included) feel that this is an overreaction, but the American Statistical Association has devoted a lot of energy over the past few years to this issue. An article in 2016 The ASA Statement on p-values: Context, Process, and Purpose had the following concluding paragraph:

https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108#.XcljXTNKiUk

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean.

No single index should substitute for scientific reasoning.

The ASA followed up in 2019 with a special issue on this topic, including an article called Moving To a World Beyond p < 0.05 which summarized an entire issue of The American Statistician devoted to various ideas on how to improve statistical testing.

https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913