1.4 Inference review

1.4.1 Introduction

This is a long one, so feel free to watch it in chunks! I am assuming you’ve seen this content before, but it may have been a while, and not everyone approaches it exactly the same way.

One thing to note before we start: what I’m about to describe is called frequentist inference. This is not the only approach to inference, either mathematically or philosophically. But it is very widely used, and it’s what we’ll focus on in this course. If you’re interested in other approaches to inference, I recommend picking up an elective on Bayesian statistics!

1.4.2 Frequentist inference: the big picture

In general, the goal of inference is to infer something based on what you can observe. In particular, suppose that there is some population you are interested in, and a particular parameter that describes the behavior of this population. For example, you might be interested in blood pressure. Your population is, let’s say, all US adults, and the parameter you’re interested in is the mean blood pressure, \(\mu\).

You can never observe the whole population, so you can never know the true value of the parameter. But you can guess, or infer, something about it if you take a sample of people and measure their blood pressure. You could calculate a sample statistic from your sample, like, say, the sample mean \(\bar{y}\), and use this to estimate the population parameter.

Frequentist inference takes a particular approach to this. In this philosophy, there is a true, fixed value of the population parameter. It is written on a secret mountain somewhere. It’s something. But you won’t ever know what it is.

What you do know, or at least assume you know, is the probabilistic relationship between that parameter and what you see in your sample. You use this relationship to work backwards from what you see to what you think about the true parameter.

There are two main inference tools or outcomes that you’re probably used to: the hypothesis test and the confidence interval.

A hypothesis test says: “Listen, if the true population parameter were some value \(X_0\), would it be pretty normal to see a sample value like this one I have here? Or would this sample value be really unlikely?”

A confidence interval says: “Here’s the whole range of parameter values that would make my sample value reasonably likely.”

1.4.3 The inference framework

So how do you find these things? Well, there’s a procedural framework. This same framework applies no matter what kind of parameter you’re interested in – a mean, a proportion, a regression slope, whatever. It looks like this:

  • Formulate question
  • Obtain data
  • Check conditions
  • Find the test statistic, and its sampling distribution under H0
  • Compare observed test statistic to null distribution
  • Create confidence interval
  • Report and interpret

Let’s go through in order. For example purposes, let’s say the population parameter we’re interested in is the mean, \(\mu\).

Formulate question:

Start by identifying the parameter, and the corresponding sample statistic. We’re interested in the population mean \(\mu\), so we’ll use the sample mean \(\bar{y}\).

Next, determine the null hypothesis, \(H_0\), and the alternative hypothesis, \(H_A\). The null hypothesis is a statement about the population parameter, which may or may not be true. The null should be two things: boring and tractable. Boring, because you will never be able to confirm it, only reject it or say nothing about it – so rejecting it should be interesting! And tractable, in the sense that you should know how the world works if that null hypothesis were true.

Let’s say my null hypothesis is \(H_0: \mu = 120\). The two-sided alternative would be \(H_A: \mu \neq 120\). (I could also do a one-sided alternative – maybe I’d only find it interesting if average blood pressure were higher than 120.)

Before I go any further, I need to decide on a confidence level, or alpha. How confident do I need to be in my answer? This is one of those “talk to your client” moments. In an exploratory setting, you might be pretty relaxed about this – maybe 95% confidence is good enough. But when the stakes are higher, maybe you need 99% or 99.9% or more. For this example, let’s say \(\alpha = 0.01\), corresponding to a 99% confidence level.

I say you need to decide your alpha first not for mathematical reasons, but for scientific ones. If you don’t start with an alpha in mind, there is a great temptation to wait until you have results, and then pick the alpha that makes them significant. Very sketchy, do not recommend. You may decide in advance to report multiple alphas – you’ll often see this in scientific papers – but your alpha decision should be based on the context of your research problem, not on making yourself look good.

Obtain data:

Next step: get some data! Back in intro stats, this was when you started talking about sampling. Now, of course, we’ll be creating experiments to obtain the data. Properly designing your experiment requires that you’ve already thought about the question – the parameter of interest, the interesting and uninteresting hypotheses, the necessary level of confidence.

Check conditions:

There are some “distribution-free” and “nonparametric” tests that rely on fewer conditions. BHH gets kind of into one particular approach, called randomization tests, notably in chapter 3. I don’t intend to focus on this topic, but you could explore nonparametrics for your new-topic project if you wanted to.

Most inference tests rely on certain assumptions and conditions. These are things like the “nearly normal condition” or “constant variance” or “large enough sample.” Which test you’re doing determines which conditions you care about.

For a mean, we generally care about the condition of Normality. That is: do the individual data points come from a roughly Normal distribution? And if they don’t, is our sample size large enough that the Central Limit Theorem will kick in and save us?

Let’s check it for blood pressure. I’m pulling a sample of 250 people from the NHANES public health study.

library(NHANES)
set.seed(2) # force R to take the same "random" sample every time the code runs!
sam_NHANES = NHANES %>%
  filter(!is.na(BPSysAve)) %>% # only include people who *have* BP measurements
  slice_sample(n = 250, replace = FALSE)

Let’s check a histogram of systolic blood pressure:

sam_NHANES %>%
  ggplot() +
  geom_histogram(aes(x = BPSysAve), bins = 20)

Mmm. Not super normal, really. Recall the “shape, center, and spread” catchphrase for describing distributions. The shape of this distribution is not very symmetric, like a Normal would be; instead it’s skewed right.

To confirm, we could try a Normal probability plot or QQ plot. If you haven’t met these before, basically, they compare the quantiles from my sample to the quantiles of a Normal distribution. If my sample values follow a Normal distribution, the points will more or less fall along a straight line.

sam_NHANES %>%
  ggplot(aes(sample = BPSysAve)) + # specifying aes() here means *all* the later commands will use it
  stat_qq() +
  stat_qq_line()

Again, not so great. QQ plots always look a bit funky out at the tails, but a big curve like this is a definite indication of non-Normality.

You may recall that one way of dealing with this is to try a transformation on the data. We’ll go into this more later, but for now, I’ll try taking the log of the values:

sam_NHANES = sam_NHANES %>% mutate("logBPSysAve" = log(BPSysAve))

sam_NHANES %>%
  ggplot() +
  geom_histogram(aes(x = logBPSysAve), bins = 20)

sam_NHANES %>%
  ggplot(aes(sample = logBPSysAve)) +
  stat_qq() +
  stat_qq_line()

Well, not perfect, but better. Probably good enough; with a sample size of 250, my analysis should be pretty robust to non-Normality.

I don’t see any other concerning behavior, like extreme outliers, multiple modes, or weird gaps, so I’d say we’re good on the nearly-Normal side of things.

There’s one assumption that you may remember cropping up in all the inference tests, no matter what parameter you were investigating: independent observations. In experimental design, we’ll usually talk about this as independence of errors. A lot of what we’ll do, especially in ANOVA, is about characterizing the dependence between various observations. For now, though, we’re working with observational data; we just want to know that we got a nice, random, unbiased sample. We’ll have to trust the NHANES study designers on that, so on we go!

Find the test statistic, and its sampling distribution under \(H_0\)

A statistic is any value you calculate from your data, and a test statistic is a statistic you calculate to help you do a test.

For a mean, the test statistic is the \(t\) statistic. You start with the sample mean \(\bar{y}\), then standardize it: \[ t = \frac{\bar{y} - \mu_0}{s/\sqrt{n}} \]

Here, \(\mu_0\) is the null value of the parameter – in our example, \(H_0\) was that the true average blood pressure was 120. But remember, we took a log! If the average blood pressure is 120, the average log blood pressure is \(\log(120)\). So our \(\mu_0\) is \(\log(120)\), which is about 4.8. That \(s\) in the denominator is the sample standard deviation. You may recall that using \(s\) here instead of \(\sigma\), the true population SD, is why this is a \(t\) statistic and not a Normal, or \(z\), statistic. Or maybe you don’t recall. Not that important right now.

Let’s find this for our sample:

sam_ybar = sam_NHANES$logBPSysAve %>% mean(na.rm = TRUE) # remove NAs
sam_s = sam_NHANES$logBPSysAve %>% sd(na.rm = TRUE)
sam_t = (sam_ybar - log(120)) / (sam_s/sqrt(250))
sam_t
## [1] -2.42752

Meanwhile, we also happen to know that if the null hypothesis is true, this \(t\) statistic follows a Student’s \(t\) distribution with \(n-1\) degrees of freedom. That means that if \(H_0\) were true, and we took a whole bunch of samples and did this calculation for each one of them, and then made a histogram of the distribution of all those individual \(t\)’s, it would look like this:

That’s its null sampling distribution, or its sampling distribution under \(H_0\). Different test statistics have different null sampling distributions.

Compare observed test statistic to null distribution

Well, we have this picture of what \(t\) statistics tend to look like if \(H_0\) is true. We also have an actual \(t\) statistic that we calculated from our observed data. Let’s compare:

Huh. That seems…questionable. If I were drawing values from this null distribution, I’d be somewhat unlikely to draw something so far out there. Or as the statisticians phrase it: to get a value as or more extreme – where “extreme” means far from the center of the distribution.

Suppose I decided that, yes indeed, that value is unlikely. One might even say weird. Now, as a statistician, I think that weird things happen, but not to me. Then I’d conclude that since this value would be weird if \(H_0\) were true, and weird things don’t happen to me, well, \(H_0\) must not be true after all.

But, I don’t know, maybe it’s not that unlikely. How weird is weird, anyway? We quantify this idea with a p-value. A p-value is a probability: specifically, it’s the probability of observing a value at least this extreme, if the null hypothesis is true. P-values have, like, a lot of problems, but again, they’re really common so it’s good to remember where they come from.

I can actually get the p-value semi-manually by finding the tail probabilities of the \(t\) distribution:

p_val = 2*(pt(sam_t, df = 250-1, lower.tail = TRUE))
p_val
## [1] 0.01591225

If \(H_0\) is true, I’d have a 1.5% chance of getting a test statistic at least this far from 0.

Then I compare that to the alpha I decided on way back at the beginning, 0.01. And I conclude: meh. To me, something that happens 1.5% of the time doesn’t count as weird, so it’s still plausible to me that \(H_0\) is in fact true. Or, to put it technically: because my p-value is higher than my alpha, I fail to reject \(H_0\). I’m not saying that \(H_0\) is true, but I haven’t found sufficiently strong evidence against it.

Note that if my p-value were lower than alpha, I would reject \(H_0\). In fact, I’d do that if I’d set \(\alpha = 0.05\). But I decided at the beginning that I needed to be more confident than that! I refuse to get all excited unless I see stronger evidence against \(H_0\).

Create confidence interval

Okay, so I can’t really say whether the true average blood pressure is 120 (or more precisely, whether the true average log blood pressure is \(\log(120)\)). What about…125? 114 and a half?

The thing about hypothesis tests is that they only tell you about a single null value. We want more! This is where confidence intervals come in.

A confidence interval tells you all the null values that you wouldn’t reject. It’s centered at your estimate of the parameter, and then goes out to either side. The more confident you want to be, the farther out you have to go. The general formula for a confidence interval is: \[ estimate \pm CV_{\alpha}*SE(estimate)\] \(CV_{\alpha}\) is the critical value – it depends on the distribution of your estimate, and on your required confidence level.

For a mean, the specific formula is: \[ \bar{y} \pm t^*_{n-1,\alpha/2} * \frac{s}{\sqrt{n}}\]

That \(t^*\) is a critical value from the \(t\) distribution – note how the degrees of freedom are specified.

I could calculate all this stuff manually too, but instead, let’s just let R do its thing:

my_ttest = t.test(x = sam_NHANES$logBPSysAve, alternative = "two.sided",
       mu = log(120), conf.level = 0.99)
my_ttest
## 
##  One Sample t-test
## 
## data:  sam_NHANES$logBPSysAve
## t = -2.4275, df = 249, p-value = 0.01591
## alternative hypothesis: true mean is not equal to 4.787492
## 99 percent confidence interval:
##  4.743569 4.788962
## sample estimates:
## mean of x 
##  4.766266

R, helpfully, both conducts a hypothesis test and provides a confidence interval. And so should you! Which brings us to the last step:

Report and interpret

Look, there are a lot of steps in this process that R can do. But there are some things that it is your job, as the human, to do: formulating an appropriate statistical question given the context, deciding on a confidence level, thinking about independence of observations. And now, interpreting what you found out, and helping other people understand it.

This process requires judgment, and it depends on your audience. Generally, you won’t want to hand over every single detail of your analysis (or at least, you’ll put it in an appendix). But you also don’t want to just holler “rejected \(H_0\), we win!” and leave.

As a general rule, you should always report:

  • Your reject/FTR decision about \(H_0\) (if you were doing a hypothesis test at all)
  • The associated p-value
  • A confidence interval if at all possible

…and then, you should make sure to put it in context. I can tell you “We failed to reject, the p-value was 0.015, and the confidence interval was (4.74, 4.79),” but what does that mean about blood pressure? Whether you’re working directly with a client or publishing something for a wider readership, you always want to translate your results into the language of your audience. That audience might include statisticians, but most often, it also includes people who aren’t.

Response moment: How would you report these results? There’s an extra calculation step you should probably do – what is it?