## 1.4 Inference review

### 1.4.1 Introduction

This is a long one, so feel free to watch it in chunks! I am assuming you’ve seen this content before, but it may have been a while, and not everyone approaches it exactly the same way.

One thing to note before we start: what I’m about to describe is called *frequentist inference*. This is not the only approach to inference, either mathematically or philosophically. But it is very widely used, and it’s what we’ll focus on in this course. If you’re interested in other approaches to inference, I recommend picking up an elective on Bayesian statistics!

### 1.4.2 Frequentist inference: the big picture

In general, the goal of inference is to *infer* something based on what you can observe. In particular, suppose that there is some *population* you are interested in, and a particular *parameter* that describes the behavior of this population. For example, you might be interested in blood pressure. Your population is, let’s say, all US adults, and the parameter you’re interested in is the mean blood pressure, \(\mu\).

You can never observe the whole population, so you can never know the true value of the parameter. But you can guess, or *infer*, something about it if you take a *sample* of people and measure their blood pressure. You could calculate a *sample statistic* from your sample, like, say, the sample mean \(\bar{y}\), and use this to *estimate* the population parameter.

Frequentist inference takes a particular approach to this. In this philosophy, there is a true, fixed value of the population parameter. It is written on a secret mountain somewhere. It’s *something.* But you won’t ever know what it is.

What you do know, or at least assume you know, is the probabilistic relationship between that parameter and what you see in your sample. You use this relationship to work backwards from what you see to what you think about the true parameter.

There are two main inference tools or outcomes that you’re probably used to: the *hypothesis test* and the *confidence interval*.

A hypothesis test says: “Listen, if the true population parameter were some value \(X_0\), would it be pretty normal to see a sample value like this one I have here? Or would this sample value be really unlikely?”

A confidence interval says: “Here’s the whole *range* of parameter values that would make my sample value reasonably likely.”

### 1.4.3 The inference framework

So how do you find these things? Well, there’s a procedural framework. This same framework applies no matter what kind of parameter you’re interested in – a mean, a proportion, a regression slope, whatever. It looks like this:

- Formulate question
- Obtain data
- Check conditions
- Find the test statistic, and
*its*sampling distribution**under H0** - Compare observed test statistic to null distribution
- Create confidence interval
- Report and interpret

Let’s go through in order. For example purposes, let’s say the population parameter we’re interested in is the mean, \(\mu\).

**Formulate question:**

Start by identifying the parameter, and the corresponding sample statistic. We’re interested in the population mean \(\mu\), so we’ll use the sample mean \(\bar{y}\).

Next, determine the *null hypothesis*, \(H_0\), and the *alternative hypothesis*, \(H_A\). The null hypothesis is a statement about the population parameter, which may or may not be true. The null should be two things: boring and tractable. Boring, because you will never be able to confirm it, only reject it or say nothing about it – so rejecting it should be interesting! And tractable, in the sense that you should know how the world works *if that null hypothesis were true*.

Let’s say my null hypothesis is \(H_0: \mu = 120\). The two-sided alternative would be \(H_A: \mu \neq 120\). (I could also do a one-sided alternative – maybe I’d only find it interesting if average blood pressure were *higher* than 120.)

Before I go any further, I need to decide on a *confidence level*, or *alpha*. How confident do I need to be in my answer? This is one of those “talk to your client” moments. In an exploratory setting, you might be pretty relaxed about this – maybe 95% confidence is good enough. But when the stakes are higher, maybe you need 99% or 99.9% or more. For this example, let’s say \(\alpha = 0.01\), corresponding to a 99% confidence level.

I say you need to decide your alpha *first* not for mathematical reasons, but for scientific ones. If you don’t start with an alpha in mind, there is a great temptation to wait until you have results, and then pick the alpha that makes them significant. Very sketchy, do not recommend. You may decide in advance to report multiple alphas – you’ll often see this in scientific papers – but your alpha decision should be based on the context of your research problem, not on making yourself look good.

**Obtain data:**

Next step: get some data! Back in intro stats, this was when you started talking about sampling. Now, of course, we’ll be creating experiments to obtain the data. Properly designing your experiment requires that you’ve *already* thought about the question – the parameter of interest, the interesting and uninteresting hypotheses, the necessary level of confidence.

**Check conditions:**

There are some “distribution-free” and “nonparametric” tests that rely on *fewer* conditions. BHH gets kind of into one particular approach, called *randomization tests*, notably in chapter 3. I don’t intend to focus on this topic, but you could explore nonparametrics for your new-topic project if you wanted to.

Most inference tests rely on certain *assumptions and conditions*. These are things like the “nearly normal condition” or “constant variance” or “large enough sample.” Which test you’re doing determines which conditions you care about.

For a mean, we generally care about the condition of *Normality*. That is: do the individual data points come from a roughly Normal distribution? And if they don’t, is our sample size large enough that the Central Limit Theorem will kick in and save us?

Let’s check it for blood pressure. I’m pulling a sample of 250 people from the NHANES public health study.

```
library(NHANES)
set.seed(2) # force R to take the same "random" sample every time the code runs!
= NHANES %>%
sam_NHANES filter(!is.na(BPSysAve)) %>% # only include people who *have* BP measurements
slice_sample(n = 250, replace = FALSE)
```

Let’s check a histogram of systolic blood pressure:

```
%>%
sam_NHANES ggplot() +
geom_histogram(aes(x = BPSysAve), bins = 20)
```

Mmm. Not super normal, really. Recall the “shape, center, and spread” catchphrase for describing distributions. The shape of this distribution is not very symmetric, like a Normal would be; instead it’s skewed right.

To confirm, we could try a Normal probability plot or *QQ plot*. If you haven’t met these before, basically, they compare the quantiles from my sample to the quantiles of a Normal distribution. If my sample values follow a Normal distribution, the points will more or less fall along a straight line.

```
%>%
sam_NHANES ggplot(aes(sample = BPSysAve)) + # specifying aes() here means *all* the later commands will use it
stat_qq() +
stat_qq_line()
```

Again, not so great. QQ plots always look a bit funky out at the tails, but a big curve like this is a definite indication of non-Normality.

You may recall that one way of dealing with this is to try a *transformation* on the data. We’ll go into this more later, but for now, I’ll try taking the log of the values:

```
= sam_NHANES %>% mutate("logBPSysAve" = log(BPSysAve))
sam_NHANES
%>%
sam_NHANES ggplot() +
geom_histogram(aes(x = logBPSysAve), bins = 20)
%>%
sam_NHANES ggplot(aes(sample = logBPSysAve)) +
stat_qq() +
stat_qq_line()
```

Well, not perfect, but better. Probably good enough; with a sample size of 250, my analysis should be pretty robust to non-Normality.

I don’t see any other concerning behavior, like extreme outliers, multiple modes, or weird gaps, so I’d say we’re good on the nearly-Normal side of things.

There’s one assumption that you may remember cropping up in *all* the inference tests, no matter what parameter you were investigating: *independent observations*. In experimental design, we’ll usually talk about this as *independence of errors*. A lot of what we’ll do, especially in ANOVA, is about characterizing the dependence between various observations. For now, though, we’re working with observational data; we just want to know that we got a nice, random, unbiased sample. We’ll have to trust the NHANES study designers on that, so on we go!

**Find the test statistic, and its sampling distribution under **\(H_0\)

A *statistic* is any value you calculate from your data, and a *test statistic* is a statistic you calculate to help you do a test.

For a mean, the test statistic is the \(t\) statistic. You start with the sample mean \(\bar{y}\), then *standardize* it:
\[ t = \frac{\bar{y} - \mu_0}{s/\sqrt{n}} \]

Here, \(\mu_0\) is the null value of the parameter – in our example, \(H_0\) was that the true average blood pressure was 120. But remember, we took a log! If the average blood pressure is 120, the average *log* blood pressure is \(\log(120)\). So our \(\mu_0\) is \(\log(120)\), which is about 4.8. That \(s\) in the denominator is the sample standard deviation. You may recall that using \(s\) here instead of \(\sigma\), the true population SD, is why this is a \(t\) statistic and not a Normal, or \(z\), statistic. Or maybe you don’t recall. Not that important right now.

Let’s find this for our sample:

```
= sam_NHANES$logBPSysAve %>% mean(na.rm = TRUE) # remove NAs
sam_ybar = sam_NHANES$logBPSysAve %>% sd(na.rm = TRUE)
sam_s = (sam_ybar - log(120)) / (sam_s/sqrt(250))
sam_t sam_t
```

`## [1] -2.42752`

Meanwhile, we also happen to know that *if the null hypothesis is true*, this \(t\) statistic follows a Student’s \(t\) distribution with \(n-1\) degrees of freedom. That means that *if* \(H_0\) were true, and we took a whole bunch of samples and did this calculation for each one of them, and then made a histogram of the distribution of all those individual \(t\)’s, it would look like this:

That’s its *null sampling distribution*, or its *sampling distribution under* \(H_0\). Different test statistics have different null sampling distributions.

**Compare observed test statistic to null distribution**

Well, we have this picture of what \(t\) statistics tend to look like if \(H_0\) is true. We also have an *actual* \(t\) statistic that we calculated from our observed data. Let’s compare:

Huh. That seems…questionable. If I were drawing values from this null distribution, I’d be somewhat unlikely to draw something so far out there. Or as the statisticians phrase it: to get a value *as or more extreme* – where “extreme” means far from the center of the distribution.

Suppose I decided that, yes indeed, that value is unlikely. One might even say weird. Now, as a statistician, I think that weird things happen, but *not to me*. Then I’d conclude that since this value would be weird if \(H_0\) were true, and weird things don’t happen to me, well, \(H_0\) must not be true after all.

But, I don’t know, maybe it’s not *that* unlikely. How weird is weird, anyway? We quantify this idea with a *p-value*. A p-value is a probability: specifically, it’s the probability of observing a value at least this extreme, *if the null hypothesis is true*. P-values have, like, a *lot* of problems, but again, they’re really common so it’s good to remember where they come from.

I can actually get the p-value semi-manually by finding the tail probabilities of the \(t\) distribution:

```
= 2*(pt(sam_t, df = 250-1, lower.tail = TRUE))
p_val p_val
```

`## [1] 0.01591225`

*If* \(H_0\) is true, I’d have a 1.5% chance of getting a test statistic at least this far from 0.

Then I compare that to the alpha I decided on way back at the beginning, 0.01. And I conclude: meh. To me, something that happens 1.5% of the time doesn’t count as weird, so it’s still plausible to me that \(H_0\) is in fact true. Or, to put it technically: because my p-value is higher than my alpha, I *fail to reject* \(H_0\). I’m not saying that \(H_0\) *is* true, but I haven’t found sufficiently strong evidence *against* it.

Note that if my p-value were lower than alpha, I would *reject* \(H_0\). In fact, I’d do that if I’d set \(\alpha = 0.05\). But I decided at the beginning that I needed to be more confident than that! I refuse to get all excited unless I see stronger evidence against \(H_0\).

**Create confidence interval**

Okay, so I can’t really say whether the true average blood pressure is 120 (or more precisely, whether the true average *log* blood pressure is \(\log(120)\)). What about…125? 114 and a half?

The thing about hypothesis tests is that they only tell you about a single null value. We want more! This is where confidence intervals come in.

A confidence interval tells you *all* the null values that you *wouldn’t* reject. It’s centered at your estimate of the parameter, and then goes out to either side. The more confident you want to be, the farther out you have to go. The general formula for a confidence interval is:
\[ estimate \pm CV_{\alpha}*SE(estimate)\]
\(CV_{\alpha}\) is the *critical value* – it depends on the distribution of your estimate, and on your required confidence level.

For a mean, the specific formula is: \[ \bar{y} \pm t^*_{n-1,\alpha/2} * \frac{s}{\sqrt{n}}\]

That \(t^*\) is a critical value from the \(t\) distribution – note how the degrees of freedom are specified.

I could calculate all this stuff manually too, but instead, let’s just let R do its thing:

```
= t.test(x = sam_NHANES$logBPSysAve, alternative = "two.sided",
my_ttest mu = log(120), conf.level = 0.99)
my_ttest
```

```
##
## One Sample t-test
##
## data: sam_NHANES$logBPSysAve
## t = -2.4275, df = 249, p-value = 0.01591
## alternative hypothesis: true mean is not equal to 4.787492
## 99 percent confidence interval:
## 4.743569 4.788962
## sample estimates:
## mean of x
## 4.766266
```

R, helpfully, both conducts a hypothesis test *and* provides a confidence interval. And so should you! Which brings us to the last step:

**Report and interpret**

Look, there are a lot of steps in this process that R can do. But there are some things that it is your job, as the human, to do: formulating an appropriate statistical question given the context, deciding on a confidence level, thinking about independence of observations. And now, interpreting what you found out, and helping other people understand it.

This process requires judgment, and it depends on your audience. Generally, you won’t want to hand over every single detail of your analysis (or at least, you’ll put it in an appendix). But you also don’t want to just holler “rejected \(H_0\), we win!” and leave.

As a general rule, you should always report:

- Your reject/FTR decision about \(H_0\) (if you were doing a hypothesis test at all)
- The associated p-value
- A confidence interval if at all possible

…and *then*, you should make sure to put it in context. I can tell you “We failed to reject, the p-value was 0.015, and the confidence interval was (4.74, 4.79),” but what does that mean about blood pressure? Whether you’re working directly with a client or publishing something for a wider readership, you always want to translate your results into the language of your audience. That audience might include statisticians, but most often, it also includes people who aren’t.

**Response moment:** How would you report these results? There’s an extra calculation step you should probably do – what is it?