1 What this class is about

Chapter 1 Readings

Mathematics, Statistics, and Teaching (Cobb and Moore 1997)

Inference to the Best Explanation (Lipton 2000)

1.1 What is inference?

“Inference” is not easy to define. Do an online search and you’ll find many potential definitions for “inference” or “inferential reasoning”.

A search for “statistical inference” returns narrower definitions, typically focusing on drawing “conclusions”¹ about a population using a sample of data. This course will cover statistical inference, but it will not be limited to statistical inference. Our interest is in any kind of inference, broadly defined, that is based at least in part upon analyzing data.

Example: Prior to Pfizer’s COVID-19 vaccine being made available to the public, a “Phase 3 trial” was conducted. There were roughly 43,000 participants, split randomly into vaccine and placebo (no vaccine) groups. Out of 170 confirmed COVID-19 cases among participants, 162 were in the placebo group and 8 were in the vaccine group.(Link to paper)

We might infer from this that:

The vaccine is effective, because it is extremely unlikely that the number of cases among those vaccinated (8) would be so much smaller than the number of cases among those not vaccinated (162) if the vaccine was equivalent to placebo.

Since equal numbers of participants were randomly assigned to get the vaccine or a get placebo, and these participants were presumably unaware of which they received, we might infer that “random chance” is the only potential explanation for the trial’s outcome other than the vaccine being effective. And, based on the data (8 vaccinated cases vs. 162 unvaccinated cases), random chance is not a believable explanation. Thus we infer that the vaccine is effective.
In this study, the chance of a vaccinated person getting infected was roughly 5% as big as the chance of an unvaccinated person getting infected (8/162 = 0.049). We might infer from this that the vaccine was “95% effective” ² But, since we only have 170 cases, there is sizable uncertainty in this estimate. We might infer that the population level value was somewhere inside the reported 95%³ confidence interval of 2.4% and 9.7%.
We notice that, even in the placebo group, the number of cases was very small compared to the number of participants (162/21728 = 0.0075). But, we would not be well justified in inferring that the population infection probability for unvaccinated people over the course of the study was close to 0.75%, due to potential sources of bias:
- People who volunteer to be in a phase 3 COVID-19 vaccine trial may have been less likely than most to be exposed to the virus.
- Participants were not testing regularly for COVID-19; they had to elect to get tested. Most people who chose to get tested did so because they showed symptoms or they knew they’d been exposed to someone else with the virus. So we can infer that a large number of infections among the trial participants were missed.

The words infer or inferential are used above when the claim being made goes beyond the sample itself, and is applied to general properties of the vaccine or to the larger population of people who might be vaccinated.

Inference involves making a “leap” from data to the questions that motivated us to collect the data in the first place. This leap may be larger than we often realize.

1.1.1 Inference vs. description

“Descriptive statistics” refer only to the data at hand. If we plot data, or calculate values from the data, but we avoid using this information to generalize beyond the data, then we are not engaging in inference.

1.1.2 Inference vs. prediction

Sometimes the purpose of analyzing data is simply to make predictions for the values of future observed outcomes.

For example, if you use data from college basketball games to fill out an NCAA March Madness bracket, you are making predictions. On the other hand, if you use data to claim that a team’s seeding (i.e. its ranking from 1 to 16 within one of four regions) is a better indicator of overall skill for teams seeded below 8th place than for teams seeded above 8th place, you are engaging in inference.

Justifying predictions is at least in principle simple, so long as your predictions are about observable outcomes: you just compare the prediction to the actual outcome. How, exactly, to quantify prediction accuracy can get complex, but the idea behind it isn’t.

Justifying inferences, however is often complicated and controversial, as we will see again and again throughout the course. Inferences require making assumptions about the world beyond the data. Inferences But, it is easy to slide from description or prediction into inference without justification. Watch out for this!

Example:

This paper claims that predictive algorithms do a surprisingly good job of determining a person’s sexual orientation from a photograph taken from a dating website. It also claims that this is due in part to biological differences between men or women of different sexual orientations. This claim is inferential, and came under strong criticism.

The first claim, that an algorithm did a good job of making predictions, was not controversial. The authors moved from prediction to inference when proposing an explanation for their algorithm’s predictive accuracy.

1.2 The gulf between models and reality

The image on the front page of Canvas captures the central idea of the course. We have data, analyzed using statistical models or algorithms. We have the results of those analyses. We then have the questions that we care about in the real world. How do we justify using data to answer larger questions? It is not always straightforward.

Some challenges:

Our models make assumptions that aren’t true (e.g. linearity, residuals that are independently sampled from a normal distribution with fixed variance across all values of the x’s, zero measurement error in the x’s).
The populations we draw inference to aren’t well defined (e.g. we want to make a claim about “people” but we measured people in a particular city who chose to volunteer for a study).
The parameters we estimate are features of models and (usually) exist only in our minds..
What we actually measure is not what we want to quantify (e.g. an exam score does not directly quantify “understanding” or “ability” )
The statistics we calculate do not directly answer scientific questions (e.g. the “coefficient of determination” \(R^2\) is not the answer to any well defined scientific question other than “what do I get when I plug my data into the formula for \(R^2\)?”)

1.3 “What does this number quantify?”

This is a question we will ask constantly in DSCI 335. It is a good habit to get into. If you calculate a number from data, and that number is supposed to be important, you should be able to explain what that number is quantifying.⁴

Hopefully DSCI 335 helps you to develop your “number sense”. It is sometimes too easy to calculate values without understanding their purpose, or to report the results of an analysis without understanding exactly what questions those results are answering.

1.4 Scientific vs. statistical inference

“Is this drug effective?” and “Can I reject the null hypothesis that the population mean outcomes for those receiving the drug vs. those receiving the placebo are identical?” are not the same question.

Formal statistical inference exists in the world of models, which are simplifications of reality. To infer a scientific conclusion based upon a statistical conclusion requires additional justification.

1.5 Logical forms of inference

Deductive inference: justifying a claim on the grounds that it must be true if the premises are true. For instance, if it is true that a rooster only crows during sunrise, and we hear the rooster crow, then we can deductively infer that it is sunrise.⁵ Deductive reasoning is great when you can use it. Statistical inferences are rarely deductive.
Inductive reasoning: justifying a claim on the grounds that it is consistent with all observations. For instance, if every time I’ve heard a rooster crowing has been at sunrise, I might conclude, inductively, that roosters only crow during sunrise. Inductive inferences are never guarantees, and are generally fraught with danger. Statistical inferences are often inductive.
Abductive reasoning: justifying a claim on the grounds that it is the best explanation available for what has been observed. This is similar to inductive reasoning in that it cannot be guaranteed correct. It is different in that it appeals to the quality of an explanation, rather than only a pattern or frequency. For instance, if I’ve only ever heard a rooster crowing at sunrise, I might consider three explanations:
- Roosters only crow during sunrise
- Roosters crow outside of sunrise but, by sheer luck, I have never heard this
- Roosters crow outside of sunrise but I have a rare an unexplained medical condition that prevents me from hearing this except during sunrise.
The first explanation seems much more likely than the second, and the third is just silly. I conclude, abductively, that roosters only crow during sunrise.

1.6 Very brief definitions of common terms

Here’s a list of terms we’ll be using a lot, along with very brief definitions:

Sample: a collection of data.

Statistic: a number calculated from data.

Interpretability: the extent to which some statistical result (like a statistic, or a plot) has a “real world” meaning.

Population: a larger group that we might be interested in; often where we suppose our sample came from.

Parameter: an unknown value assumed to exist at the population level, typically estimated by a statistic.

Estimate: what we call a statistic when it is being used to estimate an unknown value, typically thought of as a parameter.

Effect size: a term often given to a statistic that we think is easily interpreted and quantifies the magnitude of some phenomenon of interest.

95% Confidence interval (CI): an interval created in a such a way that, under repeated sampling, 95% of samples will produce an interval that contains the value of the parameter being estimated.

Sampling distribution: distribution of values a statistic takes on, under repeated sampling from the same population. Nice simulations from Brown University:

Central limit theorem

Confidence intervals

Sampling error: the extent to which we expect our estimates to differ from the parameters they estimate, due to variability in the sampling distribution. Often thought of as the extent to which our estimates are wrong “due to random chance”.

Bias: the extent to which we expect a statistical estimate to differ from the value of the parameter being estimated because of something systematic. Often referred to as “systematic error” or “non-sampling error”. Formally, the difference between the expected value of an estimator and the value of the parameter being estimated.

Noise: a shorthand way of referring to variability in data or variability in a sampling distribution that is due to factors unaccounted for by our study. Often formally modeled as “random error” and denoted as \(\varepsilon\).

Hypothesis test: a procedure for deciding whether to reject or fail to reject a null hypothesis (\(H_0\)). \(H_0\) may be rejected in favor of \(H_1\) (aka \(H_A\)), the alternative hypothesis. \(H_0\) usually represents that some effect does not exist, e.g. \(H_0: \mu_1-\mu_2=0\), “the difference in population means is zero”.

Test statistic: a statistic created for the purpose of testing against \(H_0\). Created in such a way that, the more the data deviate from what would be expected under \(H_0\), the farther away from 0 the test statistic will be.

p-value: the probability that a new set of data from the same population would produce a test statistic at least as large as the one calculated, under the assumption that \(H_0\) is true.

Statistically significant: the thing we say when p<0.05 or the 95% CI does not contain the value in the null hypothesis. Equivalent to “reject \(H_0\)”.

Type I error: rejecting \(H_0\) when \(H_0\) is true.

Type II error: failing to reject \(H_0\) when \(H_0\) is false.

A good free resource on introductory stats is OpenIntro Statistics

An even better free resource is Wikipedia. Not joking, most entries on statistical topics are good.

Also a poorly defined word↩︎
The vaccine was frequently referred to as “95% effective” in news reports; what this meant was rarely explained.↩︎
not the same “95%” as the vaccine effectiveness; this was coincidental↩︎
There are some exceptions, e.g. “AIC” or “deviance” or \(-2\log{likelihood}\). This class will focus on interpretable statistics.↩︎
But, it is not true that roosters only crow during sunrise. Logicians would say that the structure of this argument is “valid” because the conclusion follows from the premises, but that the conclusion is “unsound” because the first premise is false.↩︎