This book is meant to accompany DSCI 335. It is not a complete textbook; you will need to take notes on what you hear in class and what you read throughout the semester. In it, you will find:

  • An outline of the course topics
  • A list of readings and videos, both required and recommended
  • Notes on important topics that are not covered in readings and vidoes
  • My background notes on readings and videos, which I recommend you read first. If there are terms or concepts that are important and that I don’t expect you to already be familiar with, I’ll briefly address them here.
  • Example R scripts for conducting simulations

I will be updating and expanding this book as the semester progresses. If we aren’t yet to a certain module, then that module’s content in this book is likely incomplete.

0.1 About this course

“Inference” in data analysis refers to making claims using data, where the claims are applied beyond the scope of the data themselves.1 Inferential reasoning, then, refers to the types of reasoning we employ when coming to these claims, or when justifying these claims.

Example: The Phase 3 trial for Pfizers COVID-19 vaccine had roughly 43,000 participants, split into vaccine and placebo (no vaccine) groups. Out of 170 confirmed COVID-19 cases among participants, 162 were in the placebo group and 8 were in the vaccine group.(Link to paper)

We infer from this that:

  • The vaccine is effective, because it is extremely unlikely that the number of cases among those vaccinated (8) would be so much smaller than the number of cases among those not vaccinated (162) by chance if the vaccine was equivalent to placebo.

  • A person who would have been infected if unvaccinated has a roughly 5% chance of getting infected if vaccinated (8/162 = 0.049). But, since we only have 170 cases, there is sizable uncertainty in this estimate. We think the population level value is somewhere between 2.4% and 9.7%

We notice that, even in the placebo group, the number of cases is very small compared to the number of participants (162/21728 = 0.0075). But we should be wary of treating this as an estimate of infection probability for people who are unvaccinated, due to potential sources of bias:

  • Those who participate in a phase 3 vaccine trial may be less likely than most to be exposed to the virus.

  • Participants were only tested if they showed symptoms, so many cases were likely missed.

All of this reasoning is inferential, because we are making claims about the general effectiveness of the vaccine, based on observed case numbers from a sample.

Inference involves making a “leap” from data to the questions that motivated us to collect the data in the first place. This leap may be larger than we often realize. It is the metaphor I will focus on for this course.

0.2 Course topics outline

  1. Statistics background
  2. Correlation and causation
  3. Forms of logic in science and statistics
  4. Effect sizes
  5. Statistical power and the sampling distribution of the p-value
  6. Probability: Frequentist vs. Bayesian
  7. Model assumptions
  8. Interpreting regression coefficients
  9. Ethical concerns in data analysis
  10. Replication

  1. Formally, this is often framed as drawing inference from a sample to a population, but I think the phrase is still meaningful without needing to invoke a population.↩︎