In the Foundations of inference chapters, we have provided three different methods for statistical inference. We will continue to build on all three of the methods throughout the text, and by the end, you should have an understanding of the similarities and differences between them. Meanwhile, it is important to note that the methods are designed to mimic variability with data, and we know that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure 2.6). In Table 10.1, we have summarized some of the ways the inferential procedures feature specific sources of variability. We hope that you refer back to the table often as you dive more deeply into inferential ideas in future chapters.
|What does it do?||Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment||Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population||Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples|
|What is the random process described?||Randomized experiment||Random sampling from a population||Randomized experiment or random sampling|
|What other random processes can be approximated?||Can also be used to describe random sampling in an observational model||Can also be used to describe random allocation in an experiment||Randomized experiment or random sampling|
|What is it best for?||Hypothesis testing (can also be used for confidence intervals, but not covered in this text).||Confidence intervals (can also be used for bootstrap hypothesis testing for one proportion as well).||Quick analyses through, for example, calculating a Z score.|
|What physical object represents the simulation process?||Shuffling cards||Pulling marbles from a bag with replacement||Not applicable|
You might have noticed that the word distribution is used throughout this part (and will continue to be used in future chapters). A distribution always describes variability, but sometimes it is worth reflecting on what is varying. Typically the distribution either describes how the observations vary or how a statistic varies. But even when describing how a statistic varies, there is a further consideration with respect to the study design, e.g., does the statistic vary from random sample to random sample or does it vary from random allocation to random allocation? The methods presented in this text (and used in science generally) are typically used interchangeably across ideas of random samples or random allocations of the treatment. Often, the two different analysis methods will give equivalent conclusions. The most important thing to consider is how to contextualize the conclusion in terms of the problem. See Figure 2.6 to confirm that your conclusions are appropriate.
Below, we synthesize the different types of distributions discussed throughout the text. Reading through the different definitions and solidifying your understanding will help as you come across these distributions in future chapters and you can always return back here to refresh your understanding of the differences between the various distributions.
A data distribution describes the shape, center, and variability of the observed data.
This can also be referred to as the sample distribution but we’ll avoid that phrase as it sounds too much like sampling distribution, which is different.
A population distribution describes the shape, center, and variability of the entire population of data.
Except in very rare circumstances of very small, very well-defined populations, this is never observed.
A sampling distribution describes the shape, center, and variability of all possible values of a sample statistic from samples of a given sample size from a given population.
Since the population is never observed, it’s never possible to observe the true sampling distribution either. However, when certain conditions hold, the Central Limit Theorem tells us what the sampling distribution is.
A randomization distribution describes the shape, center, and variability of all possible values of a sample statistic from random allocations of the treatment variable.
We computationally generate the randomization distribution, though usually, it’s not feasible to generate the full distribution of all possible values of the sample statistic, so we instead generate a large number of them. Almost always, by randomly allocating the treatment variable, the randomization distribution describes the null hypothesis, i.e., it is centered at the null hypothesized value of the parameter.
A bootstrap distribution describes the shape, center, and variability of all possible values of a sample statistic from resamples of the observed data.
We computationally generate the bootstrap distribution, though usually, it’s not feasible to generate all possible resamples of the observed data, so we instead generate a large number of them. Since bootstrap distributions are generated by randomly resampling from the observed data, they are centered at the sample statistic. Bootstrap distributions are most often used for estimation, i.e., we base confidence intervals off of them.