Chapter 4 Statistical inference

Statistical inference is about learning about things you do not know (\(\theta\)) with things you do know, e.g., data from a sample (\(x\)). Then, the general idea is to infer something using statistical procedures. What we want to infer should be something that is quantifiable, so the concrete focus of statistical inferences lies in one or more quantities of interest.

Let’s start by presenting some basic concepts for statistical inference:

  • Estimand: the parameter in the population which is unknown and that will be estimated in a statistical analysis.
  • Estimation: the process of finding an estimate.
  • Estimate: the estimated quantity of interest representing the estimand given some assumptions and data.
  • Estimator: A rule for calculating an estimate of a given quantity based on observed data.

So there is a quantity of interest that is unknown (parameter or estimand), so then a combination of observed data and rules (estimator) is used to find the best possible representation of it (estimate).

For example, if we are interested in knowing the general well-being of individuals we should seek a quantity of interest that can account for it. Assuming that the best judge of someone’s well-being is the individual itself, we could focus on measuring subjective life satisfaction. A survey instrument would, then, require a respondent to state how satisfied does she/he feels about her/his life in a certain scale, e.g., from Extremely dissatisfied (0) to Extremely satisfied (10). But this is just a variable, to determine the quantity of interest one should derive a parameter from that variable, e.g., the population mean of subjective life satisfaction \(\mu_{ls}\).

There are multiple approaches for statistical inference. Commonly, two broad ones are distinguished: Frequentist and Bayesian. Within each of these approaches there are controversies about the best tools and standards for doing statistical inference. This course will mainly consider the former approach, as it is the most common in social sciences and in the report of official statistics.

In order to actually know the general well-being of a population in some country we could design a study to get the answers for this question. A first decision would be to either conduct a census – ask the question to all the individuals in the population – or to field a survey with a random sample of the population. The former strategy would be very cost intensive and time consuming as country’s population tend to be in the order of millions, while the former would be much more resource-efficient. After drawing a random sample of the population (which is developed in Part III), the survey is fielded and processed so we end up having the data on subjective well-being for this sample.

Consider the data generation process described in Fig. 3.1. For making a statistical inference now we want to go the opposite way: from the sample data to the population. Given that we know a fundamental part of the data generation process, i.e., that the individuals were selected at random from the population, it is possible to use this knowledge to make an inference. There is a probability distribution that accounts for the process, as each individual \(i\) in the population had a non-zero probability of being part in the sample (\(\pi_i > 0\)).

Nonetheless, as we showed in Section 3.1, there are multiple possible samples of size \(n\) that we could have obtained from a population of size \(N\), given that \(n < N\). Clearly, any quantity of interest will unlikely be the same over all of the possible samples from the population (see Section 5.2 on sampling error). But sampling theory provides us two important insights:

  • If a random procedure was used to generate the sample data, it is expected (among all the possible samples) that the estimate will be equal to the population parameter.
  • With higher sample sizes (\(n\)), more accurate estimates will be obtained.

4.1 Uncertainty in statistical inference

As we have seen, the original inquiry about subjective well-being can be satisfactorily conducted with a probability based sample. It is possible to use known information from the study, the mean subjective life-satisfaction (\(\bar{x}_{ls}\)) to estimate the parameter, i.e., the mean of the population subjective life-satisfaction (\(\mu_{ls}\)). At first this can be surprising and exciting, as we are able to say something about an unknown (the population parameter) with a known (the sample statistic). But there is a cost that has to be taken into account: by drawing a sample we have not observed all of the individuals in the population. It is necessary, then, to account for the uncertainty implied in the sampling process that generated the data. This uncertainty consists in part of sampling error, but this is not the unique source of uncertainty in the survey process (see Section 5).

Statistical inference deals mostly with this problem of accounting for the inherent uncertainty attached to the sampling process. It is necessary to find a way to represent the differences that might have arisen between the sample statistic and the population parameter, given the random process generating the former. As the drawing of the sample was at random, it is likely that no significant bias in relevant characteristics of the quantity of interest will exist. For example, if the population we are studying is dispersed in terms of its subjective well-being, with individuals distributed along all of the 11 categories of the scale (0 to 10), we would not expect to obtain a sample of the most unsatisfied individuals in the population. Although it is possible to draw a sample with the less satisfied individuals, this would be very improbable. The same applies for a possible sample of the most satisfied individuals. Part IV is focused on displaying different ways to account for the uncertainty in the estimates.