The Art of Data Science

6.3 Describe a model for the population

We need to have an abstract representation of how elements of the population are related to each other. Usually, this comes in the form of a statistical model that we can represent using mathematical notation. However, in more complex situations, we may resort to algorithmic representations that cannot be written down neatly on paper (many machine learning approaches have to be described this way). The simplest model might be a simple linear model, such as

$y = \beta_0 +\beta_1 x + \varepsilon$

Here, { $}x{/$ } and { $}y{/$ } are features of the population and { $}\beta_0{/$ } and { $}\beta_1{/$ } describe the relationship between those features (i.e. are they positively or negatively associated?). The final element { $}\varepsilon{/$ } is a catch-all that is intended to capture all of the factors that contribute to the difference between the { $}y{/$ } and what we expect { $}y{/$ } to be, which is { $}\beta_0 + \beta_1 x{/$ }. It is this last part that makes the model a statistical model because we typically allow { $}\varepsilon{/$ } to be random.

Another characteristic that we typically need to make an assumption about is how different units in the population interact with each other. Typically, without any additional information, we will assume that the units in the population are independent, meaning that the measurements of one unit do not provide any information about the measurements on another unit. At best, this assumption is approximately true, but it can be a useful approximation. In some situations, such as when studying things that are closely connected in space or time, the assumption is clearly false, and we must resort to special modeling approaches to account for the lack of independence.

George Box, a statistician, once said that “all models are wrong, but some are useful”. It’s likely that whatever model you devise for describing the features of a population, it is technically wrong. But you shouldn’t be fixated on developing a correct model; rather you should identify a model that is useful to you and tells a story about the data and about the underlying processes that you are trying to study.