## 2.3 Assumptions of the model

A natural25 question to ask is: “Why do we need assumptions?” The answer is that we need probabilistic assumptions to ground statistical inference about the model parameters. Or, in other words, to quantify the variability of the estimator $$\hat{\boldsymbol{\beta}}$$ and to infer properties about the unknown population coefficients $$\boldsymbol{\beta}$$ from the sample $$\{(\mathbf{X}_i, Y_i)\}_{i=1}^n$$.

The assumptions of the multiple linear model are:

1. Linearity: $$\mathbb{E}[Y|X_1=x_1,\ldots,X_p=x_p]=\beta_0+\beta_1x_1+\ldots+\beta_px_p$$.
2. Homoscedasticity: $$\mathbb{V}\text{ar}[\varepsilon|X_1=x_1,\ldots,X_p=x_p]=\sigma^2$$.
3. Normality: $$\varepsilon\sim\mathcal{N}(0,\sigma^2)$$.
4. Independence of the errors: $$\varepsilon_1,\ldots,\varepsilon_n$$ are independent (or uncorrelated, $$\mathbb{E}[\varepsilon_i\varepsilon_j]=0$$, $$i\neq j$$, since they are assumed to be normal).

A good one-line summary of the linear model is the following (independence is implicit)

\begin{align} Y|(X_1=x_1,\ldots,X_p=x_p)\sim \mathcal{N}(\beta_0+\beta_1x_1+\ldots+\beta_px_p,\sigma^2).\tag{2.8} \end{align}

Recall that, except assumption iv, the rest are expressed in terms of the random variables, not in terms of the sample. Thus are population versions, rather than sample versions. It is however trivial to express (2.8) in terms of assumptions about the sample $$\{(\mathbf{X}_i,Y_i)\}_{i=1}^n$$:

\begin{align} Y_i|(X_{i1}=x_{i1},\ldots,X_{ip}=x_{ip})\sim \mathcal{N}(\beta_0+\beta_1x_{i1}+\ldots+\beta_px_{ip},\sigma^2),\tag{2.9} \end{align}

with $$Y_1,\ldots,Y_n$$ being independent conditionally on the sample of predictors. Equivalently stated in a compact matrix way, the assumptions of the model on the sample are:

\begin{align} \mathbf{Y}|\mathbf{X}\sim\mathcal{N}_n(\mathbf{X}\boldsymbol{\beta},\sigma^2\mathbf{I}).\tag{2.10} \end{align}

Figures 2.10 and 2.11 represent situations where the assumptions of the model for $$p=1$$ are respected and violated, respectively.

Figure 2.12 represents situations where the assumptions of the model are respected and violated, for the situation with two predictors. Clearly, the inspection of the scatterplots for identifying strange patterns is more complicated than in simple linear regression – and here we are dealing only with two predictors.

Figure 2.12: Valid (all the assumptions are verified) and problematic (a single assumption does not hold) multiple linear models, when there are two predictors. Application available here.

The dataset assumptions.RData contains the variables x1, …, x9 and y1, …, y9. For each regression y1 ~ x1, …, y9 ~ x9:

• Check whether the assumptions of the linear model are being satisfied (make a scatterplot with a regression line).
• State which assumption(s) are violated and justify your answer.

1. After all, we already have a neat way of estimating $$\boldsymbol{\beta}$$ from the data. Isn’t it the end of the story?