Regression models with factors and interactions

Suppose we are in the scenario where we have a response variable, a continuous explanatory variable and a binary explanatory variables

We have already discussed three models that can be expressed as:

A collection of different regression lines (a model which includes an interaction with a factor),

A collection of parallel regression lines,

A single regression line (with no differences among the groups).

using the notation:

$y_{ij}$ : response observation $j$ in group $i$

$x_{ij}$ : explanatory variable observation $j$ in group $i$

$n_i$ : sample size in group $i$

$p$ : number of groups

$n$ : $\sum_{i=1}^pn_i$ , total sample size.

The most general model (1), could be formulated as

$\begin{equation*} E(y_{ij}) = \alpha_i+\beta_ix_{ij} \end{equation*}$

as group $i$ has its own slope and intercept.

In previous weeks we have found that the formulation in terms of $(x_i-\bar{x})$ led to simpler algebra. This is also true in the present case and we will therefore formulate the model as

$E(y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})$

$y_{ij} = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})+\epsilon_{ij}$

where $\bar{x}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}x_{ij}$ , the mean of the explanatory variable, $x_{ij}$ for group $i$ .

The models of interest can be expressed as:

different lines: $E(Y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})$

parallel lines: $E(Y_{ij}) = \alpha_i+\beta(x_{ij}-\bar{x}_{i.})$

single line: $E(Y_{ij}) = \alpha+\beta(x_{ij}-\bar{x}_{..})$ , where $\bar{x}_{..}$ is the mean of all $x_{ij}$ .

Note that the “single line” model is written in terms of $(x_{ij}-\bar{x}..)$ so that there are no differences among the groups.

We will now consider how to select an `appropriate’ model given such data. We will begin with the most general model with different regression lines for each group. Let’s assume that $i=2$ such that we have two groups.