Regression models with factors and interactions
Suppose we are in the scenario where we have a response variable, a continuous explanatory variable and a binary explanatory variables
We have already discussed three models that can be expressed as:
- A collection of different regression lines (a model which includes an interaction with a factor),
- A collection of parallel regression lines,
- A single regression line (with no differences among the groups).
using the notation:
\(y_{ij}\) : response observation \(j\) in group \(i\)
\(x_{ij}\) : explanatory variable observation \(j\) in group \(i\)
\(n_i\) : sample size in group \(i\)
\(p\) : number of groups
\(n\) : \(\sum_{i=1}^pn_i\) , total sample size.
The most general model (1), could be formulated as
\[\begin{equation*} E(y_{ij}) = \alpha_i+\beta_ix_{ij} \end{equation*}\]
as group \(i\) has its own slope and intercept.
In previous weeks we have found that the formulation in terms of \((x_i-\bar{x})\) led to simpler algebra. This is also true in the present case and we will therefore formulate the model as
\[E(y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})\]
or
\[y_{ij} = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})+\epsilon_{ij}\]
where \(\bar{x}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}x_{ij}\), the mean of the explanatory variable, \(x_{ij}\) for group \(i\).
The models of interest can be expressed as:
- different lines: \(E(Y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})\)
- parallel lines: \(E(Y_{ij}) = \alpha_i+\beta(x_{ij}-\bar{x}_{i.})\)
- single line: \(E(Y_{ij}) = \alpha+\beta(x_{ij}-\bar{x}_{..})\), where \(\bar{x}_{..}\) is the mean of all \(x_{ij}\).
Note that the “single line” model is written in terms of \((x_{ij}-\bar{x}..)\) so that there are no differences among the groups.
We will now consider how to select an `appropriate’ model given such data. We will begin with the most general model with different regression lines for each group. Let’s assume that \(i=2\) such that we have two groups.