Regression models with factors and interactions
Suppose we are in the scenario where we have a response variable, a continuous explanatory variable and a binary explanatory variables
We have already discussed three models that can be expressed as:
- A collection of different regression lines (a model which includes an interaction with a factor),
- A collection of parallel regression lines,
- A single regression line (with no differences among the groups).
using the notation:
y_{ij} : response observation j in group i
x_{ij} : explanatory variable observation j in group i
n_i : sample size in group i
p : number of groups
n : \sum_{i=1}^pn_i , total sample size.
The most general model (1), could be formulated as
\begin{equation*} E(y_{ij}) = \alpha_i+\beta_ix_{ij} \end{equation*}
as group i has its own slope and intercept.
In previous weeks we have found that the formulation in terms of (x_i-\bar{x}) led to simpler algebra. This is also true in the present case and we will therefore formulate the model as
E(y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})
or
y_{ij} = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})+\epsilon_{ij}
where \bar{x}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}x_{ij}, the mean of the explanatory variable, x_{ij} for group i.
The models of interest can be expressed as:
- different lines: E(Y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})
- parallel lines: E(Y_{ij}) = \alpha_i+\beta(x_{ij}-\bar{x}_{i.})
- single line: E(Y_{ij}) = \alpha+\beta(x_{ij}-\bar{x}_{..}), where \bar{x}_{..} is the mean of all x_{ij}.
Note that the “single line” model is written in terms of (x_{ij}-\bar{x}..) so that there are no differences among the groups.
We will now consider how to select an `appropriate’ model given such data. We will begin with the most general model with different regression lines for each group. Let’s assume that i=2 such that we have two groups.