Regression models with factors and interactions

Suppose we are in the scenario where we have a response variable, a continuous explanatory variable and a binary explanatory variables

We have already discussed three models that can be expressed as:

  1. A collection of different regression lines (a model which includes an interaction with a factor),
  2. A collection of parallel regression lines,
  3. A single regression line (with no differences among the groups).

using the notation:

\(y_{ij}\) : response observation \(j\) in group \(i\)

\(x_{ij}\) : explanatory variable observation \(j\) in group \(i\)

\(n_i\) : sample size in group \(i\)

\(p\) : number of groups

\(n\) : \(\sum_{i=1}^pn_i\) , total sample size.

The most general model (1), could be formulated as

\[\begin{equation*} E(y_{ij}) = \alpha_i+\beta_ix_{ij} \end{equation*}\]

as group \(i\) has its own slope and intercept.

In previous weeks we have found that the formulation in terms of \((x_i-\bar{x})\) led to simpler algebra. This is also true in the present case and we will therefore formulate the model as

\[E(y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})\]

or

\[y_{ij} = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})+\epsilon_{ij}\]

where \(\bar{x}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}x_{ij}\), the mean of the explanatory variable, \(x_{ij}\) for group \(i\).

The models of interest can be expressed as:

  1. different lines: \(E(Y_{ij}) = \alpha_i+\beta_i(x_{ij}-\bar{x}_{i.})\)
  2. parallel lines: \(E(Y_{ij}) = \alpha_i+\beta(x_{ij}-\bar{x}_{i.})\)
  3. single line: \(E(Y_{ij}) = \alpha+\beta(x_{ij}-\bar{x}_{..})\), where \(\bar{x}_{..}\) is the mean of all \(x_{ij}\).

Note that the “single line” model is written in terms of \((x_{ij}-\bar{x}..)\) so that there are no differences among the groups.

We will now consider how to select an `appropriate’ model given such data. We will begin with the most general model with different regression lines for each group. Let’s assume that \(i=2\) such that we have two groups.