Variable selection with two continuous explanatory variables
Variable selection in general is a rather complex topic. We will introduce some of the ideas in the simple case where we have 2 continuous explanatory variables available, \(x_1\) and \(x_2\), as opposed to a continuous and categorical explanatory variable, and we want to decide which variables — if any — will be useful in predicting \(y\) or explain varation in \(y\) (these are very different things!).
One approach is to consider a hierarchy of models
- \(E(Y|x_1,x_2) = \alpha+\beta x_1+\gamma x_2\)
- \(E(Y|x_1,x_2) = \alpha+\beta x_1\)
- \(E(Y|x_1,x_2)=\alpha+\gamma x_2\)
- \(E(Y|x_1,x_2) = \alpha\)
where \(Y\) is the response variables and \(x_1\), \(x_2\) are potential explanatory, or predictor, variables. Within this structure we might take a top-down approach.
Fit the most general model, i.e.
\[E(Y|x_1,x_2) = \alpha+\beta x_1+\gamma x_2\]
since we believe this is likely to provide a good description of the data.
Construct interval estimates and hypothesis tests for \(\beta\) and \(\gamma\).
\(\boldsymbol{\beta} = \left( \begin{array}{c} \alpha \\ \beta \\ \gamma \\ \end{array} \right), \beta = \mathbf{b}^T\boldsymbol{\beta}\) where \(\mathbf{b} = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \end{array} \right), \mbox{ and } \gamma = \mathbf{b}^T\boldsymbol{\beta}\), where \(\mathbf{b} = \left( \begin{array}{c} 0 \\ 0 \\ 1 \\ \end{array} \right)\).
If both intervals exclude 0 (i.e. p-values \(<0.05\)) then retain the model with both \(x_1\) and \(x_2\).
If the interval for \(\beta\) contains 0 (i.e. p-value \(>0.05\)) but that for \(\gamma\) does not, use the model with \(x_2\) alone.
If the interval for \(\gamma\) contains 0 (i.e. p-value \(>0.05\)) but that for \(\beta\) does not, use the model with \(x_1\) alone.
If both intervals include 0 it may still be that a model with one variable is useful. In this case two models — each with a single variable — should be fitted and intervals (and tests) for \(\beta\) and \(\gamma\) constructed and compared with 0.
In addition to the information provided by C.I.s and hypothesis tests, it is also wise to consider the proportion of variation explained by different models, as expressed in \(R^2\). This idea leads us on to variable selection that will will discuss over the next few lectures.