Variable selection with two continuous explanatory variables
Variable selection in general is a rather complex topic. We will introduce some of the ideas in the simple case where we have 2 continuous explanatory variables available, x_1 and x_2, as opposed to a continuous and categorical explanatory variable, and we want to decide which variables — if any — will be useful in predicting y or explain varation in y (these are very different things!).
One approach is to consider a hierarchy of models
- E(Y|x_1,x_2) = \alpha+\beta x_1+\gamma x_2
- E(Y|x_1,x_2) = \alpha+\beta x_1
- E(Y|x_1,x_2)=\alpha+\gamma x_2
- E(Y|x_1,x_2) = \alpha
where Y is the response variables and x_1, x_2 are potential explanatory, or predictor, variables. Within this structure we might take a top-down approach.
Fit the most general model, i.e.
E(Y|x_1,x_2) = \alpha+\beta x_1+\gamma x_2
since we believe this is likely to provide a good description of the data.
Construct interval estimates and hypothesis tests for \beta and \gamma.
\boldsymbol{\beta} = \left( \begin{array}{c} \alpha \\ \beta \\ \gamma \\ \end{array} \right), \beta = \mathbf{b}^T\boldsymbol{\beta} where \mathbf{b} = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \end{array} \right), \mbox{ and } \gamma = \mathbf{b}^T\boldsymbol{\beta}, where \mathbf{b} = \left( \begin{array}{c} 0 \\ 0 \\ 1 \\ \end{array} \right).
If both intervals exclude 0 (i.e. p-values <0.05) then retain the model with both x_1 and x_2.
If the interval for \beta contains 0 (i.e. p-value >0.05) but that for \gamma does not, use the model with x_2 alone.
If the interval for \gamma contains 0 (i.e. p-value >0.05) but that for \beta does not, use the model with x_1 alone.
If both intervals include 0 it may still be that a model with one variable is useful. In this case two models — each with a single variable — should be fitted and intervals (and tests) for \beta and \gamma constructed and compared with 0.
In addition to the information provided by C.I.s and hypothesis tests, it is also wise to consider the proportion of variation explained by different models, as expressed in R^2. This idea leads us on to variable selection that will will discuss over the next few lectures.