Variable selection with several explanatory variables

If we have only a few explanatory variables, then an extension of the strategy outlined in the previous lecture would be effective. Start with the full model and simplify by removing terms until no further terms can be removed. When the number of explanatory variables is `large’ the problem becomes more difficult. When we have many possible explanatory variables, if can be difficult to find a ‘good’ model. Variable selection is a process that aims to solve this problem. It is important to realise that this process is not an exact procedure and there is not ‘correct’ answer.In general we need to choose two things.

  1. A criterion that we can use to compare models.

  2. A strategy that we can use to search for models.

There are automatic strategies, based on systematically searching through the entire list of variables not in the current model to make decisions on whether each should be included. We will discuss a few in this section.

These strategies need to be handled with care, and a proper discussion of them is beyond the remit of this course. Our best strategy is a mixture of judgement on what variables should be included as potential explanatory variables, together with an interval estimation and hypothesis testing strategy for assessing these. The judgement should be made in the light of advice from the problem context.