12.8 Variable Importance Measures
- Variable importance: Which are the most important predictors?
- Single decision tree: Intepretation is easy.. just look at the splits in the graph
- e.g., Figure 8.6. (James et al. 2013, 313) lower right Thalium stress test (Tahl) is most important
- Bag of large number of trees: Can’t just visualize single tree and no longer clear which variables are most relevant for splits
- Bagging improves prediction accuracy at the expense of interpretability (James et al. 2013, 319)
- Overall summary of importance of each predictor
- Using Gini index (measure of node purity) for bagging classification trees (or RSS for regression trees)
- Classification trees: Add up the total amount that the Gini index is decreased (i.e., node purity increased) by splits over a given predictor, averaged over all \(B\) trees
- Gini index: a small value indicates that a node contains predominantly observations from a single class
- See Figure 8.9 (James et al. 2013, 313) for graphical representation of importance: Mean decrease in Gini index for each variable relative to the largest
- Classification trees: Add up the total amount that the Gini index is decreased (i.e., node purity increased) by splits over a given predictor, averaged over all \(B\) trees
- Using Gini index (measure of node purity) for bagging classification trees (or RSS for regression trees)
References
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.