Chapter 2 Tree-based Models

Tree-based models are a class of nonparametric algorithms that work by partitioning the feature space into a number of smaller (non-overlapping) regions with similar response values using a set of splitting rules.

For example, the following figures show a tree-based classification model built on two predictors.

An example of classification tree based one two predictors.

Figure 2.1: An example of classification tree based one two predictors.

Definitions

  • Nodes: The subgroups that are formed recursively using binary partitions formed by asking simple yes-or-no questions about each feature (e.g., \(x_1 < 0.7\)?).

  • We refer to the first subgroup at the top of the tree as the root node (this node contains all of the training data).

  • The final subgroups at the bottom of the tree are called the terminal nodes or leaves.

  • The rest of the subgroups are referred to as internal nodes.

  • The connections between nodes are called branches.

  • Depth of a tree is the length of the longest path from a root to a leaf.

  • Size of a tree is the number of terminal nodes in the tree.

In the tree model example above, there are 8 internal nodes and 9 terminal nodes; the depth of the tree is 8 and the size of the tree is 9.

As we’ll see, tree-based models offer many benefits; however, classic tree-based models, classification and regression trees, typically lack predictive performance compared to more complex algorithms like neural networks. Fortunately, we can blend powerful ensemble algorithms into the tree-based model, for example, random forests and boosting trees, to improve the predictive performance. In this chapter, we will explore various tree-based models.

2.1 Classification and Regression Trees

There are many methodologies for constructing tree-based models, but the most well-known and classic is the classification and regression tree (CART) algorithm. A classification and regression tree partitions the training data into homogeneous subgroups (i.e., groups with similar response values) and then fits a simple constant in each subgroup. Mathematically, A classification and regression tree ends in a leaf node which corresponds to some hyper-rectangle in feature (predictor) space, \(R_j,\; j=\{1,...,J\}\), i.e. the feature space defined by \(X_1,X_2,...,X_p\) is divided into \(J\) distinct and non-overlapping regions.

  • Regression tree, assigns a value to each \(R_j\), that is the model predicts the output based on the average response values for all observations that fall in that subgroup.
  • Classification tree, assigns a class label to each \(R_j\), that is the model predicts the output based on the class that has majority representation or provides predicted probabilities using the proportion of each class within the subgroup.

Interpretation

Classification trees mimic some decision-making processes. For example, the following decision tree is used by A&E doctors to decide what bed type to assign:

Note that trees can be constructed even in the presence of qualitative predictor variables, for example, eye colours (blue, green, brown). As tree methods can produce simple rules that are easy to interpret and visualize with tree diagrams, tree methods often refer to decision trees, which perhaps makes them popular for the feeling of interpretability and of being like a data-learned expert system.

Building trees

In theory, the regions \(R_j\) could have any shape. But we choose hyper (high-dimensional)-rectangles (boxes) for simplicity. The left panel is an example of an invalid partitioning, and the right is an example of a valid one.

Performance Metric

For regression trees, the goal is to find \(R_j,\; j=\{1,...,J\}\) that minimizing the RSS \[\sum_{j=1}^J\sum_{i:x_i\in R_j}(y_i-\hat{y}_{R_j})^2,\] where \(\hat{y}_{R_j}\) is the mean response for the training observations within the jth rectangle.

For classification trees, we may adopt classification error rate \(E=1-\max_k(\hat{p}_{jk})\) as performance criteria, where \(\hat{p}_{jk}\) represents the proportion of training observations in the jth region that are from the kth class. However, it turns out that the classification error rate is not sufficiently sensitive for tree-growing, especially when the data is noisy. And in practice, two other metrics are preferable:

  • Gini index: \(G=\sum^K_{k=1}\hat{p}_{jk}(1-\hat{p}_{jk})\)
  • Entropy: \(D=-\sum^K_{k=1}\hat{p}_{jk}\log\hat{p}_{jk}\)

Note that Both functions will take on a value near zero if the \(\hat{p}_{jk}\)’s are all near zero or near one.

Greedy fitting algorithm

It is computationally infeasible to consider every possible partition of the feature space into \(J\) boxes. Think about how many ways one could divide this 2-D space into 5 boxes. Not to mention high-dimensional feature space. Therefore, in practice, we adopt the so-called Greedy(top-down) fitting algorithm:

  1. Initialise hyper-rectangle, \(R=\mathbb{R}^d\).

  2. Find the best (based on some objective function) split on variable \(x_i\) at location \(s\), gives \(R_1(i,s)=\{x\in R: x_i<s\}\) and \(R_2(i,s)=\{x\in R: x_i\ge s\}\).

  3. Repeat step 2. twice, once with \(R=R_1(i,s)\) and once with \(R=R_2(i,s)\)

There will be some stopping rule, for example, requiring a minimum number (for example, 5) of training observations in each hyper-rectangle (box).

Note that such a greedy fitting algorithm does NOT guarantee an optimal tree since it is constructed stepwisely.

Pruning trees

The fitting procedure described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. To balance this trade-off between variance and bias and avoid overfitting, we often prune trees by adopting the so-called weakest link pruning algorithm in practice.

The weakest link pruning (or cost complexity pruning) introduces a nonnegative tuning parameter \(\alpha\), for regression trees minimising \[\sum_{j=1}^{|T|}\sum_{i:x_i\in R_j}(y_i-\hat{y}_{R_j})^2+\alpha|T|,\] where \(|T|\) is the number of terminal nodes (size) of the tree \(T\) (for classification trees simply replace the RSS with classification objective function, e.g. Gini index). Minimisation is achieved by attempting to remove terminal nodes from the base of the tree upward. Similar to Lasso regression, \(\alpha\) is a penalty on the size (complexity) of the tree. When \(\alpha=0\), we will keep the whole tree and when \(\alpha=\infty\), we will prune everything back to null tree (the forecast is simply the average of the response). And as always, we may Select \(\alpha\) using cross-validation.

Pros and Cons

The advantages of CART are:

  • Simple and easy to interpret

  • Akin to common decision-making processes

  • Good visualisations

  • Easy to handle qualitative predictors (avoid dummy variables)

The disadvantages of CART are:

  • Trees tend to have high variance, small changes in the sample lead to dramatic cascading changes in the fit!

  • poor prediction accuracy.

2.2 Bagging

We have seen CART has attractive features, but the biggest problem is high variance. Bootstrap aggregating (an ensemble algorithm), also called bagging, is designed to improve the stability and accuracy of regression and classification algorithms by model averaging. Although bagging helps to reduce variance and avoid overfitting, model interpretability will be sacrificed.

Bootstrap

A bootstrap sample of size \(m\) is \((y_i^\star,\boldsymbol{x}_i^\star)^m_{i=1}\), where each \((y_i^\star,\boldsymbol{x}_i^\star)\) is a uniformly random draw from the training data \((y_1,\boldsymbol{x}_1),...,(y_n,\boldsymbol{x}_n)\). A bootstrap sample is a random sample of the data taken with replacement, which means samples in the original data set may appear more than once in the bootstrap sample. Note that when sampling from the training data \((y_1,\boldsymbol{x}_1),...,(y_n,\boldsymbol{x}_n)\) in pairs, relationship between \(y\) and \(\boldsymbol{x}\) is reserved.

There is a nice mathematical property about bootstrap samples. Let \((y_i^\star,\boldsymbol{x}_i^\star)^m_{i=1}\) are an approximation to drawing IID samples from \(\mathbb{P}(\boldsymbol{X},Y)\). If we set \(m=n\), then notionally we sampled a whole new training set. However, we will only have on average \(\approx 63.2\%\) of the original training data in the bootstrap resampling.

\[\begin{equation} \begin{split} \mathbb{P}(sample\; i\; is\; & in\; the\; boostrap\; resample)\\ & =1- \mathbb{P}(sample\; i\; is\;not\; in\; the\; boostrap\; resample)\\ & =1- \mathbb{P}[(\boldsymbol{x}_1^\star,y_1^\star)\neq (\boldsymbol{x}_i,y_i)\cap \dots \cap (\boldsymbol{x}_n^\star,y_n^\star)\neq (\boldsymbol{x}_i,y_i)]\\ & =1- \prod_{j=1}^{n} \mathbb{P}[(\boldsymbol{x}_j^\star,y_j^\star)\neq (\boldsymbol{x}_i,y_i)] \\ & =1-\left(1-\frac{1}{n}\right)^n \\ & \approx 1-e^{-1} \\ & \approx 0.632 \end{split} \end{equation}\]

Note that we don’t really imagine we have a new IID training sample! But bootstrap methods allow us to produce model-free estimates of sample distributions. If we have a lot of data, then the estimate will be quite good. Intuitively we can roughly approximate the sampling distribution of trees for training sets of size \(n\). We then can achieve variance reduction by averaging many sampled trees.

Bagging for CART

We can apply bagging to CART in the following. For \(b=1,...,B\), draw \(n\) bootstrap samples \((y_i^\star,\boldsymbol{x}_i^\star)^n_{i=1}\) and each time fit a tree and get \(\hat{f}^{*b}(\boldsymbol{x})\), the prediction at a point \(\boldsymbol{x}\).

  • For regression trees, average all the predictions to obtain: \[\hat{f}^{bag}(\boldsymbol{x})=\frac{1}{B}\sum^{B}_{b=1}\hat{f}^{*b}(\boldsymbol{x})\]

  • For classification trees, we choose the class label for which the most bootstrapped trees “voted”. Or to construct a probabilistic prediction for \(j^{th}\) class by directly averaging tree output probabilities \[\hat{f}_j^{bag}(\boldsymbol{x})=\frac{1}{B}\sum^{B}_{b=1}\hat{f}^{*b}_j(\boldsymbol{x})\]

Bagging for CART addresses the overfitting issue by

  • Grow large trees with minimal (or no) pruning. Relies on the bagging procedure directly to avoid overfitting.

  • Prune each tree as was described in the last lecture.

Another nice property of bagging for CART is that it automatically provides out-of-sample evaluation. In fact, we don’t need to do cross-validation for bagging, because we know on average \(36.8\%\) of original data will NOT be in the bootstrap sample so can be used as a test set. Such kind of Out Of Bag(OOB) error estimation is approximately equal to a 3-fold cross validation.

Pros and Cons

The advantages of bagging are:

  • If you have a lot of data, bagging is a good choice because the empirical distribution will be close to the true population distribution

  • Bagging is not restricted to trees, it can be used with any type of method.

  • It is most useful when it is desirable to reduce the variance of a predictor. Note under the (inaccurate) independence assumption, the variance will be reduced by a factor of \(\sim 1/B\) for \(B\) bootstrap resamples (Central limit theorem).

  • Out-of-sample evaluation without using cross-validation, since any given observation will not be used in around \(36.8\%\) of the models.

The disadvantages of bagging are:

  • We lose interpretability as the final estimate is not a tree.

  • Computational cost is multiplied by a factor of at least \(B\).

  • Bagging trees are correlated, the more correlated random variables are, the less the variance reduction of their average.

2.3 Random Forests

Bagging improved on a single CART model by reducing the variance through resampling. But, in fact, the bagged models are still quite correlated. And the more correlated random variables are, the less the variance reduction of their average. Can we make them more independent in order to produce a better variance reduction? Random forecasts achieve this by not only resampling observations, but by restricting the model to random subspaces, \(\mathcal{X}'\subset \mathcal{X}\).

Methodology

The Random Forests algorithm can be implemented in the following:

  1. Take a bootstrap resample \((y_i^\star,\boldsymbol{x}_i^\star)^n_{i=1}\) of the training data as for bagging.

  2. Building a tree, each time a split in a tree is considered, random select \(m\) predictors out of the full set of \(p\) predictors as split candidates, and find the “optimal” split within those \(m\) predictors. (typically choose \(m\approx \sqrt{p}\))

  3. Repeat 1) and 2), average the prediction of all trees.

Interpretation

Classification and regression trees are easy to interpret and follow common human decision-making mechanisms. Linear models provide statistical significance tests (e.g. t-test, z-test of parameter significance). Random Forecasts may seem great, but we’ve sacrificed interpretability with bagging and random subspace. There are two popular approaches that can help us to quantify the importance of each variable.

  1. For each feature variable \(x_{j}, \;j=1,…,p\), loop over each tree in the forest,

    • find all nodes that make a split on \(x_{j}\)
    • compute the improvement in loss criterion the split causes (e.g. accuracy/Gini/etc)
    • sum improvements across nodes in the tree

    Finally, sum improvement across all tress.

  2. We already discussed that bagging enables out-of-bag error estimates. We can then compute out-of-bag variable importance for each feature \(x_{j},\; j=1,…,p\) in the following:

    • For each tree, take the out-of-bag samples data matrix, \(\boldsymbol{x}^{oob}\), and compute the predictive accuracy for that tree;
    • Take \(\boldsymbol{x}^{oob}\) and randomly permute all the entries \(j^{th}\) column (break the ties between \(x_j\) and the rest of variables);
    • Pass the modified \(\boldsymbol{x}^{oob^*}\) through the tree and compute the change in predictive accuracy for that tree.

    Finally average the decrease in accuracy over all trees.

Pros and Cons

The advantages of random forests are:

  • Inherits the advantages of bagging and trees
  • Very easy to parallelise
  • Works well with high-dimensional data
  • Tuning is rarely needed (easy to get good quality forecast)

The disadvantages of random forests are:

  • Often quite suboptimal for regression.
  • Same extrapolation issue as trees.
  • Harder to implement and memory-hungry model.

2.4 Boosting

Boosting is an extremely popular machine learning algorithm that has proven successful across many domains and is one of the leading methods for winning Kaggle competitions. Whereas random forests build an ensemble of deep independent trees, Boosting builds an ensemble of shallow trees in sequence with each tree learning and improving on the previous one. Although shallow trees by themselves are rather weak predictive models, they can be “boosted” to produce a powerful “committee” that, when appropriately tuned, is often hard to beat with other algorithms.

Methodology Boosting is similar to bagging in the sense that we combine results from multiple classifiers, but it is fundamentally different in approach:

  1. Set \(\hat{f}(\boldsymbol{x})=0\) and \(\epsilon_i=y_i\) for \(i=1,…,n\).

  2. For \(b=1,…,B\), iterate:

    1. Fit a model (eg tree) \(\hat{f}^b(\boldsymbol{x})\) to the response \(\epsilon_1,…,\epsilon_n\)
    2. Update the predictive model to \[\hat{f}(\boldsymbol{x}) \leftarrow \hat{f}(\boldsymbol{x}) + \lambda \hat{f}^b(\boldsymbol{x}) \]
    3. Update the residuals \[ \epsilon_i \leftarrow \epsilon_i-\lambda\hat{f}^b(\boldsymbol{x}) \;\;\;\; \forall i\]
  3. Output as final boosted model \[\hat{f}(\boldsymbol{x}) =\sum^B_{b=1}\lambda \hat{f}^b(\boldsymbol{x}) \]

Intuitively Boosting is a

  • Slow learning approach: Aim to learn usually simple model (which leads to low variance, but high bias) in each round. Then bring bias down by repeating over many rounds.

  • Sequential learning: Each round of boosting aims to correct the error of the previous rounds.

  • Generic methodology: Any model at all can be boosted! Completely general method.

Tuning parameters for boosting

There are three main hyper-parameters that need to be tuned for building boosting tree model.

  1. The number of trees \(B\). Unlike bagging and random forests, boosting can overfit if \(B\) is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select \(B\).

  2. The shrinkage parameter \(\lambda\), a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small \(\lambda\) can require using a very large value of \(B\) in order to achieve good performance.

  3. The number of splits \(d\) in each tree, which controls the complexity of the boosted ensemble. Often \(d=1\) works well, in which case each tree consists of a single split and results in an additive model. More generally \(d\) is the interaction depth and controls the interaction order of the boosted model since \(d\) splits can involve at most \(d\) variables.

Boosting discussion Boosting is an incredibly powerful technique and regularly is instrumental in winning machine learning competitions. Any model at all can be boosted. However, tuning is needed (not trivial) and some difficulties in interpretation and extrapolation.