# Chapter 10 Tree models

Decision trees (aka tree-based models) are commonly used in data mining to perform predictive analytics, typically with a single categorical dependent variable and multiple predictor variables (which can be continuous or categorical). There are two major types of decision trees: classification trees and regression trees. Classification trees are discussed in the first section and regression trees are discussed in the second section. Decision tree algorithms are “automatic” in that the independent variables are selected by searching for optimal splits using a measure of “purity” or effect size.

## 10.1 Classification trees

As an example, the results of a classification tree could inform customer targeting for a marketing campaign. Consider the hypothetical results of a tree that is based on four predictor variables:

1. Is the customer’s income greater or less than $70k? 2. How old is the customer? 3. Is the customer a college graduate? 4. Is the customer male or female? A classification tree based on past data on purchases might produce a tree-like structure as shown in Figure 10.1. Notice that some of the branches terminate before all the variables are considered. This is because further splitting of the branch does not lead to any more useful differences. Note that after splitting by income, the next variable selected for splitting differs depending on the income level. For income less than or equal to$70k, the next split is on age. For income greater than $70k, the split is on education. This is known as an interaction effect. Finally, note that the splits can be either on continuous variables such as age or on nominal variables such as gender. The process starts with the root node, which represents entire data set. The process proceeds by creating branches where the data is split into sub nodes. The final splits result in leaf or terminal nodes. From the hypothetical tree shown in Figure 10.1, a series of rules are produced: • If Income ≤$70,000 AND Age ≤ 30 THEN probability of purchase = 14%.
• If Income ≤ $70,000 AND Age > 30 AND Female, THEN probability of purchase = 40%. • If Income ≤$70,000 AND Age > 30 AND Male, THEN probability of purchase = 63%.
• If Income > $70,000 AND No college, THEN probability of purchase = 38%. • If Income >$70,000 AND College graduate, THEN probability of purchase = 70%.

A classification tree is not an inferential technique that is suitable for statistical hypothesis testing. One way to think about classification trees is that they recursively split data into smaller and smaller branches that are increasingly “pure” in terms of the target variable. To find a split, the program examines all the input variables and selects the one at each stage that is most effective according to the criteria in the algorithm.

This search process involved with forming a tree violates the logical premises of classical statistical testing since the data is used to inform the splits into branches. However, when there are many observations and many variables and in cases where no well-defined theory exists, classification trees can help the researcher to “discover” relationships that can be tested later a different sample.

Some common uses of classification trees are:

• Market segmentation – identifying segments most likely to purchase.
• Stratification – dividing cases into high/medium/low risk, for example.
• Prediction – creating rules and use them to predict future outcomes.
• Data reduction and variable screening – screening many variables to identify best prospects.
• Interaction detection – finding variables with effects which differ according to the levels of other variables.
• Category merging – recoding variables with large numbers of levels into fewer categories without substantial lose of predictive information. This is one attraction of classification trees: the technique works without much “thinking.”

## 10.2 Forming classification trees

There are exponentially many possible classification trees with a given set of attributes. The number is very large because continuous predictors can be split in many ways and the same predictor can be used again and again as the tree is built. Since it is usually impossible to examine all possible tree structures in each problem, an algorithm is used to grow a tree that is reasonably accurate instead of optimal. Classification trees are known as “greedy algorithms” since they use a strategy of proceeding stage by stage and once a split in the data is made, the algorithm does not go back after making additional splits to check previous splits. Thus, locally optimum selections are made at each stage. The result is usually a very good model, but not one that is not necessarily optimal.

However, there are dangers and costs associated with the approach. You put in a long list of variables and the program selects the best predictors (and the best point for splitting).

## 10.3 Varieties of classification tree algorithms

A wide variety of different models and techniques has been developed all of which fall under the umbrella term of classification trees. Models for classification trees differ in terms of how branches of the tree are split off from the main trunk, in terms of stopping rules, the types of variables that can be used, what is provided in the output, etc. Several algorithms are available for classification trees including CART, C5.0, and CHAID. By no means will these give the same answers to a given problem. One reason is that the programs use different criteria to select which parent nodes to split, as shown in Table 10.1.

Table 10.1: Characteristics of three classification tree algorithms.
Algorithm Splitting criterion Input variables Target variable Splits
CART Gini index Categorical or continuous Categorical Binary
C50 Entropy Categorical or continuous Categorical Binary or multiway
CHAID Chi-square test Categorical Categorical Binary or multiway

Classification trees have been around a long time, but until recently they were frequently discussed in derogatory terms. One of the first models, AID, was called a substitute for thinking. AID was an acronym for “automatic interaction detection.” Classification trees remain controversial, and some researchers claim that classification trees should not be used. By automatically combing through data sets in search of relationships, the models have the potential to find spurious associations that may appear to be plausible but are only artifacts due to randomness.

That is very much true if you use classification trees on small samples and do not develop both training and testing subsets. The criticism is due to past applications where small data sets were used. In the era of plentiful data, these concerns no longer are as important. Data mining, after all, is for large data sets.

For continuous predictor variables, all possible splits are considered. Thus, for n distinct values of a predictor, n-1 potential splits are considered. For nominal predictor variables, the number of possible splits into two groups depends upon the number of distinct categories. Table 10.2 gives examples of the number of possible splits as a function of distinct categories.

Table 10.2: The number of possible binary splits by number of categories in a nominal predictor.
Categories Possible.splits
2 1
3 3
4 7
5 15
6 21

In general, the number of possible splits as a function of categories is given by Sterling Numbers of the Second Kind .

## 10.4 Criteria for splitting and growing a tree

As noted above, various approaches have been developed for selecting a node to split in forming a classification tree. Three approaches for splitting are the Gini index, entropy, and chi-square. Each of these can measure the “purity” of a node.

This data set has three possible predictors: Feature 1, Feature 2, and Feature 3. So, the first split could be on any of the three predictors.

### 10.4.1 The Gini index

The Gini index of “node purity,” $$G_{A}$$ is computed using equation (10.1).

$$$\ \mathit{G_{A}}\; = 1 - (p_{0}^2\; + p_{1}^2\;) \tag{10.1}$$$

where $$p_{0}$$ is the proportion of cases in the node that are at level “0” and $$p_{1}$$ the proportion that are at level “1.” Figure 10.2 shows the Gini index as $$p_{0}$$ and $$p_{1}$$ are varied. The maximum value for the index is .5 and is 0 for either all cases equal 0 or equal 1.

Some algorithms only consider splitting each node into two child nodes while others, such as CHAID can create multi-category nodes. For this example, binary splits in the data set are considered. The overall Gini index for a binary split is computed as the weighted average of the Gini values for the two possible branches.

To consider splitting a node in two, $$G_{A}$$ is computed for each of the resulting nodes, which are labeled $$G_{A\_Left}\;$$ and $$G_{A\_Right}\;$$. The total Gini index is then calculated as the weighted average of $$G_{A\_Left}\;$$ and $$G_{A\_Right}\;$$, where the weights are the proportions of cases in the left and right nodes, given by $$w_{n\_Left}$$ and $$w_{n\_Right}$$ . So, the Gini index for a split is calculated with equation (10.2).

$$$\ \mathit{G_{A\_split}\; = w_{n\_Left} \times G_{A\_Left}\; + w_{n\_Right} \times G_{A\_Right}\; }\; \tag{10.2}$$$

### 10.4.2 Information Gain

Information in a node is a measure of impurity, with higher values indicating greater impurity. The expected information in each node for a binary target variable, IInfo_A, is computed using equation (10.3).

$$$\ \mathit{I_{Info\_A}\; = \sum_{i=1}^{2} p_{i}\; \times \log_{2} p_{i}\;}\; \tag{10.3}$$$

where $$p_{i}$$ is the proportion of cases in the node that are at level i (either 0 or 1). Figure 10.3 shows how expected information varies with $$p_{0}$$ and $$p_{1}$$. The maximum value for the index is 1.0 and is 0 for either all cases = 0 or all cases = 1.

To select a node to split, information gain is computed, which is the sum of the information values for the parent node minus the sum of he expected information values for the two child nodes.

To consider splitting a node in two, $$I_{A}$$ is computed for each of the resulting nodes, which are labeled $$I_{A\_Left}\;$$ and $$I_{A\_Right}\;$$. The information contained in the two child nodes is then calculated as the weighted average of $$I_{A\_Left}\;$$ and $$I_{A\_Right}\;$$, where the weights are the proportions of cases in the left and right nodes, given by $$w_{n\_Left}$$ and $$w_{n\_Right}$$. So, the information gain for a split is calculated with equation (10.4).

$$$\ \mathit{I_{A\_split}\; = w_{n\_Left} \times I_{A\_Left}\; + w_{n\_Right} \times I_{A\_Right}\; }\; \tag{10.4}$$$

### 10.4.3 Chi-square as a splitting criterion

This approach to splitting nodes uses the chi-square statistic for selecting a parent node to split. In this case the preferred split is based on which split increases the chi-square value the most, since two nodes that are each “pure” node in terms of 1’s and 0’s will have a sum of chi-squares greater than nodes which are balanced. The formula used at each stage of the classification tree where the target is binary (labeled “0” and “1”) and binary splits are being considered is equation (10.5).

$$$\ \mathit{\text{Chi-square}{_A}\; = \sum_{i=1}^{2} \frac{(a_i\; - e_i\;)^2 }{e_i}\;}\; \tag{10.5}$$$

where $$a_{i}$$ and $$e_{i}$$ are the actual number and expected of 0’s in the node and $$a_{1}$$ and $$e_{1}$$ are the actual number and expected of 1’s in the node. Figure 10.4 shows how shape of the chi-square value varies with different proportions of $$p_{0}$$ and $$p_{1}$$.

Note that as a node increases in “purity” in terms of $$p_{0}$$ or $$p_{1}$$, chi-square increases. Therefore, the parent node with the highest sum of chi-squares for two child nodes is selected for splitting.

To illustrate, consider the “toy” data in Table 10.3. This data consists of 20 observations, 9 zero responses and 11 responses of 1.

Table 10.3: Toy data set
Feature 1 Feature 2 Feature 3 Response
F1_A F2_A F3_A 0
F1_A F2_A F3_B 1
F1_A F2_A F3_A 1
F1_B F2_B F3_A 1
F1_B F2_B F3_B 1
F1_B F2_B F3_B 0
F1_A F2_B F3_B 1
F1_A F2_B F3_A 0
F1_A F2_B F3_B 1
F1_B F2_B F3_B 1
F1_A F2_B F3_A 0
F1_A F2_B F3_A 0
F1_A F2_A F3_B 1
F1_B F2_B F3_A 0
F1_B F2_B F3_A 0
F1_B F2_B F3_B 1
F1_A F2_A F3_B 1
F1_A F2_A F3_A 0
F1_A F2_A F3_A 0
F1_A F2_A F3_B 1

The Gini index, information gain, and chi-square were computed for candidate splits on Features 1, 2, and 3. All three of the criteria (shown in Table 10.4 led to the same initial split on Feature 3, but this will not always be the case.

Table 10.4: Criteria for initial splits.. .
Feature to split Gini index Information gain Chi-square
Feature 1 0.488 0.011 0.707
Feature 2 0.496 0.512 1.955
Feature 3 0.250 0.695 3.162

The tree so far is shown in Figure Figure 10.5:

Continuing with just the Gini index, the next possible splits are shown in Table 10.5.

Table 10.5: Criteria for second splits.
Feature to split Gini index
Split Node 2 by Feature 1 0.305
Split Node 2 by Feature 2 0.317
Split Node 3 by Feature 1 0.150
Split Node 3 by Feature 2 0.167

The best split at this stage is to split Node 3 by Feature 2. Then, the best split is to split Node 2 by Feature 1. The resulting tree is shown in 10.6. Note that the best split of Node 2 is based on Feature_1 while the best split on Node 3 is based on Feature 2. No further splits were made with this example.

## 10.5 Overfitting

Trees can grow in complexity and are susceptible to overfitting data. Bramer defines overfitting as follows:

A classification algorithm is said to overfit to the training data if it generates a classification tree … that depends too much on irrelevant features of the training instances, with the result that it performs well on the training data but relatively poorly on unseen instances. "Realistically, overfitting will always occur to a greater or lesser extent simply because the training set does not contain all possible instances. It only becomes a problem when the classification accuracy on unseen instances is significantly downgraded.

Figure 10.7 shows conceptually how increasing the size of a classification tree by allowing more and more modes will usually increase the accuracy in the training data, but eventually increases error in the test or unseen data. Overfitting the training data fits the model to errors or idiosyncrasies of the training data which are not present in other data sets.

There are two general approaches to avoid overfitting:

• Simply avoid growing large trees – by providing a stopping rule such as the minimum number of observations in a node or the maximum number of splits.
• Grow a large tree and cut branches afterwards, which is known as pruning. The full tree is grown (early stopping might additionally be used), and each split is examined to determine if it brings a reliable improvement.

## 10.6 Example of a classification tree

Customer churn occurs when a customer (player, subscriber, user, etc.) ceases his or her relationship with a company. The full cost of customer churn includes both lost revenue as well as the marketing costs involved with replacing those customers with new ones. Reducing customer churn is a key business goal of nearly every online business because it is almost always more difficult and expensive to acquire a new customer than it is to retain a current paying customer.

The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents an important potential for increasing revenue and profit.

In the case of telecom companies, customers may cancel for many reasons, including poor service, availability of specific hardware, and price. Identifying potential churners before they quit can be the first step in efforts to lower the churn rate. It only makes financial sense to offer incentives only to potential churners and not to customers that are going to remain.

Analytic techniques can be used to develop predictive models to identify likely churners based on customer characteristics and behaviors.

For this example, a data set, TelcoChurn5000.csv, consisting of 5,000 customers of a telecom provider is available with the variables shown in Table 10.6.

Table 10.6: Variables in the TelcoChurn5000.csv data set.
Variable Description
State 2-letter code of the US state of customer residence
Months w/ carrier Number of months with the telco provider
Intl plan? The customer has international plan (yes/no)
Intl mins Total minutes of international calls
# intl calls Total number of international calls
$Intl Total charge of international calls Voice mail? The customer has voice mail plan (yes/no) # vmail messages Number of voice-mail messages Call minutes Total minutes of calls # of calls Total number of calls$ Total Total monthly charges
# service calls Number of calls to customer service
Churn Customer churn (yes/no)

A classification tree was developed in KNIME using the workflow is shown in Figure 10.8.

The nodes in the workflow are described in Table 10.7,

Table 10.7: Descriptions of nodes in the churn workflow.
Node Label Description
1 File Reader TelcoChurn5000.csv
2 Statistics Descriptive statistics of Churn data set.
3 Partitioning A 60/40 (Training/Test) random split of the data set stratified on Churn is formed. Set random seed = 123.
4 SMOTE SMOTE (Synthetic Minority Over-sampling Technique) oversamples the class of Churn = “no” to equal Churn = “yes.” This was done because the of the imbalance of Churn in the data: 707 “yes” and 4293 “no.”
5 Statistics Descriptive statistics on Churn target variable showing balance.
6 Decision Tree Learner Set minimum records per node = 150 to pre-prune the classification tree.
7 Decision Tree Predictor Use the decision tree to create predictions of Churn on the unbalanced Training data (prior to running SMOTE).
8 Scorer Compute performance statistics and classification (confusion) matrix for the Training data.
9 Decision Tree Predictor Use the decision tree to create predictions of Churn on unbalanced Test data.
10 Scorer Compute performance statistics and classification (confusion) matrix for the Test data.
11 Joiner Join accuracy tables for Training and Test data.
12 Excel Writer Write accuracy tables to ChurnAccuracy.xlsx; include Row Key.

The performance results for the Training and Test summarized from the Excel file are shown in Table 10.8.

Table 10.8: Performance measures for the classification tree.
Metric Training data Test data
Accuracy 0.953 0.942
Cohen’s kappa 0.812 0.765
Precision 0.818 0.783
Sensitivity 0.861 0.816
Specificity 0.969 0.963
F-measure 0.839 0.799

The classification tree performed quite well on this data set, with comparable performance for both the Training and Test data as shown in Tables 10.9 and 10.10.

Table 10.9: Confusion matrix for Churn Training data.
Training Churn=Yes Churn=No Totals
Churn=Yes 365 59 424
Churn=No 81 2495 2576
Totals 446 2554 3000
Table 10.10: Confusion matrix for Churn Test data.
Training Churn=Yes Churn=No Totals
Churn=Yes 231 52 282
Churn=No 64 1653 1717
Totals 295 1705 2000

## 10.7 Regression trees

Decision trees can also be used to predict continuous target variables. In such applications these are known as regression trees. The analysis is much like the case with categorical target variables: the model produces a series of splits using selected predictor variables. With regression trees, much like ordinary regression, you can have continuous or nominal predictors. However, the other assumptions associated with linear regression regarding error distributions and so on, are not really relevant because regression trees are not a classical statistical technique.

### 10.7.1 How regression trees work

Regression trees operate by successively dividing a data set into smaller and smaller groups that are more homogeneous with respect to the target variable. The groups are nodes and each node lower in the tree is more homogeneous in terms of the target variable than those nodes higher in the tree. The model starts with no predictors and then examines each possible predictor in turn to select the best variable for initial split.

The process of building a regression tree is illustrated using a simple data set (DemoRegressionTrees.csv) consisting of 100 observations of home prices as the continuous target variable with area in square feet (either 1,000 or 2,000) and quality (either high or average) as the predictors. The first five and last five rows are shown in Table 10.11.

Table 10.11: Performance measures for the classification tree.
Row Price in 000’s Area in sq. ft. Quality
1 203.8 1000 High
2 195.0 1000 Average
3 200.9 1000 High
4 403.2 2000 High
5 402.8 2000 High
· · · · · · · · · · · ·
96 306.9 2000 High
97 194.3 1000 Average
98 206.2 1000 High
99 310.1 2000 High
100 298.0 2000 Average

The tree building process uses the criterion of minimizing the sum of squared errors as binary splits in the predictors are made. The initial sum of squares ($$SST_{I}$$) is given by equation (10.6).

$$$\ \mathit{\text{SST}{_I}\; = \sum_{i=1}^{n} (y_{i}\; + \bar{y}\;)^2\; = 502,301}\; \tag{10.6}$$$

where n = the number of observations, $$y_{i}$$ = the price for observation i and $$\bar{y}\;$$ = the mean value of the prices.

Next, splits are considered by area in square feet and quality. When a split is made, the sum of squares of each branch of the split is computed and added together. The sum of squares after splitting is subtracted from $$SST_{I}$$ and the split with the greatest reduction in sum of squares is selected.

The sum of squares for the binary splits is given by (10.7).

$$$\ \mathit{\text{SST}_{binary}\; = \sum_{i=1}^{n_L} (y_{i}\; + \bar{y}_L\;)^2\; + \sum_{i=1}^{n_R} (y_{i}\; + \bar{y}_R\;)^2\;} \tag{10.7}$$$

where $$n_{j}$$ = the number of observations in the jth node after split, $$\bar{y}_L\;)\;$$ = the mean value of the prices in the jth node with j = L for left node and j = R for the right node.

The results of the calculations are shown in Figure 10.9.

Splitting on area produced the greatest reduction in sum of squares and therefore the split should be made on area.

In a larger problem, a recursive process is used, and each predictor is considered. Each potential split of a continuous predictor must be calculated, so that for a variable with k distinct values, k-1 splits considered.

There are several different algorithms for regression trees, but a commonly used approach is the CART method. The depth of the tree can be controlled by the programs.

The predictions are made using the terminal nodes at the bottom of the tree. The prediction estimates of the target variable equal the average value of the observations in each terminal node.

Simple regression trees have a limitation, as we will see in an example. The limitation stems from the model structure itself because the estimates of the target are only made with the average value in each terminal node. Therefore, if there are only a few terminal nodes, then only a few different prediction values will be made. This happens even though the original target variable has many, perhaps hundreds or thousands of different values if the target is continuous.

The predictions are said to have limited cardinality. Cardinality is just the number of distinct values in the set of predicted values. This by itself can limit the R2 and predictive accuracy of regression trees.

### 10.7.2 Example: Predicting home prices

Next, an example again is based on predicting home prices. This data set has 522 observations in the file RealEstatePrices.csv. The variables in the data are listed in Table 10.12.

Table 10.12: Performance measures for the classification tree.
Variable Values
Sales Price Continuous
Finished square feet Continuous
Air conditioning Yes or No
Garage size # of cars
Pool Yes or No
Year built Continuous
Quality Fair, Moderate, or Best
Lot size Continuous
Adjacent to highway Yes or No

A workflow (Figure 10.10) was created in KNIME to predict Sales Price on Test data using regression trees and, for comparison, ordinary linear regression.

The nodes in the workflow for this example are described in Table 10.13.

Table 10.13: Node descriptions for Real Estate workflow.
Node Label Description
2 Partitioning Randomly split data 70/30; set seed to 123.
3 Simple Regression Tree Learner Limit number of levels to 5; Minimum split node size = 10; Minimum node size = 5.
4 Simple Regression Tree Predictor Predict Sales Price on Test data.
5 Numeric Scorer Calculate performance measures on Regression Tree using Test data.
6 Scatter Plot Plot Predicted vs. Actual for Regression Tree on Test data.
7 Linear Regression Learner Run the same data using OLS.
8 Regression Predictor Predict Sales Price on Test data.
9 Numeric Scorer Calculate performance measures on OLS using Test data.
10 Scatter Plot Plot Predicted vs. Actual for OLS on Test data.

The results for the two analyses are shown in Table 10.14. Ordinary least squares regression performed slightly better in this example. One reason may be the limitation on the predicted values when using regression trees. There are only as many distinct predictive values as the number of terminal nodes. In the original Selling Price variable there are 131 distinct values. In the predicted values from the regression tree, there are only 30 distinct values, while there are 157 distinct values in the predictions using ordinary regression.

Table 10.14: Node descriptions for Real Estate workflow.
Metric Regression tree Ordinary regression
R-sqaured 0.813 0.836
mean absolute error 36,544 36,710
mean squared error 3,189,206,925 2,789,026,413
root mean squared error 56,473 52,811
mean signed difference -11,198 -4,908
mean absolute percentage error 0.14 0.14

Since the results from the regression tree and ordinary regression differed, an exploration of combining the two results was conducted. The predictions from the two models were averaged and compared the actual Selling Price data. The performance metrics for the averaged predictions are shown in Table 10.15.

Table 10.15: Performance metrics for averaged predictions.
Metric Averaged predictions
Rsquared 0.853
mean absolute error 33,761
mean squared error 2,510,151,735
root mean squared error 50,101
mean signed difference -8,053
mean absolute percentage error 0.13

The metrics indicate improved accuracy. In particular the root mean squared error (RMSE) was reduced to 50,101 compared with the RMSE for the regression tree of 56,473 and 52,811 for OLS. The RMSE for the averaged model was about 5% lower than the OLS mode. Whether or not this improvement is important, it does illustrate that combining prediction estimates can result in better performance. This is a similar effect to the one discussed in the chapter on ensemble models.

## 10.8 Strengths and weaknesses

Decision trees are widely used in data mining but as with virtually every technique, there are both strengths and weaknesses.

Strengths of decision trees

• Interpretation is usually straightforward and easy to demonstrate and explain.
• There are few underlying assumptions that must be met.
• The results are displayed in a tree-like structure, which is intuitively appealing. Rules generated are transparent.
• Interactions among predictor variables can be identified.
• Outliers and missing values can be handled without problems (with most algorithms).
• Non-linear relationships are handled without problems.
• Predictor variable selection is automatic.
• Binary, categorical, ordinal, and interval target and predictor variables can be used.

Weaknesses of decision trees

• Slight changes in the data set can produce dramatically different results.
• Large datasets are needed.
• Careful validation is required to avoid over-fitting.
• The data snooping process can be misleading.
• There is a bias toward selecting categorical predictors with many levels.
• Considerable judgment and experimentation may be needed to develop a suitable result.

### References

Bramer, Max. 2007. Principles of Data Mining. New York: Springer.
Breiman, JH Friedman, L., and CJ Stone. 1984. Classification and Regression Trees. Boca Raton: Chapman; Hall/CRC.
“Wolfram MathWord: Stirling Numbers of the Second Kind.” n.d. https://https://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html.