5.7 Factors versus Dummy Variables in Tree-Based Models

As previously mentioned, certain types of models have the ability to use categorical data in its natural form (i.e., without conversions to dummy variables). A simple regression tree (Breiman et al. 1984) using the day of the week predictor for the Chicago data resulted in this simple split for prediction:

if day in {Sun, Sat} then ridership = 4.4K
 else ridership = 17.3K

Suppose the day of the week had been converted to dummy variables. What would have occurred? In this case, the model is slightly more complex since it can only create rules as a function of a single dummy variable at a time:

if day = Sun then ridership = 3.84K
  else if day = Sat then ridership = 4.96K
    else ridership = 17.30K

The non-dummy variable model could have resulted in the same structure if it had found that the results for Saturday and Sunday were sufficiently different to merit an extra set of splits. This leads to the question related to using categorical predictors in tree-based models: does it matter how the predictions are encoded?

To answer this question, a series of experiments was conducted⁵⁰. Several public classification data sets were used to make the comparison between the different encodings. These are summarized in Table 5.5. Two data sets contained both ordered and unordered factors. As described below, the ordered factors were treated in different ways.

Table 5.5: A summary of the data sets used to evaluate encodings of categorical predictors in tree- and rule-based models.
	Attrition	Cars	Churn	German Credit	HPC
n	1470	1728	5000	1000	4331
p	30	6	19	20	7
Classes	2	4	2	2	4
Numeric Predictors	16	0	15	7	5
Factor Predictors	14	6	4	13	2
Ordered Factors	7	0	0	1	0
Factor with 3+ Levels	12	7	2	11	3
Factor with 5+ Levels	3	0	1	5	2

For each iteration of the simulation, 75% of the data was used for the training set and 10-fold cross-validation was used to tune the models. When the outcome variable had two levels, the area under the ROC curve was maximized. For the other datasets, the multinomial log-likelihood was optimized. The same number of tuning parameter values were evaluated although, in some cases, the values of these parameters were different due to the number of predictors in the data before and after dummy variable generation. The same resamples and random numbers were used for the models with and without dummy variables. When the data contained a significant class imbalance, the data were downsampled to compensate.

For each data set, models were fit with the data in its original form as well as (unordered) dummy variables. For the two data sets with ordinal data, an additional model was created using ordinal dummy variables (i.e., polynomial contrasts).

Several models were fit to the data sets. Unless otherwise stated, the default software parameters were used for each model.

Single CART trees (Breiman et al. 1984) The complexity parameter was chosen using the “one-standard error rule” that is internal to the recursive partitioning algorithm.
Bagged CART trees (Breiman 1996). Each model contained 50 constituent trees.
Single C5.0 trees and single C5.0 rulesets (Quinlan 1993; Kuhn and Johnson 2013)
Single conditional inference trees (Hothorn, Hornik, and Zeileis 2006). A grid of 10 values for the p-value threshold for splitting were evaluated.
Boosted CART trees (a.k.a. stochastic gradient boosting, Friedman (2002)). The models were optimized over the number of trees, the learning rate, the tree depth, and the number of samples required to make additional splits. Twenty-five parameter combinations were evaluate using random search.
Boosted C5.0 trees. These were tuned for the number of iterations.
Boosted C5.0 rules. These were also tuned for the number of iterations.
Random forests using CART trees (Breiman 2001). Each forest contained 1,500 trees and the number of variables selected for a split was tuned over a grid of 10 values.
Random forests using conditional (unbiased) inference trees (Strobl et al. 2007). Each forest contained 100 trees and the number of variables selected for a split was tuned over a grid of 10 values.

For each model a variety of different performance metrics were estimated using resampling as well as the total time to train and tune the model..⁵¹

For the three data sets with two classes, Figure 5.6 shows a summary of the results for the area under the ROC curve. In the plot, the percent difference in performance is calculated using

$\%Difference = \frac{Factor - Dummy}{Factor}\times 100$

In this way, positive values indicate that the factor encodings have better performance. The image shows the median in addition to the lower and upper 5% percentiles of the distribution. The gray dotted line indicates no difference between the encodings.

Figure 5.6: A comparison of encodings for the area under the ROC curve.

The results of these simulations shows that, for these data sets, there is no real difference in the area under the ROC curve between the encoding methods. This also appears to be true when simple factor encodings are compared to dummy variables generated from polynomial contrasts for ordered predictors⁵². Of the ten models, there are only two cases out of 40 scenarios where the mainstream of the distributions did not cover zero. Stochastic gradient boosting and bagged CART trees are both ensemble methods and these showed a 2%-4% drop in the ROC curve when using factors instead of dummy variables for a single data set.

Another metric, overall accuracy, can also be assessed. These results are shown in Figure 5.7 where all of the models can be taken into account. In this case, the results are mixed. When comparing factors to unordered dummy variables, two of the models show differences in encodings. The churn data shows similar results to the ROC curve metrics. The car evaluation data demonstrates a nearly uniform effect where factor encodings do better than dummy variables. Recall that in the car data, which has four classes, all of the predictors are categorical. For this reason, it is likely to show the most effect of all of the data sets.

In the case of dummy variables generated using polynomial contrasts, neither of the data sets show a difference between the two encodings. However, the car evaluation data shows a pattern where the factor encodings had no difference compared to polynomial contrasts but when compared to unordered dummy variables, the factor encoding is superior. This indicates that the underlying trend in the data follows a polynomial pattern.

In terms of performance, it appears that differences between the two encodings are rare (but can occur). One might infer that, since the car data contains all categorical variables, that this situation would be a good indicator for when to use factors instead of dummy variables. However, two of the data sets (Attrition and German Credit), have a high proportion of categorical predictors and show no difference. In some data sets, the effect of the encoding would depend on whether the categorical predictor(s) are important to the outcome and in what way.

Figure 5.7: A comparison of encodings for the accuracy.

In summary, while few differences were seen, it is very difficult to predict when a difference will occur.

However, one other statistic was computed for each of the simulations: the time to train the models. Figure 5.8 shows the speed-up of using factors above and beyond dummy variables (i.e., a value of 2.5 indicates that dummy variable models are two and a half times slower than factor encoding models). Here, there is very strong trend that factor-based models are more efficiently trained than their dummy variable counterparts. The reason for this is likely to be that the expanded number of predictors (caused by generating dummy variables) requires more computational time than the method for determining the optimal split of factor levels. The exceptions to this trend are the models using conditional inference trees.

A comparison of encodings for time to train the model. Large values indicate that the factor encoding took less time to train than the model with dummy variables.

Figure 5.8: A comparison of encodings for time to train the model. Large values indicate that the factor encoding took less time to train than the model with dummy variables.

One other effect of how qualitative predictors are encoded is related to summary measures. Many of these techniques, especially tree-based models, calculate variable importance scores that are relative measures for how much a predictor affected the outcome. For example, trees measure the effect of a specific split on the improvement in model performance (e.g., impurity, residual error, etc.). As predictors are used in splits, these improvements are aggregated; these can be used as the importance scores. If a split involves all of the predictor’s values (e.g., Saturday versus the other six days), the importance score for the entire variable is likely to be much larger than a similar importance score for an individual level (e.g., Saturday or not-Saturday). In the latter case, these fragmented scores for each level may not be ranked as highly as the analogous score that reflects all of the levels. A similar issue comes up during feature selection (Chapters 10 through 7) and interaction detection (Chapter 12). The choice of predictor encoding methods is discussed further there.

For a guideline, we suggest using the predictors without converting to dummy variables and, if the model appears promising, to also try refitting using dummy variables.

In Chapter 12, a similar but smaller-scale analysis is conducted with the naive Bayes model.↩
The programs used in the simulation are contained in the GitHub repository https://github.com/topepo/dummies-vs-factors.↩
Note that many of the non-ensemble methods, such as C5.0 trees/rules, CART, and conditional inference trees, show a significant amount of variation. This is due to the fact that they are unstable models with high variance.↩