7 Explainable AI
As demonstrated in Session 4, ensemble models (and in many applications deep learning models as well) increase model performance drastically. A large downside, however, is the trade-off between reduced model interpretability and increased model complexity. Remember that a very nice graphical overview like a decision tree is not possible for a random forest of 1000 trees? This is a large downside to ensemble methods. Managers want to know what they are basing their decisions on. This issue is even further enlarged by recent regulations which make explainable machine learning methods mandatory by law (e.g., credit scoring and default prediction).
To overcome this black-box issue, researchers came up with a number of methods to explain how the various features influence the model. A general question is ‘Which are the most important drivers of our model?’. To answer this question, feature importances were introduced. These feature importances indicate how important each unique feature is for the model’s predictive performance. We will be using the random forest model created in Session 4 for this.
setwd("C:\\Users\\matbogae\\OneDrive - UGent\\PPA22\\PredictiveAnalytics\\Book_2022\\data codeBook_2022") load("randomForestSession4.Rdata") load("TrainValTest.Rdata") load("XGBSession4.Rdata")
7.1 Feature importance
Some algorithms have specific ways of calculating these importances. For example, random forest uses the decrease in Gini as a measure of importance in our model, as after all this decrease in impurity is what we are trying to optimize. When we have hundreds of trees, each variable will be used numerous times as a splitting criterion. The average decrease caused by splits on the variable can then be used as an estimate of importance. Because this method is inherently linked to the random forest algorithm, did the authors of the
randomForest package also include a function to automatically calculate these importances.
if (!require("pacman")) install.packages("pacman") require("pacman", character.only = TRUE, quietly = TRUE) p_load(randomForest) ::importance(rFmodel, type = 2)randomForest
## MeanDecreaseGini ## TotalDiscount 23.43498369 ## TotalPrice 25.64158626 ## TotalCredit 12.87933281 ## PaymentType_DD 0.02843186 ## PaymentStatus_Not.Paid 0.50086927 ## Frequency 15.11395560 ## Recency 24.28758256
varImpPlot(rFmodel, type = 2)
We clearly observe the importance of the traditional RFM variables. The total amount of discount received is however the single most important variable.
Remember that during the estimation of a random forest, there is always a part of the data used as validation for tree construction, as the bootstrap sample is very unlikely to include all unique observations. These out-of-bag (OOB) instances can be used to estimate the accuracy of the model. By permuting each predictor variable, one can see how this affects (decreases) the accuracy on the OOB data.
::importance(rFmodel, type = 1) #change to type = 1randomForest
## MeanDecreaseAccuracy ## TotalDiscount 2.497947 ## TotalPrice 2.139902 ## TotalCredit 3.084465 ## PaymentType_DD 0.000000 ## PaymentStatus_Not.Paid 0.000000 ## Frequency 2.138397 ## Recency 3.174698
varImpPlot(rFmodel, type = 1)
We observe how the recency variable has now gained the first ranking in terms of variable importance. This is actually also more in line with what we would expect. The over-estimation of TotalDiscount’s importance based on the Gini decrease outlines a typical problem with this method, using pure permutation-based feature importances (i.e., the ones that use a traditional performance measure) tend to give more reliable estimates. Of course, not all methods involve OOB samples. You can ,however, do the same when using the training or test set.
iml (interpretable machine learning) package provides such a built-in function which calculates the feature importance (with 90% confidence bands) based on a number of possible performance measures. It is both possible to use a pre-defined metric with a character input
('ce','f1','logLoss','mae','mse','rmse','mape','mdae','msle','percent_bias','rae','rmsle','rse','rrse','smape') as well providing a self-defined metric which takes the functional form
function(actual,predicted). If you use this predefined function, you can also use the
Metrics package to add you own function. We also use the
revalue function from
plyr as the
iml package has difficulties with factor levels called ‘0’ and ‘1’. Since the output of a permutation-based feature importance score is stochastic, make sure to put the number of permutations (
n.repetitions) sufficiently large to have stable and reliable estimates.
p_load(iml, partykit, plyr, tidyverse) # Join into one table <- BasetableTRAINbig iml_df # change the churn levels <- iml_df %>% iml_df mutate(churn = revalue(yTRAINbig, c(`0` = "nonchurn", `1` = "churn"))) <- randomForest(churn ~ ., data = iml_df, ntree = 500) rf_iml # Make new predictor class <- Predictor$new(model = rf_iml, data = iml_df[, 1:7], x_pred y = iml_df$churn, type = "prob") # Perform feature importance <- FeatureImp$new(x_pred, loss = "ce", n.repetitions = 10) x_imp plot(x_imp)
We observe the TotalPrice variable to be the most important variable. However, the wide confidence bands tell us that we should give special attention to the other four/five top variables. The payment variables are clearly of lesser influence.
The big advantage of this method, is the fact that it can be applied to virtually any classifier which uses the standard
data.frame input. Be careful, however, as the different packages all have different input requirements. Sometimes you may not have to refactor, while other times you will have to make new unseen changes. This trial-and-error work is typical for a data scientist and quite similar to what you will have already experienced during the preprocessing phase.
One special case we will be discussing is XGBoost. As the package requires it’s own unique data input, it is incompatible with the
iml package. Luckily, the
xgboost package already has many interpretation functionalities implemented. Always have a good look at the help function to see what these functions are doing!
p_load(xgboost) <- xgb.importance(colnames(BasetableTRAINbig), importance_matrix model = xgb) xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
p_load(Ckmeans.1d.dp) xgb.ggplot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
TotalDiscount is ranked on top. While the other four variables (notice how the payment variables receive zero importance in the xgboost model) are also of great importance.
While knowing which variables are the most important, it is also useful to know how these variables influence the dependent variable. We know that discounts influence churn behaviour, but do they influence it negatively, as we expected and, if so, does the relationship keep on decreasing?
7.2 PDP and ICE
To know this, we want to have insights into the nature of the relationship as fitted by the model. A plot is a useful tool of visualizing such relationships. The
iml package provides two useful functionalities to do so: the partial dependence plots (PDP) and individual conditional expectation (ICE) curves. The
FeatureEffect class in the package enables us to plot them both.
# Use x_pred object from above for partial dependence plot # and ice curves <- FeatureEffect$new(x_pred, feature = "Recency", method = "pdp") eff plot(eff)
<- FeatureEffect$new(x_pred, feature = "TotalDiscount", method = "pdp") eff plot(eff)
We observe a clear monotonic increase in the first part of the TotalDiscount curve (until approx TotalDiscount = 120). This means that more discounts translate to more likely churn behaviour! The total amount of discount may be indicative of compensations made by the firm towards their customers. These compensations could be not sufficient to overcome unpleasant customer experiences. Afterwards, the curve heavily flattens out, but this can also be contributed to the less observations in that range (as measured by the ticks at the bottom of the curve). This plot made us not only realize how we can identify potential churners, but also gave business insights into what might be going wrong: did we compensate dissatisfied customers?, Do we need to try other compensations besides financial since they do not really work? Can other methods be more effective and less costly?
The PDP shows the average effect of the variable. How you generally see the predictions to change according to the model. This is of course not always the case, due to interactions with other features. To visualize this more clearly, you can also add ICE curves, which show the curves for individual observations. The PDP is still displayed as the thick yellow line. As you can see for a several customer we notice first a downward trend followed by a stagnation, howevcer for most customers see the effect an increase follwed by a stagnation, which explains why the average curves favors this relationship.
<- FeatureEffect$new(x_pred, feature = "TotalDiscount", method = "pdp+ice") eff plot(eff)
While the above methods already give a good idea about which variables affect our model and how, they are still not 100% fit for explaining individual predictions. The LIME framework tries to overcome this. The framework builds a white-box model based on the predictions of the black-box model. This white-box model can then explain the predictions. Typical examples are logistic regression and decision trees. Our beloved
iml package once again implements both for us:
TreeSurrogate creates a local surrogate decision tree, while
LocalModel fits a logistic regression on the predictions.
<- TreeSurrogate$new(x_pred) tree_sur # Plot the resulting leaf nodes plot(tree_sur)
The fitted decision tree splits the data into four large end nodes. However, the only pane that makes sense is the bottom right pane. When customer have paid a high price and they have a small credit line, they are likely to stay at your company. All other panes are not that clearcut and this actually tempers the validity of our surrogate model. This is also a good example of how expert knowledge (in this case you don’t have to be an extreme expert) can aid in identifying issues with models when XAI tools are being applied.
# Extract the results <- tree_sur$results dat head(dat)
## TotalDiscount TotalPrice TotalCredit ## 1 22.79579 114.40 0.00 ## 2 0.00000 972.00 0.00 ## 3 0.00000 951.00 0.00 ## 4 0.00000 957.69 -14.31 ## 5 72.00000 144.00 0.00 ## 6 0.00000 861.22 -17.78 ## PaymentType_DD PaymentStatus_Not.Paid Frequency ## 1 0 0 4 ## 2 0 0 4 ## 3 0 0 5 ## 4 0 0 4 ## 5 0 0 1 ## 6 0 0 5 ## Recency .node ## 1 295 3 ## 2 64 7 ## 3 341 7 ## 4 39 7 ## 5 39 7 ## 6 94 7 ## .path ## 1 TotalPrice <= 130 &\n TotalPrice <= 129 ## 2 TotalPrice > 130 &\n TotalCredit > -90.49 ## 3 TotalPrice > 130 &\n TotalCredit > -90.49 ## 4 TotalPrice > 130 &\n TotalCredit > -90.49 ## 5 TotalPrice > 130 &\n TotalCredit > -90.49 ## 6 TotalPrice > 130 &\n TotalCredit > -90.49 ## .y.hat.nonchurn .y.hat.churn ## 1 1.000 0.000 ## 2 1.000 0.000 ## 3 1.000 0.000 ## 4 0.862 0.138 ## 5 0.530 0.470 ## 6 0.504 0.496 ## .y.hat.tree.nonchurn .y.hat.tree.churn ## 1 0.914321 0.08567901 ## 2 0.975840 0.02416000 ## 3 0.975840 0.02416000 ## 4 0.975840 0.02416000 ## 5 0.975840 0.02416000 ## 6 0.975840 0.02416000
We notice a discrepancy between between true value and prediction of our tree, which further highlights the issues with our surrogate model. Let us try increasing the complexity of our local surrogate model by allowing to create deeper trees (default
maxdepth = 2).
<- TreeSurrogate$new(x_pred, maxdepth = 4) tree_sur # Plot the resulting leaf nodes plot(tree_sur)
This gives some more insights, however mainly for nonchurners, for churners the insights are rather limited. This shows a huge downside of LIME, namely their sensitivity to the accuracy of the surrogate model. The decision tree has the advantage of modeling nonmonotonic relationships, but the various can groups can become hard to track when maxdepth is becoming too high. Luckily for us, LIME also allows you to model a linear/logistic regression model which can be better in this situation.
By using LR the relationship can explained by coefficients. Let’s check out how this works exactly and how it can be used for individual explanations. As we saw that five features were highly influenceable, we set the number of used features (
k) equal to 5. The
LocalModel can take just one instance at a time as input to explain. We will use the first observation in the test set.
<- LocalModel$new(x_pred, x.interest = BasetableTEST[1, lime k = 5) ],
## Warning: from glmnet Fortran code (error code ## -68); Convergence for 68th lambda value not ## reached after maxit=100000 iterations; solutions ## for larger lambdas returned
## Interpretation method: LocalModel ## ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data: ## Sampling from data.frame with 1415 rows and 7 columns. ## ## Head of results: ## beta x.recoded effect x.original ## 1 -0.001567426 0.00 0.0000000 0 ## 2 0.007885419 -12.95 -0.1021162 -12.95 ## 3 0.606118346 0.00 0.0000000 0 ## 4 0.483342888 4.00 1.9333716 4 ## 5 0.002716930 145.00 0.3939549 145 ## 6 0.001567426 0.00 0.0000000 0 ## feature feature.value ## 1 TotalDiscount TotalDiscount=0 ## 2 TotalCredit TotalCredit=-12.95 ## 3 PaymentStatus_Not.Paid PaymentStatus_Not.Paid=0 ## 4 Frequency Frequency=4 ## 5 Recency Recency=145 ## 6 TotalDiscount TotalDiscount=0 ## .class ## 1 nonchurn ## 2 nonchurn ## 3 nonchurn ## 4 nonchurn ## 5 nonchurn ## 6 churn
Recency (which rather surprisingly was picked up by the model) have a positive effect on churn probability, while
TotalCredit has a negative effect and the
TotalDiscount have almost no effect.
7.4 Shapley values
Another framework which can be used for individual prediction explanations are Shapley values. The framework resulting in those Shapley values thinks of features as players in a cooperative game. The value of a feature is then equal to its (average) contribution across all coalitions (unique data points). Simply said: it is the ‘general deviation’ from what you would expect from an average observation.
iml package also provides a class for this framework:
Shapley. The individual Shapley value of a predicted observation is how much this feature contributes to the deviation from the average prediction across the various features. Its syntax shows great similarity to
LocalModel. The main difference is that we don’t have to set the number of features, as Shapley values are distributed across all features. This has the advantage that you can’t make incorrect assumptions about k, while Shapley values is also the only framework which has a sound theoretical backening.
<- Shapley$new(x_pred, x.interest = BasetableTEST[1, shapley ])shapley
## Interpretation method: Shapley ## Predicted value: 1.000000, Average prediction: 0.948749 (diff = 0.051251) Predicted value: 0.000000, Average prediction: 0.051251 (diff = -0.051251) ## ## Analysed predictor: ## Prediction task: unknown ## ## ## Analysed data: ## Sampling from data.frame with 1415 rows and 7 columns. ## ## Head of results: ## feature class phi ## 1 TotalDiscount nonchurn 0.00470 ## 2 TotalPrice nonchurn 0.01316 ## 3 TotalCredit nonchurn 0.00848 ## 4 PaymentType_DD nonchurn 0.00000 ## 5 PaymentStatus_Not.Paid nonchurn 0.00000 ## 6 Frequency nonchurn 0.00592 ## phi.var feature.value ## 1 0.0004990202 TotalDiscount=0 ## 2 0.0031000549 TotalPrice=959.05 ## 3 0.0064946360 TotalCredit=-12.95 ## 4 0.0000000000 PaymentType_DD=0 ## 5 0.0000000000 PaymentStatus_Not.Paid=0 ## 6 0.0019281552 Frequency=4
You can clearly observe how both LIME and Shapley values can be very useful in explaining a model’s single predictions. This led to the creation of SHAP, which combines the strengths of both. SHAP basically builds a local surrogate model based on the Shapley values. This leads to a flexible fitting, which allowed for the creation of several diagnostic tools, all built around the same principle (note how we have been picking parts from different methodologies up until now).
Unfortunately, most SHAP implementations are not yet nicely developed in R. While Python has the nice
shap library, which allows for the use of SHAP with a high level of abstraction, we do not observe something similar in R. A first useful attempt is the
shapper package. It is actually a wrapper around the
shap library in Python. To make this work, you however need to work with virtual environments, in order to make R know where to look for the corresponding Python code. This task is rather difficult and generally leads to issues with package dependencies.
To avoid these issues, we regard the
shapper package as out of scope of this course. Another good package -and more native to R- is the
fastshap packages, however it is still under active development. This makes the package prone to errors, which is why we also do not show you this package.
To let you have an understanding of the SHAP methodology nonetheless, we will be discussing the
SHAPforxgboost package. The creators of the xgboost algorithm (and package) automatically built in SHAP values. This allows for the easy usage of the methodology of SHAP when the xgboost algorithm is used. This proves rather useful as the xgboost algorithm is likely to be the (or close to the) top performer.
p_load(SHAPforxgboost) <- shap.values(xgb_model = xgb, X_train = as.matrix(BasetableTRAIN)) shap_values $mean_shap_scoreshap_values
## Recency TotalDiscount ## 0.6706929 0.6330155 ## Frequency TotalPrice ## 0.3053244 0.2965370 ## TotalCredit PaymentType_DD ## 0.2059751 0.0000000 ## PaymentStatus_Not.Paid ## 0.0000000
The mean SHAP score can be used as a means of variable importances. We get
TotalDiscount at the top, followed by
TotalPrice. If we compare this with previous methods, this is a general trend. It is exactly this kind of behavior that we want, with outcomes being independent of the used method!
SHAPforxgboost package is not only very easy to use, it is actually also much faster than many other solutions involving SHAP.
Visualizations are also relatively quick. In a summary plot, we can easily observe how features influence the predictions. The largest impacts are clearly made by the
TotalDiscount variables. With the
TotalDiscount variable having a positive relationship to churn (low variable values (yellow) leading to reduced predictions) while
Recency has a negative relationship (high variable values (blue/purple) leading to reduced predictions). Note that this relationship of
Recency might seem counterintuitive, however this has to do with the nature of the problem. A newspaper works with a yearly subscription, hence moderate to high values of
Recency could be 1 year and after 1 year people renew their subscription. People who end their subscription will maybe open one a few month ago, but not renew in the next period. Again, it can be useful to use some expert knowledge here to get more insights.
When looking at the
Frequency variable, the interpretation makes more sense. High values of frequency have a negative impact on the predicted churn probability, whereas a low frequency leads to higher churn chances. Note that for
TotalCredit the interpretation is more tricky, since a negative value means that you have a higher credit line.
<- shap.prep(xgb_model = xgb, X_train = as.matrix(BasetableTRAIN)) shap_prepare shap.plot.summary(shap_prepare)
A similar remark can be made about the dependence plots.
plot(shap.plot.dependence(data_long = shap_prepare, x = "Recency"))
## `geom_smooth()` using formula 'y ~ x'
plot(shap.plot.dependence(data_long = shap_prepare, x = "TotalDiscount"))
## `geom_smooth()` using formula 'y ~ x'
We observe a similar pattern as the ones observed in the PDPs and ICE curves (the downwards slope of high
TotalDiscount values is a clear case of overfitting to those data points).
The added advantage is the possibility to visualize interactions:
plot(shap.plot.dependence(data_long = shap_prepare, x = "TotalDiscount", y = "Recency", color_feature = "Recency"))
## `geom_smooth()` using formula 'y ~ x'
This plot would argue the effect of
Recency as being larger than the one of total discount when not considering the outliers. For
Recency, you can clearly see here that around the 300 days the influence on churn is negative, this can be seen as a confirmation of our theory above.
SHAP also allows for understandable explanations per observation. The force plot is perhaps the most known concept, where you explain the individual prediction with the predictor values acting as forces on the probability (pushing up or down).
# Prepare data into correct format <- shap.prep.stack.data(shap_contrib = shap_values$shap_score)plot_data
## All the features will be used.
## Data has N = 707 | zoom in length is 70 at location 424.2.
The observed plotting style is clearly a bit different, but we still observe how the various individual features contribute to the predicted probabilities. However, all the analyses pointed to the payment variables having little to no impact. How do we adapt this in our code? By the
top_n argument. This sets how many variables to visualize in the plot, in our case this is 5 (7-2).
<- shap.prep.stack.data(shap_contrib = shap_values$shap_score, plot_data top_n = 5)
## The SHAP values of the Rest 2 features were summed into variable 'rest_variables'.
## Data has N = 707 | zoom in length is 70 at location 424.2.
How do we get individual explanations then? By subsetting! We can clearly observe how
TotalDiscount had a positive effect on the probability of churn, while the other variables had a negative effect. Observation 513 scores low on both
TotalDiscount, which is in line with the observed positive relationships above.
## ID Recency TotalDiscount Frequency ## 1: 541 1.147243 -0.2721461 -1.071955 ## TotalPrice TotalCredit rest_variables group ## 1: 0.9093845 -0.1211484 0 4 ## sorted_id ## 1: 513
## Data has N = 1 | zoom in length is 50 at location 0.6.
Another interesting feature of the package is the
n_groups argument, this automatically clusters predictions based on the SHAP explanations. This allows you to easily interpret your predictions per group. When looking at the overall force plot, I would say there are 5 types of predictions (just look at the switches in pattern), so let us try this out.
<- shap.prep.stack.data(shap_contrib = shap_values$shap_score, plot_data top_n = 5, n_groups = 5)
## The SHAP values of the Rest 2 features were summed into variable 'rest_variables'.
We have 5 clusters. Cluster 1 are customers we expect to stay due to good ‘signs’ on all indicators. Cluster 2 are the customer we are uncertain off, while customers cluster 3 is also expected to remain royal, but mainly due to their score on frequency. Cluster 4 is the ‘problematic’ cluster with probably churners, this due to their
Frequency values. Cluster 5 is quite similar to cluster 1, but we are less certain of their loyalty (look at y-axis).