# 7 Explainable AI

As demonstrated in Session 4, ensemble models (and in many applications deep learning models as well) increase model performance drastically. A large downside, however, is the trade-off between reduced model interpretability and increased model complexity. Remember that a very nice graphical overview like a decision tree is not possible for a random forest of 1000 trees? This is a large downside to ensemble methods. Managers want to know what they are basing their decisions on. This issue is even further enlarged by recent regulations which make explainable machine learning methods mandatory by law (e.g., credit scoring and default prediction).

To overcome this black-box issue, researchers came up with a number of methods to explain how the various features influence the model. A general question is *‘Which are the most important drivers of our model?’*. To answer this question, feature importances were introduced. These feature importances indicate how important each unique feature is for the model’s predictive performance. We will be using the random forest model created in Session 4 for this.

```
setwd("C:\\Users\\matbogae\\OneDrive - UGent\\PPA22\\PredictiveAnalytics\\Book_2022\\data codeBook_2022")
load("randomForestSession4.Rdata")
load("TrainValTest.Rdata")
load("XGBSession4.Rdata")
```

## 7.1 Feature importance

Some algorithms have specific ways of calculating these importances. For example, random forest uses the decrease in Gini as a measure of *importance* in our model, as after all this decrease in impurity is what we are trying to optimize. When we have hundreds of trees, each variable will be used numerous times as a splitting criterion. The average decrease caused by splits on the variable can then be used as an estimate of importance. Because this method is inherently linked to the random forest algorithm, did the authors of the `randomForest`

package also include a function to automatically calculate these importances.

```
if (!require("pacman")) install.packages("pacman")
require("pacman", character.only = TRUE, quietly = TRUE)
p_load(randomForest)
::importance(rFmodel, type = 2) randomForest
```

```
## MeanDecreaseGini
## TotalDiscount 23.43498369
## TotalPrice 25.64158626
## TotalCredit 12.87933281
## PaymentType_DD 0.02843186
## PaymentStatus_Not.Paid 0.50086927
## Frequency 15.11395560
## Recency 24.28758256
```

`varImpPlot(rFmodel, type = 2)`

We clearly observe the importance of the traditional RFM variables. The total amount of discount received is however the single most important variable.

Remember that during the estimation of a random forest, there is always a part of the data used as validation for tree construction, as the bootstrap sample is very unlikely to include all unique observations. These out-of-bag (OOB) instances can be used to estimate the accuracy of the model. By permuting each predictor variable, one can see how this affects (decreases) the accuracy on the OOB data.

`::importance(rFmodel, type = 1) #change to type = 1 randomForest`

```
## MeanDecreaseAccuracy
## TotalDiscount 2.497947
## TotalPrice 2.139902
## TotalCredit 3.084465
## PaymentType_DD 0.000000
## PaymentStatus_Not.Paid 0.000000
## Frequency 2.138397
## Recency 3.174698
```

`varImpPlot(rFmodel, type = 1)`

We observe how the recency variable has now gained the first ranking in terms of variable importance. This is actually also more in line with what we would expect. The over-estimation of TotalDiscount’s importance based on the Gini decrease outlines a typical problem with this method, using pure **permutation-based feature importances** (i.e., the ones that use a traditional performance measure) tend to give more reliable estimates. Of course, not all methods involve OOB samples. You can ,however, do the same when using the training or test set.

The `iml`

(interpretable machine learning) package provides such a built-in function which calculates the feature importance (with 90% confidence bands) based on a number of possible performance measures. It is both possible to use a pre-defined metric with a character input `('ce','f1','logLoss','mae','mse','rmse','mape','mdae','msle','percent_bias','rae','rmsle','rse','rrse','smape')`

as well providing a self-defined metric which takes the functional form `function(actual,predicted)`

. If you use this predefined function, you can also use the `Metrics`

package to add you own function. We also use the `revalue`

function from `plyr`

as the `iml`

package has difficulties with factor levels called ‘0’ and ‘1’. Since the output of a permutation-based feature importance score is stochastic, make sure to put the number of permutations (`n.repetitions`

) sufficiently large to have stable and reliable estimates.

```
p_load(iml, partykit, plyr, tidyverse)
# Join into one table
<- BasetableTRAINbig
iml_df
# change the churn levels
<- iml_df %>%
iml_df mutate(churn = revalue(yTRAINbig, c(`0` = "nonchurn", `1` = "churn")))
<- randomForest(churn ~ ., data = iml_df, ntree = 500)
rf_iml
# Make new predictor class
<- Predictor$new(model = rf_iml, data = iml_df[, 1:7],
x_pred y = iml_df$churn, type = "prob")
# Perform feature importance
<- FeatureImp$new(x_pred, loss = "ce", n.repetitions = 10)
x_imp plot(x_imp)
```

We observe the TotalPrice variable to be the most important variable. However, the wide confidence bands tell us that we should give special attention to the other four/five top variables. The payment variables are clearly of lesser influence.

The big advantage of this method, is the fact that it can be applied to virtually any classifier which uses the standard `data.frame`

input. Be careful, however, as the different packages all have different input requirements. Sometimes you may not have to refactor, while other times you will have to make new unseen changes. This trial-and-error work is typical for a data scientist and quite similar to what you will have already experienced during the preprocessing phase.

One special case we will be discussing is *XGBoost*. As the package requires it’s own unique data input, it is incompatible with the `iml`

package. Luckily, the `xgboost`

package already has many interpretation functionalities implemented. Always have a good look at the help function to see what these functions are doing!

```
p_load(xgboost)
<- xgb.importance(colnames(BasetableTRAINbig),
importance_matrix model = xgb)
xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
```

```
p_load(Ckmeans.1d.dp)
xgb.ggplot.importance(importance_matrix, rel_to_first = TRUE,
xlab = "Relative importance")
```

TotalDiscount is ranked on top. While the other four variables (notice how the payment variables receive zero importance in the xgboost model) are also of great importance.

While knowing which variables are the most important, it is also useful to know *how these variables influence the dependent variable*. We know that discounts influence churn behaviour, but do they influence it negatively, as we expected and, if so, does the relationship keep on decreasing?

## 7.2 PDP and ICE

To know this, we want to have insights into the nature of the relationship as fitted by the model. A plot is a useful tool of visualizing such relationships. The `iml`

package provides two useful functionalities to do so: the **partial dependence plots** (PDP) and **individual conditional expectation** (ICE) curves. The `FeatureEffect`

class in the package enables us to plot them both.

```
# Use x_pred object from above for partial dependence plot
# and ice curves
<- FeatureEffect$new(x_pred, feature = "Recency", method = "pdp")
eff plot(eff)
```

```
<- FeatureEffect$new(x_pred, feature = "TotalDiscount", method = "pdp")
eff plot(eff)
```

We observe a clear monotonic increase in the first part of the TotalDiscount curve (until approx TotalDiscount = 120). This means that more discounts translate to more likely churn behaviour! The total amount of discount may be indicative of compensations made by the firm towards their customers. These compensations could be not sufficient to overcome unpleasant customer experiences. Afterwards, the curve heavily flattens out, but this can also be contributed to the less observations in that range (as measured by the ticks at the bottom of the curve). This plot made us not only realize how we can identify potential churners, but also gave business insights into what might be going wrong: did we compensate dissatisfied customers?, Do we need to try other compensations besides financial since they do not really work? Can other methods be more effective and less costly?

The PDP shows the *average* effect of the variable. How you *generally* see the predictions to change according to the model. This is of course not always the case, due to interactions with other features. To visualize this more clearly, you can also add **ICE curves**, which show the curves for individual observations. The PDP is still displayed as the thick yellow line. As you can see for a several customer we notice first a downward trend followed by a stagnation, howevcer for most customers see the effect an increase follwed by a stagnation, which explains why the average curves favors this relationship.

```
<- FeatureEffect$new(x_pred, feature = "TotalDiscount", method = "pdp+ice")
eff plot(eff)
```

## 7.3 LIME

While the above methods already give a good idea about which variables affect our model and how, they are still not 100% fit for explaining individual predictions. The **LIME** framework tries to overcome this. The framework builds a white-box model based on the predictions of the black-box model. This white-box model can then explain the predictions. Typical examples are logistic regression and decision trees. Our beloved `iml`

package once again implements both for us: `TreeSurrogate`

creates a local surrogate decision tree, while `LocalModel`

fits a logistic regression on the predictions.

```
<- TreeSurrogate$new(x_pred)
tree_sur
# Plot the resulting leaf nodes
plot(tree_sur)
```

The fitted decision tree splits the data into four large end nodes. However, the only pane that makes sense is the bottom right pane. When customer have paid a high price and they have a small credit line, they are likely to stay at your company. All other panes are not that clearcut and this actually tempers the validity of our surrogate model. This is also a good example of how expert knowledge (in this case you don’t have to be an extreme expert) can aid in identifying issues with models when XAI tools are being applied.

```
# Extract the results
<- tree_sur$results
dat head(dat)
```

```
## TotalDiscount TotalPrice TotalCredit
## 1 22.79579 114.40 0.00
## 2 0.00000 972.00 0.00
## 3 0.00000 951.00 0.00
## 4 0.00000 957.69 -14.31
## 5 72.00000 144.00 0.00
## 6 0.00000 861.22 -17.78
## PaymentType_DD PaymentStatus_Not.Paid Frequency
## 1 0 0 4
## 2 0 0 4
## 3 0 0 5
## 4 0 0 4
## 5 0 0 1
## 6 0 0 5
## Recency .node
## 1 295 3
## 2 64 7
## 3 341 7
## 4 39 7
## 5 39 7
## 6 94 7
## .path
## 1 TotalPrice <= 130 &\n TotalPrice <= 129
## 2 TotalPrice > 130 &\n TotalCredit > -90.49
## 3 TotalPrice > 130 &\n TotalCredit > -90.49
## 4 TotalPrice > 130 &\n TotalCredit > -90.49
## 5 TotalPrice > 130 &\n TotalCredit > -90.49
## 6 TotalPrice > 130 &\n TotalCredit > -90.49
## .y.hat.nonchurn .y.hat.churn
## 1 1.000 0.000
## 2 1.000 0.000
## 3 1.000 0.000
## 4 0.862 0.138
## 5 0.530 0.470
## 6 0.504 0.496
## .y.hat.tree.nonchurn .y.hat.tree.churn
## 1 0.914321 0.08567901
## 2 0.975840 0.02416000
## 3 0.975840 0.02416000
## 4 0.975840 0.02416000
## 5 0.975840 0.02416000
## 6 0.975840 0.02416000
```

We notice a discrepancy between between true value and prediction of our tree, which further highlights the issues with our surrogate model. Let us try increasing the complexity of our local surrogate model by allowing to create deeper trees (default `maxdepth = 2`

).

```
<- TreeSurrogate$new(x_pred, maxdepth = 4)
tree_sur
# Plot the resulting leaf nodes
plot(tree_sur)
```

This gives some more insights, however mainly for nonchurners, for churners the insights are rather limited. This shows a huge downside of LIME, namely their sensitivity to the accuracy of the surrogate model. The decision tree has the advantage of modeling nonmonotonic relationships, but the various can groups can become hard to track when maxdepth is becoming too high. Luckily for us, LIME also allows you to model a linear/logistic regression model which can be better in this situation.

By using LR the relationship can explained by coefficients. Let’s check out how this works exactly and how it can be used for individual explanations. As we saw that five features were highly influenceable, we set the number of used features (`k`

) equal to 5. The `LocalModel`

can take just one instance at a time as input to explain. We will use the first observation in the test set.

```
<- LocalModel$new(x_pred, x.interest = BasetableTEST[1,
lime k = 5) ],
```

```
## Warning: from glmnet Fortran code (error code
## -68); Convergence for 68th lambda value not
## reached after maxit=100000 iterations; solutions
## for larger lambdas returned
```

` lime`

```
## Interpretation method: LocalModel
##
##
## Analysed predictor:
## Prediction task: unknown
##
##
## Analysed data:
## Sampling from data.frame with 1415 rows and 7 columns.
##
## Head of results:
## beta x.recoded effect x.original
## 1 -0.001567426 0.00 0.0000000 0
## 2 0.007885419 -12.95 -0.1021162 -12.95
## 3 0.606118346 0.00 0.0000000 0
## 4 0.483342888 4.00 1.9333716 4
## 5 0.002716930 145.00 0.3939549 145
## 6 0.001567426 0.00 0.0000000 0
## feature feature.value
## 1 TotalDiscount TotalDiscount=0
## 2 TotalCredit TotalCredit=-12.95
## 3 PaymentStatus_Not.Paid PaymentStatus_Not.Paid=0
## 4 Frequency Frequency=4
## 5 Recency Recency=145
## 6 TotalDiscount TotalDiscount=0
## .class
## 1 nonchurn
## 2 nonchurn
## 3 nonchurn
## 4 nonchurn
## 5 nonchurn
## 6 churn
```

`Frequency`

and `Recency`

(which rather surprisingly was picked up by the model) have a positive effect on churn probability, while `TotalCredit`

has a negative effect and the `Payementstatus`

and `TotalDiscount`

have almost no effect.

`plot(lime)`

## 7.4 Shapley values

Another framework which can be used for individual prediction explanations are **Shapley values**. The framework resulting in those Shapley values thinks of features as players in a cooperative game. The value of a feature is then equal to its (average) contribution across all coalitions (unique data points). Simply said: it is the ‘general deviation’ from what you would expect from an average observation.

The `iml`

package also provides a class for this framework: `Shapley`

. The individual Shapley value of a predicted observation is how much this feature contributes to the deviation from the average prediction across the various features. Its syntax shows great similarity to `LocalModel`

. The main difference is that we don’t have to set the number of features, as Shapley values are distributed across all features. This has the advantage that you can’t make incorrect assumptions about k, while Shapley values is also the only framework which has a sound theoretical backening.

```
<- Shapley$new(x_pred, x.interest = BasetableTEST[1,
shapley
]) shapley
```

```
## Interpretation method: Shapley
## Predicted value: 1.000000, Average prediction: 0.948749 (diff = 0.051251) Predicted value: 0.000000, Average prediction: 0.051251 (diff = -0.051251)
##
## Analysed predictor:
## Prediction task: unknown
##
##
## Analysed data:
## Sampling from data.frame with 1415 rows and 7 columns.
##
## Head of results:
## feature class phi
## 1 TotalDiscount nonchurn 0.00470
## 2 TotalPrice nonchurn 0.01316
## 3 TotalCredit nonchurn 0.00848
## 4 PaymentType_DD nonchurn 0.00000
## 5 PaymentStatus_Not.Paid nonchurn 0.00000
## 6 Frequency nonchurn 0.00592
## phi.var feature.value
## 1 0.0004990202 TotalDiscount=0
## 2 0.0031000549 TotalPrice=959.05
## 3 0.0064946360 TotalCredit=-12.95
## 4 0.0000000000 PaymentType_DD=0
## 5 0.0000000000 PaymentStatus_Not.Paid=0
## 6 0.0019281552 Frequency=4
```

`plot(shapley)`

## 7.5 SHAP

You can clearly observe how both LIME and Shapley values can be very useful in explaining a model’s single predictions. This led to the creation of **SHAP**, which combines the strengths of both. SHAP basically builds a local surrogate model based on the Shapley values. This leads to a flexible fitting, which allowed for the creation of several diagnostic tools, all built around the same principle (note how we have been picking parts from different methodologies up until now).

Unfortunately, most SHAP implementations are not yet nicely developed in R. While Python has the nice `shap`

library, which allows for the use of SHAP with a high level of abstraction, we do not observe something similar in R. A first useful attempt is the `shapper`

package. It is actually a wrapper around the `shap`

library in Python. To make this work, you however need to work with virtual environments, in order to make R know where to look for the corresponding Python code. This task is rather difficult and generally leads to issues with package dependencies.

To avoid these issues, we regard the `shapper`

package as out of scope of this course. Another good package -and more native to R- is the `fastshap`

packages, however it is still under active development. This makes the package prone to errors, which is why we also do not show you this package.

To let you have an understanding of the SHAP methodology nonetheless, we will be discussing the `SHAPforxgboost`

package. The creators of the xgboost algorithm (and package) automatically built in SHAP values. This allows for the easy usage of the methodology of SHAP when the xgboost algorithm is used. This proves rather useful as the xgboost algorithm is likely to be the (or close to the) top performer.

```
p_load(SHAPforxgboost)
<- shap.values(xgb_model = xgb, X_train = as.matrix(BasetableTRAIN))
shap_values $mean_shap_score shap_values
```

```
## Recency TotalDiscount
## 0.6706929 0.6330155
## Frequency TotalPrice
## 0.3053244 0.2965370
## TotalCredit PaymentType_DD
## 0.2059751 0.0000000
## PaymentStatus_Not.Paid
## 0.0000000
```

The mean SHAP score can be used as a means of variable importances. We get `Recency`

and `TotalDiscount`

at the top, followed by `Frequency`

and `TotalPrice`

. If we compare this with previous methods, this is a general trend. It is exactly this kind of behavior that we want, with outcomes being independent of the used method!

The `SHAPforxgboost`

package is not only very easy to use, it is actually also much faster than many other solutions involving SHAP.

Visualizations are also relatively quick. In a summary plot, we can easily observe how features influence the predictions. The largest impacts are clearly made by the `Recency`

and `TotalDiscount`

variables. With the `TotalDiscount`

variable having a positive relationship to churn (low variable values (yellow) leading to reduced predictions) while `Recency`

has a negative relationship (high variable values (blue/purple) leading to reduced predictions). Note that this relationship of `Recency`

might seem counterintuitive, however this has to do with the nature of the problem. A newspaper works with a yearly subscription, hence moderate to high values of `Recency`

could be 1 year and after 1 year people renew their subscription. People who end their subscription will maybe open one a few month ago, but not renew in the next period. Again, it can be useful to use some expert knowledge here to get more insights.

When looking at the `Frequency`

variable, the interpretation makes more sense. High values of frequency have a negative impact on the predicted churn probability, whereas a low frequency leads to higher churn chances. Note that for `TotalCredit`

the interpretation is more tricky, since a negative value means that you have a higher credit line.

```
<- shap.prep(xgb_model = xgb, X_train = as.matrix(BasetableTRAIN))
shap_prepare shap.plot.summary(shap_prepare)
```

A similar remark can be made about the dependence plots.

`plot(shap.plot.dependence(data_long = shap_prepare, x = "Recency"))`

`## `geom_smooth()` using formula 'y ~ x'`

`plot(shap.plot.dependence(data_long = shap_prepare, x = "TotalDiscount"))`

`## `geom_smooth()` using formula 'y ~ x'`

We observe a similar pattern as the ones observed in the PDPs and ICE curves (the downwards slope of high `TotalDiscount`

values is a clear case of overfitting to those data points).

The added advantage is the possibility to visualize interactions:

```
plot(shap.plot.dependence(data_long = shap_prepare, x = "TotalDiscount",
y = "Recency", color_feature = "Recency"))
```

`## `geom_smooth()` using formula 'y ~ x'`

This plot would argue the effect of `Recency`

as being larger than the one of total discount when not considering the outliers. For `Recency`

, you can clearly see here that around the 300 days the influence on churn is negative, this can be seen as a confirmation of our theory above.

SHAP also allows for understandable explanations per observation. The force plot is perhaps the most known concept, where you explain the individual prediction with the predictor values acting as forces on the probability (pushing up or down).

```
# Prepare data into correct format
<- shap.prep.stack.data(shap_contrib = shap_values$shap_score) plot_data
```

`## All the features will be used.`

`shap.plot.force_plot(plot_data)`

`## Data has N = 707 | zoom in length is 70 at location 424.2.`

The observed plotting style is clearly a bit different, but we still observe how the various individual features contribute to the predicted probabilities. However, all the analyses pointed to the payment variables having little to no impact. How do we adapt this in our code? By the `top_n`

argument. This sets how many variables to visualize in the plot, in our case this is 5 (7-2).

```
<- shap.prep.stack.data(shap_contrib = shap_values$shap_score,
plot_data top_n = 5)
```

`## The SHAP values of the Rest 2 features were summed into variable 'rest_variables'.`

`shap.plot.force_plot(plot_data)`

`## Data has N = 707 | zoom in length is 70 at location 424.2.`

How do we get individual explanations then? By subsetting! We can clearly observe how `Recency`

and `TotalDiscount`

had a positive effect on the probability of churn, while the other variables had a negative effect. Observation 513 scores low on both `Recency`

and `TotalDiscount`

, which is in line with the observed positive relationships above.

`513, ] plot_data[`

```
## ID Recency TotalDiscount Frequency
## 1: 541 1.147243 -0.2721461 -1.071955
## TotalPrice TotalCredit rest_variables group
## 1: 0.9093845 -0.1211484 0 4
## sorted_id
## 1: 513
```

`shap.plot.force_plot(plot_data[513, ])`

`## Data has N = 1 | zoom in length is 50 at location 0.6.`

Another interesting feature of the package is the `n_groups`

argument, this automatically clusters predictions based on the SHAP explanations. This allows you to easily interpret your predictions per group. When looking at the overall force plot, I would say there are 5 types of predictions (just look at the switches in pattern), so let us try this out.

```
<- shap.prep.stack.data(shap_contrib = shap_values$shap_score,
plot_data top_n = 5, n_groups = 5)
```

`## The SHAP values of the Rest 2 features were summed into variable 'rest_variables'.`

`shap.plot.force_plot_bygroup(plot_data)`

We have 5 clusters. Cluster 1 are customers we expect to stay due to good ‘signs’ on all indicators. Cluster 2 are the customer we are uncertain off, while customers cluster 3 is also expected to remain royal, but mainly due to their score on frequency. Cluster 4 is the ‘problematic’ cluster with probably churners, this due to their `TotalDiscount`

and `Frequency`

values. Cluster 5 is quite similar to cluster 1, but we are less certain of their loyalty (look at y-axis).