# Chapter 3 R Lab 2 - 04/04/2023

In this lecture we will learn how to implement KNN and the linear discriminant analysis (LDA) for a classification problem. Moreover, the methods will be compared by using performance indexes.

The following packages are required: `class`

, `MASS`

and `tidyverse`

.

```
library(MASS) #for LDA
library(tidyverse)
library(class) #for KNN classification
```

## 3.1 Loan default data

The data we use for this lab are taken from the book titled “Machine Learning & Data Science Blueprints for Finance” (suggested for the AIMLFF course). See in particular the **Case Study 2: loan default probability** in Chapter 6. The goal is to build machine learning models to predict the probability that a loan will default and won’t be repaid by the borrower.

The response variable is named `charged_off`

and is a binary integer variable: `0`

means that the borrower was able to fully pay the loan, while `1`

means that the loan was charged off (default). In this case study there are several regressors, both quantitative and qualitative.

We import the data as usual

```
<- read.csv("./files/loandefault.csv")
d glimpse(d)
```

```
## Rows: 86,138
## Columns: 34
## $ loan_amnt <dbl> 15000, 10400, 21425, 7650, 9600, 2500, 16000, 23325, 5250…
## $ funded_amnt <dbl> 15000, 10400, 21425, 7650, 9600, 2500, 16000, 23325, 5250…
## $ term <chr> " 60 months", " 36 months", " 60 months", " 36 months", "…
## $ int_rate <dbl> 12.39, 6.99, 15.59, 13.66, 13.66, 11.99, 11.44, 14.31, 11…
## $ installment <dbl> 336.64, 321.08, 516.36, 260.20, 326.53, 83.03, 351.40, 80…
## $ grade <chr> "C", "A", "D", "C", "C", "B", "B", "C", "B", "B", "D", "C…
## $ sub_grade <chr> "C1", "A3", "D1", "C3", "C3", "B5", "B4", "C4", "B4", "B5…
## $ emp_title <chr> "MANAGEMENT", "Truck Driver Delivery Personel", "Programm…
## $ emp_length <chr> "10+ years", "8 years", "6 years", "< 1 year", "10+ years…
## $ home_ownership <chr> "RENT", "MORTGAGE", "RENT", "RENT", "RENT", "MORTGAGE", "…
## $ annual_inc <dbl> 78000, 58000, 63800, 50000, 69000, 89000, 109777, 72000, …
## $ verification_status <chr> "Source Verified", "Not Verified", "Source Verified", "So…
## $ purpose <chr> "debt_consolidation", "credit_card", "credit_card", "debt…
## $ title <chr> "Debt consolidation", "Credit card refinancing", "Credit …
## $ zip_code <chr> "235xx", "937xx", "658xx", "850xx", "077xx", "554xx", "20…
## $ addr_state <chr> "VA", "CA", "MO", "AZ", "NJ", "MN", "VA", "WA", "MD", "MI…
## $ dti <dbl> 12.03, 14.92, 18.49, 34.81, 25.81, 13.77, 11.63, 27.03, 1…
## $ earliest_cr_line <chr> "Aug-1994", "Sep-1989", "Aug-2003", "Aug-2002", "Nov-1992…
## $ fico_range_low <dbl> 750, 710, 685, 685, 680, 685, 700, 665, 745, 675, 680, 67…
## $ fico_range_high <dbl> 754, 714, 689, 689, 684, 689, 704, 669, 749, 679, 684, 67…
## $ open_acc <dbl> 6, 17, 10, 11, 12, 9, 7, 14, 8, 5, 11, 7, 11, 10, 7, 9, 3…
## $ revol_util <dbl> 29.0, 31.6, 76.2, 91.9, 59.4, 94.3, 60.4, 82.2, 20.2, 98.…
## $ initial_list_status <chr> "w", "w", "w", "f", "f", "f", "w", "f", "f", "f", "f", "f…
## $ last_pymnt_amnt <dbl> 12017.81, 321.08, 17813.19, 17.70, 9338.58, 2294.26, 4935…
## $ application_type <chr> "Individual", "Individual", "Individual", "Individual", "…
## $ acc_open_past_24mths <dbl> 5, 7, 4, 6, 8, 6, 3, 6, 4, 0, 7, 2, 5, 6, 2, 6, 3, 4, 6, …
## $ avg_cur_bal <dbl> 29828, 9536, 4232, 5857, 3214, 44136, 53392, 39356, 1267,…
## $ bc_open_to_buy <dbl> 9525, 7599, 324, 332, 6494, 1333, 2559, 3977, 12152, 324,…
## $ bc_util <dbl> 4.7, 41.5, 97.8, 93.2, 69.2, 86.4, 72.2, 89.0, 26.8, 98.5…
## $ mo_sin_old_rev_tl_op <dbl> 244, 290, 136, 148, 265, 148, 133, 194, 67, 124, 122, 167…
## $ mo_sin_rcnt_rev_tl_op <dbl> 1, 1, 7, 8, 23, 24, 17, 15, 12, 40, 2, 21, 15, 8, 27, 3, …
## $ mort_acc <dbl> 0, 1, 0, 0, 0, 5, 2, 6, 0, 0, 0, 2, 5, 0, 0, 0, 0, 1, 0, …
## $ num_actv_rev_tl <dbl> 4, 9, 4, 4, 7, 4, 3, 5, 2, 5, 8, 3, 8, 6, 3, 2, 2, 5, 6, …
## $ charged_off <int> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
```

## 3.2 Data preparation and cleaning

The following steps follows the approach used in the book. As we will implement LDA and KNN, we need to have only quantitative regressors (`is.numeric`

). They can be jointly selected by using the function `select_if`

(we create a new data frame called `data2`

):

```
= d %>% select_if(is.numeric)
d2 glimpse(d2)
```

```
## Rows: 86,138
## Columns: 20
## $ loan_amnt <dbl> 15000, 10400, 21425, 7650, 9600, 2500, 16000, 23325, 5250…
## $ funded_amnt <dbl> 15000, 10400, 21425, 7650, 9600, 2500, 16000, 23325, 5250…
## $ int_rate <dbl> 12.39, 6.99, 15.59, 13.66, 13.66, 11.99, 11.44, 14.31, 11…
## $ installment <dbl> 336.64, 321.08, 516.36, 260.20, 326.53, 83.03, 351.40, 80…
## $ annual_inc <dbl> 78000, 58000, 63800, 50000, 69000, 89000, 109777, 72000, …
## $ dti <dbl> 12.03, 14.92, 18.49, 34.81, 25.81, 13.77, 11.63, 27.03, 1…
## $ fico_range_low <dbl> 750, 710, 685, 685, 680, 685, 700, 665, 745, 675, 680, 67…
## $ fico_range_high <dbl> 754, 714, 689, 689, 684, 689, 704, 669, 749, 679, 684, 67…
## $ open_acc <dbl> 6, 17, 10, 11, 12, 9, 7, 14, 8, 5, 11, 7, 11, 10, 7, 9, 3…
## $ revol_util <dbl> 29.0, 31.6, 76.2, 91.9, 59.4, 94.3, 60.4, 82.2, 20.2, 98.…
## $ last_pymnt_amnt <dbl> 12017.81, 321.08, 17813.19, 17.70, 9338.58, 2294.26, 4935…
## $ acc_open_past_24mths <dbl> 5, 7, 4, 6, 8, 6, 3, 6, 4, 0, 7, 2, 5, 6, 2, 6, 3, 4, 6, …
## $ avg_cur_bal <dbl> 29828, 9536, 4232, 5857, 3214, 44136, 53392, 39356, 1267,…
## $ bc_open_to_buy <dbl> 9525, 7599, 324, 332, 6494, 1333, 2559, 3977, 12152, 324,…
## $ bc_util <dbl> 4.7, 41.5, 97.8, 93.2, 69.2, 86.4, 72.2, 89.0, 26.8, 98.5…
## $ mo_sin_old_rev_tl_op <dbl> 244, 290, 136, 148, 265, 148, 133, 194, 67, 124, 122, 167…
## $ mo_sin_rcnt_rev_tl_op <dbl> 1, 1, 7, 8, 23, 24, 17, 15, 12, 40, 2, 21, 15, 8, 27, 3, …
## $ mort_acc <dbl> 0, 1, 0, 0, 0, 5, 2, 6, 0, 0, 0, 2, 5, 0, 0, 0, 0, 1, 0, …
## $ num_actv_rev_tl <dbl> 4, 9, 4, 4, 7, 4, 3, 5, 2, 5, 8, 3, 8, 6, 3, 2, 2, 5, 6, …
## $ charged_off <int> 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
```

The next step is to transform the response variable (now coded as 0/1 integer variable) into a `factor`

(qualitative variable) with categories `0_FullyPaid`

and `1_ChargedOff`

. Note that we check the frequency distribution of the variable `charged_off`

before and after the transformation to be sure that everything works fine:

`%>% count(charged_off) d2 `

```
## charged_off n
## 1 0 69982
## 2 1 16156
```

```
# from integer to factor
$charged_off = factor(d2$charged_off,
d2levels = c(0,1),
labels = c("0_FullyPaid", "1_ChargedOff"))
%>% count(charged_off) d2
```

```
## charged_off n
## 1 0_FullyPaid 69982
## 2 1_ChargedOff 16156
```

We can now plot the frequency distribution for `charged_off`

as follows using the `geom_bar`

geometry (automatically in the y-axis absolute frequencies - count - are reported):

```
%>%
d2 ggplot()+
geom_bar(aes(charged_off), fill="pink")
```

If we run the `summary`

function we realize that some of the variables have missing values (e.g. `bc_open_to_buy`

):

`summary(d2)`

```
## loan_amnt funded_amnt int_rate installment annual_inc
## Min. : 1000 Min. : 1000 Min. : 6.00 Min. : 30.42 Min. : 4000
## 1st Qu.: 7800 1st Qu.: 7800 1st Qu.: 9.49 1st Qu.: 248.48 1st Qu.: 45000
## Median :12000 Median :12000 Median :12.99 Median : 370.48 Median : 62474
## Mean :14107 Mean :14107 Mean :13.00 Mean : 430.74 Mean : 73843
## 3rd Qu.:20000 3rd Qu.:20000 3rd Qu.:15.61 3rd Qu.: 568.00 3rd Qu.: 90000
## Max. :35000 Max. :35000 Max. :26.06 Max. :1408.13 Max. :7500000
##
## dti fico_range_low fico_range_high open_acc revol_util
## Min. : 0.00 Min. :660.0 Min. :664.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:12.07 1st Qu.:670.0 1st Qu.:674.0 1st Qu.: 8.00 1st Qu.: 37.20
## Median :17.95 Median :685.0 Median :689.0 Median :11.00 Median : 54.90
## Mean :18.53 Mean :692.5 Mean :696.5 Mean :11.75 Mean : 54.58
## 3rd Qu.:24.50 3rd Qu.:705.0 3rd Qu.:709.0 3rd Qu.:14.00 3rd Qu.: 72.50
## Max. :39.99 Max. :845.0 Max. :850.0 Max. :84.00 Max. :180.30
## NA's :44
## last_pymnt_amnt acc_open_past_24mths avg_cur_bal bc_open_to_buy
## Min. : 0.0 Min. : 0.000 Min. : 0 Min. : 0
## 1st Qu.: 358.5 1st Qu.: 2.000 1st Qu.: 3010 1st Qu.: 1087
## Median : 1241.2 Median : 4.000 Median : 6994 Median : 3823
## Mean : 4757.4 Mean : 4.595 Mean : 13067 Mean : 8943
## 3rd Qu.: 7303.2 3rd Qu.: 6.000 3rd Qu.: 17905 3rd Qu.: 10588
## Max. :36234.4 Max. :53.000 Max. :447433 Max. :249625
## NA's :996
## bc_util mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mort_acc
## Min. : 0.00 Min. : 3.0 Min. : 0.0 Min. : 0.000
## 1st Qu.: 44.10 1st Qu.:118.0 1st Qu.: 3.0 1st Qu.: 0.000
## Median : 67.70 Median :167.0 Median : 8.0 Median : 1.000
## Mean : 63.81 Mean :183.5 Mean : 12.8 Mean : 1.749
## 3rd Qu.: 87.50 3rd Qu.:232.0 3rd Qu.: 15.0 3rd Qu.: 3.000
## Max. :255.20 Max. :718.0 Max. :372.0 Max. :34.000
## NA's :1049
## num_actv_rev_tl charged_off
## Min. : 0.000 0_FullyPaid :69982
## 1st Qu.: 3.000 1_ChargedOff:16156
## Median : 5.000
## Mean : 5.762
## 3rd Qu.: 7.000
## Max. :38.000
##
```

One possibility is to do data imputation for the missing values. Here instead we will remove the observations (rows) with one or more than one missing values by using `na.omit`

:

`= na.omit(d2) d2 `

Note the reduction in the number of observation for `d2`

after removing the missing values.

From the preview of the data and the summary statistics we realize that `loan_amnt`

and `funded_amnt`

have the same values. To be sure that these two variables take exactly the same values we plot them joint using a scatterplot:

```
%>%
d2 ggplot()+
geom_point(aes(funded_amnt,loan_amnt))
```

As the points are exactly on a straight line it means that they exactly coincide. Thus, we need to remove one of the two for avoiding problems with the models’ estimation:

```
# Remove funded_amnt
= d2 %>% dplyr::select(- funded_amnt) d2
```

Note that the code `dplyr::select`

here above is used to specify that the function `select`

is from the `dplyr`

package (which is part of the `tidyverse`

collection). This is required because sometimes `select`

gives some weird error…

We then plot the distribution of `annual_inc`

conditionally on `charged_off`

by using boxplots:

```
%>%
d2 ggplot()+
geom_boxplot(aes(charged_off, annual_inc))
```

We see that, due to the extreme values in the `annual_inc`

the boxes are completely hidden and the distribution of the variable is not clearly shown. We try to apply the log transformation to the quantitative variable:

```
%>%
d2 ggplot()+
geom_boxplot(aes(charged_off, log(annual_inc)))
```

Now, the distribuion appears more reasonable for both the categories (the median `annual_inc`

is very similar in the two groups). We thus decide to apply this transformation permanently in the dataframe by creating a new variable named `log_annual_inc`

and eliminating the old one:

```
= d2 %>%
d2 mutate(log_annual_inc = log(annual_inc)) %>%
::select(-annual_inc) #remove the original income dplyr
```

We check now the correlation between the `fico_range_high`

and `fico_range_low`

, where the FICO score is designed to rank-order consumers by their sensitivity to a future economic downturn.

`cor(d2$fico_range_high, d2$fico_range_low)`

`## [1] 1`

The correlation is exactly equal to one meaning that the two variables are perfectly linearly associated, see also the following plot:

```
%>%
d2 ggplot()+
geom_point(aes(fico_range_high, fico_range_low))
```

This means that it doesn’t make sense to have both the variables for the modeling because they provide the same information. We thus create a new variable which is the average of the two (remember to remove the original variables after the creation of `fico_new`

):

```
= d2 %>%
d2 mutate(fico_new = (fico_range_high+fico_range_low)/2) %>%
::select(-fico_range_high, -fico_range_low) dplyr
```

## 3.3 A new method for creating the training and testing set

To create the training (80%) and test (20%) dataset we use a new approach different from the one introduced in Section 2.2.

We first create a vector with the indexes we will use for the training dataset. These indexes are sampled randomly in the set of integer between 1 and 85089. In this case we must set `replace`

to `FALSE`

, since each index (corresponding to an observation) can’t be sampled twice since each subject will belong to the training or test set only (not to both). Recall also to set a seed, in order to be able to replicate your results:

```
set.seed(11)
= sample(1:nrow(d2), nrow(d2)*0.8, replace=F)
training_index head(training_index)
```

`## [1] 65570 66457 19004 73612 28886 37598`

`# positions of the first 6 elements to be included in the training set`

We are now ready to sample from the `d2`

dataset 80% of the observations (rows) by using the `training_index`

vector, and leave the non-sampled indexes for test. Just use a `minus (-)`

before the row selection to exclude them:

```
# recall: dataset [ row , columns ]
= d2[training_index,]
s_train = d2[-training_index,] s_test
```

Note that this procedure could be the easiest choice when we don’t have a third dataset (validation).

We are now ready to train our models.

## 3.4 Definition of a function for computing performance indexes

For assessing the performance of a classifier we compare predicted categories with observed categories. This can be done by using the **confusion matrix** which is a 2x2 matrix reporting the joint distribution (with absolute frequencies) of predicted (by row) and observed (by column) categories.

No (obs.) | Yes (obs.) | |||
---|---|---|---|---|

No (pred.) | TN | FN | ||

Yes (pred.) | FP | TP | ||

Total | N | P |

We consider in particular the following performance indexes:

**sensitivity**(true positive rate): TP/P**specificity**(true negative rate): TN/N**accuracy**(rate of correctly classified observations): (TN+TP)/(N+P)

We define a function named `perf_indexes`

which computes the above defined indexes. The function has only one argument which is a confusion 2x2 matrix named `cm`

. It returns a vector with the 3 indexes (sensitivity, specificity, overall accuracy).

```
= function(cm){
perf_indexes # True negative rate
= cm[1,1] / (cm[1,1] + cm[2,1])
specificity # True positive rate
= cm[2,2] / (cm[1,2] + cm[2,2])
sensitivity # Overall accuracy
= sum(diag(cm)) / sum(cm)
accuracy # Return all the indexes
return(c(spec=specificity, sens=sensitivity, acc=accuracy))
}
```

## 3.5 KNN for classification problems

For this particular case we assume that the KNN has already been tuned and that the best value for the hyperparameter is \(K=200\). Rememember that usually tuning is necessary, as described in the previous lab with KNN regression.

We now implement KNN classification with \(K=200\) by using the `knn`

function from the `class`

package (see `?knn`

). Note that there are two functions both named `knn`

in the `class`

and `FNN`

package. In order to specify that we want to use the function in the `class`

package we could use `class::knn`

. We have to specify the regressor dataframe for the training and test observations and the response variable vector (`cl`

).

```
= knn(train = dplyr::select(s_train, -charged_off),
outknn test = dplyr::select(s_test, -charged_off),
cl = s_train$charged_off,
k = 200)
length(outknn)
```

`## [1] 17018`

`head(outknn)`

```
## [1] 0_FullyPaid 0_FullyPaid 0_FullyPaid 0_FullyPaid 0_FullyPaid 0_FullyPaid
## Levels: 0_FullyPaid 1_ChargedOff
```

Predictions (categories) are contained in the `outknn`

object. To compare predictions \(\hat y_0\) and observed values \(y_0\), it is useful to create the **confusion matrix** which is a double-entry matrix with predicted and observed categories and the corresponding frequencies:

`table(pred=outknn, obs=s_test$charged_off)`

```
## obs
## pred 0_FullyPaid 1_ChargedOff
## 0_FullyPaid 13199 2291
## 1_ChargedOff 712 816
```

In the main diagonal we can read the number of correctly classified observations. Outside from the main diagonal we have the number of missclassified observations.

The percentage **test error rate** is computed as follows:

```
#test error rate
mean(outknn != s_test$charged_off)*100
```

`## [1] 17.64602`

We also create the function `perf_indexes`

defined above the compute the performance indexes for KNN:

`perf_indexes(table(pred=outknn, obs=s_test$charged_off))`

```
## spec sens acc
## 0.9488175 0.2626328 0.8235398
```

In this case we see that 82.35% of the observations are correctly classified. However the sensitivity is quite small and we are not able to classify risky individuals (we predict 2291 subjects are `0_FullyPaid`

but actually they charged off).

## 3.6 Linear discriminant analysis (LDA)

To run LDA we use the `lda`

function of the `MASS`

package. We have to specify the formula and the data:

```
= lda(charged_off ~ ., data = s_train)
outlda outlda
```

```
## Call:
## lda(charged_off ~ ., data = s_train)
##
## Prior probabilities of groups:
## 0_FullyPaid 1_ChargedOff
## 0.8116378 0.1883622
##
## Group means:
## loan_amnt int_rate installment dti open_acc revol_util last_pymnt_amnt
## 0_FullyPaid 13877.18 12.37247 427.2998 18.01403 11.66427 53.85583 5746.523
## 1_ChargedOff 15220.02 15.58736 449.6341 20.70623 12.24271 57.94041 462.385
## acc_open_past_24mths avg_cur_bal bc_open_to_buy bc_util
## 0_FullyPaid 4.445329 13577.79 9456.097 62.84290
## 1_ChargedOff 5.286461 10608.55 6620.385 68.23541
## mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mort_acc num_actv_rev_tl
## 0_FullyPaid 185.6501 13.10170 1.803689 5.675107
## 1_ChargedOff 173.8203 10.95726 1.501794 6.294650
## log_annual_inc fico_new
## 0_FullyPaid 11.07526 696.3695
## 1_ChargedOff 10.98164 687.0090
##
## Coefficients of linear discriminants:
## LD1
## loan_amnt 2.637815e-04
## int_rate 1.368488e-01
## installment -6.307476e-03
## dti 7.136174e-03
## open_acc -5.994273e-03
## revol_util 1.130244e-04
## last_pymnt_amnt -1.806656e-04
## acc_open_past_24mths 4.915655e-02
## avg_cur_bal -1.423212e-06
## bc_open_to_buy -7.909778e-08
## bc_util -6.673435e-04
## mo_sin_old_rev_tl_op -3.727478e-04
## mo_sin_rcnt_rev_tl_op 5.698480e-05
## mort_acc -1.268820e-02
## num_actv_rev_tl 4.860252e-04
## log_annual_inc -5.032210e-02
## fico_new -1.276014e-03
```

The output reports the prior probabilities (\(\hat \pi_k\)) for the two categories and the conditional means of all the covariates (\(\hat \mu_k\)). Recall that the prior probability is the probability that a randomly chosen observation comes from the \(k\)-th class.

We go straight to the calculation of the predictions:

```
= predict(outlda, newdata = s_test)
predlda names(predlda)
```

`## [1] "class" "posterior" "x"`

Note that the object `predlda`

is a list containing 3 objects including the vector of posterior probabilities (`posterior`

) and the vector of predicted categories (`class`

):

`head(predlda$posterior)`

```
## 0_FullyPaid 1_ChargedOff
## 1 0.9682948 0.03170517
## 3 0.9662011 0.03379886
## 5 0.9837382 0.01626181
## 12 0.9888337 0.01116627
## 18 0.9481007 0.05189933
## 19 0.8994442 0.10055583
```

`head(predlda$class)`

```
## [1] 0_FullyPaid 0_FullyPaid 0_FullyPaid 0_FullyPaid 0_FullyPaid 0_FullyPaid
## Levels: 0_FullyPaid 1_ChargedOff
```

Note that the predictions contained in `predlda$class`

are computed using as threshold for the posterior probabilities of `1_ChargedOff`

the value 0.5 (standard choice).

To compute the confusion matrix we will use the vector `predlda$class`

:

`table(predlda$class, s_test$charged_off)`

```
##
## 0_FullyPaid 1_ChargedOff
## 0_FullyPaid 13579 1789
## 1_ChargedOff 332 1318
```

`perf_indexes(table(predlda$class, s_test$charged_off))`

```
## spec sens acc
## 0.9761340 0.4242034 0.8753673
```

Note that the value of the sensitivity (true positive rate) is quite low, while specificity and accuracy are quite high. It seems that LDA is not able to correctly identify the actual defaulters.

We could choose a different threshold for our decision. In this case, the used functions do not allow to change directly the threshold. Recall that LDA assign the the observation to the most likely class (= class with the highest posterior probability). So, there is an implicit probability threshold with two classes (\(0.5\)). Check it by yourself here below:

```
= ifelse(predlda$posterior[,"1_ChargedOff"] > 0.5,
predlda_myclass "1_ChargedOff", "0_FullyPaid")
# same results than in the previous LDA
table(predlda_myclass, s_test$charged_off)
```

```
##
## predlda_myclass 0_FullyPaid 1_ChargedOff
## 0_FullyPaid 13579 1789
## 1_ChargedOff 332 1318
```

```
# same results than in the previous LDA
perf_indexes(table(predlda_myclass, s_test$charged_off))
```

```
## spec sens acc
## 0.9761340 0.4242034 0.8753673
```

We can now set an higher probability threshold, with the aim of being able to better select the “real” default customers. Let’s consider for example a value of the probability threshold equal to \(0.2\) :

```
= ifelse(predlda$posterior[,"1_ChargedOff"] > 0.2,
predlda_myclass2 "1_ChargedOff", "0_FullyPaid")
table(predlda_myclass2, s_test$charged_off)
```

```
##
## predlda_myclass2 0_FullyPaid 1_ChargedOff
## 0_FullyPaid 11863 903
## 1_ChargedOff 2048 2204
```

`perf_indexes(table(predlda_myclass2, s_test$charged_off)) `

```
## spec sens acc
## 0.8527784 0.7093659 0.8265954
```

We are now able to correctly detect more real charged off customers (see the second column), increasing the sensitivity of our algorithm. Specificity and accuracy are slightly reduced.