4 Arboles de Decisión - Parte II
4.1 Arboles de decisión y modelos lineales
- Si la relación entre la variable dependiente y la(s) independiente se aproxima a un modelo lineal, la regresión lineal dará mejores resultados que un modelo de árbol de decisión.
- Si la relación es compleja y altamento no lineal, entonces el árbol de decisión tendrá mejores resultados de que un método clásico de regresión.
- Si se quiere construir un modelo que sea fácil de explicar, entonces un modelo de árbol de decisión será mejor que un modelo lineal.
4.2 Codigo generalizado
Existen varios algoritmos implementados ID3, CART, C4.5 C5.0, CHAID.
Es importante saber que existen variadas implementaciones (librerías) de árboles de decisión en R como por ejemplo:
rpart
,tree
,party
,ctree
, etc. Algunas se diferencias en las heurísticas utilizadas para el proceso de poda del árbol y otras manejan un componente probabilísto internamente. Un ejemplo del esquema general para implementación.
> library(rpart)
> x <- cbind(x_train,y_train)
# grow tree
> fit <- rpart(y_train ~ ., data = x,method="class")
> summary(fit)
#Predict Output
> predicted= predict(fit,x_test)
4.3 Ejemplo de Arbol de Decisión + prepruning
Entrenamiento y visualización de árboles de decisión.
- Paso 1: Importar los datos
- Paso 2: Limpiar los datos
- Paso 3: Crear los conjuntos de entranamieto y test
- Paso 4: Construir el modelo
- Paso 5: Hacer la predicción
- Paso 6: Medir el rendimiento del modelo
- Paso 7: Ajustar los hyper-parámetros
4.3.1 Importar los datos
El propósito del siguiente conjunto de datos titanic
es predecir que personas son más propensas a sobrevivir la colisión con el iceberg. El conjunto de datos contiene 13 variables y 1309 observaciones. Finalmente, este se encuentra ordenado por la variable X
.
set.seed(678)
path <- 'https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/titanic_csv.csv'
titanic <-read.csv(path)
str(titanic)
## 'data.frame': 1309 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
## $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
## $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
head(titanic)
## X pclass survived name sex
## 1 1 1 1 Allen, Miss. Elisabeth Walton female
## 2 2 1 1 Allison, Master. Hudson Trevor male
## 3 3 1 0 Allison, Miss. Helen Loraine female
## 4 4 1 0 Allison, Mr. Hudson Joshua Creighton male
## 5 5 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female
## 6 6 1 1 Anderson, Mr. Harry male
## age sibsp parch ticket fare cabin embarked
## 1 29.0000 0 0 24160 211.3375 B5 S
## 2 0.9167 1 2 113781 151.5500 C22 C26 S
## 3 2.0000 1 2 113781 151.5500 C22 C26 S
## 4 30.0000 1 2 113781 151.5500 C22 C26 S
## 5 25.0000 1 2 113781 151.5500 C22 C26 S
## 6 48.0000 0 0 19952 26.5500 E12 S
## home.dest
## 1 St Louis, MO
## 2 Montreal, PQ / Chesterville, ON
## 3 Montreal, PQ / Chesterville, ON
## 4 Montreal, PQ / Chesterville, ON
## 5 Montreal, PQ / Chesterville, ON
## 6 New York, NY
tail(titanic)
## X pclass survived name sex age sibsp
## 1304 1304 3 0 Yousseff, Mr. Gerious male NA 0
## 1305 1305 3 0 Zabour, Miss. Hileni female 14.5 1
## 1306 1306 3 0 Zabour, Miss. Thamine female NA 1
## 1307 1307 3 0 Zakarian, Mr. Mapriededer male 26.5 0
## 1308 1308 3 0 Zakarian, Mr. Ortin male 27.0 0
## 1309 1309 3 0 Zimmerman, Mr. Leo male 29.0 0
## parch ticket fare cabin embarked home.dest
## 1304 0 2627 14.4583 C
## 1305 0 2665 14.4542 C
## 1306 0 2665 14.4542 C
## 1307 0 2656 7.2250 C
## 1308 0 2670 7.2250 C
## 1309 0 315082 7.8750 S
Los datos no están ordenados aleatoriamente sino secuencialmente de acuerdo a la variable categórica de interés. Esto es un problema importante y se debe corregir antes de dividir los datos en entrenamiento y test. Para desordenar la lista de observaciones, se puede usar la función sample()
.
shuffle_index <- sample(1:nrow(titanic))
head(shuffle_index)
## [1] 288 874 1078 633 887 992
Ahora se usa estos índices para generar un ordenamiento aleatorio del conjunto de datos.
titanic <- titanic[shuffle_index, ]
head(titanic)
## X pclass survived
## 288 288 1 0
## 874 874 3 0
## 1078 1078 3 1
## 633 633 3 0
## 887 887 3 1
## 992 992 3 1
## name sex age
## 288 Sutton, Mr. Frederick male 61
## 874 Humblen, Mr. Adolf Mathias Nicolai Olsen male 42
## 1078 O'Driscoll, Miss. Bridget female NA
## 633 Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren) female 39
## 887 Jermyn, Miss. Annie female NA
## 992 Mamee, Mr. Hanna male NA
## sibsp parch ticket fare cabin embarked home.dest
## 288 0 0 36963 32.3208 D50 S Haddenfield, NJ
## 874 0 0 348121 7.6500 F G63 S
## 1078 0 0 14311 7.7500 Q
## 633 1 5 347082 31.2750 S Sweden Winnipeg, MN
## 887 0 0 14313 7.7500 Q
## 992 0 0 2677 7.2292 C
4.3.2 Limpiar el conjunto de datos
- Existen valores NA’s, por lo tanto deben ser eliminados.
- Prescindir de variables innecesarias
- Crear-convertir variables a tipo factor de ser necesario (e.g.,
pclass
ysurvived
)
library(dplyr)
# Drop variables
clean_titanic <- titanic %>%
select(-c(home.dest, cabin, name, ticket)) %>%
#Convert to factor level
mutate(pclass = factor(pclass, levels = c(1, 2, 3), labels = c('Upper', 'Middle', 'Lower')),
survived = factor(survived, levels = c(0, 1), labels = c('No', 'Yes'))) %>%
na.omit()
glimpse(clean_titanic)
## Observations: 1,045
## Variables: 9
## $ X <int> 288, 874, 633, 182, 375, 21, 560, 307, 1104, 742, 916...
## $ pclass <fctr> Upper, Lower, Lower, Upper, Middle, Upper, Middle, U...
## $ survived <fctr> No, No, No, Yes, No, Yes, Yes, No, No, No, No, No, Y...
## $ sex <fctr> male, male, female, female, male, male, female, male...
## $ age <dbl> 61.0, 42.0, 39.0, 49.0, 29.0, 37.0, 20.0, 54.0, 2.0, ...
## $ sibsp <int> 0, 0, 1, 0, 0, 1, 0, 0, 4, 0, 0, 1, 1, 0, 0, 0, 1, 1,...
## $ parch <int> 0, 0, 5, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 2, 0, 4, 0,...
## $ fare <dbl> 32.3208, 7.6500, 31.2750, 25.9292, 10.5000, 52.5542, ...
## $ embarked <fctr> S, S, S, S, S, S, S, S, S, C, S, S, S, Q, C, S, S, C...
4.3.3 Dividir en conjuntos de entrenamiento y test
Antes de entrenar el modelo vamos a dividir el conjunto de datos en entrenamiento y test. La práctica común es 80-20. Crearemos una función con este propósito.
library(dplyr)
data_train <- clean_titanic %>% dplyr::sample_frac(.8)
data_test <- dplyr::anti_join(clean_titanic, data_train, by = 'X') # se debe tener un id
data_train <- dplyr::select(data_train, -X)
data_test <- dplyr::select(data_test, -X)
head(data_train)
## pclass survived sex age sibsp parch fare embarked
## 436 Middle Yes female 7 0 2 26.2500 S
## 837 Lower No male 25 0 0 7.7417 Q
## 504 Upper Yes female 38 0 0 227.5250 C
## 754 Lower No female 26 1 0 16.1000 S
## 290 Lower No male 31 0 0 7.7750 S
## 361 Upper Yes female 64 0 2 83.1583 C
dim(data_train)
## [1] 836 8
dim(data_test)
## [1] 209 8
- Conjunto de entrenamiento = 1046 filas (instancias)
- Conjunto de test = 262 filas (instancias)
Ahora verificamos el proceso de aleatoriedad a través de las funciones prop.table()
combinada con table()
.
prop.table(table(data_train$survived))
##
## No Yes
## 0.5873206 0.4126794
prop.table(table(data_test$survived))
##
## No Yes
## 0.6076555 0.3923445
Instalar rpart.plot
rpart.plot
es una librearía que sirve para visualizaciones más elegantes de árboles de decisión. Se debe instalar a través de la consola.
install.packages("rpart.plot")
4.3.4 Construir el modelo
El comando para generar un modelo de árbol de decisión, usando la librería rpart
lleva el mismo nombre.
library(rpart)
library(rpart.plot)
fit <- rpart(survived~., data = data_train, method = 'class')
rpart.plot(fit, extra = 106)
Cada nodo muestra
- La clase predecida (died o survived),
- La probabilidad predecida de survival,
- El porcentaje de observaciones en el nodo.
Ahora probemos con las opciones 1 y 9.
rpart.plot(fit, extra = 1)
rpart.plot(fit, extra = 9)
- Los árboles de decisión requieren muy poca preparación de datos. Particularmente, no requieren escalamiento de atributos o centrado.
- Por defecto,
rpart()
use la medida de Gini para la división de los nodos.
4.3.5 Hacer la predicción
El modelo ha sido entrenado y ahora puede ser usado para predecir nuevas instancias en el conjunto de datos de test. Para esto se usa la función predict()
.
#Arguments:
#- fitted_model: This is the object stored after model estimation.
#- df: Data frame used to make the prediction
#- type: Type of prediction
# - 'class': for classification
# - 'prob': to compute the probability of each class
# - 'vector': Predict the mean response at the node level
predict_unseen <-predict(fit, data_test, type = 'class')
Contabilizar la coincidencia entre las observaciones de test y los valores predecidos (matriz de confusión).
table_mat <- table(data_test$survived, predict_unseen)
table_mat
## predict_unseen
## No Yes
## No 113 14
## Yes 26 56
4.3.6 Medir el rendimiento del modelo
- A partir de la matriz de confusión es posible calcular un medida de rendimiento del modelo.
- La matriz de confusión es utilizada en casos de clasificación.
Accuracy test a partir de la matriz de confusión o tabla de contingencia
Accuracy = TP + TN / (TP + TN + FP + FN)
Proporción de las instancias predecidas correctamente TP and TN sobre la suma total de elementos evaluados.
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.808612440191388"
4.3.7 Ajustar los hyper-parámetros
Para incluir restricciones definidas por el usuario respecto a cómo elaborar el árbol de decisión se puede utilizar el comando rpart.control()
de la librería rpart
.
#rpart.control(minsplit = 20, minbucket = round(minsplit/3), maxdepth = 30)
#Arguments:
#-minsplit: Set the minimum number of observations in the node before the algorithm perform a split
#-minbucket: Set the minimum number of observations in the final node i.e. the leaf
#-maxdepth: Set the maximum depth of any node of the final tree. The root node is treated a depth 0
Vamos a modificar el ejemplo anterior. Para ello vamos a constuir una función que encapsule el cáculo de la precisión del modelo.
accuracy_tune <- function(fit) {
predict_unseen <- predict(fit, data_test, type = 'class')
table_mat <- table(data_test$survived, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
accuracy_Test
}
Ahora, vamos a ajustar los parámetros para intentar mejor el rendimiento del modelo sobre los valores por defecto. La precisión que obtuvimos previamente fue de 0.78.
control <- rpart.control(minsplit = 4,
minbucket = round(5 / 3),
maxdepth = 3,
cp = 0)
tune_fit <- rpart(survived~., data = data_train, method = 'class', control = control)
accuracy_tune(tune_fit)
## [1] 0.7990431
En efecto hemos mejorado ligeramente la estimación de 0.78 a 0.79. Como habíamos revisado anteriormente, lo ideal sería aplicar un proceso de validación cruzada para ajustar correctamente y encontrar así la mejor combinación.
4.4 Ejemplo de clasificación + Poda (postpruning)
Usaremos la librería rpart para nuestro ejemplo. Esta librería incopora adicionalmente a nuestra revisión teórica, el manejo de un parámetro de regularización (optimización) que permite identificar el mejor punto para una poda.
Criterio de costo de complejidad - Cost complexity criterion
Para encontrar el balance entre la profundidad y complejidad del árbol con respecto a la capacidad predictiva del modelo en datos de test, normalmente se hace crecer el árbol de decisión hasta su mayor extensión y luego se ejecuta el proceso de poda para identificar el subárbol óptimo.
Se encuentra el subárbol óptimo usando el parámetro de costo de complejidad (\(\alpha\)) que penaliza la función objetivo abajo para el número de nodos hoja en el árbol (T)
\[minimize \left( SSE+\alpha|T| \right) \]
- Para un valor dado de \(\alpha\) se encuentra el árbol podado más pequeño (número de nodos hoja) que tiene el error más bajo de penalización.
- Se evalúan múltiples moodelos a través de un espectro de \(\alpha\) y se usa validación cruzada para identificar el \(\alpha\) óptimo, y por lo tanto el subárbol óptimo.
Ejemplo basado en https://dzone.com/articles/decision-trees-and-pruning-in-r
library(rpart)
hr_data <- read.csv("data/HR.csv")
str(hr_data)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "IT","RandD","accounting",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
sample_ind <- sample(nrow(hr_data),nrow(hr_data)*0.70)
train <- hr_data[sample_ind,]
test <- hr_data[-sample_ind,]
#Base Model
hr_base_model <- rpart(left ~ ., data = train, method = "class",
control = rpart.control(cp = 0))
summary(hr_base_model)
## Call:
## rpart(formula = left ~ ., data = train, method = "class", control = rpart.control(cp = 0))
## n= 10499
##
## CP nsplit rel error xerror xstd
## 1 0.2355305466 0 1.00000000 1.00000000 0.017512341
## 2 0.1887057878 1 0.76446945 0.76446945 0.015861870
## 3 0.0761655949 3 0.38705788 0.38705788 0.011886991
## 4 0.0526527331 5 0.23472669 0.23593248 0.009461835
## 5 0.0321543408 6 0.18207395 0.18247588 0.008376808
## 6 0.0140675241 7 0.14991961 0.15032154 0.007633241
## 7 0.0120578778 8 0.13585209 0.13866559 0.007341821
## 8 0.0064308682 9 0.12379421 0.12660772 0.007025709
## 9 0.0052250804 10 0.11736334 0.12138264 0.006883595
## 10 0.0048231511 13 0.10168810 0.11736334 0.006771987
## 11 0.0032154341 14 0.09686495 0.10450161 0.006400164
## 12 0.0028135048 15 0.09364952 0.09686495 0.006167590
## 13 0.0012057878 16 0.09083601 0.09606109 0.006142544
## 14 0.0009378349 17 0.08963023 0.09646302 0.006155081
## 15 0.0004823151 20 0.08681672 0.09766881 0.006192525
## 16 0.0004019293 25 0.08440514 0.09967846 0.006254385
## 17 0.0002679528 28 0.08319936 0.09967846 0.006254385
## 18 0.0001148369 31 0.08239550 0.10369775 0.006376123
## 19 0.0000000000 38 0.08159164 0.10490354 0.006412147
##
## Variable importance
## satisfaction_level average_montly_hours number_project
## 34 18 18
## last_evaluation time_spend_company
## 17 13
##
## Node number 1: 10499 observations, complexity param=0.2355305
## predicted class=0 expected loss=0.2369749 P(node) =1
## class counts: 8011 2488
## probabilities: 0.763 0.237
## left son=2 (7569 obs) right son=3 (2930 obs)
## Primary splits:
## satisfaction_level < 0.465 to the right, improve=1071.2240, (0 missing)
## number_project < 2.5 to the right, improve= 686.2670, (0 missing)
## time_spend_company < 2.5 to the left, improve= 288.9549, (0 missing)
## average_montly_hours < 287.5 to the left, improve= 270.1633, (0 missing)
## last_evaluation < 0.575 to the right, improve= 155.4792, (0 missing)
## Surrogate splits:
## number_project < 2.5 to the right, agree=0.793, adj=0.259, (0 split)
## average_montly_hours < 275.5 to the left, agree=0.753, adj=0.113, (0 split)
## last_evaluation < 0.485 to the right, agree=0.742, adj=0.076, (0 split)
##
## Node number 2: 7569 observations, complexity param=0.07616559
## predicted class=0 expected loss=0.09644603 P(node) =0.7209258
## class counts: 6839 730
## probabilities: 0.904 0.096
## left son=4 (6198 obs) right son=5 (1371 obs)
## Primary splits:
## time_spend_company < 4.5 to the left, improve=446.74570, (0 missing)
## last_evaluation < 0.815 to the left, improve=153.15450, (0 missing)
## average_montly_hours < 216.5 to the left, improve=120.71250, (0 missing)
## number_project < 4.5 to the left, improve= 80.28063, (0 missing)
## satisfaction_level < 0.715 to the left, improve= 58.80052, (0 missing)
## Surrogate splits:
## last_evaluation < 0.995 to the left, agree=0.823, adj=0.023, (0 split)
##
## Node number 3: 2930 observations, complexity param=0.1887058
## predicted class=1 expected loss=0.4 P(node) =0.2790742
## class counts: 1172 1758
## probabilities: 0.400 0.600
## left son=6 (1708 obs) right son=7 (1222 obs)
## Primary splits:
## number_project < 2.5 to the right, improve=310.9605, (0 missing)
## satisfaction_level < 0.115 to the right, improve=250.6215, (0 missing)
## time_spend_company < 4.5 to the right, improve=236.1650, (0 missing)
## last_evaluation < 0.575 to the right, improve=130.2469, (0 missing)
## average_montly_hours < 161.5 to the right, improve=111.4102, (0 missing)
## Surrogate splits:
## satisfaction_level < 0.355 to the left, agree=0.880, adj=0.713, (0 split)
## last_evaluation < 0.575 to the right, agree=0.859, adj=0.662, (0 split)
## average_montly_hours < 161.5 to the right, agree=0.856, adj=0.655, (0 split)
## time_spend_company < 3.5 to the right, agree=0.840, adj=0.615, (0 split)
##
## Node number 4: 6198 observations, complexity param=0.003215434
## predicted class=0 expected loss=0.01565021 P(node) =0.5903419
## class counts: 6101 97
## probabilities: 0.984 0.016
## left son=8 (6190 obs) right son=9 (8 obs)
## Primary splits:
## average_montly_hours < 290.5 to the left, improve=15.5231500, (0 missing)
## number_project < 5.5 to the left, improve= 3.7322140, (0 missing)
## satisfaction_level < 0.475 to the right, improve= 1.6447890, (0 missing)
## time_spend_company < 3.5 to the left, improve= 1.4018780, (0 missing)
## sales splits as LLLRRLLLLR, improve= 0.5033744, (0 missing)
##
## Node number 5: 1371 observations, complexity param=0.07616559
## predicted class=0 expected loss=0.4617068 P(node) =0.1305839
## class counts: 738 633
## probabilities: 0.538 0.462
## left son=10 (534 obs) right son=11 (837 obs)
## Primary splits:
## last_evaluation < 0.815 to the left, improve=301.1271, (0 missing)
## average_montly_hours < 216.5 to the left, improve=263.9916, (0 missing)
## time_spend_company < 6.5 to the right, improve=175.0792, (0 missing)
## satisfaction_level < 0.715 to the left, improve=157.5197, (0 missing)
## number_project < 3.5 to the left, improve=136.2939, (0 missing)
## Surrogate splits:
## average_montly_hours < 215.5 to the left, agree=0.743, adj=0.341, (0 split)
## number_project < 3.5 to the left, agree=0.708, adj=0.251, (0 split)
## satisfaction_level < 0.715 to the left, agree=0.705, adj=0.243, (0 split)
## time_spend_company < 6.5 to the right, agree=0.682, adj=0.184, (0 split)
## Work_accident < 0.5 to the right, agree=0.648, adj=0.097, (0 split)
##
## Node number 6: 1708 observations, complexity param=0.1887058
## predicted class=0 expected loss=0.4051522 P(node) =0.1626822
## class counts: 1016 692
## probabilities: 0.595 0.405
## left son=12 (1093 obs) right son=13 (615 obs)
## Primary splits:
## satisfaction_level < 0.115 to the right, improve=680.1184, (0 missing)
## average_montly_hours < 242.5 to the left, improve=384.2062, (0 missing)
## number_project < 5.5 to the left, improve=351.9510, (0 missing)
## last_evaluation < 0.765 to the left, improve=272.1409, (0 missing)
## time_spend_company < 3.5 to the left, improve=111.0344, (0 missing)
## Surrogate splits:
## average_montly_hours < 242.5 to the left, agree=0.856, adj=0.600, (0 split)
## number_project < 5.5 to the left, agree=0.833, adj=0.535, (0 split)
## last_evaluation < 0.765 to the left, agree=0.784, adj=0.400, (0 split)
##
## Node number 7: 1222 observations, complexity param=0.03215434
## predicted class=1 expected loss=0.1276596 P(node) =0.116392
## class counts: 156 1066
## probabilities: 0.128 0.872
## left son=14 (92 obs) right son=15 (1130 obs)
## Primary splits:
## last_evaluation < 0.575 to the right, improve=129.625400, (0 missing)
## average_montly_hours < 162 to the right, improve=117.537800, (0 missing)
## satisfaction_level < 0.355 to the left, improve=105.527700, (0 missing)
## time_spend_company < 3.5 to the right, improve= 63.082060, (0 missing)
## salary splits as LRR, improve= 7.446581, (0 missing)
## Surrogate splits:
## average_montly_hours < 162 to the right, agree=0.942, adj=0.228, (0 split)
## time_spend_company < 3.5 to the right, agree=0.938, adj=0.174, (0 split)
## satisfaction_level < 0.355 to the left, agree=0.936, adj=0.152, (0 split)
##
## Node number 8: 6190 observations, complexity param=0.0002679528
## predicted class=0 expected loss=0.01437803 P(node) =0.58958
## class counts: 6101 89
## probabilities: 0.986 0.014
## left son=16 (6080 obs) right son=17 (110 obs)
## Primary splits:
## number_project < 5.5 to the left, improve=3.3329360, (0 missing)
## satisfaction_level < 0.475 to the right, improve=1.6639030, (0 missing)
## time_spend_company < 3.5 to the left, improve=1.2096960, (0 missing)
## average_montly_hours < 247.5 to the left, improve=0.4426587, (0 missing)
## sales splits as LLLRRLLRRR, improve=0.4068224, (0 missing)
##
## Node number 9: 8 observations
## predicted class=1 expected loss=0 P(node) =0.0007619773
## class counts: 0 8
## probabilities: 0.000 1.000
##
## Node number 10: 534 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.04681648 P(node) =0.05086199
## class counts: 509 25
## probabilities: 0.953 0.047
## left son=20 (344 obs) right son=21 (190 obs)
## Primary splits:
## sales splits as LLLRLRLLRR, improve=2.3943050, (0 missing)
## average_montly_hours < 272.5 to the left, improve=2.3216550, (0 missing)
## last_evaluation < 0.805 to the left, improve=1.8136130, (0 missing)
## time_spend_company < 6.5 to the right, improve=1.4818060, (0 missing)
## satisfaction_level < 0.895 to the right, improve=0.6712242, (0 missing)
## Surrogate splits:
## last_evaluation < 0.465 to the right, agree=0.657, adj=0.037, (0 split)
## average_montly_hours < 132.5 to the right, agree=0.657, adj=0.037, (0 split)
##
## Node number 11: 837 observations, complexity param=0.05265273
## predicted class=1 expected loss=0.2735962 P(node) =0.07972188
## class counts: 229 608
## probabilities: 0.274 0.726
## left son=22 (155 obs) right son=23 (682 obs)
## Primary splits:
## average_montly_hours < 216.5 to the left, improve=160.24020, (0 missing)
## time_spend_company < 6.5 to the right, improve=132.25340, (0 missing)
## satisfaction_level < 0.715 to the left, improve=105.36270, (0 missing)
## number_project < 3.5 to the left, improve= 75.46432, (0 missing)
## salary splits as LRL, improve= 17.57553, (0 missing)
## Surrogate splits:
## time_spend_company < 6.5 to the right, agree=0.861, adj=0.252, (0 split)
## satisfaction_level < 0.715 to the left, agree=0.847, adj=0.174, (0 split)
## number_project < 3.5 to the left, agree=0.843, adj=0.155, (0 split)
##
## Node number 12: 1093 observations, complexity param=0.00522508
## predicted class=0 expected loss=0.07044831 P(node) =0.1041052
## class counts: 1016 77
## probabilities: 0.930 0.070
## left son=24 (1080 obs) right son=25 (13 obs)
## Primary splits:
## number_project < 6.5 to the left, improve=22.736150, (0 missing)
## average_montly_hours < 290.5 to the left, improve=12.174900, (0 missing)
## last_evaluation < 0.995 to the left, improve= 2.908747, (0 missing)
## sales splits as LLRLLLLLLR, improve= 2.400259, (0 missing)
## time_spend_company < 5.5 to the right, improve= 1.360632, (0 missing)
##
## Node number 13: 615 observations
## predicted class=1 expected loss=0 P(node) =0.05857701
## class counts: 0 615
## probabilities: 0.000 1.000
##
## Node number 14: 92 observations, complexity param=0.0004019293
## predicted class=0 expected loss=0.06521739 P(node) =0.008762739
## class counts: 86 6
## probabilities: 0.935 0.065
## left son=28 (85 obs) right son=29 (7 obs)
## Primary splits:
## average_montly_hours < 275.5 to the left, improve=3.8829380, (0 missing)
## sales splits as LLLLLRLLLR, improve=1.6265130, (0 missing)
## time_spend_company < 3.5 to the left, improve=1.0635450, (0 missing)
## salary splits as LRL, improve=0.4209983, (0 missing)
## satisfaction_level < 0.315 to the left, improve=0.4173913, (0 missing)
##
## Node number 15: 1130 observations, complexity param=0.01205788
## predicted class=1 expected loss=0.0619469 P(node) =0.1076293
## class counts: 70 1060
## probabilities: 0.062 0.938
## left son=30 (30 obs) right son=31 (1100 obs)
## Primary splits:
## last_evaluation < 0.445 to the left, improve=54.236520, (0 missing)
## average_montly_hours < 162 to the right, improve=47.188320, (0 missing)
## satisfaction_level < 0.35 to the left, improve=37.583870, (0 missing)
## time_spend_company < 2.5 to the left, improve=24.947510, (0 missing)
## Work_accident < 0.5 to the right, improve= 1.488762, (0 missing)
##
## Node number 16: 6080 observations
## predicted class=0 expected loss=0.01217105 P(node) =0.5791028
## class counts: 6006 74
## probabilities: 0.988 0.012
##
## Node number 17: 110 observations, complexity param=0.0002679528
## predicted class=0 expected loss=0.1363636 P(node) =0.01047719
## class counts: 95 15
## probabilities: 0.864 0.136
## left son=34 (59 obs) right son=35 (51 obs)
## Primary splits:
## sales splits as RLLLRLLLRR, improve=1.8612340, (0 missing)
## last_evaluation < 0.965 to the left, improve=1.5290910, (0 missing)
## average_montly_hours < 256.5 to the left, improve=1.3476870, (0 missing)
## satisfaction_level < 0.865 to the right, improve=1.2662340, (0 missing)
## salary splits as LRR, improve=0.3645365, (0 missing)
## Surrogate splits:
## satisfaction_level < 0.755 to the left, agree=0.591, adj=0.118, (0 split)
## salary splits as RLL, agree=0.582, adj=0.098, (0 split)
## number_project < 6.5 to the left, agree=0.573, adj=0.078, (0 split)
## average_montly_hours < 214.5 to the left, agree=0.573, adj=0.078, (0 split)
## time_spend_company < 3.5 to the left, agree=0.573, adj=0.078, (0 split)
##
## Node number 20: 344 observations
## predicted class=0 expected loss=0.01162791 P(node) =0.03276503
## class counts: 340 4
## probabilities: 0.988 0.012
##
## Node number 21: 190 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.1105263 P(node) =0.01809696
## class counts: 169 21
## probabilities: 0.889 0.111
## left son=42 (174 obs) right son=43 (16 obs)
## Primary splits:
## average_montly_hours < 272.5 to the left, improve=3.735768, (0 missing)
## last_evaluation < 0.555 to the left, improve=2.341083, (0 missing)
## Work_accident < 0.5 to the right, improve=1.703218, (0 missing)
## salary splits as RLL, improve=1.271752, (0 missing)
## satisfaction_level < 0.895 to the right, improve=1.048217, (0 missing)
##
## Node number 22: 155 observations, complexity param=0.0009378349
## predicted class=0 expected loss=0.07741935 P(node) =0.01476331
## class counts: 143 12
## probabilities: 0.923 0.077
## left son=44 (110 obs) right son=45 (45 obs)
## Primary splits:
## time_spend_company < 5.5 to the right, improve=3.5378950, (0 missing)
## number_project < 2.5 to the right, improve=2.6674830, (0 missing)
## sales splits as LLLRLLLLRR, improve=1.2673900, (0 missing)
## salary splits as LRL, improve=0.9090162, (0 missing)
## last_evaluation < 0.985 to the left, improve=0.9032991, (0 missing)
## Surrogate splits:
## average_montly_hours < 130.5 to the right, agree=0.761, adj=0.178, (0 split)
## number_project < 2.5 to the right, agree=0.748, adj=0.133, (0 split)
## sales splits as LLLRLLRLLL, agree=0.729, adj=0.067, (0 split)
## last_evaluation < 0.985 to the left, agree=0.716, adj=0.022, (0 split)
##
## Node number 23: 682 observations, complexity param=0.01406752
## predicted class=1 expected loss=0.1260997 P(node) =0.06495857
## class counts: 86 596
## probabilities: 0.126 0.874
## left son=46 (35 obs) right son=47 (647 obs)
## Primary splits:
## time_spend_company < 6.5 to the right, improve=56.351040, (0 missing)
## satisfaction_level < 0.715 to the left, improve=48.654300, (0 missing)
## number_project < 3.5 to the left, improve=44.350240, (0 missing)
## promotion_last_5years < 0.5 to the right, improve=17.076870, (0 missing)
## salary splits as LRR, improve= 8.593971, (0 missing)
## Surrogate splits:
## promotion_last_5years < 0.5 to the right, agree=0.959, adj=0.200, (0 split)
## satisfaction_level < 0.975 to the right, agree=0.952, adj=0.057, (0 split)
##
## Node number 24: 1080 observations, complexity param=0.002813505
## predicted class=0 expected loss=0.05925926 P(node) =0.1028669
## class counts: 1016 64
## probabilities: 0.941 0.059
## left son=48 (1073 obs) right son=49 (7 obs)
## Primary splits:
## average_montly_hours < 290.5 to the left, improve=12.4707300, (0 missing)
## last_evaluation < 0.995 to the left, improve= 3.1002080, (0 missing)
## satisfaction_level < 0.295 to the left, improve= 1.5340500, (0 missing)
## sales splits as LRRLLLLLLR, improve= 1.3515420, (0 missing)
## time_spend_company < 5.5 to the right, improve= 0.8342854, (0 missing)
##
## Node number 25: 13 observations
## predicted class=1 expected loss=0 P(node) =0.001238213
## class counts: 0 13
## probabilities: 0.000 1.000
##
## Node number 28: 85 observations
## predicted class=0 expected loss=0.02352941 P(node) =0.008096009
## class counts: 83 2
## probabilities: 0.976 0.024
##
## Node number 29: 7 observations
## predicted class=1 expected loss=0.4285714 P(node) =0.0006667302
## class counts: 3 4
## probabilities: 0.429 0.571
##
## Node number 30: 30 observations
## predicted class=0 expected loss=0 P(node) =0.002857415
## class counts: 30 0
## probabilities: 1.000 0.000
##
## Node number 31: 1100 observations, complexity param=0.00522508
## predicted class=1 expected loss=0.03636364 P(node) =0.1047719
## class counts: 40 1060
## probabilities: 0.036 0.964
## left son=62 (21 obs) right son=63 (1079 obs)
## Primary splits:
## average_montly_hours < 162 to the right, improve=25.5952600, (0 missing)
## satisfaction_level < 0.35 to the left, improve=17.0105900, (0 missing)
## time_spend_company < 2.5 to the left, improve=16.8526000, (0 missing)
## Work_accident < 0.5 to the right, improve= 1.0677430, (0 missing)
## last_evaluation < 0.455 to the left, improve= 0.6124402, (0 missing)
## Surrogate splits:
## time_spend_company < 2.5 to the left, agree=0.985, adj=0.238, (0 split)
## satisfaction_level < 0.295 to the left, agree=0.985, adj=0.190, (0 split)
##
## Node number 34: 59 observations
## predicted class=0 expected loss=0.05084746 P(node) =0.005619583
## class counts: 56 3
## probabilities: 0.949 0.051
##
## Node number 35: 51 observations, complexity param=0.0002679528
## predicted class=0 expected loss=0.2352941 P(node) =0.004857605
## class counts: 39 12
## probabilities: 0.765 0.235
## left son=70 (43 obs) right son=71 (8 obs)
## Primary splits:
## average_montly_hours < 256.5 to the left, improve=2.8820110, (0 missing)
## satisfaction_level < 0.595 to the right, improve=1.7431850, (0 missing)
## last_evaluation < 0.45 to the left, improve=0.8983957, (0 missing)
## salary splits as LRR, improve=0.8983957, (0 missing)
## sales splits as L---R---LR, improve=0.1434174, (0 missing)
## Surrogate splits:
## number_project < 6.5 to the left, agree=0.882, adj=0.25, (0 split)
##
## Node number 42: 174 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.08045977 P(node) =0.01657301
## class counts: 160 14
## probabilities: 0.920 0.080
## left son=84 (162 obs) right son=85 (12 obs)
## Primary splits:
## salary splits as RLL, improve=1.6483610, (0 missing)
## time_spend_company < 5.5 to the right, improve=1.3049180, (0 missing)
## last_evaluation < 0.565 to the left, improve=1.1541530, (0 missing)
## average_montly_hours < 152.5 to the left, improve=0.9341183, (0 missing)
## Work_accident < 0.5 to the right, improve=0.8831264, (0 missing)
##
## Node number 43: 16 observations
## predicted class=0 expected loss=0.4375 P(node) =0.001523955
## class counts: 9 7
## probabilities: 0.562 0.437
##
## Node number 44: 110 observations
## predicted class=0 expected loss=0.009090909 P(node) =0.01047719
## class counts: 109 1
## probabilities: 0.991 0.009
##
## Node number 45: 45 observations, complexity param=0.0009378349
## predicted class=0 expected loss=0.2444444 P(node) =0.004286122
## class counts: 34 11
## probabilities: 0.756 0.244
## left son=90 (24 obs) right son=91 (21 obs)
## Primary splits:
## number_project < 3.5 to the right, improve=4.2293650, (0 missing)
## average_montly_hours < 144.5 to the left, improve=3.5851850, (0 missing)
## sales splits as LLLRLLLRRR, improve=2.1847220, (0 missing)
## salary splits as LRL, improve=1.2230130, (0 missing)
## satisfaction_level < 0.535 to the right, improve=0.5620718, (0 missing)
## Surrogate splits:
## sales splits as LLLRRRLLRL, agree=0.711, adj=0.381, (0 split)
## satisfaction_level < 0.615 to the right, agree=0.600, adj=0.143, (0 split)
## last_evaluation < 0.895 to the right, agree=0.578, adj=0.095, (0 split)
## average_montly_hours < 101.5 to the right, agree=0.578, adj=0.095, (0 split)
## salary splits as LLR, agree=0.578, adj=0.095, (0 split)
##
## Node number 46: 35 observations
## predicted class=0 expected loss=0 P(node) =0.003333651
## class counts: 35 0
## probabilities: 1.000 0.000
##
## Node number 47: 647 observations, complexity param=0.006430868
## predicted class=1 expected loss=0.07882535 P(node) =0.06162492
## class counts: 51 596
## probabilities: 0.079 0.921
## left son=94 (26 obs) right son=95 (621 obs)
## Primary splits:
## number_project < 3.5 to the left, improve=28.781440, (0 missing)
## satisfaction_level < 0.715 to the left, improve=25.916140, (0 missing)
## average_montly_hours < 277.5 to the right, improve= 8.573654, (0 missing)
## time_spend_company < 5.5 to the right, improve= 4.287677, (0 missing)
## salary splits as LRL, improve= 2.443115, (0 missing)
## Surrogate splits:
## satisfaction_level < 0.485 to the left, agree=0.964, adj=0.115, (0 split)
## average_montly_hours < 282.5 to the right, agree=0.964, adj=0.115, (0 split)
##
## Node number 48: 1073 observations, complexity param=0.0004823151
## predicted class=0 expected loss=0.05312209 P(node) =0.1022002
## class counts: 1016 57
## probabilities: 0.947 0.053
## left son=96 (1061 obs) right son=97 (12 obs)
## Primary splits:
## last_evaluation < 0.995 to the left, improve=3.2078270, (0 missing)
## satisfaction_level < 0.295 to the left, improve=0.9214371, (0 missing)
## average_montly_hours < 131.5 to the left, improve=0.8642098, (0 missing)
## sales splits as LRRLLLLLLR, improve=0.8366530, (0 missing)
## time_spend_company < 5.5 to the right, improve=0.5995069, (0 missing)
##
## Node number 49: 7 observations
## predicted class=1 expected loss=0 P(node) =0.0006667302
## class counts: 0 7
## probabilities: 0.000 1.000
##
## Node number 62: 21 observations, complexity param=0.0004019293
## predicted class=0 expected loss=0.1904762 P(node) =0.00200019
## class counts: 17 4
## probabilities: 0.810 0.190
## left son=124 (14 obs) right son=125 (7 obs)
## Primary splits:
## satisfaction_level < 0.305 to the right, improve=3.0476190, (0 missing)
## average_montly_hours < 240.5 to the left, improve=3.0476190, (0 missing)
## sales splits as R-L-L-LRLR, improve=1.3852810, (0 missing)
## time_spend_company < 3.5 to the left, improve=0.8800366, (0 missing)
## salary splits as -LR, improve=0.6428571, (0 missing)
## Surrogate splits:
## average_montly_hours < 240.5 to the left, agree=0.905, adj=0.714, (0 split)
## sales splits as L-L-L-LLLR, agree=0.762, adj=0.286, (0 split)
## last_evaluation < 0.545 to the left, agree=0.714, adj=0.143, (0 split)
##
## Node number 63: 1079 observations, complexity param=0.00522508
## predicted class=1 expected loss=0.02131603 P(node) =0.1027717
## class counts: 23 1056
## probabilities: 0.021 0.979
## left son=126 (13 obs) right son=127 (1066 obs)
## Primary splits:
## average_montly_hours < 125.5 to the left, improve=25.2070800, (0 missing)
## satisfaction_level < 0.315 to the left, improve=11.7475200, (0 missing)
## sales splits as RLRRRRRRRR, improve= 0.5342061, (0 missing)
## last_evaluation < 0.455 to the left, improve= 0.3269398, (0 missing)
## salary splits as LRR, improve= 0.3144245, (0 missing)
## Surrogate splits:
## time_spend_company < 2.5 to the left, agree=0.99, adj=0.154, (0 split)
##
## Node number 70: 43 observations
## predicted class=0 expected loss=0.1627907 P(node) =0.004095628
## class counts: 36 7
## probabilities: 0.837 0.163
##
## Node number 71: 8 observations
## predicted class=1 expected loss=0.375 P(node) =0.0007619773
## class counts: 3 5
## probabilities: 0.375 0.625
##
## Node number 84: 162 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.0617284 P(node) =0.01543004
## class counts: 152 10
## probabilities: 0.938 0.062
## left son=168 (76 obs) right son=169 (86 obs)
## Primary splits:
## last_evaluation < 0.595 to the left, improve=1.0910130, (0 missing)
## satisfaction_level < 0.715 to the left, improve=0.6754688, (0 missing)
## time_spend_company < 5.5 to the right, improve=0.5640528, (0 missing)
## average_montly_hours < 152.5 to the left, improve=0.5198181, (0 missing)
## Work_accident < 0.5 to the right, improve=0.4748338, (0 missing)
## Surrogate splits:
## average_montly_hours < 132.5 to the left, agree=0.617, adj=0.184, (0 split)
## satisfaction_level < 0.715 to the left, agree=0.605, adj=0.158, (0 split)
## number_project < 4.5 to the left, agree=0.574, adj=0.092, (0 split)
## sales splits as ---R-R--LR, agree=0.568, adj=0.079, (0 split)
## promotion_last_5years < 0.5 to the right, agree=0.549, adj=0.039, (0 split)
##
## Node number 85: 12 observations
## predicted class=0 expected loss=0.3333333 P(node) =0.001142966
## class counts: 8 4
## probabilities: 0.667 0.333
##
## Node number 90: 24 observations
## predicted class=0 expected loss=0.04166667 P(node) =0.002285932
## class counts: 23 1
## probabilities: 0.958 0.042
##
## Node number 91: 21 observations, complexity param=0.0009378349
## predicted class=0 expected loss=0.4761905 P(node) =0.00200019
## class counts: 11 10
## probabilities: 0.524 0.476
## left son=182 (12 obs) right son=183 (9 obs)
## Primary splits:
## salary splits as LRL, improve=5.3650790, (0 missing)
## average_montly_hours < 142 to the left, improve=4.7619050, (0 missing)
## sales splits as ---RLL-LRR, improve=0.7619048, (0 missing)
## satisfaction_level < 0.725 to the right, improve=0.5852814, (0 missing)
## last_evaluation < 0.895 to the right, improve=0.5723443, (0 missing)
## Surrogate splits:
## last_evaluation < 0.895 to the right, agree=0.762, adj=0.444, (0 split)
## sales splits as ---LLL-LRL, agree=0.762, adj=0.444, (0 split)
## satisfaction_level < 0.55 to the right, agree=0.667, adj=0.222, (0 split)
## average_montly_hours < 142 to the left, agree=0.667, adj=0.222, (0 split)
## number_project < 2.5 to the left, agree=0.619, adj=0.111, (0 split)
##
## Node number 94: 26 observations, complexity param=0.0004019293
## predicted class=0 expected loss=0.1923077 P(node) =0.002476426
## class counts: 21 5
## probabilities: 0.808 0.192
## left son=188 (17 obs) right son=189 (9 obs)
## Primary splits:
## satisfaction_level < 0.61 to the right, improve=3.632479, (0 missing)
## salary splits as LRL, improve=2.622378, (0 missing)
## last_evaluation < 0.935 to the right, improve=1.410256, (0 missing)
## average_montly_hours < 234 to the right, improve=1.069404, (0 missing)
## sales splits as L--LLLLLLR, improve=1.069404, (0 missing)
## Surrogate splits:
## average_montly_hours < 224 to the right, agree=0.769, adj=0.333, (0 split)
## sales splits as R--LLRLLLL, agree=0.731, adj=0.222, (0 split)
## last_evaluation < 0.935 to the right, agree=0.692, adj=0.111, (0 split)
##
## Node number 95: 621 observations, complexity param=0.004823151
## predicted class=1 expected loss=0.04830918 P(node) =0.05914849
## class counts: 30 591
## probabilities: 0.048 0.952
## left son=190 (20 obs) right son=191 (601 obs)
## Primary splits:
## satisfaction_level < 0.71 to the left, improve=23.3537000, (0 missing)
## number_project < 5.5 to the right, improve=12.9053700, (0 missing)
## sales splits as LRLRLRRRRR, improve= 1.4124590, (0 missing)
## time_spend_company < 5.5 to the right, improve= 1.2939960, (0 missing)
## salary splits as LRR, improve= 0.9955379, (0 missing)
## Surrogate splits:
## number_project < 5.5 to the right, agree=0.969, adj=0.05, (0 split)
##
## Node number 96: 1061 observations, complexity param=0.0004823151
## predicted class=0 expected loss=0.04901037 P(node) =0.1010572
## class counts: 1009 52
## probabilities: 0.951 0.049
## left son=192 (666 obs) right son=193 (395 obs)
## Primary splits:
## satisfaction_level < 0.295 to the left, improve=1.0930690, (0 missing)
## average_montly_hours < 131.5 to the left, improve=0.7305080, (0 missing)
## last_evaluation < 0.985 to the left, improve=0.6231371, (0 missing)
## sales splits as LRRLLLLRRR, improve=0.6091616, (0 missing)
## time_spend_company < 5.5 to the right, improve=0.4644825, (0 missing)
## Surrogate splits:
## time_spend_company < 3.5 to the right, agree=0.678, adj=0.134, (0 split)
## number_project < 3.5 to the right, agree=0.672, adj=0.119, (0 split)
## average_montly_hours < 138.5 to the right, agree=0.668, adj=0.109, (0 split)
## last_evaluation < 0.475 to the right, agree=0.638, adj=0.028, (0 split)
##
## Node number 97: 12 observations
## predicted class=0 expected loss=0.4166667 P(node) =0.001142966
## class counts: 7 5
## probabilities: 0.583 0.417
##
## Node number 124: 14 observations
## predicted class=0 expected loss=0 P(node) =0.00133346
## class counts: 14 0
## probabilities: 1.000 0.000
##
## Node number 125: 7 observations
## predicted class=1 expected loss=0.4285714 P(node) =0.0006667302
## class counts: 3 4
## probabilities: 0.429 0.571
##
## Node number 126: 13 observations
## predicted class=0 expected loss=0 P(node) =0.001238213
## class counts: 13 0
## probabilities: 1.000 0.000
##
## Node number 127: 1066 observations, complexity param=0.001205788
## predicted class=1 expected loss=0.009380863 P(node) =0.1015335
## class counts: 10 1056
## probabilities: 0.009 0.991
## left son=254 (9 obs) right son=255 (1057 obs)
## Primary splits:
## satisfaction_level < 0.34 to the left, improve=7.84265700, (0 missing)
## salary splits as LRR, improve=0.43675570, (0 missing)
## sales splits as RLRRRRRRRR, improve=0.15299160, (0 missing)
## last_evaluation < 0.455 to the left, improve=0.09535386, (0 missing)
## Work_accident < 0.5 to the right, improve=0.07729790, (0 missing)
## Surrogate splits:
## time_spend_company < 3.5 to the right, agree=0.994, adj=0.333, (0 split)
##
## Node number 168: 76 observations
## predicted class=0 expected loss=0 P(node) =0.007238785
## class counts: 76 0
## probabilities: 1.000 0.000
##
## Node number 169: 86 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.1162791 P(node) =0.008191256
## class counts: 76 10
## probabilities: 0.884 0.116
## left son=338 (45 obs) right son=339 (41 obs)
## Primary splits:
## time_spend_company < 5.5 to the right, improve=0.9741476, (0 missing)
## Work_accident < 0.5 to the right, improve=0.8490218, (0 missing)
## satisfaction_level < 0.71 to the left, improve=0.7369186, (0 missing)
## last_evaluation < 0.655 to the right, improve=0.6458472, (0 missing)
## average_montly_hours < 152.5 to the left, improve=0.6155951, (0 missing)
## Surrogate splits:
## average_montly_hours < 185.5 to the right, agree=0.628, adj=0.220, (0 split)
## satisfaction_level < 0.885 to the right, agree=0.581, adj=0.122, (0 split)
## last_evaluation < 0.715 to the left, agree=0.581, adj=0.122, (0 split)
## number_project < 5.5 to the left, agree=0.570, adj=0.098, (0 split)
## sales splits as ---R-L--LL, agree=0.547, adj=0.049, (0 split)
##
## Node number 182: 12 observations
## predicted class=0 expected loss=0.1666667 P(node) =0.001142966
## class counts: 10 2
## probabilities: 0.833 0.167
##
## Node number 183: 9 observations
## predicted class=1 expected loss=0.1111111 P(node) =0.0008572245
## class counts: 1 8
## probabilities: 0.111 0.889
##
## Node number 188: 17 observations
## predicted class=0 expected loss=0 P(node) =0.001619202
## class counts: 17 0
## probabilities: 1.000 0.000
##
## Node number 189: 9 observations
## predicted class=1 expected loss=0.4444444 P(node) =0.0008572245
## class counts: 4 5
## probabilities: 0.444 0.556
##
## Node number 190: 20 observations
## predicted class=0 expected loss=0.2 P(node) =0.001904943
## class counts: 16 4
## probabilities: 0.800 0.200
##
## Node number 191: 601 observations
## predicted class=1 expected loss=0.02329451 P(node) =0.05724355
## class counts: 14 587
## probabilities: 0.023 0.977
##
## Node number 192: 666 observations
## predicted class=0 expected loss=0.03153153 P(node) =0.06343461
## class counts: 645 21
## probabilities: 0.968 0.032
##
## Node number 193: 395 observations, complexity param=0.0004823151
## predicted class=0 expected loss=0.07848101 P(node) =0.03762263
## class counts: 364 31
## probabilities: 0.922 0.078
## left son=386 (344 obs) right son=387 (51 obs)
## Primary splits:
## satisfaction_level < 0.315 to the right, improve=3.6453490, (0 missing)
## last_evaluation < 0.525 to the left, improve=1.8780370, (0 missing)
## average_montly_hours < 198.5 to the left, improve=1.4977330, (0 missing)
## Work_accident < 0.5 to the right, improve=0.8886682, (0 missing)
## number_project < 5.5 to the left, improve=0.7382494, (0 missing)
##
## Node number 254: 9 observations
## predicted class=0 expected loss=0.3333333 P(node) =0.0008572245
## class counts: 6 3
## probabilities: 0.667 0.333
##
## Node number 255: 1057 observations
## predicted class=1 expected loss=0.003784295 P(node) =0.1006763
## class counts: 4 1053
## probabilities: 0.004 0.996
##
## Node number 338: 45 observations
## predicted class=0 expected loss=0.04444444 P(node) =0.004286122
## class counts: 43 2
## probabilities: 0.956 0.044
##
## Node number 339: 41 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.195122 P(node) =0.003905134
## class counts: 33 8
## probabilities: 0.805 0.195
## left son=678 (14 obs) right son=679 (27 obs)
## Primary splits:
## satisfaction_level < 0.805 to the right, improve=1.6187900, (0 missing)
## average_montly_hours < 226.5 to the left, improve=1.4336040, (0 missing)
## Work_accident < 0.5 to the right, improve=1.1447150, (0 missing)
## last_evaluation < 0.635 to the right, improve=0.8538064, (0 missing)
## salary splits as -LR, improve=0.7066202, (0 missing)
## Surrogate splits:
## promotion_last_5years < 0.5 to the right, agree=0.707, adj=0.143, (0 split)
## number_project < 5.5 to the right, agree=0.683, adj=0.071, (0 split)
## sales splits as ---L-R--RR, agree=0.683, adj=0.071, (0 split)
##
## Node number 386: 344 observations
## predicted class=0 expected loss=0.05232558 P(node) =0.03276503
## class counts: 326 18
## probabilities: 0.948 0.052
##
## Node number 387: 51 observations, complexity param=0.0004823151
## predicted class=0 expected loss=0.254902 P(node) =0.004857605
## class counts: 38 13
## probabilities: 0.745 0.255
## left son=774 (19 obs) right son=775 (32 obs)
## Primary splits:
## number_project < 3.5 to the left, improve=3.935049, (0 missing)
## sales splits as LRRLLLLRRL, improve=3.000241, (0 missing)
## satisfaction_level < 0.305 to the left, improve=1.505882, (0 missing)
## last_evaluation < 0.535 to the left, improve=1.420168, (0 missing)
## Work_accident < 0.5 to the right, improve=1.420168, (0 missing)
## Surrogate splits:
## sales splits as RRRRRRLLRR, agree=0.725, adj=0.263, (0 split)
## satisfaction_level < 0.305 to the left, agree=0.686, adj=0.158, (0 split)
## last_evaluation < 0.405 to the left, agree=0.647, adj=0.053, (0 split)
## average_montly_hours < 261.5 to the right, agree=0.647, adj=0.053, (0 split)
## time_spend_company < 5.5 to the right, agree=0.647, adj=0.053, (0 split)
##
## Node number 678: 14 observations
## predicted class=0 expected loss=0 P(node) =0.00133346
## class counts: 14 0
## probabilities: 1.000 0.000
##
## Node number 679: 27 observations, complexity param=0.0001148369
## predicted class=0 expected loss=0.2962963 P(node) =0.002571673
## class counts: 19 8
## probabilities: 0.704 0.296
## left son=1358 (19 obs) right son=1359 (8 obs)
## Primary splits:
## satisfaction_level < 0.745 to the left, improve=2.4566280, (0 missing)
## average_montly_hours < 145.5 to the left, improve=1.6592590, (0 missing)
## last_evaluation < 0.64 to the right, improve=0.9434698, (0 missing)
## salary splits as -LR, improve=0.2945534, (0 missing)
## sales splits as ---L-R--LR, improve=0.1481481, (0 missing)
## Surrogate splits:
## last_evaluation < 0.795 to the left, agree=0.778, adj=0.25, (0 split)
## average_montly_hours < 217 to the left, agree=0.778, adj=0.25, (0 split)
##
## Node number 774: 19 observations
## predicted class=0 expected loss=0 P(node) =0.001809696
## class counts: 19 0
## probabilities: 1.000 0.000
##
## Node number 775: 32 observations, complexity param=0.0004823151
## predicted class=0 expected loss=0.40625 P(node) =0.003047909
## class counts: 19 13
## probabilities: 0.594 0.406
## left son=1550 (14 obs) right son=1551 (18 obs)
## Primary splits:
## sales splits as LRRLLLRRRL, improve=5.5803570, (0 missing)
## time_spend_company < 4.5 to the left, improve=1.2041670, (0 missing)
## average_montly_hours < 157 to the right, improve=1.2041670, (0 missing)
## last_evaluation < 0.625 to the right, improve=0.9120098, (0 missing)
## salary splits as LLR, improve=0.2556818, (0 missing)
## Surrogate splits:
## last_evaluation < 0.5 to the left, agree=0.688, adj=0.286, (0 split)
## Work_accident < 0.5 to the right, agree=0.688, adj=0.286, (0 split)
## average_montly_hours < 239 to the right, agree=0.656, adj=0.214, (0 split)
## satisfaction_level < 0.305 to the left, agree=0.625, adj=0.143, (0 split)
## time_spend_company < 4.5 to the left, agree=0.625, adj=0.143, (0 split)
##
## Node number 1358: 19 observations
## predicted class=0 expected loss=0.1578947 P(node) =0.001809696
## class counts: 16 3
## probabilities: 0.842 0.158
##
## Node number 1359: 8 observations
## predicted class=1 expected loss=0.375 P(node) =0.0007619773
## class counts: 3 5
## probabilities: 0.375 0.625
##
## Node number 1550: 14 observations
## predicted class=0 expected loss=0.07142857 P(node) =0.00133346
## class counts: 13 1
## probabilities: 0.929 0.071
##
## Node number 1551: 18 observations
## predicted class=1 expected loss=0.3333333 P(node) =0.001714449
## class counts: 6 12
## probabilities: 0.333 0.667
#Plot Decision Tree
rpart.plot(hr_base_model)
# Examine the complexity plot
printcp(hr_base_model)
##
## Classification tree:
## rpart(formula = left ~ ., data = train, method = "class", control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] average_montly_hours last_evaluation number_project
## [4] salary sales satisfaction_level
## [7] time_spend_company
##
## Root node error: 2488/10499 = 0.23697
##
## n= 10499
##
## CP nsplit rel error xerror xstd
## 1 0.23553055 0 1.000000 1.000000 0.0175123
## 2 0.18870579 1 0.764469 0.764469 0.0158619
## 3 0.07616559 3 0.387058 0.387058 0.0118870
## 4 0.05265273 5 0.234727 0.235932 0.0094618
## 5 0.03215434 6 0.182074 0.182476 0.0083768
## 6 0.01406752 7 0.149920 0.150322 0.0076332
## 7 0.01205788 8 0.135852 0.138666 0.0073418
## 8 0.00643087 9 0.123794 0.126608 0.0070257
## 9 0.00522508 10 0.117363 0.121383 0.0068836
## 10 0.00482315 13 0.101688 0.117363 0.0067720
## 11 0.00321543 14 0.096865 0.104502 0.0064002
## 12 0.00281350 15 0.093650 0.096865 0.0061676
## 13 0.00120579 16 0.090836 0.096061 0.0061425
## 14 0.00093783 17 0.089630 0.096463 0.0061551
## 15 0.00048232 20 0.086817 0.097669 0.0061925
## 16 0.00040193 25 0.084405 0.099678 0.0062544
## 17 0.00026795 28 0.083199 0.099678 0.0062544
## 18 0.00011484 31 0.082395 0.103698 0.0063761
## 19 0.00000000 38 0.081592 0.104904 0.0064121
plotcp(hr_base_model)
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_base_model, test, type = "class")
base_accuracy <- mean(test$pred == test$left)
# Grow a tree with minsplit of 100 and max depth of 8 (PREPRUNING)
hr_model_preprun <- rpart(left ~ ., data = train, method = "class",
control = rpart.control(cp = 0, maxdepth = 8,minsplit = 100))
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_model_preprun, test, type = "class")
accuracy_preprun <- mean(test$pred == test$left)
rpart.plot(hr_model_preprun)
#Postpruning
# Prune the hr_base_model based on the optimal cp value (POSTPRUNING)
hr_model_pruned <- prune(hr_base_model, cp = 0.0046 )
# Compute the accuracy of the pruned tree
test$pred <- predict(hr_model_pruned, test, type = "class")
accuracy_postprun <- mean(test$pred == test$left)
rpart.plot(hr_model_pruned)
data.frame(base_accuracy, accuracy_preprun, accuracy_postprun)
## base_accuracy accuracy_preprun accuracy_postprun
## 1 0.9757778 0.9717778 0.9766667
4.5 Ejemplo Regresión + Poda
Para este ejemplo necesitaremos instalar algunas librerías
#Librería "AmesHousing"
install.packages("AmesHousing")
#Librería "rsample"
#https://tidymodels.github.io/rsample/
require(devtools) #si no tiene esta librería instalarla
install_github("tidymodels/rsample")
## Librería purrr (functional programming tool - mapping). La forma más sencilla es instalando `tidyverse`
install.packages("tidyverse")
Cargamos las librerías
library(AmesHousing)
library(rsample) # data splitting
## Loading required package: tidyr
##
## Attaching package: 'rsample'
## The following object is masked from 'package:tidyr':
##
## fill
library(dplyr) # data wrangling
library(rpart) # performing regression trees
library(rpart.plot) # plotting regression trees
library(MLmetrics) # goodness of fit
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
Para ilustrar algunos conceptos de regularización vamos usar el conjunto de datos de Ames Housing que se incluye en el paquete del mismo nombre.
# Entrenamiento (70%) y test (30%) a partir de AmesHousing::make_ames() data.
# Usar semilla para reproducibilidad: set.seed
set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2051 obs. of 81 variables:
## $ MS_SubClass : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 6 6 12 12 6 6 1 6 ...
## $ MS_Zoning : Factor w/ 7 levels "Floating_Village_Residential",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Lot_Frontage : num 141 93 74 78 41 43 60 75 0 63 ...
## $ Lot_Area : int 31770 11160 13830 9978 4920 5005 7500 10000 7980 8402 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Lot_Shape : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 2 1 2 1 2 2 2 ...
## $ Land_Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 2 4 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot_Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 1 5 5 5 5 5 1 5 5 ...
## $ Land_Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "North_Ames","College_Creek",..: 1 1 7 7 17 17 7 7 7 7 ...
## $ Condition_1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition_2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg_Type : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 5 5 1 1 1 1 ...
## $ House_Style : Factor w/ 8 levels "One_Story","One_and_Half_Fin",..: 1 1 6 6 1 1 6 6 1 6 ...
## $ Overall_Qual : Factor w/ 10 levels "Very_Poor","Poor",..: 6 7 5 6 8 8 7 6 6 6 ...
## $ Overall_Cond : Factor w/ 10 levels "Very_Poor","Poor",..: 5 5 5 6 5 5 5 5 7 5 ...
## $ Year_Built : int 1960 1968 1997 1998 2001 1992 1999 1993 1992 1998 ...
## $ Year_Remod_Add : int 1960 1968 1998 1998 2001 1992 1999 1994 2007 1998 ...
## $ Roof_Style : Factor w/ 6 levels "Flat","Gable",..: 4 4 2 2 2 2 2 2 2 2 ...
## $ Roof_Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior_1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 4 14 14 6 7 14 7 7 14 ...
## $ Exterior_2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 4 15 15 6 7 15 7 7 15 ...
## $ Mas_Vnr_Type : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 4 2 4 4 4 4 4 4 ...
## $ Mas_Vnr_Area : num 112 0 0 20 0 0 0 0 0 0 ...
## $ Exter_Qual : Factor w/ 4 levels "Excellent","Fair",..: 4 3 4 4 3 3 4 4 4 4 ...
## $ Exter_Cond : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 3 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 3 3 3 3 3 3 3 3 ...
## $ Bsmt_Qual : Factor w/ 6 levels "Excellent","Fair",..: 6 6 3 6 3 3 6 3 3 3 ...
## $ Bsmt_Cond : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
## $ Bsmt_Exposure : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 3 4 4 4 4 4 ...
## $ BsmtFin_Type_1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 1 3 3 3 1 7 7 1 7 ...
## $ BsmtFin_SF_1 : num 2 1 3 3 3 1 7 7 1 7 ...
## $ BsmtFin_Type_2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ BsmtFin_SF_2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt_Unf_SF : num 441 1045 137 324 722 ...
## $ Total_Bsmt_SF : num 1080 2110 928 926 1338 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating_QC : Factor w/ 5 levels "Excellent","Fair",..: 2 1 3 1 1 1 3 3 1 3 ...
## $ Central_Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ First_Flr_SF : int 1656 2110 928 926 1338 1280 1028 763 1187 789 ...
## $ Second_Flr_SF : int 0 0 701 678 0 0 776 892 0 676 ...
## $ Low_Qual_Fin_SF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gr_Liv_Area : int 1656 2110 1629 1604 1338 1280 1804 1655 1187 1465 ...
## $ Bsmt_Full_Bath : num 1 1 0 0 1 0 0 0 1 0 ...
## $ Bsmt_Half_Bath : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Full_Bath : int 1 2 2 2 2 2 2 2 2 2 ...
## $ Half_Bath : int 0 1 1 1 0 0 1 1 0 1 ...
## $ Bedroom_AbvGr : int 3 3 3 3 2 2 3 3 3 3 ...
## $ Kitchen_AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen_Qual : Factor w/ 5 levels "Excellent","Fair",..: 5 1 5 3 3 3 3 5 5 5 ...
## $ TotRms_AbvGrd : int 7 8 6 7 6 5 7 7 6 7 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Fireplaces : int 2 2 1 1 0 0 1 1 0 1 ...
## $ Fireplace_Qu : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 3 4 4 6 6 4 3 ...
## $ Garage_Type : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Garage_Finish : Factor w/ 4 levels "Fin","No_Garage",..: 1 1 1 1 1 3 1 1 1 1 ...
## $ Garage_Cars : num 2 2 2 2 2 2 2 2 2 2 ...
## $ Garage_Area : num 528 522 482 470 582 506 442 440 420 393 ...
## $ Garage_Qual : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage_Cond : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Paved_Drive : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...
## $ Wood_Deck_SF : int 210 0 212 360 0 0 140 157 483 0 ...
## $ Open_Porch_SF : int 62 0 34 36 0 82 60 84 21 75 ...
## $ Enclosed_Porch : int 0 0 0 0 170 0 0 0 0 0 ...
## $ Three_season_porch: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen_Porch : int 0 0 0 0 0 144 0 0 0 0 ...
## $ Pool_Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool_QC : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Fence : Factor w/ 5 levels "Good_Privacy",..: 5 5 3 5 5 5 5 5 1 5 ...
## $ Misc_Feature : Factor w/ 6 levels "Elev","Gar2",..: 3 3 3 3 3 3 3 3 5 3 ...
## $ Misc_Val : int 0 0 0 0 0 0 0 0 500 0 ...
## $ Mo_Sold : int 5 4 3 6 4 1 6 4 3 5 ...
## $ Year_Sold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sale_Type : Factor w/ 10 levels "COD","CWD","Con",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Sale_Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Sale_Price : int 215000 244000 189900 195500 213500 191500 189000 175900 185000 180400 ...
## $ Longitude : num -93.6 -93.6 -93.6 -93.6 -93.6 ...
## $ Latitude : num 42.1 42.1 42.1 42.1 42.1 ...
Para crear el árbol de decisión vamos a usar la misma función rpart como en los ejemplos anteriores,sin embargo se cambia la definición de
method = "anova"
.La descripción del resultado de la variable
m1
puede ser muy extensa, esta muestra la estructura explicativa del árbol generado.
Nota:
- yval = valor medio de las observaciones en el nodo
- deviance = SSE (Sum Square Error) en el nodo
m1 <- rpart(
formula = Sale_Price ~ .,
data = ames_train,
method = "anova"
)
m1
## n= 2051
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 2051 1.329920e+13 181620.20
## 2) Overall_Qual=Very_Poor,Poor,Fair,Below_Average,Average,Above_Average,Good 1699 4.001092e+12 156147.10
## 4) Neighborhood=North_Ames,Old_Town,Edwards,Sawyer,Mitchell,Brookside,Iowa_DOT_and_Rail_Road,South_and_West_of_Iowa_State_University,Meadow_Village,Briardale,Northpark_Villa,Blueste 1000 1.298629e+12 131787.90
## 8) Overall_Qual=Very_Poor,Poor,Fair,Below_Average 195 1.733699e+11 98238.33 *
## 9) Overall_Qual=Average,Above_Average,Good 805 8.526051e+11 139914.80
## 18) First_Flr_SF< 1150.5 553 3.023384e+11 129936.80 *
## 19) First_Flr_SF>=1150.5 252 3.743907e+11 161810.90 *
## 5) Neighborhood=College_Creek,Somerset,Northridge_Heights,Gilbert,Northwest_Ames,Sawyer_West,Crawford,Timberland,Northridge,Stone_Brook,Clear_Creek,Bloomington_Heights,Veenker,Green_Hills 699 1.260199e+12 190995.90
## 10) Gr_Liv_Area< 1477.5 300 2.472611e+11 164045.20 *
## 11) Gr_Liv_Area>=1477.5 399 6.311990e+11 211259.60
## 22) Total_Bsmt_SF< 1004.5 232 1.640427e+11 192946.30 *
## 23) Total_Bsmt_SF>=1004.5 167 2.812570e+11 236700.80 *
## 3) Overall_Qual=Very_Good,Excellent,Very_Excellent 352 2.874510e+12 304571.10
## 6) Overall_Qual=Very_Good 254 8.855113e+11 273369.50
## 12) Gr_Liv_Area< 1959.5 155 3.256677e+11 247662.30 *
## 13) Gr_Liv_Area>=1959.5 99 2.970338e+11 313618.30 *
## 7) Overall_Qual=Excellent,Very_Excellent 98 1.100817e+12 385440.30
## 14) Gr_Liv_Area< 1990 42 7.880164e+10 325358.30 *
## 15) Gr_Liv_Area>=1990 56 7.566917e+11 430501.80
## 30) Neighborhood=College_Creek,Edwards,Timberland,Veenker 8 1.153051e+11 281887.50 *
## 31) Neighborhood=Old_Town,Somerset,Northridge_Heights,Northridge,Stone_Brook 48 4.352486e+11 455270.80
## 62) Total_Bsmt_SF< 1433 12 3.143066e+10 360094.20 *
## 63) Total_Bsmt_SF>=1433 36 2.588806e+11 486996.40 *
Visualicemos el árbol usando rpart.plot
. Este árbol particiona usando 11 varibles, sin embargo hay 80 variables en el conjunto de entrenamiento. ¿Por qué?
rpart.plot(m1)
Por defecto rpart
aplica un proceso para calcular la penalización de acuerdo al número de árboles.
plotcp(m1)
El árbol extendido al máximo, forzando el argumento cp=0
m2 <- rpart(
formula = Sale_Price ~ .,
data = ames_train,
method = "anova",
control = list(cp = 0, xval = 10)
)
plotcp(m2)
abline(v = 12, lty = "dashed")
Entonces rpart
aplica automáticamente el proceso de optimización para identificar un subárbol óptimo. Para este caso 11 divisiones, 12 nodos terminales y un error de validación cruzada de 0.272. Sin embargo es posible mejorar el modelo a través de un tuning
de los parámetros de restricción (e.g., profundidad, número mínimo de observaciones, etc)
m1$cptable
## CP nsplit rel error xerror xstd
## 1 0.48300624 0 1.0000000 1.0017486 0.05769371
## 2 0.10844747 1 0.5169938 0.5189120 0.02898242
## 3 0.06678458 2 0.4085463 0.4126655 0.02832854
## 4 0.02870391 3 0.3417617 0.3608270 0.02123062
## 5 0.02050153 4 0.3130578 0.3325157 0.02091087
## 6 0.01995037 5 0.2925563 0.3228913 0.02127370
## 7 0.01976132 6 0.2726059 0.3175645 0.02115401
## 8 0.01550003 7 0.2528446 0.3096765 0.02117779
## 9 0.01397824 8 0.2373446 0.2857729 0.01902451
## 10 0.01322455 9 0.2233663 0.2833382 0.01936841
## 11 0.01089820 10 0.2101418 0.2687777 0.01917474
## 12 0.01000000 11 0.1992436 0.2621273 0.01957837
Un ejemplo con algunos valores para probar
m3 <- rpart(
formula = Sale_Price ~ .,
data = ames_train,
method = "anova",
control = list(minsplit = 10, maxdepth = 12, xval = 10)
)
m3$cptable
## CP nsplit rel error xerror xstd
## 1 0.48300624 0 1.0000000 1.0007911 0.05768347
## 2 0.10844747 1 0.5169938 0.5192042 0.02900726
## 3 0.06678458 2 0.4085463 0.4140423 0.02835387
## 4 0.02870391 3 0.3417617 0.3556013 0.02106960
## 5 0.02050153 4 0.3130578 0.3251197 0.02071312
## 6 0.01995037 5 0.2925563 0.3151983 0.02095032
## 7 0.01976132 6 0.2726059 0.3106164 0.02101621
## 8 0.01550003 7 0.2528446 0.2913458 0.01983930
## 9 0.01397824 8 0.2373446 0.2750055 0.01725564
## 10 0.01322455 9 0.2233663 0.2677136 0.01714828
## 11 0.01089820 10 0.2101418 0.2506827 0.01561141
## 12 0.01000000 11 0.1992436 0.2480154 0.01583340
Es posbile probar varias combinaciones para encontrar el óptimo a tráves de una búsqueda tipo grid search
par encontrar el conjunto óptimo de hyper-parámetros.
Aquí una grilla de valores de minsplit
de 5-20 y maxdepth
de 8-15. Esto resulta en 128 modelos diferentes.
hyper_grid <- expand.grid(
minsplit = seq(5, 20, 1),
maxdepth = seq(8, 15, 1)
)
head(hyper_grid)
## minsplit maxdepth
## 1 5 8
## 2 6 8
## 3 7 8
## 4 8 8
## 5 9 8
## 6 10 8
nrow(hyper_grid)
## [1] 128
Creamos una lista de modelos
models <- list()
for (i in 1:nrow(hyper_grid)) {
# get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# train a model and store in the list
models[[i]] <- rpart(
formula = Sale_Price ~ .,
data = ames_train,
method = "anova",
control = list(minsplit = minsplit, maxdepth = maxdepth)
)
}
Ahora creamos funciones para extraer el mínimo error asociado con el valor óptimo de costo de complejidad (CP) de cada modelo y adicionalmente el valor óptimo CP y su respectivo error. Agregaremos estos resultados a nuestra grilla de hyper-parámetros y filtraremos los 5 valores con error mínimo.
(xerror of 0.242 versus 0.272).
# function to get optimal cp
get_cp <- function(x) {
min <- which.min(x$cptable[, "xerror"])
cp <- x$cptable[min, "CP"]
}
# function to get minimum error
get_min_error <- function(x) {
min <- which.min(x$cptable[, "xerror"])
xerror <- x$cptable[min, "xerror"]
}
hyper_grid %>%
mutate(
cp = purrr::map_dbl(models, get_cp),
error = purrr::map_dbl(models, get_min_error)
) %>%
arrange(error) %>%
top_n(-5, wt = error)
## minsplit maxdepth cp error
## 1 5 13 0.0108982 0.2421256
## 2 6 8 0.0100000 0.2453631
## 3 12 10 0.0100000 0.2454067
## 4 8 13 0.0100000 0.2459588
## 5 19 9 0.0100000 0.2460173
Ahora podemos generar nuestro óptimo en base a la información anterior.
optimal_tree <- rpart(
formula = Sale_Price ~ .,
data = ames_train,
method = "anova",
control = list(minsplit = 8, maxdepth = 11, cp = 0.01)
)
pred <- predict(optimal_tree, newdata = ames_test)
rmse_gof = (MSE(y_pred = pred, y_true = ames_test$Sale_Price))^(1/2)
rmse_gof
## [1] 39145.39
rpart.plot(optimal_tree)