6 Ensambladores: Random Forest - Parte II

6.1 Ejemplo introductorio

Tomado de https://datascienceplus.com/random-forests-in-r/

require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
require(MASS)#Package which contains the Boston housing dataset
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
attach(Boston)
set.seed(101)

dim(Boston)
## [1] 506  14
##training Sample with 300 observations
train=sample(1:nrow(Boston),300)
str(Boston)  
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Usaremos la variable medv como la variable objetivo la cual es la mediana del costo de vivienda. Se usará 500 árboles.

Boston.rf=randomForest(medv ~ . , data = Boston , subset = train)
Boston.rf
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 12.34243
##                     % Var explained: 85.09

El error medio cuadrático y la varianza explicada son calculados usando la estimación OBB. Los 2/3 se usa para el entrenamiento y el restante 1/3 para la validación de los árboles. El número de variables seleccionadas aleatoriamente en cada ramificación es 4 por defecto.

#Plotting the Error vs Number of Trees Graph.
plot(Boston.rf)

Este gráfico muestra el error vs. el número de árboles. Se puede ver como el error disminuye a medida que se incrementa el número de árboles. Para este ejemplo hemos dejado por fuera un conjunto adicional de test (a pesar de que indicamos que por la naturaleza del algoritmo esto no sería necesario). Inspeccionemos como efectivamente cambian o no los valores de error evaluados en los dos conjuntos diferentes: test y OOB sample.

Para esto vamos a probar como cambia el error de acuerdo al número de atributos aleatorios candidatos en cada ramificación de los árboles.

oob.err=double(13)
test.err=double(13)

##mtry is no of Variables randomly chosen at each split
for(mtry in 1:13) 
{
  rf=randomForest(medv ~ . , data = Boston , subset = train,mtry=mtry,ntree=400) 
  oob.err[mtry] = rf$mse[400] 
  
  pred<-predict(rf,Boston[-train,]) #Predictions on Test Set for each Tree
  test.err[mtry]= with(Boston[-train,], mean( (medv - pred)^2)) #Mean Squared Test Error
  
  cat(mtry," ") #printing the output to the console
}
## 1  2  3  4  5  6  7  8  9  10  11  12  13
test.err
##  [1] 26.07809 18.45827 15.91703 14.82310 14.42375 14.43336 14.45756
##  [8] 14.62056 14.46458 14.79109 14.90324 15.13335 15.30873
oob.err
##  [1] 20.04025 14.31230 13.19824 12.48773 12.93861 12.96235 12.86396
##  [8] 13.69076 13.81878 14.36594 14.83846 14.85143 14.84741

Un gráfico que ilustra los resultados previos.

matplot(1:mtry , cbind(oob.err,test.err), pch=19 , col=c("red","blue"),type="b",ylab="Mean Squared Error",xlab="Number of Predictors Considered at each Split")
legend("topright",legend=c("Out of Bag Error","Test Error"),pch=19, col=c("red","blue"))

El caso extremo de la derecha con 13 variables aleatorias (todos los predictores posibles) se reduce al modelo de Bagging. Finalmente un primer acercamiento a la visualización de la importancia de atributos.

varImpPlot(rf)

6.2 Ejemplo de regresión + tuning

library(AmesHousing)
library(rsample)     # data splitting 
library(dplyr)       # data wrangling
library(ranger)      # más rápida que randomforest
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(h2o)         # computación distribuida
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc, ||
library(ggplot2)

Cargamos los datos y los separamos en entreanmiento y test.

# Entrenamiento (70%) y test (30%) a partir de AmesHousing::make_ames() data.
# Usar semilla para reproducibilidad: set.seed 

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2051 obs. of  81 variables:
##  $ MS_SubClass       : Factor w/ 16 levels "One_Story_1946_and_Newer_All_Styles",..: 1 1 6 6 12 12 6 6 1 6 ...
##  $ MS_Zoning         : Factor w/ 7 levels "Floating_Village_Residential",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Lot_Frontage      : num  141 93 74 78 41 43 60 75 0 63 ...
##  $ Lot_Area          : int  31770 11160 13830 9978 4920 5005 7500 10000 7980 8402 ...
##  $ Street            : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley             : Factor w/ 3 levels "Gravel","No_Alley_Access",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Lot_Shape         : Factor w/ 4 levels "Regular","Slightly_Irregular",..: 2 1 2 2 1 2 1 2 2 2 ...
##  $ Land_Contour      : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 2 4 4 4 4 ...
##  $ Utilities         : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot_Config        : Factor w/ 5 levels "Corner","CulDSac",..: 1 1 5 5 5 5 5 1 5 5 ...
##  $ Land_Slope        : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood      : Factor w/ 28 levels "North_Ames","College_Creek",..: 1 1 7 7 17 17 7 7 7 7 ...
##  $ Condition_1       : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition_2       : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg_Type         : Factor w/ 5 levels "OneFam","TwoFmCon",..: 1 1 1 1 5 5 1 1 1 1 ...
##  $ House_Style       : Factor w/ 8 levels "One_Story","One_and_Half_Fin",..: 1 1 6 6 1 1 6 6 1 6 ...
##  $ Overall_Qual      : Factor w/ 10 levels "Very_Poor","Poor",..: 6 7 5 6 8 8 7 6 6 6 ...
##  $ Overall_Cond      : Factor w/ 10 levels "Very_Poor","Poor",..: 5 5 5 6 5 5 5 5 7 5 ...
##  $ Year_Built        : int  1960 1968 1997 1998 2001 1992 1999 1993 1992 1998 ...
##  $ Year_Remod_Add    : int  1960 1968 1998 1998 2001 1992 1999 1994 2007 1998 ...
##  $ Roof_Style        : Factor w/ 6 levels "Flat","Gable",..: 4 4 2 2 2 2 2 2 2 2 ...
##  $ Roof_Matl         : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior_1st      : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 4 14 14 6 7 14 7 7 14 ...
##  $ Exterior_2nd      : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 4 15 15 6 7 15 7 7 15 ...
##  $ Mas_Vnr_Type      : Factor w/ 5 levels "BrkCmn","BrkFace",..: 5 4 4 2 4 4 4 4 4 4 ...
##  $ Mas_Vnr_Area      : num  112 0 0 20 0 0 0 0 0 0 ...
##  $ Exter_Qual        : Factor w/ 4 levels "Excellent","Fair",..: 4 3 4 4 3 3 4 4 4 4 ...
##  $ Exter_Cond        : Factor w/ 5 levels "Excellent","Fair",..: 5 5 5 5 5 5 5 5 3 5 ...
##  $ Foundation        : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 3 3 3 3 3 3 3 3 ...
##  $ Bsmt_Qual         : Factor w/ 6 levels "Excellent","Fair",..: 6 6 3 6 3 3 6 3 3 3 ...
##  $ Bsmt_Cond         : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 6 6 6 6 6 6 6 ...
##  $ Bsmt_Exposure     : Factor w/ 5 levels "Av","Gd","Mn",..: 2 4 4 4 3 4 4 4 4 4 ...
##  $ BsmtFin_Type_1    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 2 1 3 3 3 1 7 7 1 7 ...
##  $ BsmtFin_SF_1      : num  2 1 3 3 3 1 7 7 1 7 ...
##  $ BsmtFin_Type_2    : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ BsmtFin_SF_2      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt_Unf_SF       : num  441 1045 137 324 722 ...
##  $ Total_Bsmt_SF     : num  1080 2110 928 926 1338 ...
##  $ Heating           : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating_QC        : Factor w/ 5 levels "Excellent","Fair",..: 2 1 3 1 1 1 3 3 1 3 ...
##  $ Central_Air       : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical        : Factor w/ 6 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ First_Flr_SF      : int  1656 2110 928 926 1338 1280 1028 763 1187 789 ...
##  $ Second_Flr_SF     : int  0 0 701 678 0 0 776 892 0 676 ...
##  $ Low_Qual_Fin_SF   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gr_Liv_Area       : int  1656 2110 1629 1604 1338 1280 1804 1655 1187 1465 ...
##  $ Bsmt_Full_Bath    : num  1 1 0 0 1 0 0 0 1 0 ...
##  $ Bsmt_Half_Bath    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full_Bath         : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ Half_Bath         : int  0 1 1 1 0 0 1 1 0 1 ...
##  $ Bedroom_AbvGr     : int  3 3 3 3 2 2 3 3 3 3 ...
##  $ Kitchen_AbvGr     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen_Qual      : Factor w/ 5 levels "Excellent","Fair",..: 5 1 5 3 3 3 3 5 5 5 ...
##  $ TotRms_AbvGrd     : int  7 8 6 7 6 5 7 7 6 7 ...
##  $ Functional        : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Fireplaces        : int  2 2 1 1 0 0 1 1 0 1 ...
##  $ Fireplace_Qu      : Factor w/ 6 levels "Excellent","Fair",..: 3 6 6 3 4 4 6 6 4 3 ...
##  $ Garage_Type       : Factor w/ 7 levels "Attchd","Basment",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Garage_Finish     : Factor w/ 4 levels "Fin","No_Garage",..: 1 1 1 1 1 3 1 1 1 1 ...
##  $ Garage_Cars       : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ Garage_Area       : num  528 522 482 470 582 506 442 440 420 393 ...
##  $ Garage_Qual       : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage_Cond       : Factor w/ 6 levels "Excellent","Fair",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Paved_Drive       : Factor w/ 3 levels "Dirt_Gravel",..: 2 3 3 3 3 3 3 3 3 3 ...
##  $ Wood_Deck_SF      : int  210 0 212 360 0 0 140 157 483 0 ...
##  $ Open_Porch_SF     : int  62 0 34 36 0 82 60 84 21 75 ...
##  $ Enclosed_Porch    : int  0 0 0 0 170 0 0 0 0 0 ...
##  $ Three_season_porch: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen_Porch      : int  0 0 0 0 0 144 0 0 0 0 ...
##  $ Pool_Area         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool_QC           : Factor w/ 5 levels "Excellent","Fair",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Fence             : Factor w/ 5 levels "Good_Privacy",..: 5 5 3 5 5 5 5 5 1 5 ...
##  $ Misc_Feature      : Factor w/ 6 levels "Elev","Gar2",..: 3 3 3 3 3 3 3 3 5 3 ...
##  $ Misc_Val          : int  0 0 0 0 0 0 0 0 500 0 ...
##  $ Mo_Sold           : int  5 4 3 6 4 1 6 4 3 5 ...
##  $ Year_Sold         : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sale_Type         : Factor w/ 10 levels "COD","CWD","Con",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Sale_Condition    : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Sale_Price        : int  215000 244000 189900 195500 213500 191500 189000 175900 185000 180400 ...
##  $ Longitude         : num  -93.6 -93.6 -93.6 -93.6 -93.6 ...
##  $ Latitude          : num  42.1 42.1 42.1 42.1 42.1 ...
set.seed(123)

# default RF model
m1 <- randomForest(
  formula = Sale_Price ~ .,
  data    = ames_train
)

m1
## 
## Call:
##  randomForest(formula = Sale_Price ~ ., data = ames_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 26
## 
##           Mean of squared residuals: 659550782
##                     % Var explained: 89.83
# número de árboles con el menor MSE
which.min(m1$mse)
## [1] 344
# RMSE de este modelo óptimo RF
sqrt(m1$mse[which.min(m1$mse)])
## [1] 25673.5

6.2.1 Tuning inicial

Si se desea ajustar inicialmente el parámetro mtry se puede usar el comando randomForest::tuneRF para una evaluación rápida y sencilla.

La función tuneRF necesita indicar de manera separada los datos predictores y la variable objetivo (respuesta).

# names of features
features <- setdiff(names(ames_train), "Sale_Price")

set.seed(123)

m2 <- tuneRF(
  x          = ames_train[features],
  y          = ames_train$Sale_Price,
  ntreeTry   = 350,  # número de árboles
  mtryStart  = 5,    # valor inicial
  stepFactor = 1.5,  # factor de paso para incremento de mtry
  improve    = 0.01, # OBB error para si ha tenido una mejora del 1%
  trace      = TRUE  # 
)
## mtry = 5  OOB error = 762415867 
## Searching left ...
## mtry = 4     OOB error = 802224340 
## -0.0522136 0.01 
## Searching right ...
## mtry = 7     OOB error = 726606619 
## 0.04696813 0.01 
## mtry = 10    OOB error = 692769021 
## 0.04656935 0.01 
## mtry = 15    OOB error = 671771621 
## 0.03030938 0.01 
## mtry = 22    OOB error = 664775551 
## 0.01041436 0.01 
## mtry = 33    OOB error = 669849982 
## -0.007633299 0.01

6.3 Utilizando la librería ranger

Esta librería puede ser 6 veces más rápida que randomforest.

# randomForest speed
system.time(
  ames_randomForest <- randomForest(
    formula = Sale_Price ~ ., 
    data    = ames_train, 
    ntree   = 350,
    mtry    = floor(length(features) / 3)
  )
)
##    user  system elapsed 
##  24.128   0.063  24.227
# ranger speed
system.time(
  ames_ranger <- ranger(
    formula   = Sale_Price ~ ., 
    data      = ames_train, 
    num.trees = 350,
    mtry      = floor(length(features) / 3)
  )
)
##    user  system elapsed 
##   4.836   0.054   0.453

Para generar un “grid search”, vamos a construir nuestra grilla de hyper-parámetros. Vamos a buscar a través de 96 modelos diferentes con variaciones en mtry, mínimo tamaño de muestras en el nodo, y tamaño de la muestra.

# hyperparameter grid search
hyper_grid <- expand.grid(
  mtry       = seq(20, 30, by = 2),
  node_size  = seq(3, 9, by = 2),
  sampe_size = c(.55, .632, .70, .80),
  OOB_RMSE   = 0
)

# total number of combinations
nrow(hyper_grid)
## [1] 96

Ahora iteramos a través de las combinaciones y aplicamos 350 árboles (similar al valor que obtuvimos en pasos anteriores, 344) para simplificar el ejemplo y el tiempo de cómputo.

Resultados:

  • El OOB RMSE tiene un rango entre ~26,000-27,000.
  • Los 10 mejores modelos tienen valores RMSE justo alrededor de los 26,000
  • Los modelos con muestras ligeramente más grandes (70-80%) y árboles más profundos (3-5 observationes en un nodo hoja) tienen un mejor rendimiento.
  • Se obtienen diferentes valores de mtry values en los 10 mejores modelos lo que denota que no es en extremo influyente.
for(i in 1:nrow(hyper_grid)) {
  
  # train model
  model <- ranger(
    formula         = Sale_Price ~ ., 
    data            = ames_train, 
    num.trees       = 305,
    mtry            = hyper_grid$mtry[i],
    min.node.size   = hyper_grid$node_size[i],
    sample.fraction = hyper_grid$sampe_size[i],
    seed            = 123 # Notese el seteo de la semilla
  )
  
  # add OOB error to grid
  hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
}

hyper_grid %>% 
  dplyr::arrange(OOB_RMSE) %>%
  head(10)
##    mtry node_size sampe_size OOB_RMSE
## 1    20         5        0.8 26059.73
## 2    20         3        0.8 26074.55
## 3    28         3        0.8 26081.40
## 4    22         3        0.8 26128.50
## 5    28         5        0.8 26162.54
## 6    26         3        0.8 26162.87
## 7    26         5        0.7 26162.92
## 8    26         5        0.8 26180.37
## 9    22         5        0.8 26180.89
## 10   26         7        0.8 26200.37

Ahora definimos nuestro mejor modelo de acuerdo a lo que hemos identificado anteriormente.

OOB_RMSE <- vector(mode = "numeric", length = 100)

for(i in seq_along(OOB_RMSE)) {

  optimal_ranger <- ranger(
    formula         = Sale_Price ~ ., 
    data            = ames_train, 
    num.trees       = 350,
    mtry            = 22,
    min.node.size   = 5,
    sample.fraction = .8,
    importance      = 'impurity' #para evaluar importancia de variables
  )
  
  OOB_RMSE[i] <- sqrt(optimal_ranger$prediction.error)
}

hist(OOB_RMSE, breaks = 20)

Inspecciones los resultados de importancia de atributos.

optimal_ranger$variable.importance
##        MS_SubClass          MS_Zoning       Lot_Frontage 
##       2.383412e+10       2.598857e+10       5.649101e+10 
##           Lot_Area             Street              Alley 
##       1.851665e+11       8.723654e+08       2.389873e+09 
##          Lot_Shape       Land_Contour          Utilities 
##       9.719195e+09       2.143386e+10       5.107861e+07 
##         Lot_Config         Land_Slope       Neighborhood 
##       7.601668e+09       9.803425e+09       5.107236e+10 
##        Condition_1        Condition_2          Bldg_Type 
##       7.380356e+09       1.517712e+09       1.023225e+10 
##        House_Style       Overall_Qual       Overall_Cond 
##       1.728124e+10       2.423575e+12       3.589149e+10 
##         Year_Built     Year_Remod_Add         Roof_Style 
##       5.556295e+11       1.259489e+11       1.228007e+10 
##          Roof_Matl       Exterior_1st       Exterior_2nd 
##       7.530494e+09       2.421003e+10       2.138768e+10 
##       Mas_Vnr_Type       Mas_Vnr_Area         Exter_Qual 
##       8.980558e+09       1.292766e+11       6.003321e+11 
##         Exter_Cond         Foundation          Bsmt_Qual 
##       6.127954e+09       3.195972e+10       8.198886e+11 
##          Bsmt_Cond      Bsmt_Exposure     BsmtFin_Type_1 
##       4.483529e+09       1.743399e+10       3.016338e+10 
##       BsmtFin_SF_1     BsmtFin_Type_2       BsmtFin_SF_2 
##       2.518336e+10       5.740365e+09       5.801121e+09 
##        Bsmt_Unf_SF      Total_Bsmt_SF            Heating 
##       5.375453e+10       5.425982e+11       1.285165e+09 
##         Heating_QC        Central_Air         Electrical 
##       1.202852e+10       1.769784e+10       2.673107e+09 
##       First_Flr_SF      Second_Flr_SF    Low_Qual_Fin_SF 
##       4.480740e+11       1.689317e+11       8.780560e+08 
##        Gr_Liv_Area     Bsmt_Full_Bath     Bsmt_Half_Bath 
##       1.018169e+12       2.226521e+10       3.829746e+09 
##          Full_Bath          Half_Bath      Bedroom_AbvGr 
##       1.550351e+11       1.879208e+10       2.951400e+10 
##      Kitchen_AbvGr       Kitchen_Qual      TotRms_AbvGrd 
##       3.435490e+09       2.368400e+11       7.200721e+10 
##         Functional         Fireplaces       Fireplace_Qu 
##       3.923851e+09       1.056745e+11       2.764051e+10 
##        Garage_Type      Garage_Finish        Garage_Cars 
##       6.851481e+10       2.185210e+10       1.110687e+12 
##        Garage_Area        Garage_Qual        Garage_Cond 
##       5.452131e+11       1.049091e+10       1.185353e+10 
##        Paved_Drive       Wood_Deck_SF      Open_Porch_SF 
##       1.778479e+10       4.306222e+10       5.307117e+10 
##     Enclosed_Porch Three_season_porch       Screen_Porch 
##       8.480061e+09       6.637285e+08       3.127910e+10 
##          Pool_Area            Pool_QC              Fence 
##       7.913716e+08       1.549350e+09       4.394394e+09 
##       Misc_Feature           Misc_Val            Mo_Sold 
##       1.396489e+09       1.863706e+09       4.869890e+10 
##          Year_Sold          Sale_Type     Sale_Condition 
##       1.420325e+10       6.876117e+09       1.568422e+10 
##          Longitude           Latitude 
##       8.645152e+10       8.924475e+10
list_features=data.frame("feature"= names(optimal_ranger$variable.importance),"value"=unname(optimal_ranger$variable.importance))

list_features %>%
dplyr::arrange(desc(value)) %>%
dplyr::top_n(25) %>%
ggplot(aes(reorder(feature, value), value)) +
geom_col() +
coord_flip() +
ggtitle("Top 25 important variables")
## Selecting by value

6.3.1 H2O una librería para cómputo distribuido.

Búsqueda exhaustiva con grid search usando H2O

El código anterior es muy demorado debido a la implementación de bucles for, los cuales son muy ineficientes en R. A pesar de que la librería ranger es computacionalmente eficiente, esto no basta.

H2O es una una poderosa y eficiente interfaz basada en java que provee algoritmos de paralelización distribuida. Un ejemplo de como usar esta librería.

Nota: Debido a las prestaciones de cada computador y al tiempo de cálculo solo inspeccionaremos el código y se deja al lector probar su implementación. Debe tener presente que la paralelización distribuida require una importante cantidad de memoria en el computador.

# start up h2o (I turn off progress bars when creating reports/tutorials)
h2o.no_progress()
h2o.init(max_mem_size = "5g")
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 44 minutes 
##     H2O cluster timezone:       America/Guayaquil 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.8 
##     H2O cluster version age:    1 month and 24 days  
##     H2O cluster name:           H2O_started_from_R_johannaorellana_kmg728 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.00 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  12 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.3.3 (2017-03-06)
# create feature names
y <- "Sale_Price"
x <- setdiff(names(ames_train), y)

# turn training set into h2o object
train.h2o <- as.h2o(ames_train)

# hyperparameter grid
hyper_grid.h2o <- list(
  ntrees      = seq(200, 500, by = 100),
  mtries      = seq(20, 30, by = 2),
  sample_rate = c(.55, .632, .70, .80)
)

# build grid search 
grid <- h2o.grid(
  algorithm = "randomForest",
  grid_id = "rf_grid",
  x = x, 
  y = y, 
  training_frame = train.h2o,
  hyper_params = hyper_grid.h2o,
  search_criteria = list(strategy = "Cartesian")
  )

# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
  grid_id = "rf_grid", 
  sort_by = "mse", 
  decreasing = FALSE
  )
print(grid_perf)

Una (no muy corta) prueba con algunos resultados.

#H2O Grid Details
#================
  
#  Grid ID: rf_grid 
#Used hyper parameters: 
#  -  mtries 
#-  ntrees 
# -  sample_rate 
# Number of models: 96 
# Number of failed models: 1 
# 
# Hyper-Parameter Search Summary: ordered by increasing mse
# mtries ntrees sample_rate         model_ids                 mse
# 1     22    200         0.8  rf_grid_model_92 5.959833334434386E8
# 2     24    400         0.8 rf_grid_model_105 6.117881496530541E8
# 3     20    400         0.8 rf_grid_model_103 6.168158567573317E8
# 4     24    500         0.8 rf_grid_model_111 6.183128897636861E8
# 5     22    500         0.8 rf_grid_model_110  6.18855139151295E8
# 
# ---
#   mtries ntrees sample_rate       model_ids                 mse
# 91     22    200        0.55 rf_grid_model_1 6.672565213436959E8
# 92     20    200        0.55 rf_grid_model_0 6.691233582975057E8
# 93     24    200        0.55 rf_grid_model_2 6.695830321897688E8
# 94     24    300        0.55 rf_grid_model_8 6.705963772893049E8
# 95     26    300        0.55 rf_grid_model_9 6.716723861811426E8
# 96     26    200        0.55 rf_grid_model_3 6.731654202001693E8
# Failed models
# -------------
#   mtries ntrees sample_rate status_failed msgs_failed
# 22    500        0.55          FAIL        "NA"

Búsqueda Discreta Randómica

Aquí una versión optimizada de lo anterior, donde no se hace la búsqueda en un espacio cartesiano sino tomando combinaciones randómicas o definiendo criterios de parada de prueba de los modelos RF. Generalmente el resultado es muy parecido al óptimo.

# hyperparameter grid
hyper_grid.h2o <- list(
  ntrees      = seq(200, 500, by = 150),
  mtries      = seq(15, 35, by = 10),
  max_depth   = seq(20, 40, by = 5),
  min_rows    = seq(1, 5, by = 2),
  nbins       = seq(10, 30, by = 5),
  sample_rate = c(.55, .632, .75)
)

# random grid search criteria
search_criteria <- list(
  strategy = "RandomDiscrete",
  stopping_metric = "mse",
  stopping_tolerance = 0.005,
  stopping_rounds = 10,
  max_runtime_secs = 30*60
  )

# build grid search 
random_grid <- h2o.grid(
  algorithm = "randomForest",
  grid_id = "rf_grid2",
  x = x, 
  y = y, 
  training_frame = train.h2o,
  hyper_params = hyper_grid.h2o,
  search_criteria = search_criteria
  )

# collect the results and sort by our model performance metric of choice
grid_perf2 <- h2o.getGrid(
  grid_id = "rf_grid2", 
  sort_by = "mse", 
  decreasing = FALSE
  )
print(grid_perf2)


#Identificar el mejor modelo, que se encuentra al tope de la lista y aplicarlo con los datos de test

best_model_id <- grid_perf2@model_ids[[1]]
best_model <- h2o.getModel(best_model_id)

# Now let’s evaluate the model performance on a test set
ames_test.h2o <- as.h2o(ames_test)
best_model_perf <- h2o.performance(model = best_model, newdata = ames_test.h2o)

# RMSE of best model
h2o.mse(best_model_perf) %>% sqrt()

6.3.2 Predicción

Independientemente si el el modelo ha sido generado (entrenado) usando la librería randomforest, ranger, h2o; se puede utilizar la función tradicional predict para obtener las predicciones.

No obstante, las predicciones podrían variar ligeramente entre ellas.

# randomForest
pred_randomForest <- predict(ames_randomForest, ames_test)
head(pred_randomForest)
##        1        2        3        4        5        6 
## 129741.9 153747.9 267093.1 382370.9 208447.3 209501.6
# ranger
pred_ranger <- predict(ames_ranger, ames_test)
head(pred_ranger$predictions)
## [1] 125340.1 153001.1 270957.3 393530.7 223134.0 216577.0