Chapter 8 Building Models For Prediction
8.1 Variable Engineering
Often, it is beneficial to create new variables, or transform existing variables into different forms before fitting a model. Here are some examples, using a dataset on AirBnB's in major US cities.
Train <- read.csv("Train.csv")
Test <- read.csv("Test.csv")
We convert the host response rate variable to numeric.
Train$host_response_rate <- str_remove(Train$host_response_rate, "%")
Test$host_response_rate <- str_remove(Test$host_response_rate, "%")
Train$host_response_rate <- as.numeric(as.character(Train$host_response_rate))
Test$host_response_rate <- as.numeric(as.character(Test$host_response_rate))
- group together infrequent property types into a category called "other". We keep the four most frequent categories: apartment, house, townhouse, and condominium.
- create a yes/no variable for whether or not there was an online review.
- modify the
host_has_profile_pic
andhost_identity_verified
variables to group missing values and false's together.
- create a variable to tell whether the last review was made on a weekend.
Train <- Train %>% mutate(property_type = fct_lump(property_type, n=4),
has_review = !is.na(review_scores_rating),
host_has_profile_pic = host_has_profile_pic=="t",
host_identity_verified = host_identity_verified =="t",
weekend = weekdays(as.Date(last_review)) %in% c("Saturday", "Sunday")
)
Test <- Test %>% mutate(property_type = fct_lump(property_type, n=4),
has_review = !is.na(review_scores_rating),
host_has_profile_pic = host_has_profile_pic=="t",
host_identity_verified = host_identity_verified =="t",
weekend = weekdays(as.Date(last_review)) %in% c("Saturday", "Sunday")
)
8.2 Model Evaluation
The caret()
(Classification And REgression Training) package is useful for comparing and evaluating models using cross-validation. The train()
function performs cross validation.
library(caret)
We must first impute missing values, as train()
will not work with missing values.
preProcValues <- preProcess(Train%>%select(-c(price)), method = c("medianImpute"))
Train <- predict(preProcValues, Train)
Test <- predict(preProcValues, Test)
The trainControl()
function allows us to set the number of folds, and repeats in our cross-validation procedure.
control <- trainControl(method="repeatedcv", number=5, repeats=5 )
It is important to set the same seed before each training procedure to ensure that the models are compared on the same partitions of training and validation data.
set.seed(11082020)
model1 <- train(data=Train, price ~ bedrooms, method="lm", trControl=control)
set.seed(11082020)
model2 <- train(data=Train, price ~ bedrooms + accommodates + bathrooms + beds,
method="lm", trControl=control)
set.seed(11082020)
model3 <- train(data=Train, price ~ bedrooms * accommodates * bathrooms * beds,
method="lm", trControl=control)
set.seed(11082020)
model4 <- train(data=Train, price ~ bedrooms + city + room_type ,
method="lm", trControl=control)
set.seed(11082020)
model5 <- train(data=Train, price ~ bedrooms * city * room_type ,
method="lm", trControl=control)
set.seed(11082020)
model6 <- train(data=Train, price ~ bedrooms + accommodates + bathrooms + beds + city +room_type +
cancellation_policy + cleaning_fee ,
method="lm", trControl=control)
set.seed(11082020)
#exclude latitude, longitude, id, zipcode, weekday, weekend
model7 <- train(data=Train, price ~ property_type + bed_type + host_has_profile_pic + bedrooms +
accommodates + bathrooms + beds + city +room_type + cancellation_policy +
cleaning_fee + host_identity_verified + host_response_rate + instant_bookable +
number_of_reviews + review_scores_rating+ has_review, method="lm", trControl=control)
set.seed(11082020)
model8 <- train(data=Train, price ~ ., method="lm", trControl=control)
We extract the RMSPE for each method and record it in a table. Note that R calls this RMSE
, even though RMSPE
is really a more accurate term, since it is computed on data that were withheld when fitting the model.
r1 <- model1$results$RMSE
r2 <- model2$results$RMSE
r3 <- model3$results$RMSE
r4 <- model4$results$RMSE
r5 <- model5$results$RMSE
r6 <- model6$results$RMSE
r7 <- model7$results$RMSE
r8 <- model8$results$RMSE
Model <- 1:8
RMSE <- c(r1, r2, r3, r4, r5, r6, r7, r8)
T <- data.frame(Model, RMSE)
kable(T)
Model | RMSE |
---|---|
1 | 121.1541 |
2 | 117.1564 |
3 | 118.8492 |
4 | 112.2782 |
5 | 112.8955 |
6 | 109.9661 |
7 | 111.6570 |
8 | 139.5865 |
8.3 Predictions on Test Data
Having identified the best model through cross-validation, we'll fit that model to the full training set and use it to make predictions on the test data.
M1 <- lm(data=Train, price ~ bedrooms + accommodates + bathrooms +
beds + city +room_type + cancellation_policy + cleaning_fee)
Sometimes, we encounter a categorical variable with a category that shows up in the test data, but not the training data. If this happens, the model will not be able to make predictions. The best option is to change the category to one that is most similar in the training data. This can be done using the recode()
function.
Test$cancellation_policy <- recode(Test$cancellation_policy, super_strict_60="super_strict_30")
Now, we make the predictions on the new data.
Predictions <- predict(M1, newdata=Test)
We record the predicted prices in the column Test$price.
Test$price <- Predictions
We create a .csv file with the predictions. The .csv file will be created and saved in the working directory you are currently in. Be sure this is where you want it to be by navigating the the desired directory in the lower right RStudio window, andchoosing "More -> Set as Working Directory".
write.csv(Test, file="Test_predictions.csv")