Chapter 8 Building Models For Prediction
8.1 Variable Engineering
Often, it is beneficial to create new variables, or transform existing variables into different forms before fitting a model. Here are some examples, using a dataset on AirBnB’s in major US cities.
<- read.csv("Train.csv")
Train <- read.csv("Test.csv") Test
We convert the host response rate variable to numeric.
$host_response_rate <- str_remove(Train$host_response_rate, "%")
Train$host_response_rate <- str_remove(Test$host_response_rate, "%")
Test$host_response_rate <- as.numeric(as.character(Train$host_response_rate))
Train$host_response_rate <- as.numeric(as.character(Test$host_response_rate)) Test
- group together infrequent property types into a category called “other.” We keep the four most frequent categories: apartment, house, townhouse, and condominium.
- create a yes/no variable for whether or not there was an online review.
- modify the
host_has_profile_pic
andhost_identity_verified
variables to group missing values and false’s together.
- create a variable to tell whether the last review was made on a weekend.
<- Train %>% mutate(property_type = fct_lump(property_type, n=4),
Train has_review = !is.na(review_scores_rating),
host_has_profile_pic = host_has_profile_pic=="t",
host_identity_verified = host_identity_verified =="t",
weekend = weekdays(as.Date(last_review)) %in% c("Saturday", "Sunday")
)<- Test %>% mutate(property_type = fct_lump(property_type, n=4),
Test has_review = !is.na(review_scores_rating),
host_has_profile_pic = host_has_profile_pic=="t",
host_identity_verified = host_identity_verified =="t",
weekend = weekdays(as.Date(last_review)) %in% c("Saturday", "Sunday")
)
8.2 Model Evaluation
The caret()
(Classification And REgression Training) package is useful for comparing and evaluating models using cross-validation. The train()
function performs cross validation.
library(caret)
We must first impute missing values, as train()
will not work with missing values.
<- preProcess(Train%>%select(-c(price)), method = c("medianImpute"))
preProcValues <- predict(preProcValues, Train)
Train <- predict(preProcValues, Test) Test
The trainControl()
function allows us to set the number of folds, and repeats in our cross-validation procedure.
<- trainControl(method="repeatedcv", number=5, repeats=5 ) control
It is important to set the same seed before each training procedure to ensure that the models are compared on the same partitions of training and validation data.
set.seed(11082020)
<- train(data=Train, price ~ bedrooms, method="lm", trControl=control)
model1 set.seed(11082020)
<- train(data=Train, price ~ bedrooms + accommodates + bathrooms + beds,
model2 method="lm", trControl=control)
set.seed(11082020)
<- train(data=Train, price ~ bedrooms * accommodates * bathrooms * beds,
model3 method="lm", trControl=control)
set.seed(11082020)
<- train(data=Train, price ~ bedrooms + city + room_type ,
model4 method="lm", trControl=control)
set.seed(11082020)
<- train(data=Train, price ~ bedrooms * city * room_type ,
model5 method="lm", trControl=control)
set.seed(11082020)
<- train(data=Train, price ~ bedrooms + accommodates + bathrooms + beds + city +room_type +
model6 + cleaning_fee ,
cancellation_policy method="lm", trControl=control)
set.seed(11082020)
#exclude latitude, longitude, id, zipcode, weekday, weekend
<- train(data=Train, price ~ property_type + bed_type + host_has_profile_pic + bedrooms +
model7 + bathrooms + beds + city +room_type + cancellation_policy +
accommodates + host_identity_verified + host_response_rate + instant_bookable +
cleaning_fee + review_scores_rating+ has_review, method="lm", trControl=control)
number_of_reviews set.seed(11082020)
<- train(data=Train, price ~ ., method="lm", trControl=control) model8
We extract the RMSPE for each method and record it in a table. Note that R calls this RMSE
, even though RMSPE
is really a more accurate term, since it is computed on data that were withheld when fitting the model.
<- model1$results$RMSE
r1 <- model2$results$RMSE
r2 <- model3$results$RMSE
r3 <- model4$results$RMSE
r4 <- model5$results$RMSE
r5 <- model6$results$RMSE
r6 <- model7$results$RMSE
r7 <- model8$results$RMSE
r8 <- 1:8
Model <- c(r1, r2, r3, r4, r5, r6, r7, r8)
RMSE <- data.frame(Model, RMSE)
T kable(T)
Model | RMSE |
---|---|
1 | 121.1541 |
2 | 117.1564 |
3 | 118.8492 |
4 | 112.2782 |
5 | 112.8955 |
6 | 109.9661 |
7 | 111.6570 |
8 | 139.5865 |
8.3 Predictions on Test Data
Having identified the best model through cross-validation, we’ll fit that model to the full training set and use it to make predictions on the test data.
<- lm(data=Train, price ~ bedrooms + accommodates + bathrooms +
M1 + city +room_type + cancellation_policy + cleaning_fee) beds
Sometimes, we encounter a categorical variable with a category that shows up in the test data, but not the training data. If this happens, the model will not be able to make predictions. The best option is to change the category to one that is most similar in the training data. This can be done using the recode()
function.
$cancellation_policy <- recode(Test$cancellation_policy, super_strict_60="super_strict_30") Test
Now, we make the predictions on the new data.
<- predict(M1, newdata=Test) Predictions
We record the predicted prices in the column Test$price.
$price <- Predictions Test
We create a .csv file with the predictions. The .csv file will be created and saved in the working directory you are currently in. Be sure this is where you want it to be by navigating the the desired directory in the lower right RStudio window, andchoosing “More -> Set as Working Directory.”
write.csv(Test, file="Test_predictions.csv")