## 15.18 Lab: Predicting house prices: a regression example

Based on Chollet and Allaire (2018, Ch. 3.6) who provide examples for binary, categorical and continuous outcomes.

In the lab below we have a quantitative outcome that we’ll try to predict namely house prices. In other words, instead of predicting discrete labels (classification) we now predict values on a continuous variable

• Q: Is that a classification or a regression problem?

• The Boston Housing Price dataset (overview)

• Outcome: Median house prices in suburbs (1970s)
• Unit of analysis: Suburb
• Predictors/explanatory variables/features
• 13 numerical features, such as per capita crime rate, average number of rooms per dwelling, accessibility to highways etc.
• Each input feature has a different scale
• Data: 506 observations/suburbs (404 training samples, 102 test samples)

We load the dataset (comes as a list) and create four objects using the %<-% operator that is part of the keras package.

library(keras)

# Dataset comes as a list
dataset <- dataset_boston_housing()
# The already splits the dataset into training and test data
# it creates a list

# The data is stored in a list
# Elements 'train' and 'test' contain
# predictors/features x and outcome y
names(dataset)

#### 15.18.0.4 Validating our approach using K-fold validation

• Our dataset is relatively small (remember our discussion of bias/size of training and validation data!)
• Best practice: Use K-fold cross-validation (see Figure ??)
• Split available data into K partitions (typically K = 4 or 5)
• Create K identical models and train each one on $$K-1$$ partitions while evaluating on remaining partition
• Validation score for model used is then the average of the $$K$$ validation scores obtained
knitr::include_graphics("07-fig3-9.jpg")

Below the code for this procedure:

# K-fold validation
k <- 4 # Number of partitions
indices <- sample(1:nrow(train_data)) # Create indicee used for K folds

# Create folds (4 numbers)
folds <- cut(indices, breaks = k, labels = FALSE)

num_epochs <- 50
all_scores <- c() # Object to store results

# Loop over K (= folds)
for (i in 1:k) {
cat("processing fold #", i, "\n")

# Prepare the validation data: data from partition #k
val_indices <- which(folds == i, arr.ind = TRUE)
val_data <- train_data[val_indices,]
val_targets <- train_targets[val_indices]

# Prepare the training data: data from all other partitions
partial_train_data <- train_data[-val_indices,]
partial_train_targets <- train_targets[-val_indices]

# Builds the Keras model (already compiled)
model <- build_model()

# Train the model (in silent mode, verbose = 0)
model %>% fit(partial_train_data, partial_train_targets,
epochs = num_epochs, batch_size = 1, verbose = 0)

# Evaluate model on the validation data
results <- model %>% evaluate(val_data, val_targets, verbose = 0)
all_scores <- c(all_scores, results["mae"])
}

Running this with epochs yields the following results:

all_scores
mean(all_scores)
mean(all_scores)*1000 # = USD
• The different runs do indeed show rather different validation scores
• Average is a much more reliable metric than any single score - that’s the entire point of K-fold cross-validation

#### 15.18.0.6 Wrapping up & takeaways

Above we discussed a DL example for a regression problem. An example for a classical ML analogue with the same data can be found in Chapter 3.6 in James et al. (2013). For DL examples for binary and categorical outcomes see the Chapter 3.4 and 3.5 in Chollet and Allaire (2018).

• Loss functions
• Regression uses different loss functions than classification (e.g., error rate)
• Mean squared error (MSE) is a loss function commonly used for regression
• Evaluation metrics
• Differ for regression and classification
• Common regression metric is mean absolute error (MAE)
• Preprocessing
• When features in input data have values in different ranges, each feature should be scaled independently as a preprocessing step
• Small datasets
• K-fold validation is a great way to reliably evaluate a model
• Preferable to use a small network with few hidden layers (typically only one or two), in order to avoid severe overfitting

#### 15.18.0.7 Summary

• ML (or DL) models can be built for different outcome: binary variables (classification), categorical variables (classification), continuous variables (regression)
• Preprocessing: When the data has features (explanatory vars) with different ranges, we would scale each feature as part of the preprocessing
• Overfitting: As training progresses, neural networks eventually begin to overfit and obtain worse resutls for never-before-seen data
• If you don’t have much training data, use a small network with only one or two hidden layers, to avoid severe overfitting
• Layer number: If data divided into many categories, making the layers to small may cause information bottlenecks
• Loss functions: Regression uses different loss functions and different evaluation metrics than classification
• Data size: K-fold validation can help reliably evaluate your model when the data is small

### References

Chollet, Francois, and J J Allaire. 2018. Deep Learning with R. 1st ed. Manning Publications.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.