15.18 Lab: Predicting house prices: a regression example
Based on Chollet and Allaire (2018, Ch. 3.6) who provide examples for binary, categorical and continuous outcomes.
In the lab below we have a quantitative outcome that we’ll try to predict namely house prices. In other words, instead of predicting discrete labels (classification) we now predict values on a continuous variable
Q: Is that a classification or a regression problem?
The Boston Housing Price dataset (overview)
- Outcome: Median house prices in suburbs (1970s)
- Unit of analysis: Suburb
- Predictors/explanatory variables/features
- 13 numerical features, such as per capita crime rate, average number of rooms per dwelling, accessibility to highways etc.
- Each input feature has a different scale
- Data: 506 observations/suburbs (404 training samples, 102 test samples)
15.18.0.1 Load the dataset
We load the dataset (comes as a list) and create four objects using the %<-%
operator that is part of the keras package.
library(keras)
library(tensorflow) # should be preloaded
# Dataset comes as a list
<- dataset_boston_housing()
dataset # The already splits the dataset into training and test data
# it creates a list
# The data is stored in a list
# Elements 'train' and 'test' contain
# predictors/features x and outcome y
names(dataset)
names(dataset$train)
We convert the data in the list to four matrices train_data
, train_targets
, test_data
andtest_targets
that we can use in our functions:
# Convert list to matrices
# train_data, train_targets: Training data (predictors, outcome)
# test_data, test_targets: Test data (predictors, outcome)
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset
Let’s look at the data:
head(train_data)
head(test_data)
We have 404 training samples and 102 test samples, each with 13 numerical features, such as per capita crime rate, average number of rooms per dwelling, accessibility to highways, and so on.
Below we also inspect our outcome that we want to predict:
str(train_targets)
summary(train_targets)
str(test_targets)
The targets (the outcome we want to predict) are the median values of owner-occupied homes, in thousands of dollars. The prices are typically between 10,000$ and 50,000$ (prices are not adjusted for inflation, 1970s).
15.18.0.2 Preparing the input data
- Learning for a neural network is more difficult when the input data has wildly different ranges (the network should be able to adapt but it takes longer)
- Chollet and Allaire (2018) recommend the “widespread best practice” of feature-wise normalization
- for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation
- We use the
scale()
for that
# Calculate the mean and standard deviation
# for each column of training data
<- apply(train_data, 2, mean)
mean <- apply(train_data, 2, sd)
std
# Scale the training and test data using the mean(s) and
# standard deviation(s) from the training data
<- scale(train_data, center = mean, scale = std)
train_data <- scale(test_data, center = mean, scale = std) test_data
Important: The quantities used for normalizing the test data are computed using the training data.
!! Never use in your workflow any quantity computed on the test data, even for something as simple as data normalization !!
15.18.0.3 Building your network
- We use a a small network with two hidden layers, each with 64 units, because of the small number of samples (small size of data)
- General rule: The less training data we have, the worse overfitting will be, and using a small network is one way to mitigate overfitting.
Because we will need to instantiate the same model multiple times, we use a function to construct it.
# Model definition (as a function)
<- function() {
build_model # Define network architecture
<- keras_model_sequential() %>%
model layer_dense(units = 64, activation = "relu",
input_shape = dim(train_data)[[2]]) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1) # Ends with single unit
# Set loss function, optimizer and metric
%>% compile(
model optimizer = "rmsprop",
loss = "mse", # Mean squared error loss function
metrics = c("mae") # mean absolute error (MAE)
)
}# MSE = mean squared error = square of the difference between
# the predictions and the targets
- The network ends with a single unit and no activation (it will be a linear layer)
- This is a typical setup for scalar regression where we are trying to predict a single continuous value
- Here: Last layer is purely linear so that network is free to learn to predict values in any range
- Applying another activation function would constrain the range the output can take
- e.g., applying sigmoid activation function to the last layer, the network could only learn to predict values between 0 and 1
- Loss function
- Mean squared error: the square of the difference between the predictions and the targets (standard loss function for regression problems)
- Metric for monitoring: mean absolute error (MAE)
- Absolute value of the difference between the predictions and the targets
- e.g., MAE of 0.5 here would mean predictions are off by 500$ on average
15.18.0.4 Validating our approach using K-fold validation
- Our dataset is relatively small (remember our discussion of bias/size of training and validation data!)
- Best practice: Use K-fold cross-validation (see Figure ??)
- Split available data into K partitions (typically K = 4 or 5)
- Create K identical models and train each one on \(K-1\) partitions while evaluating on remaining partition
- Validation score for model used is then the average of the \(K\) validation scores obtained
- Split available data into K partitions (typically K = 4 or 5)
::include_graphics("07-fig3-9.jpg") knitr
Below the code for this procedure:
# K-fold validation
<- 4 # Number of partitions
k <- sample(1:nrow(train_data)) # Create indicee used for K folds
indices
# Create folds (4 numbers)
<- cut(indices, breaks = k, labels = FALSE)
folds
<- 50
num_epochs <- c() # Object to store results
all_scores
# Loop over K (= folds)
for (i in 1:k) {
cat("processing fold #", i, "\n")
# Prepare the validation data: data from partition #k
<- which(folds == i, arr.ind = TRUE)
val_indices <- train_data[val_indices,]
val_data <- train_targets[val_indices]
val_targets
# Prepare the training data: data from all other partitions
<- train_data[-val_indices,]
partial_train_data <- train_targets[-val_indices]
partial_train_targets
# Builds the Keras model (already compiled)
<- build_model()
model
# Train the model (in silent mode, verbose = 0)
%>% fit(partial_train_data, partial_train_targets,
model epochs = num_epochs, batch_size = 1, verbose = 0)
# Evaluate model on the validation data
<- model %>% evaluate(val_data, val_targets, verbose = 0)
results <- c(all_scores, results["mae"])
all_scores }
Running this with epochs yields the following results:
all_scoresmean(all_scores)
mean(all_scores)*1000 # = USD
- The different runs do indeed show rather different validation scores
- Average is a much more reliable metric than any single score - that’s the entire point of K-fold cross-validation
- We are off by $ on average
- This is significant considering that the prices range from 10000$ to 50000$ (variable is in thousands, i.e., divided by 1000)
- Below we train the network longer and increase the number of epochs to 200
- We also save the per-epoch validation score log to evaluate how it develops across epochs (check how well the model does at each epoch!)
# Below we save the validation logs at each fold
<- 200
num_epochs <- NULL
all_mae_histories
# Loop over K (= folds)
for (i in 1:k) {
cat("processing fold #", i, "\n")
# Prepares the validation data: data from partition #k
<- which(folds == i, arr.ind = TRUE)
val_indices <- train_data[val_indices,]
val_data <- train_targets[val_indices]
val_targets
# Prepares the training data: data from all other partitions
<- train_data[-val_indices,]
partial_train_data <- train_targets[-val_indices]
partial_train_targets
# Builds the Keras model (already compiled)
<- build_model()
model
# Trains the model (in silent mode, verbose=0)
<- model %>% fit(
history
partial_train_data, partial_train_targets,validation_data = list(val_data, val_targets),
epochs = num_epochs, batch_size = 1, verbose = 0
)
# Store metrics (validation score)
<- history$metrics$val_mae
mae_history <- rbind(all_mae_histories, mae_history)
all_mae_histories }
- Then we compute the average of the per-epoch MAE scores for all folds:
# Building the history of successive mean K-fold validation scores
<- data.frame(
average_mae_history epoch = seq(1:ncol(all_mae_histories)),
validation_mae = apply(all_mae_histories, 2, mean)
)
Let’s plot this:
# Figure: Validation MAE by epoch
# Plotting validation scores ()
library(ggplot2)
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_line()
- Below we use
geom_smooth()
to get a better grasp of how the mean absolute error develops across epochs
# Validation MAE by epoch: smoothed
# Plotting validation scores with geom_smooth()
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_smooth()
- The MAE stops improving significantly after 125 epochs
- After 125 epochs overfitting increases
In other words after 125 the algorithm adapts so strongly to the training data that the predictions for the test data become words again.
15.18.0.5 Confronting the model with test data
Generally, we would tune our model in different ways, e.g., in addition to the number of epochs we could adjust the size of the hidden layers, the number of predictors/features etc.
Then we would train a final production model on all of the training data, with the best parameters, and then look at its performance on the test data.
# Training the final model
<- build_model()
model
# Trains the model on the entirety of the data
%>% fit(train_data, train_targets,
model epochs = 80, batch_size = 16, verbose = 0)
# Evaluates the model on the test data
<- model %>% evaluate(test_data, test_targets) result
Here’s the final result for the mean absolute error:
round(result["mae"], 2)
Our model is off by about $ when trying to predict the test data.
15.18.0.6 Wrapping up & takeaways
Above we discussed a DL example for a regression problem. An example for a classical ML analogue with the same data can be found in Chapter 3.6 in James et al. (2013). For DL examples for binary and categorical outcomes see the Chapter 3.4 and 3.5 in Chollet and Allaire (2018).
- Loss functions
- Regression uses different loss functions than classification (e.g., error rate)
- Mean squared error (MSE) is a loss function commonly used for regression
- Evaluation metrics
- Differ for regression and classification
- Common regression metric is mean absolute error (MAE)
- Preprocessing
- When features in input data have values in different ranges, each feature should be scaled independently as a preprocessing step
- Small datasets
- K-fold validation is a great way to reliably evaluate a model
- Preferable to use a small network with few hidden layers (typically only one or two), in order to avoid severe overfitting
15.18.0.7 Summary
- ML (or DL) models can be built for different outcome: binary variables (classification), categorical variables (classification), continuous variables (regression)
- Preprocessing: When the data has features (explanatory vars) with different ranges, we would scale each feature as part of the preprocessing
- Overfitting: As training progresses, neural networks eventually begin to overfit and obtain worse resutls for never-before-seen data
- If you don’t have much training data, use a small network with only one or two hidden layers, to avoid severe overfitting
- Layer number: If data divided into many categories, making the layers to small may cause information bottlenecks
- Loss functions: Regression uses different loss functions and different evaluation metrics than classification
- Data size: K-fold validation can help reliably evaluate your model when the data is small